Missing data analysis: - A study of complete case analysis, single imputation and multiple imputation. Filip Lindhfors and Farhana Morko

Size: px
Start display at page:

Download "Missing data analysis: - A study of complete case analysis, single imputation and multiple imputation. Filip Lindhfors and Farhana Morko"

Transcription

1 Bachelor thesis Department of Statistics Kandidatuppsats, Statistiska institutionen Nr 2014:5 Missing data analysis: - A study of complete case analysis, single imputation and multiple imputation Filip Lindhfors and Farhana Morko Bachelor s degree thesis in Statistics (15 credits), spring term 2014 Supervisor: Nicklas Pettersson

2 Abstract Is there a way to solve the problem of missing values in a data set? In this paper four missing data methods will be studied and applied on a data set of 129 countries which are treated as the global population of all the countries in the world. The different methods are complete case analysis, single imputation and two multiple imputation methods (one frequentist and Bayesian approach). The aim is to compare the results of the methods for a mean estimator. All data in this paper are on country level for year 2002 and contains three variables; corruption is the dependent variable with missing values and GDP per capita and civil liberties are the explanatory variables with complete observations. The reason for choosing a data set with only one variable with missing values is to simplify for the reader. The missingness is assumed to be at least missing at random. To get better and more general results the four methods are investigated in a simulation study. The sample size is equal to the entire population size, which in turn generate a large data set with many missing values. A population imputation procedure is implemented, this procedure is very rarely used. After applying each of these four methods, the resulting mean estimator is compared. The conclusion that can be drawn from the result is that the Bayesian multiple imputation method is the best to use for our data set, of the methods studied. Key words Complete case analysis, single imputation, (frequentist and Bayesian) multiple imputation, population imputation, missing data, simulation study. 2

3 Preface We would like to thank our supervisor Nicklas Pettersson for the guidance through our work. 3

4 Table of contents 1 Introduction Background Methods Purpose Delimitation Outline 6 2 Missing data Missing data mechanisms Missing completely at random (MCAR) Missing at random (MAR) Not missing at random (NMAR) Missing data mechanism for our data set Missing data patterns 9 3 Theoretical part of missing data methods Complete case (CC) analysis Single imputation (SI) Multiple imputation (MI) Variance formula Multiple imputation vs single imputation Differences between a frequentist and Bayesian approach Estimates of imputation uncertainty for missing values 16 4 Description of the data set 17 5 Practical/empirical part in software programs SAS and R 22 6 Results 26 7 Discussion and conclusion 31 8 References 32 Appendix 1: SAS codes 35 Appendix 2: R codes 39 Appendix 3: R output 44 4

5 1 Introduction 1.1 Background Missing data can be difficult to handle and can be missing for many different reasons. When we looked at different data on country level there appeared to be some pattern of why some countries had missing values more often than other countries, missingness seemed to be more widespread among less developed countries. These countries perhaps do not have the same access to save and restore statistics than more developed countries. Since handling of missing data has not been covered in our previous statistical courses (75 credits) we found this to be a good topic for this thesis. There are several different methods to choose between when dealing with missing data. Little and Rubin discuss many different methods in their book Statistical Analysis with Missing Data but the focus in this paper will be on the following methods: complete case (CC) analysis, single imputation (SI) and multiple imputation (MI). The reason for choosing these methods are that they are widely used (especially the first two). In complete case analysis (also named listwise deletion) all observations with at least one missing value are deleted and finally a complete data set is created. By creating a complete data set it is possible to apply ordinary statistical methods. The second method is single imputation, where each missing value is replaced by one imputed 1 value and by doing so a complete data set is created. The third method is multiple imputation, which creates a complete data set in a similar way as the SI method. The difference is that instead of imputing one value for each missing item, MI creates several values for each missing value. That is, several complete data sets are being generated. The estimates from each imputed data set are then being combined into one single estimate. 1.2 Methods This thesis includes both a theoretical part in which the methods will be presented and a practical/empirical part where the methods will be implemented on a data set using the software programs SAS and R. A population imputation will be performed in the latter part. This is not a very usual approach (since it is often not possible to observe the entire population); this can be seen as our contribution to research. The idea is to compare the results of the methods to find out which method is the best to use. For instance, the bias, mean square error (MSE) and estimated standard error will be calculated for each method. A simulation of missing data will be performed before the methods are implemented. The simulation of missing data will be done times. The methods used are complete case analysis, one single imputation and two multiple imputation methods (one frequentist and one Bayesian approach). 1 An imputed value is a value which is filled in to replace a missing value 5

6 The data set used in this paper consists of three variables: one dependent variable (the corruption perceptions index) had 28 missing values and two independent variables (GDP/capita and an index of civil liberties) which were fully observed. These variables will be better explained in section 4, Description of the data set. The reason of selecting GDP/capita and civil liberties as explanatory variables is because of the suspicion of a connection with corruption. All data are on country level and a total of 129 countries for year 2002 have been included in the study. These 129 countries are treated as the entire (global) population of all the countries in the world. The focus will be to analyze how to handle the missing values of the variable corruption in the data set. The reason for why corruption was chosen as the variable of interest was for the simple reason that in an assignment in our prior statistical course the corruption-variable was originally chosen but unfortunately did contain missing values for several countries. How to handle a data set with missing values was not a part of our earlier courses so therefore the only solution at that time was to choose another variable to study. This earlier experience of not being able to choose corruption as the variable of interest was what made us choose to investigate corruption in this thesis. 1.3 Purpose This study will try to find a solution when having missing values in a data set. The aim is to investigate and compare the results of four different methods when estimating the mean; complete case analysis, single imputation and two versions of multiple imputation (one frequentist and one Bayesian approach). Which of these methods are preferred when dealing with nonresponse (missing values) in the data set? 1.4 Delimitation There are some delimitations of this thesis. The first is that the empirical results will be based on only one data set. If other data sets would have been studied as well the empirical results would perhaps have been different. There are especially two reasons for choosing to work with only one data set; the first is the time limitation and the second is to simplify the understanding for the reader as much as possible. There are several other methods available for treatment of missing data that will not be investigated in this paper. The reason for not choosing to study more methods is the same as for not using several data sets; the limitation of time and to make things clearer for the reader. The reason for choosing CC, SI and MI (and not some other methods) is that these methods are widely used (particularly the first two). 1.5 Outline In chapter 2 there is an introduction describing different causes of missing data. The three mechanisms of missing data are presented; missing completely at random, missing at random and not missing at random. Different missing data patterns will also be presented. The theoretical part of chapter 3 describes the different missing data methods studied in this thesis; complete case analysis, single imputation and multiple imputation. In chapter 4 a description of the 6

7 variables in the data set is presented, where corruption is the not fully observed variable of interest and GDP per capita and civil liberties are the fully observed independent variables. Some descriptive statistics, such as means and correlations, are also presented in this section. Chapter 5 is the practical part where the nonresponse mechanism is estimated and the simulation procedure will be described. In chapter 6 the results of the different methods are presented. A discussion about the four methods based on the results and a short summery will be presented in chapter 7. 2 Missing data 2.1 Missing data mechanisms Before using any method for dealing with missing data it is important to understand why the data is missing. There can be many reasons for why data is missing. Missing data could arise if some respondents in a survey cannot participate due to reasons such as sickness. Another common cause of missing data is that the data collector type in some of the data incorrectly. Other reasons for missing data could be that respondents are on vacation or that some refuse to answer on principle. Some missing data could arise because of difficulties with the language or that respondents are stressed and do not have enough time to participate in the survey. 2 Another reason for missing data could be that the equipment used, for instance in a medical study, did not work correctly. Data can be missing completely at random (MCAR), missing at random (MAR) or not missing at random (NMAR) and these three missing data mechanisms will be presented in the section below Missing completely at random (MCAR) Let us assume that there is only one single variable with missing data called Y and another variable X, which is fully observed. Let R=1 if the variable Y has a missing value and R=0 means that the value of Y is observed (this notation will be used throughout this thesis). The MCAR assumption can then be written: P(R = 1 X, Y) = P(R = 1) (2.1) The probability of Y being missing does not depend on the observed variable X or on Y itself (2.1). 4 In other words, the assumption of missing completely at random means that the probability of an observation (Y i ) being missing does not depend on the value of Y i or the value of any other variable in the data set (X in our two-variable example). If respondents with low income (Y=low) are as likely to report their income as those with high income (Y=high), irrespective of their level of education (X), then missingness can be considered MCAR. 5 2 Hörngren, Jan 3 Howell, David C. 4 Allison, Paul D., Missing Data, p.73 5 Howell, David C. 7

8 2.1.2 Missing at random (MAR) The MAR assumption can be written as: P(R = 1 X, Y) = P(R = 1 X) (2.2) Let us assume that only Y has some missing data and the other variable X have only observed values. The probability of Y being missing does not depend on Y itself but it can be missing because of X (2.2). If the probability of reporting income (Y) depends on the level of education (X), then the missingness can be considered to be MAR. The MCAR is a special case of MAR which means that if the data is MCAR it is also MAR. MAR is a weaker assumption then MCAR Not missing at random (NMAR) Not missing at random is when the MAR assumption has been violated and the variable Y is not missing at random. 6 This can be written: P(R = 1 X, Y) = P(R = 1 X, Y) (2.3) That is, the formula (2.3) cannot be simplified if the data is not missing at random. There can be several reasons for why the data is not missing at random. One example could be the one described where the respondents with low income (Y) tend to report their income less frequently. Another example of NMAR is if missingness depends on the level of education (X), whether it is observed or not Missing data mechanism for our data set Regarding the variable corruption, the reason for why values are missing is not known but values seem to be missing in greater extent for more corrupt countries and countries with low degree of liberty. These could be reasons for why the data is missing on the corruption index but this is not very likely. Why not? The corruption index is a combined index representing different surveys and valuations of perceptions of corruption, from reliable institutions that make judgments about corruption in many countries. Therefore one could conclude that the reason for missing a value on corruption does not depend on how corrupt the country is. It is more likely that it is because the institutions did not study some countries in the year of It is interesting to look at the latest survey of the corruption perceptions index from In that study 177 countries in the world where represented (almost all world s countries). 7 What could be concluded from this is that it is reasonable to assume that the data on corruption at least is missing at random. 6 Allison, Paul D., Missing Data, p 74 7 Transparency International,

9 2.2 Missing data patterns The missing data pattern shows which of the values are observed and which are missing. There are different patterns of missing data. One example of a missing data pattern is the monotone missing data pattern. This pattern is common in studies following respondents over time. One example of such a potential study could be a study of 1000 unemployed, which are interviewed each quarter during five years. The monotone missing data pattern arises if there are a fraction of respondents leaving the study between the different quarters. This leads to a pattern where all the subsequent observations (after the drop out) will be missing. The missing data could also be arbitrary. In such a case the missing values have a random pattern. This pattern is common in surveys, where many respondents have not answered one or several of the questions. Another pattern is the univariate missing data pattern, in which only one of the variables are having missing values and all other variables are fully observed. 8 This is the pattern of the data set used in the practical part, in which the GDP/capita- and the civil liberties-variable have no missing values while the corruption index are missing for 28 countries. 3 Theoretical part of missing data methods 3.1 Complete case (CC) analysis CC analysis is a method for handling missing data and is one of the methods used in this thesis. The method is also called listwise deletion and is the most common method for handling missing data, 9 probably because of its simplicity. Complete case analysis is in many cases the default in statistical programs (for example in SAS when linear regression analysis is being performed). 10 An example clarifying how to use the method is given in Table 3.1 below. The data is not taken from any source (it is just made up for better understanding of the method in question). 8 Little & Rubin, p Howell, David C. 10 Institute for Digital Research and Education 9

10 Table 3.1: Missing values for the variables in the example data set. Subject Age Gender Education Monthly income (in SEK) 1 32 High school Male University Male University Male High school Female High school 7 33 Female High school Female University The data set consists of 8 subjects; the dependent variable (monthly income) and three explanatory variables. Subject 1 has no data for gender while the age of the third subject is unknown. The fifth subject has missing data on both gender and educational level while the income of subject 6 is missing. The method is very simple, all observations with at least one missing value is deleted. In the example above subjects 1, 3, 5 and 6 have at least one missing item. Therefore these observations are deleted and by doing so a complete data set is created. Four observations (number 2, 4, 7 and 8) remain and the sample size has decreased to 4, where half of the observations have been deleted. For instance, a regression analysis on these subjects can then be performed. In general, if a regression model contains many variables it is common that many of the observations have at least one missing value, leading to a big reduction of the data set when CC analysis is implemented. An investigator performing a complete case analysis assumes that the observed complete cases are a random sample of the originally targeted sample. In other words, the researcher assumes that the subjects with no missing values are a random sample from the whole population. This method is acceptable to use when there is a small amount of missing values in the data set since the effect is not considered to be to big regardless if the data is MCAR, MAR or NMAR. 11 There are some advantages of complete case analysis. The first advantage is that the method is very easy to perform. The second advantage is the possibility to compare univariate statistics. 12 This can be done because of the fact that all statistics are based on the same subjects after the deletion of incomplete cases (that is, after all observations with at least one missing value are deleted from the analysis). 13 The third advantage is that if the assumption of MCAR holds then the parameter estimate will be unbiased. 14 There are also some drawbacks of complete case 11 Pigott, Therese D., p , 12 Univariate analysis is the procedure when each variable in a data set is explored separately. A univariate statistic is a summary measure for a single variable, for instance the mean or the standard deviation. 13 Little & Rubin, p Howell, David C. 10

11 analysis. One disadvantage is that incomplete cases are not considered in the analysis. This will for sure lead to loss of information. According to Little and Rubin this loss will both be in precision and in the fact that the incompleteness will lead to bias if the assumption of MCAR is not satisfied. The method could be considered acceptable because of the easiness if the bias and loss of precision is very small. 15 Even in the case of data satisfying the assumption of missing completely at random there is a loss in power using CC analysis, particularly if a huge proportion of subjects are deleted. 16 The complete case analysis is often used but not recommended when dealing with missing data. Some formulas of CC analysis CC is a complete case estimate of the average. In our practical situation it is an estimate of the average corruption perceptions index from the complete cases. The increase in variance of CC compared to NM, an estimate of the corruption index without missing values (NM = not missing), is given by (3.1): Var θ CC = Var θ NM (1 + Δ CC ) (3.1) Δ CC is the proportional increase in the variance coming from the loss of information because of the discarding of the incomplete cases. The overall mean is given by: θ =π CC θ CC + (1 π CC ) θ IC (3.2) In words this is given by the proportion of complete cases (π CC ) times the mean of the complete cases ( CC ) plus the proportion of the incomplete cases (1 π CC ) times the mean of the incomplete cases ( IC, which is usually not known), (3.2). The bias from the complete case analysis for the mean can be written as: θ CC θ = (1 π CC ) θ CC θ IC (3.3) The bias will be zero if the mean of the incomplete cases is the same as the mean of the complete cases (3.3). That is, if the data is missing completely at random Little & Rubin, p Howell, David C. 17 Little & Rubin, p

12 3.2 Single imputation (SI) The second approach for treating missing data (in this thesis) is the single imputation method. As mentioned earlier this method is one of the most widely used. Single imputation can be performed in many ways and is the umbrella term for methods where one value is imputed for each missing value. Imputations could be either draws or means from a predictive distribution of the missing values. In this thesis D is the number of imputed data sets where D=1 is the special case in single imputation. There are many ways to fill in values for missing observations. One simple way is to use the mean of the observed cases of the variable of interest and imputing this unconditional mean for each missing value. After imputing all values a complete data set is created. An example of this simple single imputation procedure is presented in Table 3.2 below (the same data is used as in the complete case analysis example above and the procedure will be illustrated for the income variable). Table 3.2: Unconditional mean imputation on monthly income. Subject Age Gender Education Monthly income (in SEK) 1 32 High school Male University Male University Male High school Female High school The imputed value: Female High school Female University The imputed value of SEK is the unconditional mean, which is the summation of all the observed values divided by the number of observations with no missing value on monthly income. 18 The biggest problem with this method is that it produces biased estimates of the mean, unless it is a MCAR situation By using this simple mean imputation the variance of the variable of interest will be underestimated. Imputing the missing values with the mean will not account for the variation that probably would exist if the missing values actually were observed. This is because of the fact that the actual observations are probably not exactly the mean value. Another reason for this underestimation is that the imputation procedure increases the number of observations in the data set. The sample size is increased leading to a smaller standard error and this will not reflect the actual uncertainty in the data. 19 Another method is conditional mean imputation which can be done in several ways. This method imputes the conditional means given the observed values. One example of conditional mean 18 Calculation of the imputed value from unconditional mean imputation: ( )/7 = /7 = Pigott, Therese D., p

13 imputation is regression imputation. In this method missing values are substituted by predicted values from a regression analysis. A simple example would be to estimate the simple linear regression model, with age as the only explanatory variable, based on the subjects with values on both monthly income and age (subject 1, 2, 4, 5, 7 and 8), given by (3.4). This is done in the statistical software program R and the R codes are attached in Appendix 2: R codes. Monthly income = α + β Age = Age (3.4) After the estimation a prediction is made (for a person at age 28): Monthly income (28) = = SEK In this procedure, by conditioning on X (Age), the bias from the unconditional case can be reduced. The variance is still underestimated, since it can be considered as unreasonable to assume that the values lie exactly on the regression line. There are other imputation methods that will not be treated in this paper; hot deck imputation substitutes missing values with values from observations in the sample with similar characteristics. Substitution, replaces missing observations with similar observations, not originally included in the sample. Cold deck imputation is similar to hot deck imputation but the replacement of missing values is done from another source (for instance from the same survey from last year). Conditional mean imputation corrects bias, but still underestimates the variance and is preferred compared to unconditional mean imputation. The method recommended is to use a procedure of conditional draws which is used in this paper in the practical part of single imputation. The conditional draws method is the recommended single imputation method both under the assumption of MCAR and under the MAR assumption. Formula for imputing a conditional draw with the regression approach: y ik = β 0 + β 1 X 1 + β 2 X 2 + z ik (3.5) The last term is the random normal deviate with a mean of zero. The inclusion of the error term is important and is what makes the single imputation a draw from the predictive distribution of the missing values, instead of the conditional mean (3.5). Even though such a method in general is an improvement to conditional mean imputation there are still some disadvantages. One disadvantage is that the random draws result in an efficiency loss. Another disadvantage is that standard errors of the parameter estimates from the imputed data are systematically too small. This uncertainty issue could be handled by changing to the method of multiple imputation. 20 Especially a Bayesian approach of this method, where the regression parameters are drawn from a distribution, could be preferred. This is because of that the uncertainty from imputation is not fully considered if the parameters of the estimated model are assumed to be fixed. This additional Bayesian approach will be used for multiple imputation in our empirical study. 20 Little & Rubin, p.3-4, 62-66, 72 13

14 In our empirical case, because of the circumstance that the whole population is imputed, the estimated standard error from single imputation will be zero due to that the finite population correction (FPC) is zero Multiple imputation (MI) The third approach for handling missing data is the multiple imputation (MI) method. The assumption for MI and other imputation methods is that the data is at least MAR. It is a good idea to also check the relationship between the variable with missing values and the other variables in the model (before applying any imputation methods). In multiple imputation the missing values are filled in D times which gives D complete data sets where each missing value is replaced by a vector of D 2 imputed values filled in D times (in single imputation D=1). 22 The usual multiple imputation procedure suggested by Rubin (1977) is done in several steps. First imputation is made using a suitable model that takes into account the random variation. Following our notation above, this is done D times creating D complete data sets. According to this original suggestion from Rubin it is often sufficient to do this 3-5 times (but more is better). After values have been imputed each complete data set is analyzed. An average of the parameter of interest (y) is calculated for the data set at hand: n θ ı = y i (3.6) n i=1 In (3.6) n is the sample size in each imputed data set. The mean is calculated for each imputed dataset θ 1, θ 2,, θ n. Finally the averages from the D data sets are combined into one single point estimate. 23 The overall mean is calculated by adding the sum of each mean θ 1 + θ 2+,, + θ d and divide by D (3.7). 24 D θ overall = θ 1+θ 2 + θ θ D (3.7) D i=1 In the practical/empirical part two multiple imputation methods will be implemented on the data set; one frequentist approach and one Bayesian approach. The reason for using two multiple imputation approaches is that it could be interesting to see if there are any differences in the results. The frequentist method imputes missing values by a regression analysis approach. 21 The finite population correction is explained in section Variance formula 22 Little & Rubin, p Allison, Paul D., Multiple Imputation for Missing Data: A Cautionary Tale, p.4 24 Little & Rubin, p

15 Missing values are being replaced using random draws around the fitted linear regression line. The Bayesian imputation method is quite similar, but uses a Bayesian linear regression approach. 25 Nowadays the number of imputations suggested is higher than 3-5 according to Allison. In his article from November 2012 Allison describes what different authors suggest regarding the number of imputations. If 27 % of the cases in the data set have missing values on at least one variable it is recommended to use approximately 30 imputations. Another suggestion presented in the article is that 20 imputations should be made if 10% to 30% of the observations have missing values. 26 In the data set used in the practical section 28 values (21.7 %) are missing and by following the recommendations it would be reasonable to use approximately imputations. These suggestions are followed in the practical/empirical part for the multiple imputation methods and 30 imputations will be used Variance formula The total variance formula being used for multiple imputation in this thesis is: Var (θ Y obs ) V + (1 + D 1 )B (3.8) V is the calculated within imputed variance and is the sum of the variance from each data set divided by D. The within variance is simply the same as the usual single imputation variance estimator. The only difference is that now there are D data sets created from which an average is counted. B is the between imputed variance and the between and within variance adds up to the total variance (3.8). 27 The finite population correction (FPC) is used when the population is limited (not infinite). In this paper the whole population size (of 129 countries) is equal to the sample size, (N=n). That is, the whole population is imputed. The FPC cannot be ignored as it could have been if N was (much) larger than n. The stepwise calculation of the variance is presented below. The within variance (V ) is zero due to the finite population correction (FPC). 28 In our case FPC=0 and is given by (3.9): 29 FPC= N n N 1 = N N N 1 = 0 (3.9) 25 The Comprehensive R Archive Network, p.61,63 26 Allison, Paul, Why You Probably Need More Imputations Than You Think 27 Little & Rubin, p Starsinic, Michael 29 West Chester University 15

16 The within variance V is calculated below (3.10). V is the estimated within variance D times, D = 30 in (3.10): V = V fpc D V = 0 = V 0 D = V (3.10) The only variance left in equation (3.8) is the between variance, B. The total variance of multiple imputation is then given by (3.11): Var (θ Y obs ) 0 + ( )B = ( )B. (3.11) Multiple imputation vs single imputation Multiple imputation has the same advantages as single imputation, for instance the possibility of using standard complete-data methods. A problem of single imputation is that when imputing a single value the user may be tricked to believe that the imputed value is true, the uncertainty is not being considered, it could be that the missing value is an outlier with a very high or very low value. Multiple imputation takes the uncertainty into account, which is a considerable advantage compared to single imputation. The disadvantage of multiple imputation compared to single imputation is that it takes more time and effort to make the imputations and analyze the results. This drawback of MI is not very important because of the computer programs available today Differences between a frequentist and Bayesian approach A difference between these two is that the frequentist approach has repeatable samples at random and fixed parameters, while the Bayesian approach has unknown parameters and fixed data. There are also other differences, for example; the frequentist method uses a sampling distribution of the data while the Bayesian method assumes a prior distribution before the data have been seen, based on previous studies Estimates of imputation uncertainty for missing values There are several ways to account for the additional variance (uncertainty) because of nonresponse (the missing values). One way is described above (for multiple imputation). There are also other ways to estimate the additional uncertainty because of missing values, but these other methods will not be presented here since they are not used in this thesis. For the interested reader some of these approaches are presented in the book Statistical Analysis with Missing Data (see footnote) Little & Rubin, p Casella George 32 Little & Rubin, p

17 4 Description of the data set In this section the data set which will be used in the practical part (section 5) will be presented. All our data are on country level with a total of 129 countries, which are treated as all countries in the world (the total population), and the year of interest is The data is assumed to be missing at random. In a simulation study (of simulations), performed in the practical part, nonresponses will be randomly created using the estimated nonresponse mechanism (=phat). 33 The simulation procedure will be better explained in Section 5: Practical part in SAS and R. The variable corruption is the dependent variable. It is an index of the perceptions for corruption and is collected from the home page of Transparency International. The anti-corruption organization ranks countries and territories based on the level of corruption in a country s public sector. The index of corruption is a measure of abuse of power, dealing in secret and bribery in the world. The measurement is a score of corruption on a scale from The countries that have received a value close to 0 are highly corrupt and a value close to 10 means that the country is very clean. In the study of corruption in 2002 there were 102 countries with observed values and 28 countries with missing observations 34 while in the following year of 2003 there were 133 countries with observed values. 35 Of these 133 countries Palestine had missing values on both GDP/capita and civil liberties in This is probably because of the fact that the status of Palestine is controversial. Values on GDP/capita were missing for both Cuba and Iraq (a country in insecurity, close to war at the time being). 36 The value on civil liberties is missing for Hong Kong. 37 These four countries were deleted from the analysis and 129 countries remains and will be included in the study of average global corruption. 28 of these countries have missing values on the corruption perceptions index (101 countries have observed values). Both of the explanatory variables GDP/capita and civil liberties are fully observed. Seven out of ten countries have a corruption index under 5 which indicates that the majority of the countries in the world are corrupt. The countries that have a corruption index for year 2002 above 9.0 are Finland, Sweden, Singapore, New Zeeland, Iceland and Denmark and the countries that are in the bottom of the ranking, scoring under 2 are Angola, Bangladesh, Indonesia, Kenya, Madagascar, Nigeria and Paraguay. 38 Negative side effects of corruption are undermining of demographic institutions, the economic slowdown and the government instability. 39 Not surprisingly the corruption score for 2002 and 2003 are very similar for most countries. Corruption perceptions are usually not affected much between two years. A few countries have had notable changes in corruption. Some of the 33 phat is the probability of missing data on the corruption index given values on the explanatory variables civil liberties and GDP 34 Transparency International, Transparency International, International Monetary Fund 37 Freedom House, Data 38 Transparency International, 2002, (push the press release link) 39 UNDOC 17

18 countries with the biggest changes are for example Botswana that in year 2002 had a corruption score of 6.4 and in 2003 the corruption index was 5.7, which indicates a decrease in the corruption index with 0.7 units and a corresponding increase in corruption. Other countries that also have become more corrupt from year 2002 to 2003 are Namibia, Ethiopia and Haiti. Madagascar is the country with the biggest positive change of corruption perceptions, from 1.7 in 2002 to 2.6 in the following year. As mentioned, most countries as expected have almost the same corruption index in both years. Sweden and Finland are two examples. GDP per capita, current prices (in U.S. Dollars) from 2002 is the gross domestic product divided by population in midyear. It is the sum of all resident producers in the economy and tax on products minus subsides that is not included in the value of the products, GDP is unfortunately not a measure of personal income. 40 Countries with highest GDP/capita are Luxembourg, Norway and Switzerland while the following countries are in the bottom: Ethiopia, Myanmar and Tajikistan. Civil liberties is a variable indicating the degree of liberty (freedom) in a country. Freedom in a country could be measured for instance by an index of political rights or an index of civil liberties from the independent organization Freedom House; in this thesis the variable chosen is civil liberties. The reason is that the political rights index is overlapped by growth in real GDP per capita and political corruption, which are already variables that are included in our data set. This is the main reason of choosing civil liberties over political rights. Civil liberties index is a measure of freedom in the world and indicates freedom of expression, assembly, association, education, religion, allowance of free economic activity and that men, women and minority groups are equal and has the same opportunities. Civil liberties is measured within a range from 1-7, where 1 indicates highest degree of liberty. 41 Completecorruption02 is a new variable that has been created, the corruption index of 2002 has 28 missing values while the corruption index of 2003 is fully observed and has no missing values. This property of the data will be used to create a new variable called completecorruption02, which will be considered as the (true) answer sheet. This answer sheet will be created as follows: 1. First the 101 observed values of corruption for year 2002 will be included in the answer sheet. 2. Then the 28 missing values from the corruption data of 2002 will be replaced with the corresponding values from the corruption data for year R, the nonresponse indicator for corruption in 2002, is created as a binary 1/0 dummy variable 42. The variable R has value of 1 if the corresponding country misses a value on the corruption index and R=0 indicates that the value on corruption in 2002 is observed. As mentioned before 28 values are missing for corruption. This means that there are 28 observations (21.7 %) where 40 The World Bank 41 Freedom House, Methodology 42 Note that both the variable R and the statistical software program R have the same name. Do not get confused! 18

19 R=1 and 101 observations (78.3 %) where R=0. The data set with all variables can be found in Appendix 1: SAS codes. Table 4.1: Correlations between the variables. Pearson Correlation Coefficients Prob > r under H0: Rho=0 Number of Observations Corruption02 Liberties GDP Corruption03 Corruption < < < Liberties < < < GDP < < < Corruption < < < Table 4.1 above shows the output of Pearson Correlation coefficients (r), the degree of linear relationships between the different variables is measured between 1 r -1.When r is close to (or equal to) -1 the correlation is strongly negative and when r is close to (or equal to) 1 the correlation is strongly positive. The correlation between corruption in 2002 and corruption in 2003 is 0.992, indicating a significant (=p-value <0.001) strong positive linear relationship. Because of the strong relationship between the corruption indexes in both years it is reasonable to use the observed values from 2003 (which are missing in 2002) as part of the answer sheet. The negative relationship between corruption and civil liberties, r= -0.68, is significant. Note that the negative relationship is only because of the fact that a high value of corruption indicates that the country is less corrupt while a high value of civil liberties indicates that a country is less free. The same interpretation is done between corruption in 2003 and civil liberties with a value of r= The correlation between GDP/capita and corruption in 2002 is significant with a value of 0.83 and the correlation between GDP/capita and corruption in 2003 is These results are reasonable since the corruption is almost the same in both years (the corruption perceptions index is usually quite constant or changing very slow over time). The relationships between the variables are shown visually in Figure 4.1 below. 19

20 Figure 4.1: The relationships between the variables Corruption02, GDP/capita, Liberties and Corruption03. 20

21 Table 4.2: Description of the variables. Variable N Mean Variance Std Dev Minimum Maximum Liberties GDP Corruption02 Corruption03 R Completecorruption In the descriptive Table 4.2 the variables; civil liberties, GDP/capita, corruption in 2003, R and completecorruption02 are fully observed while the corruption index of 2002 only has 101 observations. The estimates of the corruption means, variances, standard deviations etcetera are quite similar in both years. The corruption index in 2002 has 28 missing values with an average of 4.52 (given in Table 4.2). One interesting thing is that the missingness is contributing to the decrease in the corruption index with 0.25 units ( ), to an increase in the perceptions of average global corruption. The average corruption of the 28 new countries of the study is approximately That is, the countries with missing values in 2002 are considerably more corrupt. The variable completecorruption02 has a sample size of 129 with no missing values since the missing values have been replaced from the fully observed corruption index for The mean for completecorruption02 is To be able to compare the change in corruption between 2002 and 2003 only the 101 countries with values for both years are included. The average corruption in 2003 (for the same 101 countries) is obtained in SAS by creating a new variable called Corruption03reduced. SAS codes for this will not be attached in the appendix. Table 4.3 shows that the average is The conclusion of this is that the corruption index for the 101 countries has decreased with approximately 0.06 units, the world has become a little more corrupt (as expected a very small change). Table 4.3: Average corruption in 2003 for the 101 countries with available corruption data in Analysis Variable : Corruption03reduced N Mean Variance Std Dev Minimum Maximum From Table 4.2 Bangladesh is the country representing the minimum value as the most corrupt country in year 2002 (with a score of 1.2) and in 2003 (with a score of 1.3). Finland is the least corrupt, with a score of 9.7 for both of the years. 43 The value of 3.35 is calculated from the data set which is available in Appendix 1: SAS codes (countries within brackets): [2.6 (Algeria) (Armenia) (United Arab Emirates) (Yemen)] / 28 = 93.7/

22 Civil liberties has a minimum value of 1 representing countries with much freedom such as; Australia, Austria, Belgium, Canada, Chile, Denmark, France, Germany, Iceland etcetera while the maximum value of 7 represent less democratic countries where the freedom is restricted. There are five countries with a civil liberties index of 7; Libya, Myanmar, Saudi Arabia, Sudan and Syria. Ethiopia is the country with the lowest GDP per capita in 2002 while Luxembourg with a GDP/capita of approximately dollars is the country in the top of the list. 5 Practical/empirical part in software programs SAS and R All our data used are on country level with 129 countries. These countries are treated as they would be all countries in the world and it is this population this paper will make statements about since the intention is to discuss corruption on a global level. Population imputation (mass imputation) is the term being used when a large data set with many variables with missing values is subject to imputations. The term can also be used in our case when there is missingness in a large data set and when the sample size (n) is equal to the entire population size (N). Large data sets with missingness can be problematic and are not always easy to manage. Population imputation try to correct the nonresponse problem and large blocks of missingness are filled in the data set. The assumption of at least MAR have to be fulfilled. The imputation process is then repeated D>1 times and it is an approximation of the multiple imputation posterior distribution 44 from a frequentist or Bayesian procedure. 45 A positive aspect of population imputation is that the bias can be reduced. 46 A simulation study will be performed on our data set. First, the nonresponse mechanism (phat) will be estimated, the probability of missing data on the corruption index of 2002 given values on liberties and GDP. Then data sets, with both missing and observed values on the variable completecorruption02, will be simulated using this nonresponse mechanism. The number of simulations run in the simulation study is The data set has first been imported in the statistical program Statistical Analysis System (SAS). The version used in this thesis is SAS 9.3. All SAS codes used in the analysis will be presented in Appendix 1: SAS codes. 44 Pettersson, Nicklas, Multiple Kernel Imputation - A Locally Balanced Real Donor Method, p Rässler, Susanne, p Black, Stephen; Creel, Darryl & Krotki, Karol,

23 The logistic procedure was used to model the probability of a missing value on corruption in 2002 as a function of GDP/capita and civil liberties. This probability is named phat in the SAS codes and is the so called nonresponse mechanism. How to calculate the nonresponse mechanism is shown below. The left side of the equation (5.1) is the logit of the probability of nonresponse and the right side of the equation is the estimated logistic regression model. 47 logit(phat) = ln phat 1 phat = β 0+β 1GDP +β 2 Liberties phat = 1 phat eβ 0 +β 1 GDP +β 2 Liberties P(R = 1 GDP, Liberties) = e β 0 +β 1 GDP +β 2 Liberties/ 1+e β 0 +β 1 GDP +β 2 Liberties (5.1) Table 5.1: Output from the logistic procedure in SAS, which shows the estimates of the parameters in the nonresponse mechanism. Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Intercept <.0001 GDP Liberties <.0001 Table 5.1 shows that the parameter estimate of Liberties (β^2) is statistically significant. The conclusion is that the missing data of corruption in 2002 seems to have a (strong) relationship with the degree of liberty in the country. The parameter estimate of GDP (β^1) is far from statistically significant with a p-value of , suggesting that it is possible to remove GDP from the model. Even though the p-value is high GDP was still included as a part of the nonresponse mechanism. This is because of the strong relationship between corruption and GDP and by keeping GDP in the model the variance can be reduced. The nonresponse mechanism (phat), followed from (equation 5.1), can be written as follows: phat = P(R = 1 GDP, Liberties) = e ( GDP Liberties) [1 + e ( GDP Liberties) ] 47 Carnegie Mellon University 23

24 By inserting different values for GDP and Liberties (for each country) different values for phat will be obtained, the logistic procedure does this in SAS: Table 5.2: Presentation of phat for six countries. Obs Country R GDP Liberties phat 1 Uruguay Chile Peru Ecuador Libya Saudi Arabia Table 5.2 shows values on phat, the probability of having missing data on corruption given values on GDP/capita and the civil liberties index. For illustration purposes only the two countries with the lowest, the two countries representing the median and the two countries with the highest probability of being missing are presented. Uruguay has a GDP/capita of dollars and a value of 1 on the liberties-variable. Uruguay is the country with the lowest probability of having a missing value on the corruption index and by looking at the R-variable corruption for Uruguay is observed in the actual data. The table shows that Saudi Arabia, according to the missing data mechanism, is the country with the highest probability of missing a value. The less free a country is the higher is the probability of having a missing value on corruption. Peru and Ecuador are the countries with a median probability of missingness (13.1 percent). Table 5.3: Description of the variable phat. Analysis Variable : phat Estimated Probability N Mean Variance Std Dev Median Minimum Maximum Table 5.3 presents some descriptive statistics for phat. In the original data 21.7 percent of the observations were missing for corruption in 2002 and therefore the mean of phat is The next step is to implement the missing data methods on the data set. The original idea was to do this on our data set only once but such an approach could not lead to any strong conclusions. A much better approach is to do a simulation study. How this simulation is done will be explained below. The idea was to do the simulation in SAS but because of difficulties with the simulation, or more specifically with the loop, and after correspondence with our supervisor 48 the decision was to use the MICE package in the statistical software program R. The version 48 Pettersson, Nicklas, Correspondence 24

Simulation of Imputation Effects Under Different Assumptions. Danny Rithy

Simulation of Imputation Effects Under Different Assumptions. Danny Rithy Simulation of Imputation Effects Under Different Assumptions Danny Rithy ABSTRACT Missing data is something that we cannot always prevent. Data can be missing due to subjects' refusing to answer a sensitive

More information

Missing Data Techniques

Missing Data Techniques Missing Data Techniques Paul Philippe Pare Department of Sociology, UWO Centre for Population, Aging, and Health, UWO London Criminometrics (www.crimino.biz) 1 Introduction Missing data is a common problem

More information

Handling missing data for indicators, Susanne Rässler 1

Handling missing data for indicators, Susanne Rässler 1 Handling Missing Data for Indicators Susanne Rässler Institute for Employment Research & Federal Employment Agency Nürnberg, Germany First Workshop on Indicators in the Knowledge Economy, Tübingen, 3-4

More information

Missing Data Analysis for the Employee Dataset

Missing Data Analysis for the Employee Dataset Missing Data Analysis for the Employee Dataset 67% of the observations have missing values! Modeling Setup Random Variables: Y i =(Y i1,...,y ip ) 0 =(Y i,obs, Y i,miss ) 0 R i =(R i1,...,r ip ) 0 ( 1

More information

WELCOME! Lecture 3 Thommy Perlinger

WELCOME! Lecture 3 Thommy Perlinger Quantitative Methods II WELCOME! Lecture 3 Thommy Perlinger Program Lecture 3 Cleaning and transforming data Graphical examination of the data Missing Values Graphical examination of the data It is important

More information

SOS3003 Applied data analysis for social science Lecture note Erling Berge Department of sociology and political science NTNU.

SOS3003 Applied data analysis for social science Lecture note Erling Berge Department of sociology and political science NTNU. SOS3003 Applied data analysis for social science Lecture note 04-2009 Erling Berge Department of sociology and political science NTNU Erling Berge 2009 1 Missing data Literature Allison, Paul D 2002 Missing

More information

MODEL SELECTION AND MODEL AVERAGING IN THE PRESENCE OF MISSING VALUES

MODEL SELECTION AND MODEL AVERAGING IN THE PRESENCE OF MISSING VALUES UNIVERSITY OF GLASGOW MODEL SELECTION AND MODEL AVERAGING IN THE PRESENCE OF MISSING VALUES by KHUNESWARI GOPAL PILLAY A thesis submitted in partial fulfillment for the degree of Doctor of Philosophy in

More information

Digital EAGLEs. Outlook and perspectives

Digital EAGLEs. Outlook and perspectives 2016 Digital EAGLEs Outlook and perspectives Fixed and Mobile broadband adoption rates in the next decade Changes in Fixed-Broadband penetration 2014-25 Changes in Mobile-Broadband penetration 2014-25

More information

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health Multiple Imputation for Missing Data Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health Outline Missing data mechanisms What is Multiple Imputation? Software Options

More information

Handling Data with Three Types of Missing Values:

Handling Data with Three Types of Missing Values: Handling Data with Three Types of Missing Values: A Simulation Study Jennifer Boyko Advisor: Ofer Harel Department of Statistics University of Connecticut Storrs, CT May 21, 2013 Jennifer Boyko Handling

More information

Missing Data and Imputation

Missing Data and Imputation Missing Data and Imputation NINA ORWITZ OCTOBER 30 TH, 2017 Outline Types of missing data Simple methods for dealing with missing data Single and multiple imputation R example Missing data is a complex

More information

Panel Data 4: Fixed Effects vs Random Effects Models

Panel Data 4: Fixed Effects vs Random Effects Models Panel Data 4: Fixed Effects vs Random Effects Models Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised April 4, 2017 These notes borrow very heavily, sometimes verbatim,

More information

Missing Data Analysis with SPSS

Missing Data Analysis with SPSS Missing Data Analysis with SPSS Meng-Ting Lo (lo.194@osu.edu) Department of Educational Studies Quantitative Research, Evaluation and Measurement Program (QREM) Research Methodology Center (RMC) Outline

More information

Ronald H. Heck 1 EDEP 606 (F2015): Multivariate Methods rev. November 16, 2015 The University of Hawai i at Mānoa

Ronald H. Heck 1 EDEP 606 (F2015): Multivariate Methods rev. November 16, 2015 The University of Hawai i at Mānoa Ronald H. Heck 1 In this handout, we will address a number of issues regarding missing data. It is often the case that the weakest point of a study is the quality of the data that can be brought to bear

More information

Technical Appendix B

Technical Appendix B Technical Appendix B School Effectiveness Models and Analyses Overview Pierre Foy and Laura M. O Dwyer Many factors lead to variation in student achievement. Through data analysis we seek out those factors

More information

Missing Data. Where did it go?

Missing Data. Where did it go? Missing Data Where did it go? 1 Learning Objectives High-level discussion of some techniques Identify type of missingness Single vs Multiple Imputation My favourite technique 2 Problem Uh data are missing

More information

Averages and Variation

Averages and Variation Averages and Variation 3 Copyright Cengage Learning. All rights reserved. 3.1-1 Section 3.1 Measures of Central Tendency: Mode, Median, and Mean Copyright Cengage Learning. All rights reserved. 3.1-2 Focus

More information

in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a

in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a Chapter 3 Missing Data 3.1 Types of Missing Data ˆ Missing completely at random (MCAR) ˆ Missing at random (MAR) a ˆ Informative missing (non-ignorable non-response) See 1, 38, 59 for an introduction to

More information

Missing Data: What Are You Missing?

Missing Data: What Are You Missing? Missing Data: What Are You Missing? Craig D. Newgard, MD, MPH Jason S. Haukoos, MD, MS Roger J. Lewis, MD, PhD Society for Academic Emergency Medicine Annual Meeting San Francisco, CA May 006 INTRODUCTION

More information

MISSING DATA AND MULTIPLE IMPUTATION

MISSING DATA AND MULTIPLE IMPUTATION Paper 21-2010 An Introduction to Multiple Imputation of Complex Sample Data using SAS v9.2 Patricia A. Berglund, Institute For Social Research-University of Michigan, Ann Arbor, Michigan ABSTRACT This

More information

Epidemiological analysis PhD-course in epidemiology

Epidemiological analysis PhD-course in epidemiology Epidemiological analysis PhD-course in epidemiology Lau Caspar Thygesen Associate professor, PhD 9. oktober 2012 Multivariate tables Agenda today Age standardization Missing data 1 2 3 4 Age standardization

More information

4 USE OF ACCOUNTS. Use of accounts for digital payments. What are the overall changes since 2014?

4 USE OF ACCOUNTS. Use of accounts for digital payments. What are the overall changes since 2014? 4 USE OF ACCOUNTS Owning an account is an important first step toward financial inclusion. But to fully benefit from having an account, people need to be able to use it in safe and convenient ways. This

More information

Multiple-imputation analysis using Stata s mi command

Multiple-imputation analysis using Stata s mi command Multiple-imputation analysis using Stata s mi command Yulia Marchenko Senior Statistician StataCorp LP 2009 UK Stata Users Group Meeting Yulia Marchenko (StataCorp) Multiple-imputation analysis using mi

More information

DATA APPENDIX. Real Exchange Rate Movements and the Relative Price of Nontraded Goods Caroline M. Betts and Timothy J. Kehoe

DATA APPENDIX. Real Exchange Rate Movements and the Relative Price of Nontraded Goods Caroline M. Betts and Timothy J. Kehoe DATA APPENDIX Real Exchange Rate Movements and the Relative Price of Nontraded Goods Caroline M. Betts and Timothy J. Kehoe I. ORIGINAL SERIES: DESCRIPTION A. ANNUAL AND QUARTERLY SERIES 1a. MARKET EXCHANGE

More information

Regression. Dr. G. Bharadwaja Kumar VIT Chennai

Regression. Dr. G. Bharadwaja Kumar VIT Chennai Regression Dr. G. Bharadwaja Kumar VIT Chennai Introduction Statistical models normally specify how one set of variables, called dependent variables, functionally depend on another set of variables, called

More information

Epidemiological analysis PhD-course in epidemiology. Lau Caspar Thygesen Associate professor, PhD 25 th February 2014

Epidemiological analysis PhD-course in epidemiology. Lau Caspar Thygesen Associate professor, PhD 25 th February 2014 Epidemiological analysis PhD-course in epidemiology Lau Caspar Thygesen Associate professor, PhD 25 th February 2014 Age standardization Incidence and prevalence are strongly agedependent Risks rising

More information

Comparison of Hot Deck and Multiple Imputation Methods Using Simulations for HCSDB Data

Comparison of Hot Deck and Multiple Imputation Methods Using Simulations for HCSDB Data Comparison of Hot Deck and Multiple Imputation Methods Using Simulations for HCSDB Data Donsig Jang, Amang Sukasih, Xiaojing Lin Mathematica Policy Research, Inc. Thomas V. Williams TRICARE Management

More information

No Purchase needed

No Purchase needed www.dialntalk.co.uk No Purchase needed About DialnTalk DialnTalk is the instant dial service developed to offer you easy to use low cost international telephone calls. Our aim is to provide a hassle free

More information

THE WORLD IN 2009: ICT FACTS AND FIGURES

THE WORLD IN 2009: ICT FACTS AND FIGURES THE WORLD IN 29: ICT FACTS AND FIGURES A decade of ICT growth driven by mobile technologies 8 7 Mobile cellular telephone subscriptions Internet users 67. per 1 inhabitants 6 5 4 3 2 Fixed telephone lines

More information

Missing data analysis. University College London, 2015

Missing data analysis. University College London, 2015 Missing data analysis University College London, 2015 Contents 1. Introduction 2. Missing-data mechanisms 3. Missing-data methods that discard data 4. Simple approaches that retain all the data 5. RIBG

More information

Missing Data Missing Data Methods in ML Multiple Imputation

Missing Data Missing Data Methods in ML Multiple Imputation Missing Data Missing Data Methods in ML Multiple Imputation PRE 905: Multivariate Analysis Lecture 11: April 22, 2014 PRE 905: Lecture 11 Missing Data Methods Today s Lecture The basics of missing data:

More information

Improving digital infrastructure for a better connected Thailand

Improving digital infrastructure for a better connected Thailand Improving digital infrastructure for a better connected 1 Economies across the globe are going digital fast The Global GDP forecast 2017 Economies are setting policies to encourage ICT investment Global

More information

Paper CC-016. METHODOLOGY Suppose the data structure with m missing values for the row indices i=n-m+1,,n can be re-expressed by

Paper CC-016. METHODOLOGY Suppose the data structure with m missing values for the row indices i=n-m+1,,n can be re-expressed by Paper CC-016 A macro for nearest neighbor Lung-Chang Chien, University of North Carolina at Chapel Hill, Chapel Hill, NC Mark Weaver, Family Health International, Research Triangle Park, NC ABSTRACT SAS

More information

CHAPTER 11 EXAMPLES: MISSING DATA MODELING AND BAYESIAN ANALYSIS

CHAPTER 11 EXAMPLES: MISSING DATA MODELING AND BAYESIAN ANALYSIS Examples: Missing Data Modeling And Bayesian Analysis CHAPTER 11 EXAMPLES: MISSING DATA MODELING AND BAYESIAN ANALYSIS Mplus provides estimation of models with missing data using both frequentist and Bayesian

More information

Missing Data Part 1: Overview, Traditional Methods Page 1

Missing Data Part 1: Overview, Traditional Methods Page 1 Missing Data Part 1: Overview, Traditional Methods Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised January 17, 2015 This discussion borrows heavily from: Applied

More information

Missing data a data value that should have been recorded, but for some reason, was not. Simon Day: Dictionary for clinical trials, Wiley, 1999.

Missing data a data value that should have been recorded, but for some reason, was not. Simon Day: Dictionary for clinical trials, Wiley, 1999. 2 Schafer, J. L., Graham, J. W.: (2002). Missing Data: Our View of the State of the Art. Psychological methods, 2002, Vol 7, No 2, 47 77 Rosner, B. (2005) Fundamentals of Biostatistics, 6th ed, Wiley.

More information

NORM software review: handling missing values with multiple imputation methods 1

NORM software review: handling missing values with multiple imputation methods 1 METHODOLOGY UPDATE I Gusti Ngurah Darmawan NORM software review: handling missing values with multiple imputation methods 1 Evaluation studies often lack sophistication in their statistical analyses, particularly

More information

An Econometric Study: The Cost of Mobile Broadband

An Econometric Study: The Cost of Mobile Broadband An Econometric Study: The Cost of Mobile Broadband Zhiwei Peng, Yongdon Shin, Adrian Raducanu IATOM13 ENAC January 16, 2014 Zhiwei Peng, Yongdon Shin, Adrian Raducanu (UCLA) The Cost of Mobile Broadband

More information

Measuring the Information Society Report

Measuring the Information Society Report Measuring the Information Society Report Addis Ababa, Ethiopia 24 November 2014 Andrew Rugege ITU Regional Director for Africa International Telecommunication Union MIS Report 2014 statistical highlights

More information

Multiple imputation using chained equations: Issues and guidance for practice

Multiple imputation using chained equations: Issues and guidance for practice Multiple imputation using chained equations: Issues and guidance for practice Ian R. White, Patrick Royston and Angela M. Wood http://onlinelibrary.wiley.com/doi/10.1002/sim.4067/full By Gabrielle Simoneau

More information

Bahrain Telecom Pricing International Benchmarking. July 2018

Bahrain Telecom Pricing International Benchmarking. July 2018 Bahrain Telecom Pricing International Benchmarking July 2018 1 CONTENTS OF THIS REPORT Report overview 3 PSTN basket results for GCC countries, including time series 4 Mobile basket results for GCC countries,

More information

Generate growth in Asia Pacific with Intelligent Connectivity. Edward Zhou Huawei Technologizes

Generate growth in Asia Pacific with Intelligent Connectivity. Edward Zhou Huawei Technologizes Generate growth in Asia Pacific with Intelligent Connectivity Edward Zhou Huawei Technologizes A revolutionary shift is happening in the way the world works, with economies across the planet going digital

More information

An imputation approach for analyzing mixed-mode surveys

An imputation approach for analyzing mixed-mode surveys An imputation approach for analyzing mixed-mode surveys Jae-kwang Kim 1 Iowa State University June 4, 2013 1 Joint work with S. Park and S. Kim Ouline Introduction Proposed Methodology Application to Private

More information

Types of missingness and common strategies

Types of missingness and common strategies 9 th UK Stata Users Meeting 20 May 2003 Multiple imputation for missing data in life course studies Bianca De Stavola and Valerie McCormack (London School of Hygiene and Tropical Medicine) Motivating example

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION Introduction CHAPTER 1 INTRODUCTION Mplus is a statistical modeling program that provides researchers with a flexible tool to analyze their data. Mplus offers researchers a wide choice of models, estimators,

More information

Motivating Example. Missing Data Theory. An Introduction to Multiple Imputation and its Application. Background

Motivating Example. Missing Data Theory. An Introduction to Multiple Imputation and its Application. Background An Introduction to Multiple Imputation and its Application Craig K. Enders University of California - Los Angeles Department of Psychology cenders@psych.ucla.edu Background Work supported by Institute

More information

Natural Resources - Activity 2. OPEC

Natural Resources - Activity 2. OPEC Natural Resources - Activity 2. OPEC *** If you have not gone through the preparation of this topic yet, please click here. *** *** The red circle(s) on the screen shots indicate the location of the tool

More information

Nuts and Bolts Research Methods Symposium

Nuts and Bolts Research Methods Symposium Organizing Your Data Jenny Holcombe, PhD UT College of Medicine Nuts & Bolts Conference August 16, 3013 Topics to Discuss: Types of Variables Constructing a Variable Code Book Developing Excel Spreadsheets

More information

Chapter 5. School Resources for Teaching Mathematics

Chapter 5. School Resources for Teaching Mathematics Chapter 5 School Resources for Teaching Mathematics The most successful schools tend to have students that are relatively economically affluent, speak the language of instruction, and begin school with

More information

Chapters 5-6: Statistical Inference Methods

Chapters 5-6: Statistical Inference Methods Chapters 5-6: Statistical Inference Methods Chapter 5: Estimation (of population parameters) Ex. Based on GSS data, we re 95% confident that the population mean of the variable LONELY (no. of days in past

More information

Data corruption, correction and imputation methods.

Data corruption, correction and imputation methods. Data corruption, correction and imputation methods. Yerevan 8.2 12.2 2016 Enrico Tucci Istat Outline Data collection methods Duplicated records Data corruption Data correction and imputation Data validation

More information

Spoka Meet Audio Calls Rates Dial-In UK

Spoka Meet Audio Calls Rates Dial-In UK Spoka Meet Audio Calls Rates Dial-In UK Country Toll/Toll Free Landline/Mobile GBP Argentina Toll Landline 0 Australia Toll Landline 0 Austria Toll Landline 0 Bahrain Toll Landline 0 Belgium Toll Landline

More information

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA Examples: Mixture Modeling With Cross-Sectional Data CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA Mixture modeling refers to modeling with categorical latent variables that represent

More information

Simulation Study: Introduction of Imputation. Methods for Missing Data in Longitudinal Analysis

Simulation Study: Introduction of Imputation. Methods for Missing Data in Longitudinal Analysis Applied Mathematical Sciences, Vol. 5, 2011, no. 57, 2807-2818 Simulation Study: Introduction of Imputation Methods for Missing Data in Longitudinal Analysis Michikazu Nakai Innovation Center for Medical

More information

1 Downloading files and accessing SAS. 2 Sorting, scatterplots, correlation and regression

1 Downloading files and accessing SAS. 2 Sorting, scatterplots, correlation and regression Statistical Methods and Computing, 22S:30/105 Instructor: Cowles Lab 2 Feb. 6, 2015 1 Downloading files and accessing SAS. We will be using the billion.dat dataset again today, as well as the OECD dataset

More information

What Are the Background Characteristics of Mathematics Teachers?

What Are the Background Characteristics of Mathematics Teachers? Chapter 6 Teachers of To help place students mathematics achievement in the context of their school and classroom situations, the mathematics teachers of the students tested were asked to complete questionnaires

More information

Statistical Analysis of List Experiments

Statistical Analysis of List Experiments Statistical Analysis of List Experiments Kosuke Imai Princeton University Joint work with Graeme Blair October 29, 2010 Blair and Imai (Princeton) List Experiments NJIT (Mathematics) 1 / 26 Motivation

More information

Statistics: Normal Distribution, Sampling, Function Fitting & Regression Analysis (Grade 12) *

Statistics: Normal Distribution, Sampling, Function Fitting & Regression Analysis (Grade 12) * OpenStax-CNX module: m39305 1 Statistics: Normal Distribution, Sampling, Function Fitting & Regression Analysis (Grade 12) * Free High School Science Texts Project This work is produced by OpenStax-CNX

More information

UAE and the NRI A brief introduction. December 2016

UAE and the NRI A brief introduction. December 2016 UAE and the NRI A brief introduction December 2016 UAE Vision 2021 We aim to make the UAE among the best countries in the world by the Golden Jubilee of the Union. 1 UAE Vision 2021 Gov entities working

More information

Digital Opportunity Index. Michael Minges Telecommunications Management Group, Inc.

Digital Opportunity Index. Michael Minges Telecommunications Management Group, Inc. Digital Opportunity Index Michael Minges Telecommunications Management Group, Inc. Digital Opportunity Index (DOI) Why How Preliminary results Conclusions WSIS Plan of Action E. Follow-up and evaluation

More information

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency Math 1 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency lowest value + highest value midrange The word average: is very ambiguous and can actually refer to the mean,

More information

Exploring Econometric Model Selection Using Sensitivity Analysis

Exploring Econometric Model Selection Using Sensitivity Analysis Exploring Econometric Model Selection Using Sensitivity Analysis William Becker Paolo Paruolo Andrea Saltelli Nice, 2 nd July 2013 Outline What is the problem we are addressing? Past approaches Hoover

More information

- 1 - Fig. A5.1 Missing value analysis dialog box

- 1 - Fig. A5.1 Missing value analysis dialog box WEB APPENDIX Sarstedt, M. & Mooi, E. (2019). A concise guide to market research. The process, data, and methods using SPSS (3 rd ed.). Heidelberg: Springer. Missing Value Analysis and Multiple Imputation

More information

Chapter 6. Teachers of Science

Chapter 6. Teachers of Science Chapter 6 Teachers of Science Since the teacher is central in creating a classroom environment that supports learning science, Chapter 6 presents information about the preparation and background of science

More information

Regularization and model selection

Regularization and model selection CS229 Lecture notes Andrew Ng Part VI Regularization and model selection Suppose we are trying select among several different models for a learning problem. For instance, we might be using a polynomial

More information

PLEASE NOTE: firms may submit one set of research questionnaires covering both China and Hong Kong or separate sets for each jurisdiction

PLEASE NOTE: firms may submit one set of research questionnaires covering both China and Hong Kong or separate sets for each jurisdiction Americas Argentina (Banking and finance; Capital markets; M&A; Project development) Bahamas (Financial and corporate) Barbados (Financial and corporate) Bermuda (Financial and corporate) Bolivia (Financial

More information

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables Further Maths Notes Common Mistakes Read the bold words in the exam! Always check data entry Remember to interpret data with the multipliers specified (e.g. in thousands) Write equations in terms of variables

More information

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp Chris Guthrie Abstract In this paper I present my investigation of machine learning as

More information

Multiple Imputation with Mplus

Multiple Imputation with Mplus Multiple Imputation with Mplus Tihomir Asparouhov and Bengt Muthén Version 2 September 29, 2010 1 1 Introduction Conducting multiple imputation (MI) can sometimes be quite intricate. In this note we provide

More information

Telephone Survey Response: Effects of Cell Phones in Landline Households

Telephone Survey Response: Effects of Cell Phones in Landline Households Telephone Survey Response: Effects of Cell Phones in Landline Households Dennis Lambries* ¹, Michael Link², Robert Oldendick 1 ¹University of South Carolina, ²Centers for Disease Control and Prevention

More information

Investigating Country Differences in Mobile App User Behaviour and Challenges for Software Engineering. Soo Ling Lim

Investigating Country Differences in Mobile App User Behaviour and Challenges for Software Engineering. Soo Ling Lim Investigating Country Differences in Mobile App User Behaviour and Challenges for Software Engineering Soo Ling Lim Analysis of app store data reveals what users do in the app store. We want to know why

More information

A STOCHASTIC METHOD FOR ESTIMATING IMPUTATION ACCURACY

A STOCHASTIC METHOD FOR ESTIMATING IMPUTATION ACCURACY A STOCHASTIC METHOD FOR ESTIMATING IMPUTATION ACCURACY Norman Solomon School of Computing and Technology University of Sunderland A thesis submitted in partial fulfilment of the requirements of the University

More information

Econometrics I: OLS. Dean Fantazzini. Dipartimento di Economia Politica e Metodi Quantitativi. University of Pavia

Econometrics I: OLS. Dean Fantazzini. Dipartimento di Economia Politica e Metodi Quantitativi. University of Pavia Dipartimento di Economia Politica e Metodi Quantitativi University of Pavia Overview of the Lecture 1 st EViews Session I: Convergence in the Solow Model 2 Overview of the Lecture 1 st EViews Session I:

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression Rebecca C. Steorts, Duke University STA 325, Chapter 3 ISL 1 / 49 Agenda How to extend beyond a SLR Multiple Linear Regression (MLR) Relationship Between the Response and Predictors

More information

Flash Eurobarometer 468. Report. The end of roaming charges one year later

Flash Eurobarometer 468. Report. The end of roaming charges one year later The end of roaming charges one year later Survey requested by the European Commission, Directorate-General for Communications Networks, Content & Technology and co-ordinated by the Directorate-General

More information

EE Pay Monthly Add-Ons & Commitment Packs. Version

EE Pay Monthly Add-Ons & Commitment Packs. Version EE Pay Monthly Add-Ons & Commitment Packs Version 1A Available from 28 October 2015 1 COMMITMENT PACKS In addition to the allowances included in our Standard and EE Extra plans for both Pay Monthly handset

More information

SENSITIVITY ANALYSIS IN HANDLING DISCRETE DATA MISSING AT RANDOM IN HIERARCHICAL LINEAR MODELS VIA MULTIVARIATE NORMALITY

SENSITIVITY ANALYSIS IN HANDLING DISCRETE DATA MISSING AT RANDOM IN HIERARCHICAL LINEAR MODELS VIA MULTIVARIATE NORMALITY Virginia Commonwealth University VCU Scholars Compass Theses and Dissertations Graduate School 6 SENSITIVITY ANALYSIS IN HANDLING DISCRETE DATA MISSING AT RANDOM IN HIERARCHICAL LINEAR MODELS VIA MULTIVARIATE

More information

Statistical matching: conditional. independence assumption and auxiliary information

Statistical matching: conditional. independence assumption and auxiliary information Statistical matching: conditional Training Course Record Linkage and Statistical Matching Mauro Scanu Istat scanu [at] istat.it independence assumption and auxiliary information Outline The conditional

More information

Univariate descriptives

Univariate descriptives Univariate descriptives Johan A. Elkink University College Dublin 18 September 2014 18 September 2014 1 / Outline 1 Graphs for categorical variables 2 Graphs for scale variables 3 Frequency tables 4 Central

More information

Week 4: Simple Linear Regression II

Week 4: Simple Linear Regression II Week 4: Simple Linear Regression II Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Algebraic properties

More information

Shell Global Helpline - Telephone Numbers

Shell Global Helpline - Telephone Numbers Shell Global Helpline - Telephone Numbers The Shell Global Helpline allows reports to be submitted by either a web-based form at or by utilising one of a number of telephone lines that will connect you

More information

Missing Data Analysis for the Employee Dataset

Missing Data Analysis for the Employee Dataset Missing Data Analysis for the Employee Dataset 67% of the observations have missing values! Modeling Setup For our analysis goals we would like to do: Y X N (X, 2 I) and then interpret the coefficients

More information

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane DATA MINING AND MACHINE LEARNING Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Data preprocessing Feature normalization Missing

More information

Exam Review: Ch. 1-3 Answer Section

Exam Review: Ch. 1-3 Answer Section Exam Review: Ch. 1-3 Answer Section MDM 4U0 MULTIPLE CHOICE 1. ANS: A Section 1.6 2. ANS: A Section 1.6 3. ANS: A Section 1.7 4. ANS: A Section 1.7 5. ANS: C Section 2.3 6. ANS: B Section 2.3 7. ANS: D

More information

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL SPSS QM II SHORT INSTRUCTIONS This presentation contains only relatively short instructions on how to perform some statistical analyses in SPSS. Details around a certain function/analysis method not covered

More information

Students Backgrounds And Attitudes Toward Mathematics

Students Backgrounds And Attitudes Toward Mathematics Chapter 4 Students Backgrounds And Attitudes Toward Mathematics In describing the educational context in which learning takes place, TIMSS focuses primarily on curricular, instructional, and school resource

More information

Bootstrap and multiple imputation under missing data in AR(1) models

Bootstrap and multiple imputation under missing data in AR(1) models EUROPEAN ACADEMIC RESEARCH Vol. VI, Issue 7/ October 2018 ISSN 2286-4822 www.euacademic.org Impact Factor: 3.4546 (UIF) DRJI Value: 5.9 (B+) Bootstrap and multiple imputation under missing ELJONA MILO

More information

Introduction to mixed-effects regression for (psycho)linguists

Introduction to mixed-effects regression for (psycho)linguists Introduction to mixed-effects regression for (psycho)linguists Martijn Wieling Department of Humanities Computing, University of Groningen Groningen, April 21, 2015 1 Martijn Wieling Introduction to mixed-effects

More information

Missing Data. SPIDA 2012 Part 6 Mixed Models with R:

Missing Data. SPIDA 2012 Part 6 Mixed Models with R: The best solution to the missing data problem is not to have any. Stef van Buuren, developer of mice SPIDA 2012 Part 6 Mixed Models with R: Missing Data Georges Monette 1 May 2012 Email: georges@yorku.ca

More information

Data Mining. ❷Chapter 2 Basic Statistics. Asso.Prof.Dr. Xiao-dong Zhu. Business School, University of Shanghai for Science & Technology

Data Mining. ❷Chapter 2 Basic Statistics. Asso.Prof.Dr. Xiao-dong Zhu. Business School, University of Shanghai for Science & Technology ❷Chapter 2 Basic Statistics Business School, University of Shanghai for Science & Technology 2016-2017 2nd Semester, Spring2017 Contents of chapter 1 1 recording data using computers 2 3 4 5 6 some famous

More information

Slide Copyright 2005 Pearson Education, Inc. SEVENTH EDITION and EXPANDED SEVENTH EDITION. Chapter 13. Statistics Sampling Techniques

Slide Copyright 2005 Pearson Education, Inc. SEVENTH EDITION and EXPANDED SEVENTH EDITION. Chapter 13. Statistics Sampling Techniques SEVENTH EDITION and EXPANDED SEVENTH EDITION Slide - Chapter Statistics. Sampling Techniques Statistics Statistics is the art and science of gathering, analyzing, and making inferences from numerical information

More information

Organizing Your Data. Jenny Holcombe, PhD UT College of Medicine Nuts & Bolts Conference August 16, 3013

Organizing Your Data. Jenny Holcombe, PhD UT College of Medicine Nuts & Bolts Conference August 16, 3013 Organizing Your Data Jenny Holcombe, PhD UT College of Medicine Nuts & Bolts Conference August 16, 3013 Learning Objectives Identify Different Types of Variables Appropriately Naming Variables Constructing

More information

HANDLING MISSING DATA

HANDLING MISSING DATA GSO international workshop Mathematic, biostatistics and epidemiology of cancer Modeling and simulation of clinical trials Gregory GUERNEC 1, Valerie GARES 1,2 1 UMR1027 INSERM UNIVERSITY OF TOULOUSE III

More information

Flash Eurobarometer 443. e-privacy

Flash Eurobarometer 443. e-privacy Survey conducted by TNS Political & Social at the request of the European Commission, Directorate-General for Communications Networks, Content & Technology (DG CONNECT) Survey co-ordinated by the European

More information

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors (Section 5.4) What? Consequences of homoskedasticity Implication for computing standard errors What do these two terms

More information

To calculate the arithmetic mean, sum all the values and divide by n (equivalently, multiple 1/n): 1 n. = 29 years.

To calculate the arithmetic mean, sum all the values and divide by n (equivalently, multiple 1/n): 1 n. = 29 years. 3: Summary Statistics Notation Consider these 10 ages (in years): 1 4 5 11 30 50 8 7 4 5 The symbol n represents the sample size (n = 10). The capital letter X denotes the variable. x i represents the

More information

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017 Machine Learning in the Wild Dealing with Messy Data Rajmonda S. Caceres SDS 293 Smith College October 30, 2017 Analytical Chain: From Data to Actions Data Collection Data Cleaning/ Preparation Analysis

More information

An introduction to SPSS

An introduction to SPSS An introduction to SPSS To open the SPSS software using U of Iowa Virtual Desktop... Go to https://virtualdesktop.uiowa.edu and choose SPSS 24. Contents NOTE: Save data files in a drive that is accessible

More information

Statistics, Data Analysis & Econometrics

Statistics, Data Analysis & Econometrics ST009 PROC MI as the Basis for a Macro for the Study of Patterns of Missing Data Carl E. Pierchala, National Highway Traffic Safety Administration, Washington ABSTRACT The study of missing data patterns

More information

Collaborative Regulation in the APP Economy

Collaborative Regulation in the APP Economy ITU Regional Economic and Financial Forum of Telecommunications/ICT for Africa Victoria Falls, ZIMBABWE, 30 31 January 2017 Collaborative Regulation in the APP Economy Carmen Prado Wagner Regulatory and

More information