Missing data analysis: - A study of complete case analysis, single imputation and multiple imputation. Filip Lindhfors and Farhana Morko

Size: px

Start display at page:

Download "Missing data analysis: - A study of complete case analysis, single imputation and multiple imputation. Filip Lindhfors and Farhana Morko"

Philip Bishop
5 years ago
Views:

Bachelor thesis Department of Statistics Kandidatuppsats, Statistiska institutionen Nr 2014:5 Missing data analysis: - A study of complete case analysis, single

1 Bachelor thesis Department of Statistics Kandidatuppsats, Statistiska institutionen Nr 2014:5 Missing data analysis: - A study of complete case analysis, single imputation and multiple imputation Filip Lindhfors and Farhana Morko Bachelor s degree thesis in Statistics (15 credits), spring term 2014 Supervisor: Nicklas Pettersson

2 Abstract Is there a way to solve the problem of missing values in a data set? In this paper four missing data methods will be studied and applied on a data set of 129 countries which are treated as the global population of all the countries in the world. The different methods are complete case analysis, single imputation and two multiple imputation methods (one frequentist and Bayesian approach). The aim is to compare the results of the methods for a mean estimator. All data in this paper are on country level for year 2002 and contains three variables; corruption is the dependent variable with missing values and GDP per capita and civil liberties are the explanatory variables with complete observations. The reason for choosing a data set with only one variable with missing values is to simplify for the reader. The missingness is assumed to be at least missing at random. To get better and more general results the four methods are investigated in a simulation study. The sample size is equal to the entire population size, which in turn generate a large data set with many missing values. A population imputation procedure is implemented, this procedure is very rarely used. After applying each of these four methods, the resulting mean estimator is compared. The conclusion that can be drawn from the result is that the Bayesian multiple imputation method is the best to use for our data set, of the methods studied. Key words Complete case analysis, single imputation, (frequentist and Bayesian) multiple imputation, population imputation, missing data, simulation study. 2

3 Preface We would like to thank our supervisor Nicklas Pettersson for the guidance through our work. 3

4 Table of contents 1 Introduction Background Methods Purpose Delimitation Outline 6 2 Missing data Missing data mechanisms Missing completely at random (MCAR) Missing at random (MAR) Not missing at random (NMAR) Missing data mechanism for our data set Missing data patterns 9 3 Theoretical part of missing data methods Complete case (CC) analysis Single imputation (SI) Multiple imputation (MI) Variance formula Multiple imputation vs single imputation Differences between a frequentist and Bayesian approach Estimates of imputation uncertainty for missing values 16 4 Description of the data set 17 5 Practical/empirical part in software programs SAS and R 22 6 Results 26 7 Discussion and conclusion 31 8 References 32 Appendix 1: SAS codes 35 Appendix 2: R codes 39 Appendix 3: R output 44 4

5 1 Introduction 1.1 Background Missing data can be difficult to handle and can be missing for many different reasons. When we looked at different data on country level there appeared to be some pattern of why some countries had missing values more often than other countries, missingness seemed to be more widespread among less developed countries. These countries perhaps do not have the same access to save and restore statistics than more developed countries. Since handling of missing data has not been covered in our previous statistical courses (75 credits) we found this to be a good topic for this thesis. There are several different methods to choose between when dealing with missing data. Little and Rubin discuss many different methods in their book Statistical Analysis with Missing Data but the focus in this paper will be on the following methods: complete case (CC) analysis, single imputation (SI) and multiple imputation (MI). The reason for choosing these methods are that they are widely used (especially the first two). In complete case analysis (also named listwise deletion) all observations with at least one missing value are deleted and finally a complete data set is created. By creating a complete data set it is possible to apply ordinary statistical methods. The second method is single imputation, where each missing value is replaced by one imputed 1 value and by doing so a complete data set is created. The third method is multiple imputation, which creates a complete data set in a similar way as the SI method. The difference is that instead of imputing one value for each missing item, MI creates several values for each missing value. That is, several complete data sets are being generated. The estimates from each imputed data set are then being combined into one single estimate. 1.2 Methods This thesis includes both a theoretical part in which the methods will be presented and a practical/empirical part where the methods will be implemented on a data set using the software programs SAS and R. A population imputation will be performed in the latter part. This is not a very usual approach (since it is often not possible to observe the entire population); this can be seen as our contribution to research. The idea is to compare the results of the methods to find out which method is the best to use. For instance, the bias, mean square error (MSE) and estimated standard error will be calculated for each method. A simulation of missing data will be performed before the methods are implemented. The simulation of missing data will be done times. The methods used are complete case analysis, one single imputation and two multiple imputation methods (one frequentist and one Bayesian approach). 1 An imputed value is a value which is filled in to replace a missing value 5

6 The data set used in this paper consists of three variables: one dependent variable (the corruption perceptions index) had 28 missing values and two independent variables (GDP/capita and an index of civil liberties) which were fully observed. These variables will be better explained in section 4, Description of the data set. The reason of selecting GDP/capita and civil liberties as explanatory variables is because of the suspicion of a connection with corruption. All data are on country level and a total of 129 countries for year 2002 have been included in the study. These 129 countries are treated as the entire (global) population of all the countries in the world. The focus will be to analyze how to handle the missing values of the variable corruption in the data set. The reason for why corruption was chosen as the variable of interest was for the simple reason that in an assignment in our prior statistical course the corruption-variable was originally chosen but unfortunately did contain missing values for several countries. How to handle a data set with missing values was not a part of our earlier courses so therefore the only solution at that time was to choose another variable to study. This earlier experience of not being able to choose corruption as the variable of interest was what made us choose to investigate corruption in this thesis. 1.3 Purpose This study will try to find a solution when having missing values in a data set. The aim is to investigate and compare the results of four different methods when estimating the mean; complete case analysis, single imputation and two versions of multiple imputation (one frequentist and one Bayesian approach). Which of these methods are preferred when dealing with nonresponse (missing values) in the data set? 1.4 Delimitation There are some delimitations of this thesis. The first is that the empirical results will be based on only one data set. If other data sets would have been studied as well the empirical results would perhaps have been different. There are especially two reasons for choosing to work with only one data set; the first is the time limitation and the second is to simplify the understanding for the reader as much as possible. There are several other methods available for treatment of missing data that will not be investigated in this paper. The reason for not choosing to study more methods is the same as for not using several data sets; the limitation of time and to make things clearer for the reader. The reason for choosing CC, SI and MI (and not some other methods) is that these methods are widely used (particularly the first two). 1.5 Outline In chapter 2 there is an introduction describing different causes of missing data. The three mechanisms of missing data are presented; missing completely at random, missing at random and not missing at random. Different missing data patterns will also be presented. The theoretical part of chapter 3 describes the different missing data methods studied in this thesis; complete case analysis, single imputation and multiple imputation. In chapter 4 a description of the 6

7 variables in the data set is presented, where corruption is the not fully observed variable of interest and GDP per capita and civil liberties are the fully observed independent variables. Some descriptive statistics, such as means and correlations, are also presented in this section. Chapter 5 is the practical part where the nonresponse mechanism is estimated and the simulation procedure will be described. In chapter 6 the results of the different methods are presented. A discussion about the four methods based on the results and a short summery will be presented in chapter 7. 2 Missing data 2.1 Missing data mechanisms Before using any method for dealing with missing data it is important to understand why the data is missing. There can be many reasons for why data is missing. Missing data could arise if some respondents in a survey cannot participate due to reasons such as sickness. Another common cause of missing data is that the data collector type in some of the data incorrectly. Other reasons for missing data could be that respondents are on vacation or that some refuse to answer on principle. Some missing data could arise because of difficulties with the language or that respondents are stressed and do not have enough time to participate in the survey. 2 Another reason for missing data could be that the equipment used, for instance in a medical study, did not work correctly. Data can be missing completely at random (MCAR), missing at random (MAR) or not missing at random (NMAR) and these three missing data mechanisms will be presented in the section below Missing completely at random (MCAR) Let us assume that there is only one single variable with missing data called Y and another variable X, which is fully observed. Let R=1 if the variable Y has a missing value and R=0 means that the value of Y is observed (this notation will be used throughout this thesis). The MCAR assumption can then be written: P(R = 1 X, Y) = P(R = 1) (2.1) The probability of Y being missing does not depend on the observed variable X or on Y itself (2.1). 4 In other words, the assumption of missing completely at random means that the probability of an observation (Y i ) being missing does not depend on the value of Y i or the value of any other variable in the data set (X in our two-variable example). If respondents with low income (Y=low) are as likely to report their income as those with high income (Y=high), irrespective of their level of education (X), then missingness can be considered MCAR. 5 2 Hörngren, Jan 3 Howell, David C. 4 Allison, Paul D., Missing Data, p.73 5 Howell, David C. 7

8 2.1.2 Missing at random (MAR) The MAR assumption can be written as: P(R = 1 X, Y) = P(R = 1 X) (2.2) Let us assume that only Y has some missing data and the other variable X have only observed values. The probability of Y being missing does not depend on Y itself but it can be missing because of X (2.2). If the probability of reporting income (Y) depends on the level of education (X), then the missingness can be considered to be MAR. The MCAR is a special case of MAR which means that if the data is MCAR it is also MAR. MAR is a weaker assumption then MCAR Not missing at random (NMAR) Not missing at random is when the MAR assumption has been violated and the variable Y is not missing at random. 6 This can be written: P(R = 1 X, Y) = P(R = 1 X, Y) (2.3) That is, the formula (2.3) cannot be simplified if the data is not missing at random. There can be several reasons for why the data is not missing at random. One example could be the one described where the respondents with low income (Y) tend to report their income less frequently. Another example of NMAR is if missingness depends on the level of education (X), whether it is observed or not Missing data mechanism for our data set Regarding the variable corruption, the reason for why values are missing is not known but values seem to be missing in greater extent for more corrupt countries and countries with low degree of liberty. These could be reasons for why the data is missing on the corruption index but this is not very likely. Why not? The corruption index is a combined index representing different surveys and valuations of perceptions of corruption, from reliable institutions that make judgments about corruption in many countries. Therefore one could conclude that the reason for missing a value on corruption does not depend on how corrupt the country is. It is more likely that it is because the institutions did not study some countries in the year of It is interesting to look at the latest survey of the corruption perceptions index from In that study 177 countries in the world where represented (almost all world s countries). 7 What could be concluded from this is that it is reasonable to assume that the data on corruption at least is missing at random. 6 Allison, Paul D., Missing Data, p 74 7 Transparency International,

9 2.2 Missing data patterns The missing data pattern shows which of the values are observed and which are missing. There are different patterns of missing data. One example of a missing data pattern is the monotone missing data pattern. This pattern is common in studies following respondents over time. One example of such a potential study could be a study of 1000 unemployed, which are interviewed each quarter during five years. The monotone missing data pattern arises if there are a fraction of respondents leaving the study between the different quarters. This leads to a pattern where all the subsequent observations (after the drop out) will be missing. The missing data could also be arbitrary. In such a case the missing values have a random pattern. This pattern is common in surveys, where many respondents have not answered one or several of the questions. Another pattern is the univariate missing data pattern, in which only one of the variables are having missing values and all other variables are fully observed. 8 This is the pattern of the data set used in the practical part, in which the GDP/capita- and the civil liberties-variable have no missing values while the corruption index are missing for 28 countries. 3 Theoretical part of missing data methods 3.1 Complete case (CC) analysis CC analysis is a method for handling missing data and is one of the methods used in this thesis. The method is also called listwise deletion and is the most common method for handling missing data, 9 probably because of its simplicity. Complete case analysis is in many cases the default in statistical programs (for example in SAS when linear regression analysis is being performed). 10 An example clarifying how to use the method is given in Table 3.1 below. The data is not taken from any source (it is just made up for better understanding of the method in question). 8 Little & Rubin, p Howell, David C. 10 Institute for Digital Research and Education 9

10 Table 3.1: Missing values for the variables in the example data set. Subject Age Gender Education Monthly income (in SEK) 1 32 High school Male University Male University Male High school Female High school 7 33 Female High school Female University The data set consists of 8 subjects; the dependent variable (monthly income) and three explanatory variables. Subject 1 has no data for gender while the age of the third subject is unknown. The fifth subject has missing data on both gender and educational level while the income of subject 6 is missing. The method is very simple, all observations with at least one missing value is deleted. In the example above subjects 1, 3, 5 and 6 have at least one missing item. Therefore these observations are deleted and by doing so a complete data set is created. Four observations (number 2, 4, 7 and 8) remain and the sample size has decreased to 4, where half of the observations have been deleted. For instance, a regression analysis on these subjects can then be performed. In general, if a regression model contains many variables it is common that many of the observations have at least one missing value, leading to a big reduction of the data set when CC analysis is implemented. An investigator performing a complete case analysis assumes that the observed complete cases are a random sample of the originally targeted sample. In other words, the researcher assumes that the subjects with no missing values are a random sample from the whole population. This method is acceptable to use when there is a small amount of missing values in the data set since the effect is not considered to be to big regardless if the data is MCAR, MAR or NMAR. 11 There are some advantages of complete case analysis. The first advantage is that the method is very easy to perform. The second advantage is the possibility to compare univariate statistics. 12 This can be done because of the fact that all statistics are based on the same subjects after the deletion of incomplete cases (that is, after all observations with at least one missing value are deleted from the analysis). 13 The third advantage is that if the assumption of MCAR holds then the parameter estimate will be unbiased. 14 There are also some drawbacks of complete case 11 Pigott, Therese D., p , 12 Univariate analysis is the procedure when each variable in a data set is explored separately. A univariate statistic is a summary measure for a single variable, for instance the mean or the standard deviation. 13 Little & Rubin, p Howell, David C. 10

11 analysis. One disadvantage is that incomplete cases are not considered in the analysis. This will for sure lead to loss of information. According to Little and Rubin this loss will both be in precision and in the fact that the incompleteness will lead to bias if the assumption of MCAR is not satisfied. The method could be considered acceptable because of the easiness if the bias and loss of precision is very small. 15 Even in the case of data satisfying the assumption of missing completely at random there is a loss in power using CC analysis, particularly if a huge proportion of subjects are deleted. 16 The complete case analysis is often used but not recommended when dealing with missing data. Some formulas of CC analysis CC is a complete case estimate of the average. In our practical situation it is an estimate of the average corruption perceptions index from the complete cases. The increase in variance of CC compared to NM, an estimate of the corruption index without missing values (NM = not missing), is given by (3.1): Var θ CC = Var θ NM (1 + Δ CC ) (3.1) Δ CC is the proportional increase in the variance coming from the loss of information because of the discarding of the incomplete cases. The overall mean is given by: θ =π CC θ CC + (1 π CC ) θ IC (3.2) In words this is given by the proportion of complete cases (π CC ) times the mean of the complete cases ( CC ) plus the proportion of the incomplete cases (1 π CC ) times the mean of the incomplete cases ( IC, which is usually not known), (3.2). The bias from the complete case analysis for the mean can be written as: θ CC θ = (1 π CC ) θ CC θ IC (3.3) The bias will be zero if the mean of the incomplete cases is the same as the mean of the complete cases (3.3). That is, if the data is missing completely at random Little & Rubin, p Howell, David C. 17 Little & Rubin, p

12 3.2 Single imputation (SI) The second approach for treating missing data (in this thesis) is the single imputation method. As mentioned earlier this method is one of the most widely used. Single imputation can be performed in many ways and is the umbrella term for methods where one value is imputed for each missing value. Imputations could be either draws or means from a predictive distribution of the missing values. In this thesis D is the number of imputed data sets where D=1 is the special case in single imputation. There are many ways to fill in values for missing observations. One simple way is to use the mean of the observed cases of the variable of interest and imputing this unconditional mean for each missing value. After imputing all values a complete data set is created. An example of this simple single imputation procedure is presented in Table 3.2 below (the same data is used as in the complete case analysis example above and the procedure will be illustrated for the income variable). Table 3.2: Unconditional mean imputation on monthly income. Subject Age Gender Education Monthly income (in SEK) 1 32 High school Male University Male University Male High school Female High school The imputed value: Female High school Female University The imputed value of SEK is the unconditional mean, which is the summation of all the observed values divided by the number of observations with no missing value on monthly income. 18 The biggest problem with this method is that it produces biased estimates of the mean, unless it is a MCAR situation By using this simple mean imputation the variance of the variable of interest will be underestimated. Imputing the missing values with the mean will not account for the variation that probably would exist if the missing values actually were observed. This is because of the fact that the actual observations are probably not exactly the mean value. Another reason for this underestimation is that the imputation procedure increases the number of observations in the data set. The sample size is increased leading to a smaller standard error and this will not reflect the actual uncertainty in the data. 19 Another method is conditional mean imputation which can be done in several ways. This method imputes the conditional means given the observed values. One example of conditional mean 18 Calculation of the imputed value from unconditional mean imputation: ( )/7 = /7 = Pigott, Therese D., p

13 imputation is regression imputation. In this method missing values are substituted by predicted values from a regression analysis. A simple example would be to estimate the simple linear regression model, with age as the only explanatory variable, based on the subjects with values on both monthly income and age (subject 1, 2, 4, 5, 7 and 8), given by (3.4). This is done in the statistical software program R and the R codes are attached in Appendix 2: R codes. Monthly income = α + β Age = Age (3.4) After the estimation a prediction is made (for a person at age 28): Monthly income (28) = = SEK In this procedure, by conditioning on X (Age), the bias from the unconditional case can be reduced. The variance is still underestimated, since it can be considered as unreasonable to assume that the values lie exactly on the regression line. There are other imputation methods that will not be treated in this paper; hot deck imputation substitutes missing values with values from observations in the sample with similar characteristics. Substitution, replaces missing observations with similar observations, not originally included in the sample. Cold deck imputation is similar to hot deck imputation but the replacement of missing values is done from another source (for instance from the same survey from last year). Conditional mean imputation corrects bias, but still underestimates the variance and is preferred compared to unconditional mean imputation. The method recommended is to use a procedure of conditional draws which is used in this paper in the practical part of single imputation. The conditional draws method is the recommended single imputation method both under the assumption of MCAR and under the MAR assumption. Formula for imputing a conditional draw with the regression approach: y ik = β 0 + β 1 X 1 + β 2 X 2 + z ik (3.5) The last term is the random normal deviate with a mean of zero. The inclusion of the error term is important and is what makes the single imputation a draw from the predictive distribution of the missing values, instead of the conditional mean (3.5). Even though such a method in general is an improvement to conditional mean imputation there are still some disadvantages. One disadvantage is that the random draws result in an efficiency loss. Another disadvantage is that standard errors of the parameter estimates from the imputed data are systematically too small. This uncertainty issue could be handled by changing to the method of multiple imputation. 20 Especially a Bayesian approach of this method, where the regression parameters are drawn from a distribution, could be preferred. This is because of that the uncertainty from imputation is not fully considered if the parameters of the estimated model are assumed to be fixed. This additional Bayesian approach will be used for multiple imputation in our empirical study. 20 Little & Rubin, p.3-4, 62-66, 72 13

14 In our empirical case, because of the circumstance that the whole population is imputed, the estimated standard error from single imputation will be zero due to that the finite population correction (FPC) is zero Multiple imputation (MI) The third approach for handling missing data is the multiple imputation (MI) method. The assumption for MI and other imputation methods is that the data is at least MAR. It is a good idea to also check the relationship between the variable with missing values and the other variables in the model (before applying any imputation methods). In multiple imputation the missing values are filled in D times which gives D complete data sets where each missing value is replaced by a vector of D 2 imputed values filled in D times (in single imputation D=1). 22 The usual multiple imputation procedure suggested by Rubin (1977) is done in several steps. First imputation is made using a suitable model that takes into account the random variation. Following our notation above, this is done D times creating D complete data sets. According to this original suggestion from Rubin it is often sufficient to do this 3-5 times (but more is better). After values have been imputed each complete data set is analyzed. An average of the parameter of interest (y) is calculated for the data set at hand: n θ ı = y i (3.6) n i=1 In (3.6) n is the sample size in each imputed data set. The mean is calculated for each imputed dataset θ 1, θ 2,, θ n. Finally the averages from the D data sets are combined into one single point estimate. 23 The overall mean is calculated by adding the sum of each mean θ 1 + θ 2+,, + θ d and divide by D (3.7). 24 D θ overall = θ 1+θ 2 + θ θ D (3.7) D i=1 In the practical/empirical part two multiple imputation methods will be implemented on the data set; one frequentist approach and one Bayesian approach. The reason for using two multiple imputation approaches is that it could be interesting to see if there are any differences in the results. The frequentist method imputes missing values by a regression analysis approach. 21 The finite population correction is explained in section Variance formula 22 Little & Rubin, p Allison, Paul D., Multiple Imputation for Missing Data: A Cautionary Tale, p.4 24 Little & Rubin, p

15 Missing values are being replaced using random draws around the fitted linear regression line. The Bayesian imputation method is quite similar, but uses a Bayesian linear regression approach. 25 Nowadays the number of imputations suggested is higher than 3-5 according to Allison. In his article from November 2012 Allison describes what different authors suggest regarding the number of imputations. If 27 % of the cases in the data set have missing values on at least one variable it is recommended to use approximately 30 imputations. Another suggestion presented in the article is that 20 imputations should be made if 10% to 30% of the observations have missing values. 26 In the data set used in the practical section 28 values (21.7 %) are missing and by following the recommendations it would be reasonable to use approximately imputations. These suggestions are followed in the practical/empirical part for the multiple imputation methods and 30 imputations will be used Variance formula The total variance formula being used for multiple imputation in this thesis is: Var (θ Y obs ) V + (1 + D 1 )B (3.8) V is the calculated within imputed variance and is the sum of the variance from each data set divided by D. The within variance is simply the same as the usual single imputation variance estimator. The only difference is that now there are D data sets created from which an average is counted. B is the between imputed variance and the between and within variance adds up to the total variance (3.8). 27 The finite population correction (FPC) is used when the population is limited (not infinite). In this paper the whole population size (of 129 countries) is equal to the sample size, (N=n). That is, the whole population is imputed. The FPC cannot be ignored as it could have been if N was (much) larger than n. The stepwise calculation of the variance is presented below. The within variance (V ) is zero due to the finite population correction (FPC). 28 In our case FPC=0 and is given by (3.9): 29 FPC= N n N 1 = N N N 1 = 0 (3.9) 25 The Comprehensive R Archive Network, p.61,63 26 Allison, Paul, Why You Probably Need More Imputations Than You Think 27 Little & Rubin, p Starsinic, Michael 29 West Chester University 15

16 The within variance V is calculated below (3.10). V is the estimated within variance D times, D = 30 in (3.10): V = V fpc D V = 0 = V 0 D = V (3.10) The only variance left in equation (3.8) is the between variance, B. The total variance of multiple imputation is then given by (3.11): Var (θ Y obs ) 0 + ( )B = ( )B. (3.11) Multiple imputation vs single imputation Multiple imputation has the same advantages as single imputation, for instance the possibility of using standard complete-data methods. A problem of single imputation is that when imputing a single value the user may be tricked to believe that the imputed value is true, the uncertainty is not being considered, it could be that the missing value is an outlier with a very high or very low value. Multiple imputation takes the uncertainty into account, which is a considerable advantage compared to single imputation. The disadvantage of multiple imputation compared to single imputation is that it takes more time and effort to make the imputations and analyze the results. This drawback of MI is not very important because of the computer programs available today Differences between a frequentist and Bayesian approach A difference between these two is that the frequentist approach has repeatable samples at random and fixed parameters, while the Bayesian approach has unknown parameters and fixed data. There are also other differences, for example; the frequentist method uses a sampling distribution of the data while the Bayesian method assumes a prior distribution before the data have been seen, based on previous studies Estimates of imputation uncertainty for missing values There are several ways to account for the additional variance (uncertainty) because of nonresponse (the missing values). One way is described above (for multiple imputation). There are also other ways to estimate the additional uncertainty because of missing values, but these other methods will not be presented here since they are not used in this thesis. For the interested reader some of these approaches are presented in the book Statistical Analysis with Missing Data (see footnote) Little & Rubin, p Casella George 32 Little & Rubin, p

17 4 Description of the data set In this section the data set which will be used in the practical part (section 5) will be presented. All our data are on country level with a total of 129 countries, which are treated as all countries in the world (the total population), and the year of interest is The data is assumed to be missing at random. In a simulation study (of simulations), performed in the practical part, nonresponses will be randomly created using the estimated nonresponse mechanism (=phat). 33 The simulation procedure will be better explained in Section 5: Practical part in SAS and R. The variable corruption is the dependent variable. It is an index of the perceptions for corruption and is collected from the home page of Transparency International. The anti-corruption organization ranks countries and territories based on the level of corruption in a country s public sector. The index of corruption is a measure of abuse of power, dealing in secret and bribery in the world. The measurement is a score of corruption on a scale from The countries that have received a value close to 0 are highly corrupt and a value close to 10 means that the country is very clean. In the study of corruption in 2002 there were 102 countries with observed values and 28 countries with missing observations 34 while in the following year of 2003 there were 133 countries with observed values. 35 Of these 133 countries Palestine had missing values on both GDP/capita and civil liberties in This is probably because of the fact that the status of Palestine is controversial. Values on GDP/capita were missing for both Cuba and Iraq (a country in insecurity, close to war at the time being). 36 The value on civil liberties is missing for Hong Kong. 37 These four countries were deleted from the analysis and 129 countries remains and will be included in the study of average global corruption. 28 of these countries have missing values on the corruption perceptions index (101 countries have observed values). Both of the explanatory variables GDP/capita and civil liberties are fully observed. Seven out of ten countries have a corruption index under 5 which indicates that the majority of the countries in the world are corrupt. The countries that have a corruption index for year 2002 above 9.0 are Finland, Sweden, Singapore, New Zeeland, Iceland and Denmark and the countries that are in the bottom of the ranking, scoring under 2 are Angola, Bangladesh, Indonesia, Kenya, Madagascar, Nigeria and Paraguay. 38 Negative side effects of corruption are undermining of demographic institutions, the economic slowdown and the government instability. 39 Not surprisingly the corruption score for 2002 and 2003 are very similar for most countries. Corruption perceptions are usually not affected much between two years. A few countries have had notable changes in corruption. Some of the 33 phat is the probability of missing data on the corruption index given values on the explanatory variables civil liberties and GDP 34 Transparency International, Transparency International, International Monetary Fund 37 Freedom House, Data 38 Transparency International, 2002, (push the press release link) 39 UNDOC 17

18 countries with the biggest changes are for example Botswana that in year 2002 had a corruption score of 6.4 and in 2003 the corruption index was 5.7, which indicates a decrease in the corruption index with 0.7 units and a corresponding increase in corruption. Other countries that also have become more corrupt from year 2002 to 2003 are Namibia, Ethiopia and Haiti. Madagascar is the country with the biggest positive change of corruption perceptions, from 1.7 in 2002 to 2.6 in the following year. As mentioned, most countries as expected have almost the same corruption index in both years. Sweden and Finland are two examples. GDP per capita, current prices (in U.S. Dollars) from 2002 is the gross domestic product divided by population in midyear. It is the sum of all resident producers in the economy and tax on products minus subsides that is not included in the value of the products, GDP is unfortunately not a measure of personal income. 40 Countries with highest GDP/capita are Luxembourg, Norway and Switzerland while the following countries are in the bottom: Ethiopia, Myanmar and Tajikistan. Civil liberties is a variable indicating the degree of liberty (freedom) in a country. Freedom in a country could be measured for instance by an index of political rights or an index of civil liberties from the independent organization Freedom House; in this thesis the variable chosen is civil liberties. The reason is that the political rights index is overlapped by growth in real GDP per capita and political corruption, which are already variables that are included in our data set. This is the main reason of choosing civil liberties over political rights. Civil liberties index is a measure of freedom in the world and indicates freedom of expression, assembly, association, education, religion, allowance of free economic activity and that men, women and minority groups are equal and has the same opportunities. Civil liberties is measured within a range from 1-7, where 1 indicates highest degree of liberty. 41 Completecorruption02 is a new variable that has been created, the corruption index of 2002 has 28 missing values while the corruption index of 2003 is fully observed and has no missing values. This property of the data will be used to create a new variable called completecorruption02, which will be considered as the (true) answer sheet. This answer sheet will be created as follows: 1. First the 101 observed values of corruption for year 2002 will be included in the answer sheet. 2. Then the 28 missing values from the corruption data of 2002 will be replaced with the corresponding values from the corruption data for year R, the nonresponse indicator for corruption in 2002, is created as a binary 1/0 dummy variable 42. The variable R has value of 1 if the corresponding country misses a value on the corruption index and R=0 indicates that the value on corruption in 2002 is observed. As mentioned before 28 values are missing for corruption. This means that there are 28 observations (21.7 %) where 40 The World Bank 41 Freedom House, Methodology 42 Note that both the variable R and the statistical software program R have the same name. Do not get confused! 18

19 R=1 and 101 observations (78.3 %) where R=0. The data set with all variables can be found in Appendix 1: SAS codes. Table 4.1: Correlations between the variables. Pearson Correlation Coefficients Prob > r under H0: Rho=0 Number of Observations Corruption02 Liberties GDP Corruption03 Corruption < < < Liberties < < < GDP < < < Corruption < < < Table 4.1 above shows the output of Pearson Correlation coefficients (r), the degree of linear relationships between the different variables is measured between 1 r -1.When r is close to (or equal to) -1 the correlation is strongly negative and when r is close to (or equal to) 1 the correlation is strongly positive. The correlation between corruption in 2002 and corruption in 2003 is 0.992, indicating a significant (=p-value <0.001) strong positive linear relationship. Because of the strong relationship between the corruption indexes in both years it is reasonable to use the observed values from 2003 (which are missing in 2002) as part of the answer sheet. The negative relationship between corruption and civil liberties, r= -0.68, is significant. Note that the negative relationship is only because of the fact that a high value of corruption indicates that the country is less corrupt while a high value of civil liberties indicates that a country is less free. The same interpretation is done between corruption in 2003 and civil liberties with a value of r= The correlation between GDP/capita and corruption in 2002 is significant with a value of 0.83 and the correlation between GDP/capita and corruption in 2003 is These results are reasonable since the corruption is almost the same in both years (the corruption perceptions index is usually quite constant or changing very slow over time). The relationships between the variables are shown visually in Figure 4.1 below. 19

20 Figure 4.1: The relationships between the variables Corruption02, GDP/capita, Liberties and Corruption03. 20

21 Table 4.2: Description of the variables. Variable N Mean Variance Std Dev Minimum Maximum Liberties GDP Corruption02 Corruption03 R Completecorruption In the descriptive Table 4.2 the variables; civil liberties, GDP/capita, corruption in 2003, R and completecorruption02 are fully observed while the corruption index of 2002 only has 101 observations. The estimates of the corruption means, variances, standard deviations etcetera are quite similar in both years. The corruption index in 2002 has 28 missing values with an average of 4.52 (given in Table 4.2). One interesting thing is that the missingness is contributing to the decrease in the corruption index with 0.25 units ( ), to an increase in the perceptions of average global corruption. The average corruption of the 28 new countries of the study is approximately That is, the countries with missing values in 2002 are considerably more corrupt. The variable completecorruption02 has a sample size of 129 with no missing values since the missing values have been replaced from the fully observed corruption index for The mean for completecorruption02 is To be able to compare the change in corruption between 2002 and 2003 only the 101 countries with values for both years are included. The average corruption in 2003 (for the same 101 countries) is obtained in SAS by creating a new variable called Corruption03reduced. SAS codes for this will not be attached in the appendix. Table 4.3 shows that the average is The conclusion of this is that the corruption index for the 101 countries has decreased with approximately 0.06 units, the world has become a little more corrupt (as expected a very small change). Table 4.3: Average corruption in 2003 for the 101 countries with available corruption data in Analysis Variable : Corruption03reduced N Mean Variance Std Dev Minimum Maximum From Table 4.2 Bangladesh is the country representing the minimum value as the most corrupt country in year 2002 (with a score of 1.2) and in 2003 (with a score of 1.3). Finland is the least corrupt, with a score of 9.7 for both of the years. 43 The value of 3.35 is calculated from the data set which is available in Appendix 1: SAS codes (countries within brackets): [2.6 (Algeria) (Armenia) (United Arab Emirates) (Yemen)] / 28 = 93.7/

22 Civil liberties has a minimum value of 1 representing countries with much freedom such as; Australia, Austria, Belgium, Canada, Chile, Denmark, France, Germany, Iceland etcetera while the maximum value of 7 represent less democratic countries where the freedom is restricted. There are five countries with a civil liberties index of 7; Libya, Myanmar, Saudi Arabia, Sudan and Syria. Ethiopia is the country with the lowest GDP per capita in 2002 while Luxembourg with a GDP/capita of approximately dollars is the country in the top of the list. 5 Practical/empirical part in software programs SAS and R All our data used are on country level with 129 countries. These countries are treated as they would be all countries in the world and it is this population this paper will make statements about since the intention is to discuss corruption on a global level. Population imputation (mass imputation) is the term being used when a large data set with many variables with missing values is subject to imputations. The term can also be used in our case when there is missingness in a large data set and when the sample size (n) is equal to the entire population size (N). Large data sets with missingness can be problematic and are not always easy to manage. Population imputation try to correct the nonresponse problem and large blocks of missingness are filled in the data set. The assumption of at least MAR have to be fulfilled. The imputation process is then repeated D>1 times and it is an approximation of the multiple imputation posterior distribution 44 from a frequentist or Bayesian procedure. 45 A positive aspect of population imputation is that the bias can be reduced. 46 A simulation study will be performed on our data set. First, the nonresponse mechanism (phat) will be estimated, the probability of missing data on the corruption index of 2002 given values on liberties and GDP. Then data sets, with both missing and observed values on the variable completecorruption02, will be simulated using this nonresponse mechanism. The number of simulations run in the simulation study is The data set has first been imported in the statistical program Statistical Analysis System (SAS). The version used in this thesis is SAS 9.3. All SAS codes used in the analysis will be presented in Appendix 1: SAS codes. 44 Pettersson, Nicklas, Multiple Kernel Imputation - A Locally Balanced Real Donor Method, p Rässler, Susanne, p Black, Stephen; Creel, Darryl & Krotki, Karol,

23 The logistic procedure was used to model the probability of a missing value on corruption in 2002 as a function of GDP/capita and civil liberties. This probability is named phat in the SAS codes and is the so called nonresponse mechanism. How to calculate the nonresponse mechanism is shown below. The left side of the equation (5.1) is the logit of the probability of nonresponse and the right side of the equation is the estimated logistic regression model. 47 logit(phat) = ln phat 1 phat = β 0+β 1GDP +β 2 Liberties phat = 1 phat eβ 0 +β 1 GDP +β 2 Liberties P(R = 1 GDP, Liberties) = e β 0 +β 1 GDP +β 2 Liberties/ 1+e β 0 +β 1 GDP +β 2 Liberties (5.1) Table 5.1: Output from the logistic procedure in SAS, which shows the estimates of the parameters in the nonresponse mechanism. Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Intercept <.0001 GDP Liberties <.0001 Table 5.1 shows that the parameter estimate of Liberties (β^2) is statistically significant. The conclusion is that the missing data of corruption in 2002 seems to have a (strong) relationship with the degree of liberty in the country. The parameter estimate of GDP (β^1) is far from statistically significant with a p-value of , suggesting that it is possible to remove GDP from the model. Even though the p-value is high GDP was still included as a part of the nonresponse mechanism. This is because of the strong relationship between corruption and GDP and by keeping GDP in the model the variance can be reduced. The nonresponse mechanism (phat), followed from (equation 5.1), can be written as follows: phat = P(R = 1 GDP, Liberties) = e ( GDP Liberties) [1 + e ( GDP Liberties) ] 47 Carnegie Mellon University 23

24 By inserting different values for GDP and Liberties (for each country) different values for phat will be obtained, the logistic procedure does this in SAS: Table 5.2: Presentation of phat for six countries. Obs Country R GDP Liberties phat 1 Uruguay Chile Peru Ecuador Libya Saudi Arabia Table 5.2 shows values on phat, the probability of having missing data on corruption given values on GDP/capita and the civil liberties index. For illustration purposes only the two countries with the lowest, the two countries representing the median and the two countries with the highest probability of being missing are presented. Uruguay has a GDP/capita of dollars and a value of 1 on the liberties-variable. Uruguay is the country with the lowest probability of having a missing value on the corruption index and by looking at the R-variable corruption for Uruguay is observed in the actual data. The table shows that Saudi Arabia, according to the missing data mechanism, is the country with the highest probability of missing a value. The less free a country is the higher is the probability of having a missing value on corruption. Peru and Ecuador are the countries with a median probability of missingness (13.1 percent). Table 5.3: Description of the variable phat. Analysis Variable : phat Estimated Probability N Mean Variance Std Dev Median Minimum Maximum Table 5.3 presents some descriptive statistics for phat. In the original data 21.7 percent of the observations were missing for corruption in 2002 and therefore the mean of phat is The next step is to implement the missing data methods on the data set. The original idea was to do this on our data set only once but such an approach could not lead to any strong conclusions. A much better approach is to do a simulation study. How this simulation is done will be explained below. The idea was to do the simulation in SAS but because of difficulties with the simulation, or more specifically with the loop, and after correspondence with our supervisor 48 the decision was to use the MICE package in the statistical software program R. The version 48 Pettersson, Nicklas, Correspondence 24

Simulation of Imputation Effects Under Different Assumptions. Danny Rithy

Simulation of Imputation Effects Under Different Assumptions Danny Rithy ABSTRACT Missing data is something that we cannot always prevent. Data can be missing due to subjects' refusing to answer a sensitive