WELCOME! Lecture 3 Thommy Perlinger

Size: px

Start display at page:

Download "WELCOME! Lecture 3 Thommy Perlinger"

Samson Crawford
5 years ago
Views:

1 Quantitative Methods II WELCOME! Lecture 3 Thommy Perlinger

2 Program Lecture 3 Cleaning and transforming data Graphical examination of the data Missing Values

3 Graphical examination of the data It is important to understand, evaluate, and interpret results from multivariate analyses, which might be complex. This requires a thorough understanding of the basic characteristics of the underlying data and relationships. Graphical techniques are used to complement the empirical measures, to provide a visual representation of the basic relationships in order to feel confident in the understanding of these relationships.

4 Recap: Quantitative variables Discrete variables Take only some values, often integer values. A clear indicator that your variable is discrete is that it begins with Number of Eg. number of children. Continuous variables Take any value in an interval. A more precise measurement would always give more decimals. Eg. weight.

5 Recap: Describing distributions Bar charts display distributions of categorical variables, and discrete (quantitative) variables Histograms display distributions of continuous variables Count Weight

6 Recap: Describing distributions To interpret a histogram, think about: The general shape The center & spread Deviations from the general shape

7 Recap: Graphs for continuous variables Relationship betwen two variables (bivariate relationships): Scatter plot

8 Examining relationships To interpret a scatterplot: The pattern of points represents the relationship A strong organization of points along a straight line implies a linear relationship or correlation A curved set of points may denote a nonlinear relationship A seemingly random pattern of points may indicate that there is no relationship.

9 Scatterplot matrix in the book Scatterplots Bivariate (pairwise) scatterplots Histograms Univariate graphs

10 Recap: Correlation The Pearson correlation coefficient (r ) measures the direction (positive/negative) and strength of the linear relationship between two variables X andy The correlation is always a number between -1 and 1-1 r 1 The population correlation coefficient is denoted (the Greek letter rho)

11 Recap: Correlation If there is a strong positive linear relationship between X and Y, the value of the correlation coefficient (r ) is close to 1. If there is a strong negative linear relationship between X and Y, the value of the correlation coefficient (r ) is close to -1. If there is no linear relationship at all between X and Y, the value of the correlation coefficient (r ) is close to 0. Y Y Y r = 1 r = -1 r = 0 X X X

12 Examples of scatter plots of data with various correlation coefficients Y Y Y r = -1 X r = -.6 X r = 0 X Y Y Y r = +1 X r = +.3 X r = 0 12 X

13 Scatterplot matrix in the book Correlations Bivariate (pairwise) correlations Scatterplots Bivariate (pairwise) scatterplots Histograms Univariate graphs

14 Scatterplot matrix in SPSS Scatterplots with X and Y reversed Scatterplots Bivariate (pairwise) scatterplots

15 Examining group differences Groups can be formed from the categories of a nonmetric variable. Group differences are often of interest, differences of one or more metric variables. Assessing group differences is done through univariate analyses such as t-tests, or multivariate techniques such as MANOVA (multivariate analysis of variance). The graphical method used for this task is the boxplot.

16 Recap: Boxplot (for any data where the median is appropriate) 25% 50% 25% Age (years)

17 Recap: Boxplot Max Q3 Median Q1 Min Age (years) IQR If the median lies near one end of the box, skewness is indicated.

18 Recap: Boxplot Outliers Outlier: an observation more than 1.5 interquartile ranges away from Q1 or Q3. Extreme outlier (* in SPSS): an observation more than 3 interquartile ranges away from Q1 or Q3.

19 Examining group differences Boxplots are used as a complement to the statistical tests to get descriptive information that adds to our understanding of the group differences

20 Program Lecture 3 Cleaning and transforming data Graphical examination of the data Missing Values

21 Missing data When values on one or more variable(s) are not available for analysis, we say that we have missing data. Missing data are a fact of life in multivariate analysis. Data entry errors or data collection problems, or the respondent refusing to answer (among other things) can lead to missing data.

22 Missing data If there is a lot of missing data, the results can be biased. The larger the rate of missing data, the larger the risk of making incorrect generalizations to the target population There are different ways of dealing with missing data, you can e.g. impute data (substitute the missing data with some values). It is important to identify any patterns and relationships underlying the missing data, in order to maintain as close as possible the original distribution of values when any remedy is applied.

23 The impact of missing data The missing data processes, especially those based on actions by the respondent (e.g. nonresponse to some questions), are rarely known beforehand. Questions to be investigated regarding missing data: 1) Are the missing data scattered randomly throughout the observations, or are distinct patterns identifiable? 2) How prevalent are the missing data (what is the extent of the missing data)?

24 The impact of missing data The practical impact is the reduction of the sample size available for analysis. Since several variables are included in the analysis, any individual with a missing value on any of the variables will not be a part of the analysis. It has been shown that if 10% of the data is randomly missing in a set of five variables, the sample is reduced to only 40% of the original size. In such situations, you must either gather additional observations, or find a remedy for the missing data.

25 The impact of missing data From a substantive perspective, any statistical results based on data with a nonrandom missing data process could be biased. If, e.g., individuals that don t provide their household income tend to be those in the higher income brackets, the results will be erroneus. We still get results from the analysis even without the missing data, but it is important to consider the validity of the results.

26 A four-step process for identifying missing data and applying remedies Step 1: determine the type of missing data Is the missing data part of the research design and under the control of the researcher? Or are the causes and impacts of the missing data truly unknown? If the missing data are expected and part of the research design, they are termed ignorable. The missing data process is then operating at random (the observed values are a random sample of the total set of values) and no specific remedies are needed.

27 Example: ignorable missing data 1. Have you experienced any pain during the past 7 days? 2. How strong was your pain as worst? (If you answered no to question 1, proceed to question 3) No pain Worst pain imaginable Missing data on question 2 are part of the research design and would be inappropriate to attempt to remedy.

28 A four-step process for identifying missing data and applying remedies Step 1 Is the missing data ignorable? Yes Apply specialized techniques for ignorable missing data

29 Non-ignorable missing data In general, missing data that cannot be classified as ignorable fall into two classes based on their source: 1) Known processes. Missing data that can be identified due to procedural factors, such as errors in data entry, failure to complete the entire questionnaire, etc. 2) Unknown processes. Most often directly related to the respondent, e.g. refusal to respond to certain questions (common when questions are of a sensitive nature), or when the respondent has no opinion or not enough knowledge to answer.

30 Non-ignorable missing data When non-ignorable missing data occur in a random pattern, some remedies may be applicable to mitigate (ease) the effect of the missing data.

31 A four-step process for identifying missing data and applying remedies Step 1 Is the missing data ignorable? No Step 2 Is the extent of missing data substantial enough to warrant action? Yes Apply specialized techniques for ignorable missing data

32 A four-step process for identifying missing data and applying remedies Step 2: determine the extent of missing data Determine the extent of missing data for individual variables, individual cases (subjects/objects), and even overall. Determine whether the amount of missing data is low enough to not affect the results, even if it is non-random. If the extent is sufficiently low, then any of the approaches for remedying missing data may be applied.

33 Assessing the extent of missing data To identify the extent of missing data, and any exceptionally high levels of missing data that occur for individual cases or observations, tabulate the following: 1) The percentage of variables with missing data for each case/individual/object 2) The number of cases with missing data for each variable separately This can be done using the Missing Values Analysis option in SPSS.

34 Assessing any patterns of missing data Using the tabulations for each case and each variable: Look for any nonrandom patterns in the data, such as concentration of missing data in a specific set of questions, or signs of individuals not completing the questionnaire, etc.

35 Imputation Imputation is the process of substituting the missing value with a valid value based on other variables and/or cases in the sample. The reason for imputation is that it is desirable to keep as much information as possible in your data set. If the extent of missing data is acceptably low, and no specific nonrandom patterns appear, an imputation technique can be used without biasing the results too much.

36 How much missing data is too much? Rules of thumb Missing data under 10% for an individual case or observation can generally be ignored, except when the missing data occurs in a specific nonrandom fashion (e.g. concentration in a specific set of questions, missing answers at the end of the questionnaire implying non-completion, etc.) The number of cases with no missing data must be sufficient for the selected analysis technique if values will not be substituted (imputed) for the missing data (complete-case analysis).

37 Deleting individual cases and/or variables Consider the simplest approach of remedying missing data, i.e. deleting cases and/or variables with high levels of missing data. You may find that the missing data are concentrated in a small subset of cases and/or variables, and the exclusion of these might substantially reduce the extent of the missing data. If cases where a nonrandom pattern of missing data is present, this might be the most efficient solution.

38 Deletions based on missing data Rules of thumb Variables or cases with 50% or more missing data should always be deleted. Variables with as little as 15% missing data are candidates for deletion, but higher levels of missing data (20% to 30%) can often be remedied. Be sure that the overall decrease in missing data is large enough to justify deleting an individual variable or case.

39 Deletions based on missing data Rules of thumb, cont d Cases with missing data for dependent/response variable(s) typically are deleted to avoid any artificial increase in relationships with independent variable When deleting a variable, ensure that alternative variables, hopefully highly correlated, are available to represent the intent of the original variable. Always consider performing the analysis both with and without the deleted cases or variable(s) to identify any marked differences.

40 Example: HBAT missing data A pretest of a questionnaire used to collect the HBAT data, consisting of n=70 individuals and 14 variables.

41 Example: HBAT missing data Step 1 Is the missing data ignorable? No Step 2 Is the extent of missing data substantial enough to warrant action? All the missing data in this example are unknown, due to nonresponse by the respondents, and thus not ignorable.

42 Example: HBAT missing data Univariate Statistics N Mean Std. Deviation Missing No. of Extremes a Count Percent Low High v1 49 4,008, ,0 0 0 v2 57 1,944, ,6 0 0 v3 53 8,062 1, ,3 0 0 v4 63 5,168 1, ,0 0 0 v5 61 2,856, ,9 0 0 v6 64 2,611, ,6 0 0 v7 61 6,823 1, ,9 1 0 v ,033 9, ,9 0 0 v9 63 4,759, ,0 0 0 v ,9 v ,9 Categorical variables v ,9 v ,4 v ,9 a. Number of cases outside the range (Q1-1.5*IQR, Q *IQR). V1, V2, and V3 are possible candidates for deletion SPSS: Analyze >> Missing Value Analysis

43 Example: HBAT missing data SPSS: Analyze >> Missing Value Analysis. Click Pattern, mark Cases with missing values

44 Example: HBAT missing data

45 Example: HBAT missing data 6 individuals with 50% missing data, candidates for deletion All missing values for the categorical variables occur in these 6 cases.

46 Example: HBAT missing data 26 cases with complete data (no missing values) Only one more complete case if V3 is deleted SPSS: Analyze >> Missing Value Analysis. Click Pattern, mark Tabulated cases 11 more complete cases by deletion of V1 and V3 (37-26=11) 6 more complete cases by deletion of V1 only (32-26=6)

47 A four-step process for identifying missing data and applying remedies Step 1 Is the missing data ignorable? Delete cases and/or variables with high missing data Yes No Step 2 Is the extent of missing data substantial enough to warrant action? Yes Should cases and/or variables be deleted due to high levels of missing data? No Step 3 Are the missing data processes MAR (nonrandom) or MCAR (random)? No Step 4 Do you want to replace the missing data with values?

48 Example: HBAT missing data Step 1 Is the missing data ignorable? Delete cases and/or variables with high missing data Yes 2 variables (V1 and 3), and 6 cases are to be deleted. No Step 2 Is the extent of missing data substantial enough to warrant action? Yes Should cases and/or variables be deleted due to high levels of missing data? No Step 3 Are the missing data processes MAR (nonrandom) or MCAR (random)? No Step 4 Do you want to replace the missing data with values?

49 A four-step process for identifying missing data and applying remedies Step 3: diagnose the randomness of the missing data processes If the extent of missing data is substantial enough to warrant action, the degree of randomness in the missing data has to be ascertained. A nonrandom missing data process is present between the two variables X and Y when significant differences in the values of X occur between cases that have valid data for Y versus those cases with missing data on Y.

50 Levels of randomness of the missing data process Two levels of randomness of missing data: Missing At Random (MAR), which requires special methods to accommodate a nonrandom component. Missing Completely At Random (MCAR), which is sufficiently random to accommodate any type of missing data remedy. The distinction between these two levels is in the generalizability to the population.

51 Missing at random (MAR) If the missing values of Y depend on the variable X, but not on Y, the data are missing at random. The observed values of Y represent a random sample of the actual Y values for each observed value of X, but the observed data for Y do not necessarily represent a truly random sample of all Y values. The missing data process is random in the sample, but the observed values are not generalizable to the population.

52 Example: Missing at random (MAR) X= gender of the respondents (assumed to be known) Y = household income Missing data are random for both males and females, but occur much more frequently for males. The missing data is random within the gender variable, but the observed data is not generalizable to the population since it does not reflect the ultimate distribution of the household income values.

53 Missing completely at random (MCAR) Data are missing completely at random if the observed values of Y are truly a random sample of all Y values, with no underlying process that introduces bias to the observed data. There is no property of the cases that distinguishes those with missing data from cases with complete data.

54 Example: Missing completely at random (MCAR) X= gender of the respondents (assumed to be known) Y = household income Missing data are random for both males and females, and in equal proportions for both gender. In this missing data process, any remedy can be applied without having to consider the impact of any other variable or missing data process.

55 Diagnostic tests for levels of randomness There are two diagnostics tests that can be used to assess the level of randomness (MAR or MCAR): 1) Two groups of individuals are formed: one with missing values of Y, and another with valid values of Y. Then statistical tests (e.g. t-tests) are performed to see if differences exist between the two groups based on other variables of interest. Significant differences indicate the possibility of nonrandom missing data. A number of variables should be examined to find any consistent pattern. Either a large number of differences or a systematic pattern may indicate a nonrandom component (MAR).

56 Diagnostic tests for levels of randomness 2) An overall test of randomness compares patterns of missing data on all variables with the pattern expected for random missing data. If no significant differences are found, the missing data can be classified as MCAR. If significant differences are found, the nonrandom missing data processes have to be investigated. As a result of these tests, the missing data process is classified as either MAR or MCAR.

57 Example: HBAT missing data 1) Two groups of individuals are formed: one with missing values of e.g. V2, and another with valid values of V2. Then, t-tests are performed to see if differences exist between the two groups based on all other numerical variables of interest.

58 Variable that the groups are based on Variables used to test for differences between the groups

59 Example: HBAT missing data Three significant differences between groups based on V2. Only one significant difference among the rest of the tests. SPSS: Analyze >> Missing Value Analysis. Click Descriptives, mark t tests with groups formed by indicator variables

60 Example: HBAT missing data 2) An overall test of randomness. H 0 : H a : P-value (two-sided) The observed pattern of missing data does not differ from a random pattern. The observed pattern of missing data differs from a random pattern. SPSS: Analyze >> Missing Value Analysis. To the right under Estimation, mark EM (for Little s MCAR test).

61 Example: HBAT missing data This result, together with the analysis showing minimal differences in a nonrandom pattern, allows us to conclude that the missing data process is MCAR. If the MCAR test had been significant, or a nonrandom pattern had been obvious in the previous analysis, the missing data process would have been concluded to be MAR.

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL SPSS QM II SHORT INSTRUCTIONS This presentation contains only relatively short instructions on how to perform some statistical analyses in SPSS. Details around a certain function/analysis method not covered