Exploratory Data Analysis

Size: px

Start display at page:

Download "Exploratory Data Analysis"

Opal Henderson
5 years ago
Views:

1 Chapter 10 Exploratory Data Analysis Definition of Exploratory Data Analysis (page 410) Definition Exploratory data analysis (EDA) is a subfield of applied statistics that is concerned with the investigation of the collected or transformed data to reveal patterns, peculiarities and relationships using visual displays, resistant statistics and a thorough examination of the residuals. EDA is a preliminary step in data analysis. It can be used to determine if the planned method for analysis is appropriate for the collected data. Four major themes that describe the methods used in EDA: revelation, resistance, reexpression, and residuals. 1

2 REVELATION (page ) EDA reveals the essential features of the dataset usually via simple graphical displays. (Example: stem-and-leaf display and the boxplot) These graphs can give us a general idea about the distribution such as its center and other quantiles, spread, symmetry, and kurtosis. Graphs can help detect sources of problems in analysis such as the presence of outliers and multimodality. Graphs can also help reveal patterns and possible relationships among the different variables in the study. RESISTANCE (page ) Definition 12.2 A statistic is said to be resistant if its value is not adversely affected (i) when we replace some of the values in a dataset with totally different values; or, (ii) when there are minor changes in all of the data values possibly due to rounding. * The mean and the variance are not resistant statistics that is why they are seldom used in EDA. 2

3 Example 12.1 (page 412) ORIGINAL DATASET: Mean and Median = 74. Obs No. Obs No Obs No. (i) Xi (i) Xi (i) Xi Let us examine the effect on the sample mean and median if we change one value in the dataset by an outlying value such as 1,000. Obs No. Obs No. Obs No. Modified Modified Modified (i) Mean Md (i) Mean Md (i) Mean Md Definition of Stem-and-Leaf Display (page 416) Definition The stem-and-leaf display (SALD) is a histogram-like display of the data where the digits of the data values replace the bars in representing the frequencies. Example: Stem Leaf (unit = 0.1) Note: We can retrieve the data value from the display by joining the digits in the stem and the leaf together then multiplying the number by the specified unit. For example, the smallest data value in the SALD above is 22 x 0.1=2.2. In the third row, the observations are 4.5, 4.6 and

4 Steps in Constructing the SALD (page 416) Step 1. Choose the common division point of each observation where we will split each data value into its stem and leaf components. Example 12.3: Smallest value is Largest value is Choices: Example Location (for Abra: 235.9) Values of Stem between ones and tenths place to 1394 between tens and ones place to 139 between hundreds and tens to 13 between thousands and hundreds to 1 Steps in Constructing the SALD (pages ) Step 2: Step 3: In a vertical column, list the smallest stem value up to the largest stem value, using increments of 1 unit. Draw a vertical line to the right of the stem value. Example: Stem

5 Steps in Constructing the SALD (page 417) Step 4. Step 5. Record the leaf portion of the first observation in the row corresponding to its stem value. Do the same for all of the observations. Sort the leaves within each stem row from lowest to highest. Maintain uniform spacing in between the leaves for each one of the rows. By doing so, the stem with the most number of leaves (observations) will have the longest line; that is, it will appear to have the longest bar Steps in Constructing the SALD (page 417) Step 6. Indicate the unit of the leaves to allow the recreation of the actual data values from the display. For example, Unit = represents = 35.6 Unit = represents = 356 Unit = represents = 3,560 (Unit = 0.1 million pesos)

6 Split Stem-and-Leaf Display (page 421) If there are too many leaves in some of the rows is too large, we may split each stem into two groups. In the first group, we include all leaves with leading digits from 0 4. In the second group, we include all leaves with leading digits from 5 9. We mark the stem of the first group with * and we mark the stem of the second group with.. If the number of leaves is still too large, we can divide each stem into five groups. We mark the stem of the first group with * and include all leaves with leading digits from 0-1. The second group is marked t and includes leaves with leading digits from 2-3, the third group is marked f and includes leaves with leading digits from 4-5, the fourth group is marked s and includes leaves with leading digits from 6-7. The last group is marked. and includes leaves with leading digits from 8 9. Example Below are the starting salaries of a sample of 100 computer science majors who earned their baccalaureate degrees during a recent year: Starting Salaries (P000) Values range from 18.5 to

7 Example of Split SALD Stem Leaf (unit = 0.1 thousand pesos) t f s t f s Note: If there are outlying values then these values can be reported inside the parentheses on a special row in the first row (if value is extremely low) labelled as low or on the last row (if value is extremely large) labelled as hi. For example, if the starting salaries of two graduates are as large as and then we will add the following row at the bottom of the SALD, hi (120.3, 150.4) Definition of Depth (page 419) Definition 12.5 If we determine the two ranks of a data value by recording its position from each end of the array, then its depth is the smaller between these two ranks. 7

8 Example 12.4 (page 419) The array and the corresponding ranks and depths of each observation in the array are as follows: Array Rank A (from lowest to highest) Rank B (from highest to lowest) Depth (the smaller between Rank A and Rank B) Q 1=10 Md=18 Q 3=25 We will observe that the depths of the 1 st and 3 rd quartiles are both equal to (n+1)/4=(11+1)/4=3; and, the depth of the median is (n+1)/2=6. Five-Number Summary (pages ) Definition 12.6 A letter value is a statistic whose value depends on its defined depth, which we tag using a particular letter. The median is a letter value whose depth is (n+1)/2 and its tag is M. Definition 12.7 The extremes are the two data values in the array with depths equal to 1. Definition 12.8 The fourths or the hinges are the two data values in the array with the following depth: ( depth of median) 1 when n is odd depth of fourth 2 ( depth of median) 0.5 when n is even 2 We use the letter F as our tag for the fourth. Definition 12.9 The five-number summary is a collection of letter values consisting of the median, the fourths, and the extremes 8

9 Note on the Fourth (page 426) The fourth can be viewed as the two observations that are halfway between the median and the corresponding extremes. The depth of the fourth is either a whole number or has a remainder of ½ since the numerator is always a whole number and the Position: denominator 1 1 is Interpolation is needed only when the depth has a remainder of ½. In this case, Lower just Fourth get the midpoint Median of the Upper two Fourth Upper values Fourth adjacent to the fourth. Example: n=6, depth of fourth = (depth of median + 0.5)/2 = ((6+1)/ )/2 = 2 The fourths are the 2 nd and the 2 nd to the last ordered statistics. n=7, depth of fourth = (depth of median + 1)/2 = ((7+1)/2 + 1)/2 = 2.5 The fourths are interpolated values. On each end of the array, it is computed as the average of the two observations with depths equal to 2 and 3. Definition of Box-and-Whisker Plot (page 430) Definition The box-and-whisker plot, or boxplot, is a simple graphical display of the data used to display the 5- letter summary. Note: The boxplot displays the following features of the data: (i) location, (ii) spread, (iii) symmetry, (iv) extremes, and (v) outliers. 9

10 Steps in Constructing the Boxplot (pages ) Step 1: Construct a rectangle with one end at the lower fourth (F L) and the other end at the upper fourth (F U) Step 2: Put a line across the interior of the rectangle at the median Depth of median = (15 +1)/2 = 8 Med=22 Depth of fourth = (depth of median + 1)/2 = 9/2 =4.5 F L = (15+18)/2=16.5 F U = (24+23)/2 = Steps in Constructing the Boxplot (cont d) Step 3: Compute for the fourth-spread (d F ), lower fence and upper fence as follows: d F = F U F L lower fence = F L 1.5 d F upper fence = F U d F The lower and upper fences are outlier cutoffs. We will consider all data points smaller than the lower fence or larger than the upper fence as outliers Example: d F = = 7 Lower fence = 16.5 (1.5)(7) = 6 Upper fence = (1.5)(7) = 34 10

11 Steps in Constructing the Boxplot (cont d) Step 4: Excluding outliers, identify the two data values that are closest to the lower fence and upper fence, respectively. Draw a line, starting from these values up to each side of the rectangle. We sometimes refer to these lines as the whiskers. Step 5: Plot each outlier at its corresponding value, using an x-mark or any other distinctive mark. We consider an outlying observation that is less than F L 3d F or greater than F U +3d F as an extreme outlier. We sometimes distinguish extreme outliers from other outliers by placing a circle at their actual location, instead of an x Lower fence= 6 F L =16.5 Med=22 F U = 23.5 Upper fence = 34 Outlier : 1 Closest data point to lower fence that is not an outlier: 10 Closest data point to upper fence that is not an outlier: 28 x Chapter Introduction 25 30to EDA Remarks (page 431) The height of the rectangle is usually arbitrary and has no specific meaning. If several boxplots appear together, however, the height is sometimes made proportional to the different sample sizes. This is rarely done, however, because an accurate representation is very difficult to achieve. The different statistical software present varying versions of the boxplot. For example, instead of plotting the sides of the rectangles at the lower fourth and upper fourth, these are plotted to related summary measures, the 1st and 3rd quartiles respectively and the fences are computed as follows: Lower fence = Q IQR Upper fence = Q IQR 11

12 Interpreting the Boxplot (page 433) 1. The line inside the rectangle shows the location of the median, our measure of central tendency. 2. The sides of the rectangle, which are plotted either at the fourths or the quartiles, indicate where the middle 50% of the observations lie. 3. The length of the rectangle represents the magnitude of either the fourth-spread or the inter-quartile range, our measure of dispersion. 4. The relative position of the line inside the rectangle to its sides gives us an idea on the degree and direction of symmetry because this shows the respective distances of the median to the lower and upper fourths. A line that is in the middle of the rectangle indicates that the distribution is symmetric; while a line that is closer to the lower fourth (or 1 st quartile) indicates that the distribution is skewed to right, and, a line that is closer to the upper fourth (or 3 rd quartile) indicates that the distribution is skewed to the left). 5. If there are no outliers then the ends of the whiskers indicate the respective values of both extremes; but, if there are outliers then the farthest outlier is our extreme. 6. The outliers are clearly identified by the distinctive marks used to plot them. Interpreting the Boxplot Symmetric distribution Negatively-skewed distribution Positively-skewed distribution 12

13 Comparing Distributions using the Boxplot Total Financial Resources Generated by Major Geographic Region million pesos Luzon Visayas Mindanao Assignment Use the data in page 434, Exercise 3, on the illiteracy rate among the male and female populations, 15 years of age and over, in Asia in Construct a split stem-and-leaf display of the illiteracy rate among the male population. Let the common division point be in between the tens and ones digit. Split each stem into two lines. 2. Compute for the median, lower fourth, upper fourth, fourth spread, lower fence, and upper fence of the illiteracy rate of: i. Male population ii. Female population 3. Use the values computed in no. 2 to draw the boxplot of the illiteracy rate among the male population. On the same plotting area, draw the boxplot of the illiteracy rate among the female population. Chapter 12. Exploratory Data Analysis 13

Averages and Variation

Averages and Variation 3 Copyright Cengage Learning. All rights reserved. 3.1-1 Section 3.1 Measures of Central Tendency: Mode, Median, and Mean Copyright Cengage Learning. All rights reserved. 3.1-2 Focus