Exploring and Understanding Data Using R.

Size: px

Start display at page:

Download "Exploring and Understanding Data Using R."

Garry Parrish
6 years ago
Views:

1 Exploring and Understanding Data Using R. Loading the data into an R data frame: variable <- read.csv( file path, stringsasfactors= BOOLEAN, header=boolean) usedcars<-read.csv(., stringsasfactors= FALSE) A value TRUE for stringsasfactors will display how the values are stored in R, especially for nominal values. Default value is TRUE The header option states whether the data file has headers or no. Default value is TRUE. Exploring the structure of data: One of the first questions to ask is how data is organized the str() function provides a method for displaying the structure of a data frame, or any R data structure, including vectors and lists. >str(variable) (str(usedcars)) This function will return the number of data items or examples and the number of variables or features in the data frame. Exploring numeric variables: To investigate the numeric variables in the data set, we employ a commonly used set of measurements for describing such data, using the summary() function. The summary() function displays several common statistics: > summary(variable$feature)

2 > summary(usedcars$price) If we want to get summary statistics for several numeric variables at the same time, we use the c operators that build a list. >summary(dataset [c( f1, f2, )]) >summary(usedcars[ c( price, mileage )]) Measuring the central tendency- mean and median: Measures of central tendency are a class of statistics used to identify a value that falls in the middle of a set of data. The most common measure is the average. When something is deemed average, it means that it falls between the extreme ends of the scale. In statistics, the average is also known as the mean, a measurement defined as the sum of all values divided by the number of values. R provides a mean() function which calculates the mean for a vector of numbers >mean(val1, val2, valn) (>mean(c(36000, 44000, 56000)) Instead of listing values, we can list the name of the feature for which we want the mean as in >mean(dataset$feature) Although the mean is the most commonly cited statistic for measuring the center of a dataset, it is not always the most appropriate. Another common central tendency measure is the median, which is the value that occurs halfway through an ordered list of values. R provides a median() function which we can apply. > median(val1, val2, valn) (median(c(36000, 44000, 56000)) The middle value is

3 If the dataset has an even number of values, there is no middle value. In this case, the median is commonly calculated as the average of the two values at the center of the ordered list. It may appear that the median and mean are very similar measures. SO why have two measures of central tendency? The reason is due to the fact that the mean and the median are affected differently by values falling at far ends of the range; that is the mean is highly sensitive to outliers or atypical high or low relative to the majority of data. The median is not sensitive to outliers. Measuring spread- quartiles: Measuring the mean and median of our data provides one way to quickly summarize values. But the measures of center tell us little about whether or not there is diversity in the measurements. To measure diversity, we need another type of summary statistics that are concerned with the spread of data or how tightly or loosely the values are spaced. Knowing the spread provides a sense of the data s highs and lows and whether most values are like the mean and median The five-number summary is a set of five statistics that depicts the spread of a dataset. 1. Minimum (Min) 2. First quartile or Q1( 1 st Qu.) 3. Median or Q2 (Median) 4. Third quartile, or Q3 (3 rd Qu) 5. Maximum

4 The minimum and maximum are the most extreme values found in the dataset, indicating the smallest and largest values respectively. R provides the min() and max() functions to calculate these values. The span between the minimum and maximum value is known as the range. in R the range() function returns both the minimum and maximum value. Combining range() with the difference function diff() allows you to return the range of data. >range(dataset$feature) (range(usedcars$price) >diff(range(.)) (diff(range(usedcars$price)) The first quartile Q1 refers to the value below which one quarter of the values are found. The third quartile Q3 refers to the value above which one quarter of the values are found. Along with the median (Q2), the quartiles divide a dataset into four portions, each with the same number of values. Quartiles are a special case of statistics called quantiles which are numbers that divide the data into equally sized quantities. In addition to quartiles, there are tertiles(3 equal parts), quintales(5 parts), deciles(10 parts) and percentiles(100 parts). Percentiles are often used for ranking. The middle 50% of between Q1 and Q3 is of interest because it is a simple measure of spread. The difference between Q1 and Q3 is known as the interquartile range (IQR) and can be calculated using the IQR() function. The quantile() function provides a tool for identifying quantiles for a set of values. By default, the quantile() function returns the five number summary, just displayed in terms of quantiles. > quantiles(dataset$feature) (quantile(usedcards$price)

5 0% 25% 50% 75% 100% If we specify additional probabilities (probs parameter) using a vector, we can obtain arbitrary quantiles such as 1% or 99% >quantile(usedcars$price, probs=c(0.01, 0.99)) Measuring spread variance and standard deviation: Distribution allows us to characterize a large dataset using a small number of parameters. A normal distribution can be defined with just two values: center and spread. The center of a normal distribution is defined by its mean and its spread is measured by a statistic called the standard deviation. In order to calculate the standard deviation, we must first obtain the variance, which defined as the average of the squared differences between each value and the mean value. The standard deviation is the square root of the variance. To obtain the variance and standard deviation in R, the var() and sd() functions can be used var(dataset$price) (var(usedcars$price) sd(dataset$price) (sd(usedcars$price)) When interpreting the variance, larger numbers indicate that the data are spread more widely around the mean. The standard deviation indicates, on average, how much each value differs from the mean.

6 Exploring categorical variables: In contrast to numeric data, categorical data is examined using tables rather than summary statistics. A table that represents a single categorical variable is known as one-way table. the table() function can be used to generate one way tables for our data >table(dataset$feature) (table(usedcars$year)) The table output lists the categories of the nominal variable and a count of the number of values (or frequency). R can also perform the calculation of table proportions using the prop.table() command on a table produced by the table() command. >model_table <- table(usedcars$model) >prop.table(model_table) This will produce the proportion of each model in the dataset. This result can be combined with other R functions to transform the output. >model_table <- table(usedcars$model) >model_pct<- prop.table(model_table)*100 >round(model_pct, digit=1) This will give the proportions as percentages. Measuring the central tendency- Mode In statistics, the mode of a feature is the value occurring most often. Like the mean and median, the mode is another measure of central tendency. It is often used for categorical data, since the mean and median are not defined for nominal variables. There is no function in R that returns the mode of a categorical feature. To find the statistical mode, simply look at the category with the greatest number of values.

7 The mode is used in a qualitative sense to gain an understanding of important values in a dataset. Yet, it is advisable not to put too much emphasis on values that are returned by mode. It is best to think about the modes in relation to other categories.. Is there a category that dominates all others, or are there several? From there, we may examine what the most common values tell us about the variable being measured. Exploring relationships between variables Bivariate relationships consider relationships between two variables and can answer questions such as: 1) Does the price data imply that we are only examining economy-class cars. 2) How does the price change in relation to the mileage. 3) Relationships of more than two variables are called multivariate relationships. Visualizing relationships-scatterplots A scatterplot is a diagram that visualizes a bivariate relationship. It is a two dimensional figure in which dots are drawn on a coordinate plane, using the values of one feature to provide the horizontal x coordinates, and the values of another feature to provide the vertical y coordinate. Patterns in the placement of dots reveal underlying associations between the 2 features. To answer our question about the relationship between price and mileage, we will examine a scatterplot. We will use the plot() function along with the main, xlab and ylab parameters.

8 To use plot(), we need to specify x and y vectors containing the values used to position the dots on the figure. Convention is that the y variable is the one that is presumed to depend on the other ( and is thus known as the dependent variable). Since the odometer reading cannot be changed by the seller, it cannot be dependent on the car s price. Instead we will hypothesize that the car price is dependent on the odometer mileage, therefore we will use the price as y or the dependent variable. The full command to create our scatterplot is: plot(x =usedcars$mileage, y=usedcars$price, main= Scatterplot of Price vs. Mileage, xlab= Used Car Odometer (mi.), ylab= Used Car Price ($) ) Using the scatterplot, we notice a clear relationship between the price of a used car and the odometer reading. In this plot, we see that the price of a car gets lower as the values for mileage increase. abline(lm(usedcars$price ~ usedcars$mileage), col="red") the lm() function tries to fit a linear model between the two attributes. The absence of many points with high price and high mileage provides evidence to support a conclusion that our data is unlikely to include any high mileage luxury cars. The relationship between mileage and price is known as negative association, because it forms a pattern of dots in a line sloping downward. A positive association would form a line sloping upward. A flat line, or seemingly random scattered dots, is evidence that the two variables are not associated at all. The strength of a linear association between two variables is measured by a statistic known as correlation.

9 Examining Nominal Relationships two way cross tabulations To examine the relationship between two nominal variables, twoway cross-tabulation is used (also known as crosstab or a contingency table). A cross-tabulation is similar to a scatterplot in that it allows you to examine who the values of one variable vary by the values of another. The format is a table in which the rows are the levels of one variable while the columns are the levels of another. Counts in each of the table cells indicate the number of values falling into a particular row and column combination. To answer a question about a possible relationship between model and color, we will examine a crosstab. There are several commands to produce a two-way table in R. The function table() can be used for two-way cross tables. We will use the CrossTable() function in the gmodels package. - Download the gmodels package and install it using the drop down menu or install.packages( gmodels ) After the package installs, type library(gmodels) to load the package. You will need to load the package during each R session in which you plan on using the CrossTable() function. We would like to select all colors that we deem conservative under a same name, so it is easier to analyze the data. For that, we will divide the nine colors into two groups: in the first group, we will include the conservative colors (Black, Grey, Silver and White). The second group will include the rest of the colors (Blue, Green, Gold, Red and Yellow). We will create a Boolean variable that will indicate for each car whether its color attribute is within Group1, i.e, the conservative colors and will return TRUE or within the non conservative ones and will return FALSE. usedcars$conservative <- usedcars$color %in% c("black", "Grey", "Silver", "White")

10 the %in% operator returns TRUE or FALSE for each value in the vector in the left hand side of the %IN% operator, depending on whether the value is found in the vector on the right-hand side. The function table() with the newly created variable will return the proportion of true/false values. table(usedcars$conservative) The cross-tabulation will show how the proportion of conservative colored cars varies by model. Since we are assuming that the model of the car influences the choice of color, we ll treat conservative as the dependent (y) variable. The CrossTable() command is therefore: CrossTable(x=usedcars$model, y=usedcars$conservative) Total Observations in Table: 150 usedcars$models FALSE TRUE Row Total SE SEL SES Column Total

11 The row proportion for conservative cars for each model. These are the numbers indicated in red for the cell of interest ( TRUE) (SE: 56%, SEL: 48%, SES: 57%). These % are close to chance, so it is not very conclusive. The CrossTable() function is also called the Cross Tabulation with Tests for Factor Independence, in that it tests whether nominal attributes are independent or not. The most common test for variable independence is the Chi-square values or Pearsn;s Chi-squared test for independence between two variables. This test measures how likely it is that the difference in cell counts in the table is due to chance alone. If the probability is very low, it provides strong evidence that the two variables are associated. In order to obtain the Chi-squared test results, we add a parameter specifying chisq=true when using CrossTable() function. In our case, the probability is 62% which means that variation in cell counts between FALSE and TRUE are due to chance only and that car model and car color are independent variables. CrossTable(x=usedcars$model, y=usedcars$conservative, chisq=true)

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Summary Statistics Acquisition Description Exploration Examination what data is collected Characterizing properties of data. Exploring the data distribution(s). Identifying data quality problems. Selecting