Exploring and Understanding Data Using R.
|
|
- Garry Parrish
- 6 years ago
- Views:
Transcription
1 Exploring and Understanding Data Using R. Loading the data into an R data frame: variable <- read.csv( file path, stringsasfactors= BOOLEAN, header=boolean) usedcars<-read.csv(., stringsasfactors= FALSE) A value TRUE for stringsasfactors will display how the values are stored in R, especially for nominal values. Default value is TRUE The header option states whether the data file has headers or no. Default value is TRUE. Exploring the structure of data: One of the first questions to ask is how data is organized the str() function provides a method for displaying the structure of a data frame, or any R data structure, including vectors and lists. >str(variable) (str(usedcars)) This function will return the number of data items or examples and the number of variables or features in the data frame. Exploring numeric variables: To investigate the numeric variables in the data set, we employ a commonly used set of measurements for describing such data, using the summary() function. The summary() function displays several common statistics: > summary(variable$feature)
2 > summary(usedcars$price) If we want to get summary statistics for several numeric variables at the same time, we use the c operators that build a list. >summary(dataset [c( f1, f2, )]) >summary(usedcars[ c( price, mileage )]) Measuring the central tendency- mean and median: Measures of central tendency are a class of statistics used to identify a value that falls in the middle of a set of data. The most common measure is the average. When something is deemed average, it means that it falls between the extreme ends of the scale. In statistics, the average is also known as the mean, a measurement defined as the sum of all values divided by the number of values. R provides a mean() function which calculates the mean for a vector of numbers >mean(val1, val2, valn) (>mean(c(36000, 44000, 56000)) Instead of listing values, we can list the name of the feature for which we want the mean as in >mean(dataset$feature) Although the mean is the most commonly cited statistic for measuring the center of a dataset, it is not always the most appropriate. Another common central tendency measure is the median, which is the value that occurs halfway through an ordered list of values. R provides a median() function which we can apply. > median(val1, val2, valn) (median(c(36000, 44000, 56000)) The middle value is
3 If the dataset has an even number of values, there is no middle value. In this case, the median is commonly calculated as the average of the two values at the center of the ordered list. It may appear that the median and mean are very similar measures. SO why have two measures of central tendency? The reason is due to the fact that the mean and the median are affected differently by values falling at far ends of the range; that is the mean is highly sensitive to outliers or atypical high or low relative to the majority of data. The median is not sensitive to outliers. Measuring spread- quartiles: Measuring the mean and median of our data provides one way to quickly summarize values. But the measures of center tell us little about whether or not there is diversity in the measurements. To measure diversity, we need another type of summary statistics that are concerned with the spread of data or how tightly or loosely the values are spaced. Knowing the spread provides a sense of the data s highs and lows and whether most values are like the mean and median The five-number summary is a set of five statistics that depicts the spread of a dataset. 1. Minimum (Min) 2. First quartile or Q1( 1 st Qu.) 3. Median or Q2 (Median) 4. Third quartile, or Q3 (3 rd Qu) 5. Maximum
4 The minimum and maximum are the most extreme values found in the dataset, indicating the smallest and largest values respectively. R provides the min() and max() functions to calculate these values. The span between the minimum and maximum value is known as the range. in R the range() function returns both the minimum and maximum value. Combining range() with the difference function diff() allows you to return the range of data. >range(dataset$feature) (range(usedcars$price) >diff(range(.)) (diff(range(usedcars$price)) The first quartile Q1 refers to the value below which one quarter of the values are found. The third quartile Q3 refers to the value above which one quarter of the values are found. Along with the median (Q2), the quartiles divide a dataset into four portions, each with the same number of values. Quartiles are a special case of statistics called quantiles which are numbers that divide the data into equally sized quantities. In addition to quartiles, there are tertiles(3 equal parts), quintales(5 parts), deciles(10 parts) and percentiles(100 parts). Percentiles are often used for ranking. The middle 50% of between Q1 and Q3 is of interest because it is a simple measure of spread. The difference between Q1 and Q3 is known as the interquartile range (IQR) and can be calculated using the IQR() function. The quantile() function provides a tool for identifying quantiles for a set of values. By default, the quantile() function returns the five number summary, just displayed in terms of quantiles. > quantiles(dataset$feature) (quantile(usedcards$price)
5 0% 25% 50% 75% 100% If we specify additional probabilities (probs parameter) using a vector, we can obtain arbitrary quantiles such as 1% or 99% >quantile(usedcars$price, probs=c(0.01, 0.99)) Measuring spread variance and standard deviation: Distribution allows us to characterize a large dataset using a small number of parameters. A normal distribution can be defined with just two values: center and spread. The center of a normal distribution is defined by its mean and its spread is measured by a statistic called the standard deviation. In order to calculate the standard deviation, we must first obtain the variance, which defined as the average of the squared differences between each value and the mean value. The standard deviation is the square root of the variance. To obtain the variance and standard deviation in R, the var() and sd() functions can be used var(dataset$price) (var(usedcars$price) sd(dataset$price) (sd(usedcars$price)) When interpreting the variance, larger numbers indicate that the data are spread more widely around the mean. The standard deviation indicates, on average, how much each value differs from the mean.
6 Exploring categorical variables: In contrast to numeric data, categorical data is examined using tables rather than summary statistics. A table that represents a single categorical variable is known as one-way table. the table() function can be used to generate one way tables for our data >table(dataset$feature) (table(usedcars$year)) The table output lists the categories of the nominal variable and a count of the number of values (or frequency). R can also perform the calculation of table proportions using the prop.table() command on a table produced by the table() command. >model_table <- table(usedcars$model) >prop.table(model_table) This will produce the proportion of each model in the dataset. This result can be combined with other R functions to transform the output. >model_table <- table(usedcars$model) >model_pct<- prop.table(model_table)*100 >round(model_pct, digit=1) This will give the proportions as percentages. Measuring the central tendency- Mode In statistics, the mode of a feature is the value occurring most often. Like the mean and median, the mode is another measure of central tendency. It is often used for categorical data, since the mean and median are not defined for nominal variables. There is no function in R that returns the mode of a categorical feature. To find the statistical mode, simply look at the category with the greatest number of values.
7 The mode is used in a qualitative sense to gain an understanding of important values in a dataset. Yet, it is advisable not to put too much emphasis on values that are returned by mode. It is best to think about the modes in relation to other categories.. Is there a category that dominates all others, or are there several? From there, we may examine what the most common values tell us about the variable being measured. Exploring relationships between variables Bivariate relationships consider relationships between two variables and can answer questions such as: 1) Does the price data imply that we are only examining economy-class cars. 2) How does the price change in relation to the mileage. 3) Relationships of more than two variables are called multivariate relationships. Visualizing relationships-scatterplots A scatterplot is a diagram that visualizes a bivariate relationship. It is a two dimensional figure in which dots are drawn on a coordinate plane, using the values of one feature to provide the horizontal x coordinates, and the values of another feature to provide the vertical y coordinate. Patterns in the placement of dots reveal underlying associations between the 2 features. To answer our question about the relationship between price and mileage, we will examine a scatterplot. We will use the plot() function along with the main, xlab and ylab parameters.
8 To use plot(), we need to specify x and y vectors containing the values used to position the dots on the figure. Convention is that the y variable is the one that is presumed to depend on the other ( and is thus known as the dependent variable). Since the odometer reading cannot be changed by the seller, it cannot be dependent on the car s price. Instead we will hypothesize that the car price is dependent on the odometer mileage, therefore we will use the price as y or the dependent variable. The full command to create our scatterplot is: plot(x =usedcars$mileage, y=usedcars$price, main= Scatterplot of Price vs. Mileage, xlab= Used Car Odometer (mi.), ylab= Used Car Price ($) ) Using the scatterplot, we notice a clear relationship between the price of a used car and the odometer reading. In this plot, we see that the price of a car gets lower as the values for mileage increase. abline(lm(usedcars$price ~ usedcars$mileage), col="red") the lm() function tries to fit a linear model between the two attributes. The absence of many points with high price and high mileage provides evidence to support a conclusion that our data is unlikely to include any high mileage luxury cars. The relationship between mileage and price is known as negative association, because it forms a pattern of dots in a line sloping downward. A positive association would form a line sloping upward. A flat line, or seemingly random scattered dots, is evidence that the two variables are not associated at all. The strength of a linear association between two variables is measured by a statistic known as correlation.
9 Examining Nominal Relationships two way cross tabulations To examine the relationship between two nominal variables, twoway cross-tabulation is used (also known as crosstab or a contingency table). A cross-tabulation is similar to a scatterplot in that it allows you to examine who the values of one variable vary by the values of another. The format is a table in which the rows are the levels of one variable while the columns are the levels of another. Counts in each of the table cells indicate the number of values falling into a particular row and column combination. To answer a question about a possible relationship between model and color, we will examine a crosstab. There are several commands to produce a two-way table in R. The function table() can be used for two-way cross tables. We will use the CrossTable() function in the gmodels package. - Download the gmodels package and install it using the drop down menu or install.packages( gmodels ) After the package installs, type library(gmodels) to load the package. You will need to load the package during each R session in which you plan on using the CrossTable() function. We would like to select all colors that we deem conservative under a same name, so it is easier to analyze the data. For that, we will divide the nine colors into two groups: in the first group, we will include the conservative colors (Black, Grey, Silver and White). The second group will include the rest of the colors (Blue, Green, Gold, Red and Yellow). We will create a Boolean variable that will indicate for each car whether its color attribute is within Group1, i.e, the conservative colors and will return TRUE or within the non conservative ones and will return FALSE. usedcars$conservative <- usedcars$color %in% c("black", "Grey", "Silver", "White")
10 the %in% operator returns TRUE or FALSE for each value in the vector in the left hand side of the %IN% operator, depending on whether the value is found in the vector on the right-hand side. The function table() with the newly created variable will return the proportion of true/false values. table(usedcars$conservative) The cross-tabulation will show how the proportion of conservative colored cars varies by model. Since we are assuming that the model of the car influences the choice of color, we ll treat conservative as the dependent (y) variable. The CrossTable() command is therefore: CrossTable(x=usedcars$model, y=usedcars$conservative) Total Observations in Table: 150 usedcars$models FALSE TRUE Row Total SE SEL SES Column Total
11 The row proportion for conservative cars for each model. These are the numbers indicated in red for the cell of interest ( TRUE) (SE: 56%, SEL: 48%, SES: 57%). These % are close to chance, so it is not very conclusive. The CrossTable() function is also called the Cross Tabulation with Tests for Factor Independence, in that it tests whether nominal attributes are independent or not. The most common test for variable independence is the Chi-square values or Pearsn;s Chi-squared test for independence between two variables. This test measures how likely it is that the difference in cell counts in the table is due to chance alone. If the probability is very low, it provides strong evidence that the two variables are associated. In order to obtain the Chi-squared test results, we add a parameter specifying chisq=true when using CrossTable() function. In our case, the probability is 62% which means that variation in cell counts between FALSE and TRUE are due to chance only and that car model and car color are independent variables. CrossTable(x=usedcars$model, y=usedcars$conservative, chisq=true)
Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.
Summary Statistics Acquisition Description Exploration Examination what data is collected Characterizing properties of data. Exploring the data distribution(s). Identifying data quality problems. Selecting
More informationSTA 570 Spring Lecture 5 Tuesday, Feb 1
STA 570 Spring 2011 Lecture 5 Tuesday, Feb 1 Descriptive Statistics Summarizing Univariate Data o Standard Deviation, Empirical Rule, IQR o Boxplots Summarizing Bivariate Data o Contingency Tables o Row
More informationManaging and Understanding Data
Managing and Understanding Data A key early component of any machine learning project involves managing and understanding the data you have collected. Although you may not find it as gratifying as building
More informationSTA Module 2B Organizing Data and Comparing Distributions (Part II)
STA 2023 Module 2B Organizing Data and Comparing Distributions (Part II) Learning Objectives Upon completing this module, you should be able to 1 Explain the purpose of a measure of center 2 Obtain and
More informationSTA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II)
STA 2023 Module 2B Organizing Data and Comparing Distributions (Part II) Learning Objectives Upon completing this module, you should be able to 1 Explain the purpose of a measure of center 2 Obtain and
More informationVocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.
5-number summary 68-95-99.7 Rule Area principle Bar chart Bimodal Boxplot Case Categorical data Categorical variable Center Changing center and spread Conditional distribution Context Contingency table
More informationSTP 226 ELEMENTARY STATISTICS NOTES PART 2 - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES
STP 6 ELEMENTARY STATISTICS NOTES PART - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES Chapter covered organizing data into tables, and summarizing data with graphical displays. We will now use
More informationTable of Contents (As covered from textbook)
Table of Contents (As covered from textbook) Ch 1 Data and Decisions Ch 2 Displaying and Describing Categorical Data Ch 3 Displaying and Describing Quantitative Data Ch 4 Correlation and Linear Regression
More informationChapter 6: DESCRIPTIVE STATISTICS
Chapter 6: DESCRIPTIVE STATISTICS Random Sampling Numerical Summaries Stem-n-Leaf plots Histograms, and Box plots Time Sequence Plots Normal Probability Plots Sections 6-1 to 6-5, and 6-7 Random Sampling
More informationData Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University
Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Exploratory data analysis tasks Examine the data, in search of structures
More informationChapter 2 Describing, Exploring, and Comparing Data
Slide 1 Chapter 2 Describing, Exploring, and Comparing Data Slide 2 2-1 Overview 2-2 Frequency Distributions 2-3 Visualizing Data 2-4 Measures of Center 2-5 Measures of Variation 2-6 Measures of Relative
More information15 Wyner Statistics Fall 2013
15 Wyner Statistics Fall 2013 CHAPTER THREE: CENTRAL TENDENCY AND VARIATION Summary, Terms, and Objectives The two most important aspects of a numerical data set are its central tendencies and its variation.
More informationSTA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures
STA 2023 Module 3 Descriptive Measures Learning Objectives Upon completing this module, you should be able to: 1. Explain the purpose of a measure of center. 2. Obtain and interpret the mean, median, and
More informationChapter 1. Looking at Data-Distribution
Chapter 1. Looking at Data-Distribution Statistics is the scientific discipline that provides methods to draw right conclusions: 1)Collecting the data 2)Describing the data 3)Drawing the conclusions Raw
More informationFurther Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables
Further Maths Notes Common Mistakes Read the bold words in the exam! Always check data entry Remember to interpret data with the multipliers specified (e.g. in thousands) Write equations in terms of variables
More information3. Data Analysis and Statistics
3. Data Analysis and Statistics 3.1 Visual Analysis of Data 3.2.1 Basic Statistics Examples 3.2.2 Basic Statistical Theory 3.3 Normal Distributions 3.4 Bivariate Data 3.1 Visual Analysis of Data Visual
More informationBar Charts and Frequency Distributions
Bar Charts and Frequency Distributions Use to display the distribution of categorical (nominal or ordinal) variables. For the continuous (numeric) variables, see the page Histograms, Descriptive Stats
More information2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES
EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES Chapter 2 2.1 Objectives 2.1 What Are the Types of Data? www.managementscientist.org 1. Know the definitions of a. Variable b. Categorical versus quantitative
More informationData can be in the form of numbers, words, measurements, observations or even just descriptions of things.
+ What is Data? Data is a collection of facts. Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. In most cases, data needs to be interpreted and
More informationAverages and Variation
Averages and Variation 3 Copyright Cengage Learning. All rights reserved. 3.1-1 Section 3.1 Measures of Central Tendency: Mode, Median, and Mean Copyright Cengage Learning. All rights reserved. 3.1-2 Focus
More informationMeasures of Dispersion
Measures of Dispersion 6-3 I Will... Find measures of dispersion of sets of data. Find standard deviation and analyze normal distribution. Day 1: Dispersion Vocabulary Measures of Variation (Dispersion
More informationCHAPTER 3: Data Description
CHAPTER 3: Data Description You ve tabulated and made pretty pictures. Now what numbers do you use to summarize your data? Ch3: Data Description Santorico Page 68 You ll find a link on our website to a
More informationMATH& 146 Lesson 8. Section 1.6 Averages and Variation
MATH& 146 Lesson 8 Section 1.6 Averages and Variation 1 Summarizing Data The distribution of a variable is the overall pattern of how often the possible values occur. For numerical variables, three summary
More informationAND NUMERICAL SUMMARIES. Chapter 2
EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES Chapter 2 2.1 What Are the Types of Data? 2.1 Objectives www.managementscientist.org 1. Know the definitions of a. Variable b. Categorical versus quantitative
More informationMeasures of Central Tendency. A measure of central tendency is a value used to represent the typical or average value in a data set.
Measures of Central Tendency A measure of central tendency is a value used to represent the typical or average value in a data set. The Mean the sum of all data values divided by the number of values in
More informationUnderstanding and Comparing Distributions. Chapter 4
Understanding and Comparing Distributions Chapter 4 Objectives: Boxplot Calculate Outliers Comparing Distributions Timeplot The Big Picture We can answer much more interesting questions about variables
More informationMeasures of Central Tendency
Page of 6 Measures of Central Tendency A measure of central tendency is a value used to represent the typical or average value in a data set. The Mean The sum of all data values divided by the number of
More informationCreate a bar graph that displays the data from the frequency table in Example 1. See the examples on p Does our graph look different?
A frequency table is a table with two columns, one for the categories and another for the number of times each category occurs. See Example 1 on p. 247. Create a bar graph that displays the data from the
More informationMeasures of Position
Measures of Position In this section, we will learn to use fractiles. Fractiles are numbers that partition, or divide, an ordered data set into equal parts (each part has the same number of data entries).
More informationTo calculate the arithmetic mean, sum all the values and divide by n (equivalently, multiple 1/n): 1 n. = 29 years.
3: Summary Statistics Notation Consider these 10 ages (in years): 1 4 5 11 30 50 8 7 4 5 The symbol n represents the sample size (n = 10). The capital letter X denotes the variable. x i represents the
More informationResearch Methods for Business and Management. Session 8a- Analyzing Quantitative Data- using SPSS 16 Andre Samuel
Research Methods for Business and Management Session 8a- Analyzing Quantitative Data- using SPSS 16 Andre Samuel A Simple Example- Gym Purpose of Questionnaire- to determine the participants involvement
More informationWeek 4: Describing data and estimation
Week 4: Describing data and estimation Goals Investigate sampling error; see that larger samples have less sampling error. Visualize confidence intervals. Calculate basic summary statistics using R. Calculate
More informationChapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data
Chapter 2 Descriptive Statistics: Organizing, Displaying and Summarizing Data Objectives Student should be able to Organize data Tabulate data into frequency/relative frequency tables Display data graphically
More informationLearner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display
CURRICULUM MAP TEMPLATE Priority Standards = Approximately 70% Supporting Standards = Approximately 20% Additional Standards = Approximately 10% HONORS PROBABILITY AND STATISTICS Essential Questions &
More informationChapter 4: Analyzing Bivariate Data with Fathom
Chapter 4: Analyzing Bivariate Data with Fathom Summary: Building from ideas introduced in Chapter 3, teachers continue to analyze automobile data using Fathom to look for relationships between two quantitative
More informationThings you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs.
1 2 Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs. 2. How to construct (in your head!) and interpret confidence intervals.
More informationMATH NATION SECTION 9 H.M.H. RESOURCES
MATH NATION SECTION 9 H.M.H. RESOURCES SPECIAL NOTE: These resources were assembled to assist in student readiness for their upcoming Algebra 1 EOC. Although these resources have been compiled for your
More informationData Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha
Data Preprocessing S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking
More informationM7D1.a: Formulate questions and collect data from a census of at least 30 objects and from samples of varying sizes.
M7D1.a: Formulate questions and collect data from a census of at least 30 objects and from samples of varying sizes. Population: Census: Biased: Sample: The entire group of objects or individuals considered
More informationPart I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures
Part I, Chapters 4 & 5 Data Tables and Data Analysis Statistics and Figures Descriptive Statistics 1 Are data points clumped? (order variable / exp. variable) Concentrated around one value? Concentrated
More information2.1: Frequency Distributions and Their Graphs
2.1: Frequency Distributions and Their Graphs Frequency Distribution - way to display data that has many entries - table that shows classes or intervals of data entries and the number of entries in each
More informationThe first few questions on this worksheet will deal with measures of central tendency. These data types tell us where the center of the data set lies.
Instructions: You are given the following data below these instructions. Your client (Courtney) wants you to statistically analyze the data to help her reach conclusions about how well she is teaching.
More informationLecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 2.1- #
Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series by Mario F. Triola Chapter 2 Summarizing and Graphing Data 2-1 Review and Preview 2-2 Frequency Distributions 2-3 Histograms
More informationDAY 52 BOX-AND-WHISKER
DAY 52 BOX-AND-WHISKER VOCABULARY The Median is the middle number of a set of data when the numbers are arranged in numerical order. The Range of a set of data is the difference between the highest and
More information1 Overview of Statistics; Essential Vocabulary
1 Overview of Statistics; Essential Vocabulary Statistics: the science of collecting, organizing, analyzing, and interpreting data in order to make decisions Population and sample Population: the entire
More informationSummarising Data. Mark Lunt 09/10/2018. Arthritis Research UK Epidemiology Unit University of Manchester
Summarising Data Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester 09/10/2018 Summarising Data Today we will consider Different types of data Appropriate ways to summarise these
More informationAP Statistics Summer Assignment:
AP Statistics Summer Assignment: Read the following and use the information to help answer your summer assignment questions. You will be responsible for knowing all of the information contained in this
More informationChapter 2: Descriptive Statistics
Chapter 2: Descriptive Statistics Student Learning Outcomes By the end of this chapter, you should be able to: Display data graphically and interpret graphs: stemplots, histograms and boxplots. Recognize,
More informationECLT 5810 Data Preprocessing. Prof. Wai Lam
ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate
More informationBar Graphs and Dot Plots
CONDENSED LESSON 1.1 Bar Graphs and Dot Plots In this lesson you will interpret and create a variety of graphs find some summary values for a data set draw conclusions about a data set based on graphs
More informationPrepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.
Chapter 2 2.1 Descriptive Statistics A stem-and-leaf graph, also called a stemplot, allows for a nice overview of quantitative data without losing information on individual observations. It can be a good
More informationIntroduction to Geospatial Analysis
Introduction to Geospatial Analysis Introduction to Geospatial Analysis 1 Descriptive Statistics Descriptive statistics. 2 What and Why? Descriptive Statistics Quantitative description of data Why? Allow
More informationMaking Science Graphs and Interpreting Data
Making Science Graphs and Interpreting Data Eye Opener: 5 mins What do you see? What do you think? Look up terms you don t know What do Graphs Tell You? A graph is a way of expressing a relationship between
More informationBIOSTATISTICS LABORATORY PART 1: INTRODUCTION TO DATA ANALYIS WITH STATA: EXPLORING AND SUMMARIZING DATA
BIOSTATISTICS LABORATORY PART 1: INTRODUCTION TO DATA ANALYIS WITH STATA: EXPLORING AND SUMMARIZING DATA Learning objectives: Getting data ready for analysis: 1) Learn several methods of exploring the
More informationCHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.
1 CHAPTER 1 Introduction Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data. Variable: Any characteristic of a person or thing that can be expressed
More information10.4 Measures of Central Tendency and Variation
10.4 Measures of Central Tendency and Variation Mode-->The number that occurs most frequently; there can be more than one mode ; if each number appears equally often, then there is no mode at all. (mode
More information10.4 Measures of Central Tendency and Variation
10.4 Measures of Central Tendency and Variation Mode-->The number that occurs most frequently; there can be more than one mode ; if each number appears equally often, then there is no mode at all. (mode
More informationCHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and
CHAPTER-13 Mining Class Comparisons: Discrimination between DifferentClasses: 13.1 Introduction 13.2 Class Comparison Methods and Implementation 13.3 Presentation of Class Comparison Descriptions 13.4
More informationUnivariate Statistics Summary
Further Maths Univariate Statistics Summary Types of Data Data can be classified as categorical or numerical. Categorical data are observations or records that are arranged according to category. For example:
More informationEx.1 constructing tables. a) find the joint relative frequency of males who have a bachelors degree.
Two-way Frequency Tables two way frequency table- a table that divides responses into categories. Joint relative frequency- the number of times a specific response is given divided by the sample. Marginal
More informationBoxplot
Boxplot By: Meaghan Petix, Samia Porto & Franco Porto A boxplot is a convenient way of graphically depicting groups of numerical data through their five number summaries: the smallest observation (sample
More informationLecture Notes 3: Data summarization
Lecture Notes 3: Data summarization Highlights: Average Median Quartiles 5-number summary (and relation to boxplots) Outliers Range & IQR Variance and standard deviation Determining shape using mean &
More informationStatistics 251: Statistical Methods
Statistics 251: Statistical Methods Summaries and Graphs in R Module R1 2018 file:///u:/documents/classes/lectures/251301/renae/markdown/master%20versions/summary_graphs.html#1 1/14 Summary Statistics
More informationMATH11400 Statistics Homepage
MATH11400 Statistics 1 2010 11 Homepage http://www.stats.bris.ac.uk/%7emapjg/teach/stats1/ 1.1 A Framework for Statistical Problems Many statistical problems can be described by a simple framework in which
More informationMath 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency
Math 1 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency lowest value + highest value midrange The word average: is very ambiguous and can actually refer to the mean,
More informationSLStats.notebook. January 12, Statistics:
Statistics: 1 2 3 Ways to display data: 4 generic arithmetic mean sample 14A: Opener, #3,4 (Vocabulary, histograms, frequency tables, stem and leaf) 14B.1: #3,5,8,9,11,12,14,15,16 (Mean, median, mode,
More information+ Statistical Methods in
9/4/013 Statistical Methods in Practice STA/MTH 379 Dr. A. B. W. Manage Associate Professor of Mathematics & Statistics Department of Mathematics & Statistics Sam Houston State University Discovering Statistics
More informationChapter 3 - Displaying and Summarizing Quantitative Data
Chapter 3 - Displaying and Summarizing Quantitative Data 3.1 Graphs for Quantitative Data (LABEL GRAPHS) August 25, 2014 Histogram (p. 44) - Graph that uses bars to represent different frequencies or relative
More informationLecture 1: Exploratory data analysis
Lecture 1: Exploratory data analysis Statistics 101 Mine Çetinkaya-Rundel January 17, 2012 Announcements Announcements Any questions about the syllabus? If you sent me your gmail address your RStudio account
More informationRegression III: Advanced Methods
Lecture 3: Distributions Regression III: Advanced Methods William G. Jacoby Michigan State University Goals of the lecture Examine data in graphical form Graphs for looking at univariate distributions
More informationGetting to Know Your Data
Chapter 2 Getting to Know Your Data 2.1 Exercises 1. Give three additional commonly used statistical measures (i.e., not illustrated in this chapter) for the characterization of data dispersion, and discuss
More informationDay 4 Percentiles and Box and Whisker.notebook. April 20, 2018
Day 4 Box & Whisker Plots and Percentiles In a previous lesson, we learned that the median divides a set a data into 2 equal parts. Sometimes it is necessary to divide the data into smaller more precise
More informationDescriptive Statistics, Standard Deviation and Standard Error
AP Biology Calculations: Descriptive Statistics, Standard Deviation and Standard Error SBI4UP The Scientific Method & Experimental Design Scientific method is used to explore observations and answer questions.
More informationVisual Analytics. Visualizing multivariate data:
Visual Analytics 1 Visualizing multivariate data: High density time-series plots Scatterplot matrices Parallel coordinate plots Temporal and spectral correlation plots Box plots Wavelets Radar and /or
More informationSection 5.2: BUY OR SELL A CAR OBJECTIVES
Section 5.2: BUY OR SELL A CAR OBJECTIVES Compute mean, median, mode, range, quartiles, and interquartile range. Key Terms statistics data measures of central tendency mean arithmetic average outlier median
More informationBIO 360: Vertebrate Physiology Lab 9: Graphing in Excel. Lab 9: Graphing: how, why, when, and what does it mean? Due 3/26
Lab 9: Graphing: how, why, when, and what does it mean? Due 3/26 INTRODUCTION Graphs are one of the most important aspects of data analysis and presentation of your of data. They are visual representations
More informationWELCOME! Lecture 3 Thommy Perlinger
Quantitative Methods II WELCOME! Lecture 3 Thommy Perlinger Program Lecture 3 Cleaning and transforming data Graphical examination of the data Missing Values Graphical examination of the data It is important
More informationMultiple Regression White paper
+44 (0) 333 666 7366 Multiple Regression White paper A tool to determine the impact in analysing the effectiveness of advertising spend. Multiple Regression In order to establish if the advertising mechanisms
More informationName Geometry Intro to Stats. Find the mean, median, and mode of the data set. 1. 1,6,3,9,6,8,4,4,4. Mean = Median = Mode = 2.
Name Geometry Intro to Stats Statistics are numerical values used to summarize and compare sets of data. Two important types of statistics are measures of central tendency and measures of dispersion. A
More informationLearning Log Title: CHAPTER 8: STATISTICS AND MULTIPLICATION EQUATIONS. Date: Lesson: Chapter 8: Statistics and Multiplication Equations
Chapter 8: Statistics and Multiplication Equations CHAPTER 8: STATISTICS AND MULTIPLICATION EQUATIONS Date: Lesson: Learning Log Title: Date: Lesson: Learning Log Title: Chapter 8: Statistics and Multiplication
More informationboxplot - A graphic way of showing a summary of data using the median, quartiles, and extremes of the data.
Learning Target Create scatterplots and identify whether there is a relationship between two sets of data. Draw a line of best fit and use it to make predictions. Focus Questions How can I organize data?
More informationGlossary Common Core Curriculum Maps Math/Grade 6 Grade 8
Glossary Common Core Curriculum Maps Math/Grade 6 Grade 8 Grade 6 Grade 8 absolute value Distance of a number (x) from zero on a number line. Because absolute value represents distance, the absolute value
More information1. Basic Steps for Data Analysis Data Editor. 2.4.To create a new SPSS file
1 SPSS Guide 2009 Content 1. Basic Steps for Data Analysis. 3 2. Data Editor. 2.4.To create a new SPSS file 3 4 3. Data Analysis/ Frequencies. 5 4. Recoding the variable into classes.. 5 5. Data Analysis/
More informationData Mining By IK Unit 4. Unit 4
Unit 4 Data mining can be classified into two categories 1) Descriptive mining: describes concepts or task-relevant data sets in concise, summarative, informative, discriminative forms 2) Predictive mining:
More informationAfter opening Stata for the first time: set scheme s1mono, permanently
Stata 13 HELP Getting help Type help command (e.g., help regress). If you don't know the command name, type lookup topic (e.g., lookup regression). Email: tech-support@stata.com. Put your Stata serial
More informationExploratory Data Analysis
Chapter 10 Exploratory Data Analysis Definition of Exploratory Data Analysis (page 410) Definition 12.1. Exploratory data analysis (EDA) is a subfield of applied statistics that is concerned with the investigation
More information1.2. Pictorial and Tabular Methods in Descriptive Statistics
1.2. Pictorial and Tabular Methods in Descriptive Statistics Section Objectives. 1. Stem-and-Leaf displays. 2. Dotplots. 3. Histogram. Types of histogram shapes. Common notation. Sample size n : the number
More information/4 Directions: Graph the functions, then answer the following question.
1.) Graph y = x. Label the graph. Standard: F-BF.3 Identify the effect on the graph of replacing f(x) by f(x) +k, k f(x), f(kx), and f(x+k), for specific values of k; find the value of k given the graphs.
More informationQuantitative - One Population
Quantitative - One Population The Quantitative One Population VISA procedures allow the user to perform descriptive and inferential procedures for problems involving one population with quantitative (interval)
More information8. MINITAB COMMANDS WEEK-BY-WEEK
8. MINITAB COMMANDS WEEK-BY-WEEK In this section of the Study Guide, we give brief information about the Minitab commands that are needed to apply the statistical methods in each week s study. They are
More informationStatistics can best be defined as a collection and analysis of numerical information.
Statistical Graphs There are many ways to organize data pictorially using statistical graphs. There are line graphs, stem and leaf plots, frequency tables, histograms, bar graphs, pictographs, circle graphs
More informationNCSS Statistical Software
Chapter 152 Introduction When analyzing data, you often need to study the characteristics of a single group of numbers, observations, or measurements. You might want to know the center and the spread about
More informationFrequency Distributions
Displaying Data Frequency Distributions After collecting data, the first task for a researcher is to organize and summarize the data so that it is possible to get a general overview of the results. Remember,
More informationa. divided by the. 1) Always round!! a) Even if class width comes out to a, go up one.
Probability and Statistics Chapter 2 Notes I Section 2-1 A Steps to Constructing Frequency Distributions 1 Determine number of (may be given to you) a Should be between and classes 2 Find the Range a The
More informationA. Incorrect! This would be the negative of the range. B. Correct! The range is the maximum data value minus the minimum data value.
AP Statistics - Problem Drill 05: Measures of Variation No. 1 of 10 1. The range is calculated as. (A) The minimum data value minus the maximum data value. (B) The maximum data value minus the minimum
More information5. Compare the volume of a three dimensional figure to surface area.
5. Compare the volume of a three dimensional figure to surface area. 1. What are the inferences that can be drawn from sets of data points having a positive association and a negative association. 2. Why
More informationMean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242
Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242 Creation & Description of a Data Set * 4 Levels of Measurement * Nominal, ordinal, interval, ratio * Variable Types
More informationLAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA
LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data in StatCrunch. In particular, you will learn how to
More informationVocabulary: Data Distributions
Vocabulary: Data Distributions Concept Two Types of Data. I. Categorical data: is data that has been collected and recorded about some non-numerical attribute. For example: color is an attribute or variable
More informationMATH 1070 Introductory Statistics Lecture notes Descriptive Statistics and Graphical Representation
MATH 1070 Introductory Statistics Lecture notes Descriptive Statistics and Graphical Representation Objectives: 1. Learn the meaning of descriptive versus inferential statistics 2. Identify bar graphs,
More information