Univariate Statistics Summary

Similar documents
Chapter 3 - Displaying and Summarizing Quantitative Data

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

Measures of Central Tendency. A measure of central tendency is a value used to represent the typical or average value in a data set.

2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES

MATH NATION SECTION 9 H.M.H. RESOURCES

AND NUMERICAL SUMMARIES. Chapter 2

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Chapter 2 Describing, Exploring, and Comparing Data

Measures of Central Tendency

STP 226 ELEMENTARY STATISTICS NOTES PART 2 - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES

AP Statistics Prerequisite Packet

STA 570 Spring Lecture 5 Tuesday, Feb 1

10.4 Measures of Central Tendency and Variation

10.4 Measures of Central Tendency and Variation

STA Module 2B Organizing Data and Comparing Distributions (Part II)

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II)

Name Date Types of Graphs and Creating Graphs Notes

CHAPTER 2: SAMPLING AND DATA

CHAPTER 2 DESCRIPTIVE STATISTICS

1.3 Graphical Summaries of Data

DAY 52 BOX-AND-WHISKER

Averages and Variation

Box Plots. OpenStax College

Mean,Median, Mode Teacher Twins 2015

AP Statistics Summer Assignment:

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures

UNIT 1A EXPLORING UNIVARIATE DATA

Chapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data

Understanding Statistical Questions

Section 9: One Variable Statistics

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.

M7D1.a: Formulate questions and collect data from a census of at least 30 objects and from samples of varying sizes.

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

No. of blue jelly beans No. of bags

Middle Years Data Analysis Display Methods

2.1: Frequency Distributions and Their Graphs

Descriptive Statistics: Box Plot

Chapter 1. Looking at Data-Distribution

Using a percent or a letter grade allows us a very easy way to analyze our performance. Not a big deal, just something we do regularly.

How individual data points are positioned within a data set.

Processing, representing and interpreting data

MAT 142 College Mathematics. Module ST. Statistics. Terri Miller revised July 14, 2015

MAT 110 WORKSHOP. Updated Fall 2018

NOTES TO CONSIDER BEFORE ATTEMPTING EX 1A TYPES OF DATA

Frequency Distributions

Chapter 3. Descriptive Measures. Slide 3-2. Copyright 2012, 2008, 2005 Pearson Education, Inc.

Downloaded from

CHAPTER 2: DESCRIPTIVE STATISTICS Lecture Notes for Introductory Statistics 1. Daphne Skipper, Augusta University (2016)

TMTH 3360 NOTES ON COMMON GRAPHS AND CHARTS

Chapter 6: DESCRIPTIVE STATISTICS

Homework Packet Week #3

1.2. Pictorial and Tabular Methods in Descriptive Statistics

Unit I Supplement OpenIntro Statistics 3rd ed., Ch. 1

Table of Contents (As covered from textbook)

Numerical Descriptive Measures

Measures of Position

15 Wyner Statistics Fall 2013

Day 4 Percentiles and Box and Whisker.notebook. April 20, 2018

Learning Log Title: CHAPTER 7: PROPORTIONS AND PERCENTS. Date: Lesson: Chapter 7: Proportions and Percents

Measures of Dispersion

Unit 7 Statistics. AFM Mrs. Valentine. 7.1 Samples and Surveys

Chpt 3. Data Description. 3-2 Measures of Central Tendency /40

Date Lesson TOPIC HOMEWORK. Displaying Data WS 6.1. Measures of Central Tendency WS 6.2. Common Distributions WS 6.6. Outliers WS 6.

Exploratory Data Analysis

The Normal Distribution

Basic Statistical Terms and Definitions

Lecture Notes 3: Data summarization

6th Grade Vocabulary Mathematics Unit 2

WHOLE NUMBER AND DECIMAL OPERATIONS

Round each observation to the nearest tenth of a cent and draw a stem and leaf plot.

Measures of Dispersion

The basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student

Chapter 5snow year.notebook March 15, 2018

Section 2-2 Frequency Distributions. Copyright 2010, 2007, 2004 Pearson Education, Inc

The main issue is that the mean and standard deviations are not accurate and should not be used in the analysis. Then what statistics should we use?

Section 6.3: Measures of Position

CHAPTER 3: Data Description

Chapter 2 Organizing and Graphing Data. 2.1 Organizing and Graphing Qualitative Data

Statistical Methods. Instructor: Lingsong Zhang. Any questions, ask me during the office hour, or me, I will answer promptly.

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and

MATH& 146 Lesson 10. Section 1.6 Graphing Numerical Data

8 Organizing and Displaying

Section 3.2 Measures of Central Tendency MDM4U Jensen

Chapter 2 - Graphical Summaries of Data

LESSON 3: CENTRAL TENDENCY

This chapter will show how to organize data and then construct appropriate graphs to represent the data in a concise, easy-to-understand form.

Numerical Summaries of Data Section 14.3

STANDARDS OF LEARNING CONTENT REVIEW NOTES ALGEBRA I. 4 th Nine Weeks,

Descriptive Statistics

Measures of Central Tendency

Create a bar graph that displays the data from the frequency table in Example 1. See the examples on p Does our graph look different?

Bar Graphs and Dot Plots

Descriptive Statistics

Density Curve (p52) Density curve is a curve that - is always on or above the horizontal axis.

Chapter 3 Analyzing Normal Quantitative Data

MATH 112 Section 7.2: Measuring Distribution, Center, and Spread

The first few questions on this worksheet will deal with measures of central tendency. These data types tell us where the center of the data set lies.

Name Geometry Intro to Stats. Find the mean, median, and mode of the data set. 1. 1,6,3,9,6,8,4,4,4. Mean = Median = Mode = 2.

Transcription:

Further Maths Univariate Statistics Summary Types of Data Data can be classified as categorical or numerical. Categorical data are observations or records that are arranged according to category. For example: the favourite colour of a class of students; the mode of transport that each student uses to get to school; the rating of a TV program, either a great program, average program or poor program. Postal codes such as 3011, 3015 etc. Numerical data are observations based on counting or measurement. Calculations can be performed on numerical data. There are two main types of numerical data Discrete data, which takes only fixed values, usually whole numbers. Discrete data often arises as the result of counting items. For example: the number of siblings each student has, the number of pets a set of randomly chosen people have or the number of passengers in cars that pass an intersection. Continuous data can take any value in a given range. It is usually a measurement. For example: the weights of students in a class. The weight of each student could be measured to the nearest tenth of a kg. Weights of 84.8kg and 67.5kg would be recorded. Other examples of continuous data include the time taken to complete a task or the heights of a group of people. Exercise 1 Decide whether the following data is categorical or numerical. If numerical decide if the data is discrete or continuous. 1. 2. Page 1 of 21

3. 4. Solutions 1a Numerical-discrete b. Categorical c. Categorical d. Numerical-Continuous 2a. Numerical b. Categorical c. Numerical d. Categorical e. Categorical 3a Continuous b. Discrete c. Continuous d. Continuous e. Discrete 4. D Representing Data Data is often represented in frequency charts, column charts, histograms and dot plots. Remember: the frequency of an observation is the number of times that observation occurs. Page 2 of 21

Frequency Univariate Statistics Summary Example 1: The following frequency distribution table gives the number of days of each weather type for the month of January. Represent the information using a column chart. Weather Type 20 15 10 5 0 Hot Warm Mild Cool Weather Example 2 Represent the data in a. a histogram. b. a frequency polygon a. b. Notice that for a histogram there is no gap between the bars and the number of visits are positioned at the centre of each bar. Page 3 of 21

Example 3 Represent the data in a. a histogram b. a frequency polygon a. b. Notice that the numbers are placed at the edges of the bars along the x-axis for grouped data. Types of Average There are three types of average which can represent a set of data. An average is a measure of central tendency. The Mean The most common average is the mean. x is used to denote the mean. x = sum of all scores number of scores Example 1: The following data gives the number of pets kept in each of 10 different households. 3, 5, 4, 4, 2, 3, 0, 1, 4, 5 The mean number of pets is given by: 3 + 5 + 4 + 4 + 2 + 3 + 0 + 1 + 4 + 5 10 Page 4 of 21 = 3.1 The mean is sometimes not the best average to use as it is affected by extreme scores or outliers.

The Median The median of a set of scores is the middle score when the data are arranged in order of size. The median s position is given by n+1 2 th score, where n is the number of scores. In example 1 the median s position is given by the 10+1 th score. This is the 5.5 th score or halfway 2 between the 5 th and 6 th score, after the scores have been arranged in order of size. Arranging the data in order of size: 0, 1, 2, 3, 3, 4, 4, 4, 5, 5 Median number of pets is: 3+4 2 The median is not affected by extreme values or outliers. The Mode = 3.5 (as there are two middle scores we take their mean. The mode of a group of scores is the score that occurs most often. That is the score with the highest frequency. In example 1 the modal number of pets is 4. More than one mode is possible. Frequency Tables Example 4 The table indicates that 6 students made 0 cinema visits, 7 students made 1 cinema visit, 4 students made 2 cinema visits etc. The mean number of visits can be found by adding an extra column to the table and multiplying the number of visits by the frequency. Number of visits (x) Frequency (f) f x 0 6 0 1 7 7 2 4 8 3 2 6 4 1 4 Total 20 25 a. mean number of visits = total of f x total of f = 25 20 = 1.25 Page 5 of 21

b. the median number of visits can be found by finding the position of the median as the number of visits are in order of size in the table. The median s position is the n+1 20+1 th score = th = 10.5th position. Halfway between the 10 th 2 2 and 11 th scores. The median s position falls within the second row and is therefore 1. c. The mode is the score with the highest frequency. The mode is 1. Alternatively, the mean and median can be found using Lists and Spreadsheet in the calculator. 1: Enter the data into Lists and Spreadsheet view 2: Hit Menu, Statistics, Stat Calculations, One Variable Statistics 3. Click OK when number of lists appears. 4. In the pop up, click in the X1 List box and select visits from the drop down list. Hit the Tab key to move to the next box and select freq from the drop down list in the Frequency List box 6. The statistical data appears. The mean is given by x = 1.25 The median is 1. n is useful as it gives the frequency total. There is no need to enter data into the other boxes. 5. Click OK. Page 6 of 21

Grouped Data When data is presented in a frequency table within class intervals, and we do not know the actual values within each class interval, we assume that all values are equal to the midpoint of the class interval in order to find the mean. Example 5: The ages of a group of 30 people attending a superannuation seminar are recorded in the frequency table below, calculate the mean age. Age (Class Intervals) Frequency 20-29 1 30-39 6 40-49 13 50-59 6 60-69 3 70-79 1 Total 30 To find the mean age, assume all people in the class interval 20-29 are 24.5 years of age (This value is obtained by finding the midpoint of 20-29), all people in the class interval 30-39 are 34.5 years of age and so on. The mean age can be found from the table below: Age (Class Intervals) Frequency f Midpoint of f m Class Interval m 20-29 1 24.5 24.5 30-39 6 34.5 207 40-49 13 44.5 578.5 50-59 6 54.5 327 60-69 3 64.5 193.5 70-79 1 74.5 74.5 Total: 30 1405 The mean age x = total of f m total of f = 1405 30 = 46.8 years (correct to 1 decimal place) The above can be more easily done using Lists and Spreadsheet on the calculator. Page 7 of 21

1: In Lists and Spreadsheets view enter the data for the midpoints and the frequency into the first two columns. Label the columns as shown. 2. Press Menu, Statistics, Stat Calculations, One Variable Statistics. 3. Leave the Number of Lists as 1 and select OK. 4. In the pop up box, click in the X1 List box and select midpoint from the drop down list. Press the Tab key to move to the Frequency List box and select freq from the drop down list. 5. Press the TAB key to move to the OK button. The mean x = 46.833 The mean age = 46.8 years Page 8 of 21

Measures of Variability or Spread It is useful to be able to measure the spread or variability of the data. How dispersed is the data? The Range The simplest measure of spread is the range. The range is the difference between the smallest score and the largest. Example 1 The set of data 3, 5, 4, 4, 2, 3, 0, 1, 4, 5 (which gave the number of pets in each of 10 households) has a range of 5 0 = 5 The Interquartile Range (IQR) The lower quartile, Q 1 is 1 of the way through the set of data. 4 The upper quartile, Q 3 is 3 of the way through the set of data. 4 The IQR = Q 3 Q 1 Page 9 of 21

Example 6 The Standard Deviation The standard deviation gives a measure of the spread of the data about the mean. The formula to find the standard deviation is complex and we usually find it directly from the calculator. On the calculator it is denoted by the symbol sx. The bigger the standard deviation, the greater the spread of data. Example 7 Find the standard deviation of the set of data: 12, 9, 4, 6, 5, 8, 9, 4, 10, 2. This is the same data as in Example 6. 1. In Lists and Spreadsheet view, enter the data. 2. Press Menu, Statistics, Stat Calculations, One Variable Statistics. 3. Leave the Number of Lists as 1 and select OK. 4. Enter the data into the pop up box as shown. 5. Click OK. Page 10 of 21

The statistical data appears. The standard deviation is given by sx = 3.17805 Notice that the lower quartile is given by Q 1 X = 4 and the upper quartile is given by Q 3 X = 9 This agrees with the solutions to Example 2. The IQR = 9 4 = 5 Stem and Leaf Plots Example 8 The data below shows the weights in kg of 20 possums arranged in order of size: 0.7 0.9 1.1 1.4 1.5 1.6 1.7 1.7 1.8 1.8 1.9 2.0 2.1 2.1 2.2 2.3 2.3 2.5 3.0 3.2 Page 11 of 21

We can represent this data in a stem and leaf plot as shown below: Key: 0 7 = 0.7kg Stem Leaf 0 7 9 1 1 4 5 6 7 7 8 8 9 2 0 1 1 2 3 3 5 3 0 2 In a stem and leaf plot the numbers are arranged in order of size. The key is given as 0 7 kg means stem 0 and leaf 7 which represents 0.7 kg. You should always include a key in the stem and leaf plot. When preparing a stem and leaf plot keep the number s in neat vertical columns because a neat plot will show the distribution of the scores. It is like a sideways bar chart or histogram. The interquartile range can be found from the stem and leaf plot. ( 20 1) 1. Find the median weight. The median weight Q2 is the th score. ie the 10.5 th score. 2 The median lies between the 10 th and 11 th scores. Count through the data to find the position of the median. It can be seen from the plot that the median lies between 1.8 and ( 1.8 1.9) 1.9. The median weight is = 1.85 kg. 2 ( 10 1) 2. The lower quartile Q1 will be the th score in the lower half. ie the 5.5 th score in the 2 lower half. Count through the data to find the position of the lower quartile. Q1 = ( 1.5 1.6) = 1.55 kg. 2 3. The upper quartile Q3 will be the 5.5 th score in the upper half of the plot. Count through the ( 2.2 2.3) data to find the position of the upper quartile. Q3 = = 2.25 kg 2 4. The interquartile range = Q3 Q1 = 2.25 1.55 = 0.7 kg 5. See diagram below: Q1 median Q2 Key: 0 7 = 0.7kg Stem Leaf 0 7 9 1 1 4 5 6 7 7 8 8 9 2 0 1 1 2 3 3 5 3 0 2 Page 12 of 21 Q3

Example 9 Find the interquartile range of the data presented in the following stem and leaf plot. Key: 15 4 = 154 Q1 median Q2 Stem Leaf 15 4 8 8 16 1 3 3 6 8 17 0 0 1 4 7 9 9 9 18 1 2 3 3 5 7 8 8 9 19 2 7 8 20 0 2 Q3 The median is the median is 179. ( 30 1) th score. ie the 15.5 th score which lies between 179 and 179. So the 2 The lower quartile Q1 will be the Q1 = 168. ( 15 1) th score in the lower half. ie the 8 th score in the lower half. 2 The upper quartile Q2 will be the 8 th score in the upper half of the data. ie 188. The interquartile range = Q3 Q1 = 188 168 = 20. See the diagram above. Using CAS. You could check your answers by entering the data into your CAS calculator to determine the median, lower and upper quartiles. Boxplots Five-number summary A five number summary is a list consisting of the lowest score (Xmin), lower quartile (Q1), median (Q2), upper quartile (Q3) and the greatest score (Xmax) of a set of data. A five number summary gives information about the spread or variability of a set of data. Box Plots A box plot is a graph of the 5-number summary. It is a powerful way of showing the spread of data. A box plot consists of a central divided box with attached whiskers. The box spans the interquartile range. The median is marked by a vertical line inside the box. The whiskers indicate the range of scores. Box plots are always drawn to scale and a scale is often attached. Page 13 of 21

Interpreting a Boxplot A boxplot divides the data into four sections. 25% of the scores lie between the lowest score and the lower quartile, 25% between the lower quartile and the median, 25% between the median and the upper quartile and 25% between the upper quartile and the greatest score. Extreme Values or Outliers Extreme values often make the whiskers appear longer than they should and hence give the appearance that the data is spread over a much greater range than they really are. If an extreme value occurs in a set of data it can be denoted by a small cross on the boxplot. The whisker is then shortened to the next largest or smallest score. When one observation lies well away from other observations in a set, we call it an outlier. For example the histogram shows the weights of a group of 5-year old boys. Clearly the weight of 33kg is an outlier. Page 14 of 21

Determining whether an Observation is an Outlier To identify possible outliers we can use the following rule: An outlier is a score, x, which is either less than Q 1 1. 5 IQR or greater than Q 3 + 1. 5 IQR Where Q 1 is the lower quartile and Q 3 is the upper quartile and IQR is the interquartile range In summary An outlier is a score, x, which lies outside the interval Example 10 Q 1 1. 5 IQR x Q 3 + 1. 5 IQR Page 15 of 21

Distribution of Data Page 16 of 21

Comparing Two Sets of Data Back to Back Stem and Leaf Plots Univariate Statistics Summary Two sets of data can be compared using back to back stem and leaf plots. The data below shows the life time of a sample of 40 batteries in hours of each of two brands when fitted into a child s toy. Some of the toys are fitted with an ordinary battery and some with Brand X. Which battery is best? Key: 6 9 = 69 hours Ordinary Brand Leaf Stem Brand X Leaf 8 6 2 0 0 6 9 9 9 9 8 8 6 4 0 7 3 5 8 8 7 5 3 1 1 1 0 8 2 4 8 9 6 6 4 2 2 2 0 0 9 0 1 4 5 5 9 8 7 5 3 1 1 1 10 0 0 2 5 8 8 9 9 4 2 11 0 0 1 1 3 3 6 7 9 12 1 4 6 6 6 7 8 8 13 3 5 14 6 The spread of each set of data can be seen graphically from the stem and leaf plot. It can be seen that although brand X showed a little more variability than the ordinary brand the batteries generally lasted longer. Parallel Box Plots The above data can also be compared by using parallel boxplots. The boxplots share a common scale. Quantitative comparisons can be made between the sets of data. The 5-Number Summaries of both types of batteries are given below. You can work them out from the stem plots or by using your calculator. Brand X Xmin Lower Quartile Q1 Median Q2 Upper Quartile Q3 Xmax 69 95 109.5 122.5 146 Ordinary Brand Xmin Lower Quartile Q1 Median Q2 Upper Quartile Q3 Xmax 60 78.5 87.5 97.5 114 Page 17 of 21

The following parallel boxplots can be drawn to compare the data. Time in Hours 50 60 70 80 90 100 110 120 130 140 150 Brand X Ordinary Brand From the box plots it can be seen that: 1. Brand X showed more variability in its performance than the ordinary brand. Brand X range = 77, ordinary brand range = 54. Brand X interquartile range = 27.5 and ordinary brand interquartile range =19.0 2. The longest lifetime recorded was that of a Brand X battery of 146 hours 3. The shortest lifetime recorded was that of an ordinary battery of 60 hours. 4. Brand X battery median lifetime (109.5 hours) was better than that of an ordinary battery (87.5 hours) 5. Over one quarter of Brand X batteries were better performers than the best ordinary brand battery (that is, had longer lifetimes than the longest of the ordinary brand batteries lifetimes) The Mean, Standard Deviation and Normal Distribution The Mean The mean of a set of data, x, is given by: x = x, where, x represents the sum of all the n observations in the data set and n represents the number of observations in the data set. For example the mean of 4, 7, 9, 12 and 18 is given by: x = 4 + 7 + 9 + 12 + 18 5 = 10 The mean is often used as the everyday average and gives a measure of the centre of a distribution. The mean is sometimes affected by outliers (extreme values) and the median is often a better average to use as it is not affected by extreme values. The Standard Deviation The standard deviation gives a measure of how the data is spread around the mean. The formula for the standard deviation is given below: (x x )2 s = n 1 Fortunately, the calculator can be used to find the standard deviation. On the calculator it is denoted by sx. Page 18 of 21

The 68%-95%-99.7% Rule applied to the Normal Distribution Many sets of data are approximately symmetric forming a "bell-shaped" curve. We refer to this type of data as a normal distribution. Examples include birth weights, people's heights etc. Data that is normally distributed have their symmetrical bell-shaped distribution centred on the mean value, x. The above data of the heights of people form a bell-shape and approximates a normal distribution. Page 19 of 21

Using the symmetry of the bell-shaped curve and the above rules various percentages can be calculated. Z-Scores To compare scores in different distributions we can make use of z-scores. The z-score, also known as the standardised score, indicates the position of a score in relation to the mean. It gives the number of standard deviations that the score is from the mean. A z score can be both positive and negative. A z-score of 0 indicates that the score is equal to the mean, a negative z-score indicates that the score is below the mean and a positive z-score indicates that the score is above the mean. A score that is exactly 1 standard deviation above the mean has a z-score of 1. A score that is exactly 2 standard deviations below the mean has a z-score of 2. Not all z-scores will be whole numbers; in fact most will not be. A whole number indicates only that the score is an exact number of standard deviations above or below the mean. For example, if the mean value of the IQ s of a group of students is 100 and the standard deviation is 15, an IQ of 88 would be represented by a z score of 0.8, as shown below. z = x x s = 88 100 15 = 0.8 The negative value indicates that the IQ of 88 is below the mean but by less than one standard deviation. Page 20 of 21

Using Z-Scores to Compare Data Page 21 of 21