( n+1 2 ) , position=(7+1)/2 =4,(median is observation #4) Median=10lb

Similar documents
OCR Statistics 1. Working with data. Section 3: Measures of spread

SAMPLE VERSUS POPULATION. Population - consists of all possible measurements that can be made on a particular item or procedure.

SD vs. SD + One of the most important uses of sample statistics is to estimate the corresponding population parameters.

ENGI 4421 Probability and Statistics Faculty of Engineering and Applied Science Problem Set 1 Descriptive Statistics

Data Analysis. Concepts and Techniques. Chapter 2. Chapter 2: Getting to Know Your Data. Data Objects and Attribute Types

Intermediate Statistics

UNIT 4 Section 8 Estimating Population Parameters using Confidence Intervals

Normal Distributions

Descriptive Statistics Summary Lists

The Closest Line to a Data Set in the Plane. David Gurney Southeastern Louisiana University Hammond, Louisiana

STP 226 ELEMENTARY STATISTICS NOTES PART 2 - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES

Describing data with graphics and numbers

IMP: Superposer Integrated Morphometrics Package Superposition Tool

Chapter2 Description of samples and populations. 2.1 Introduction.

Performance Plus Software Parameter Definitions

ECE4050 Data Structures and Algorithms. Lecture 6: Searching

The isoperimetric problem on the hypercube

9.1. Sequences and Series. Sequences. What you should learn. Why you should learn it. Definition of Sequence

The golden search method: Question 1

Name Date Hr. ALGEBRA 1-2 SPRING FINAL MULTIPLE CHOICE REVIEW #1

Arithmetic Sequences

Polynomial Functions and Models. Learning Objectives. Polynomials. P (x) = a n x n + a n 1 x n a 1 x + a 0, a n 0

Image Segmentation EEE 508

Lecture 1: Introduction and Strassen s Algorithm

9 x and g(x) = 4. x. Find (x) 3.6. I. Combining Functions. A. From Equations. Example: Let f(x) = and its domain. Example: Let f(x) = and g(x) = x x 4

Math Section 2.2 Polynomial Functions

Pattern Recognition Systems Lab 1 Least Mean Squares

Capability Analysis (Variable Data)

Ones Assignment Method for Solving Traveling Salesman Problem

EM375 STATISTICS AND MEASUREMENT UNCERTAINTY LEAST SQUARES LINEAR REGRESSION ANALYSIS

10.4 Measures of Central Tendency and Variation

10.4 Measures of Central Tendency and Variation

Thompson s Group F (p + 1) is not Minimally Almost Convex

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

Designing a learning system

Administrative UNSUPERVISED LEARNING. Unsupervised learning. Supervised learning 11/25/13. Final project. No office hours today

So we find a sample mean but what can we say about the General Education Statistics

An (or ) is a sequence in which each term after the first differs from the preceding term by a fixed constant, called the.

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures

Wavelet Transform. CSE 490 G Introduction to Data Compression Winter Wavelet Transformed Barbara (Enhanced) Wavelet Transformed Barbara (Actual)

Parabolic Path to a Best Best-Fit Line:

Consider the following population data for the state of California. Year Population

Bezier curves. Figure 2 shows cubic Bezier curves for various control points. In a Bezier curve, only

Name Date Hr. ALGEBRA 1-2 SPRING FINAL MULTIPLE CHOICE REVIEW #2

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

CS Polygon Scan Conversion. Slide 1

The Nature of Light. Chapter 22. Geometric Optics Using a Ray Approximation. Ray Approximation

CSC165H1 Worksheet: Tutorial 8 Algorithm analysis (SOLUTIONS)

Math 167 Review for Test 4 Chapters 7, 8 & 9

Numerical Methods Lecture 6 - Curve Fitting Techniques

n n B. How many subsets of C are there of cardinality n. We are selecting elements for such a

The first few questions on this worksheet will deal with measures of central tendency. These data types tell us where the center of the data set lies.

Chapter 3 - Displaying and Summarizing Quantitative Data

Averages and Variation

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance

STA Module 2B Organizing Data and Comparing Distributions (Part II)

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II)

Chapter 2 and 3, Data Pre-processing

Accuracy Improvement in Camera Calibration

CIS 121 Data Structures and Algorithms with Java Fall Big-Oh Notation Tuesday, September 5 (Make-up Friday, September 8)

4.3 Modeling with Arithmetic Sequences

PLEASURE TEST SERIES (XI) - 04 By O.P. Gupta (For stuffs on Math, click at theopgupta.com)

Learning to Shoot a Goal Lecture 8: Learning Models and Skills

Area As A Limit & Sigma Notation

Algorithms for Disk Covering Problems with the Most Points

Xbar/R Chart for x1-x3

Chapter 1. Looking at Data-Distribution

EVALUATION OF TRIGONOMETRIC FUNCTIONS

Ch 9.3 Geometric Sequences and Series Lessons

Lecture 5. Counting Sort / Radix Sort

CS 683: Advanced Design and Analysis of Algorithms

Some cycle and path related strongly -graphs

Fast Fourier Transform (FFT) Algorithms

Lecturers: Sanjam Garg and Prasad Raghavendra Feb 21, Midterm 1 Solutions

Octahedral Graph Scaling

CS 111: Program Design I Lecture 21: Network Analysis. Robert H. Sloan & Richard Warner University of Illinois at Chicago April 10, 2018

Random Graphs and Complex Networks T

Data Preprocessing. Motivation

Alpha Individual Solutions MAΘ National Convention 2013

FURTHER INTEGRATION TECHNIQUES (TRIG, LOG, EXP FUNCTIONS)

A Novel Feature Extraction Algorithm for Haar Local Binary Pattern Texture Based on Human Vision System

3D Model Retrieval Method Based on Sample Prediction

STA 570 Spring Lecture 5 Tuesday, Feb 1

Computational Geometry

Fundamentals of Media Processing. Shin'ichi Satoh Kazuya Kodama Hiroshi Mo Duy-Dinh Le

Math 10C Long Range Plans

Lecture 13: Validation

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis

How do we evaluate algorithms?

Chapter 5: The standard deviation as a ruler and the normal model p131

Improving Template Based Spike Detection

Graphs. Minimum Spanning Trees. Slides by Rose Hoberman (CMU)

NAME: DIRECTIONS FOR THE ROUGH DRAFT OF THE BOX-AND WHISKER PLOT

On Computing the Fuzzy Weighted Average Using the KM Algorithms

Tutorial on Packet Time Metrics

Package popkorn. R topics documented: February 20, Type Package

Which movie we can suggest to Anne?

Running Time. Analysis of Algorithms. Experimental Studies. Limitations of Experiments

A New Morphological 3D Shape Decomposition: Grayscale Interframe Interpolation Method

1.2 Binomial Coefficients and Subsets

Transcription:

Chapter 3 Descriptive Measures Measures of Ceter (Cetral Tedecy) These measures will tell us where is the ceter of our data or where most typical value of a data set lies Mode the value that occurs most frequetly i the data set Obtai the frequecy of each value 1. If the greatest frequecy is 1, the there is o mode. 2. If the greatest frequecy is 2 or greater, the ay value with that greatest frequecy is the mode of the data set. Example: 2, 3, 3, 3, 4, 4, 5 Mode = 3 Media divides the bottom 50% of the data from the top 50% Arrage the data i icreasig order 1. If the # of observatios is odd, the media is the observatio exactly i the middle. 2. If the # of observatios is eve, the media is the mea of the two middle observatios. For observatios, the positio of the media is the ( +1 2 ) th positio i the ordered distributio. Ex Weight gai i pouds for 6 youg lambs 1 2 10 11 13 19, positio=(6+1)/2=3.5 (media is betwee observatio #3 ad #4), Media=(10+11)/2=10.5 lb If we add oe more observatio: 10lb, data becomes: 1 2 10 10 11 13 19, positio=(7+1)/2 =4,(media is observatio #4) Media=10lb Media is a robust (resistat) measure of ceter, it is relatively uaffected by chages i small portio of the data. Mea sum of the observatios divided by the umber of observatios. x = Mea (arithmetic mea)= x= i=1 x i, where x i s are observatios i the sample. I our example x =56/6~9.33 lb

Differeces betwee each data poit ad the mea their sum (x i x)=0 for ay data set. i=1 (x i x) are called deviatios from the mea ad I our example sum of all deviatios = (- 8.33)+ (- 7.33)+.67+1.67+3.67+9.67=0 Mea ca be visualized as a poit of balace of the weightless seesaw with poits (like childre) sittig o it. Ulike media, mea is ot robust, it is iflueced by ay data chages, very much by extremes. If data has some extreme values the media is a better measure of ceter for that data. Mea vs Media right skewed distributio, left skewed distributio, symmetric distributio, Mea>Media Mea< Media Mea=Media Measures of dispersio (variability) Rage=Maximum-Miimum, gives overall spread of the data, easy to calculate, but very sesitive to extreme data values. Sample Stadard Deviatio DEFINITION: s = i=1 (x i x) 2 1 s averages the squared deviatios from the mea. Square root is take at the ed, so the uits of s are the same as the uits of the data. Properties: s 0, s=0 if all data poits are the same s has the same uits as your data larger s idicates more variability

s 2 is the sample variace. We will abbreviate SD for stadard deviatio, s will be used i the formulas. Ex. Experimet o chrysathemums, botaist measured stem elogatio i 7 days (i mm) 76, 72, 65, 70, 82 =5 x=365/5=73 76 72 65 70 82 x i x i x (x i x) 2 3-1 -8-3 9 9 1 64 9 81 total 0 164 s== 164 4 =6.40 mm variace s 2 =41mm 2 s gives typical distace of the observatios from the mea, larger s meas more variability. Similar to the mea, s is also iflueced by extreme data values (ot a robust measure). -1 =degrees of freedom of s, as a ituitive justificatio why we use ( -1) ot we ca cosider =1, whe variability of 1 observatio ca't be computed, oe data poit gives o iformatio about variability. Sample stadard deviatio x i x i 2 i=1 COMPUTATIONAL FORMULA: s= x i 2 ( i=1 1 x i)2 76 72 65 70 82 5776 5184 4225 4900 6724 365 26809

s= 26809 (365) 2 5 4 = 26809 26645 4 = 164 =6.40 mm 4 The more variatio there is i a data set, the larger its stadard deviatio. Similar to the mea, stadard deviatio is ot robust, it is iflueced by ay data chages, very much by extremes. Three Stadard Deviatios Rule: Almost all of the observatios i ay data set lie withi three (3) stadard deviatios to either side of the mea. More Precise Rules for ay data set: (optioal) Chebychev s rule : ~ 89% of the observatios i ay data set lie withi three stadard deviatios to either side of the mea. Chebychev s rule (more precisely):for ay data set ad ay umber k > 1, at least 100(1 1/k 2 )% of the observatios lie withi k stadard deviatios to either side of the mea. If the distributio is ~ bell-shaped, the Empirical Rule implies that ~ 99.7% of the observatios lie withi three stadard deviatios to either side of the mea. We will tallk about bell shaped distributios later. Typical Percetages: The Empirical Rule For a ice distributio (pretty symmetric, uimodal, o very log or very short tails) we expect to fid : about 68% of all data poits withi the iterval ( y SD, y+ SD) about 95% of all data poits withi the iterval ( y 2SD, y+ 2SD) more tha 99% of all data poits withi the iterval ( y 3SD, y+ 3SD) Effect of Trasformatio of Variables Sometimes whe we work with a data set it is coveiet to trasform our variable(s). For example, we may wat to chage uits or trasform very small umbers that appear i scietific otatio to somethig easier to use by multiplyig origial data by 10,000. ad SD =s, the X '=ax +b is it's liear trasformatio, mea ad SD of X ' are x ' ad SD= s' respectively. That type of trasformatio does ot chage the essetial shape of the distributio of X, the histogram of trasformed variable ca be made idetical to the origial histogram by suitable scalig of the horizotal axis. Liear trasformatio is the simplest oe: Let X be the origial variable with mea x

How Liear Trasformatio Affects mea ad SD? Oly mea (but ot s) is affected by the additive trasformatio (addig positive or egative costat b to X), but both mea ad SD are affected by multiplyig X by a positive or a egative costat a: x'=a x+b ad s '= a s Ex Suppose X=summer temperature i some America city i 2013 i F, x=79.6 If we would like to chage the X to C, the trasformatio is as follows: X '=( X 32) 5 9 = 5 9 X 5 9 32, so ew mea x '= 5 9 79.6 ( 5 32)=26.44 C ad 9 s'= 5 9 12.7=7.06 C F ad s=12.7 F. Noliear trasformatios like the followig examples: X '= X, X '=log X, X '= 1 X, X '= X 2, ca affect data i complex ways ad they do chage essetial shape of the frequecy distributio. If the distributio is right skewed, for example, ad we wish to make it more symmetric, we ca apply square root trasformatio to pool the righthad tail ad push out the left -had tail. Logarithmic trasformatio will deliver eve more drastic chage i that regard (check out the histograms give at the ed of this sectio) The five-umber summary; Boxplots Media, Percetiles, Deciles, Quartiles, Iterquartile Rage are all resistat measures. Percetiles divide the distributio ito 100 equal parts (P 1, P 2,,P 99 ) P 1 divides the bottom 1% of the data from the top 99% P 2 divides the bottom 2% of the data from the top 98% Etc, Media is the 50 th percetile Deciles divide the distributio ito 10 equal parts (D 1, D 2,, D 9 ) D 1 divides the bottom 10% of the data from the top 90% D 2 divides the bottom 20% of the data from the top 80% Etc, Media is D 5 Quartiles divide the distributio ito 4 equal parts (Q 1, Q 2, Q 3 ) Q 1 divides the bottom 25% of the data from the top 75%

Q 2 divides the bottom 50% of the data from the top 50% Q 3 divides the bottom 75% of the data from the top 25% Media is Q 2 To fid the Quartiles Arrage the data i icreasig order. 1. Q 1 is the media of the data set that lies at or below the media of the etire data set. 2. Q 2 is the media of the etire data set. 3. Q 3 is the media of the data set that lies at or abowe the media of the etire data set. Examples: 1. =7 (odd) Data: 3, 4, 5, 6, 12, 13, 14 Q 1 =(4+5)/2=4.5 Q 2 = 6 Q 3 = (12+13)/2=12.5 Whe is odd, calculator ad your book have slightly differet ways to calculate quartiles: Your Book: To compute Quartiles, Media is icluded i lower ad upper part of data Your calculator: To compute Quartiles, Media is excluded from the computatios, so you will get somewhat differet values: Q 1 = 4 Q 2 = 6 Q 3 = 13 2. =10 (eve) Data: 1, 3, 4, 5, 6, 12, 13, 14, 15, 18 Q 1 = 4 Q 2 = (6+12)/2=9 Q 3 = 13 Iterquartile Rage (IQR) differece betwee the first ad third quartiles. IQR = Q 3 Q 1 IQR gives the rage of the middle 50% of the observatios (approximately) The five-umber summary of a data set cosists of the miimum, maximum, ad the quartiles i icreasig order. Mi., Q 1, Q 2, Q 3, Max. Outliers observatios well outside of the overall patter of the data LL=Lower limit = Q 1 1.5 (IQR) UL=Upper limit = Q 3 + 1.5 (IQR)

Potetial outliers are observatios outside of the Lower ad Upper Limits. Boxplot (box-ad-whisker diagram) ad the modified boxplot To costruct a boxplot 1. Determie the 5 umber summary (Mi, Q 1, Q 2, Q 3, Max.) 2. Draw a horizotal axis o which the umbers obtaied i step 1ca be located. Above this axis, mark the quartiles ad the miimum ad maximum with vertical lies. 3. Coect the quartiles to each other to make a box, ad the coect the box to the miimum ad maximum with lies. The followig is Boxplot for example 1 (top of previous page): 3 4.5 6 12.5 14 To costruct a modified boxplot 1. Determie the quartiles. 2. Determie potetial outliers ad the adjacet values. 3. Draw a horizotal axis o which the umbers obtaied i steps 1 ad 2 ca be located. Above this axis, mark the quartiles ad the adjacet values with vertical lies. 4. Coect the quartiles to each other to make a box, ad the coect the box to the most extreme obs. that are still lyig withi the upper ad lower limits 5. Plot each potetial outlier with a asterisk. The two lies stretchig out o both sides are the whiskers. Example Data represets systolic blood pressure (i mmhg) of 7 adult males 151 124 132 170 146 124 113 We order data first: 113 124 124 132 146 151 170 Mi=113, Max=170, Media=132 Q 1 =124 Q 3 =151 (Media is excluded whe we compute quartiles) Boxplot coects all 5 umbers i the followig way, the box represets middle half of the data.

110 120 130 140 150 160 170 Are there ay outliers? I our example: IQR=151-124=27, 1.5(IQR)=1.5*27 = 40.5 lower limit=124-40.5=83.5, upper limit = 151+40.5 = 191.5, all observatios are withi the limits, so so there are o outliers i our data set. Example Radishes growth (i mm) i the light. 4 5 5 7 7 8 9 10 10 10 10 14 20 21 Mi=4, Max=21, Q 1 =7, Media=(9+10)/2=9.5 Q 3 =10 IQR=3, lower limit=2.5 upper limit=14.5, so 20 ad 21 are outliers. Modified box plot exposes outliers. * * 5 10 15 20 25 Descriptive Measures for Populatios; Use of Samples Statistical Iferece is the process of drawig coclusios about the populatio based o the observatios i the sample. Notatio: Size Mea SD Sample x s Populatio N μ σ

Parameter A descriptive measure for a populatio. Example:, Statistic A descriptive measure for a sample. Example: x, s Sample mea, x, is used to estimate a populatio mea, Sample SD, s, is used to estimate populatio SD, σ μ Populatio Mea μ (Mea of a Variable) computed i same maer as for a sample mea For a variable X, the mea of all possible obs. for the etire populatio is called the populatio mea or mea of the variable X. It is deoted by x or whe o cofusio will arise, simply by. For a fiite populatio, we have = N x where N is the populatio size. Populatio Stadard Deviatio σ (Stadard Deviatio of the Variable) For a variable x, the stadard deviatio of all possible obs. for the etire populatio is called the populatio stadard deviatio or stadard deviatio of the variable x. It is deoted by x or, whe o cofusio will arise, simply by. For a fiite populatio, we have = ( x ) N 2 = ( x 2 N ) μ2 where N is the populatio size. Populatio Variace 2 x Stadardized Variable For a variable X, the variable z = is called the stadard score or z-score z is also called the stadardized versio of x or the stadardized variable correspodig to the variable x. The value of the z score tells us how may stadard deviatios above or below the mea is a particular value of x.

Properties of z-scores: z<0 if x is below the mea, z>0 if x is above the mea z=0 if x is equal to the mea. z-scores have mea=0 ad SD=1 z-scores have o uits: z = z =0, z = N ( z ) N 2 =1 Most of the z-scores are betwee -3 ad 3 (3 SD Rule) Example: Fial test scores i all Mat119 classes last semester have mea μ=72 ad SD σ=10 a) Jae scored 86 poits, fid ad iterpret her z-score z = x, z=(86-72)/10=1.4 Het test grade is 1.4 stadard deviatios above the average b) Jack's z-score was -1.0, what was his test score is: x=μ+ z σ, X=72-1.0(10)=62 c) True or false? Very few fial test scores are below 42 poits or above 102 poits. True, most of the scores are withi 3SD-s from the mea, i a iterval (42,102), so very few are outside of that rage