MAT 155: Describing, Exploring, and Comparing Data Page 1 of NotesCh2-3.doc

Similar documents
/06/$ IEEE 364

Drawing Lines in 2 Dimensions

Chapter 2 Describing, Exploring, and Comparing Data

KS3 Maths Assessment Objectives

3D SMAP Algorithm. April 11, 2012

1 The secretary problem

Minimum congestion spanning trees in bipartite and random graphs

a. divided by the. 1) Always round!! a) Even if class width comes out to a, go up one.

A NEW APPROACH IN MEASURING OF THE ROUGHNESS FOR SURFACE CONSTITUTED WITH MACHINING PROCESS BY MATERIAL REMOVAL

STP 226 ELEMENTARY STATISTICS NOTES PART 2 - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES

2.1: Frequency Distributions and Their Graphs

Lecture 14: Minimum Spanning Tree I

Chapter 5snow year.notebook March 15, 2018

CHAPTER 2: SAMPLING AND DATA

Laboratory Exercise 6

Comparison of Methods for Horizon Line Detection in Sea Images

STA 570 Spring Lecture 5 Tuesday, Feb 1

Universität Augsburg. Institut für Informatik. Approximating Optimal Visual Sensor Placement. E. Hörster, R. Lienhart.

15 Wyner Statistics Fall 2013

Planning of scooping position and approach path for loading operation by wheel loader

Laboratory Exercise 6

On successive packing approach to multidimensional (M-D) interleaving

ES205 Analysis and Design of Engineering Systems: Lab 1: An Introductory Tutorial: Getting Started with SIMULINK

Chapter 13 Non Sampling Errors

The norm Package. November 15, Title Analysis of multivariate normal datasets with missing values

else end while End References

Areas of Regular Polygons. To find the area of a regular polygon. The Solve It involves the area of a polygon.

Cutting Stock by Iterated Matching. Andreas Fritsch, Oliver Vornberger. University of Osnabruck. D Osnabruck.

Building a Compact On-line MRF Recognizer for Large Character Set using Structured Dictionary Representation and Vector Quantization Technique

Chapter 3: Describing, Exploring & Comparing Data

Chapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data

Operational Semantics Class notes for a lecture given by Mooly Sagiv Tel Aviv University 24/5/2007 By Roy Ganor and Uri Juhasz

MAT 142 College Mathematics. Module ST. Statistics. Terri Miller revised July 14, 2015

Quadrilaterals. Learning Objectives. Pre-Activity

Polygon Side Lengths NAME DATE TIME

Laboratory Exercise 6

Performance of a Robust Filter-based Approach for Contour Detection in Wireless Sensor Networks

Edits in Xylia Validity Preserving Editing of XML Documents

CSE 250B Assignment 4 Report

Routing Definition 4.1

Laboratory Exercise 6

See chapter 8 in the textbook. Dr Muhammad Al Salamah, Industrial Engineering, KFUPM

The first few questions on this worksheet will deal with measures of central tendency. These data types tell us where the center of the data set lies.

Chapter 6: DESCRIPTIVE STATISTICS

Markov Random Fields in Image Segmentation

Key Terms - MinMin, MaxMin, Sufferage, Task Scheduling, Standard Deviation, Load Balancing.

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

xy-monotone path existence queries in a rectilinear environment

Gray-level histogram. Intensity (grey-level) transformation, or mapping. Use of intensity transformations:

Measures of Central Tendency. A measure of central tendency is a value used to represent the typical or average value in a data set.

A SIMPLE IMPERATIVE LANGUAGE THE STORE FUNCTION NON-TERMINATING COMMANDS

Unit 7 Statistics. AFM Mrs. Valentine. 7.1 Samples and Surveys

CHAPTER 3: Data Description

SLA Adaptation for Service Overlay Networks

So we find a sample mean but what can we say about the General Education Statistics

New Structural Decomposition Techniques for Constraint Satisfaction Problems

A note on degenerate and spectrally degenerate graphs

Frequency Distributions

Downloaded from

Chapter 3 - Displaying and Summarizing Quantitative Data

Math 214 Introductory Statistics Summer Class Notes Sections 3.2, : 1-21 odd 3.3: 7-13, Measures of Central Tendency

Basic Statistical Terms and Definitions

An Intro to LP and the Simplex Algorithm. Primal Simplex

Chpt 3. Data Description. 3-2 Measures of Central Tendency /40

Computer Arithmetic Homework Solutions. 1 An adder for graphics. 2 Partitioned adder. 3 HDL implementation of a partitioned adder

Shortest Paths Problem. CS 362, Lecture 20. Today s Outline. Negative Weights

Representations and Transformations. Objectives

Modeling of underwater vehicle s dynamics

Performance Evaluation of an Advanced Local Search Evolutionary Algorithm

Practical Analog and Digital Filter Design

Analyzing Hydra Historical Statistics Part 2

Today s Outline. CS 561, Lecture 23. Negative Weights. Shortest Paths Problem. The presence of a negative cycle might mean that there is

Variable Resolution Discretization in the Joint Space

Parity-constrained Triangulations with Steiner points

Course Updates. Reminders: 1) Assignment #13 due Monday. 2) Mirrors & Lenses. 3) Review for Final: Wednesday, May 5th

Measures of Central Tendency

IMPROVED JPEG DECOMPRESSION OF DOCUMENT IMAGES BASED ON IMAGE SEGMENTATION. Tak-Shing Wong, Charles A. Bouman, and Ilya Pollak

Descriptive Statistics

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

Measures of Dispersion

ANALYSIS OF THE FIRST LAYER IN WEIGHTLESS NEURAL NETWORKS FOR 3_DIMENSIONAL PATTERN RECOGNITION

MATH 1070 Introductory Statistics Lecture notes Descriptive Statistics and Graphical Representation

IMPLEMENTATION OF AREA, VOLUME AND LINE SOURCES

Multiclass Road Sign Detection using Multiplicative Kernel

TAM 212 Worksheet 3. Solutions

Shortest Path Routing in Arbitrary Networks

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Motion Control (wheeled robots)

Measures of Position

Optimal Gossip with Direct Addressing

Localized Minimum Spanning Tree Based Multicast Routing with Energy-Efficient Guaranteed Delivery in Ad Hoc and Sensor Networks

Averages and Variation

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures

Name Date Types of Graphs and Creating Graphs Notes

Modeling the Effect of Mobile Handoffs on TCP and TFRC Throughput

Hassan Ghaziri AUB, OSB Beirut, Lebanon Key words Competitive self-organizing maps, Meta-heuristics, Vehicle routing problem,

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.

Delaunay Triangulation: Incremental Construction

Transcription:

MAT 155: Decribing, Exploring, and Comparing Data Page 1 of 8 001-oteCh-3.doc ote for Chapter Summarizing and Graphing Data Chapter 3 Decribing, Exploring, and Comparing Data Frequency Ditribution, Graphic Repreentation, Meaure of Center, Variation, & Standing In thee chapter, we will tudy (1) viual repreentation of data, () mean of center and variation, and (3) relative tanding and exploratory analyi. Thee three area will include (1) frequency ditribution, relative frequency ditribution, cumulative frequency ditribution, hitogram, frequency polygon, tem-and-leaf plot, and catter plot; () arithmetic mean, median, mode, midrange, weighted mean, range, tandard deviation, coefficient of variation, empirical rule, and Chebyhev Theorem; and (3) z-core, quartile, percentile, outlier, and box plot (5-number ummary). Viual Repreentation of Data A frequency ditribution i one convenient way to repreent a large amount of data in a mall amount of pace uing two column: (1) categorie or clae and () frequency. There are ome general guideline that we hould ue when contructing a frequency ditribution. Firt, determine the number of clae, k, by uing the to the k rule. Find the mallet integer k o that k n where n i the total number of obervation or data value. For example, if n 50 data value, we would find k 6 clae. [ 4 16, 5 3, and 6 64] Of coure, we have ome freedom that allow u to chooe the number of clae different from the k-value when actually contructing the frequency ditribution. We may chooe a different k- value to make the ditribution more appealing. OTE: Clae hould be mutually excluive and collectively exhautive. Thi would enure that each data value would fit into only one cla, and every value would belong to a cla. Alo, we hould try to have at leat 5 and not more than 15 clae. Thu, we will try to atify the inequality 5 k 15. We hould avoid, if poible, open-ended clae. Second, determine the cla interval or cla width. Two guideline that may be ued to determine the cla interval, i, are l et data value mallet data value i arg l arg et data value mallet data value (1) () i number of clae 1+ 3.3(log n) Suppoe the mallet and larget value of the 50 value from above are 1 and 88, repectively. 88 1 88 1 By ( 1) i 1. 666 and by ( ) i 11. 439 6 1+ 3.3(log 50) Again, we have ome freedom to chooe the cla width (interval) to be a whole number if we wih. Depending on our choice for i, we may have to change the number of clae from 6. OTE: The cla interval hould be equal.

MAT 155: Decribing, Exploring, and Comparing Data Page of 8 001-oteCh-3.doc We will et up our clae o that the lower limit (left value) of the cla i included in that cla, and the upper limit (right value) of the cla i not included in that cla. Returning to the 50 data value ranging from 1 to 88, let u et up the clae. If we chooe i 1 and tart the firt cla with a lower limit of 1, we would need 7 clae in order to include the larget value of 88. If we chooe i 15 and tart with 10 a the lower limit of the firt cla, we would need only 6 clae to include the value of 88. OTE: Some people recommend that the lower limit of the firt cla be a whole number multiple of the mallet data value. However, thi i not eential, and we will ue that only when it i convenient. Baed on the information preented above, we may chooe either of the cla etup below. Table A Table B Clae: k7, i1 Clae: k6, i15 1-4 10-5 4-36 5-40 36-48 40-55 48-60 55-70 60-7 70-85 7-84 85-100 84-96 Once we et up the clae, we count and record the number of value in each cla. In Table A, we record, in the frequency column, the number of value o that 1 value < 4, 4 value < 36, etc. Table C. 50 Data Value 57 5 1 1 74 43 70 5 78 61 88 6 3 4 0 87 79 17 39 78 13 16 69 81 73 4 73 75 19 46 48 4 19 64 41 4 81 54 0 73 16 40 70 85 7 37 64 17 46 Uing the guideline, Table A, and the data in Table C above, we get the frequency ditribution in Table D below. Table D. Frequency Ditribution Clae Frequency 1-4 13 4-36 5 36-48 9 48-60 3 60-7 6 7-84 11

MAT 155: Decribing, Exploring, and Comparing Data Page 3 of 8 001-oteCh-3.doc 84-96 3 Sum of freq. n 50 The relative frequency ditribution i contructed from the frequency ditribution by dividing each frequency by the um of the frequencie. For example 13/50 0.6, 5/50 0.10, etc. Table E below i the relative frequency ditribution contructed from Table D. Table E. Relative Frequency Ditribution Clae Frequency Relative Frequency 1-4 13 0.6 4-36 5 0.10 36-48 9 0.18 48-60 3 0.06 60-7 6 0.1 7-84 11 0. 84-96 3 0.06 Total 50 1.00 From Table E, we ee that about 6% and % of the data value are in the interval [1,4) and [7,84), repectively. In addition to the relative frequency ditribution, we will dicu the le than cumulative frequency ditribution (LCF). The LCF (Table F) how the accumulated frequency that i le than the upper limit value in the repective cla. Table F. Le Than Cumulative Frequency Ditribution Clae Frequency Le than Cumulative Frequency (<cf) 1-4 13 13 4-36 5 18 36-48 9 7 48-60 3 30 60-7 6 36 7-84 11 47 84-96 3 50 Total 50 --- We ee that 13 value are maller than 4. The 13 in the firt cla plu 5 in the econd cla give 18 value le than 36. [The ret of the value in the column <cf are obtained thuly 18 + 9 7, 7 + 3 30, 30 + 6 36, 36 + 11 47, and 47 + 3 50.] The hitogram i contructed by uing the cla limit on the horizontal axi of the frequencie on the vertical axi. The hitogram below on the left wa contructed uing Statdik; on the right by uing Excel.

MAT 155: Decribing, Exploring, and Comparing Data Page 4 of 8 001-oteCh-3.doc Hitogram Fregrency 15 10 5 0 1-4 4-36 36-48 48-60 60-7 7-84 84-96 Clae The tem-and-leaf plot i a good repreentation for raw data. All value are hown in a concie form a hown by the Minitab output of the following data: 4, 45, 51, 61, 69, 76, 78, 78, 7, 6, 51, and 44. Table H. Current workheet: Citie.mtw Character Stem-and-Leaf Diplay Stem-and-leaf of Atlanta 1 Leaf Unit 1.0 3 4 45 5 5 11 (3) 6 19 4 7 688 Interval i 10. Stem increae by 10: 40, 50, 60, 70 above it, and the number below it give u the n 1. [5 + 3 + 4 1] Meaure of Center and Variation The tem-and-leaf indicate that there are 1 data value, and each leaf repreent 1 unit. A we read the firt row, we ee there are 3 value in the 40. Thee are 4, 44, and 45. There are (5-3) value in the 50 : 51 and 51. There are (3) value in the 60 : 61, 6, and 69. Finally, there are 4 value in the 70 : 7, 76, 78, and 78. The firt column accumulate from the top down until we reach (3) [Don t be concerned about the meaning of thi value.] Then the accumulation tart at the bottom and work upward. Adding the (3), the number We will firt dicu the population mean and the ample mean. When talking about the population and a ample, we refer to a parameter and a tatitic, repectively. otation for the population and ample mean are µ (mu) and (-bar), repectively. otice in the formula that (upper cae) repreent the total number of obervation in the population, and n (lower cae) repreent the total number of obervation in the ample.

MAT 155: Decribing, Exploring, and Comparing Data Page 5 of 8 001-oteCh-3.doc Arithmetic Mean for Population and Sample Type of Data Population Sample Raw µ n f f Grouped µ f f The arithmetic mean (1) i calculated for interval-level and ratio-level data, () include all data value, (3) i unique for a et of data, (4) i ueful in comparing two or more group of data, and (5) i affected by extremely large or extremely mall value. The median i a meaure of center that require little or no calculation for raw data. To find the median for raw data, we ue the following procedure. (1) Order the data from mallet value to larget value or vice-vera. () If the number of data value i odd, chooe the value in the middle o that the ame number of value are to the left a are to the right of the middle value. (3) If the number of data value i even, chooe the two value in the middle o that the ame number of value are to the left a are to the right of the two middle value. (4) Calculate the average of thoe two value. To find the median for grouped data, we ue the following procedure. (1) In the frequency ditribution, form the le than cumulative frequency (<CF) column. () Find one-half the um of the frequencie, n/. (3) Find the larget number in the <CF column that i not larger than n/. (4) Circle the row (Cla, frequency, <CF) below the number in Step 3. Thi row contain the median. (5) Subtract the number found in Step 3 (CF) from n/, divide by the frequency (f) circle in Step 4, and multiply by the cla interval (i). (6) Add the anwer from Step 5 to the lower cla limit circled in Step 4. Thi repreent the median for the grouped data. The following formula ummarize the ix-tep procedure given above. n CF Median for grouped data Median L + ( i) f The mode i a meaure of center that identifie the data value that appear mot frequently. There will be no mode if all data value appear the ame number of time. There will be more than one mode if two or more data value appear with the ame frequency and more frequently than other data value(). To find the mode for raw data, imply find the value() that appear mot frequently. To find the mode for grouped data, find the midpoint() of the cla(e) that ha (have) the larget frequencie. The cla containing the mode i called the modal cla. The mid-range i midway between the larget value and the mallet value of the data. l arget + mallet midrange

MAT 155: Decribing, Exploring, and Comparing Data Page 6 of 8 001-oteCh-3.doc The weighted mean may be calculated by uing the following three-tep procedure: (1) multiply each value by a weight for that value, () um thoe product, and (3) divide that um by the um of the weight. The following formula expree the above procedure: Weighted Mean w w1 1 + w + wn w w + w + w where w repreent the weight and repreent the data value. 1 n n, Skewne tell u omething about the hape of a frequency ditribution. A ymmetric ditribution i one whoe graph i ymmetric with repect to a vertical line that pae through the mean, median, and mode. If a ditribution i kewed to the right, the graph i elongated (or tretched) to the right ide. If a ditribution i kewed to the left, the graph i elongated (or tretched) to the left ide. Remember that extremely large value will pull the mean to the right; thu, kewing the graph (ditribution) to the right. Similarly, extremely mall value will pull the mean to the left; thu, kewing the graph (ditribution) to the left. To calculate the coefficient of kewne by hand, we ue Pearon index (coefficient) of kewne formula: 3( mean median) I k We will dicu variation (diperion) for two reaon. Firt, variation (diperion) can be ued to indicate the preence or abence of reliability. Second, variation (diperion) can be ued to compare the pread of two or more ditribution. One meaure of variation (diperion) i the range. The range i the difference between the larget and mallet data value. The calculation of the range i the implet of the meaure of variation (diperion). A diadvantage of uing the range i that it involve only two of the data value. Range Range L arg et Value SmalletValue (D1) We calculate the variance of data o that we can find the tandard deviation. For population data, the variance i the arithmetic mean of the quared deviation from the mean. For ample data, divide the um of the quared deviation by n-1. We may ue the following procedure to calculate the variance for ungrouped data. (1) Calculate the arithmetic mean. () Find the difference between each data value and the mean. (3) Square each of the difference found in Step. (4) Sum the quare from Step 3. (5) If the data i from a population, divide the um in Step 4 by, the total number of data value. (6) If the data i from a ample, divide the um in Step 4 by n-1, where n i the total number of data value. The above tep are ummarized in the two formula below. In the ample calculation, the denominator of n-1 i ued intead of n to help correct for the error created by the maller number of data value in the ample compared to the population. The table below how the Conceptual Formula and Calculation Formula (for raw or ungrouped data) ued to find the variance of data. The tandard deviation can be ued to compare the diperion of two or more population or ample. Alo, if the data value are

MAT 155: Decribing, Exploring, and Comparing Data Page 7 of 8 001-oteCh-3.doc meaured in the ame unit and the mean are cloe together, a mall tandard deviation may be ued indicate that the mean a reliable meaure of central tendency. For population data, the tandard deviation i the quare root of the population variance. For ample data, the tandard deviation i the quare root of the ample variance. We may ue the following procedure to calculate the tandard deviation for ungrouped data. (1) Calculate the variance. () Find the quare root of the variance from Step 1. The above tep are ummarized in the formula below. Conceptual Formula to Calculate the Variance of Raw Data Population Sample ( µ ) ( ) σ (D3) (D4) n 1 ( ) Calculation σ (D5) n nn ( 1) ( ) (D6) Variance Standard Deviation Formula to Calculate the Variance and Standard Deviation of Grouped Data Population Sample σ (D7) σ (D9) ( ) ( ) f f ( ) ( ) f f (D8) (D10) ( ) ( f ) n f nn ( 1) ( ) ( f ) n f nn ( 1) For grouped data, the range i the difference between the upper limit of the larget cla (interval) and lower limit of the mallet cla (interval). Range Range Upper Limit of L arg et Interval Lower Limit of Smallet Interval (D1G) Relative Diperion. If the unit of meaure are different or the mean are not cloe together, the tandard deviation cannot be ued to compare diperion of data et. Therefore, we ue the coefficient of variation that meaure the diperion relative to the mean by dividing the tandard deviation by the mean and multiplying by 100 to form a percent. The coefficient of variation i calculated by uing the following formula: CV (100%) (D1) Empirical Rule. The Empirical Rule applie only to ditribution that are ymmetrical and bell-haped. For uch ditribution, the Empirical Rule tate that about 68% of the data value are within plu or minu one tandard deviation of the mean; about 95% within plu and minu two tandard deviation of the mean; and about 99.7% within plu and minu three tandard deviation of the mean.

MAT 155: Decribing, Exploring, and Comparing Data Page 8 of 8 001-oteCh-3.doc Chebyhev Theorem allow u to determine the minimum proportion of data value within a pecific number (larger than one) of tandard deviation of the mean for any et of data value. Thi minimum proportion i calculated by uing the formula 1 1 (D11) k where k >1 i the number of tandard deviation either ide of the mean. Uing thi formula, we ee that at leat 75% of the data value are between two tandard deviation below the mean and two tandard deviation above the mean. Similarly, there would be at leat 88.9% within three tandard deviation of the mean. There would be at leat 55.6% within 1.5 tandard deviation of the mean. 1 4 5 1.5 yield 1 1 0.556 55.6% 1.5 9 9 Z-core, tandard core, i the number of tandard deviation x i from the mean. x x x µ z-core for ample: z z-core for population: z σ Quartile, Decile, Percentile. Earlier we dicued meaure of center. One of thoe meaure wa the median. We found the median to be the middle value of ungrouped data, and we ued a formula to find the median for grouped data. ow we will calculate quartile, decile, and percentile a meaure of diperion. For ungrouped data, the following formula may be ued to find the location, L, of a percentile, k: k L n (D13) 100 If L i whole number, P k i midway between L th value and (L+1) t value of the orted data. If L i not a whole number, P k i the next value above the L th poition. To find the location of the firt quartile, imply find the location of the 5 th percentile; to find the location of the econd decile, imply find the location of the 0 th percentile. number of valuelethan x Percentile of value x 100 total number of value Box Plot. A box plot i a graphical diplay of five value: mallet and larget data value, the median, and the firt and third quartile. To draw a box plot, (1) identify the mallet and larget data value, () calculate the firt, econd, and third quartile, (3) draw a rectangle with the firt quartile at the left end, the third quartile at the right end, and the econd quartile (median) a a vertical line egment in the rectangle, (4) draw line egment from the left end to the mallet value and from the right end to the larget value. A an example, conider the following: mallet value i 50, firt quartile i 70, econd quartile (median) i 90, third quartile i 115, and the larget value i 150. The box plot repreenting thee value i hown below. 50 70 90 115 150 (Copyrighted by Claude S. Moore 004-008)