1 Elementary Statistics Introduction Statistics is the collection of methods for planning experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions. Definitions and Statistical Terminology Population - data set consisting of all outcomes, measurements, or responses of interest Sample - data set which is a subset of the population data set For example, if we are interested in measuring the salaries of Kenyan high-school teachers, the population data set would be a list of the salaries of every high-school teacher in Kenya. A sample data set could be obtained by selecting 100 high-school teachers from a across the country and listing their salaries. Note: There are several reasons why we don't work with populations. They are usually large, and it is often impossible to get data for every object we're studying. Sampling does not usually occur without cost, and the more items surveyed, the larger the cost. Raw Data - Data collected in original form which have not been organised numerically. Variable - Characteristic or attribute that can assume different values. Qualitative Variables Variables which assume non-numerical values e.g. marital status, hair colour, favourite ice-cream etc. Quantitative Variables Variables which assume numerical values. These can further be divided into two: i) Discrete Variables Variables which assume a finite or countable number of possible values. Usually obtained by counting e.g. number of children in a family.
2 ii) Continuous Variables Variables which assume an infinite number of possible values. Usually obtained by measurement e.g. age, weight, income etc. Note: Since continuous variables are real numbers, we usually round them. This implies a boundary depending on the number of decimal places. For example: 64 is really anything 63.5 < x < 64.5. Likewise, if there are two decimal places, then 64.03 is really anything 63.025 < x < 63.035. Boundaries always have one more decimal place than the data and end in a 5. Frequency Distributions A frequency distribution is a table used to describe a data set. A frequency table lists intervals or ranges of data values called data classes together with the number of data values from the set that are in each class. This number is called the frequency of the class. Example 1 Some values occur more than once. Therefore we can form a table showing how many times each value occurs.
3 Complete a tally diagram and determine the frequency distribution of the values. Classification of Data If the range of values of the variable is large, it is often helpful to consider these values arranged in regular groups or classes. There are two methods of classifying data i) The Inclusive Method Both class boundaries are included in the class they represent. E.g. 10-19, 20-29 etc ii) The Exclusive Method (left end-point convention) The upper class boundary for a particular class actually belongs to the next class. E.g. 10-20, 20-30 etc. In the previous example using discrete data, there is no difficulty in allocating any given value to its appropriate group, since for example, there is no value between, say, 29 and 30. However with continuous data, the value is measured on a continuous scale and may have values lying in between e.g. 29.7, 29.8 etc. In such cases we use the exclusive method of data classification. Example 2 Statistics exam grades. Suppose that 20 statistics students scores on an exam are as follows: 97, 92, 88, 75, 83, 67, 89, 55, 72, 78, 81, 91, 57, 63, 67, 74, 87, 84, 98, 46 We can construct a frequency table with classes 90-99, 80-89, 70-79 etc. by counting the number of grades in each grade range.
4 Class Frequency ( f ) 90-99 4 80-89 6 70-79 4 60-69 3 50-59 2 40-49 1 Note that the sum of the frequency column is equal to 20, the number of test scores. In practice, where the values of the variable are all given to the same number of significant figures or decimal places, there is no trouble and we form the groups accordingly as in the following example. Example 3 The following data represent the record high temperatures for each of the 50 counties. Construct a grouped frequency distribution for the data using 7 classes. Exercise 112 100 127 120 134 118 105 110 109 112 110 118 117 116 118 122 114 114 105 109 107 112 114 115 118 117 118 122 106 110 116 108 110 121 113 120 119 111 104 111 120 113 120 117 105 110 118 112 114 114 The lengths (in mm) of 40 spindles were measured with the following results: Additional Terminology Class Limits - Separate one class in a grouped frequency distribution from another. The limits could actually appear in the data and have gaps between the upper limit of one class and the lower limit of the next.
5 Class Width The difference between the upper (or lower) class limits of consecutive classes. It is also the positive difference between two consecutive class marks. It is not the difference between the upper and lower limits of the same class. Class Mark The middle value of each data class. Also called midpoint or central value. To find the class mark, average the upper and lower class limits. class mark = upper 2 lower Class Boundaries - The boundaries have one more decimal place than the raw data and therefore do not appear in the data. There is no gap between the upper boundary of one class and the lower boundary of the next class. The lower class boundary is found by subtracting 0.5 units from the lower class limit and the upper class boundary is found by adding 0.5 units to the upper class limit. Example: From the frequency table of statistics grades above. The upper class limits are 99, 89, 79, 69, 59, and 49. The lower class limits are 90, 80, 70, 60, 50, and 40. The class marks are 94.5, 84.5, 74.5, 64.5, 54.5, and 44.5. The width of each class is 10.
6 Creating a Frequency Table
7 Exercises 1) The thicknesses of 20 samples of steel plate are measured and the results (in mm) to 2 s.f. are as follows: 7.3 7.1 6.6 7.0 7.8 7.3 7.5 6.2 6.9 6.7 6.5 6.8 7.2 7.4 6.5 6.9 7.2 7.6 7.0 6.8 Complete a table showing the frequency distribution for regular classes of class width 0.3 mm. 2) Construct a frequency table with 6 data classes from the following data set. Amount of gasoline purchased by 28 drivers: 7, 4, 18, 4, 9, 8, 8, 7, 6, 2, 9, 5, 9, 12, 4, 14, 15, 7, 10, 2, 3, 11, 4, 4, 9, 12, 5, 3 Mathematical Notation The following symbols and variables have the meanings given below. Variables x n N f = data value = number of values in a sample data set = number of values in a population data set = frequency of a data class Symbol indicates the sum of all values for the following variable or expression. Cumulative Frequency The cumulative frequency of a data class is the number of data elements in that class and all previous classes. (Can be done ascending or descending).
8 Example: Class Frequency ( f ) Cumulative Frequency 90-99 4 4 80-89 6 10 70-79 4 14 60-69 3 17 50-59 2 19 40-49 1 20 Notice that the last entry in the cumulative frequency column is n 20. Exercise: Add a cumulative frequency column to table of gasoline purchases. Relative Frequency It is the frequency of any one data class compared with the sum of the frequencies of all classes (i.e. the total frequency). The result is generally expressed as a percentage. We can calculate the relative frequency for each class as follows: relative frequency = f n Example: Class Frequency ( f ) Cumulative Frequency Relative Frequency (%) 90-99 4 4 20 80-89 6 10 30 70-79 4 14 20 60-69 3 17 15 50-59 2 19 10 40-49 1 20 05 Note: The sum of the relative frequencies should be 1 or 100 per cent f 1 n Exercise: Add a relative frequency column to table of gasoline purchases.
9 Graphical Representation of Data DESCRIPTIVE STATISTICS 1) Histograms A histogram is a graphical representation of the information in a frequency table in which vertical rectangular blocks are drawn so that: a) The centre of the base indicates the class mark and b) The area of the rectangle represents the percentage frequency If the class intervals are regular, the frequency is then denoted by the height of the rectangle. How to draw a histogram i) Convert count to percent (if necesssary). ii) For each bar, find width and height. area height class width iii) Draw and label axes iv) Draw the bars Example: A histogram to represent the data for the record high temperatures for each of the 50 counties. Notice that the bar for each class is centered at the class midpoint (class mark), and the bars for successive classes touch. Exercise: Construct a histogram for the frequency table of gasoline purchases.
10 2) Frequency Polygon A frequency polygon is a line graph representation of the information in a frequency table. Like a histogram, the vertical axis represents the percent frequency and the horizontal axis represents the variable being measured in the data set. To construct the graph, a point is plotted for each class at its midpoint and with height given by the frequency (or percent frequency) of the class. The points are then connected by straight lines. If the centre points of the tops of the bars of a histogram are joined, the resulting figure is a frequency polygon. If the polygon is extended to include the midpoints of the zero frequency classes at each end of the histogram, then the area of the complete polygon is equal to the area of the histogram and therefore represents the total frequency of the variable. Example: using the same data for high temperatures in the previous example. 3) Ogive a line graph that represents the cumulative frequencies for the classes in a frequency distribution. x-axis: class boundaries y-axis: cumulative frequency
11 Example: for the high temperatures, this is an ogive. See illustration of less-than and more-than ogive on following page. Other types of graphs include Bar charts, pareto charts, pie charts, scatter plots etc. Exercises 1) Construct a) a histogram b) a frequency polygon c) a less-than and more-than ogive for the lengths of spindles in the previous example. 2) The table below shows a frequency distribution of the monthly wages in sterling pounds of 70 employees at a company. Wages ( ) Frequency f 50.00-59.99 8 60.00-69.99 10 70.00-79.99 16 80.00-89.99 15 90.00-99.99 10 100.00-119.99 8 120.00-179.99 3 Construct a histogram for this frequency distribution.
12
13