Authors: Coman Gentiana. Asparuh Hristov. Daniel Corteso. Fernando Nunez

Size: px
Start display at page:

Download "Authors: Coman Gentiana. Asparuh Hristov. Daniel Corteso. Fernando Nunez"

Transcription

1 OUTLIER DETECTOR DOCUMENTATION VERSION 1.0 Authors: Coman Gentiana Asparuh Hristov Daniel Corteso Fernando Nunez Copyright Team 6, 2011

2 Contents 1. Introduction Global variables used Scientific explanation of algorithms Standard deviation and standard normal distribution Median and quartiles Score function K-th nearest neighbor Local Outlier Factor Local Correlation Integral Excel built-in functions used Future implementations References Notes... 18

3 1. Introduction The document contains information regarding the Outlier Detector Excel add-on. It is aimed at users with basic knowledge of Excel and Visual Basic for Applications or future developers of the application. The document is structured as follows: Chapter 2 - Global variables used Explanation of the VBA variables that were used in the application. Chapter 3 - Scientific explanation of algorithms The mathematical reasoning behind the algorithms used to detect outliers together with advantages and disadvantages of each approach. Chapter 4 - Excel built in functions used A list of the built in Microsoft Excel functions used in the application. The test and development of the application was done entirely in Microsoft Excel and Visual Basic The tool is available in the Outlier Detector Tool.xls as well, but might produce other types of results. We recommend using it with Microsoft Office Global variables used In order to safely store the input and output related variables we have used global variables which are shared by all the functions implemented. These are: outputbook As Workbook Variable used to store the reference to the output workbook selected by the user. outputsheet As Worksheet Variable used to store the reference to the output worksheet, from the output workbook, as selected by the user. inputbook As Workbook Variable used to refer to the workbook containing the input data set. inputsheet As Worksheet Variable used to refer to the worksheet containing the input data set, from the input workbook. DTBook As Workbook Variable used to store the reference to the workbook containing the distance tables, if the user wishes to skip the creation of them. 1

4 inputrange As Range Variable used to store the reference to the input range selected from the input wlorksheet. fdialog As Office.FileDialog Global file dialog object which will enable the user to browse for files. algorithm As Integer Variable used to store the value of the selected algorithm from the first tab. The values which the variable can take are the following: 1. Standard deviation algorithm 2. Mean and quartiles algorithm 3. Score function algorithm 4. K-th nearest neighbor algorithm 5. Local Outlier Factor algorithm 6. Local Correlation Integral algorithm 3. Scientific explanation of algorithms This section will contain the explanation of the algorithms used. Each algorithm will contain an introductory note, a mathematical reasoning, a section with advantages and a section with disadvantages Standard deviation and standard normal distribution Introduction The standard normal distribution is a special case of the normal distribution 2

5 It is a normal distribution with zero mean and unit variance, given by the probability density function and distribution function over the domain. The normal random variable of a standard normal distribution is called a standard score or a z-score. Every normal random variable X can be transformed into a z score via the following equation: z = (X - μ) / σ where X is a normal random variable, μ is the mean mean of X, and σ is the standard deviation of X. Applying the formula will always produce a transformed distribution with a mean of zero and a standard deviation of one. However, the shape of the distribution will not be affected by the transformation. If X is not normal then the transformed distribution will not be normal either. One reason the normal distribution is important is that it is easy for mathematical statisticians to work with. This means that many kinds of statistical tests can be derived for normal distributions. The approach that was taken in the Outlier Detector application is explained in the following paragraphs. Mathematical Reasoning Using the values of X from the input dataset, the mean and the standard deviation are computed using the following formulas: N μ = 1 N i=1 x i 3

6 For each X value, its z-score is calculated. If the score is greater than 0, we take into consideration the values that lie between z and. To find the probability that a Z value lies between the z-score z of value X and, we calculate the following factor: Nz = P(Z>z) The probability that a standard normal random variable (Z) is greater than a given value (z), which is 1-P(Z<z). P(Z<z) represents the cumulative probability associated with a z-score P is the probability density function mentioned in the introduction For z-scores greater than 0 we will use the normal distribution function implemented in Microsoft Excel, which will be subtracted from the value of 1, while for z-scores lower than 0 will use only the value returned by the normal distribution function implemented in Microsoft Excel. Therefore: If z>0: If z<=0: Nz = 1 NormSDist(z) Nz = NormSDist(z) where NormSDist is the standard normal distribution function from Microsoft Excel. The value which results in Nz will indicate how many other points lie at a greater distance from the average than the point X taken into consideration. If this value is lower than 0.05, meaning 5%, this indicates that X is a potential outlier, because it lies far from the average and only other 5% other points are situated farther than him. This also indicates that the point could not belong to a standard normal distribution, therefore making it an outlier according to this algorithm. Advantages The advantage of the standard normal distribution is that we can compare different normal distributions with different means and variance after transforming them into standard normal distributions. Also there exists a probability table already computed for standard normal distributions, therefore it is easy to find the probability or z-score value. In the context of detecting outliers, it is useful to see if the given dataset follows a standard normal distribution, because one such distribution has the property that over 96% of its values fall between -2 and +2 standard deviations from the average. This indicates that if the dataset follows such a distribution, then all the values are close to each other, showing a strong probability that they are correct data. On the other hand, if the dataset does not conform to the distribution, there is a strong possibility that outliers exist among the values and they are detectable. 4

7 Disadvantages The most important disadvantage of this approach is that it can only be used with onedimensional datasets, because of the average and standard deviation computations. It cannot detect density based anomalies and relies only on the standard normal distribution and the probability density function Median and quartiles Introduction The algorithm uses the box plot method. A box plot is a way of summarizing a set of data measured on an interval scale. It is often used in exploratory data analysis. Box plots are also very useful when large numbers of observations are involved and when two or more data sets are being compared. They are helpful for indicating whether a distribution is skewed and whether there are any unusual observations (outliers) in the data set. The statistical terms used are: Median: The median value in a dataset is such that there are equal number of values greater than the median as are less than the median. When the dataset is sorted, the median is the middle value in the dataset. If the dataset has even number of values then the median is the average of the two middle values in the dataset. Quartiles: Quartiles, by definition, separate a quarter of data points from the rest. This roughly means that the first quartile is the value under which 25% of the data lie and the third quartile is the value over which 25% of the data are found (this also indicates that the second quartile is the median itself). First Quartile, Q1: Concluding from the definitions above, first quartile is the median of the lower half of the data. If the number of data points is odd, the lower half includes the median. Third Quartile, Q3: Third quartile is the median of the upper half of the data. If the number of data points is odd, the upper half of the data includes the median. 5

8 As the above picture indicates: The median divides the data into a lower half and an upper half. The lower quartile is the middle value of the lower half. The upper quartile is the middle value of the upper half. Mathematical Reasoning The Outlier Detector application uses the box plot method of calculating lower inner and upper fences instead of using the interquartile range. The median, first and third quartile are generated using the Microsoft Excel specific functions. With the help of the two determined quartiles, the interquartile range is determined, using the following formula: IQ = Q3 - Q1 where IQ stands for interquartile range, a useful measure of the amount of variation in a set of data. Using the value of the interquartile range, the lower and upper inner fences are computed: Lower Inner Fence: LI = Q1-1.5*IQ Upper Inner Fence: UI = Q *IQ The values situated between the two fences are considered to be normal, whereas the values outside of the fences are seen as outliers. 6

9 The method implemented here is a variation of the box plot, because the box plot also takes into consideration outer fences situated at distances of three times the interquartile range from each quartile and makes a distinction between strong and weak outliers. The reasoning for choosing only the inner fences as thresholds lies in the fact that the distinction is not considered very important and also because the outer fences are very restrictive and might not prove useful for outlier detection in some types of datasets. Advantages One of the advantages of this approach would be the fact that, in comparison with the average, the median is not affected by the extreme values in the dataset. The interquartile range ignores the extreme values, whereas the inner fences provide a reasonable range for normal data points. Disadvantages A disadvantage of this approach is the fact that it tends to emphasize the tails of a distribution, which are the least certain data points in a set. It also hides many details of the distribution and can only be applied to one-dimensional data sets. At the same time, clustering of data points cannot be taken into consideration. 3.3 Score function Introduction Scoring techniques assign an anomaly score to each instance in the test data depending on the degree to which that instance is considered an anomaly. In the case of the Outlier Detector, the score consists of the number of standard deviations the value lies from the average. The standard deviation is a statistic used as a measure of the dispersion or variation in a distribution, equal to the square root of the arithmetic mean of the squares of the deviations from the arithmetic mean. An important attribute of the standard deviation as a measure of spread is that if the mean and standard deviation of a normal distribution are known, it is possible to compute the percentile rank associated with any given score. In a normal distribution, about 68% of the scores are within one standard deviation of the mean and about 95% of the scores are within two standard deviations of the mean. The standard deviation has proven to be an extremely useful measure of spread in part because it is mathematically tractable. Its formula is: 7

10 Mathematical Reasoning The score function implemented in the Outlier Detector relies on the standard deviation and the distance each value has from the average. The first step is calculating the average and standard deviation of the dataset, after which for each value, its distance to the average is computed. The average and standard deviation are computed using the Microsoft Excel specific built-in functions. The score for each value is computed dividing its distance to the standard deviation and extracting the absolute value. Therefore the scores are positive numbers. This is done to indicate how far apart each value lies from the average, in number of standard deviations. If a value lies very far away from the average, then its score will be relatively high, possibly indicating an outlier. Determining the outliers can be performed in two manners: Taking into consideration the maximum score from the dataset and setting the threshold at half of this maximum score. Taking into consideration a minimum score as the threshold, indicated as input. Any score above the threshold is considered an outlier. Advantages An advantage of standard deviation, in comparison with other statistical indicators such as range, is that it makes use of all data to calculate the spread of data from average. The range for example only uses two data items, i.e. the largest value data and the smallest value data, therefore standard deviation is a more accurate measure. Standard deviation gives weight to the deviation of the data from the mean by squaring it, therefore the greater the deviation, the greater the weight after the squaring. This is a better indicator for values which lie far away from the average, because squaring the difference makes the value more noticeable. Disadvantages Standard deviation, as well as the average, are strongly influenced by the extreme values of a dataset. At the same time it can only be applied to one-dimensional data sets. 8

11 3.4 K-th nearest neighbor Introduction The nearest neighbor method can be used to detect outliers and it can achieve great results. As the name suggests it determines the anomalies by comparing the distance of each one to its k-th neighbor. Therefore one needs to define how the distance would be computed. The metric used has to be positive-definite and symmetric but it is not required to satisfy the triangle inequality. There are lots of variations of the algorithm, aiming at better and/or faster detection, but all of them can be broadly classified in one of the two possibilities: 1) Techniques that use the distance of a data instance to its k-th nearest neighbour as the anomaly score. 2) Techniques that compute the relative density of each data instance to compute its anomaly score. We have implemented both variations. The simple k-th neighbour algorithm discussed in this section is from the former type. Mathematical Reasoning First of all we need to define a metric before computing the actual distances. The algorithm normally works under the assumption : Normal data instances occur in dense neighborhoods, while anomalies occur far from their closest neighbors. However, driven by the concrete topic of our project, we were inspired to adapt as much as possible the algorithm to the specific theme, so it can evaluate better and more meaningful results. Therefore, we have made more assumptions. Before discussing them, note that the project is Outliers detection in health insurance. We were given a data set with patients and their spending in different kind of medicines. In our opinion the metric used, should take the squared differences between two patients for each medicine. In this way we make larger difference more important. We assume that the health insurance company is more interested in patient, who has spent 20 euro more in a given medicine, than one spent 1 euro more in 20 different medicines (20^2>1^2*20). The next slight modification of the algorithm is that if a given patient hasn t spent any money on a particular medicine, than no distance should be added for this category. This assumption was made by taking into account that for a given medicine the patient either needs it or not. This way he either spends some normal amount of money on it, or he doesn t spend any money at all. If he hasn t purchased the medicine he is of no interest for the insurance company and no additional score should be added. 9

12 In other words the distance between two patients x and y with values x i and y i in attribute i is: x i 0 (x i y i ) 2 Now we can compute and store the distances. Note that we have implemented the method to store information only for the first 150 neighbors because of storage reasoning. The algorithm initializes two tables: one with sorted distances and one with the corresponding number (id) of the neighbor. The rows are used to represent different patients and the columns stands for the neighbors, because if it was the other way round we would have reached the excel column limit with some of the input files. This means that for example the value in the i-th row and j-th column corresponds to the distance between i-th patient and its j-th nearest neighbor in one of the tables, and the actual number of this j-th nearest neighbor in the other table. Having the tables computed the algorithm just sorts the distances to the k-th neighbor for each patient and the ones with the biggest distances are the outliers. One additional problem that can arise is that one has to choose the number of outliers he wants to detect. Therefore we have implemented some graphical output options particular for this issue, helping the user to decide what the best number should be. Advantages 1) It performs very well in terms of detected anomalies, since the likelihood of an anomaly to form a close neighborhood in the training data set is very low. 2) Adapting nearest neighbor based techniques to a different data type is straight-forward and can lead to great results Disadvantages 1) It doesn t take into account that different clusters may have different density and therefore sometimes normal can be detected as outlier. 2) The initializing of the tables is very time-consuming. 3) If the normal instances in the data do not have enough neighbors with similar normal attributes, the false positive rate for such techniques is high. 10

13 3.5. Local Outlier Factor Introduction As indicated by the title, the local outlier factor is based on a concept of a local density, where locality is given by the k nearest neighbors, whose distance is used to estimate the density. By comparing the local density of an object to the local densities of its neighbors, one can identify regions of similar density, and points that have a substantially lower density than their neighbors. These are considered to be outliers. The local density is estimated by the typical distance at which a point can be "reached" from its neighbors. The definition of "reachability distance" used in LOF is an additional measure to produce more stable results within clusters. Basic idea of LOF: comparing the local density of a point with the densities of its neighbors. Mathematical Reasoning Let k-distance(a) be the distance of the object A to the k nearest neighbor. Note that the set of the k nearest neighbors includes all objects at this distance, which can in the case of a "tie" be more than k objects. We denote the set of k nearest neighbors as Nk(A). Illustration of the reachability distance. Objects B and C have the same reachability distance (k=3), while D is not a k nearest neighbor 11

14 This distance is used to define what is called reachability distance: reachability-distancek(a,b) = max{k-distance(b),d(a,b)} In words, the reachability distance of an object A from B is the true distance of the two objects, but at least the k-distance of B. Objects that belong to the k nearest neighbors of B are considered to be equally distant. The reason for this distance is to get more stable results. Note that this is not a distance in the mathematical definition, since it is not symmetric. The local reachability density of an object A is defined by Which is the quotient of the average reachability distance of the object A from its neighbors. Note that it is not the average reachability of the neighbors from A (which by definition would be the k-distance(a)), but the distance at which it can be "reached" from its neighbors. The local reachability densities are then compared with those of the neighbors using Which is the average local reachability density of the neighbors divided by the objects own local reachability density. A value of approximately 1 indicates that the object is comparable to its neighbors (and thus not an outlier). A value below 1 indicates a denser region (which would be an inlier), while values significantly larger than 1 indicate outliers. Advantages Due to the local approach, LOF is able to identify outliers in a data set that would not be outliers in another area of the data set. For example, a point at a "small" distance to a very dense cluster is an outlier, while a point within a sparse cluster might exhibit similar distances to its neighbors. While the geometric intuition of LOF is only applicable to low dimensional vector spaces, the algorithm can be applied in any context a dissimilarity function can be defined. It has experimentally been shown to work very well in numerous setups, often outperforming the competitors, for example in network intrusion detection. Disadvantages The resulting values are quotient-values and hard to interpret. A value of 1 or even less indicates a clear inlier, but there is no clear rule for when a point is an outlier. In one data set, a value of 1.1 may already be an outlier; in another dataset and parameterization (with strong local fluctuations) a value of 2 could still be an inlier. These differences can also occur within a dataset due to the locality of the method. 12

15 3.6. Local Correlation Integral Introduction In this section, we describe the LOCI (LOcal Correlation Integral method) algorithm for detecting outliers. This algorithm computes exact MDEF and σmdef values for all objects, and then reports an outlier whenever MDEF is more than three times larger than σmdef for the same radius. Thus the key to a fast algorithm is an efficient computation of MDEF and σmdef values. Mathematical Reasoning MDEF Multi-granularity deviation factor (MDEF), which can cope with local density variations in the feature space and detect both isolated outliers as well as outlying clusters. LOCI algorithm uses an exact computation of MDEF values. The MDEF is associated with the correlation integral, it is an aggregate measure. Let the r-neighborhood of an object pi be the set of objects within distance r of pi. Intuitively, the MDEF at radius r for a point pi is the relative deviation of its local neighborhood density from the average local neighborhood density in its r-neighborhood. Thus, an object whose neighborhood density matches the average local neighborhood density will have an MDEF of 0. In contrast, outliers will have MDEFs far from 0. Let n (pi, αr) be the number of objects in the αr-neighborhood of pi. Let n^ (pi, r, α) be the average, over all objects p in the r-neighborhood of pi, of n (p, αr) (see Figure below). We have two radii serves to decouple the neighbor size radius αr from the radius r over which we are averaging. The local correlation integral is the function n^ (pi, α,r) over all r. 13

16 Definition MDEF: For any pi, r and α, we define the multi-granularity deviation factor (MDEF) at radius (or scale) r as: For faster computation of MDEF, we will sometimes estimate both n (pi, αr) and n^ (pi, r, α). This leads to the following definitions: Definition Counting and sampling neighborhood: The counting neighborhood (or αr-neighborhood) is the neighborhood of radius αr, over which each n (p, αr) is estimated. The sampling neighborhood (or r-neighborhood) is the neighborhood of radius r, over which we collect samples of n (p, αr) in order to estimate n^ (pi, r, α). σ n^ (pi, r, α) is the normalized standard deviation: Given the above definition of MDEF, we still have to make a number of decisions. In particular, we need to answer the following questions: 1. Sampling neighborhood: Which points constitute the sampling neighborhood of pi, or, in other words, which points do we average over to compute n^ (and, in turn, MDEF) for a pi in question? For each point and counting radius, the sampling neighborhood is selected to be large enough to contain enough samples. We choose α = 1/2 in all exact computations. 2. Scale: Regardless of the choice of neighborhood, over what range of distances do we compare n and n^? This leads to the following definition, where N is the number of objects and NN (pi, m) is the m-th nearest neighbor of pi. Definition Critical Distance: For 1 m N, we call d(nn(pi, m), pi) a critical distance of pi and d(nn(pi, m), pi)/α an α-critical distance of pi. We need only consider radii that are critical or α-critical. In a pre-processing pass, we determine the critical and α-critical distances Di for each object pi. In our algorithm the scale ([rmin, rmax]) depend on the nmin (minimal number of neighbors of a point) and nmax (maximal number of neighbors of a point). This leads to rmin = d (NN (pi, nmin), pi) and rmax = d (NN (pi, nmax), pi) 14

17 3. Flagging: After computing the MDEF values (over a certain range of distances), how do we use them to choose the outliers? Then considering each object pi in turn, and considering increasing radius r from Di, we maintain n(pi, αr), n^(pi, r, α), MDEF(pi, r, α), and σmdef(pi, r, α). A point is flagged as an outlier, if for any r In the algorithm, we use kσ= 3 The method selects a point as an outlier if its MDEF value deviates significantly (more than three standard deviations) from the local averages. Advantages Like the state of the art, it can detect outliers and groups of outliers (or, micro-clusters). It also includes several of the previous methods (or slight variants thereof) as a special case. Going beyond any previous method, it proposes an automatic, data-dictated cut-off to determine whether a point is an outlier in contrast, previous methods let the users decide, providing them with no hints as to what cut-off is suitable for each dataset. This method successfully deals with both local density and multiple granularities. The exact LOCI method can be computed as quickly as previous methods (but not always). Extensive experiments on synthetic and real data show that LOCI can automatically detect outliers and micro-clusters, without user-required cut-offs, and that they quickly spot outliers, expected and unexpected Disadvantages The only disadvantage is that for large datasets, the computations are complex and time consuming and it will take a lot of time to process, even with the distance tables computed. 15

18 4. Excel built-in functions used This section contains information regarding the Excel built in functions that were used in the VBA code. These functions have simplified the work load and also perform in a timely manner. Each function is explained in the corresponding table, mentioning also the call for it in Visual Basic. Mathematical and statistical functions Name Call Description Average WorksheetFunction.average() Computes the average of the input data set, which is referenced in the inputrange variable. Standard deviation WorksheetFunction.stDev() Computes the standard deviation of a range of cells, taking as parameter the inputrange variable. Standard normal distribution WorksheetFunction.NormSDist() Computes the standard normal distribution factor for the z-score of each data point. Quartile WorksheetFunction.Quartile() Computes the specified quartile from a range of cells. The quartile is mentioned as an integer number, 1 or 3 in the case of the second algorithm Median WorksheetFunction.median() Computes the median for the range of cells specified as input, in our case the inputrange variable Round Down WorksheetFunction.RoundDown() Rounds down a given real number to the specified amount of decimal places Round Up WorksheetFunction.RoundUp() Rounds up a given real number to the specified amount of decimal places Sort fields Worksheet.Sort.SortFields.Add() Sorts a range of cells based on the selected column and conditions Maximum WorksheetFunction.Max() Computes the maximum value from the range parameter, in our case the inputrange 16

19 Graphical representation functions Name Call Description Add chart ActiveSheet.Shapes.AddChart Creates a new chart in the output sheet, which will be populated with the data points from the inputrange Color outliers ActiveChart.SeriesCollection(1).Poin ts(point).markerbackgroundcolor Used to color with red the outlier points from the scatter plot. Cut to clipboard Selection.Cut Cuts the selected chart to the clipboard for future placement inside the user form. Special module In order to represent the graphs in user form, we have used the module compastepicture which copies the graph from clipboard and places it inside forms. The module was created by Stephen Bullen and was made available as open source code on the Internet. 5. Future implementations The program is complex enough but it always can be improved in future implementations. There are some ideas to enhance the usefulness of this program. One of these ideas could be the use of text files as input or output. Now, we need to have a worksheet as input but the program will give a better service if the input would not have to be a worksheet. Many times the information given for the input will be in a plain text and it will be very tedious to write all the text in a worksheet by hand. Therefore, it could be a good new feature of the program for future implementations. In the other hand, the output could be written also in a plain text. This will help to store the data in some cases, and it could be used again as input for the program. For example, the distance computed tables given by the k nearest algorithm could be stored in plain text and used later as input for the LOF algorithm. 17

20 6. References [1] V. Chandola, A. Banerjee, V. Kumar (2009). Anomaly Detection: A Survey. ACM Computing Survey [2] M.M. Breuning, H. Kriegel, R.T. Ng, J. Sander (2000). LOF: Identifying density-based local outliers. Proceedings of ACM SIGMOD, Texas. [3] S. Papadimitriou, H. Kitagawa, P.B. Gibbons, C. Faloutsos (2003). LOCI: Fast outlier detection using local correlation integral. Proceedings of the 19th ICDE. [4] Russell Langley, Practical Statistics Simply Explained, Dover Publications; Revised edition (June 1, 1971) 7. Notes The application has been created as the project for the Optimization of Business Processes course held at Vrije University in January The creators of the application are Gentiana Coman Asparuh Hristov Daniel Corteso Fernando Nunez All the copyrights go to the before mentioned persons. The application is free to use. 18

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures STA 2023 Module 3 Descriptive Measures Learning Objectives Upon completing this module, you should be able to: 1. Explain the purpose of a measure of center. 2. Obtain and interpret the mean, median, and

More information

STA Module 2B Organizing Data and Comparing Distributions (Part II)

STA Module 2B Organizing Data and Comparing Distributions (Part II) STA 2023 Module 2B Organizing Data and Comparing Distributions (Part II) Learning Objectives Upon completing this module, you should be able to 1 Explain the purpose of a measure of center 2 Obtain and

More information

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II)

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II) STA 2023 Module 2B Organizing Data and Comparing Distributions (Part II) Learning Objectives Upon completing this module, you should be able to 1 Explain the purpose of a measure of center 2 Obtain and

More information

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order. Chapter 2 2.1 Descriptive Statistics A stem-and-leaf graph, also called a stemplot, allows for a nice overview of quantitative data without losing information on individual observations. It can be a good

More information

Chapter 3 - Displaying and Summarizing Quantitative Data

Chapter 3 - Displaying and Summarizing Quantitative Data Chapter 3 - Displaying and Summarizing Quantitative Data 3.1 Graphs for Quantitative Data (LABEL GRAPHS) August 25, 2014 Histogram (p. 44) - Graph that uses bars to represent different frequencies or relative

More information

Averages and Variation

Averages and Variation Averages and Variation 3 Copyright Cengage Learning. All rights reserved. 3.1-1 Section 3.1 Measures of Central Tendency: Mode, Median, and Mean Copyright Cengage Learning. All rights reserved. 3.1-2 Focus

More information

Chapter 2 Describing, Exploring, and Comparing Data

Chapter 2 Describing, Exploring, and Comparing Data Slide 1 Chapter 2 Describing, Exploring, and Comparing Data Slide 2 2-1 Overview 2-2 Frequency Distributions 2-3 Visualizing Data 2-4 Measures of Center 2-5 Measures of Variation 2-6 Measures of Relative

More information

STA 570 Spring Lecture 5 Tuesday, Feb 1

STA 570 Spring Lecture 5 Tuesday, Feb 1 STA 570 Spring 2011 Lecture 5 Tuesday, Feb 1 Descriptive Statistics Summarizing Univariate Data o Standard Deviation, Empirical Rule, IQR o Boxplots Summarizing Bivariate Data o Contingency Tables o Row

More information

Lecture Notes 3: Data summarization

Lecture Notes 3: Data summarization Lecture Notes 3: Data summarization Highlights: Average Median Quartiles 5-number summary (and relation to boxplots) Outliers Range & IQR Variance and standard deviation Determining shape using mean &

More information

The first few questions on this worksheet will deal with measures of central tendency. These data types tell us where the center of the data set lies.

The first few questions on this worksheet will deal with measures of central tendency. These data types tell us where the center of the data set lies. Instructions: You are given the following data below these instructions. Your client (Courtney) wants you to statistically analyze the data to help her reach conclusions about how well she is teaching.

More information

Measures of Dispersion

Measures of Dispersion Measures of Dispersion 6-3 I Will... Find measures of dispersion of sets of data. Find standard deviation and analyze normal distribution. Day 1: Dispersion Vocabulary Measures of Variation (Dispersion

More information

To calculate the arithmetic mean, sum all the values and divide by n (equivalently, multiple 1/n): 1 n. = 29 years.

To calculate the arithmetic mean, sum all the values and divide by n (equivalently, multiple 1/n): 1 n. = 29 years. 3: Summary Statistics Notation Consider these 10 ages (in years): 1 4 5 11 30 50 8 7 4 5 The symbol n represents the sample size (n = 10). The capital letter X denotes the variable. x i represents the

More information

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data. 1 CHAPTER 1 Introduction Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data. Variable: Any characteristic of a person or thing that can be expressed

More information

Measures of Central Tendency. A measure of central tendency is a value used to represent the typical or average value in a data set.

Measures of Central Tendency. A measure of central tendency is a value used to represent the typical or average value in a data set. Measures of Central Tendency A measure of central tendency is a value used to represent the typical or average value in a data set. The Mean the sum of all data values divided by the number of values in

More information

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable. 5-number summary 68-95-99.7 Rule Area principle Bar chart Bimodal Boxplot Case Categorical data Categorical variable Center Changing center and spread Conditional distribution Context Contingency table

More information

IT 403 Practice Problems (1-2) Answers

IT 403 Practice Problems (1-2) Answers IT 403 Practice Problems (1-2) Answers #1. Using Tukey's Hinges method ('Inclusionary'), what is Q3 for this dataset? 2 3 5 7 11 13 17 a. 7 b. 11 c. 12 d. 15 c (12) #2. How do quartiles and percentiles

More information

Measures of Central Tendency

Measures of Central Tendency Page of 6 Measures of Central Tendency A measure of central tendency is a value used to represent the typical or average value in a data set. The Mean The sum of all data values divided by the number of

More information

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Exploratory data analysis tasks Examine the data, in search of structures

More information

Chapter 6: DESCRIPTIVE STATISTICS

Chapter 6: DESCRIPTIVE STATISTICS Chapter 6: DESCRIPTIVE STATISTICS Random Sampling Numerical Summaries Stem-n-Leaf plots Histograms, and Box plots Time Sequence Plots Normal Probability Plots Sections 6-1 to 6-5, and 6-7 Random Sampling

More information

CHAPTER 2: SAMPLING AND DATA

CHAPTER 2: SAMPLING AND DATA CHAPTER 2: SAMPLING AND DATA This presentation is based on material and graphs from Open Stax and is copyrighted by Open Stax and Georgia Highlands College. OUTLINE 2.1 Stem-and-Leaf Graphs (Stemplots),

More information

STA Module 4 The Normal Distribution

STA Module 4 The Normal Distribution STA 2023 Module 4 The Normal Distribution Learning Objectives Upon completing this module, you should be able to 1. Explain what it means for a variable to be normally distributed or approximately normally

More information

STA /25/12. Module 4 The Normal Distribution. Learning Objectives. Let s Look at Some Examples of Normal Curves

STA /25/12. Module 4 The Normal Distribution. Learning Objectives. Let s Look at Some Examples of Normal Curves STA 2023 Module 4 The Normal Distribution Learning Objectives Upon completing this module, you should be able to 1. Explain what it means for a variable to be normally distributed or approximately normally

More information

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. + What is Data? Data is a collection of facts. Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. In most cases, data needs to be interpreted and

More information

15 Wyner Statistics Fall 2013

15 Wyner Statistics Fall 2013 15 Wyner Statistics Fall 2013 CHAPTER THREE: CENTRAL TENDENCY AND VARIATION Summary, Terms, and Objectives The two most important aspects of a numerical data set are its central tendencies and its variation.

More information

Table of Contents (As covered from textbook)

Table of Contents (As covered from textbook) Table of Contents (As covered from textbook) Ch 1 Data and Decisions Ch 2 Displaying and Describing Categorical Data Ch 3 Displaying and Describing Quantitative Data Ch 4 Correlation and Linear Regression

More information

MAT 142 College Mathematics. Module ST. Statistics. Terri Miller revised July 14, 2015

MAT 142 College Mathematics. Module ST. Statistics. Terri Miller revised July 14, 2015 MAT 142 College Mathematics Statistics Module ST Terri Miller revised July 14, 2015 2 Statistics Data Organization and Visualization Basic Terms. A population is the set of all objects under study, a sample

More information

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data. Summary Statistics Acquisition Description Exploration Examination what data is collected Characterizing properties of data. Exploring the data distribution(s). Identifying data quality problems. Selecting

More information

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency Math 1 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency lowest value + highest value midrange The word average: is very ambiguous and can actually refer to the mean,

More information

Chapter 3: Data Description - Part 3. Homework: Exercises 1-21 odd, odd, odd, 107, 109, 118, 119, 120, odd

Chapter 3: Data Description - Part 3. Homework: Exercises 1-21 odd, odd, odd, 107, 109, 118, 119, 120, odd Chapter 3: Data Description - Part 3 Read: Sections 1 through 5 pp 92-149 Work the following text examples: Section 3.2, 3-1 through 3-17 Section 3.3, 3-22 through 3.28, 3-42 through 3.82 Section 3.4,

More information

Chapter 1. Looking at Data-Distribution

Chapter 1. Looking at Data-Distribution Chapter 1. Looking at Data-Distribution Statistics is the scientific discipline that provides methods to draw right conclusions: 1)Collecting the data 2)Describing the data 3)Drawing the conclusions Raw

More information

+ Statistical Methods in

+ Statistical Methods in 9/4/013 Statistical Methods in Practice STA/MTH 379 Dr. A. B. W. Manage Associate Professor of Mathematics & Statistics Department of Mathematics & Statistics Sam Houston State University Discovering Statistics

More information

Lecture 3: Chapter 3

Lecture 3: Chapter 3 Lecture 3: Chapter 3 C C Moxley UAB Mathematics 12 September 16 3.2 Measurements of Center Statistics involves describing data sets and inferring things about them. The first step in understanding a set

More information

AP Statistics Summer Assignment:

AP Statistics Summer Assignment: AP Statistics Summer Assignment: Read the following and use the information to help answer your summer assignment questions. You will be responsible for knowing all of the information contained in this

More information

CHAPTER 2 DESCRIPTIVE STATISTICS

CHAPTER 2 DESCRIPTIVE STATISTICS CHAPTER 2 DESCRIPTIVE STATISTICS 1. Stem-and-Leaf Graphs, Line Graphs, and Bar Graphs The distribution of data is how the data is spread or distributed over the range of the data values. This is one of

More information

Chapter 5: Outlier Detection

Chapter 5: Outlier Detection Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases SS 2016 Chapter 5: Outlier Detection Lecture: Prof. Dr.

More information

Descriptive Statistics

Descriptive Statistics Chapter 2 Descriptive Statistics 2.1 Descriptive Statistics 1 2.1.1 Student Learning Objectives By the end of this chapter, the student should be able to: Display data graphically and interpret graphs:

More information

10.4 Measures of Central Tendency and Variation

10.4 Measures of Central Tendency and Variation 10.4 Measures of Central Tendency and Variation Mode-->The number that occurs most frequently; there can be more than one mode ; if each number appears equally often, then there is no mode at all. (mode

More information

10.4 Measures of Central Tendency and Variation

10.4 Measures of Central Tendency and Variation 10.4 Measures of Central Tendency and Variation Mode-->The number that occurs most frequently; there can be more than one mode ; if each number appears equally often, then there is no mode at all. (mode

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

CHAPTER 3: Data Description

CHAPTER 3: Data Description CHAPTER 3: Data Description You ve tabulated and made pretty pictures. Now what numbers do you use to summarize your data? Ch3: Data Description Santorico Page 68 You ll find a link on our website to a

More information

Measures of Position

Measures of Position Measures of Position In this section, we will learn to use fractiles. Fractiles are numbers that partition, or divide, an ordered data set into equal parts (each part has the same number of data entries).

More information

Univariate Statistics Summary

Univariate Statistics Summary Further Maths Univariate Statistics Summary Types of Data Data can be classified as categorical or numerical. Categorical data are observations or records that are arranged according to category. For example:

More information

STP 226 ELEMENTARY STATISTICS NOTES PART 2 - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES

STP 226 ELEMENTARY STATISTICS NOTES PART 2 - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES STP 6 ELEMENTARY STATISTICS NOTES PART - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES Chapter covered organizing data into tables, and summarizing data with graphical displays. We will now use

More information

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables Further Maths Notes Common Mistakes Read the bold words in the exam! Always check data entry Remember to interpret data with the multipliers specified (e.g. in thousands) Write equations in terms of variables

More information

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at Performance Evaluation of Ensemble Method Based Outlier Detection Algorithm Priya. M 1, M. Karthikeyan 2 Department of Computer and Information Science, Annamalai University, Annamalai Nagar, Tamil Nadu,

More information

Chapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data

Chapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data Chapter 2 Descriptive Statistics: Organizing, Displaying and Summarizing Data Objectives Student should be able to Organize data Tabulate data into frequency/relative frequency tables Display data graphically

More information

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures Part I, Chapters 4 & 5 Data Tables and Data Analysis Statistics and Figures Descriptive Statistics 1 Are data points clumped? (order variable / exp. variable) Concentrated around one value? Concentrated

More information

CHAPTER 2: DESCRIPTIVE STATISTICS Lecture Notes for Introductory Statistics 1. Daphne Skipper, Augusta University (2016)

CHAPTER 2: DESCRIPTIVE STATISTICS Lecture Notes for Introductory Statistics 1. Daphne Skipper, Augusta University (2016) CHAPTER 2: DESCRIPTIVE STATISTICS Lecture Notes for Introductory Statistics 1 Daphne Skipper, Augusta University (2016) 1. Stem-and-Leaf Graphs, Line Graphs, and Bar Graphs The distribution of data is

More information

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and CHAPTER-13 Mining Class Comparisons: Discrimination between DifferentClasses: 13.1 Introduction 13.2 Class Comparison Methods and Implementation 13.3 Presentation of Class Comparison Descriptions 13.4

More information

MATH 1070 Introductory Statistics Lecture notes Descriptive Statistics and Graphical Representation

MATH 1070 Introductory Statistics Lecture notes Descriptive Statistics and Graphical Representation MATH 1070 Introductory Statistics Lecture notes Descriptive Statistics and Graphical Representation Objectives: 1. Learn the meaning of descriptive versus inferential statistics 2. Identify bar graphs,

More information

Chapter 2: The Normal Distribution

Chapter 2: The Normal Distribution Chapter 2: The Normal Distribution 2.1 Density Curves and the Normal Distributions 2.2 Standard Normal Calculations 1 2 Histogram for Strength of Yarn Bobbins 15.60 16.10 16.60 17.10 17.60 18.10 18.60

More information

MAT 110 WORKSHOP. Updated Fall 2018

MAT 110 WORKSHOP. Updated Fall 2018 MAT 110 WORKSHOP Updated Fall 2018 UNIT 3: STATISTICS Introduction Choosing a Sample Simple Random Sample: a set of individuals from the population chosen in a way that every individual has an equal chance

More information

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data in StatCrunch. In particular, you will learn how to

More information

Slide Copyright 2005 Pearson Education, Inc. SEVENTH EDITION and EXPANDED SEVENTH EDITION. Chapter 13. Statistics Sampling Techniques

Slide Copyright 2005 Pearson Education, Inc. SEVENTH EDITION and EXPANDED SEVENTH EDITION. Chapter 13. Statistics Sampling Techniques SEVENTH EDITION and EXPANDED SEVENTH EDITION Slide - Chapter Statistics. Sampling Techniques Statistics Statistics is the art and science of gathering, analyzing, and making inferences from numerical information

More information

MATH 112 Section 7.2: Measuring Distribution, Center, and Spread

MATH 112 Section 7.2: Measuring Distribution, Center, and Spread MATH 112 Section 7.2: Measuring Distribution, Center, and Spread Prof. Jonathan Duncan Walla Walla College Fall Quarter, 2006 Outline 1 Measures of Center The Arithmetic Mean The Geometric Mean The Median

More information

Measures of Central Tendency

Measures of Central Tendency Measures of Central Tendency MATH 130, Elements of Statistics I J. Robert Buchanan Department of Mathematics Fall 2017 Introduction Measures of central tendency are designed to provide one number which

More information

3.2-Measures of Center

3.2-Measures of Center 3.2-Measures of Center Characteristics of Center: Measures of center, including mean, median, and mode are tools for analyzing data which reflect the value at the center or middle of a set of data. We

More information

3 Graphical Displays of Data

3 Graphical Displays of Data 3 Graphical Displays of Data Reading: SW Chapter 2, Sections 1-6 Summarizing and Displaying Qualitative Data The data below are from a study of thyroid cancer, using NMTR data. The investigators looked

More information

Downloaded from

Downloaded from UNIT 2 WHAT IS STATISTICS? Researchers deal with a large amount of data and have to draw dependable conclusions on the basis of data collected for the purpose. Statistics help the researchers in making

More information

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski Data Analysis and Solver Plugins for KSpread USER S MANUAL Tomasz Maliszewski tmaliszewski@wp.pl Table of Content CHAPTER 1: INTRODUCTION... 3 1.1. ABOUT DATA ANALYSIS PLUGIN... 3 1.3. ABOUT SOLVER PLUGIN...

More information

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha Data Preprocessing S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking

More information

Chapter 2: Descriptive Statistics

Chapter 2: Descriptive Statistics Chapter 2: Descriptive Statistics Student Learning Outcomes By the end of this chapter, you should be able to: Display data graphically and interpret graphs: stemplots, histograms and boxplots. Recognize,

More information

Page 1. Graphical and Numerical Statistics

Page 1. Graphical and Numerical Statistics TOPIC: Description Statistics In this tutorial, we show how to use MINITAB to produce descriptive statistics, both graphical and numerical, for an existing MINITAB dataset. The example data come from Exercise

More information

appstats6.notebook September 27, 2016

appstats6.notebook September 27, 2016 Chapter 6 The Standard Deviation as a Ruler and the Normal Model Objectives: 1.Students will calculate and interpret z scores. 2.Students will compare/contrast values from different distributions using

More information

Graphical Analysis of Data using Microsoft Excel [2016 Version]

Graphical Analysis of Data using Microsoft Excel [2016 Version] Graphical Analysis of Data using Microsoft Excel [2016 Version] Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable physical parameters.

More information

MATH& 146 Lesson 8. Section 1.6 Averages and Variation

MATH& 146 Lesson 8. Section 1.6 Averages and Variation MATH& 146 Lesson 8 Section 1.6 Averages and Variation 1 Summarizing Data The distribution of a variable is the overall pattern of how often the possible values occur. For numerical variables, three summary

More information

Chapter 3 Analyzing Normal Quantitative Data

Chapter 3 Analyzing Normal Quantitative Data Chapter 3 Analyzing Normal Quantitative Data Introduction: In chapters 1 and 2, we focused on analyzing categorical data and exploring relationships between categorical data sets. We will now be doing

More information

MATHEMATICS Grade 7 Advanced Standard: Number, Number Sense and Operations

MATHEMATICS Grade 7 Advanced Standard: Number, Number Sense and Operations Standard: Number, Number Sense and Operations Number and Number Systems A. Use scientific notation to express large numbers and numbers less than one. 1. Use scientific notation to express large numbers

More information

Descriptive Statistics

Descriptive Statistics Descriptive Statistics Library, Teaching & Learning 014 Summary of Basic data Analysis DATA Qualitative Quantitative Counted Measured Discrete Continuous 3 Main Measures of Interest Central Tendency Dispersion

More information

Chapter 3: Describing, Exploring & Comparing Data

Chapter 3: Describing, Exploring & Comparing Data Chapter 3: Describing, Exploring & Comparing Data Section Title Notes Pages 1 Overview 1 2 Measures of Center 2 5 3 Measures of Variation 6 12 4 Measures of Relative Standing & Boxplots 13 16 3.1 Overview

More information

NDoT: Nearest Neighbor Distance Based Outlier Detection Technique

NDoT: Nearest Neighbor Distance Based Outlier Detection Technique NDoT: Nearest Neighbor Distance Based Outlier Detection Technique Neminath Hubballi 1, Bidyut Kr. Patra 2, and Sukumar Nandi 1 1 Department of Computer Science & Engineering, Indian Institute of Technology

More information

3. Data Analysis and Statistics

3. Data Analysis and Statistics 3. Data Analysis and Statistics 3.1 Visual Analysis of Data 3.2.1 Basic Statistics Examples 3.2.2 Basic Statistical Theory 3.3 Normal Distributions 3.4 Bivariate Data 3.1 Visual Analysis of Data Visual

More information

2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES

2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES Chapter 2 2.1 Objectives 2.1 What Are the Types of Data? www.managementscientist.org 1. Know the definitions of a. Variable b. Categorical versus quantitative

More information

Density Curve (p52) Density curve is a curve that - is always on or above the horizontal axis.

Density Curve (p52) Density curve is a curve that - is always on or above the horizontal axis. 1.3 Density curves p50 Some times the overall pattern of a large number of observations is so regular that we can describe it by a smooth curve. It is easier to work with a smooth curve, because the histogram

More information

Understanding and Comparing Distributions. Chapter 4

Understanding and Comparing Distributions. Chapter 4 Understanding and Comparing Distributions Chapter 4 Objectives: Boxplot Calculate Outliers Comparing Distributions Timeplot The Big Picture We can answer much more interesting questions about variables

More information

The basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student

The basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student Organizing data Learning Outcome 1. make an array 2. divide the array into class intervals 3. describe the characteristics of a table 4. construct a frequency distribution table 5. constructing a composite

More information

M7D1.a: Formulate questions and collect data from a census of at least 30 objects and from samples of varying sizes.

M7D1.a: Formulate questions and collect data from a census of at least 30 objects and from samples of varying sizes. M7D1.a: Formulate questions and collect data from a census of at least 30 objects and from samples of varying sizes. Population: Census: Biased: Sample: The entire group of objects or individuals considered

More information

Chapter 2 Modeling Distributions of Data

Chapter 2 Modeling Distributions of Data Chapter 2 Modeling Distributions of Data Section 2.1 Describing Location in a Distribution Describing Location in a Distribution Learning Objectives After this section, you should be able to: FIND and

More information

Section 9: One Variable Statistics

Section 9: One Variable Statistics The following Mathematics Florida Standards will be covered in this section: MAFS.912.S-ID.1.1 MAFS.912.S-ID.1.2 MAFS.912.S-ID.1.3 Represent data with plots on the real number line (dot plots, histograms,

More information

Decimals should be spoken digit by digit eg 0.34 is Zero (or nought) point three four (NOT thirty four).

Decimals should be spoken digit by digit eg 0.34 is Zero (or nought) point three four (NOT thirty four). Numeracy Essentials Section 1 Number Skills Reading and writing numbers All numbers should be written correctly. Most pupils are able to read, write and say numbers up to a thousand, but often have difficulty

More information

UNIT 1A EXPLORING UNIVARIATE DATA

UNIT 1A EXPLORING UNIVARIATE DATA A.P. STATISTICS E. Villarreal Lincoln HS Math Department UNIT 1A EXPLORING UNIVARIATE DATA LESSON 1: TYPES OF DATA Here is a list of important terms that we must understand as we begin our study of statistics

More information

Math 214 Introductory Statistics Summer Class Notes Sections 3.2, : 1-21 odd 3.3: 7-13, Measures of Central Tendency

Math 214 Introductory Statistics Summer Class Notes Sections 3.2, : 1-21 odd 3.3: 7-13, Measures of Central Tendency Math 14 Introductory Statistics Summer 008 6-9-08 Class Notes Sections 3, 33 3: 1-1 odd 33: 7-13, 35-39 Measures of Central Tendency odd Notation: Let N be the size of the population, n the size of the

More information

Robust Linear Regression (Passing- Bablok Median-Slope)

Robust Linear Regression (Passing- Bablok Median-Slope) Chapter 314 Robust Linear Regression (Passing- Bablok Median-Slope) Introduction This procedure performs robust linear regression estimation using the Passing-Bablok (1988) median-slope algorithm. Their

More information

Probability and Statistics. Copyright Cengage Learning. All rights reserved.

Probability and Statistics. Copyright Cengage Learning. All rights reserved. Probability and Statistics Copyright Cengage Learning. All rights reserved. 14.5 Descriptive Statistics (Numerical) Copyright Cengage Learning. All rights reserved. Objectives Measures of Central Tendency:

More information

ECLT 5810 Data Preprocessing. Prof. Wai Lam

ECLT 5810 Data Preprocessing. Prof. Wai Lam ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate

More information

6th Grade Vocabulary Mathematics Unit 2

6th Grade Vocabulary Mathematics Unit 2 6 th GRADE UNIT 2 6th Grade Vocabulary Mathematics Unit 2 VOCABULARY area triangle right triangle equilateral triangle isosceles triangle scalene triangle quadrilaterals polygons irregular polygons rectangles

More information

You will begin by exploring the locations of the long term care facilities in Massachusetts using descriptive statistics.

You will begin by exploring the locations of the long term care facilities in Massachusetts using descriptive statistics. Getting Started 1. Create a folder on the desktop and call it your last name. 2. Copy and paste the data you will need to your folder from the folder specified by the instructor. Exercise 1: Explore the

More information

AND NUMERICAL SUMMARIES. Chapter 2

AND NUMERICAL SUMMARIES. Chapter 2 EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES Chapter 2 2.1 What Are the Types of Data? 2.1 Objectives www.managementscientist.org 1. Know the definitions of a. Variable b. Categorical versus quantitative

More information

Middle School Math Course 3

Middle School Math Course 3 Middle School Math Course 3 Correlation of the ALEKS course Middle School Math Course 3 to the Texas Essential Knowledge and Skills (TEKS) for Mathematics Grade 8 (2012) (1) Mathematical process standards.

More information

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering

More information

DATA MINING II - 1DL460

DATA MINING II - 1DL460 DATA MINING II - 1DL460 Spring 2016 A second course in data mining!! http://www.it.uu.se/edu/course/homepage/infoutv2/vt16 Kjell Orsborn! Uppsala Database Laboratory! Department of Information Technology,

More information

Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs.

Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs. 1 2 Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs. 2. How to construct (in your head!) and interpret confidence intervals.

More information

Chapter 3. Descriptive Measures. Slide 3-2. Copyright 2012, 2008, 2005 Pearson Education, Inc.

Chapter 3. Descriptive Measures. Slide 3-2. Copyright 2012, 2008, 2005 Pearson Education, Inc. Chapter 3 Descriptive Measures Slide 3-2 Section 3.1 Measures of Center Slide 3-3 Definition 3.1 Mean of a Data Set The mean of a data set is the sum of the observations divided by the number of observations.

More information

3 Graphical Displays of Data

3 Graphical Displays of Data 3 Graphical Displays of Data Reading: SW Chapter 2, Sections 1-6 Summarizing and Displaying Qualitative Data The data below are from a study of thyroid cancer, using NMTR data. The investigators looked

More information

DAY 52 BOX-AND-WHISKER

DAY 52 BOX-AND-WHISKER DAY 52 BOX-AND-WHISKER VOCABULARY The Median is the middle number of a set of data when the numbers are arranged in numerical order. The Range of a set of data is the difference between the highest and

More information

The main issue is that the mean and standard deviations are not accurate and should not be used in the analysis. Then what statistics should we use?

The main issue is that the mean and standard deviations are not accurate and should not be used in the analysis. Then what statistics should we use? Chapter 4 Analyzing Skewed Quantitative Data Introduction: In chapter 3, we focused on analyzing bell shaped (normal) data, but many data sets are not bell shaped. How do we analyze quantitative data when

More information

Math Lesson Plan 6th Grade Curriculum Total Activities: 302

Math Lesson Plan 6th Grade Curriculum Total Activities: 302 TimeLearning Online Learning for Homeschool and Enrichment www.timelearning.com Languages Arts, Math and more Multimedia s, Interactive Exercises, Printable Worksheets and Assessments Student Paced Learning

More information

OBE: Outlier by Example

OBE: Outlier by Example OBE: Outlier by Example Cui Zhu 1, Hiroyuki Kitagawa 2, Spiros Papadimitriou 3, and Christos Faloutsos 3 1 Graduate School of Systems and Information Engineering, University of Tsukuba 2 Institute of Information

More information

Ex.1 constructing tables. a) find the joint relative frequency of males who have a bachelors degree.

Ex.1 constructing tables. a) find the joint relative frequency of males who have a bachelors degree. Two-way Frequency Tables two way frequency table- a table that divides responses into categories. Joint relative frequency- the number of times a specific response is given divided by the sample. Marginal

More information

Using Excel for Graphical Analysis of Data

Using Excel for Graphical Analysis of Data Using Excel for Graphical Analysis of Data Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable physical parameters. Graphs are

More information