For our example, we will look at the following factors and factor levels.

Similar documents
The same procedure is used for the other factors.

One Factor Experiments

Topic:- DU_J18_MA_STATS_Topic01

Multiple Regression White paper

Measures of Dispersion

Descriptive Statistics, Standard Deviation and Standard Error

Lab 5 - Risk Analysis, Robustness, and Power

CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY

Section 4 General Factorial Tutorials

Analysis of Two-Level Designs

Data Analysis Guidelines

STATS PAD USER MANUAL

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening

Lesson 20: Every Line is a Graph of a Linear Equation

Week 4: Simple Linear Regression II

Lesson 16: More on Modeling Relationships with a Line

CHAPTER 3: Data Description

Chapter 3 Analyzing Normal Quantitative Data

A. Incorrect! This would be the negative of the range. B. Correct! The range is the maximum data value minus the minimum data value.

General Program Description

Week 4: Simple Linear Regression III

Topic 3: Fractions. Topic 1 Integers. Topic 2 Decimals. Topic 3 Fractions. Topic 4 Ratios. Topic 5 Percentages. Topic 6 Algebra

Lecture 25: Review I

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

MAT 142 College Mathematics. Module ST. Statistics. Terri Miller revised July 14, 2015

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

Two-Stage Least Squares

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Week 5: Multiple Linear Regression II

Meeting 1 Introduction to Functions. Part 1 Graphing Points on a Plane (REVIEW) Part 2 What is a function?

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

1. Determine the population mean of x denoted m x. Ans. 10 from bottom bell curve.

8. MINITAB COMMANDS WEEK-BY-WEEK

Descriptive Statistics and Graphing

Graphical Analysis of Data using Microsoft Excel [2016 Version]

STA 570 Spring Lecture 5 Tuesday, Feb 1

Knowledge Discovery and Data Mining

Lesson 13: The Graph of a Linear Equation in Two Variables

Watkins Mill High School. Algebra 2. Math Challenge

THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA. Forrest W. Young & Carla M. Bann

A noninformative Bayesian approach to small area estimation

Continuous Improvement Toolkit. Normal Distribution. Continuous Improvement Toolkit.

Introduction to Excel Workshop

Math 7 Notes Unit Three: Applying Rational Numbers

Experiment 1 CH Fall 2004 INTRODUCTION TO SPREADSHEETS

Statistics: Normal Distribution, Sampling, Function Fitting & Regression Analysis (Grade 12) *

Alaska Mathematics Standards Vocabulary Word List Grade 7

General Factorial Models

Bluman & Mayer, Elementary Statistics, A Step by Step Approach, Canadian Edition

Bland-Altman Plot and Analysis

CHAPTER 4. OPTIMIZATION OF PROCESS PARAMETER OF TURNING Al-SiC p (10P) MMC USING TAGUCHI METHOD (SINGLE OBJECTIVE)

Learning Log Title: CHAPTER 2: FRACTIONS AND INTEGER ADDITION. Date: Lesson: Chapter 2: Fractions and Integer Addition

Engineering Mechanics Prof. Siva Kumar Department of Civil Engineering Indian Institute of Technology, Madras Statics - 4.3

Formulas in Microsoft Excel

Comparison of Means: The Analysis of Variance: ANOVA

Deductive reasoning can be used to establish area formulas.

Learning Log Title: CHAPTER 3: PORTIONS AND INTEGERS. Date: Lesson: Chapter 3: Portions and Integers

Subset Selection in Multiple Regression

Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data

Validation of a Direct Analysis in Real Time Mass Spectrometry (DART-MS) Method for the Quantitation of Six Carbon Sugar in Saccharification Matrix

DESIGN OF EXPERIMENTS and ROBUST DESIGN

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Psychology 282 Lecture #21 Outline Categorical IVs in MLR: Effects Coding and Contrast Coding

Unit 1, Lesson 11: Polygons

A4.8 Fitting relative potencies and the Schild equation

Fractional. Design of Experiments. Overview. Scenario

Error Analysis, Statistics and Graphing

MAT 110 WORKSHOP. Updated Fall 2018

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

Section 1.5 Transformation of Functions

OLS Assumptions and Goodness of Fit

MODELING FOR RESIDUAL STRESS, SURFACE ROUGHNESS AND TOOL WEAR USING AN ADAPTIVE NEURO FUZZY INFERENCE SYSTEM

EFFECT OF CUTTING SPEED, FEED RATE AND DEPTH OF CUT ON SURFACE ROUGHNESS OF MILD STEEL IN TURNING OPERATION

1. What specialist uses information obtained from bones to help police solve crimes?

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Nonparametric Testing

Technical Arts 101 Prof. Anupam Saxena Department of Mechanical engineering Indian Institute of Technology, Kanpur. Lecture - 7 Think and Analyze


Evaluating Robot Systems

Recall the expression for the minimum significant difference (w) used in the Tukey fixed-range method for means separation:

1 StatLearn Practical exercise 5

Introduction to Programming in C Department of Computer Science and Engineering. Lecture No. #44. Multidimensional Array and pointers

Excerpt from "Art of Problem Solving Volume 1: the Basics" 2014 AoPS Inc.

Activity 7. Modeling Exponential Decay with a Look at Asymptotes. Objectives. Introduction. Modeling the Experiment: Casting Out Sixes

General Multilevel-Categoric Factorial Tutorial

Your Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression

Homework 1 Excel Basics

Second Edition. Concept Builders. Jana Kohout

Example 1 of panel data : Data for 6 airlines (groups) over 15 years (time periods) Example 1

4. Write sets of directions for how to check for direct variation. How to check for direct variation by analyzing the graph :

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Introductory Applied Statistics: A Variable Approach TI Manual

Bell Ringer Write each phrase as a mathematical expression. Thinking with Mathematical Models

Using Excel for Graphical Analysis of Data

STA Module 2B Organizing Data and Comparing Distributions (Part II)

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II)

Section 1.5 Transformation of Functions

Parallel and perspective projections such as used in representing 3d images.

RSM Split-Plot Designs & Diagnostics Solve Real-World Problems

addition + =5+C2 adds 5 to the value in cell C2 multiplication * =F6*0.12 multiplies the value in cell F6 by 0.12

Transcription:

In order to review the calculations that are used to generate the Analysis of Variance, we will use the statapult example. By adjusting various settings on the statapult, you are able to throw the ball various distances. There are several factors that could be adjusted on the statapult that might affect the distance the ball is thrown. For our example, we will look at the following factors and factor levels.

A Taguchi L8 design was run. The response we wished to measure was distance We ran three replicates of each design matrix setup. The Data Entry sheet from DOE Wisdom software is shown here. Now let s talk about the calculations for the Analysis of Variance or ANOVA. The worksheet for our statapult design is shown here.

If we enter our data into a DOE software package and then ask it to model the data, the result is the multiple regression output shown here. From this ANOVA, we can build a model of our experiment. There is obviously a lot of information in our ANOVA. Not only can we build a model, we can also determine how well the prediction equation models our response (distance) in the range of interest. This goodness of fit will be evaluated as a whole and in parts.

To understand the greater part of a regression table, it is important for us to group data in certain ways. We will use two of these groupings to estimate population variances. Population denotes all possible responses at the experimental levels. The data we collect in experiments is just a sample of that population. If we repeated the same experiment we would probably collect a different sample from that same population. We are examining variance because there are some statistical tolls at our disposal that lend themselves directly to comparisons of variances in samples from the same population. This will lead to judgments of significance. For simplicity, let s suppose we have only one factor at 2 levels and eight runs that graphically look like this. Our prediction equation y provides a best fit for the data.

Overlaying the mean of the data points y, we can begin to look at the variance in the data and, using that variance, estimate the variance of all the possible population responses in this factor range. In this figure, we estimate population variance from the variance of the entire data group about y.

In this figure, we estimate population variance by pooling the variances of each of the subgroups about its own mean. (This figure shows Variance within subgroups) In this figure, we estimate population variance by finding the variance of the mean of each subgroup of data about the grand mean y. This figure shows Variance between subgroups. We will use the Variance within subgroups estimate and the Variance between subgroups estimate to judge the significance of the model

and the significance of individual factors (for a two-level design). Using the empirical percentages for a normal distribution, we expect 68.26% of any further

statapult firings to fall inside of y ± MSE or (one standard deviation). In our statapult example, we have three factors (A,B and C) and one response (distance y) This is a four dimensional space that will be difficult to visualize. The three data points collected in a run represent a data subgroup. We have eight data subgroups. If this were a three dimensional space, we would picture these data subgroups floating above and below the appropriate settings for factors A and B. The MSE would represent a distance above and below the surface that we would expect 68.26% of all future values to fall within. We refer to this surface as the Response Surface.

To obtain this "best" estimate of the population variance (MSE), we said that we would somehow pool the variance of subgroups. From a mechanical standpoint, we said we would compute a sum of squares and divide by degrees of freedom. For our example, the computations are shown here. Although it seems hidden in the mechanics, we have actually pooled the variances from each run. MSE is an overall measure of variation. Each subgroup is the set of data collected at a run setting. Since the predicted value for each data subgroup is the mean (for a two-level design), let's examine the SSE for one run. If we divide this by the number of data points for that run minus one, you will see the familiar form of the variance for that run.

Proceeding in this fashion, we can obtain the SSE for a two-level design simply by finding the variance of the responses for each run, multiplying each by (n RUNi 1), and summing them. The calculations for our example are shown here. Notice that the Sum of Squares Error or SSE is also known as the Sum of Squares Residual on our ANOVA.

Since the degrees of freedom have entered our computations, let s take a minute to discuss them. Suppose you know that the total of five numbers is 25. How many free choices do you have in selecting the numbers that will make this happen? We propose that the first four numbers are up to you and that the fifth is predetermined by the sum. Therefore, you have four degrees of freedom. In our catapult example, we had three data points in each run to compute a variance for that run. This gives us two degrees of freedom for each of our runs and a total of 16 degrees of freedom (2 df X 8 runs) for the Sum of Squares Error. Therefore, the Mean Square Error (or MSE) for a two-level factors is shown here. This number appears on our ANOVA.

The remainder of our analysis is a comparison of the population variance estimate between subgroups (Mean Square Between, or MSB, considering all factors, then each separately) with the Mean Square Error. Take another look at the variance between subgroups shown here. It seems reasonable to state that the only time the between estimate will be close to that of the standard error of the estimate will be when there is very little shift in the subgroups about y. If there was very little shift in response between levels of all factors, then the prediction equation would not predict much better than the mean. For each factor s influence considered separately, this would indicate that the factor would have little influence on the response. So, in order for the model to be a good predictor or for a factor to influence the response, the variation between subgroups must be somewhat bigger than the MSE. Note: Somewhat bigger will be defined by the t,z, or F test as a certain measure of confidence that the model has detected a significant shift in the mean, or that a factor should be included in the model.)

The computations of the MSB are as straightforward as those of the MSE. When we look at this between estimate for the overall model, we will refer to it as the Mean Square Regression (MSR). For individual factors, we will use MSB. First, let s attack the MSR. For our example, we will compute a sum of squares and then adjust it with the degrees of freedom we can attribute to the regression model. Once again, the computations have partially hidden what we are doing. For each run you have probably noticed that the predicted value was reported three times one for each data point taken during that run. Therefore, the sum of squares is nothing more than the sum of the number of data points collected during each run times the squared difference between the predicted value for a run and the overall mean.

To use this sum of squares to estimate a population variance, we still need to adjust it with the degrees of freedom. Since our data points for MSR are the predicted values for each run setting, we have 8 data points. The degrees of freedom will be 7 for the regression model. Thus, the equation for MSR is shown here. Now that we have all these estimates for variance, what do we do with them? To answer this question, we need to review three types of distributions: Z distribution (used when the number of

samples is 30), the t distribution (used when the number of samples is <30) and the F distribution (based upon samples taken from normal distributions). In our seminar we introduce a little exercise that helps our students visualize what is going on. We have forty-six small wooden balls in a sack. The balls are labeled with the numbers 0 through 11. They are distributed as shown in this figure. If we overlay the bell-shaped function, it becomes clear that our distribution is approximately normal. The Z and t distributions fall into this category. The Z distribution is shown here. The t distribution is a little flatter, based on the number of samples in the distribution. Most commonly, these distributions are standardized with a mean of zero, a standard deviation of one, and such that the area under the curve is one. That tells us that the horizontal axis will be the number of standard deviation units we are on either side of zero, and that the vertical axis will be a measure of frequencies. Usually we will be interested in the probability of being so many standard deviation units above or below zero. This is computed as area under those curves. For example, if we wish to know the probability of being 1.6 standard deviation units to the right of zero on the Z distribution, we compute the area under the curve from - to 1.6. The good news is that these values are provided by computers or tables.

We now have each student take five balls from the sack, record the numbers, and put them back in the sack. We ask them to compute the variance of their five numbers and divide it by the variance computed by two or three other students. Afterward, each student gives us their ratios. A typical count of the ratios is shown here. Empirically, our students have built an F distribution the graph being visible when we overlay a curve to the right of the stars.

Turning the picture more conventionally, the horizontal axis represents the F ratio and the vertical axis a measure of frequency. Although the F distribution will change with the degrees of freedom of the numerator and denominator of the F ratio, this provides a basis for understanding what is going on. What have we learned? If we compared the variance of samples taken from an approximate normal distribution, we see that most of the ratio values cluster around one and very few are

greater than six. The chance of finding a specific F value is found in a manner similar to the chance of being any number of standard deviation units from zero in the Z distribution. The overall F ratio for our model is shown here. F equals MSR/MSE equals 107.852. This agrees with our ANOVA. We saw in our demonstration that F > 6.0 seemed rare. In fact, we use F = 6.0 as our Rule of Thumb for a cut-off. If F > 6.0, we will say that there is a significant shift in the response of different run settings. That means we don t believe the change in the response at different settings happened by chance; we believe it happened because there is a difference in responses between factor levels. In order to get a handle on this chance, let s look again at a picture of an F distribution. Based on our previous discussion and demonstration, we believe that 107.852 is a large F value located considerably far out on the tail of the distribution. Since we do not believe there is a high probability that would happen by chance, we are saying we think it occurred because the factors shifted the response. Put another way, we do not believe the two estimates of variance came from the same populations. Instead, we think the response actually behaves differently (has different averages) for different factor levels.

Since there is a little bit of the tail to the right of 107.852, there appears to be some chance we might be wrong in assuming the shift in response is due to a change in factor levels. In other words, there is a small probability F= 107.852 could have happened by chance. This chance or risk is known as a type I error or α error. A good example which explains an α error is the decision reached by a jury regarding a defendant. This table summarizes the possibilities:

If the jury appropriately finds an innocent defendant innocent or a guilty defendant guilty, we have no disagreement. In our judicial system we guard against finding an innocent person guilty. We consider that the worst possible error. This type error is called a type I error or α error. The other situation is a type II or β error. For our present analysis, we will only concentrate on the α error. We will choose α = 0.5 (telling us we will be 95% confident that our decision is correct. This selection is arbitrary. Relating this to our problem, let s suppose we wish to be 95% confident that our model detects a linear shift in response due to changes in factor levels. Then we will want the area under the F curve to be greater than or equal to 0.95 for our F = 107.82.

From our rule of thumb, this factor appears significant. The P value supports that conclusion. This concludes the math section on Sum of Squares, Mean Square and F ratio. Exit out of this section to return to the main lesson.