SYS 6021 Linear Statistical Models

Size: px
Start display at page:

Download "SYS 6021 Linear Statistical Models"

Transcription

1 SYS 6021 Linear Statistical Models Project 2 Spam Filters Jinghe Zhang Summary The spambase data and time indexed counts of spams and hams are studied to develop accurate spam filters. Static models are constructed using generalized linear regression and the recommended model is built based on principal component regression using the spam data after log transformation on the explanatory variables. The selected static spam filter has the smallest numbers of total errors and false positives, which are 91 and 35, respectively. In this model, the coefficient of the first principal component is and the 95% confidence interval is [-2.02, ]. When there is one unit change in this variable while holding the others constant, the percent change in the odds is % and the 95% confidence interval of the percent change is [-86.75%, %]. The time series for ham models the seasonality and random fluctuation, while that for spam models the trend and fluctuation. The recommended models have statistically better performance than the other models considering their overall decent performance in terms of smaller MSE and AIC. In addition, the statistical tests on the two recommended models indicate that they are adequate. In ham seasonality, the coefficient of Saturday is -4.70, and the confidence interval is [-5.88, -3.52], which indicates that on Saturdays, there is fewer hams received at the address. In the fluctuation model of spam, the coefficients of ar1 is 1.03 and the 95% confidence interval is [0.90, 1.17]. The recommended static and time series models can be combined using Bayes rule to further improve the performance of spam filters. Honor Pledge: On my honor, I pledge that I am the sole author of this paper and I have accurately cited all help and references used in its completion. November 10, 2013

2 1 Problem description 1.1 Situation Nowadays, the development of technology and the wide spread of internet provide great convenience to our daily life. Especially, the communication in different places does not reply on the traditional approaches any more, such as letters and telephones. plays a more and more important role which not only facilitates the communication between people but also a very convenient way to approach information. However, along with those useful s, people receive spams sometimes. Some of the junk mails are just ads which can be annoying. Moreover, many spams include virus or malware which are harmful to the host computers, such as stealing information. According to the National Technology Readiness Survey in 2004, the cost of spam is more than $21 billion annually in terms of lost productivity [1]. Hence, this has attracted more and more attention and many spam filters have been created using various techniques, such as Bayesian filtering [2, 3]. 1.2 Goal The objective of this study is to develop a spam filter in order to detect junk s. The spam filter can be static, time series, and integrated models. 1.3 Metrics For the static generalized liner regression model, the model performance is measured by ROC curve and the number of total errors, i.e., false positive (FP) and false negative (FN). Moreover, it is very undesirable to make a good as a spam in this project so that a spam filter with fewer false positives is preferred. For the time series filter, the performance metric is mean squared error (MSE). In addition, for the integrated model, the performance metrics are same as those of the static model, which are the ROC curve and sum of FP and FN. 1.4 Hypotheses There are two hypotheses in this study illustrated as follows: a. The variables, which describe word frequency, the appearance of capital letters, and the frequency of some characters, are predictive of spams. b. Time component has some impact on the number of spam s arrived at that address. 2. Approach 2.1 Data 1

3 In this study, spam data are analyzed to develop spam filters. In the dataset, there are 4601 observations and 58 variables [4]. One of the variables (V58) is the class label which indicates whether it is a spam or a ham. Of the other 57 explanatory variables, 48 continuous attributes (V1-V48) describe the frequency of some particular words in the , and 6 variables provide (V49-V54) information on the frequency of some characters. Moreover, there are 3 variables (V55-V57) illustrate the appearance of capital letters in the . Also, the numbers of spams and hams received at an address on different days are available. On average, there are spams received each day from August 1, 2004 to July 30, 2005, while there are 3.89 hams each day from January 13, 2000 to June 1, To discover the relationship between the explanatory variables and the response variable, scatter plots matrices are provided. The variables in scatter plots are divided into three categories: word frequency variables, character frequency variables, and capital letter variables. The matrix in Figure 2.1 studies the correlations between capital letter variables V55-V57, and the response variable V58. As shown in Figure 2.1, there are strong correlations between V55 and V56, as well as V56 and V57. Figure 2.1 Scatter Plots Matrix of V55, V56, V57, and V58 Also, considering their relationship with the response variable, V57 is selected for modeling the spam filters. In addition, the predictive variables are strongly skewed. By observing the other scatter plots, we can get the same conclusion. Therefore, log transformation on those predictors is preferred. In addition, the scatter plots matrix of some word frequency variables V1-V10 and the response variable V58 are displayed in Figure 2.2 which also indicates strong skewness and some correlation between explanatory variables. The scatter plot matrices of other variables are given in Appendix. 2

4 Figure 2.2 Scatter Plots Matrix of V1-V10, and V58 Box plots are displayed to show the distribution of word frequency variables V1 to V48, character frequency variables (V49-V54), and capital letter variables (V55-V57). These boxplots show whether a variable is discriminatory in terms of spam s. Box plots for V1 to V9 are given in Figure 2.3. It is obvious that the value of V3 is discriminatory in terms of spam and ham. The factor plots of other variables are given in the Appendix. 3

5 Figure 2.3 Factor Plots of V1-V9 By the information conveyed in those box plots as well as the scatter plot matrices, 8 predictive variables are selected as predictors considering the independence of predictive variables and their predictability of spam. A statistical summary of the selected predictors are provided in Table 2.1. Variable Table 2.1 A Statistical Summary of Selected Variables Min First Quartile Median Mean Third Quartile Max V V V V V V V V The histograms of these 8 variables are given in Figure

6 Figure 2.4 Histograms of Selected Predictors We can see that the distributions of these variables are highly skewed which is consistent with the conclusion from the scatter plot matrices. Therefore, log transformation is performed to all the potential predictors. The scatter plot matrix in Figure 2.5 shows relationship between predictive variables after transformation and the response variable. The scatter plots of other variables after transformation are given in the Appendix. 5

7 Figure 2.5 Scatter Plots Matrix of V1-V10 and V58 after Log Transformation After log transformation, we can see that the distributions of predictive variables are less skewed. Also, there are larger correlations between the predictors and response variable. The factor plots of some log transformed variables are provided in Figure 2.6. Figure 2.6 Factor Plots of V1-V9 after Log Transformation 6

8 By observing the boxplots of the selected variables in Figure 2.7, we find that there are some outliers. Figure 2.7 Factor Plots of Selected Predictors Therefore, the extreme observations are extracted and investigated. After further investigation, the corresponding outliers are excluded from the data set in order to eliminate bias. The 12 removed observations are 1, 752, 831, 1708, 1489, 1763, 1791, 2694, 2905, 3247, 3913, and Hence, the dataset used for modeling has 4589 observations. In Table 2.2, there is a statistical summary of the data without outliers. Table 2.2 A Statistical Summary of Selected Variables (without outliers) Variable Min First Quartile Median Mean Third Quartile Max V V V V V V V V

9 The preprocessing on the dataset discussed above is used to build static spam filter and another two datasets are used to construct the time series spam filters. These two time series datasets are consist of the date as well as the number of spams/hams arriving at a particular address [5]. A summary of the two datasets are provided in Table 2.3. Table 2.3 A Summary of Ham and Spam Datasets for Time Series Modeling Dataset Min First Quartile Median Mean Third Quartile Max Ham Spam In the ham dataset, here are 506 observations from Jan. 13, 2000 to Jun. 1, There are 364 observations from Aug. 1, 2004 to Jul. 30, 2005 in the spam dataset. 2.2 Analysis Static Analysis for Spam Filter Design The dataset with 4589 observations and 8 variables are used to build generalized linear models for spam detection. To discover the discriminatory of these variables in terms of spam detection, factor plots are shown in Figure 2.7. The dataset is randomly divided into two subsets: training set and testing set, which account for 2/3 and 1/3 of data, respectively. Also, all the observations for training and testing are log transformed and can be used to build logistic regression models. At first, two main effect generalized linear model are constructed using the 8 variables before transformation and after log transformation on the predictors. GLM1 (no transformation): V58 ~ V3+V7+V16+V17+V19+V21+V52+V57 GLM2 (log transformation): V58 ~ V3+V7+V16+V17+V19+V21+V52+V57 Chi-square test is performed to compare the first model GLM1 and the null model, and the p value is less than 2.2e-16, which indicates GLM1 is significant. Also, every predictor in GLM1 has a p value less than 2.2e-16 so that all the 8 predictors are significant. Similarly, the GLM2 is statistically significant and all the predictors in GLM2 are significant. Then, principal component analysis (PCA) is performed to the train set and principle components which give 90% of the variance of this dataset are used for logistic regression. The biplots of the untransformed and transformed explanatory variables are shown in Figure 2.8 and Figure 2.9 respectively. 8

10 Figure 2.8 Biplot of PCA on Untransformed Data Figure 2.9 Biplot of PCA on Transformed Data 9

11 The variances of two types of s on the first two components are displayed in Figure 2.10 and Figure Figure 2.10 Variances of Hams and Spams on First Two Components (Before Transformation) Figure 2.11 Variances of Hams and Spams on First Two Components (After Transformation) According to Figure 2.10 and 2.11, we can see that ham s have larger variation on the first component, while the spams have larger variation on the second component before log transformation on the explanatory variables. However, after transformation, ham and spam s have large variation on both of the first two components. Then, two PCA regression models (GLM3 and GLM4) are trained using the training data before and after log transformation. Two models are compared with the null model and both of them are statistically significant. In addition, interactions between predictors are studied through interaction plots. For example, there is some interaction between V16 and V21 as shown in Figure 2.1. Also, there are some interaction between V3 and V17, V3 and V19, V17 and V57, etc. Then, interaction terms are added to the main effect models. 10

12 GLM5 (no transformation): V58~V3+V7+V16+V17+V19+V21+V52+V57+V3*(V17+V19+V21+V52)+V7*(V21+V52) +V16*(V17+V19+V21+V52)+V17*V57+V19*(V21+V52+V57) +V21*(V52+V57) +V52*V57 GLM6 (log transformation): V58~V3+V7+V16+V17+V19+V21+V52+V57+V3*(V17+V19+V21+V52)+V7*(V21+V52) +V16*(V17+V19+V21+V52)+V17*V57+V19*(V21+V52+V57) +V21*(V52+V57) +V52*V57 The two models are compared with the null model using the Chi-square test and GLM5 and GLM6 are statistically significant. Stepwise regression is performed to the two models and the two stepwise models are listed as below: GLM7 (stepwise on GLM5): V58 ~ V3 + V7 + V16 + V17 + V19 + V21 + V52 + V57 + V3:V17 + V3:V21 + V16:V19 + V16:V21 + V17:V57 + V19:V21 + V19:V57 + V21:V52 + V52:V57 GLM8 (stepwise on GLM6): V58 ~ V3 + V7 + V16 + V17 + V19 + V21 + V52 + V57 + V3:V17 + V3:V21 + V7:V52 + V16:V19 + V17:V57 + V19:V21 + V52:V57 The two stepwise models are also compared with the null model using Chi-square test and they are statistically significant Time Series Analysis for Spam Filter Design To study whether time component has an influence on the numbers of hams and spams received on a particular day, the counts of these two types of s are transformed to time series data and plotted in Figure

13 Figure 2.12 Time Series Plots of Ham and Spam Data The autocorrelation function (ACF) plots are provided in Figure 2.13, which indicates that there are some impact from the time component since the lags are greater than 0. Figure 2.13 ACF Plots of Ham and Spam Data 12

14 Test set method is used to evaluate the time series models and the observations of the last 7 days are extracted as the test set from both ham and spam data. The others are used for training. Then, the trends of the two training datasets are modeled using linear regression and the two models are tested using F-statistic to measure the significance of the models. As to the spam data, the trend is statistically significant since the p value of the associated F test is 1.05E-5 which is less than However, the p value of F test on the trend model is for the ham data. Therefore, there is significant trend in the spam data rather than in the ham data. The plots in Figure 2.14 give a graphical illustration on the trends of the two types of s. Figure 2.14 Trends of Ham and Spam Data To investigate the seasonality of the two datasets, periodograms are plotted in Figure 2.15 to find the peaks and to compute the period. Figure 2.15 Periodogram Plots of Ham and Spam Data 13

15 The peaks are at 0.14 and in the two datasets, respectively. Hence, the corresponding periods are 6.74 and 375. Therefore, the seasonality in hams is by week and there is no seasonality in spam. Therefore, for ham data seasonality is modeled and trend is not modeled. However, trend is modeled and seasonality is not modeled for spam data. Consequently, a linear regression is built to model the seasonality of ham. The residuals of spam trend model and ham seasonality model are studied and plotted in Figure Figure 2.16 Residuals of Ham Seasonality and Spam Trend Models The ACF and partial ACF (PACF) plots in Figure 2.17 and 2.18 show the autocorrelation of residuals in the linear regression models for hams and spams. Figure 2.17 ACF Plots of Residuals of Ham Seasonality and Spam Trend Models 14

16 Figure 2.18 PACF Plots of Residuals of Ham Seasonality and Spam Trend Models In Figure 2.16, the residuals of ham seasonality model have relatively constant mean and variance and there is a sinusoidal pattern in the corresponding ACF plot, which indicates a stationary time series. Similarly, the time series of spam residuals is also stationary. In addition, they cut off in the PACF after 2 and 3 lags, respectively. In addition, for further investigation, the first order difference is also taken for the time series data and the plots of residuals after taking first order difference are given in Figure Figure 2.19 First Order Difference of Residuals 15

17 The ACF and partial ACF (PACF) plots in Figure 2.20 and 2.21 show the autocorrelation of residuals of ham seasonality and spam trends models after taking the first order difference. Figure 2.20 ACF Plots of First Order Difference of Residuals Figure 2.21 PACF Plots of First Order Difference of Residuals However, since there are negative values in the PACF plots, it is preferred not to take the difference of residuals. The ACF plots in Figure 2.17 show sinusoidal patterns and they cut off after 2 and 3 lags in the PACF plots in Figure Hence, the autoregression models AR-Ham and AR-Spam are constructed and the orders of them are 2 and 3, respectively. Also, MA-Ham and MA-Spam are two moving average models built to model the seasonality of ham and the trend of spam. Both of these two models take the order of 1. Then ARMA-Ham and ARMA-Spam are two models 16

18 constructed for ham and spam, respectively. In addition, the difference is not modeled in the time series according to the discussion derived from Figure The ARMA-Ham model takes the order of 2 and 1, while the ARMA-Spam model takes the order of 3 and 1. Overall, there are three time series models constructed for the seasonality of ham and another three models for the trend of spam. Automated method is adopted to build the ARIMA models and they are discussed in Section Integrated Filter Design Both static and time series spam filters are built in this project and they reveal spam information from different perspectives. In the static model, the generalized linear regression model provides a result 1 or 0 which indicates whether a particular is a spam. In the time series model, we can obtain the numbers of spams and hams at a particular day, which can be used to derive the probability of receiving a spam on that day. In order to improve the performance of spam filter, the two models can be combined using Bayes rule. Hence, the combing rule is displayed as shown in Equation 2.1. Pr( E i S j, T k) Pr( S j, T k E i)pr( E i) I, 2.1 Pr( S j, T K E i)pr( E i) i where E, S, T represent whether an is a spam in reality, the prediction of the static spam filter, and the prediction from the time series model, respectively. Here, i, j, and k are binary and is an elements in {0, 1} which indicates whether it is a spam. Equation 2.1 can be interpreted to Equation 2.2. Pr( E i S j, T k) Pr( S j E i)pr( T k E i) 2.2 The first term Pr( S j E i) is the probability of true positive when i=1 and j=1, and the second term is the probability of spam obtained from the time series model. Correspondingly, the probability of true positive can be computed through the score table of the selected static model using Equation 2.3. TP Pr( TP) 2.3 TP FN The probability of spam obtained from the time series model can be computed through Equation 2.4. # Spams Pr( TP) 2.4 # Spams # Hams 17

19 In brief, an improved spam filter is obtained by combining the static generalized model and time series model using the Bayes rule. 3. Evidence 3.1 Static Filter design To measure the performance of those GLM models in Section 2.2.1, the AIC of those models are shown in Table 3.1. Table 3.1 AIC of Static Models GLMs AIC GLM GLM GLM GLM GLM GLM GLM GLM The mail effect model GLM1 without data transformation has a large AIC so that another main effect model GLM2 after log transformation is constructed and its performance is improved compared with GLM1. In the Chi-square tests, all the variables in the main effect model are significant. To further refine the static models, PCA regression is considered in which principal components which account for 90% variance are used to build the regression models. Also, the PCA regression GLM3 with non-transformed data is not as good as the one GLM4 using transformed data. Sometimes the effect of a predictor on the response variable is dependent on the value of another variable so that interaction terms are added to the main effect model for a better regression. The interactions between explanatory variables are studied by interaction plots shown in Section GLM5 and GLM6 are built based on GLM1 and GLM2, respectively. It is clear that the performance of the static models is improved after adding interaction terms into the main effect models. In order to reduce the complexity and improve the performance of these two models, stepwise regression is performed to both of them where GML7 and GLM8 are constructed. From Table 3.1, we can see that the stepwise regressions based on the two antecedent models are able to achieve some improvement. Overall, the generalized linear models have smaller AIC when they are trained with the log transformed data. The static models with interaction terms have better accuracy than the main effect models. Moreover, the PCA regression using transformed data provides the best results among all the models described above, 18

20 which is most likely due to the latent properties discovered in PCA as well as the elimination of multicollineaity between explanatory variables. As mentioned in Section 2.2, the test set method is used in this project in order to evaluate the generalization ability of those models. In Table 3.2, the score table of these static models illustrates their prediction accuracy on test set. Table 3.2 Score Table of Static Models GLMs TN FP FN TP Total Errors GLM GLM GLM GLM GLM GLM GLM GLM The goal of this project is to detect spams and the model with a smaller number of total errors is preferred. It is obvious that GLM4 has the smallest number of total errors. Especially in this project, making a good to a spam (FP) is very undesirable so that a smaller FP is very important. Overall, GLM4 has the smallest AIC, total number of errors, and a smaller FP. Therefore, it is the optimal static model among all the GLMs discussed above. In addition, to measure the model performance in a graphical way the ROC curves are plotted and shown in Figure

21 Figure 3.1 ROC Curves of Static Spam Filters Point (0, 1) indicates an ideal prediction model and the model with a larger area under the curve (AUC) is preferred. Therefore, it is clear that GLM3 and GLM4 are very outstanding. Especially, GLM4 has the best performance among all the static spam filters and the main effect model GLM1 has the poorest accuracy. The other spam filters are very much overlapped with each other. Overall, the static models are improved after log transformation on the explanatory variables. To have a better understanding on all the static models after log transformation, ROC curves in Figure 3.2 gives a clearer plot with only transformed data. 20

22 Figure 3.2 ROC Curves of Static Spam Filters with Transformed Data In Figure 3.2, we can see that GLM4 is the most accurate model among all the spam filters trained with transformed data. GLM6 is a little better than GLM2 by adding interaction terms. The stepwise regression model GLM8 based on GLM6 does not make much difference since the ROC curves of these two models are very much overlapped. Similarly, the ROC curves of filters built with untransformed data support the same conclusions. Overall, GLM4 provides the best prediction accuracy so that it is proposed as the recommended static spam filter. The 95% confidence intervals of the coefficients of the first five principle components in GLM4 are given in Table 3.3. The complete list of coefficients and the associated 95% confidence intervals of GLM4 is given in Table 6.1 in Appendix. Table 3.3 Coefficients of the First Five Principle Components in GLM4 Mean Coefficient 2.50% Coefficient 97.50% Coefficient Percent Change (Mean) (%) Percent Change (2.50%) (%) Percent Change (97.50%) (%) (Intercept) Comp Comp Comp Comp Comp

23 The first three columns in Table 3.3 provide the coefficient of the predictor as well as its 95% confidence interval. The last three columns show the corresponding percent changes in the odds computed from the first three columns, respectively. For example, the coefficient of the first principal component is and the 95% confidence interval is [-2.02, -1.58]. When there is one unit change in this variable while holding the others constant, the percent change in the odds is % and the 95% confidence interval of the percent change is [-86.75%, %]. Similarly, as to the second component, the coefficient is and the associated confidence interval is [- 0.48, 0.02]. When making a one unit increase on this component and holding the others constant, there is a % change in the odds and the 95% confidence interval of this change is [-37.91%, -1.82%]. For the fourth component, the coefficient is positive and the value is The associated 95% confidence interval is [0.63, 1.02]. When there is one unit increase in this component and the other components are holding constant, the odds will have a % increase and its confidence interval is [87.80%, %]. 3.2 Time Series Filter Design The candidate models in Section are two AR models for ham and spam with the order of 2 and 3, respectively. The diagnostic plot for AR-Ham is given in Figure 3.3. Figure 3.3 Diagnostic Plots of AR-Ham In the diagnostic plot of AR-Ham, the residuals are not showing constant variance and the p values of Ljung-Box statistic tests indicate that this model is not adequate. There is no significant correlation between residuals. To improve the model, the MA is considered to be added as well as the difference which can be combined to an ARIMA model. Also, it is discovered that the order of difference is preferred to be 0. The diagnostic plot for MA-Ham is displayed in Figure

24 Figure 3.4 Diagnostic Plots of MA-Ham In Figure 3.4, we can see that the residuals do not have constant variance and the p values in the statistic tests show that this model is not sufficient. Therefore, ARMA-Ham is constructed which combines AR-Ham and MA-Ham. The diagnostic plot is shown in Figure 3.5. Figure 3.5 Diagnostic Plots of ARMA-Ham 23

25 In Figure 3.5, the residuals are random and the autocorrelation between them are very insignificant. Also, the p values for the Ljung-Box tests indicate that this model is sufficient. To further investigate the appropriateness of the orders in ARMA-Ham model, the ARIMA model (ARIMA-Ham) is constructed using automated method and it takes 1, 0, and 2 for the orders of AR, difference, as well as MA, respectively. The diagnostic plot for this ARIMA-Ham model is displayed in Figure 3.6. Figure 3.6 Diagnostic Plots of ARIMA-Ham From Figure 3.6, we can see that the residuals are random and the p values of the Ljung-Box statistic tests show that this model is adequate. The AR-Spam model described in Section takes the order of 3 and the diagnostic plot of this model is given in Figure 3.7. Figure 3.7 Diagnostic Plots of AR-Spam 24

26 In Figure 3.7, the residuals are showing constant variance and lack of pattern and the p values of Ljung-Box statistic tests indicate that this model is adequate. There is no significant correlation between residuals. Then, the MA-Spam model is constructed with the order of 1. The diagnostic plot for MA-Spam is displayed in Figure 3.8. Figure 3.8 Diagnostic Plots of MA-Spam In Figure 3.8, the residuals do not have constant variance and the p values in the statistic tests show that this model is not sufficient. Therefore, ARMA-Spam is constructed which combines AR-Spam and MA-Spam. The diagnostic plot is shown in Figure 3.9. Figure 3.9 Diagnostic Plots of ARMA-Spam 25

27 In Figure 3.9, we can see that the residuals have relatively constant variance and the p values in the statistic tests show that this model is sufficient. Also, there is no significant correlation between residuals. To further investigate the appropriateness of the orders in ARMA-Spam model, the ARIMA model (ARIMA-Spam) is constructed using automated method and it takes 1, 0, and 1 for the orders of AR, difference, as well as MA, respectively. The diagnostic plot for this ARIMA-Ham model is displayed in Figure Figure 3.10 Diagnostic Plots of ARIMA-Spam From the diagnostic plot for ARIMA-Spam, the residuals are random and lack of pattern and the p values of the Ljung-Box statistic tests indicate that this model is adequate. Table 3.4 provides the AICs of all the four models discussed in this section. Table 3.4 AIC of Time Series Models AIC AR-Ham MA-Ham ARMA-Ham ARIMA-Ham AR-Spam MA-Spam ARMA-Spam ARIMA-Spam

28 According to the AIC scores of the time series models, the two ARIMA models are the best and the MA models perform the worst. AR model are better than MA models and ARMA models which combine the AR and MA are able to make relatively large improvements. Therefore, considering their AICs, ARIMA-Ham and ARIMA-Spam are two models recommended. Test set method is used in evaluating the performance of time series models and the last 7 observations are used for testing. The forecasting plots of AR-Ham, MA-Ham, ARMA-Ham, and ARIMA-Ham are shown in Figure 3.11 and Figure 3.11 Forecasting Plots of AR-Ham and MA-Ham Figure 3.12 Forecasting Plots of ARMA-Ham and ARIMA-Ham 27

29 The black line is the number of hams received each day and the blue line shows the prediction on the numbers of hams received on the last 7 days. The blue shade is the 95% confidence interval of the prediction. To have a better understanding of the prediction, another two plots are given in Figure 3.8 which only provide the actual and prediction data on the test set. Figure 3.13 Prediction Plot of Time Series Models for Ham According to Figure 3.13, we can see that the predictions of ARIMA-Ham and ARMA-Ham model are very much overlapped, while AR-Ham and MA-Ham are close to each other. Also, the ARMA-Ham and ARIMA-Ham are more accurate since their predictions are closer to the actual observations when compared to the other two models. Table 3.5 provides the prediction of these two models on the test set as well as the mean square error (MSE). Table 3.5 Prediction Summary of Time Series Models for Ham MSE Actual AR-Ham MA-Ham ARMA-Ham ARIMA-Ham

30 According to Table 3.5, ARMA-Ham provides the best prediction in terms of MSE which is a little better than the ARIMA model, while the AR-Ham and MA-Ham have larger errors. Considering that the AIC of ARMA-Ham is only a little worse than that of the ARIMA-Ham and the former model has a higher accuracy in terms of MSE, the ARMA-Ham is recommended as the most preferred time series model for ham. The coefficients and the 95% confidence intervals are given in Table 3.6. Model Component Ham.season Ham.arma201 Table 3.6 Coefficients of Selected Time Series Model for Ham Coefficient 2.50% Coefficient 97.50% Coefficient (Intercept) season.ham Monday season.ham Saturday season.ham Sunday season.ham Thursday season.ham Tuesday season.ham Wednesday Intercept ar ar ma In the time series for ham, it consists of seasonality and some random fluctuations. In the seasonality model, the coefficient of Monday is and the 95% confidence interval is [-1.39, 0.97]. The coefficient of Saturday is -4.70, and the confidence interval is [-5.88, -3.52], which indicates that on Saturdays, there is fewer hams received at the address. Similarly, on Sundays there are a smaller number of hams. However, on Tuesday, Thursday, and Friday, the coefficients are more likely to be positive so that there are more hams received during weekdays. In the fluctuation model, the coefficients of ar1 is 1.09 and the 95% confidence interval is [0.93, 1.24]. The coefficient of ma1 is and the confidence interval is [-0.93, -0.69]. The time series models for spam are also evaluated using the test set method. The forecasting plots of AR-Spam, MA-Spam, ARMA-Spam, and ARIMA-Spam are shown in Figure 3.14 and

31 Figure 3.14 Forecasting Plots of AR-Spam and MA-Spam Figure 3.15 Forecasting Plots of ARMA-Spam and ARIMA-Spam The black line is the number of spams received each day and the blue line shows the prediction on the numbers of spams received on the last 7 days. The blue shade is the 95% confidence interval of the prediction. To have a better understanding of the prediction, another two plots are given in Figure 3.16 which only provide the actual and prediction data on the test set. 30

32 Figure 3.16 Prediction Plot of Time Series Models for Spam According to Figure 3.16, we can see that the predictions of ARIMA-Spam and ARMA-Spam model are very much overlapped, while AR-Spam and MA-Spam are close to each other. Table 3.7 provides the prediction of these two models on the test set as well as the mean square error (MSE). Table 3.7 Prediction Summary of Time Series Models for Spam MSE Actual AR-Spam MA-Spam ARMA-Spam ARIMA-Spam According to Table 3.7, MA-Ham provides the best prediction in terms of MSE. However, the AIC of MA-Ham is the worst and the Ljung_Box test shows that this model is not adequate. Therefore, ARMA-Spam model is recommended since this model is adequate according to the Ljung-Box test and it has relatively smaller MSE and AIC. The coefficients and the 95% confidence intervals are given in Table

33 Table 3.8 Coefficients of Selected Time Series Model for Spam Model Component spam.trend spam.arma301 Coefficient 2.50% Coefficient 97.50% Coefficient (Intercept) time.spam ar ar ar ma intercept In the time series model for spam, it consists of trend and random fluctuations. In the trend model, the coefficient of time component is 0.02 and the 95% confidence interval is [0.01, 0.03]. In the fluctuation model, the coefficients of ar1 is 1.03 and the 95% confidence interval is [0.90, 1.17]. The coefficient of ar2 is and the confidence interval is [-0.19, 0.11]. Moreover, the coefficient of ma1 is and the confidence interval is [-0.98, -0.81]. In Section 1.4, there are two hypotheses proposed in this project. The first hypothesis is that word frequency, the appearance of capital letters, and the frequency of some characters, are predictive of spams. The biplot in Figure 2.8 shows the variance of variables in the first component and the second component which are two of the predictors in the selected static model GLM4. We can see that some word frequency variables, e.g., V25 and V40, show large variance in both component 1 or 2. Character frequency variables have larger variance in component 1, such as V52. In addition, capital letter related variables show large variance in the first and second components, for example, V56 and V57. Hence, word frequency variables, character frequency variables, and capital letter variables are predictive of spam and the first hypothesis is correct. The second hypothesis proposes that the time component has some impact on the number of spam s arrived at that address. In the selected time series models for spam and ham, there is some trend in spam and seasonality in ham so that time component has some influence on the number of spams. Hence, both hypotheses are correct. 4. Recommendation Two types of spam filters are built in this project: static generalized linear model and time series filters. The selected static model GLM4 is constructed using PCA regression on the spam data after log transformation on the explanatory variables. Word frequency, character frequency, and capital letter related variables are predictive of spams. The performance of this model is significantly better than all the other candidate static models described in Section The AIC of this model is the smallest shown in Table 3.1. According to the ROC curves in Figure 3.2, GLM4 has the best performance. In addition, when evaluating it with the test set, it provides the least number of total errors and FPs which are 91 and 35, respectively, according to Table

34 In the time series model, the seasonality and random fluctuation of ham are modeled to predict the number of hams received at the address on a particular day. Also, the number of spams is modeled through the trend and the random fluctuation. ARMA-Ham and ARMA-Spam are recommended because of their decent prediction accuracy on the test sets. The selected ham time series model achieves a MSE of 8.51 on the test set, and the recommended spam time series model has a MSE of on its test set. Also, they have relatively smaller AIC and prove to be adequate according to the Ljung-Box tests. Therefore, these two models are selected to predict spams. The probability of receiving spam on a particular day can be derived from these two models. Therefore, they can be used as spam filter. In order to improve the performance of spam filter, the recommended static model and the selected ham and spam time series models can be combined using Bayes rule as shown in Equation 2.1. Therefore, the probability that a particular is spam can be computed based on the prediction of the static model and the probability of receiving spam on that day which is accessible by using the two time series models. 33

35 5. Reference [1] C. Thomas, Spam Costs Billions, Information Week, February [Online]. Available: [2] D. E. Brown and L. Barnes, Project 2: Spam Filters, October 10, 2013, assignment in class SYS [3] D. E. Brown and L. Barnes, Project 2 Template, October 10, 2013, assignment in class SYS [4] UCI Machine Learning Repository, Spambase Data Set, July [Online]. Available: [5] D. E. Brown and L. Barnes, Spam and Ham Data, October 10, 2013, assignment in class SYS Appendix Figure 6.1 Scatter Plot Matrix of V21-V30 and V58 34

36 Figure 6.2 Scatter Plot Matrix of V31-V40 and V58 Figure 6.3 Scatter Plot Matrix of V41-V48 and V58 35

37 Figure 6.4 Scatter Plot Matrix of V49-V54 and V58 Figure 6.5 Scatter Plots Matrix of V11-V20 and V58 after Log Transformation 36

38 Figure 6.6 Scatter Plots Matrix of V21-V30 and V58 after Log Transformation Figure 6.7 Scatter Plots Matrix of V31-V40 and V58 after Log Transformation 37

39 Figure 6.8 Scatter Plots Matrix of V41-V49 and V58 after Log Transformation Figure 6.9 Scatter Plots Matrix of V49-V54 and V58 after Log Transformation 38

40 Figure 6.10 Scatter Plots Matrix of V55-V57 and V58 after Log Transformation Figure 6.11 Factor Plots of V10-V18 39

41 Figure 6.12 Factor Plots of V19-V27 Figure 6.13 Factor Plots of V28-V36 40

42 Figure 6.14 Factor Plots of V37-V45 Figure 6.15 Factor Plots of V46-V54 41

43 Figure 6.16 Factor Plots of V55-V57 Mean Coefficient Table 6.1 Coefficient List of GLM4 2.50% Coefficient 97.50% Coefficient Percent Change (Mean) Percent Change (2.50%) Percent Change (97.50%) (Intercept) Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp

44 Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp

Intro to ARMA models. FISH 507 Applied Time Series Analysis. Mark Scheuerell 15 Jan 2019

Intro to ARMA models. FISH 507 Applied Time Series Analysis. Mark Scheuerell 15 Jan 2019 Intro to ARMA models FISH 507 Applied Time Series Analysis Mark Scheuerell 15 Jan 2019 Topics for today Review White noise Random walks Autoregressive (AR) models Moving average (MA) models Autoregressive

More information

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables Further Maths Notes Common Mistakes Read the bold words in the exam! Always check data entry Remember to interpret data with the multipliers specified (e.g. in thousands) Write equations in terms of variables

More information

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display CURRICULUM MAP TEMPLATE Priority Standards = Approximately 70% Supporting Standards = Approximately 20% Additional Standards = Approximately 10% HONORS PROBABILITY AND STATISTICS Essential Questions &

More information

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Abstract In this project, we study 374 public high schools in New York City. The project seeks to use regression techniques

More information

Table of Contents (As covered from textbook)

Table of Contents (As covered from textbook) Table of Contents (As covered from textbook) Ch 1 Data and Decisions Ch 2 Displaying and Describing Categorical Data Ch 3 Displaying and Describing Quantitative Data Ch 4 Correlation and Linear Regression

More information

Regression Analysis and Linear Regression Models

Regression Analysis and Linear Regression Models Regression Analysis and Linear Regression Models University of Trento - FBK 2 March, 2015 (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 1 / 33 Relationship between numerical

More information

The Time Series Forecasting System Charles Hallahan, Economic Research Service/USDA, Washington, DC

The Time Series Forecasting System Charles Hallahan, Economic Research Service/USDA, Washington, DC The Time Series Forecasting System Charles Hallahan, Economic Research Service/USDA, Washington, DC INTRODUCTION The Time Series Forecasting System (TSFS) is a component of SAS/ETS that provides a menu-based

More information

Quantitative - One Population

Quantitative - One Population Quantitative - One Population The Quantitative One Population VISA procedures allow the user to perform descriptive and inferential procedures for problems involving one population with quantitative (interval)

More information

BUSINESS ANALYTICS. 96 HOURS Practical Learning. DexLab Certified. Training Module. Gurgaon (Head Office)

BUSINESS ANALYTICS. 96 HOURS Practical Learning. DexLab Certified. Training Module. Gurgaon (Head Office) SAS (Base & Advanced) Analytics & Predictive Modeling Tableau BI 96 HOURS Practical Learning WEEKDAY & WEEKEND BATCHES CLASSROOM & LIVE ONLINE DexLab Certified BUSINESS ANALYTICS Training Module Gurgaon

More information

Box-Cox Transformation for Simple Linear Regression

Box-Cox Transformation for Simple Linear Regression Chapter 192 Box-Cox Transformation for Simple Linear Regression Introduction This procedure finds the appropriate Box-Cox power transformation (1964) for a dataset containing a pair of variables that are

More information

Lecture 25: Review I

Lecture 25: Review I Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,

More information

COPYRIGHTED MATERIAL CONTENTS

COPYRIGHTED MATERIAL CONTENTS PREFACE ACKNOWLEDGMENTS LIST OF TABLES xi xv xvii 1 INTRODUCTION 1 1.1 Historical Background 1 1.2 Definition and Relationship to the Delta Method and Other Resampling Methods 3 1.2.1 Jackknife 6 1.2.2

More information

Two-Stage Least Squares

Two-Stage Least Squares Chapter 316 Two-Stage Least Squares Introduction This procedure calculates the two-stage least squares (2SLS) estimate. This method is used fit models that include instrumental variables. 2SLS includes

More information

Predicting Messaging Response Time in a Long Distance Relationship

Predicting Messaging Response Time in a Long Distance Relationship Predicting Messaging Response Time in a Long Distance Relationship Meng-Chen Shieh m3shieh@ucsd.edu I. Introduction The key to any successful relationship is communication, especially during times when

More information

Forecasting Video Analytics Sami Abu-El-Haija, Ooyala Inc

Forecasting Video Analytics Sami Abu-El-Haija, Ooyala Inc Forecasting Video Analytics Sami Abu-El-Haija, Ooyala Inc (haija@stanford.edu; sami@ooyala.com) 1. Introduction Ooyala Inc provides video Publishers an endto-end functionality for transcoding, storing,

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

Long Term Analysis for the BAM device Donata Bonino and Daniele Gardiol INAF Osservatorio Astronomico di Torino

Long Term Analysis for the BAM device Donata Bonino and Daniele Gardiol INAF Osservatorio Astronomico di Torino Long Term Analysis for the BAM device Donata Bonino and Daniele Gardiol INAF Osservatorio Astronomico di Torino 1 Overview What is BAM Analysis in the time domain Analysis in the frequency domain Example

More information

BIO 360: Vertebrate Physiology Lab 9: Graphing in Excel. Lab 9: Graphing: how, why, when, and what does it mean? Due 3/26

BIO 360: Vertebrate Physiology Lab 9: Graphing in Excel. Lab 9: Graphing: how, why, when, and what does it mean? Due 3/26 Lab 9: Graphing: how, why, when, and what does it mean? Due 3/26 INTRODUCTION Graphs are one of the most important aspects of data analysis and presentation of your of data. They are visual representations

More information

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Exploratory data analysis tasks Examine the data, in search of structures

More information

Linear Methods for Regression and Shrinkage Methods

Linear Methods for Regression and Shrinkage Methods Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares Input vectors

More information

Aaron Daniel Chia Huang Licai Huang Medhavi Sikaria Signal Processing: Forecasting and Modeling

Aaron Daniel Chia Huang Licai Huang Medhavi Sikaria Signal Processing: Forecasting and Modeling Aaron Daniel Chia Huang Licai Huang Medhavi Sikaria Signal Processing: Forecasting and Modeling Abstract Forecasting future events and statistics is problematic because the data set is a stochastic, rather

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal

More information

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100. Linear Model Selection and Regularization especially usefull in high dimensions p>>100. 1 Why Linear Model Regularization? Linear models are simple, BUT consider p>>n, we have more features than data records

More information

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy 2017 ITRON EFG Meeting Abdul Razack Specialist, Load Forecasting NV Energy Topics 1. Concepts 2. Model (Variable) Selection Methods 3. Cross- Validation 4. Cross-Validation: Time Series 5. Example 1 6.

More information

Business Club. Decision Trees

Business Club. Decision Trees Business Club Decision Trees Business Club Analytics Team December 2017 Index 1. Motivation- A Case Study 2. The Trees a. What is a decision tree b. Representation 3. Regression v/s Classification 4. Building

More information

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression Journal of Data Analysis and Information Processing, 2016, 4, 55-63 Published Online May 2016 in SciRes. http://www.scirp.org/journal/jdaip http://dx.doi.org/10.4236/jdaip.2016.42005 A Comparative Study

More information

Stat 428 Autumn 2006 Homework 2 Solutions

Stat 428 Autumn 2006 Homework 2 Solutions Section 6.3 (5, 8) 6.3.5 Here is the Minitab output for the service time data set. Descriptive Statistics: Service Times Service Times 0 69.35 1.24 67.88 17.59 28.00 61.00 66.00 Variable Q3 Maximum Service

More information

BIOL 458 BIOMETRY Lab 10 - Multiple Regression

BIOL 458 BIOMETRY Lab 10 - Multiple Regression BIOL 458 BIOMETRY Lab 0 - Multiple Regression Many problems in biology science involve the analysis of multivariate data sets. For data sets in which there is a single continuous dependent variable, but

More information

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL SPSS QM II SHORT INSTRUCTIONS This presentation contains only relatively short instructions on how to perform some statistical analyses in SPSS. Details around a certain function/analysis method not covered

More information

B.2 Measures of Central Tendency and Dispersion

B.2 Measures of Central Tendency and Dispersion Appendix B. Measures of Central Tendency and Dispersion B B. Measures of Central Tendency and Dispersion What you should learn Find and interpret the mean, median, and mode of a set of data. Determine

More information

CSE 158. Web Mining and Recommender Systems. Midterm recap

CSE 158. Web Mining and Recommender Systems. Midterm recap CSE 158 Web Mining and Recommender Systems Midterm recap Midterm on Wednesday! 5:10 pm 6:10 pm Closed book but I ll provide a similar level of basic info as in the last page of previous midterms CSE 158

More information

Integrated Math 1 Module 7 Honors Connecting Algebra and Geometry Ready, Set, Go! Homework Solutions

Integrated Math 1 Module 7 Honors Connecting Algebra and Geometry Ready, Set, Go! Homework Solutions 1 Integrated Math 1 Module 7 Honors Connecting Algebra and Geometry Ready, Set, Go! Homework Solutions Adapted from The Mathematics Vision Project: Scott Hendrickson, Joleigh Honey, Barbara Kuehl, Travis

More information

Ex.1 constructing tables. a) find the joint relative frequency of males who have a bachelors degree.

Ex.1 constructing tables. a) find the joint relative frequency of males who have a bachelors degree. Two-way Frequency Tables two way frequency table- a table that divides responses into categories. Joint relative frequency- the number of times a specific response is given divided by the sample. Marginal

More information

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp Chris Guthrie Abstract In this paper I present my investigation of machine learning as

More information

Minitab 17 commands Prepared by Jeffrey S. Simonoff

Minitab 17 commands Prepared by Jeffrey S. Simonoff Minitab 17 commands Prepared by Jeffrey S. Simonoff Data entry and manipulation To enter data by hand, click on the Worksheet window, and enter the values in as you would in any spreadsheet. To then save

More information

Math 183 Statistical Methods

Math 183 Statistical Methods Math 183 Statistical Methods Eddie Aamari S.E.W. Assistant Professor eaamari@ucsd.edu math.ucsd.edu/~eaamari/ AP&M 5880A 1 / 24 Math 183 Statistical Methods Eddie Aamari S.E.W. Assistant Professor eaamari@ucsd.edu

More information

STA121: Applied Regression Analysis

STA121: Applied Regression Analysis STA121: Applied Regression Analysis Variable Selection - Chapters 8 in Dielman Artin Department of Statistical Science October 23, 2009 Outline Introduction 1 Introduction 2 3 4 Variable Selection Model

More information

Louis Fourrier Fabien Gaie Thomas Rolf

Louis Fourrier Fabien Gaie Thomas Rolf CS 229 Stay Alert! The Ford Challenge Louis Fourrier Fabien Gaie Thomas Rolf Louis Fourrier Fabien Gaie Thomas Rolf 1. Problem description a. Goal Our final project is a recent Kaggle competition submitted

More information

2. On classification and related tasks

2. On classification and related tasks 2. On classification and related tasks In this part of the course we take a concise bird s-eye view of different central tasks and concepts involved in machine learning and classification particularly.

More information

WELCOME! Lecture 3 Thommy Perlinger

WELCOME! Lecture 3 Thommy Perlinger Quantitative Methods II WELCOME! Lecture 3 Thommy Perlinger Program Lecture 3 Cleaning and transforming data Graphical examination of the data Missing Values Graphical examination of the data It is important

More information

Network Bandwidth Utilization Prediction Based on Observed SNMP Data

Network Bandwidth Utilization Prediction Based on Observed SNMP Data 160 TUTA/IOE/PCU Journal of the Institute of Engineering, 2017, 13(1): 160-168 TUTA/IOE/PCU Printed in Nepal Network Bandwidth Utilization Prediction Based on Observed SNMP Data Nandalal Rana 1, Krishna

More information

STATISTICS (STAT) Statistics (STAT) 1

STATISTICS (STAT) Statistics (STAT) 1 Statistics (STAT) 1 STATISTICS (STAT) STAT 2013 Elementary Statistics (A) Prerequisites: MATH 1483 or MATH 1513, each with a grade of "C" or better; or an acceptable placement score (see placement.okstate.edu).

More information

Evaluating Machine Learning Methods: Part 1

Evaluating Machine Learning Methods: Part 1 Evaluating Machine Learning Methods: Part 1 CS 760@UW-Madison Goals for the lecture you should understand the following concepts bias of an estimator learning curves stratified sampling cross validation

More information

2016 Stat-Ease, Inc. & CAMO Software

2016 Stat-Ease, Inc. & CAMO Software Multivariate Analysis and Design of Experiments in practice using The Unscrambler X Frank Westad CAMO Software fw@camo.com Pat Whitcomb Stat-Ease pat@statease.com Agenda Goal: Part 1: Part 2: Show how

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview Chapter 888 Introduction This procedure generates D-optimal designs for multi-factor experiments with both quantitative and qualitative factors. The factors can have a mixed number of levels. For example,

More information

Averages and Variation

Averages and Variation Averages and Variation 3 Copyright Cengage Learning. All rights reserved. 3.1-1 Section 3.1 Measures of Central Tendency: Mode, Median, and Mean Copyright Cengage Learning. All rights reserved. 3.1-2 Focus

More information

Applied Regression Modeling: A Business Approach

Applied Regression Modeling: A Business Approach i Applied Regression Modeling: A Business Approach Computer software help: SAS SAS (originally Statistical Analysis Software ) is a commercial statistical software package based on a powerful programming

More information

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University

More information

Ensemble of Specialized Neural Networks for Time Series Forecasting. Slawek Smyl ISF 2017

Ensemble of Specialized Neural Networks for Time Series Forecasting. Slawek Smyl ISF 2017 Ensemble of Specialized Neural Networks for Time Series Forecasting Slawek Smyl slawek@uber.com ISF 2017 Ensemble of Predictors Ensembling a group predictors (preferably diverse) or choosing one of them

More information

STA 570 Spring Lecture 5 Tuesday, Feb 1

STA 570 Spring Lecture 5 Tuesday, Feb 1 STA 570 Spring 2011 Lecture 5 Tuesday, Feb 1 Descriptive Statistics Summarizing Univariate Data o Standard Deviation, Empirical Rule, IQR o Boxplots Summarizing Bivariate Data o Contingency Tables o Row

More information

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010 THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL STOR 455 Midterm September 8, INSTRUCTIONS: BOTH THE EXAM AND THE BUBBLE SHEET WILL BE COLLECTED. YOU MUST PRINT YOUR NAME AND SIGN THE HONOR PLEDGE

More information

Introduction to Statistical Analyses in SAS

Introduction to Statistical Analyses in SAS Introduction to Statistical Analyses in SAS Programming Workshop Presented by the Applied Statistics Lab Sarah Janse April 5, 2017 1 Introduction Today we will go over some basic statistical analyses in

More information

Time Series Analysis by State Space Methods

Time Series Analysis by State Space Methods Time Series Analysis by State Space Methods Second Edition J. Durbin London School of Economics and Political Science and University College London S. J. Koopman Vrije Universiteit Amsterdam OXFORD UNIVERSITY

More information

Product Catalog. AcaStat. Software

Product Catalog. AcaStat. Software Product Catalog AcaStat Software AcaStat AcaStat is an inexpensive and easy-to-use data analysis tool. Easily create data files or import data from spreadsheets or delimited text files. Run crosstabulations,

More information

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017 Lecture 27: Review Reading: All chapters in ISLR. STATS 202: Data mining and analysis December 6, 2017 1 / 16 Final exam: Announcements Tuesday, December 12, 8:30-11:30 am, in the following rooms: Last

More information

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing Generalized Additive Model and Applications in Direct Marketing Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing Abstract Logistic regression 1 has been widely used in direct marketing applications

More information

Minitab 18 Feature List

Minitab 18 Feature List Minitab 18 Feature List * New or Improved Assistant Measurement systems analysis * Capability analysis Graphical analysis Hypothesis tests Regression DOE Control charts * Graphics Scatterplots, matrix

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

Withdrawn Equity Offerings: Event Study and Cross-Sectional Regression Analysis Using Eventus Software

Withdrawn Equity Offerings: Event Study and Cross-Sectional Regression Analysis Using Eventus Software Withdrawn Equity Offerings: Event Study and Cross-Sectional Regression Analysis Using Eventus Software Copyright 1998-2001 Cowan Research, L.C. This note demonstrates the use of Eventus software to conduct

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with

More information

Design of Experiments

Design of Experiments Seite 1 von 1 Design of Experiments Module Overview In this module, you learn how to create design matrices, screen factors, and perform regression analysis and Monte Carlo simulation using Mathcad. Objectives

More information

How to use FSBForecast Excel add-in for regression analysis (July 2012 version)

How to use FSBForecast Excel add-in for regression analysis (July 2012 version) How to use FSBForecast Excel add-in for regression analysis (July 2012 version) FSBForecast is an Excel add-in for data analysis and regression that was developed at the Fuqua School of Business over the

More information

The Automation of the Feature Selection Process. Ronen Meiri & Jacob Zahavi

The Automation of the Feature Selection Process. Ronen Meiri & Jacob Zahavi The Automation of the Feature Selection Process Ronen Meiri & Jacob Zahavi Automated Data Science http://www.kdnuggets.com/2016/03/automated-data-science.html Outline The feature selection problem Objective

More information

Your Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression

Your Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression Your Name: Section: 36-201 INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression Objectives: 1. To learn how to interpret scatterplots. Specifically you will investigate, using

More information

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value. Calibration OVERVIEW... 2 INTRODUCTION... 2 CALIBRATION... 3 ANOTHER REASON FOR CALIBRATION... 4 CHECKING THE CALIBRATION OF A REGRESSION... 5 CALIBRATION IN SIMPLE REGRESSION (DISPLAY.JMP)... 5 TESTING

More information

Solution to Bonus Questions

Solution to Bonus Questions Solution to Bonus Questions Q2: (a) The histogram of 1000 sample means and sample variances are plotted below. Both histogram are symmetrically centered around the true lambda value 20. But the sample

More information

7. Collinearity and Model Selection

7. Collinearity and Model Selection Sociology 740 John Fox Lecture Notes 7. Collinearity and Model Selection Copyright 2014 by John Fox Collinearity and Model Selection 1 1. Introduction I When there is a perfect linear relationship among

More information

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Performance Estimation and Regularization Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Bias- Variance Tradeoff Fundamental to machine learning approaches Bias- Variance Tradeoff Error due to Bias:

More information

Variable selection is intended to select the best subset of predictors. But why bother?

Variable selection is intended to select the best subset of predictors. But why bother? Chapter 10 Variable Selection Variable selection is intended to select the best subset of predictors. But why bother? 1. We want to explain the data in the simplest way redundant predictors should be removed.

More information

Minitab Study Card J ENNIFER L EWIS P RIESTLEY, PH.D.

Minitab Study Card J ENNIFER L EWIS P RIESTLEY, PH.D. Minitab Study Card J ENNIFER L EWIS P RIESTLEY, PH.D. Introduction to Minitab The interface for Minitab is very user-friendly, with a spreadsheet orientation. When you first launch Minitab, you will see

More information

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. + What is Data? Data is a collection of facts. Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. In most cases, data needs to be interpreted and

More information

Multicollinearity and Validation CIVL 7012/8012

Multicollinearity and Validation CIVL 7012/8012 Multicollinearity and Validation CIVL 7012/8012 2 In Today s Class Recap Multicollinearity Model Validation MULTICOLLINEARITY 1. Perfect Multicollinearity 2. Consequences of Perfect Multicollinearity 3.

More information

VCEasy VISUAL FURTHER MATHS. Overview

VCEasy VISUAL FURTHER MATHS. Overview VCEasy VISUAL FURTHER MATHS Overview This booklet is a visual overview of the knowledge required for the VCE Year 12 Further Maths examination.! This booklet does not replace any existing resources that

More information

YEAR 12 Trial Exam Paper FURTHER MATHEMATICS. Written examination 1. Worked solutions

YEAR 12 Trial Exam Paper FURTHER MATHEMATICS. Written examination 1. Worked solutions YEAR 12 Trial Exam Paper 2016 FURTHER MATHEMATICS Written examination 1 s This book presents: worked solutions, giving you a series of points to show you how to work through the questions mark allocations

More information

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Predictive Analysis: Evaluation and Experimentation. Heejun Kim Predictive Analysis: Evaluation and Experimentation Heejun Kim June 19, 2018 Evaluation and Experimentation Evaluation Metrics Cross-Validation Significance Tests Evaluation Predictive analysis: training

More information

3. Data Analysis and Statistics

3. Data Analysis and Statistics 3. Data Analysis and Statistics 3.1 Visual Analysis of Data 3.2.1 Basic Statistics Examples 3.2.2 Basic Statistical Theory 3.3 Normal Distributions 3.4 Bivariate Data 3.1 Visual Analysis of Data Visual

More information

Brief Guide on Using SPSS 10.0

Brief Guide on Using SPSS 10.0 Brief Guide on Using SPSS 10.0 (Use student data, 22 cases, studentp.dat in Dr. Chang s Data Directory Page) (Page address: http://www.cis.ysu.edu/~chang/stat/) I. Processing File and Data To open a new

More information

Lecture 13: Model selection and regularization

Lecture 13: Model selection and regularization Lecture 13: Model selection and regularization Reading: Sections 6.1-6.2.1 STATS 202: Data mining and analysis October 23, 2017 1 / 17 What do we know so far In linear regression, adding predictors always

More information

THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA. Forrest W. Young & Carla M. Bann

THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA. Forrest W. Young & Carla M. Bann Forrest W. Young & Carla M. Bann THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA CB 3270 DAVIE HALL, CHAPEL HILL N.C., USA 27599-3270 VISUAL STATISTICS PROJECT WWW.VISUALSTATS.ORG

More information

Assignments Fill out this form to do the assignments or see your scores.

Assignments Fill out this form to do the assignments or see your scores. Assignments Assignment schedule General instructions for online assignments Troubleshooting technical problems Fill out this form to do the assignments or see your scores. Login Course: Statistics W21,

More information

Data Management - 50%

Data Management - 50% Exam 1: SAS Big Data Preparation, Statistics, and Visual Exploration Data Management - 50% Navigate within the Data Management Studio Interface Register a new QKB Create and connect to a repository Define

More information

IBM SPSS Forecasting 24 IBM

IBM SPSS Forecasting 24 IBM IBM SPSS Forecasting 24 IBM Note Before using this information and the product it supports, read the information in Notices on page 59. Product Information This edition applies to ersion 24, release 0,

More information

Video Traffic Modeling Using Seasonal ARIMA Models

Video Traffic Modeling Using Seasonal ARIMA Models Video Traffic Modeling Using Seasonal ARIMA Models Abdel-Karim Al-Tamimi and Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at:

More information

Technical Support Minitab Version Student Free technical support for eligible products

Technical Support Minitab Version Student Free technical support for eligible products Technical Support Free technical support for eligible products All registered users (including students) All registered users (including students) Registered instructors Not eligible Worksheet Size Number

More information

Subset Selection in Multiple Regression

Subset Selection in Multiple Regression Chapter 307 Subset Selection in Multiple Regression Introduction Multiple regression analysis is documented in Chapter 305 Multiple Regression, so that information will not be repeated here. Refer to that

More information

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati Evaluation Metrics (Classifiers) CS Section Anand Avati Topics Why? Binary classifiers Metrics Rank view Thresholding Confusion Matrix Point metrics: Accuracy, Precision, Recall / Sensitivity, Specificity,

More information

Example. Section: PS 709 Examples of Calculations of Reduced Hours of Work Last Revised: February 2017 Last Reviewed: February 2017 Next Review:

Example. Section: PS 709 Examples of Calculations of Reduced Hours of Work Last Revised: February 2017 Last Reviewed: February 2017 Next Review: Following are three examples of calculations for MCP employees (undefined hours of work) and three examples for MCP office employees. Examples use the data from the table below. For your calculations use

More information

Learn What s New. Statistical Software

Learn What s New. Statistical Software Statistical Software Learn What s New Upgrade now to access new and improved statistical features and other enhancements that make it even easier to analyze your data. The Assistant Data Customization

More information

DM4U_B P 1 W EEK 1 T UNIT

DM4U_B P 1 W EEK 1 T UNIT MDM4U_B Per 1 WEEK 1 Tuesday Feb 3 2015 UNIT 1: Organizing Data for Analysis 1) THERE ARE DIFFERENT TYPES OF DATA THAT CAN BE SURVEYED. 2) DATA CAN BE EFFECTIVELY DISPLAYED IN APPROPRIATE TABLES AND GRAPHS.

More information

Here is Kellogg s custom menu for their core statistics class, which can be loaded by typing the do statement shown in the command window at the very

Here is Kellogg s custom menu for their core statistics class, which can be loaded by typing the do statement shown in the command window at the very Here is Kellogg s custom menu for their core statistics class, which can be loaded by typing the do statement shown in the command window at the very bottom of the screen: 4 The univariate statistics command

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

Machine Learning. Topic 4: Linear Regression Models

Machine Learning. Topic 4: Linear Regression Models Machine Learning Topic 4: Linear Regression Models (contains ideas and a few images from wikipedia and books by Alpaydin, Duda/Hart/ Stork, and Bishop. Updated Fall 205) Regression Learning Task There

More information

Welcome to class! Put your Create Your Own Survey into the inbox. Sign into Edgenuity. Begin to work on the NC-Math I material.

Welcome to class! Put your Create Your Own Survey into the inbox. Sign into Edgenuity. Begin to work on the NC-Math I material. Welcome to class! Put your Create Your Own Survey into the inbox. Sign into Edgenuity. Begin to work on the NC-Math I material. Unit Map - Statistics Monday - Frequency Charts and Histograms Tuesday -

More information

CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY

CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY 23 CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY 3.1 DESIGN OF EXPERIMENTS Design of experiments is a systematic approach for investigation of a system or process. A series

More information

Evaluating Machine-Learning Methods. Goals for the lecture

Evaluating Machine-Learning Methods. Goals for the lecture Evaluating Machine-Learning Methods Mark Craven and David Page Computer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Some of the slides in these lectures have been adapted/borrowed from

More information

INF 4300 Classification III Anne Solberg The agenda today:

INF 4300 Classification III Anne Solberg The agenda today: INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15

More information

Quality Checking an fmri Group Result (art_groupcheck)

Quality Checking an fmri Group Result (art_groupcheck) Quality Checking an fmri Group Result (art_groupcheck) Paul Mazaika, Feb. 24, 2009 A statistical parameter map of fmri group analyses relies on the assumptions of the General Linear Model (GLM). The assumptions

More information

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane DATA MINING AND MACHINE LEARNING Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Data preprocessing Feature normalization Missing

More information

/4 Directions: Graph the functions, then answer the following question.

/4 Directions: Graph the functions, then answer the following question. 1.) Graph y = x. Label the graph. Standard: F-BF.3 Identify the effect on the graph of replacing f(x) by f(x) +k, k f(x), f(kx), and f(x+k), for specific values of k; find the value of k given the graphs.

More information