SYS 6021 Linear Statistical Models

Size: px

Start display at page:

Download "SYS 6021 Linear Statistical Models"

Alexina Hood
5 years ago
Views:

1 SYS 6021 Linear Statistical Models Project 2 Spam Filters Jinghe Zhang Summary The spambase data and time indexed counts of spams and hams are studied to develop accurate spam filters. Static models are constructed using generalized linear regression and the recommended model is built based on principal component regression using the spam data after log transformation on the explanatory variables. The selected static spam filter has the smallest numbers of total errors and false positives, which are 91 and 35, respectively. In this model, the coefficient of the first principal component is and the 95% confidence interval is [-2.02, ]. When there is one unit change in this variable while holding the others constant, the percent change in the odds is % and the 95% confidence interval of the percent change is [-86.75%, %]. The time series for ham models the seasonality and random fluctuation, while that for spam models the trend and fluctuation. The recommended models have statistically better performance than the other models considering their overall decent performance in terms of smaller MSE and AIC. In addition, the statistical tests on the two recommended models indicate that they are adequate. In ham seasonality, the coefficient of Saturday is -4.70, and the confidence interval is [-5.88, -3.52], which indicates that on Saturdays, there is fewer hams received at the address. In the fluctuation model of spam, the coefficients of ar1 is 1.03 and the 95% confidence interval is [0.90, 1.17]. The recommended static and time series models can be combined using Bayes rule to further improve the performance of spam filters. Honor Pledge: On my honor, I pledge that I am the sole author of this paper and I have accurately cited all help and references used in its completion. November 10, 2013

2 1 Problem description 1.1 Situation Nowadays, the development of technology and the wide spread of internet provide great convenience to our daily life. Especially, the communication in different places does not reply on the traditional approaches any more, such as letters and telephones. plays a more and more important role which not only facilitates the communication between people but also a very convenient way to approach information. However, along with those useful s, people receive spams sometimes. Some of the junk mails are just ads which can be annoying. Moreover, many spams include virus or malware which are harmful to the host computers, such as stealing information. According to the National Technology Readiness Survey in 2004, the cost of spam is more than $21 billion annually in terms of lost productivity [1]. Hence, this has attracted more and more attention and many spam filters have been created using various techniques, such as Bayesian filtering [2, 3]. 1.2 Goal The objective of this study is to develop a spam filter in order to detect junk s. The spam filter can be static, time series, and integrated models. 1.3 Metrics For the static generalized liner regression model, the model performance is measured by ROC curve and the number of total errors, i.e., false positive (FP) and false negative (FN). Moreover, it is very undesirable to make a good as a spam in this project so that a spam filter with fewer false positives is preferred. For the time series filter, the performance metric is mean squared error (MSE). In addition, for the integrated model, the performance metrics are same as those of the static model, which are the ROC curve and sum of FP and FN. 1.4 Hypotheses There are two hypotheses in this study illustrated as follows: a. The variables, which describe word frequency, the appearance of capital letters, and the frequency of some characters, are predictive of spams. b. Time component has some impact on the number of spam s arrived at that address. 2. Approach 2.1 Data 1

3 In this study, spam data are analyzed to develop spam filters. In the dataset, there are 4601 observations and 58 variables [4]. One of the variables (V58) is the class label which indicates whether it is a spam or a ham. Of the other 57 explanatory variables, 48 continuous attributes (V1-V48) describe the frequency of some particular words in the , and 6 variables provide (V49-V54) information on the frequency of some characters. Moreover, there are 3 variables (V55-V57) illustrate the appearance of capital letters in the . Also, the numbers of spams and hams received at an address on different days are available. On average, there are spams received each day from August 1, 2004 to July 30, 2005, while there are 3.89 hams each day from January 13, 2000 to June 1, To discover the relationship between the explanatory variables and the response variable, scatter plots matrices are provided. The variables in scatter plots are divided into three categories: word frequency variables, character frequency variables, and capital letter variables. The matrix in Figure 2.1 studies the correlations between capital letter variables V55-V57, and the response variable V58. As shown in Figure 2.1, there are strong correlations between V55 and V56, as well as V56 and V57. Figure 2.1 Scatter Plots Matrix of V55, V56, V57, and V58 Also, considering their relationship with the response variable, V57 is selected for modeling the spam filters. In addition, the predictive variables are strongly skewed. By observing the other scatter plots, we can get the same conclusion. Therefore, log transformation on those predictors is preferred. In addition, the scatter plots matrix of some word frequency variables V1-V10 and the response variable V58 are displayed in Figure 2.2 which also indicates strong skewness and some correlation between explanatory variables. The scatter plot matrices of other variables are given in Appendix. 2

4 Figure 2.2 Scatter Plots Matrix of V1-V10, and V58 Box plots are displayed to show the distribution of word frequency variables V1 to V48, character frequency variables (V49-V54), and capital letter variables (V55-V57). These boxplots show whether a variable is discriminatory in terms of spam s. Box plots for V1 to V9 are given in Figure 2.3. It is obvious that the value of V3 is discriminatory in terms of spam and ham. The factor plots of other variables are given in the Appendix. 3

5 Figure 2.3 Factor Plots of V1-V9 By the information conveyed in those box plots as well as the scatter plot matrices, 8 predictive variables are selected as predictors considering the independence of predictive variables and their predictability of spam. A statistical summary of the selected predictors are provided in Table 2.1. Variable Table 2.1 A Statistical Summary of Selected Variables Min First Quartile Median Mean Third Quartile Max V V V V V V V V The histograms of these 8 variables are given in Figure

6 Figure 2.4 Histograms of Selected Predictors We can see that the distributions of these variables are highly skewed which is consistent with the conclusion from the scatter plot matrices. Therefore, log transformation is performed to all the potential predictors. The scatter plot matrix in Figure 2.5 shows relationship between predictive variables after transformation and the response variable. The scatter plots of other variables after transformation are given in the Appendix. 5

Figure 2.5 Scatter Plots Matrix of V1-V10 and V58 after Log Transformation After log transformation, we can see that the distributions of predictive variables are less skewed.

7 Figure 2.5 Scatter Plots Matrix of V1-V10 and V58 after Log Transformation After log transformation, we can see that the distributions of predictive variables are less skewed. Also, there are larger correlations between the predictors and response variable. The factor plots of some log transformed variables are provided in Figure 2.6. Figure 2.6 Factor Plots of V1-V9 after Log Transformation 6

8 By observing the boxplots of the selected variables in Figure 2.7, we find that there are some outliers. Figure 2.7 Factor Plots of Selected Predictors Therefore, the extreme observations are extracted and investigated. After further investigation, the corresponding outliers are excluded from the data set in order to eliminate bias. The 12 removed observations are 1, 752, 831, 1708, 1489, 1763, 1791, 2694, 2905, 3247, 3913, and Hence, the dataset used for modeling has 4589 observations. In Table 2.2, there is a statistical summary of the data without outliers. Table 2.2 A Statistical Summary of Selected Variables (without outliers) Variable Min First Quartile Median Mean Third Quartile Max V V V V V V V V

9 The preprocessing on the dataset discussed above is used to build static spam filter and another two datasets are used to construct the time series spam filters. These two time series datasets are consist of the date as well as the number of spams/hams arriving at a particular address [5]. A summary of the two datasets are provided in Table 2.3. Table 2.3 A Summary of Ham and Spam Datasets for Time Series Modeling Dataset Min First Quartile Median Mean Third Quartile Max Ham Spam In the ham dataset, here are 506 observations from Jan. 13, 2000 to Jun. 1, There are 364 observations from Aug. 1, 2004 to Jul. 30, 2005 in the spam dataset. 2.2 Analysis Static Analysis for Spam Filter Design The dataset with 4589 observations and 8 variables are used to build generalized linear models for spam detection. To discover the discriminatory of these variables in terms of spam detection, factor plots are shown in Figure 2.7. The dataset is randomly divided into two subsets: training set and testing set, which account for 2/3 and 1/3 of data, respectively. Also, all the observations for training and testing are log transformed and can be used to build logistic regression models. At first, two main effect generalized linear model are constructed using the 8 variables before transformation and after log transformation on the predictors. GLM1 (no transformation): V58 ~ V3+V7+V16+V17+V19+V21+V52+V57 GLM2 (log transformation): V58 ~ V3+V7+V16+V17+V19+V21+V52+V57 Chi-square test is performed to compare the first model GLM1 and the null model, and the p value is less than 2.2e-16, which indicates GLM1 is significant. Also, every predictor in GLM1 has a p value less than 2.2e-16 so that all the 8 predictors are significant. Similarly, the GLM2 is statistically significant and all the predictors in GLM2 are significant. Then, principal component analysis (PCA) is performed to the train set and principle components which give 90% of the variance of this dataset are used for logistic regression. The biplots of the untransformed and transformed explanatory variables are shown in Figure 2.8 and Figure 2.9 respectively. 8

10 Figure 2.8 Biplot of PCA on Untransformed Data Figure 2.9 Biplot of PCA on Transformed Data 9

11 The variances of two types of s on the first two components are displayed in Figure 2.10 and Figure Figure 2.10 Variances of Hams and Spams on First Two Components (Before Transformation) Figure 2.11 Variances of Hams and Spams on First Two Components (After Transformation) According to Figure 2.10 and 2.11, we can see that ham s have larger variation on the first component, while the spams have larger variation on the second component before log transformation on the explanatory variables. However, after transformation, ham and spam s have large variation on both of the first two components. Then, two PCA regression models (GLM3 and GLM4) are trained using the training data before and after log transformation. Two models are compared with the null model and both of them are statistically significant. In addition, interactions between predictors are studied through interaction plots. For example, there is some interaction between V16 and V21 as shown in Figure 2.1. Also, there are some interaction between V3 and V17, V3 and V19, V17 and V57, etc. Then, interaction terms are added to the main effect models. 10

12 GLM5 (no transformation): V58~V3+V7+V16+V17+V19+V21+V52+V57+V3*(V17+V19+V21+V52)+V7*(V21+V52) +V16*(V17+V19+V21+V52)+V17*V57+V19*(V21+V52+V57) +V21*(V52+V57) +V52*V57 GLM6 (log transformation): V58~V3+V7+V16+V17+V19+V21+V52+V57+V3*(V17+V19+V21+V52)+V7*(V21+V52) +V16*(V17+V19+V21+V52)+V17*V57+V19*(V21+V52+V57) +V21*(V52+V57) +V52*V57 The two models are compared with the null model using the Chi-square test and GLM5 and GLM6 are statistically significant. Stepwise regression is performed to the two models and the two stepwise models are listed as below: GLM7 (stepwise on GLM5): V58 ~ V3 + V7 + V16 + V17 + V19 + V21 + V52 + V57 + V3:V17 + V3:V21 + V16:V19 + V16:V21 + V17:V57 + V19:V21 + V19:V57 + V21:V52 + V52:V57 GLM8 (stepwise on GLM6): V58 ~ V3 + V7 + V16 + V17 + V19 + V21 + V52 + V57 + V3:V17 + V3:V21 + V7:V52 + V16:V19 + V17:V57 + V19:V21 + V52:V57 The two stepwise models are also compared with the null model using Chi-square test and they are statistically significant Time Series Analysis for Spam Filter Design To study whether time component has an influence on the numbers of hams and spams received on a particular day, the counts of these two types of s are transformed to time series data and plotted in Figure

13 Figure 2.12 Time Series Plots of Ham and Spam Data The autocorrelation function (ACF) plots are provided in Figure 2.13, which indicates that there are some impact from the time component since the lags are greater than 0. Figure 2.13 ACF Plots of Ham and Spam Data 12

14 Test set method is used to evaluate the time series models and the observations of the last 7 days are extracted as the test set from both ham and spam data. The others are used for training. Then, the trends of the two training datasets are modeled using linear regression and the two models are tested using F-statistic to measure the significance of the models. As to the spam data, the trend is statistically significant since the p value of the associated F test is 1.05E-5 which is less than However, the p value of F test on the trend model is for the ham data. Therefore, there is significant trend in the spam data rather than in the ham data. The plots in Figure 2.14 give a graphical illustration on the trends of the two types of s. Figure 2.14 Trends of Ham and Spam Data To investigate the seasonality of the two datasets, periodograms are plotted in Figure 2.15 to find the peaks and to compute the period. Figure 2.15 Periodogram Plots of Ham and Spam Data 13

15 The peaks are at 0.14 and in the two datasets, respectively. Hence, the corresponding periods are 6.74 and 375. Therefore, the seasonality in hams is by week and there is no seasonality in spam. Therefore, for ham data seasonality is modeled and trend is not modeled. However, trend is modeled and seasonality is not modeled for spam data. Consequently, a linear regression is built to model the seasonality of ham. The residuals of spam trend model and ham seasonality model are studied and plotted in Figure Figure 2.16 Residuals of Ham Seasonality and Spam Trend Models The ACF and partial ACF (PACF) plots in Figure 2.17 and 2.18 show the autocorrelation of residuals in the linear regression models for hams and spams. Figure 2.17 ACF Plots of Residuals of Ham Seasonality and Spam Trend Models 14

16 Figure 2.18 PACF Plots of Residuals of Ham Seasonality and Spam Trend Models In Figure 2.16, the residuals of ham seasonality model have relatively constant mean and variance and there is a sinusoidal pattern in the corresponding ACF plot, which indicates a stationary time series. Similarly, the time series of spam residuals is also stationary. In addition, they cut off in the PACF after 2 and 3 lags, respectively. In addition, for further investigation, the first order difference is also taken for the time series data and the plots of residuals after taking first order difference are given in Figure Figure 2.19 First Order Difference of Residuals 15

17 The ACF and partial ACF (PACF) plots in Figure 2.20 and 2.21 show the autocorrelation of residuals of ham seasonality and spam trends models after taking the first order difference. Figure 2.20 ACF Plots of First Order Difference of Residuals Figure 2.21 PACF Plots of First Order Difference of Residuals However, since there are negative values in the PACF plots, it is preferred not to take the difference of residuals. The ACF plots in Figure 2.17 show sinusoidal patterns and they cut off after 2 and 3 lags in the PACF plots in Figure Hence, the autoregression models AR-Ham and AR-Spam are constructed and the orders of them are 2 and 3, respectively. Also, MA-Ham and MA-Spam are two moving average models built to model the seasonality of ham and the trend of spam. Both of these two models take the order of 1. Then ARMA-Ham and ARMA-Spam are two models 16

18 constructed for ham and spam, respectively. In addition, the difference is not modeled in the time series according to the discussion derived from Figure The ARMA-Ham model takes the order of 2 and 1, while the ARMA-Spam model takes the order of 3 and 1. Overall, there are three time series models constructed for the seasonality of ham and another three models for the trend of spam. Automated method is adopted to build the ARIMA models and they are discussed in Section Integrated Filter Design Both static and time series spam filters are built in this project and they reveal spam information from different perspectives. In the static model, the generalized linear regression model provides a result 1 or 0 which indicates whether a particular is a spam. In the time series model, we can obtain the numbers of spams and hams at a particular day, which can be used to derive the probability of receiving a spam on that day. In order to improve the performance of spam filter, the two models can be combined using Bayes rule. Hence, the combing rule is displayed as shown in Equation 2.1. Pr( E i S j, T k) Pr( S j, T k E i)pr( E i) I, 2.1 Pr( S j, T K E i)pr( E i) i where E, S, T represent whether an is a spam in reality, the prediction of the static spam filter, and the prediction from the time series model, respectively. Here, i, j, and k are binary and is an elements in {0, 1} which indicates whether it is a spam. Equation 2.1 can be interpreted to Equation 2.2. Pr( E i S j, T k) Pr( S j E i)pr( T k E i) 2.2 The first term Pr( S j E i) is the probability of true positive when i=1 and j=1, and the second term is the probability of spam obtained from the time series model. Correspondingly, the probability of true positive can be computed through the score table of the selected static model using Equation 2.3. TP Pr( TP) 2.3 TP FN The probability of spam obtained from the time series model can be computed through Equation 2.4. # Spams Pr( TP) 2.4 # Spams # Hams 17

19 In brief, an improved spam filter is obtained by combining the static generalized model and time series model using the Bayes rule. 3. Evidence 3.1 Static Filter design To measure the performance of those GLM models in Section 2.2.1, the AIC of those models are shown in Table 3.1. Table 3.1 AIC of Static Models GLMs AIC GLM GLM GLM GLM GLM GLM GLM GLM The mail effect model GLM1 without data transformation has a large AIC so that another main effect model GLM2 after log transformation is constructed and its performance is improved compared with GLM1. In the Chi-square tests, all the variables in the main effect model are significant. To further refine the static models, PCA regression is considered in which principal components which account for 90% variance are used to build the regression models. Also, the PCA regression GLM3 with non-transformed data is not as good as the one GLM4 using transformed data. Sometimes the effect of a predictor on the response variable is dependent on the value of another variable so that interaction terms are added to the main effect model for a better regression. The interactions between explanatory variables are studied by interaction plots shown in Section GLM5 and GLM6 are built based on GLM1 and GLM2, respectively. It is clear that the performance of the static models is improved after adding interaction terms into the main effect models. In order to reduce the complexity and improve the performance of these two models, stepwise regression is performed to both of them where GML7 and GLM8 are constructed. From Table 3.1, we can see that the stepwise regressions based on the two antecedent models are able to achieve some improvement. Overall, the generalized linear models have smaller AIC when they are trained with the log transformed data. The static models with interaction terms have better accuracy than the main effect models. Moreover, the PCA regression using transformed data provides the best results among all the models described above, 18

20 which is most likely due to the latent properties discovered in PCA as well as the elimination of multicollineaity between explanatory variables. As mentioned in Section 2.2, the test set method is used in this project in order to evaluate the generalization ability of those models. In Table 3.2, the score table of these static models illustrates their prediction accuracy on test set. Table 3.2 Score Table of Static Models GLMs TN FP FN TP Total Errors GLM GLM GLM GLM GLM GLM GLM GLM The goal of this project is to detect spams and the model with a smaller number of total errors is preferred. It is obvious that GLM4 has the smallest number of total errors. Especially in this project, making a good to a spam (FP) is very undesirable so that a smaller FP is very important. Overall, GLM4 has the smallest AIC, total number of errors, and a smaller FP. Therefore, it is the optimal static model among all the GLMs discussed above. In addition, to measure the model performance in a graphical way the ROC curves are plotted and shown in Figure

21 Figure 3.1 ROC Curves of Static Spam Filters Point (0, 1) indicates an ideal prediction model and the model with a larger area under the curve (AUC) is preferred. Therefore, it is clear that GLM3 and GLM4 are very outstanding. Especially, GLM4 has the best performance among all the static spam filters and the main effect model GLM1 has the poorest accuracy. The other spam filters are very much overlapped with each other. Overall, the static models are improved after log transformation on the explanatory variables. To have a better understanding on all the static models after log transformation, ROC curves in Figure 3.2 gives a clearer plot with only transformed data. 20

22 Figure 3.2 ROC Curves of Static Spam Filters with Transformed Data In Figure 3.2, we can see that GLM4 is the most accurate model among all the spam filters trained with transformed data. GLM6 is a little better than GLM2 by adding interaction terms. The stepwise regression model GLM8 based on GLM6 does not make much difference since the ROC curves of these two models are very much overlapped. Similarly, the ROC curves of filters built with untransformed data support the same conclusions. Overall, GLM4 provides the best prediction accuracy so that it is proposed as the recommended static spam filter. The 95% confidence intervals of the coefficients of the first five principle components in GLM4 are given in Table 3.3. The complete list of coefficients and the associated 95% confidence intervals of GLM4 is given in Table 6.1 in Appendix. Table 3.3 Coefficients of the First Five Principle Components in GLM4 Mean Coefficient 2.50% Coefficient 97.50% Coefficient Percent Change (Mean) (%) Percent Change (2.50%) (%) Percent Change (97.50%) (%) (Intercept) Comp Comp Comp Comp Comp

23 The first three columns in Table 3.3 provide the coefficient of the predictor as well as its 95% confidence interval. The last three columns show the corresponding percent changes in the odds computed from the first three columns, respectively. For example, the coefficient of the first principal component is and the 95% confidence interval is [-2.02, -1.58]. When there is one unit change in this variable while holding the others constant, the percent change in the odds is % and the 95% confidence interval of the percent change is [-86.75%, %]. Similarly, as to the second component, the coefficient is and the associated confidence interval is [- 0.48, 0.02]. When making a one unit increase on this component and holding the others constant, there is a % change in the odds and the 95% confidence interval of this change is [-37.91%, -1.82%]. For the fourth component, the coefficient is positive and the value is The associated 95% confidence interval is [0.63, 1.02]. When there is one unit increase in this component and the other components are holding constant, the odds will have a % increase and its confidence interval is [87.80%, %]. 3.2 Time Series Filter Design The candidate models in Section are two AR models for ham and spam with the order of 2 and 3, respectively. The diagnostic plot for AR-Ham is given in Figure 3.3. Figure 3.3 Diagnostic Plots of AR-Ham In the diagnostic plot of AR-Ham, the residuals are not showing constant variance and the p values of Ljung-Box statistic tests indicate that this model is not adequate. There is no significant correlation between residuals. To improve the model, the MA is considered to be added as well as the difference which can be combined to an ARIMA model. Also, it is discovered that the order of difference is preferred to be 0. The diagnostic plot for MA-Ham is displayed in Figure

24 Figure 3.4 Diagnostic Plots of MA-Ham In Figure 3.4, we can see that the residuals do not have constant variance and the p values in the statistic tests show that this model is not sufficient. Therefore, ARMA-Ham is constructed which combines AR-Ham and MA-Ham. The diagnostic plot is shown in Figure 3.5. Figure 3.5 Diagnostic Plots of ARMA-Ham 23

25 In Figure 3.5, the residuals are random and the autocorrelation between them are very insignificant. Also, the p values for the Ljung-Box tests indicate that this model is sufficient. To further investigate the appropriateness of the orders in ARMA-Ham model, the ARIMA model (ARIMA-Ham) is constructed using automated method and it takes 1, 0, and 2 for the orders of AR, difference, as well as MA, respectively. The diagnostic plot for this ARIMA-Ham model is displayed in Figure 3.6. Figure 3.6 Diagnostic Plots of ARIMA-Ham From Figure 3.6, we can see that the residuals are random and the p values of the Ljung-Box statistic tests show that this model is adequate. The AR-Spam model described in Section takes the order of 3 and the diagnostic plot of this model is given in Figure 3.7. Figure 3.7 Diagnostic Plots of AR-Spam 24

26 In Figure 3.7, the residuals are showing constant variance and lack of pattern and the p values of Ljung-Box statistic tests indicate that this model is adequate. There is no significant correlation between residuals. Then, the MA-Spam model is constructed with the order of 1. The diagnostic plot for MA-Spam is displayed in Figure 3.8. Figure 3.8 Diagnostic Plots of MA-Spam In Figure 3.8, the residuals do not have constant variance and the p values in the statistic tests show that this model is not sufficient. Therefore, ARMA-Spam is constructed which combines AR-Spam and MA-Spam. The diagnostic plot is shown in Figure 3.9. Figure 3.9 Diagnostic Plots of ARMA-Spam 25

27 In Figure 3.9, we can see that the residuals have relatively constant variance and the p values in the statistic tests show that this model is sufficient. Also, there is no significant correlation between residuals. To further investigate the appropriateness of the orders in ARMA-Spam model, the ARIMA model (ARIMA-Spam) is constructed using automated method and it takes 1, 0, and 1 for the orders of AR, difference, as well as MA, respectively. The diagnostic plot for this ARIMA-Ham model is displayed in Figure Figure 3.10 Diagnostic Plots of ARIMA-Spam From the diagnostic plot for ARIMA-Spam, the residuals are random and lack of pattern and the p values of the Ljung-Box statistic tests indicate that this model is adequate. Table 3.4 provides the AICs of all the four models discussed in this section. Table 3.4 AIC of Time Series Models AIC AR-Ham MA-Ham ARMA-Ham ARIMA-Ham AR-Spam MA-Spam ARMA-Spam ARIMA-Spam

28 According to the AIC scores of the time series models, the two ARIMA models are the best and the MA models perform the worst. AR model are better than MA models and ARMA models which combine the AR and MA are able to make relatively large improvements. Therefore, considering their AICs, ARIMA-Ham and ARIMA-Spam are two models recommended. Test set method is used in evaluating the performance of time series models and the last 7 observations are used for testing. The forecasting plots of AR-Ham, MA-Ham, ARMA-Ham, and ARIMA-Ham are shown in Figure 3.11 and Figure 3.11 Forecasting Plots of AR-Ham and MA-Ham Figure 3.12 Forecasting Plots of ARMA-Ham and ARIMA-Ham 27

29 The black line is the number of hams received each day and the blue line shows the prediction on the numbers of hams received on the last 7 days. The blue shade is the 95% confidence interval of the prediction. To have a better understanding of the prediction, another two plots are given in Figure 3.8 which only provide the actual and prediction data on the test set. Figure 3.13 Prediction Plot of Time Series Models for Ham According to Figure 3.13, we can see that the predictions of ARIMA-Ham and ARMA-Ham model are very much overlapped, while AR-Ham and MA-Ham are close to each other. Also, the ARMA-Ham and ARIMA-Ham are more accurate since their predictions are closer to the actual observations when compared to the other two models. Table 3.5 provides the prediction of these two models on the test set as well as the mean square error (MSE). Table 3.5 Prediction Summary of Time Series Models for Ham MSE Actual AR-Ham MA-Ham ARMA-Ham ARIMA-Ham

30 According to Table 3.5, ARMA-Ham provides the best prediction in terms of MSE which is a little better than the ARIMA model, while the AR-Ham and MA-Ham have larger errors. Considering that the AIC of ARMA-Ham is only a little worse than that of the ARIMA-Ham and the former model has a higher accuracy in terms of MSE, the ARMA-Ham is recommended as the most preferred time series model for ham. The coefficients and the 95% confidence intervals are given in Table 3.6. Model Component Ham.season Ham.arma201 Table 3.6 Coefficients of Selected Time Series Model for Ham Coefficient 2.50% Coefficient 97.50% Coefficient (Intercept) season.ham Monday season.ham Saturday season.ham Sunday season.ham Thursday season.ham Tuesday season.ham Wednesday Intercept ar ar ma In the time series for ham, it consists of seasonality and some random fluctuations. In the seasonality model, the coefficient of Monday is and the 95% confidence interval is [-1.39, 0.97]. The coefficient of Saturday is -4.70, and the confidence interval is [-5.88, -3.52], which indicates that on Saturdays, there is fewer hams received at the address. Similarly, on Sundays there are a smaller number of hams. However, on Tuesday, Thursday, and Friday, the coefficients are more likely to be positive so that there are more hams received during weekdays. In the fluctuation model, the coefficients of ar1 is 1.09 and the 95% confidence interval is [0.93, 1.24]. The coefficient of ma1 is and the confidence interval is [-0.93, -0.69]. The time series models for spam are also evaluated using the test set method. The forecasting plots of AR-Spam, MA-Spam, ARMA-Spam, and ARIMA-Spam are shown in Figure 3.14 and

31 Figure 3.14 Forecasting Plots of AR-Spam and MA-Spam Figure 3.15 Forecasting Plots of ARMA-Spam and ARIMA-Spam The black line is the number of spams received each day and the blue line shows the prediction on the numbers of spams received on the last 7 days. The blue shade is the 95% confidence interval of the prediction. To have a better understanding of the prediction, another two plots are given in Figure 3.16 which only provide the actual and prediction data on the test set. 30

32 Figure 3.16 Prediction Plot of Time Series Models for Spam According to Figure 3.16, we can see that the predictions of ARIMA-Spam and ARMA-Spam model are very much overlapped, while AR-Spam and MA-Spam are close to each other. Table 3.7 provides the prediction of these two models on the test set as well as the mean square error (MSE). Table 3.7 Prediction Summary of Time Series Models for Spam MSE Actual AR-Spam MA-Spam ARMA-Spam ARIMA-Spam According to Table 3.7, MA-Ham provides the best prediction in terms of MSE. However, the AIC of MA-Ham is the worst and the Ljung_Box test shows that this model is not adequate. Therefore, ARMA-Spam model is recommended since this model is adequate according to the Ljung-Box test and it has relatively smaller MSE and AIC. The coefficients and the 95% confidence intervals are given in Table

33 Table 3.8 Coefficients of Selected Time Series Model for Spam Model Component spam.trend spam.arma301 Coefficient 2.50% Coefficient 97.50% Coefficient (Intercept) time.spam ar ar ar ma intercept In the time series model for spam, it consists of trend and random fluctuations. In the trend model, the coefficient of time component is 0.02 and the 95% confidence interval is [0.01, 0.03]. In the fluctuation model, the coefficients of ar1 is 1.03 and the 95% confidence interval is [0.90, 1.17]. The coefficient of ar2 is and the confidence interval is [-0.19, 0.11]. Moreover, the coefficient of ma1 is and the confidence interval is [-0.98, -0.81]. In Section 1.4, there are two hypotheses proposed in this project. The first hypothesis is that word frequency, the appearance of capital letters, and the frequency of some characters, are predictive of spams. The biplot in Figure 2.8 shows the variance of variables in the first component and the second component which are two of the predictors in the selected static model GLM4. We can see that some word frequency variables, e.g., V25 and V40, show large variance in both component 1 or 2. Character frequency variables have larger variance in component 1, such as V52. In addition, capital letter related variables show large variance in the first and second components, for example, V56 and V57. Hence, word frequency variables, character frequency variables, and capital letter variables are predictive of spam and the first hypothesis is correct. The second hypothesis proposes that the time component has some impact on the number of spam s arrived at that address. In the selected time series models for spam and ham, there is some trend in spam and seasonality in ham so that time component has some influence on the number of spams. Hence, both hypotheses are correct. 4. Recommendation Two types of spam filters are built in this project: static generalized linear model and time series filters. The selected static model GLM4 is constructed using PCA regression on the spam data after log transformation on the explanatory variables. Word frequency, character frequency, and capital letter related variables are predictive of spams. The performance of this model is significantly better than all the other candidate static models described in Section The AIC of this model is the smallest shown in Table 3.1. According to the ROC curves in Figure 3.2, GLM4 has the best performance. In addition, when evaluating it with the test set, it provides the least number of total errors and FPs which are 91 and 35, respectively, according to Table

34 In the time series model, the seasonality and random fluctuation of ham are modeled to predict the number of hams received at the address on a particular day. Also, the number of spams is modeled through the trend and the random fluctuation. ARMA-Ham and ARMA-Spam are recommended because of their decent prediction accuracy on the test sets. The selected ham time series model achieves a MSE of 8.51 on the test set, and the recommended spam time series model has a MSE of on its test set. Also, they have relatively smaller AIC and prove to be adequate according to the Ljung-Box tests. Therefore, these two models are selected to predict spams. The probability of receiving spam on a particular day can be derived from these two models. Therefore, they can be used as spam filter. In order to improve the performance of spam filter, the recommended static model and the selected ham and spam time series models can be combined using Bayes rule as shown in Equation 2.1. Therefore, the probability that a particular is spam can be computed based on the prediction of the static model and the probability of receiving spam on that day which is accessible by using the two time series models. 33

35 5. Reference [1] C. Thomas, Spam Costs Billions, Information Week, February [Online]. Available: [2] D. E. Brown and L. Barnes, Project 2: Spam Filters, October 10, 2013, assignment in class SYS [3] D. E. Brown and L. Barnes, Project 2 Template, October 10, 2013, assignment in class SYS [4] UCI Machine Learning Repository, Spambase Data Set, July [Online]. Available: [5] D. E. Brown and L. Barnes, Spam and Ham Data, October 10, 2013, assignment in class SYS Appendix Figure 6.1 Scatter Plot Matrix of V21-V30 and V58 34

36 Figure 6.2 Scatter Plot Matrix of V31-V40 and V58 Figure 6.3 Scatter Plot Matrix of V41-V48 and V58 35

37 Figure 6.4 Scatter Plot Matrix of V49-V54 and V58 Figure 6.5 Scatter Plots Matrix of V11-V20 and V58 after Log Transformation 36

38 Figure 6.6 Scatter Plots Matrix of V21-V30 and V58 after Log Transformation Figure 6.7 Scatter Plots Matrix of V31-V40 and V58 after Log Transformation 37

39 Figure 6.8 Scatter Plots Matrix of V41-V49 and V58 after Log Transformation Figure 6.9 Scatter Plots Matrix of V49-V54 and V58 after Log Transformation 38

40 Figure 6.10 Scatter Plots Matrix of V55-V57 and V58 after Log Transformation Figure 6.11 Factor Plots of V10-V18 39

41 Figure 6.12 Factor Plots of V19-V27 Figure 6.13 Factor Plots of V28-V36 40

42 Figure 6.14 Factor Plots of V37-V45 Figure 6.15 Factor Plots of V46-V54 41

43 Figure 6.16 Factor Plots of V55-V57 Mean Coefficient Table 6.1 Coefficient List of GLM4 2.50% Coefficient 97.50% Coefficient Percent Change (Mean) Percent Change (2.50%) Percent Change (97.50%) (Intercept) Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp

44 Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp

Intro to ARMA models. FISH 507 Applied Time Series Analysis. Mark Scheuerell 15 Jan 2019

Intro to ARMA models. FISH 507 Applied Time Series Analysis. Mark Scheuerell 15 Jan 2019 Intro to ARMA models FISH 507 Applied Time Series Analysis Mark Scheuerell 15 Jan 2019 Topics for today Review White noise Random walks Autoregressive (AR) models Moving average (MA) models Autoregressive