SYS 6021 Linear Statistical Models

SYS 6021 Linear Statistical Models Project 2 Spam Filters Jinghe Zhang Summary The spambase data and time indexed counts of spams and hams are studied to develop accurate spam filters. Static models are constructed using generalized linear regression and the recommended model is built based on principal component regression using the spam data after log transformation on the explanatory variables. The selected static spam filter has the smallest numbers of total errors and false positives, which are 91 and 35, respectively. In this model, the coefficient of the first principal component is -1.80 and the 95% confidence interval is [-2.02, - 1.58]. When there is one unit change in this variable while holding the others constant, the percent change in the odds is -83.44% and the 95% confidence interval of the percent change is [-86.75%, -79.37%]. The time series for ham models the seasonality and random fluctuation, while that for spam models the trend and fluctuation. The recommended models have statistically better performance than the other models considering their overall decent performance in terms of smaller MSE and AIC. In addition, the statistical tests on the two recommended models indicate that they are adequate. In ham seasonality, the coefficient of Saturday is -4.70, and the confidence interval is [-5.88, -3.52], which indicates that on Saturdays, there is fewer hams received at the email address. In the fluctuation model of spam, the coefficients of ar1 is 1.03 and the 95% confidence interval is [0.90, 1.17]. The recommended static and time series models can be combined using Bayes rule to further improve the performance of spam filters. Honor Pledge: On my honor, I pledge that I am the sole author of this paper and I have accurately cited all help and references used in its completion. November 10, 2013

1 Problem description 1.1 Situation Nowadays, the development of technology and the wide spread of internet provide great convenience to our daily life. Especially, the communication in different places does not reply on the traditional approaches any more, such as letters and telephones. Email plays a more and more important role which not only facilitates the communication between people but also a very convenient way to approach information. However, along with those useful emails, people receive spams sometimes. Some of the junk mails are just ads which can be annoying. Moreover, many email spams include virus or malware which are harmful to the host computers, such as stealing information. According to the National Technology Readiness Survey in 2004, the cost of spam is more than $21 billion annually in terms of lost productivity [1]. Hence, this has attracted more and more attention and many spam filters have been created using various techniques, such as Bayesian filtering [2, 3]. 1.2 Goal The objective of this study is to develop a spam filter in order to detect junk emails. The spam filter can be static, time series, and integrated models. 1.3 Metrics For the static generalized liner regression model, the model performance is measured by ROC curve and the number of total errors, i.e., false positive (FP) and false negative (FN). Moreover, it is very undesirable to make a good email as a spam in this project so that a spam filter with fewer false positives is preferred. For the time series filter, the performance metric is mean squared error (MSE). In addition, for the integrated model, the performance metrics are same as those of the static model, which are the ROC curve and sum of FP and FN. 1.4 Hypotheses There are two hypotheses in this study illustrated as follows: a. The variables, which describe word frequency, the appearance of capital letters, and the frequency of some characters, are predictive of spams. b. Time component has some impact on the number of spam emails arrived at that email address. 2. Approach 2.1 Data 1

In this study, spam data are analyzed to develop spam filters. In the dataset, there are 4601 observations and 58 variables [4]. One of the variables (V58) is the class label which indicates whether it is a spam or a ham. Of the other 57 explanatory variables, 48 continuous attributes (V1-V48) describe the frequency of some particular words in the email, and 6 variables provide (V49-V54) information on the frequency of some characters. Moreover, there are 3 variables (V55-V57) illustrate the appearance of capital letters in the email. Also, the numbers of spams and hams received at an email address on different days are available. On average, there are 27.37 spams received each day from August 1, 2004 to July 30, 2005, while there are 3.89 hams each day from January 13, 2000 to June 1, 2001. To discover the relationship between the explanatory variables and the response variable, scatter plots matrices are provided. The variables in scatter plots are divided into three categories: word frequency variables, character frequency variables, and capital letter variables. The matrix in Figure 2.1 studies the correlations between capital letter variables V55-V57, and the response variable V58. As shown in Figure 2.1, there are strong correlations between V55 and V56, as well as V56 and V57. Figure 2.1 Scatter Plots Matrix of V55, V56, V57, and V58 Also, considering their relationship with the response variable, V57 is selected for modeling the spam filters. In addition, the predictive variables are strongly skewed. By observing the other scatter plots, we can get the same conclusion. Therefore, log transformation on those predictors is preferred. In addition, the scatter plots matrix of some word frequency variables V1-V10 and the response variable V58 are displayed in Figure 2.2 which also indicates strong skewness and some correlation between explanatory variables. The scatter plot matrices of other variables are given in Appendix. 2

Figure 2.2 Scatter Plots Matrix of V1-V10, and V58 Box plots are displayed to show the distribution of word frequency variables V1 to V48, character frequency variables (V49-V54), and capital letter variables (V55-V57). These boxplots show whether a variable is discriminatory in terms of spam emails. Box plots for V1 to V9 are given in Figure 2.3. It is obvious that the value of V3 is discriminatory in terms of spam email and ham. The factor plots of other variables are given in the Appendix. 3

Figure 2.3 Factor Plots of V1-V9 By the information conveyed in those box plots as well as the scatter plot matrices, 8 predictive variables are selected as predictors considering the independence of predictive variables and their predictability of spam. A statistical summary of the selected predictors are provided in Table 2.1. Variable Table 2.1 A Statistical Summary of Selected Variables Min First Quartile Median Mean Third Quartile Max V3 0.00 0.00 0.00 0.28 0.42 5.10 V7 0.00 0.00 0.00 0.11 0.00 7.27 V16 0.00 0.00 0.00 0.25 0.10 20.00 V17 0.00 0.00 0.00 0.14 0.00 7.14 V19 0.00 0.00 1.31 1.66 2.64 18.75 V21 0.00 0.00 0.22 0.81 1.27 11.11 V52 0.00 0.00 0.00 0.27 0.32 32.48 V57 1.00 35.00 95.00 283.30 266.00 15841.00 The histograms of these 8 variables are given in Figure 2.4. 4

Figure 2.4 Histograms of Selected Predictors We can see that the distributions of these variables are highly skewed which is consistent with the conclusion from the scatter plot matrices. Therefore, log transformation is performed to all the potential predictors. The scatter plot matrix in Figure 2.5 shows relationship between predictive variables after transformation and the response variable. The scatter plots of other variables after transformation are given in the Appendix. 5

Figure 2.5 Scatter Plots Matrix of V1-V10 and V58 after Log Transformation After log transformation, we can see that the distributions of predictive variables are less skewed. Also, there are larger correlations between the predictors and response variable. The factor plots of some log transformed variables are provided in Figure 2.6. Figure 2.6 Factor Plots of V1-V9 after Log Transformation 6

By observing the boxplots of the selected variables in Figure 2.7, we find that there are some outliers. Figure 2.7 Factor Plots of Selected Predictors Therefore, the extreme observations are extracted and investigated. After further investigation, the corresponding outliers are excluded from the data set in order to eliminate bias. The 12 removed observations are 1, 752, 831, 1708, 1489, 1763, 1791, 2694, 2905, 3247, 3913, and 4464. Hence, the dataset used for modeling has 4589 observations. In Table 2.2, there is a statistical summary of the data without outliers. Table 2.2 A Statistical Summary of Selected Variables (without outliers) Variable Min First Quartile Median Mean Third Quartile Max V3 0.00 0.00 0.00 0.28 0.42 4.54 V7 0.00 0.00 0.00 0.11 0.00 5.40 V16 0.00 0.00 0.00 0.24 0.10 10.16 V17 0.00 0.00 0.00 0.14 0.00 5.12 V19 0.00 0.00 1.31 1.66 2.64 14.28 V21 0.00 0.00 0.22 0.81 1.27 9.52 V52 0.00 0.00 0.00 0.26 0.32 19.13 V57 1.00 35.00 95.00 280.50 266.00 10062.00 7

The preprocessing on the dataset discussed above is used to build static spam filter and another two datasets are used to construct the time series spam filters. These two time series datasets are consist of the date as well as the number of spams/hams arriving at a particular email address [5]. A summary of the two datasets are provided in Table 2.3. Table 2.3 A Summary of Ham and Spam Datasets for Time Series Modeling Dataset Min First Quartile Median Mean Third Quartile Max Ham 0.00 0.00 3.00 3.89 6.00 27.00 Spam 0.00 22.00 26.00 27.37 32.00 72.00 In the ham dataset, here are 506 observations from Jan. 13, 2000 to Jun. 1, 2001. There are 364 observations from Aug. 1, 2004 to Jul. 30, 2005 in the spam dataset. 2.2 Analysis 2.2.1 Static Analysis for Spam Filter Design The dataset with 4589 observations and 8 variables are used to build generalized linear models for spam detection. To discover the discriminatory of these variables in terms of spam detection, factor plots are shown in Figure 2.7. The dataset is randomly divided into two subsets: training set and testing set, which account for 2/3 and 1/3 of data, respectively. Also, all the observations for training and testing are log transformed and can be used to build logistic regression models. At first, two main effect generalized linear model are constructed using the 8 variables before transformation and after log transformation on the predictors. GLM1 (no transformation): V58 ~ V3+V7+V16+V17+V19+V21+V52+V57 GLM2 (log transformation): V58 ~ V3+V7+V16+V17+V19+V21+V52+V57 Chi-square test is performed to compare the first model GLM1 and the null model, and the p value is less than 2.2e-16, which indicates GLM1 is significant. Also, every predictor in GLM1 has a p value less than 2.2e-16 so that all the 8 predictors are significant. Similarly, the GLM2 is statistically significant and all the predictors in GLM2 are significant. Then, principal component analysis (PCA) is performed to the train set and principle components which give 90% of the variance of this dataset are used for logistic regression. The biplots of the untransformed and transformed explanatory variables are shown in Figure 2.8 and Figure 2.9 respectively. 8

Figure 2.8 Biplot of PCA on Untransformed Data Figure 2.9 Biplot of PCA on Transformed Data 9

The variances of two types of emails on the first two components are displayed in Figure 2.10 and Figure 2.11. Figure 2.10 Variances of Hams and Spams on First Two Components (Before Transformation) Figure 2.11 Variances of Hams and Spams on First Two Components (After Transformation) According to Figure 2.10 and 2.11, we can see that ham emails have larger variation on the first component, while the spams have larger variation on the second component before log transformation on the explanatory variables. However, after transformation, ham and spam emails have large variation on both of the first two components. Then, two PCA regression models (GLM3 and GLM4) are trained using the training data before and after log transformation. Two models are compared with the null model and both of them are statistically significant. In addition, interactions between predictors are studied through interaction plots. For example, there is some interaction between V16 and V21 as shown in Figure 2.1. Also, there are some interaction between V3 and V17, V3 and V19, V17 and V57, etc. Then, interaction terms are added to the main effect models. 10

GLM5 (no transformation): V58~V3+V7+V16+V17+V19+V21+V52+V57+V3*(V17+V19+V21+V52)+V7*(V21+V52) +V16*(V17+V19+V21+V52)+V17*V57+V19*(V21+V52+V57) +V21*(V52+V57) +V52*V57 GLM6 (log transformation): V58~V3+V7+V16+V17+V19+V21+V52+V57+V3*(V17+V19+V21+V52)+V7*(V21+V52) +V16*(V17+V19+V21+V52)+V17*V57+V19*(V21+V52+V57) +V21*(V52+V57) +V52*V57 The two models are compared with the null model using the Chi-square test and GLM5 and GLM6 are statistically significant. Stepwise regression is performed to the two models and the two stepwise models are listed as below: GLM7 (stepwise on GLM5): V58 ~ V3 + V7 + V16 + V17 + V19 + V21 + V52 + V57 + V3:V17 + V3:V21 + V16:V19 + V16:V21 + V17:V57 + V19:V21 + V19:V57 + V21:V52 + V52:V57 GLM8 (stepwise on GLM6): V58 ~ V3 + V7 + V16 + V17 + V19 + V21 + V52 + V57 + V3:V17 + V3:V21 + V7:V52 + V16:V19 + V17:V57 + V19:V21 + V52:V57 The two stepwise models are also compared with the null model using Chi-square test and they are statistically significant. 2.2.2 Time Series Analysis for Spam Filter Design To study whether time component has an influence on the numbers of hams and spams received on a particular day, the counts of these two types of emails are transformed to time series data and plotted in Figure 2.12. 11

Figure 2.12 Time Series Plots of Ham and Spam Data The autocorrelation function (ACF) plots are provided in Figure 2.13, which indicates that there are some impact from the time component since the lags are greater than 0. Figure 2.13 ACF Plots of Ham and Spam Data 12

Test set method is used to evaluate the time series models and the observations of the last 7 days are extracted as the test set from both ham and spam data. The others are used for training. Then, the trends of the two training datasets are modeled using linear regression and the two models are tested using F-statistic to measure the significance of the models. As to the spam data, the trend is statistically significant since the p value of the associated F test is 1.05E-5 which is less than 0.05. However, the p value of F test on the trend model is 0.4733 for the ham data. Therefore, there is significant trend in the spam data rather than in the ham data. The plots in Figure 2.14 give a graphical illustration on the trends of the two types of emails. Figure 2.14 Trends of Ham and Spam Data To investigate the seasonality of the two datasets, periodograms are plotted in Figure 2.15 to find the peaks and to compute the period. Figure 2.15 Periodogram Plots of Ham and Spam Data 13

The peaks are at 0.14 and 0.0027 in the two datasets, respectively. Hence, the corresponding periods are 6.74 and 375. Therefore, the seasonality in hams is by week and there is no seasonality in spam. Therefore, for ham data seasonality is modeled and trend is not modeled. However, trend is modeled and seasonality is not modeled for spam data. Consequently, a linear regression is built to model the seasonality of ham. The residuals of spam trend model and ham seasonality model are studied and plotted in Figure 2.16. Figure 2.16 Residuals of Ham Seasonality and Spam Trend Models The ACF and partial ACF (PACF) plots in Figure 2.17 and 2.18 show the autocorrelation of residuals in the linear regression models for hams and spams. Figure 2.17 ACF Plots of Residuals of Ham Seasonality and Spam Trend Models 14

Figure 2.18 PACF Plots of Residuals of Ham Seasonality and Spam Trend Models In Figure 2.16, the residuals of ham seasonality model have relatively constant mean and variance and there is a sinusoidal pattern in the corresponding ACF plot, which indicates a stationary time series. Similarly, the time series of spam residuals is also stationary. In addition, they cut off in the PACF after 2 and 3 lags, respectively. In addition, for further investigation, the first order difference is also taken for the time series data and the plots of residuals after taking first order difference are given in Figure 2.19. Figure 2.19 First Order Difference of Residuals 15

The ACF and partial ACF (PACF) plots in Figure 2.20 and 2.21 show the autocorrelation of residuals of ham seasonality and spam trends models after taking the first order difference. Figure 2.20 ACF Plots of First Order Difference of Residuals Figure 2.21 PACF Plots of First Order Difference of Residuals However, since there are negative values in the PACF plots, it is preferred not to take the difference of residuals. The ACF plots in Figure 2.17 show sinusoidal patterns and they cut off after 2 and 3 lags in the PACF plots in Figure 2.18. Hence, the autoregression models AR-Ham and AR-Spam are constructed and the orders of them are 2 and 3, respectively. Also, MA-Ham and MA-Spam are two moving average models built to model the seasonality of ham and the trend of spam. Both of these two models take the order of 1. Then ARMA-Ham and ARMA-Spam are two models 16

constructed for ham and spam, respectively. In addition, the difference is not modeled in the time series according to the discussion derived from Figure 2.21. The ARMA-Ham model takes the order of 2 and 1, while the ARMA-Spam model takes the order of 3 and 1. Overall, there are three time series models constructed for the seasonality of ham and another three models for the trend of spam. Automated method is adopted to build the ARIMA models and they are discussed in Section 3.2. 2.2.3 Integrated Filter Design Both static and time series spam filters are built in this project and they reveal spam information from different perspectives. In the static model, the generalized linear regression model provides a result 1 or 0 which indicates whether a particular email is a spam. In the time series model, we can obtain the numbers of spams and hams at a particular day, which can be used to derive the probability of receiving a spam on that day. In order to improve the performance of spam filter, the two models can be combined using Bayes rule. Hence, the combing rule is displayed as shown in Equation 2.1. Pr( E i S j, T k) Pr( S j, T k E i)pr( E i) I, 2.1 Pr( S j, T K E i)pr( E i) i where E, S, T represent whether an email is a spam in reality, the prediction of the static spam filter, and the prediction from the time series model, respectively. Here, i, j, and k are binary and is an elements in {0, 1} which indicates whether it is a spam. Equation 2.1 can be interpreted to Equation 2.2. Pr( E i S j, T k) Pr( S j E i)pr( T k E i) 2.2 The first term Pr( S j E i) is the probability of true positive when i=1 and j=1, and the second term is the probability of spam obtained from the time series model. Correspondingly, the probability of true positive can be computed through the score table of the selected static model using Equation 2.3. TP Pr( TP) 2.3 TP FN The probability of spam obtained from the time series model can be computed through Equation 2.4. # Spams Pr( TP) 2.4 # Spams # Hams 17

In brief, an improved spam filter is obtained by combining the static generalized model and time series model using the Bayes rule. 3. Evidence 3.1 Static Filter design To measure the performance of those GLM models in Section 2.2.1, the AIC of those models are shown in Table 3.1. Table 3.1 AIC of Static Models GLMs AIC GLM1 2362.82 GLM2 1971.18 GLM3 1512.35 GLM4 1071.75 GLM5 2010.32 GLM6 1894.23 GLM7 1999.73 GLM8 1881.90 The mail effect model GLM1 without data transformation has a large AIC so that another main effect model GLM2 after log transformation is constructed and its performance is improved compared with GLM1. In the Chi-square tests, all the variables in the main effect model are significant. To further refine the static models, PCA regression is considered in which principal components which account for 90% variance are used to build the regression models. Also, the PCA regression GLM3 with non-transformed data is not as good as the one GLM4 using transformed data. Sometimes the effect of a predictor on the response variable is dependent on the value of another variable so that interaction terms are added to the main effect model for a better regression. The interactions between explanatory variables are studied by interaction plots shown in Section 2.2.1. GLM5 and GLM6 are built based on GLM1 and GLM2, respectively. It is clear that the performance of the static models is improved after adding interaction terms into the main effect models. In order to reduce the complexity and improve the performance of these two models, stepwise regression is performed to both of them where GML7 and GLM8 are constructed. From Table 3.1, we can see that the stepwise regressions based on the two antecedent models are able to achieve some improvement. Overall, the generalized linear models have smaller AIC when they are trained with the log transformed data. The static models with interaction terms have better accuracy than the main effect models. Moreover, the PCA regression using transformed data provides the best results among all the models described above, 18

which is most likely due to the latent properties discovered in PCA as well as the elimination of multicollineaity between explanatory variables. As mentioned in Section 2.2, the test set method is used in this project in order to evaluate the generalization ability of those models. In Table 3.2, the score table of these static models illustrates their prediction accuracy on test set. Table 3.2 Score Table of Static Models GLMs TN FP FN TP Total Errors GLM1 840 67 151 472 218 GLM2 845 62 123 500 185 GLM3 869 38 79 544 117 GLM4 872 35 56 567 91 GLM5 864 43 131 492 174 GLM6 852 55 117 506 172 GLM7 860 47 129 494 176 GLM8 851 56 115 508 171 The goal of this project is to detect spams and the model with a smaller number of total errors is preferred. It is obvious that GLM4 has the smallest number of total errors. Especially in this project, making a good email to a spam (FP) is very undesirable so that a smaller FP is very important. Overall, GLM4 has the smallest AIC, total number of errors, and a smaller FP. Therefore, it is the optimal static model among all the GLMs discussed above. In addition, to measure the model performance in a graphical way the ROC curves are plotted and shown in Figure 3.1. 19

Figure 3.1 ROC Curves of Static Spam Filters Point (0, 1) indicates an ideal prediction model and the model with a larger area under the curve (AUC) is preferred. Therefore, it is clear that GLM3 and GLM4 are very outstanding. Especially, GLM4 has the best performance among all the static spam filters and the main effect model GLM1 has the poorest accuracy. The other spam filters are very much overlapped with each other. Overall, the static models are improved after log transformation on the explanatory variables. To have a better understanding on all the static models after log transformation, ROC curves in Figure 3.2 gives a clearer plot with only transformed data. 20

Figure 3.2 ROC Curves of Static Spam Filters with Transformed Data In Figure 3.2, we can see that GLM4 is the most accurate model among all the spam filters trained with transformed data. GLM6 is a little better than GLM2 by adding interaction terms. The stepwise regression model GLM8 based on GLM6 does not make much difference since the ROC curves of these two models are very much overlapped. Similarly, the ROC curves of filters built with untransformed data support the same conclusions. Overall, GLM4 provides the best prediction accuracy so that it is proposed as the recommended static spam filter. The 95% confidence intervals of the coefficients of the first five principle components in GLM4 are given in Table 3.3. The complete list of coefficients and the associated 95% confidence intervals of GLM4 is given in Table 6.1 in Appendix. Table 3.3 Coefficients of the First Five Principle Components in GLM4 Mean Coefficient 2.50% Coefficient 97.50% Coefficient Percent Change (Mean) (%) Percent Change (2.50%) (%) Percent Change (97.50%) (%) (Intercept) -1.79824-2.24328-1.40168-83.44098-89.38904-75.38163 Comp.1-1.78614-2.02100-1.57839-83.23943-86.74775-79.36923 Comp.2-0.21760-0.47661 0.01803-19.55556-37.91179 1.81924 Comp.3-0.95678-1.24605-0.68441-61.58736-71.23623-49.56123 Comp.4 0.81396 0.63019 1.01676 125.68319 87.79656 176.42271 Comp.5 0.22792-0.03367 0.51549 25.59905-3.31090 67.44535 21

The first three columns in Table 3.3 provide the coefficient of the predictor as well as its 95% confidence interval. The last three columns show the corresponding percent changes in the odds computed from the first three columns, respectively. For example, the coefficient of the first principal component is -1.80 and the 95% confidence interval is [-2.02, -1.58]. When there is one unit change in this variable while holding the others constant, the percent change in the odds is - 83.44% and the 95% confidence interval of the percent change is [-86.75%, -79.37%]. Similarly, as to the second component, the coefficient is -0.22 and the associated confidence interval is [- 0.48, 0.02]. When making a one unit increase on this component and holding the others constant, there is a -19.56% change in the odds and the 95% confidence interval of this change is [-37.91%, -1.82%]. For the fourth component, the coefficient is positive and the value is 0.81. The associated 95% confidence interval is [0.63, 1.02]. When there is one unit increase in this component and the other components are holding constant, the odds will have a 125.68% increase and its confidence interval is [87.80%, 176.42%]. 3.2 Time Series Filter Design The candidate models in Section 2.2.2 are two AR models for ham and spam with the order of 2 and 3, respectively. The diagnostic plot for AR-Ham is given in Figure 3.3. Figure 3.3 Diagnostic Plots of AR-Ham In the diagnostic plot of AR-Ham, the residuals are not showing constant variance and the p values of Ljung-Box statistic tests indicate that this model is not adequate. There is no significant correlation between residuals. To improve the model, the MA is considered to be added as well as the difference which can be combined to an ARIMA model. Also, it is discovered that the order of difference is preferred to be 0. The diagnostic plot for MA-Ham is displayed in Figure 3.4. 22

Figure 3.4 Diagnostic Plots of MA-Ham In Figure 3.4, we can see that the residuals do not have constant variance and the p values in the statistic tests show that this model is not sufficient. Therefore, ARMA-Ham is constructed which combines AR-Ham and MA-Ham. The diagnostic plot is shown in Figure 3.5. Figure 3.5 Diagnostic Plots of ARMA-Ham 23

In Figure 3.5, the residuals are random and the autocorrelation between them are very insignificant. Also, the p values for the Ljung-Box tests indicate that this model is sufficient. To further investigate the appropriateness of the orders in ARMA-Ham model, the ARIMA model (ARIMA-Ham) is constructed using automated method and it takes 1, 0, and 2 for the orders of AR, difference, as well as MA, respectively. The diagnostic plot for this ARIMA-Ham model is displayed in Figure 3.6. Figure 3.6 Diagnostic Plots of ARIMA-Ham From Figure 3.6, we can see that the residuals are random and the p values of the Ljung-Box statistic tests show that this model is adequate. The AR-Spam model described in Section 2.2.2 takes the order of 3 and the diagnostic plot of this model is given in Figure 3.7. Figure 3.7 Diagnostic Plots of AR-Spam 24

In Figure 3.7, the residuals are showing constant variance and lack of pattern and the p values of Ljung-Box statistic tests indicate that this model is adequate. There is no significant correlation between residuals. Then, the MA-Spam model is constructed with the order of 1. The diagnostic plot for MA-Spam is displayed in Figure 3.8. Figure 3.8 Diagnostic Plots of MA-Spam In Figure 3.8, the residuals do not have constant variance and the p values in the statistic tests show that this model is not sufficient. Therefore, ARMA-Spam is constructed which combines AR-Spam and MA-Spam. The diagnostic plot is shown in Figure 3.9. Figure 3.9 Diagnostic Plots of ARMA-Spam 25

In Figure 3.9, we can see that the residuals have relatively constant variance and the p values in the statistic tests show that this model is sufficient. Also, there is no significant correlation between residuals. To further investigate the appropriateness of the orders in ARMA-Spam model, the ARIMA model (ARIMA-Spam) is constructed using automated method and it takes 1, 0, and 1 for the orders of AR, difference, as well as MA, respectively. The diagnostic plot for this ARIMA-Ham model is displayed in Figure 3.10. Figure 3.10 Diagnostic Plots of ARIMA-Spam From the diagnostic plot for ARIMA-Spam, the residuals are random and lack of pattern and the p values of the Ljung-Box statistic tests indicate that this model is adequate. Table 3.4 provides the AICs of all the four models discussed in this section. Table 3.4 AIC of Time Series Models AIC AR-Ham 2617.38 MA-Ham 2640.39 ARMA-Ham 2606.42 ARIMA-Ham 2604.81 AR-Spam 2501.87 MA-Spam 2514.94 ARMA-Spam 2496.72 ARIMA-Spam 2491.98 26

According to the AIC scores of the time series models, the two ARIMA models are the best and the MA models perform the worst. AR model are better than MA models and ARMA models which combine the AR and MA are able to make relatively large improvements. Therefore, considering their AICs, ARIMA-Ham and ARIMA-Spam are two models recommended. Test set method is used in evaluating the performance of time series models and the last 7 observations are used for testing. The forecasting plots of AR-Ham, MA-Ham, ARMA-Ham, and ARIMA-Ham are shown in Figure 3.11 and 3.12. Figure 3.11 Forecasting Plots of AR-Ham and MA-Ham Figure 3.12 Forecasting Plots of ARMA-Ham and ARIMA-Ham 27

The black line is the number of hams received each day and the blue line shows the prediction on the numbers of hams received on the last 7 days. The blue shade is the 95% confidence interval of the prediction. To have a better understanding of the prediction, another two plots are given in Figure 3.8 which only provide the actual and prediction data on the test set. Figure 3.13 Prediction Plot of Time Series Models for Ham According to Figure 3.13, we can see that the predictions of ARIMA-Ham and ARMA-Ham model are very much overlapped, while AR-Ham and MA-Ham are close to each other. Also, the ARMA-Ham and ARIMA-Ham are more accurate since their predictions are closer to the actual observations when compared to the other two models. Table 3.5 provides the prediction of these two models on the test set as well as the mean square error (MSE). Table 3.5 Prediction Summary of Time Series Models for Ham 1 2 3 4 5 6 7 MSE Actual 0 0 0 0 0 0 1 -- AR-Ham 4.18 4.22-0.11 0.31 4.66 5.73 5.54 15.81 MA-Ham 4.69 5.01 0.31 0.55 4.80 5.81 5.59 17.91 ARMA-Ham 3.07 2.88-1.66-1.28 3.11 4.24 4.13 8.51 ARIMA-Ham 3.07 2.91-1.63-1.24 3.16 4.30 4.20 8.68 28

According to Table 3.5, ARMA-Ham provides the best prediction in terms of MSE which is a little better than the ARIMA model, while the AR-Ham and MA-Ham have larger errors. Considering that the AIC of ARMA-Ham is only a little worse than that of the ARIMA-Ham and the former model has a higher accuracy in terms of MSE, the ARMA-Ham is recommended as the most preferred time series model for ham. The coefficients and the 95% confidence intervals are given in Table 3.6. Model Component Ham.season Ham.arma201 Table 3.6 Coefficients of Selected Time Series Model for Ham Coefficient 2.50% Coefficient 97.50% Coefficient (Intercept) 5.013889 4.181983 5.845795 season.ham Monday -0.21107-1.3917 0.969556 season.ham Saturday -4.70403-5.88466-3.5234 season.ham Sunday -4.46459-5.64522-3.28397 season.ham Thursday 0.430556-0.74594 1.607048 season.ham Tuesday 0.803013-0.37762 1.983641 season.ham Wednesday 0.57766-0.60297 1.758288 Intercept -0.07954-0.93764 0.778561 ar1 1.087639 0.934556 1.240722 ar2-0.15004-0.26292-0.03717 ma1-0.81073-0.93099-0.69046 In the time series for ham, it consists of seasonality and some random fluctuations. In the seasonality model, the coefficient of Monday is -0.21 and the 95% confidence interval is [-1.39, 0.97]. The coefficient of Saturday is -4.70, and the confidence interval is [-5.88, -3.52], which indicates that on Saturdays, there is fewer hams received at the email address. Similarly, on Sundays there are a smaller number of hams. However, on Tuesday, Thursday, and Friday, the coefficients are more likely to be positive so that there are more hams received during weekdays. In the fluctuation model, the coefficients of ar1 is 1.09 and the 95% confidence interval is [0.93, 1.24]. The coefficient of ma1 is -0.81 and the confidence interval is [-0.93, -0.69]. The time series models for spam are also evaluated using the test set method. The forecasting plots of AR-Spam, MA-Spam, ARMA-Spam, and ARIMA-Spam are shown in Figure 3.14 and 3.15. 29

Figure 3.14 Forecasting Plots of AR-Spam and MA-Spam Figure 3.15 Forecasting Plots of ARMA-Spam and ARIMA-Spam The black line is the number of spams received each day and the blue line shows the prediction on the numbers of spams received on the last 7 days. The blue shade is the 95% confidence interval of the prediction. To have a better understanding of the prediction, another two plots are given in Figure 3.16 which only provide the actual and prediction data on the test set. 30

Figure 3.16 Prediction Plot of Time Series Models for Spam According to Figure 3.16, we can see that the predictions of ARIMA-Spam and ARMA-Spam model are very much overlapped, while AR-Spam and MA-Spam are close to each other. Table 3.7 provides the prediction of these two models on the test set as well as the mean square error (MSE). Table 3.7 Prediction Summary of Time Series Models for Spam 1 2 3 4 5 6 7 MSE Actual 5 0 0 50 54 58 13 -- AR-Spam 31.51 32.92 31.31 31.28 31.30 31.09 31.06 669.05 MA-Spam 30.34 30.83 30.85 30.87 30.89 30.91 30.93 642.86 ARMA-Spam 34.19 33.96 33.95 33.87 33.80 33.73 33.66 691.64 ARIMA-Spam 34.79 34.66 34.53 34.41 34.29 34.18 34.08 703.49 According to Table 3.7, MA-Ham provides the best prediction in terms of MSE. However, the AIC of MA-Ham is the worst and the Ljung_Box test shows that this model is not adequate. Therefore, ARMA-Spam model is recommended since this model is adequate according to the Ljung-Box test and it has relatively smaller MSE and AIC. The coefficients and the 95% confidence intervals are given in Table 3.8. 31

Table 3.8 Coefficients of Selected Time Series Model for Spam Model Component spam.trend spam.arma301 Coefficient 2.50% Coefficient 97.50% Coefficient (Intercept) 23.9907 22.25844 25.72295 time.spam 0.019065 0.010678 0.027452 ar1 1.034909 0.901712 1.168106 ar2-0.03878-0.18853 0.110963 ar3-0.02538-0.13713 0.086378 ma1-0.89701-0.97999-0.81402 intercept 0.075259-2.63591 2.786431 In the time series model for spam, it consists of trend and random fluctuations. In the trend model, the coefficient of time component is 0.02 and the 95% confidence interval is [0.01, 0.03]. In the fluctuation model, the coefficients of ar1 is 1.03 and the 95% confidence interval is [0.90, 1.17]. The coefficient of ar2 is -0.04 and the confidence interval is [-0.19, 0.11]. Moreover, the coefficient of ma1 is -0.89 and the confidence interval is [-0.98, -0.81]. In Section 1.4, there are two hypotheses proposed in this project. The first hypothesis is that word frequency, the appearance of capital letters, and the frequency of some characters, are predictive of spams. The biplot in Figure 2.8 shows the variance of variables in the first component and the second component which are two of the predictors in the selected static model GLM4. We can see that some word frequency variables, e.g., V25 and V40, show large variance in both component 1 or 2. Character frequency variables have larger variance in component 1, such as V52. In addition, capital letter related variables show large variance in the first and second components, for example, V56 and V57. Hence, word frequency variables, character frequency variables, and capital letter variables are predictive of spam and the first hypothesis is correct. The second hypothesis proposes that the time component has some impact on the number of spam emails arrived at that email address. In the selected time series models for spam and ham, there is some trend in spam and seasonality in ham so that time component has some influence on the number of spams. Hence, both hypotheses are correct. 4. Recommendation Two types of spam filters are built in this project: static generalized linear model and time series filters. The selected static model GLM4 is constructed using PCA regression on the spam data after log transformation on the explanatory variables. Word frequency, character frequency, and capital letter related variables are predictive of spams. The performance of this model is significantly better than all the other candidate static models described in Section 2.2.1. The AIC of this model is the smallest shown in Table 3.1. According to the ROC curves in Figure 3.2, GLM4 has the best performance. In addition, when evaluating it with the test set, it provides the least number of total errors and FPs which are 91 and 35, respectively, according to Table 3.2. 32

In the time series model, the seasonality and random fluctuation of ham are modeled to predict the number of hams received at the email address on a particular day. Also, the number of spams is modeled through the trend and the random fluctuation. ARMA-Ham and ARMA-Spam are recommended because of their decent prediction accuracy on the test sets. The selected ham time series model achieves a MSE of 8.51 on the test set, and the recommended spam time series model has a MSE of 691.64 on its test set. Also, they have relatively smaller AIC and prove to be adequate according to the Ljung-Box tests. Therefore, these two models are selected to predict spams. The probability of receiving spam on a particular day can be derived from these two models. Therefore, they can be used as spam filter. In order to improve the performance of spam filter, the recommended static model and the selected ham and spam time series models can be combined using Bayes rule as shown in Equation 2.1. Therefore, the probability that a particular email is spam can be computed based on the prediction of the static model and the probability of receiving spam email on that day which is accessible by using the two time series models. 33

5. Reference [1] C. Thomas, Spam Costs Billions, Information Week, February 2005. [Online]. Available: http://www.informationweek.com/spam-costs-billions/59300834. [2] D. E. Brown and L. Barnes, Project 2: Spam Filters, October 10, 2013, assignment in class SYS 4021. [3] D. E. Brown and L. Barnes, Project 2 Template, October 10, 2013, assignment in class SYS 4021. [4] UCI Machine Learning Repository, Spambase Data Set, July 1999. [Online]. Available: http://www.ics.uci.edu/~mlearn/mlrepository.html [5] D. E. Brown and L. Barnes, Spam and Ham Data, October 10, 2013, assignment in class SYS 4021. 6. Appendix Figure 6.1 Scatter Plot Matrix of V21-V30 and V58 34

Figure 6.2 Scatter Plot Matrix of V31-V40 and V58 Figure 6.3 Scatter Plot Matrix of V41-V48 and V58 35

Figure 6.4 Scatter Plot Matrix of V49-V54 and V58 Figure 6.5 Scatter Plots Matrix of V11-V20 and V58 after Log Transformation 36

Figure 6.6 Scatter Plots Matrix of V21-V30 and V58 after Log Transformation Figure 6.7 Scatter Plots Matrix of V31-V40 and V58 after Log Transformation 37

Figure 6.8 Scatter Plots Matrix of V41-V49 and V58 after Log Transformation Figure 6.9 Scatter Plots Matrix of V49-V54 and V58 after Log Transformation 38

Figure 6.10 Scatter Plots Matrix of V55-V57 and V58 after Log Transformation Figure 6.11 Factor Plots of V10-V18 39

Figure 6.12 Factor Plots of V19-V27 Figure 6.13 Factor Plots of V28-V36 40

Figure 6.14 Factor Plots of V37-V45 Figure 6.15 Factor Plots of V46-V54 41

Figure 6.16 Factor Plots of V55-V57 Mean Coefficient Table 6.1 Coefficient List of GLM4 2.50% Coefficient 97.50% Coefficient Percent Change (Mean) Percent Change (2.50%) Percent Change (97.50%) (Intercept) -1.79824-2.24328-1.40168-83.44098-89.38904-75.38163 Comp.1-1.78614-2.02100-1.57839-83.23943-86.74775-79.36923 Comp.2-0.21760-0.47661 0.01803-19.55556-37.91179 1.81924 Comp.3-0.95678-1.24605-0.68441-61.58736-71.23623-49.56123 Comp.4 0.81396 0.63019 1.01676 125.68319 87.79656 176.42271 Comp.5 0.22792-0.03367 0.51549 25.59905-3.31090 67.44535 Comp.6 0.35396 0.13807 0.57378 42.46954 14.80515 77.49555 Comp.7-0.15910-0.46181 0.11676-14.70901-36.98551 12.38468 Comp.8 0.03079-0.24428 0.28600 3.12713-21.67320 33.10873 Comp.9-0.91602-1.14970-0.70221-59.98918-68.32686-50.45110 Comp.10-0.33351-0.53508-0.13165-28.35970-41.43775-12.33527 Comp.11 0.02053-0.22645 0.27510 2.07443-20.26398 31.66599 Comp.12-0.32404-0.56553-0.07837-27.67801-43.19408-7.53798 Comp.13-0.10720-0.33980 0.11227-10.16525-28.80869 11.88167 Comp.14 0.04834-0.17748 0.24370 4.95230-16.26197 27.59557 Comp.15 0.01740-0.26680 0.31833 1.75540-23.41757 37.48273 Comp.16 0.40564 0.18839 0.62591 50.02633 20.73063 86.99527 Comp.17-0.21334-0.48248 0.05602-19.21194-38.27472 5.76136 Comp.18 0.27075 0.02498 0.52087 31.09480 2.52909 68.34951 Comp.19-0.00454-0.26012 0.27448-0.45252-22.90375 31.58528 Comp.20 0.33401 0.09827 0.58896 39.65569 10.32557 80.21102 Comp.21 0.44757 0.16078 0.75836 56.45093 17.44318 113.47638 Comp.22 0.10072-0.13522 0.33941 10.59668-12.64735 40.41132 Comp.23-0.02328-0.27874 0.22131-2.30152-24.32641 24.77075 Comp.24-0.13834-0.37121 0.09156-12.91967-31.00980 9.58873 42

Comp.25 0.04517-0.17188 0.26170 4.62036-15.79169 29.91426 Comp.26-0.43951-0.65473-0.22788-35.56487-48.04182-20.37840 Comp.27-0.65878-0.98165-0.36225-48.25169-62.53061-30.38911 Comp.28 0.04372-0.22493 0.32040 4.46860-20.14265 37.76723 Comp.29 0.29628 0.06770 0.52873 34.48446 7.00462 69.67823 Comp.30-0.20335-0.42612 0.01858-18.40104-34.69590 1.87506 Comp.31 0.74160 0.50708 0.98570 109.92838 66.04304 167.96869 Comp.32 0.04777-0.24610 0.32707 4.89255-21.81526 38.68945 Comp.33-0.06047-0.30397 0.18491-5.86802-26.21156 20.31120 Comp.34 0.38222 0.09919 0.68581 46.55387 10.42780 98.53860 Comp.35 0.96650 0.72937 1.21300 162.87199 107.37681 236.35491 Comp.36-0.66101-0.91859-0.40823-48.36702-60.09203-33.51716 Comp.37 0.27964-0.05336 0.64943 32.26578-5.19601 91.44403 Comp.38 0.05386-0.21960 0.32914 5.53329-19.71601 38.97790 Comp.39-0.12687-0.40383 0.14684-11.91486-33.22422 15.81741 43