A Proposal of Regression Hybrid Modeling for Combining Random Forest and X-Means Methods

Size: px

Start display at page:

Download "A Proposal of Regression Hybrid Modeling for Combining Random Forest and X-Means Methods"

Elizabeth Ward
5 years ago
Views:

Total Quality Science Vol, No A Proposal of Regression Hybrid Modeling f Combining Random Fest and X-Means Methods Yuma Ueno*, Yasushi Nagata Waseda University, -4- Okubo, Shinjuku-ku Tokyo, 69-8,

1 Total Quality Science Vol, No A Proposal of Regression Hybrid Modeling f Combining Random Fest and X-Means Methods Yuma Ueno*, Yasushi Nagata Waseda University, -4- Okubo, Shinjuku-ku Tokyo, 69-8, Japan *contact auth s address : uen-yum@tokiwasedajp Abstract: To derive useful infmation from complicated data, many hybrid modeling strategies that combine nonparametric and parametric methods have been proposed In this study, we propose a new hybrid modeling strategy that combines the random fest and the -means methods using linear regression analysis This strategy is referred to as XR regression This study has three purposes: to improve the perfmance of a strategy of hybrid modeling using the random fest method, to determine an optimal class automatically using the -means method, and to compare the prediction accuracy of this method with that of other eisting methods To determine the characteristics of XR regression, we compare its prediction accuracy with that of the eisting methods using Monte Carlo simulations The simulation results show that XR regression has a high perfmance in any situation, especially in data sets that include interaction effects Keywds Parametric model, linear regression analysis, interaction, tree topology, err dispersion Introduction Linear regression analysis is widely popular as a tool f data analysis and is used frequently to grasp and predict data structures Here, linear regression analysis is called the parametric method under a much wider definition because we assume a specic distribution in its model However, when data become large and complicated, the parametric method alone does not suffice f obtaining all useful infmation Therefe, the nonparametric method, which does not assume a specic distribution, becomes necessary However, the nonparametric method has a few disadvantages, such as overlearning Thus, even the nonparametric method cannot yield all useful infmation Therefe, in previous studies, the semi-parametric method (Robinson (988), Sakamoto and Shirahata (996)) and the hybrid model (Kadowaki et al (a, b)) were proposed The semi-parametric method assumes a specic distribution as a part of the model, and the hybrid model combines the nonparametric method with the parametric method A hybrid model using classication and regression tree (CART) analysis was proposed in previous studies (Kadowaki et al (a, b)), and we call this model the Kadowaki hybrid model (Kadowaki HM) However, other combinations of machine learning methods were not considered in previous studies, so we believe it is possible to propose a new hybrid model with higher predictability As pilot studies, we investigated the perfmances of several hybrid models that combined cluster analysis, the k means method, and the -means method with machine learning methods such as the random fest method, the suppt vect machine (SVM), and the neural netwk Since we found that the hybrid model that combines the -means method and the random fest method had the highest perfmance, we propose this combination method in this study [DOI:99/tqs] Copyright Journal of the Japanese Society f Quality Control All rights reserved

2 Regression hybrid model, Ueno et al Furtherme, we evaluate the perfmance of the proposed hybrid model quantitatively using Monte Carlo simulations The construction of this paper is as follows In section, we eplain the random fest and the -means methods In section, we elucidate the proposed hybrid method In section 4, we compare the accuracy of the proposed method with that of the previous study using a real data set In section, we conduct a simulation study to evaluate the perfmance of the proposed method In section 6, we give conclusions The -means method and the random fest method The -means method The -means method was named by Pelleg and Moe () and is one of the cluster automatic decision methods The first step is determining a small enough cluster division, which is then repeated to the etent that the two divisions are assumed to be suitable f each cluster In this study, we use the improved -means method, which was proposed by Ishioka (, 6) This method proceeds as follows: Determine an initial parameter k (the default value is ) f the number of small enough clusters Apply the k-means method under the condition of k=k (here, k epresses the number of clusters) Then, divide the whole data set, and let the clusters after the division be C,C,, Ck Repeat procedures four and five under the conditions of i=,,,k 4 Apply the k-means method to the cluster Ci under the condition of Let the clusters after the division be C i, C i Compare the Bayesian infmation criterion after the division (BIC) with the same criterion befe the division (BIC) Divide it BIC>BIC, and stop the division not 6 Finish dividing when there is no cluster left to divide further The random fest method The random fest method is one of the machine learning methods It repeatedly constructs decision trees using dferent bootstrap samples from the data The algithm is as follows (Liaw and Wiener ()): Draw bootstrap samples from the iginal data F each of the bootstrap samples, grow an unpruned classication regression tree with the following modication: at each node, rather than choosing the best split among all predicts, select a random sample of predicts and choose the best split from among these variables Predict new data by aggregating the predictions of the trees The random fest method has been applied to various areas F eample, Ishioka () applied it to a national test, and Niizuma and Saito (9) applied it to music classication Proposed hybrid model in this study (XR regression) In this study, we propose a hybrid model called XR regression We determine several classes of learning data automatically using the -means method Using the random fest method, we identy to which class each data set f prediction belongs Then, we add class dummy variables as eplanaty variables and eecute a linear regression analysis XR regression is a method that is intended to enhance prediction accuracy We assume that a learning data set eists in hand, and each of the data sets f prediction is predicted using the learning data set i The detailed procedures are as follows: Assume that a learning data set that has p items and n samples eists Let the eplanaty variables be ( i,, p) and the objective variable be y k Procedure : Divide the learning data set into q classes Cj ( j,,q ) using the -means method Procedure : Add class labels Cj ( j,,q ) to the learning data set as dummy variables [DOI:99/tqs] Copyright Journal of the Japanese Society f Quality Control All rights reserved

3 Total Quality Science Vol, No Procedure : Estimate the class of each data set f prediction using the random fest method, and add the estimated classes to the data sets f prediction as dummy variables Procedure 4: Construct regression model (), which includes the dummy variables described in Procedure y p p C q Cq () where βi ( i,, p ) are the regression coefficients f the eisting eplanaty variables, γj ( j,,q ) are the regression coefficients f the dummy variables of the classes, and is the err term Procedure : Apply the data sets f prediction to the estimated regression model provided by procedure 4 and predict them 4 Real data analysis 4 Analytical procedure In this section, we analyze real data and compare the prediction accuracy of the proposed method with those of the previous methods The eisting methods we compare in this section are linear regression and the Kadowaki HM The number of repetitions is, We use an average absolute err (we call it the prediction err (PE) here) as the evaluation inde n y i yˆ i n i (4) 4 Boston housing price data We use housing price data f Boston These data are included in the MASS package of the statistical analysis software R The Boston data set has non-linear and interactive structures and includes 6 samples and 4 variables We divide the 6 samples into two groups of equal size at random We use one group as the learning data set and the other group as the data set f prediction This data frame contains the following variables crim: Per capita crime rate by town zn: Proption of residential land zoned f lots over, sq ft indus: Proption of non-retail business acres per town chas: Charles River dummy variable ( tract bounds river; otherwise) no: Nitrogen oides concentration (parts per million) rm: Average number of rooms per dwelling age: Proption of owner-occupied units built pri to 94 dis: Weighted mean of distances to five Boston employment centers rad: Inde of accessibility to radial highways ta: Full-value property-ta rate per $, ptratio: Pupil-teacher ratio by town black: The proption of black residents by town lstat: Percentage of the population that is lower status medv: Median value of owner-occupied homes in $,s [DOI:99/tqs] Copyright Journal of the Japanese Society f Quality Control All rights reserved

Regression hybrid model, Ueno et al Figure Accuracy comparison f Boston data Figure shows the accuracies of three methods f the Boston data We can see that XR regression has a better accuracy than

proposed method perfms well Perfmance evaluation of the hybrid model by simulation Outline of the simulations We conducted Monte Carlo simulations to eamine what kinds of data features are best

4 Regression hybrid model, Ueno et al Figure Accuracy comparison f Boston data Figure shows the accuracies of three methods f the Boston data We can see that XR regression has a better accuracy than linear regression and the Kadowaki HM However, it cannot be inferred that the dference is meaningful Hence, we conducted Monte Carlo simulations to confirm the kinds of data features f which the proposed method perfms well Perfmance evaluation of the hybrid model by simulation Outline of the simulations We conducted Monte Carlo simulations to eamine what kinds of data features are best suited f the proposed hybrid model effectively In this study, to produce data f simulation, we added the tree topology structure and the interaction structure to each of the linear and non-linear models, and we changed the err dispersion The methods compared were linear regression, the Kadowaki HM, and XR regression The detailed settings in the simulation study were as follows The number of simulations was set to be, The number of sample size was We assumed the err term followed N(, ) 4 We used an average absolute err (PE) as the evaluation inde Linear model At first, we added the tree topology structure, the interaction structure, and the change in the err dispersion to the linear model to produce data and compare the accuracy Linear model data with a tree topology structure We eecuted this simulation based on linear model data with a tree topology structure We produced the data accding to fmula () The number of eplanaty variables was five, and we assumed that all of them followed the unm distribution U(,) We used function () to add the compleity of the divergence in reference to a function called f (tree), which Miyataka () used to break the linear structure because a tree topology model has a feature that deals with variables as non-continuous y () 4 f ( tree) [DOI:99/tqs] Copyright Journal of the Japanese Society f Quality Control All rights reserved 4

5 Total Quality Science Vol, No Copyright Journal of the Japanese Society f Quality Control All rights reserved [DOI:99/tqs] treevalue treevalue treevalue treevalue treevalue treevalue treevalue treevalue ( tree ) f () Figure shows the simulation results of the linear model plus the tree topology structure The number of clusters is four in the XR regression We can see from Figure that the accuracy of the PE of the XR regression is the best From these simulation results, we prove that XR regression is me powerful to grasp the tree topology structure f cluster analysis using all variables at the same time than is CART, which uses every single variable Figure Accuracy comparison under the linear model + the tree topology structure Linear model data with an interaction structure

6 Regression hybrid model, Ueno et al We eecuted this simulation based on linear model data with an interaction structure We produced data accding to fmula () There were five eplanaty variables, and all of them were quantitative variables We allotted the standard values called a and a, which each took values of, to and Then, a and a gave y effects accding to rule (4) We used a function called g(interaction), which Miyataka () used, to produce the interaction Interactionvalue was a fied number, and we changed it from to, 4, and were quantitative variables that followed the unm distribution U(,) and were quantitative variables that followed the nmal distributions described in Table y g( interactio ) () 4 n interactionvalue a, a g ( interaction) (4) interactionvalue else Table Distribution that each standard value a and a follows a a a a ~ (, ) ~ (, ) ~ (, ) ~ (, ) N N N N Figure shows the simulation results of the linear model plus the interaction structure The number of clusters is two in the XR regression We can see that the accuracy of the PE of the XR regression is the best The interaction between eplanaty variables cannot be detected well by the Kadowaki HM using CART, but it can be detected well by XR regression using clustering We think that this finding is because it is hard f CART to detect an interaction using only one variable On the other hand, it is easy f cluster analysis to detect the interaction using all of the variables The PE suddenly decreases from a certain point, and it can be said that the larger the interaction, the greater the usefulness of the XR regression Figure Accuracy comparison under the linear model + the interaction structure [DOI:99/tqs] Copyright Journal of the Japanese Society f Quality Control All rights reserved 6

7 Total Quality Science Vol, No Linear model data with changes in the err dispersion The influence of the err dispersion is sometimes large in real data Thus, we changed and simulated the err variance to check how the err dispersion influences accuracy in this subsection We produced data accding to fmula () The number of eplanaty variables was five, and all of them followed the unm distribution U(,) We had assumed ~ N(, ) f the err term up to now, but we assumed ~ N(, ) f the err term in this subsection We changed err value, which means the value of the err dispersion σ, from to, and we simulated it y () 4 Figure 4 Accuracy comparison under the linear model + err change Figure 4 shows the simulation results of the linear model with changes in the err The number of clusters is si in the XR regression We can see that the accuracy of the XR regression is the highest in a linear model with a large err dispersion The accuracy of the Kadowaki HM is less than that of linear regression, so the influence of the err dispersion is large f the Kadowaki HM Non-linear model Generally, it is rare that real data are based on a perfectly linear model Most data partly include some non-linear structures Thus, in this section, we assumed a multiplicative epression as a non-linear model and added the tree topology structure and the interaction structure to each of the non-linear models and changed the err dispersion to produce data f simulation Non-linear model data with a tree topology structure We eecuted this simulation based on non-linear model data with a tree topology structure We produced data accding to fmula (6) The number of eplanaty variables was five, and all of them followed the unm distribution U(,) We used function () as f (tree) y 4 f ( tree) (6) [DOI:99/tqs] Copyright Journal of the Japanese Society f Quality Control All rights reserved

8 Regression hybrid model, Ueno et al Figure Accuracy comparison under the non-linear model + the tree topology structure Figure shows the simulation results of the non-linear model with the tree topology structure The number of clusters is nine in the XR regression We can see that the accuracy of the PE of the XR regression is the best On the other hand, the Kadowaki HM could not detect the tree topology most of the time It can be said that XR regression evades the influence of the tree topology well by clustering The tree topology structure becomes me dficult to grasp in case of a non-linear model, and the accuracy becomes wse However, the accuracy is relatively stable f the XR regression using all variables Non-linear model data with an interaction structure We eecuted this simulation based on non-linear model data with an interaction structure We produced data accding to fmula () There were five eplanaty variables, and all of them were quantitative We used function (4) as g ( interaction), 4 and were quantitative variables that followed the unm distribution U(,) and were quantitative variables as described in Table 4 ) y g( interaction () Figure 6 Accuracy comparison under the non-linear model + the interaction structure [DOI:99/tqs] Copyright Journal of the Japanese Society f Quality Control All rights reserved 8

9 Total Quality Science Vol, No Figure 6 shows the simulation results of the non-linear model with the interaction structure The number of clusters is four in the XR regression Because influence of the interaction is small until interactio nvalue becomes, the interaction effect cannot be detected well; thus, the Kadowaki HM mostly maintains the best accuracy However, the influence of the interaction grows larger after the interaction value eceeds, and the accuracies of all of the methods ecept XR regression become wse Only XR regression maintains a good accuracy That is, we find that the Kadowaki HM is effective when the interaction value is small and XR regression is effective when the interaction value is large We find that XR regression can detect the influence of the interaction However, f non-linear structure models in which the value of the variable itself varies greatly without the interaction, CART using one variable achieves higher detection Non-linear model data with changes in the err dispersion This time, we produced data accding to fmula (8) The number of eplanaty variables was five, and all of them followed the unm distribution U(,) 4 y (8) Figure shows the simulation results of the non-linear model with err changes The number of clusters is three in the XR regression We can see that the accuracy of XR regression is the highest even f a non-linear model with a large err dispersion In case of the Kadowaki HM, we find that the accuracy of the err is low, similar to the result with the linear model 6 Conclusion Figure Accuracy comparison under the non-linear model + err change We proposed a new hybrid model that combined the random fest and -means methods At first, in der to very the accuracy of the proposed method, we used Boston house price data The Boston data had non-linearity and interaction structures, and the accuracy of the XR regression was slightly better than that of the Kadowaki HM We then conducted Monte Carlo simulations to very f which kinds of data features the XR regression perfmed well When the influence of an interaction was small in a non-linear model, the Kadowaki HM showed good accuracy However, the Kadowaki HM was not so effective in other simulation settings On the other hand, XR regression maintained good accuracy in basically all situations, and we found it to be a wellbalanced method overall There are three future challenges First, because the most suitable cluster automatic decision method already eists along with the -means method, which we used f XR regression, we should compare the accuracy using other methods as well Second, in this study, we eecuted the simulation only f particular data in the linear and [DOI:99/tqs] Copyright Journal of the Japanese Society f Quality Control All rights reserved 9

10 Regression hybrid model, Ueno et al non-linear models Thus, we should eecute other simulations using data with crelations between eplanaty variables with many variables Third, there might be new discoveries we veried what happens to the hybrid effect when using a regression method besides linear regression analysis References Ishioka, T (), Etended K-means with an Efficient Estimation of the Number of Clusters, Japanese Journal of Applied Statistics, Vol9, No, pp4-49 Ishioka, T (6), An Epansion of X-means -Progressive Iteration of K-means and Merging of the Clusters-, Japanese Society of Computational Statistics, Vol8, No, pp- Ishioka, T (), Data Imputation by Random Fest-The Principle and Its Application f National Center Test in Japan, Japanese Journal of Applied Statistics, Vol4, No, pp9-9 Kadowaki, T, Suzuki, N, Suzuki, T and Otaki, A (a), Application of Hybrid Modeling to POS Data Analysis, Japanese Journal of Quality, Vol, No4, pp9- Kadowaki, T and Otaki, A (b), Application of Hybrid Modeling to Air Quality Data by Combining CART Analysis with Regression Model, Memoirs of the Institute of Science and Technology, Meiji University, Vol9, No9, pp69- Liaw, A and Wiener, M (), Classication and Regression by randomfest, R news, ISSN69-6 Miyataka, T (), Study about a Hybrid Model Combined a Regression Model and a Tree Topology Model, Master s thesis, Graduate school at Waseda University Niitsuma, M and Saito, H (9), Music Genre Classication Using Random Fest, Infmation Processing Society of Japan, Vol, No, pp9-9 Pelleg, D and Moe, A (), X-means: Etending K-means with Efficient Estimation of Clusters, ICML Robinson, P M (988), Root-N-Consistent Semiparametric Regression, Econometrica, Vol6, No4, pp9-94 Sakamoto, W and Shirahata, S (996), Spline Smoothing on Semiparametric Regression Problem, Japanese Society of Computational Statistics, Vol9, No, pp- Acknowledgement: We would like to thank the anonymous referees f their valuable comments This wk was partly suppted by JSPS Grants-in-Aid f Scientic Research Grant Number K6 Auths biographical notes Yuma Ueno is a graduate student in the Department of Industrial and Management System Engineering of the Graduate School of Creative Science and Engineering at Waseda University Yasushi Nagata is a profess in the Department of Industrial and Management System Engineering of the School of Creative Science and Engineering at Waseda University [DOI:99/tqs] Received: March, 6 Revised: Nobember, 6 Accepted: March, [DOI:99/tqs] Copyright Journal of the Japanese Society f Quality Control All rights reserved

Overview. Data Mining for Business Intelligence. Shmueli, Patel & Bruce

Overview. Data Mining for Business Intelligence. Shmueli, Patel & Bruce Overview Data Mining for Business Intelligence Shmueli, Patel & Bruce Galit Shmueli and Peter Bruce 2010 Core Ideas in Data Mining Classification Prediction Association Rules Data Reduction Data Exploration