Automated Selection of Training Data and Base Models for Data Stream Mining Using Naïve Bayes Ensemble Classification

Size: px

Start display at page:

Download "Automated Selection of Training Data and Base Models for Data Stream Mining Using Naïve Bayes Ensemble Classification"

Alexander Williamson
6 years ago
Views:

1 Proceedngs of the World Congress on Engneerng 2017 Vol II, July 5-7, 2017, London, U.K. Automated Selecton of Tranng Data and Base Models for Data Stream Mnng Usng Naïve Bayes Ensemble Classfcaton Patrca E.N. Lutu Abstract A data stream s a contnuous, tme-ordered sequence of data tems. Data stream mnng s the process of applyng data mnng methods to a data stream n real-tme n order to create descrptve or predctve models for the process that generates the data stream. The characterstcs of a data stream typcally change wth tme. These changes gve rse to the need to contnuously make revsons to, or completely rebuld predctve models when the changes are detected. Rebuldng and changng the predctve models needs to be a fast process because stream data may arrve at a hgh speed. Manual labelng of tranng data before creatng new models may not be able to cope wth the speed at whch model revson needs to be performed. Ths paper presents expermental results for the performance of methods for the automated selecton of hgh qualty tranng data for predctve model revson for predctve modelng usng Naïve Bayes ensemble classfcaton. The expermental results demonstrate that the use of two base model combnaton algorthms for the ensembles, results n a hgh level of confdence n the predctons. Secondly, the perodc revson of the ensemble usng the base models created from the automatcally selected tranng data produces revsed ensemble models wth hgh predctve performance. Index Terms data mnng, data stream mnng, Naïve Bayes classfcaton, ensemble classfcaton, nstance labelng A I. INTRODUCTION data stream s a contnuous, tme-ordered sequence of data tems [1]. Data stream mnng s the process of applyng data mnng methods to a data stream n real-tme n order to create descrptve or predctve models for the process that generates the data stream [1], [2], [3]. The characterstcs of a data stream typcally change wth tme. The changes are categorsed as dstrbuton changes and concept changes [2], [4]. These changes gve rse to the need to contnuously make revsons to, or completely rebuld predctve models when the changes are detected. Rebuldng and changng the predctve models needs to be a fast process because stream data may arrve at a hgh speed. Tradtonal methods of manually labelng tranng data before creatng new models may not be able to cope wth the speed at whch model revson needs to be performed. Lutu [5] has proposed a predctve modelng framework Manuscrpt receved February 03, 2017; revsed March 06, Patrca. E. N. Lutu s a Senor Lecturer n the Department of Computer Scence, Unversty of Pretora, Pretora 0002, Republc of South Afrca, phone: ; fax: ; e-mal: Patrca.Lutu@up.ac.za; web: ISSN: (Prnt); ISSN: (Onlne) for automated nstance labelng, n stream mnng. The purpose of ths paper s to report on extensons of the work reported by Lutu [5]. Expermental studes are reported on methods for the perodc revson of classfcaton ensemble models n order to ensure that the stream nstances that are classfed by ensemble models are assgned class labels wth a hgh level of confdence. Ths s necessary n order to ensure that hgh qualty automatcally labeled and selected tranng data s used for the creaton of new base models whch can be used for ensemble revson. The expermental results demonstrate that the use of two base model combnaton algorthms for the ensembles, results n a hgh level of confdence n the predctons. Secondly, the perodc revson of the ensemble usng the base models created from the automatcally labeled and selected tranng data produces revsed ensemble models wth hgh predctve performance. The rest of the paper s organsed as follows: Secton II provdes background to the reported research. Secton III presents the proposed ensemble model revson approach. Sectons IV and V respectvely provde the expermental methods and experment results. Secton VI concludes the paper. II. BACKGROUND A. Modelng actvtes for predctve data stream mnng Predctve data stream mnng nvolves the creaton of classfcaton or regresson models. For classfcaton modelng, tranng data wth class labels s used to create the model. A real-lfe data stream does not have a fnte sze or statc behavour. Ths means that a data stream cannot be stored n ts entrety, and the nstance classes cannot be correctly predcted by a model that s created as a once-off actvty. Ths makes t necessary to contnuously or perodcally revse the predctve model usng tranng data that s reasonably recent. Two maor approaches that are used for predctve data stream mnng are the sldng wndow approach and the ensemble approach. An ensemble model conssts of several base models whose predctons are combned nto one fnal predcton usng a combnaton algorthm [6]. The ntal modelng actvtes for ether approach nvolve selectng tranng data nstances, selectng relevant predctve features, creatng the ntal model, and testng the model. Thereafter, the model s used to provde predctons for data stream nstances as they arrve, and the predctve performance s montored to determne when the model should be revsed or

2 Proceedngs of the World Congress on Engneerng 2017 Vol II, July 5-7, 2017, London, U.K. replaced n ts entrety. Data streams typcally exhbt the phenomena of concept change and dstrbuton change. Concept change refers to changes n the descrpton of the class varable values and may be classfed as concept drft whch s a gradual change of the concept, or concept shft whch s a more abrupt change of the concept [1], [2], [3], [7]. Dstrbuton change, on the other hand, refers to changes n the data dstrbuton of the data stream. Concept and dstrbuton change strateges typcally nvolve contnuous montorng of model performance and data characterstcs [7]. For predctve data stream mnng, concept drft and dstrbuton changes are typcally handled through contnuous or perodc revson of the predctve model. Concept shft (sudden concept change) s handled through complete replacement of the current model. A parallel actvty to model usage and montorng s the labelng of new tranng data whch s then used to test the model performance, and to create new models for model revson or model change. Masud et. al [3], Zhu et. al [8] and Zhang et. al [9] have all observed that, for predctve stream mnng, manual labelng of data s a costly and tme consumng exercse. In practce t s not possble to manually label all stream data, especally when the data arrves at a hgh speed. It s therefore common practce to label only a small fracton of the data for tranng and testng purposes. Masud et. al [3] have proposed the use of ensemble classfcaton models based on sem-supervsed clusterng, so that only a small fracton (5) of the data needs to be labeled for the clusterng algorthm. Zhu et. al [8] have proposed an actve learnng framework for solvng the nstance labelng problem. Actve learnng ams to dentfy the most nformatve nstances that should be manually labeled n order to acheve the hghest accuracy. B. Ensemble models for stream mnng Several ensemble classfcaton methods for stream mnng have been reported n the lterature. Examples of ensemble frameworks that have been reported n the lterature are the streamng ensemble algorthm (SEA) [10], the accuracy-weghted ensemble (AWE) [11], and the dynamcally weghted maorty (DWM) ensemble [12]. Lutu [5] has proposed an ensemble framework for stream mnng based on Naïve Bayes ensemble classfcaton [13],[14]. The framework conssts of an onlne component and an offlne component. The onlne component uses Naïve Bayes ensemble base models to make predctons for newly arrved data stream nstances. The offlne component conssts of algorthms to combne base model predctons, determne the relablty of the ensemble predctons, select tranng data for new base models, create new base models, and determne whether the current onlne base models need to be replaced. Two obectves of ths framework are (1) to remove the need for manual labelng of tranng nstances, and (2) to determne when model revson should be performed. C. Nave Bayes ensemble classfcaton Naïve Bayes classfcaton has been reported n the lterature as one of the deal algorthms for stream mnng, due to ts ncremental nature [15]. The Naïve Bayes classfer assgns posteror class probabltes for the query nstance x based on Bayes theorem. The tranng dataset for a classfer s charactersed by d predctor varables X 1,...,X d and a class varable C. The tranng dataset conssts of n tranng nstances, and each nstance may be denoted as ( x,c ) where x = x,...,x ) are the values of ( 1 d the predctor varables, and c { c,...,c } s the class 1 label. Gven a new query nstance x = x,...,x ) Naïve K ( 1 d Bayes classfcaton nvolves the computaton of the probablty Pr( c x ) of the nstance belongng to each class c as [13], [14] Pr( c x ) = P r( c )P r( x c ) (1) P ( x ) where Pr( x c ) = Pr( x c ) and Pr( x ) = K = 1 r P ( Π r c )P r( x c ) The values of contnuous-valued varables are typcally converted nto categores through the process of dscretsaton. The quanttes Pr( c ) for the classes, and Pr( x c ), for the predctor varables, are then estmated from the tranng data. For zero-one loss classfcaton, the class c wth the hghest probablty Pr( c x ) s selected as the predcted class for nstance x. Feature selecton nvolves the dentfcaton of features (predctor varables) that are relevant and not redundant for a gven predcton task [16]. Lu and Motoda [16], Kohav [17], John et al. [18], and Lutu [19] have dscussed the merts of conductng feature selecton for Naïve Bayes classfcaton. One straght-forward method of feature selecton, s the use the Symmetrcal Uncertanty ( SU ) coeffcent as dscussed by Lutu [19]. The SU coeffcent for a predctor varable X and the class varable C s defned n terms of the entropy functon as [16], [20] SU = 2. 0( E( X ) + E( C) E( X,C) /( E( X ) + E( C)). (2) I ( 2 = 1 where E X ) = Pr ( x ) log Pr ( x ) J ( 2 = 1 and E C ) = Pr ( c ) log Pr ( c ) are respectvely the entropy for X and C and I E ( X,C ) = Pr ( x,c ) log 2 = 1 J = 1 Pr ( x,c ). s the ont entropy for X and C [16], [20]. The SU coeffcent takes on values n the nterval [0,1] and has the same nterpretaton as Pearson s product moment correlaton coeffcent for quanttatve varables [16]. Whte and Lu [20], and Lutu [19] have observed that the entropy functons and the ont entropy functon n (2) can be computed from a contngency table. A 2-dmensonal contngency table s a cross-tabulaton whch gves the frequences of co-occurrence of the values of two categorcal varables X and Y. For Naïve Bayes classfcaton and feature selecton, X s the feature and the ISSN: (Prnt); ISSN: (Onlne)

3 Proceedngs of the World Congress on Engneerng 2017 Vol II, July 5-7, 2017, London, U.K. second varable s C, whch s the class varable. Varous statstcal measures can be derved from a contngency table. The man reason why Naïve Bayes (NB) classfcaton was selected for the framework reported by Lutu [5] s because model creaton s a smple and fast actvty. For each predctve feature, a sngle contngency table of counts for the feature values and class labels provdes all the data needed for the computaton of the quanttes for the terms n (1) and (2). Addtonally, the class entropy measure for makng decsons on base model selecton, as dscussed n Secton III, can be easly computed from the contngency tables. III. PROPOSED APPROACH TO ENSEMBLE REVISION A. Automated nstance labelng and selecton for Naïve Bayes classfcaton The framework proposed by Lutu [5] uses three measures for assessng the performance of the ensemble base models. These measures are: Certanty, Relablty, and Incoherence. Certanty measures the frequency that all base models n the ensemble have predcted the same class. Relablty measures the frequency that one class s predcted by the maorty of base models n the ensemble. Incoherence measures the frequency that each base model n the ensemble has predcted a dfferent class from the other base models. The studes reported n ths paper are an extenson of the studes reported by Lutu [5]. The extensons nvolve refnements to the predcton categores and the crtera for makng decsons on ensemble model revson. For the studes reported by Lutu [5], the ensemble predctons were based on the maorty vote by the base models. A maor refnement s to use two methods to determne the class label for a query nstance. Gven a query nstance x, each base model of the ensemble provdes a probablstc score for each class. The scores provded by the base models are typcally stored n a decson matrx wth one row for each base model and one column for each class [6]. Varous combnaton methods are avalable for determnng the ensemble predcton. For the studes reported n ths paper, the mean score for each class s computed and the class selected based on the score s the one wth the hghest mean score value. Addtonally, each base model provdes a predcton based on the class scores. Ths s the class wth the hghest score. The class wth the maorty vote (by the base models) s also used to determne the predcton category. A predcton s treated as vald f the class selected by the mean score value s the same as the one selected by the maorty vote. The nformaton provded by the score-based predctons and by the maorty vote predctons, for the vald predctons, s used as a bass for determnng the predcton categores. Table I provdes a summary of the revsed predcton categores, whch are: Certan, Relable, WeaklyRelable, and NotRelable. Only those predctons that are vald are assgned the categores Certan, Relable or WeaklyRelable. For the proposed automated nstance labelng approach, the nstances that are assgned the categores Certan and Relable are selected as the tranng data for buldng NB base models that can be used to replace the current base models when the need arses. The contngency tables for the new base model can be ncrementally created off-lne as the tranng data s beng generated. At the end of a tme perod two maor actvtes are conducted. The frst actvty s to determne f the newly created base model has any predctve value. Ths assessment s based on the class entropy of the tranng data used to create the contngency tables for the model, and the number of selected (relevant) features. The second actvty s to assess the predctve performance of the current ensemble on the data for the tme perod, and then make a decson on whether the new base model should replace one of the current base models. TABLE I PREDICTION CATEGORIES Mean score range for predcted class Vote for predcted class category 0.8 to out of 3 Certan (score >= 0.8) and 2 out of 3 Certan (score <= 1.0) 1 out of 3 NotRelable 0.6 to 0.8 (score >= 0.6) and (score < 0.8) 0.5 to 0.6 (score >= 0.5) and (score < 0.6) 0.0 to 0.5 (score >= 0.0) and (score < 0.5) B. Base model selecton for ensemble revson Varous measures of ensemble performance are defned n terms of the nstance counts for the predcton categores, and the total number of predctons for a gven tme perod. The measures are defned as follows: Measures of ensemble performance: Surrogate Accuracy = ( count (Certan) + count(relable) ) / ( TotalPredctons) Certanty = count (Certan) / TotalPredctons Relablty = count (Relable) / TotalPredctons WeakRelablty = count (Weakly Relable) / TotalPredctons NonRelablty Agreement 3 out of 3 Relable 2 out of 3 Relable 1 out of 3 NotRelable 3 out of 3 WeaklyRelable 2 out of 3 WeaklyRelable 1 out of 3 NotRelable 3 out of 3 NotRelable 2 out of 3 NotRelable 1 out of 3 NotRelable = count (Not Relable) / TotalPredctons = count( predctons where base modelpredcton s the same as ensemble predcton) The proposed approach handles dstrbuton changes, concept drft and sudden concept change proactvely. Addtonally, the approach ams to ensure that frequent model revsons result n hgh confdence predctons. A maor tme perod T (e.g. after 30,000 nstances are receved) s used for makng decsons about ensemble model revson to handle possble dstrbuton changes and concept drft. Addtonally, a mnor tme perod t (e.g. after 3,000 nstances) s used to montor the possblty of sudden concept change. Rule 1 s used at the end of each mnor tme perod, to check for sudden concept change. ISSN: (Prnt); ISSN: (Onlne)

4 Proceedngs of the World Congress on Engneerng 2017 Vol II, July 5-7, 2017, London, U.K. When sudden concept change s detected, then the current ensemble should be abandoned, and a new ensemble should be created usng manually labeled data. Ensemble performance s assessed at the end of each maor tme perod T, to determne whether the most recently created base Rule 1: If (Certanty < specfed value) AND (Surrogate Accuracy < specfed value) AND (WeakRelablty > specfed value) AND (Agreement for at least one base model < specfed value) then flag sudden concept change model should replace one of the current base models. If the new base model has zero class entropy (.e. all tranng nstances are of the same class) then no replacement decson s made, otherwse, Rule 2 s used to determne the need for base model replacement. Rule 2: If (class entropy for the new base mode s greater than 0) then If (base model has the lowest Agreement) AND (base model has lowest no. of selected features) AND (base model has lower class entropy than new base model) AND (base model has fewer or equal number of selected features as new base model) then replace base model wth the new base model. IV. EXPERIMENTAL METHODS The KDD Cup 1999 dataset avalable from the UCI KDD archve [21] was used for the studes. The dataset conssts of a tranng dataset and a test dataset. The 10 verson of the tranng dataset was used for the experments. Ths dataset conssts of 494,021 nstances, 41 features and 23 classes. The 23 classes may be grouped nto fve categores: NORMAL, DOS, PROBE, R2L and U2R whch can then be used as the classes [22]. A new feature (called ID) was added to the dataset wth values n the range [1, ] as a pseudo tmestamp. Fg. 1 shows a plot of the class dstrbuton for ths data stream. The data stream exhbts an extreme mbalance of the class dstrbuton over tme. It s clear from Fg. 1 that the classes DOS and NORMAL are the maorty classes. Instance count K-30K 30K-60K 60K-90K 90K-120K Class dstrbuton for KDDCup99 120K-150K 150K-180K 180K-210K 210K-240K 240K-270K The algorthms for feature selecton, Naïve Bayes classfcaton, and base model predcton combnaton, were mplemented as a GNU C++ applcaton. Detals of the 270K-300K Tme perod Fg. 1. Class dstrbuton for the KDD Cup 1999 data stream (adopted from [5] ) ISSN: (Prnt); ISSN: (Onlne) 300K-330K 330K-360K 360K-390K 390K-420K 420K-450K 450K-480K 480K-494K DOS NORMAL PROBE R2L U2R requred data structures have been presented by Lutu [19]. The decsons on concept change detecton and model revson were made through manual nspecton of the model performance measures followed by the applcaton of Rule 1 or Rule 2 presented n Secton III. These decsons can easly be automated n C++ program code. V. EXPERIMENTS FOR STREAM MINING A. Obectves of the experments Exploratory experments were conducted to answer the followng questons: (1) Do the Certan and Relable categores lead to the selecton of a hgh percentage of correctly labeled tranng data? (2) Are the Certanty and Relablty measures good estmators of predctve accuracy? (3) Do the proposed methods for ensemble model revsons result n mprovements n Certan and Relable predctons? (4) Are the Certanty, Relablty, WeakRelablty, NonRelablty measures useful for forecastng possble concept change? B. Creaton of the ntal base models The top 90,000 nstances of the KDD Cup 1999 data stream were used for the creaton of the three base models MW1, MW2 and MW3 for the ntal ensemble. The nstances were dvded nto three batches of tranng nstances for each Naïve Bayes base model. The base models were tested on the data for the tme perod W4 (90K- 120K). Table II shows the propertes of the tranng data, and the test results for the ntal ensemble base models. TABLE II PROPERTIES OF THE TRAINING DATA FOR THE INITIAL ENSEMBLE BASE MODELS C. Expermental results The ntal ensemble model {MW1, MW2, MW3} was used to provde predctons for the data stream nstances startng at tme perod W4 (90K-120K). For each predcton perod W, the nstances that were assgned the Certan or Tme perod Relable category were selected as tranng data and the base model MW was created from ths tranng data. Tranng set sze Relevant features Model Class Entropy MW1 0K-30K 30, MW2 30K-60K 30, MW3 60K-90K 30, Testng accuracy on W4 data Addtonally, a decson was made whether or not to replace one of the current base models, usng the logc of Rule 2. Tables III and IV show the descrpton of the ensemble models and the ensemble performance, for the tme perods W4,,W17 (90K to 494K). For the tme perods W6 to W11, the base models are exactly the same. Ths s because all the nstances for these tme perods belong to the DOS class. The entres n column 3 of Table III show that base model replacements were done at the begnnng of tme perods W5, W6 and W13. The results of columns 3, 4 and 5 of Table IV show that, for the perods W5 to W15, the values for Certanty, Surrogate Accuracy and real Accuracy are consstently hgh. Addtonally, the Surrogate

5 Proceedngs of the World Congress on Engneerng 2017 Vol II, July 5-7, 2017, London, U.K. Accuracy values are generally very close to the (real) Accuracy values. For the perod W16, the values for Certanty, Surrogate Accuracy and real Accuracy suddenly plummet to much lower levels. Ths s an ndcaton of sudden concept change. Based on the above observatons, the answer to the queston: Are the Certanty and Relablty measures good estmators of predctve accuracy? s yes. The reader wll recall that Surrogate Accuracy s defned n terms of Certanty and Relablty. TABLE III DESCRIPTION OF THE CONTINUOUSLY REVISED ENSEMBLE Ensemble Predcton Wndow Predcton tme perod base models W4 90K-120K MW1, MW2, MW3 E1 W5 120K-150K MW2, MW3, MW4 E2 W6,..,W11 150K-330K MW2, MW4, MW5 E3 W12 330K-360K MW2, MW4, MW5 E3 W13 360K-390K MW2, MW5, MW12 E4 W14 390K-420K MW2, MW5, MW12 E4 W15 420K-450K MW2, MW5, MW12 E4 W16 450K-480K MW2, MW5, MW12 E4 W17 480K-494K MW2, MW5, MW12 E4 TABLE IV PREDICTIVE PERFORMANCE OF THE CONTINUOUSLY REVISED ENSEMBLE Predcton wndow ensemble Certanty Surrogate Accuracy real Accuracy W4 E W5 E W6,..,W11 E W12 E W13 E W14 E W15 E W16 E W17 E Table V shows the values of the performance measures for the tme perods W5 to W17. The values for the Certan and Relable categores for the perods W5 to W15 (120K to 450K clearly ndcate that the class labels assgned to nstances n these categores are generally correct (can be trusted). Ths mples that the automatcally labeled and selected tranng data for the ensemble s of good qualty. Predcton perod TABLE V ASSESSMENT OF ENSEMBLE PREDICTION CATEGORIES Certan Relable Certanty Correct Relablty Correct W W6,..,W W W W W W W However, the values for the perod W16 (450K-480K) ndcate that the assgned class labels for the Relable category cannot be trusted as they have a very low level of correctness and so, the nstances for ths tme perod should not be selected as tranng data. Ths stuaton s detected as concept change whch s dscussed below. So, do the Certan and Relable categores lead to the selecton of a hgh percentage of correctly labeled tranng data? Based on the foregong observatons, the answer to ths queston s yes. The percentage of correctly labeled tranng data s n the regon of 95, whch mples a nose level n the regon of 5. Table VI shows the propertes and predcton behavour of all the base models that were used n the perodcally revsed ensemble. It can be deduced from Table III that at the end of W4, the decson was made to replace MW1 wth MW4, to obtan the ensemble {MW2, MW3, MW4}. Based on the values n Table VI and Rule 2 of Secton III, the reason for the replacement s because MW1 has the smallest number of selected features and lowest Agreement level. Addtonally, MW1 has lower class entropy than MW4, and fewer selected features than MW4. At the end of W5, the decson was made to replace MW3 wth MW5, to obtan the ensemble {MW2, MW4, MW5}. At the end of W12, the decson was made to replace MW4 wth MW12, to obtan the ensemble {MW2, MW5, MW12}. The reasons for these replacements can be easly deduced from the values n Table VI and Rule 2. Model Name TABLE VI PROPERTIES OF THE BASE MODELS USED IN THE PERIODICALLY REVISED ENSEMBLE Agreement wth ensemble predcton n tme perod: Tme perod for tranng data Tranng set sze Class Entropy Relevant features W4 W5 W12 MW1 0K-30K 30, MW2 30K- 60K MW3 60K- 90K MW4 90K- 120K MW5 120K- 150K MW12 330K- 360K 30, , , , , Table VII provdes a comparson of the ntal ensemble and the perodcally revsed ensemble n terms of Certanty and Relablty. The results of columns 2 and 4 ndcate that Certanty s much hgher for the perodcally revsed ensemble, startng wth the frst revson n W5 (120K- 150K). Based on ths observaton, the answer to the queston: Do the proposed methods for ensemble model revsons result n mprovements n certan and relable predctons?, s yes. Addtonal analyss was conducted on the predcton results of the ensemble {MW2, MW5, MW12} for the W16 (450K-480K) predcton perod. Tables VIII and IX show the detals of the ncremental performance of the ensemble ISSN: (Prnt); ISSN: (Onlne)

6 Proceedngs of the World Congress on Engneerng 2017 Vol II, July 5-7, 2017, London, U.K. for mnor tme perods consstng of 3,000 nstances each. Results are shown for the frst sx mnor tme perods. The results show that (1) Certanty, Surrogate Accuracy and real Accuracy decrease steadly untl concept change occurs n the mnor tme perod 459K-462K, (2) WeakRelablty suddenly ncreases when concept change occurs, and (3) Agreement for one of the base models (MW2) suddenly decreases sgnfcantly when concept change occurs. So, to answer the queston: Are the Certanty, Relablty, WeakRelablty, NonRelablty measures good at forecastng possble concept change?, the answer s yes. The above observatons support the clam that the measures are good ndcators of concept change. TABLE VII COMPARISON OF INITIAL AND PERIODICALLY REVISED ENSEMBLES Predcton perod TABLE VIII INCREMENTAL PERFORMANCE FOR W16 PREDICTIONS: CERTAINTY AND RELIABILITY Predcton tme perod Intal ensemble performance Certanty Certanty Relablty Relablty VI. CONCLUSIONS Contnuously revsed ensemble performance Certanty Weakly Relably Relablty W W W6,..,W W W W W W W Non Relablty 450K-453K K-456K K-459K K-462K K-465K K-468K TABLE IX INCREMENTAL PERFORMANCE FOR W16 PREDICTIONS: ACCURACY AND AGREEMENT Accuracy Agreement Predcton tme perod Surrogate Accuracy MW2 MW5 MW12 450K-453K K-456K K-459K K-462K K-465K K-468K The obectves of the research reported n ths paper were to assess the usefulness of the proposed methods for automated selecton of hgh qualty tranng data for Naïve Bayes base model creaton and ensemble revson for predctve data stream mnng. The expermental results reported n Secton V have demonstrated that, for the KDD Cup 1999 dataset, the use of two base model predcton combnaton algorthms for the ensembles, results n a hgh level of confdence n the class labels for the automatcally selected tranng data. Secondly, the perodc revson of the ensemble usng the base models created from the selected tranng data produces revsed ensemble models wth hgh predctve performance. Thrdly, the proposed measure of Surrogate Accuracy provdes meanngful nformaton about ensemble model predctve performance. REFERENCES [1] C.C. Aggarwal, Data Streams: Models and Algorthms, Boston: Kluwer Academc Publshers, [2] J. Gao, W. Fan and J. Han, On approprate assumptons to mne data streams: analyss and practce, n Proc.7 th IEEE Int. Conf. on Data Mnng, IEEE Computer Socety, 2007, pp [3] M.M. Masud, Q. Chen, J. Gao, L. Khan, J. Han, and B. Thurasngham, Classfcaton and novel class detecton of data streams n a dynamc feature space, n Proc. ECML PKDD 2010, LNAI, Sprnger-Verlag, 2010, pp [4] J. Gao, W. Fan, J. Han and P.S. Yu, A general framework for mnng concept-drftng data streams wth skewed dstrbutons, n Proc. SDM Conference, [5] P.E.N. Lutu, Naïve Bayes classfcaton ensembles to support modelng decsons n data stream mnng, n Proc. IEEE SSCI 2015, Cape Town, South Afrca, pp [6] L. Kuncheva, Combnng Pattern Classfers: Methods and Algorthms, Hoboken, New Jersey: John Wley & sons [7] A. Bfet, G. Holmes, R. Krkby and B. Pfahrnger, Data Stream Mnng: a Practcal Approach, Unversty of Wakato, Avalable: treammnng.pdf [8] X. Zhu, P. Zhang, X. Ln and Y. Sh, Actve learnng from data streams, n Proc. 7 th IEEE Int. Conf. on Data Mnng, 2007, pp [9] P. Zhang, X. Zhu and L. Guo, Mnng data streams wth labeled and unlabeled tranng examples, n Proc. 9th IEEE Internatonal Conference on Data Mnng, 2009, pp [10] W.N. Street and Y. Km, A streamng ensemble algorthm (SEA) for large-scale classfcaton, n Proc. 7 th ACM Int. Conf. on Knowledge Dscovery and Data Mnng,ACM Press,New York,2001,pp [11] H. Wang, W. Fan, P.S. Yu and J. Han, Mnng concept-drftng data streams usng ensemble classfers, n Proc. ACM SIGKDD, Washngton DC, 2003, pp [12] J.Z. Kolter and M.A. Maloof, Dynamc weghted maorty: an ensemble method for drftng concepts, Journal of Machne Learnng Research, vol. 8, pp , [13] T. M. Mtchell, Machne Learnng, Burr Rdge, IL:WCB/McGraw- Hll, [14] P. Gudc, Appled Data Mnng: Statstcal Methods for Busness and Industry, Chchester: John Wley and Sons, [15] R. Munro and S. Chawla, An Integrated Approach to Mnng Data Streams, Techncal Report TR-548, School of Informaton Technologes, Unversty of Sydney, Avalable: [16] H. Lu and H. Motoda, Feature Selecton for Knowledge Dscovery and Data Mnng, Boston:Kluwer Academc Publshers, [17] R. Kohav, Scalng up the accuracy of Naïve Bayes classfers: a decson tree hybrd, n Proc. KDD 1996, pp [18] G.H. John, R. Kohav and K. Pleger, Irrelevant features and the subset selecton problem, n Proc. 11th Int. Conf. on Machne Learnng, 1994, pp [19] P.E.N. Lutu, Fast feature selecton for Naïve Bayes classfcaton n data stream mnng, n Proc. World Congress on Engneerng, London, U.K., July 2013, vol. III, pp [20] A.P. Whte and W.Z. Lu, Bas n nformaton-based measures n decson tree nducton, Machne Learnng, vol. 15, pp Boston : Kluwer Academc Publcatons, [21] S.D. Bay, D. Kbler, M.J. Pazzan and P. Smyth, The UCI KDD archve of large data sets for data mnng research and expermentaton, ACM SIGKDD, vol. 2, no. 2, pp , [22] S. W. Shn and C. H. Lee, Usng attack-specfc feature subsets for network ntruson detecton, n Proc. 19th Australan conference on Artfcal Intellgence, Hobart, Australa, ISSN: (Prnt); ISSN: (Onlne)

Implementation Naïve Bayes Algorithm for Student Classification Based on Graduation Status

Implementation Naïve Bayes Algorithm for Student Classification Based on Graduation Status Internatonal Journal of Appled Busness and Informaton Systems ISSN: 2597-8993 Vol 1, No 2, September 2017, pp. 6-12 6 Implementaton Naïve Bayes Algorthm for Student Classfcaton Based on Graduaton Status