A Multivariate Analysis of Static Code Attributes for Defect Prediction

Size: px

Start display at page:

Download "A Multivariate Analysis of Static Code Attributes for Defect Prediction"

Emory O’Connor’
6 years ago
Views:

1 Research Paper) A Multvarate Analyss of Statc Code Attrbutes for Defect Predcton Burak Turhan, Ayşe Bener Department of Computer Engneerng, Bogazc Unversty 3434, Bebek, Istanbul, Turkey {turhanb, bener}@boun.edu.tr Abstract Defect predcton s mportant n order to reduce test tmes by allocatng valuable test resources effectvely. In ths work, we propose a model usng multvarate approaches n conjuncton wth Bayesan methods for defect predctons. The motvaton behnd usng a multvarate approach s to overcome the ndependence assumpton of unvarate approaches about software attrbutes. Usng Bayesan methods gves practtoners an dea about the defectveness of software modules n a probablstc framework rather than the hard classfcaton methods such as decson trees. Furthermore the software attrbutes used n ths work are chosen among the statc code attrbutes that can easly be extracted from source code, whch prevents human errors or subjectvty. These attrbutes are preprocessed wth feature selecton technques to select the most relevant attrbutes for predcton. Fnally we compared our proposed model wth the best results reported so far on publc datasets and we conclude that usng multvarate approaches can perform better. Keywords: Defect predcton, Software Metrcs, Naïve Bayes. Topcs: Software Qualty, Methods and Tools. 1. Introducton Testng s the most costly and tme consumng part of software development lfecycle, regardless of the development process used. Therefore effectve testng leads to sgnfcant decrease n project costs and schedules. The am of defect predcton s to gve an dea about the testng prortes, so that exhaustve testng s prevented. Usng an automated model may help project managers to allocate testng resources effectvely. These models can predct the degree of defectveness f relevant features of software are suppled to them. These relevant features are acheved by usng software metrcs. Researchers usually prefer focusng on the selecton of a subset of avalable features [10]. Feature subset selecton s manly preferred because of ts nterpretablty, snce the selected features correspond to actual and n some occasons controllable measurements from software. Ths gves the ablty to generate rules about the desred values of metrcs for 'good' software. It s easer to explan such rules to programmers and managers [6]. Ths s also the answer to why most of the studes use decson trees as predctors. Decson trees can be nterpreted as a set of rules and they can be understood by less techncally nvolved people [6]. But decson trees are hard classfcaton methods that can predct a module as ether defectve or non-defectve. Alternatvely, Bayesan approaches provde a probablstc framework and yeld soft classfcaton methods wth posteror probabltes attached to the predctons [1]. Ths s why we employed Bayesan approaches n ths work. On the other hand, feature subset selecton requres an exhaustve search for choosng the optmal subset. Thus, feature selecton algorthms use greedy approaches lke backward or forward selecton [7]. In forward selecton, one starts wth an empty set of features, and a feature s selected only f t ncreases the performance of the predctor, otherwse t s dscarded. Backward selecton s smlar n the sense that one starts wth all features, and a feature s removed f t does not affect the performance of the predctor. These approaches evaluate the features one at a tme and they do not consder the effects of features f taken as pars, trples and n-tuples. Whle a sngle feature may not affect the estmaton performance sgnfcantly, pars, trples or n-tuples of features may [7]. In order to

overcome ths problem, ths study employs feature extracton technques and compares the results wth a baselne study, where InfoGan algorthm s used to rank and select a subset of features [10].

2 overcome ths problem, ths study employs feature extracton technques and compares the results wth a baselne study, where InfoGan algorthm s used to rank and select a subset of features [10]. Major contrbuton of ths research s to ncorporate multvarate approaches rather than unvarate ones. Unvarate approaches assume the ndependence of features whereas multvarate approaches take the relatons between features nto consderaton. Obvously unvarate models are smpler than multvarate models. Whle t s good practce to start modelng wth smple models, the problem at hand should also be nvestgated by usng more complex models. Then t should be valdated by measurng performance whether usng more complex models s worth the extra complexty ntroduced n the modelng. Ths research performs experments wth both smple and complex models and compares ther performances. In the followng secton, feature extracton methods used n ths research are brefly descrbed. Then, models used for defect predcton are explaned. After descrbng the expermental desgn and the results, conclusons wll be gven. global optmalty, and guarantee of asymptotc convergence are ts major features [16]. In general, Eucldean dstance s used to calculate the smlarty of two nstances. However, the use of the Eucldean dstance to represent par wse dstances makes the model unable to preserve the ntrnsc geometry of the data. Two nearby ponts, n terms of Eucldean dstance, may ndeed be dstant, because ther actual dstance s the path between these ponts along the manfold. The length of the path along the manfold s referred to as the geodesc dstance [16]. A -D spral s an example of a manfold, whch s actually a 1-D lne that s folded and embedded n -D See Fgure 1, adapted from [9]). Applyng Isomap on the spral unfolds t to ts true structure. Isomap smply performs classcal Multdmensonal Scalng [4] on par wse geodesc-dstance matrx.. Feature Extracton Methods In feature extracton, new features are formed by combnng the exstng ones. These new set of features may not be nterpreted easly as before [6]. On the contrary, there are cases where they turn out to be nterpretable [5]. The new features may also lead to better predcton performances by removng rrelevant and non-nformatve features. An advantage of feature extracton methods used n ths study s that they project data to an orthogonal feature space. One has to decde between ease of nterpretablty and better predcton performances n such cases. In ths research authors prefer better performance and therefore they explore feature extracton methodologes. Prncpal Component Analyss PCA) has been used n other defect predcton studes [11], [13], [8], [14],[]. We also use PCA n ths research. PCA reveals the optmum lnear structure of data ponts. But t s unable to fnd nonlnear relatons, f there exsts such relatons n data. In order to nvestgate non-lnear relatons, we use Isomap algorthm as another feature extracton technque..1. Isomap Isomap nherts the advantages of PCA and extends them to learn nonlnear structures that are hdden n hgh dmensonal data. Computatonal effcency, Fgure 1. Geodesc dstance metrc: Ponts X and Y are at dstnct ends of the spral. Usng Eucldean dstance, the true structure of spral,.e. 1-D lne folded and embedded n -D, can not be revealed. Geodesc dstance represents smlar or dfferent) data ponts more accurately than the Eucldean dstance, but the queston s how to estmate t? Here the local lnearty prncple s used and t s assumed that neghborng ponts le on a lnear patch of the manfold, so for nearby ponts the Eucldean dstances correctly estmate the geodesc dstances. For dstant ponts, the geodesc dstances are estmated by addng up neghborng dstances over the manfold usng a shortest-path algorthm. Isomap fnds the true dmensonalty of nonlnear structures. The nterpretaton of projecton axes can be meanngful n some cases [5]. Isomap uses a sngle parameter to defne the neghborhood for data ponts.e. for k-nearest neghbors of a data pont, par wse geodesc dstances are assumed to be equvalent to Eucldean dstances. Ths parameter should be fne tuned, preferably by cross-valdaton, to obtan optmum results. Data sample s transformed to have a

3 lnear structure n the new projecton space; e.g. the spral s unfolded to a lne. 3. Predctor Models Ths secton explans predctor models used for defect predcton. As a baselne, the Nave Bayes classfer s taken snce t s shown to acqure best results obtaned so far [10]. We remove the assumptons of the Nave Bayes classfer one at a tme and construct the lnear and quadratc dscrmnants. The assumpton n Nave Bayes s that the features of data sample are ndependent, thus t employs the unvarate normal dstrbuton. We beleve ths assumpton s not vald for software data and snce there are correlatons between software data features. So we use a multvarate normal dstrbuton to model the correlatons among features. In the next secton unvarate and multvarate normal dstrbutons are brefly explaned Unvarate vs. Multvarate Normal Dstrbuton In unvarate normal dstrbuton, x ~ N, ), x s sad to be normal dstrbuted wth mean μ and standard devaton σ and the probablty dstrbuton functon pdf) s defned as: 1 x p x) exp 1) ) The term nsde the exponental term n Equaton s the normalzed Eucldean dstance, where the dstance of a data sample x to the sample mean μ s measured n terms of standard devatons σ. Ths ensures to scale the dstances of dfferent features n case feature values vary sgnfcantly. Ths measure does not consder the correlatons among features. In the multvarate case, x s a d-dmensonal vector that s normal dstrbuted, x ~ N, ), and the pdf of a multvarate normal dstrbuton s defned as: 1 1 T 1 p x) exp x x ) d 1 ) Where Σ s the covarance matrx and μ s the mean vector. The term nsde the exponental term n Equaton s another dstance functon and called the Mahalanobs dstance [1]. In ths case, the dstance to the mean vector s normalzed by the covarance matrx and the correlatons of features are also consdered. Ths results n less contrbuton of hghly correlated features and features wth hgh varance. Our assumpton s that software data features are correlated and a multvarate model would be more approprate than the unvarate model. Besdes, multvarate normal dstrbuton s analytcally smple, tractable and robust to departures from normalty [1]. As no free lunch theorem states [17], nothng comes for free and usng a multvarate model ncreases the number of parameters to estmate. In the unvarate case, only parameters, μ and Σ are estmated, whle n the multvarate case, d parameters for μ and d.d parameters for Σ need to be estmated. 3.. Multvarate Classfcaton In software defect predcton, one ams to dscrmnate classes C 0 and C 1 where samples n C 0 are non defectve and samples n C 1 are defectve. We combne the multvarate normal dstrbuton and the Bayes rule, use dfferent assumptons, and acheve dfferent dscrmnants wth dfferent complexty levels See Table 1). We prefer dscrmnant pont of vew, snce t s geometrcally nterpretable. A dscrmnant n general s a hyper plane that separates d-dmensonal space nto dsjont subspaces. General structure of a dscrmnant s explaned next. Table 1. Complextes of predctors n a K-class problem wth d features. Predctor QD LD NB # Parameters K x d x d)) + K x d) + K) d x d) + K x d) + K) d) + K x d) + K) Bayes theorem states that the posteror dstrbuton of a sample s proportonal to the pror dstrbuton and the lkelhood of the gven sample. More formally: P x C ) P C ) P C x) 3) P x) Equaton 4 s read as: "The probablty of a gven data nstance x to belong to class C s equal to the multplcaton of the lkelhood that x s comng from the dstrbuton that generates C and the probablty of observng C 's n the whole sample, normalzed by the evdence. Evdence s gven by: P x) P x C ) P ) 4) C and t s a normalzaton constant for all classes, thus t can be safely dscarded. Then Equaton 4 becomes: P C x) P x C ) P C ) 5)

4 In a classfcaton problem we compute the posteror probabltes PC x) for each class and choose the one wth the hghest posteror. Ths s equvalent to defnng a dscrmnant functon g x) for class C and g x) s derved from Equaton 6 by takng the logarthms for convenence. g x) log P x C )) log P C )) 6) In order to acheve a dscrmnant value, one needs to compute the pror and lkelhood terms. Pror probablty PC ) can be estmated from the sample by countng. The crtcal ssue s to choose a sutable dstrbuton for the lkelhood term Px C ). Ths s where the multvarate normal dstrbuton takes place. In ths study lkelhood term s modeled by the multvarate normal dstrbuton. Computng dscrmnant values for each class and assgnng the nstance to the class wth the hghest value s equvalent to usng Bayes Theorem for choosng the class wth the hghest posteror probablty. For the -class case, t s suffcent to construct a sngle dscrmnant by gx) = g 0 x) g 1 x). Usng dscrmnant pont of vew, we wll explan dfferent predctors n the followng secton. In all cases, an nstance x s classfed as C such that arg max g x)) k k 3.3. Quadratc Dscrmnant Assumpton: Each class has dstnct Σ and μ. Dervaton: Combnng Equaton and Equaton T 1 g x) log S ) x m S x m log P C )) 7) and by defnng new varables W, w and w 0, the quadratc dscrmnant s obtaned as T T g x) x W x w x w 8) where S W 9) 1 w S m 10) 1 1 T 1 w 0 m S m log S ) log P C )) 11) and S, m and PC ) are maxmum lkelhood estmates of Σ, μ and PC ) respectvely. Quadratc model consders the correlaton of the features dfferently for each class. In case of K-classes, the number of parameters to estmate s K.d.d) for covarance estmates and K.d) for mean estmates. Also K pror probablty estmatons are needed Lnear Dscrmnant Assumpton: Each class has a common Σ and dstnct μ Dervaton: Assumpton states that classes share a common covarance matrx. The estmator s found by ether usng the whole data sample or by the weghted average of class covarances whch s gven as S P C ) S 1) Placng ths term n Equaton 7 we get 1 T 1 T 1 g x) x S m m S m ) log P C )) 13) whch s now a lnear dscrmnant n the form of T g x) w x w 14) where w w 0 S m 15) 1 1 T 1 0 m S m log P C )) 16) Ths model consders the correlaton of the features but assumes the varances and correlaton of features are the same for both classes. The number of parameters to estmate for covarance matrx s now ndependent of K. For covarance estmates d.d), for mean estmates K.d) and for prors K parameters should be estmated Naïve Bayes Assumpton: Each class has a common Σ wth off dagonal entres equal to 0, and dstnct μ Dervaton: Assumpton states the ndependence of features by usng a dagonal covarance matrx. Then the model reduces to a unvarate model gven n Equaton 17. d t 1 x j m j g x) log P C )) 17) j 1 s j Ths model does not take the correlaton of the features nto account and t measures the devaton from the mean n terms of standard devatons. For Nave Bayes, d) covarance, K.d) mean and K pror parameters should be estmated. 4. Experments and Results Desgn of experments and evaluaton of results n software defect predcton problems have partcular mportance. Most of the experment desgns have mportant flaws such as self tests and nsuffcent

5 performance measures as reported n [10]. Most research reported only the accuracy of predctors as a performance ndcator. Examnng defect predcton datasets, t s easly seen that they are not balanced. In other words, the number of defectve nstances s much less than the number of nondefectve nstances. As ponted out n [10], one can acheve 95% accuracy on a 5% defectve dataset by buldng a dummy classfer that always classfes nstances as nondefectve. A framework of MxN experment desgn, whch means M replcatons of N holdout cross valdaton) experments, s also gven n [10] and addtonal performance measures are reported, such as probablty of detecton pd) and probablty of false alarm pf). Ths research follows the same notaton. 10-fold cross-valdaton approach s used n the experments. That s, datasets are dvded nto 10 bns, 9 bns are used for tranng and 1 bn s used for testng. Repeatng these 10 folds ensures that each bn s used for tranng and testng whle mnmzng samplng bas. Each holdout experment s also repeated 10 tmes and n each repetton the datasets are randomzed to overcome any orderng effect and to acheve relable statstcs. Reported results are the mean values of these 100 experments for each dataset. Quadratc dscrmnant QD), lnear dscrmnant LD) and Nave Bayes NB) are the predctors used n ths research. As performance measures pd, pf and balance bal) are reported. pd s a measure for correctly detectng defectve modules and t s the rato of the number of defectve predcted modules to the number of actual defectve modules. Obvously hgher pd's are desred. As the name suggests, pf s a measure for false alarms and t s nterpreted as the probablty of predctng a module as defectve whle t s not ndeed. pf s desred to have low values. Balance measure s used to choose the optmal pd, pf) pars such that area under the ROC curve s maxmzed and t s defned as the normalzed Eucldean dstance from the desred pont 0,1) to pd, pf) n a ROC curve. Table. Dataset Descrptons Fgure. Experment Desgn. The experments conducted n [10] are replcated and extended n ths study. Framework for experment desgn n [10] s followed and updated as n Fgure. In order to extract features, PCA and Isomap are performed on the log fltered data attrbutes. An advantage of log flterng s that t scales the features so that extreme values are handled. Another advantage of log flterng s that normal dstrbuton better fts to data. In other words, data attrbutes are assumed to be lognormal dstrbuted. 5 to 30 features are extracted for all datasets usng PCA and Isomap. Best subset of features reported n [10] s also used n the experments. Ths subset of features dffers n each dataset. The best performng dmensonaltes acheved by PCA and Isomap are also dfferent for each dataset. These observatons support the dea that there s no global set of features that descrbe the software. So, maxmum possble metrcs of software should be collected and analyzed as long as t s feasble to collect them. Name #Modules DefectRate %) CM PC PC PC PC KC KC MW For evaluaton, 8 dfferent publc datasets obtaned from NASA MDP repostory [1] are used. Sample szes vary from 15 to 5589 modules. Each dataset has 38 features representng statc code attrbutes. As seen n Table defect rates are too low whch consoldates the use of above mentoned performance measures. All mplementatons are done n MATLAB envronment usng standard toolboxes. Results are tabulated n Table 3. Mean results of pd, pf) pars selected by the bal measure after 10x10 holdout experments are gven. For PCA and ISO labeled entres, these results are selected from 5 to 30 features obtaned by PCA and Isomap respectvely. For SUB labeled entres, the best subset of features

6 Table 3. Results Performances Data Predctor pd%) pf%) bal%) CM1 SUB+NB PC1 PCA+NB PC PCA+NB PC3 PCA+LD PC4 PCA+QD KC3 ISO+NB KC4 ISO+LD MW1 ISO+LD Average: obtaned by InfoGan are used as reported n [10]. In Table 3, results ndcated n bold face are statstcally sgnfcant than other methods wth α = 0.05 after applyng a t-test, consderng pd performance measure. Subset selecton s better than feature extracton methods n only 1 out of 8 datasets CM1). In the remanng datasets, best performances are obtaned ether by applyng PCA or Isomap nstead of InfoGan. In PC1, PC, PC3 and PC4, best mean performances are acheved applyng PCA, and n KC3, KC4 and MW1 Isomap yelded better results. It s observed that Isomap gves the best performances on relatvely small datasets. As the module szes ncrease PCA performs better. Except PC3 dataset, our replcated results are smlar to reported mean results n [10]. But varances of replcated experments.e. subsettng) are larger than PCA and Isomap approach especally for pf measure. NB and LD are observed to behave smlarly whereas QD results are dfferent than NB and LD n terms of performance. It s observed for QD, that as the number of features ncrease, performances get worse especally for pf measure and the varances ncrease. Possble reason for ths s the complexty of the model.e. too many parameters to estmate). As for the predctors, Nave Bayes NB) s chosen 4 tmes, lnear dscrmnant LD) s chosen 3 tmes and quadratc dscrmnant QD) s chosen only once. From these results, t can be concluded that clams statng any of these predctors as the 'globally' correct one, should be avoded. As expected, no specfc confguraton of a feature selecton and a predctor s always better than the others. Even though NB s the majorty wnner, t s clearly seen that performances on some datasets are ncreased by usng multvarate methods: QD and LD. Applyng QD gves the best result n PC4 dataset, but t s not statstcally sgnfcant. It can be concluded that QD can be dscarded because of ts complexty. In cases where LD wns, statstcal sgnfcances are observed, so the addtonal complexty ntroduced can be justfed. There may be other predctors performng better than these. Constructng better predctors s an open ended problem and as better results are reported, the problem gets more dffcult due to celng effect.e. t s harder to confrm the hypothess that predctor A performs better than predctor B, when A and B perform maxmum achevable performance or close to t [3]. Overall performance of the approach mproves on the best results reported so far [10]. Prevous research reported mean pd, pf) = 71,5) whch yelds bal = 7 averaged over all datasets. Replcaton of these experments yeld mean pd, pf) = 64, 19) and bal = 71. After expermentng wth all possble combnatons of InfoGan, PCA, Isomap wth NB, LD and QD, an mprovement s observed by pckng the best combnatons for all datasets. Improved results yeld mean pd, pf) = 77, 5) where bal = 76. Whle no change n pf measure s observed, pd measure s mproved by 6%. A fnal comment should be made about the runnng tmes of algorthms. As expected, QD takes more tme than LD and NB. However ths dfference s not too sgnfcant. The domnant factor that affects the runnng tmes are the sample szes. 5. Conclusons and Future Work In ths research software defect predcton s consdered as a data mnng problem. Several experments are conducted, ncludng the replcaton of prevous research on publcly avalable datasets from NASA repostory. Performances of dfferent predctors together wth dfferent feature extracton methods are evaluated. Results are compared wth the best performances reported so far and some mprovements are observed. The prevous research advces that one should not seek for globally best subset of features, rather to focus on buldng predctors that combnes nformaton from multple features. In addton, authors also beleve that research should focus on a balanced combnaton of those. In other words, buldng successful predctors depends on how useful nformaton s suppled to them. Whle makng research on better predctors, research on obtanng useful nformaton from features should also be carred out. A contrbuton of ths research s usng lnear and nonlnear feature extracton methods n order to combne nformaton from multple features. In software defect predcton there s more research on feature subset selecton than feature extracton. Results

7 suggest that t s worth to explore more to deepen our knowledge on feature extracton studes. Another contrbuton of ths research s the modelng of correlatons among features. Improved results are obtaned by usng multvarate statstcal methods. Furthermore, the probabltes of predctons are provded by employng Bayesan approaches, whch can gve project managers and practtoners a better understandng of the defectveness of software modules. Further research should nvestgate the valdaton of the log normal dstrbuton assumpton of software data used n ths research. It s better practce to apply goodness of ft tests, rather than assumng a normal dstrbuton. Other exponental famly dstrbutons should also be nvestgated. Another research area s to nvestgate flters to transform data nto sutable dstrbutons. Acknowledgements Ths research s supported n part by Bogazc Unversty research fund under grant number BAP- 06HA104. Authors would lke to thank Koray Balcı, who has contrbuted to the earler versons of ths manuscrpt. References [1] E. Alpaydn, Introducton to Machne Learnng, The MIT Press, October 004. [] E. Ceylan, F. O. Kutlubay, and A. B. Bener, Software defect dentfcaton usng machne learnng technques, In Proceedngs of the 3nd EUROMICRO Conference on Software Engneerng and Advanced Applcatons, IEEE Computer Socety, Washngton, DC, USA, 006, pp [3] P. R. Cohen. Emprcal Methods for Artfcal Intllgence, The MIT Press, London, England, [4] T. Cox and M. Cox, Multdmensonal Scalng. Chapman & Hall, London, [5] V. de Slva and J. B. Tenenbaum, Global versus local methods n nonlnear dmensonalty reducton, In S. Becker, S. Thrun, and K. Obermayer, edtors, Advances n Neural Informaton Processng Systems, 15, MIT Press, Cambrdge, MA, 003, pp [6] N. E. Fenton and M. Nel, A crtque of software defect predcton models, IEEE Transactons. on Software. Engneerng., 55), 1999, pp [7] Guyon and Elsseff, An ntroducton to varable and feature selecton, Journal of Machne Learnng Research, 3, 003, pp [8] T. M. Khoshgoftaar and J. C. Munson, Predctng software development errors usng software complexty metrcs, IEEE Journal on Selected Areas n Communcatons, 8), Feb. 1990, pp [9] J. A. Lee, A. Lendasse, N. Donckers, and M. Verleysen., A robust nonlnear projecton method, In Proceedngs of ESANN 000, European Symposum on Artfcal Neural Networks, Bruges Belgum), 000, pp [10] T. Menzes, J. Greenwald, and A. Frank, Data mnng statc code attrbutes to learn defect predctors, IEEE Transactons on Software Engneerng, 331), 007, pp. 13. [11] J. Munson and Y. M. Khoshgoftaar, Regresson modellng of software qualty: emprcal nvestgaton, J. Electron. Mater., 196), 1990, pp [1] NASA/WVU IV&V Faclty, Metrcs Data Program, avalable from [13] M. Nel, Multvarate assessment of software products, Softw. Test., Verf. Relab., 14), 199, pp [14] D. E. Neumann, An enhanced neural network technque for software rsk analyss, IEEE Tranactons on. Software Engneerng, 89), 00, pp [15] G. Boettcher, T. Menzes and T. Ostrand, PROMISE Repostory of emprcal software engneerng data West Vrgna Unversty, Department of Computer Scence, 007 [16] J. B. Tenenbaum, V. de Slva, and J. C. Langford, A global geometrc framework for nonlnear dmensonalty reducton, Scence, 90, 000, pp [17] D. H. Wolpert and W. G. Macready, No free lunch theorems for optmzaton IEEE Transactons on Evolutonary Computaton, 11), Aprl 1997, pp. 67 8

Feature Reduction and Selection

Feature Reduction and Selection Feature Reducton and Selecton Dr. Shuang LIANG School of Software Engneerng TongJ Unversty Fall, 2012 Today s Topcs Introducton Problems of Dmensonalty Feature Reducton Statstc methods Prncpal Components