Software Fault Prediction of Unlabeled Program Modules

Size: px

Start display at page:

Download "Software Fault Prediction of Unlabeled Program Modules"

Thomasina Watts
6 years ago
Views:

1 Software Fault Predictio of Ulabeled Program Modules C. Catal, U. Sevim, ad B. Diri, Member, IAENG Abstract Software metrics ad fault data belogig to a previous software versio are used to build the software fault predictio model for the ext release of the software. Util ow, differet classificatio algorithms have bee used to build this kid of models. However, there are cases whe previous fault data are ot preset; ad hece, supervised learig approaches caot be applied. I this study, we propose a fully automated techique which does ot require a expert durig the predictio process. I additio, it is ot required to idetify the umber of clusters before the clusterig phase, as required by K-meas clusterig method. Software metrics thresholds are used to remove the expert ecessity. Our techique first applies X-meas clusterig method to cluster modules ad idetifies the best cluster umber. After this step, the mea vector of each cluster is checked agaist the metrics thresholds vector. A cluster is predicted as fault-proe if at least oe metric of the mea vector is higher tha the threshold value of that metric. I additio to X-meas clusterig-based method, we made experimets with pure metrics thresholds method, fuzzy clusterig, ad K-meas clusterig-based methods. Experimets reveal that usupervised software fault predictio ca be fully automated ad effective results ca be produced usig X-meas clusterig with software metrics thresholds. Three datasets, collected from Turkish white-goods maufacturer developig embedded cotroller software, have bee used for the validatio. Idex Terms Clusterig, metrics thresholds, software fault predictio, ad X-meas clusterig. I. INTRODUCTION The quality of software compoets should be tracked cotiuously durig the developmet of high-assurace systems such as telecommuicatio ifrastructures, medical devices, ad avioic systems. Quality assurace group ca improve the product quality by allocatig ecessary budget ad huma resources to low quality modules idetified with differet quality estimatio models. Recet advaces i software quality estimatio yield buildig defect predictors with a mea probability of detectio of 7 percet ad mea false alarms rates of 25 percet []. Software quality estimatio is ot oly iterested i reliability, but also the other quality characteristics such as usability, efficiecy, maitaiability, fuctioality, ad portability. However, some researchers prefer usig the term software quality estimatio for the software fault predictio modelig studies [2]. Software metrics are used as idepedet variables ad fault data are regarded as depedet variable i software fault predictio models. The aim of buildig this kid of models is to predict the fault labels (fault-proe or ot fault-proe) of the modules for the ext release of the software. Some beefits of usig fault predictio models are [3]: The idetificatio of refactorig cadidate modules (fault-proe modules), The selectio of the best desig approach from desig alteratives, The improvemet of software testig process ad software quality, Reachig a highly depedable system. A typical software fault predictio process icludes two steps, as show i Figure. First, a fault predictio model is built usig previous software metrics ad fault data belogig to each software module (class or method level). After this traiig phase, fault labels of program modules ca be estimated usig this model [4]. C. Catal, PhD is with The Scietific ad Techological Research Coucil of TURKEY, Marmara Research Ceter, Iformatio Techologies Istitute, Gebze, Kocaeli, 4470, TURKEY (phoe: ; fax: ; cagatay.catal@bte.mam.gov.tr ). U. Sevim is with with the Departmet of Computer Egieerig, Bogazici Uiversity, Bebek, Istabul, TURKEY ( ugur.sevim@bou.edu.tr ). B. Diri, Ass. Prof. Dr. is with the Departmet of Computer Egieerig, Yildiz Techical Uiversity, Yildiz, Istabul, TURKEY ( bau@ce.yildiz.edu.tr). This project is supported by The Scietific ad Techological Research Coucil of TURKEY (TUBITAK) uder Grat 07E23. The fidigs ad opiios i this study belog solely to the authors, ad are ot ecessarily those of the sposor. Fig.. This shows software fault predictio process [4]. The selectio of metrics type is depedet o the programmig paradigm used i the project ad research targets. Our systematic review, focusig o 74 papers published betwee year 990 ad 2007, revealed that 60 percet of papers used method-level metrics [5]. Therefore, we applied method-level metrics to build our models i this study. A sample traiig dataset, icludig software metrics ad kow fault data, is show i Figure 2. All the metrics are separated with commas i this figure ad the last colum

presets whether this module caused fault or ot durig the testig phase. This last feature (colum) cosists of false ad true values. Fig. 2. This shows a sample fault predictio dataset.

2 presets whether this module caused fault or ot durig the testig phase. This last feature (colum) cosists of false ad true values. Fig. 2. This shows a sample fault predictio dataset. From machie learig perspective, Figure is a supervised learig approach because the modelig phase uses class labels represeted as kow fault data i the figure. Most of the software fault predictio studies focused o developig fault predictors usig previous fault data. However, there are cases whe previous fault data are ot available. For example, a software compay might start to work o a ew project domai or might pla buildig fault predictors for the first time i their developmet cycle. I additio, curret software versio s fault data might ot be collected ad therefore, there might ot exist i ay previous fault data for the ext release of the software. I these cases, supervised learig approaches ca ot be developed because of the absece of class labels. Figure 3 depicts this challegig problem ad usupervised learig approaches ca be applied i these cases. Fig. 3. This shows o fault data problem [4]. There are a few studies that have tried to build a fault predictio model whe the fault labels for modules are uavailable. Zhog et al. [6] used K-meas ad Neural-Gas clusterig methods to cluster modules, ad the a expert who is 5 years experieced egieer, labeled each cluster as fault-proe or ot fault-proe by examiig ot oly the represetative of each cluster, but also some statistical data such as global mea, miimum, maximum, media, 75 percetile, ad 90 percetile of each metric. To remove the obligatio of a expert assistace, we developed a predictio model i our previous work [7] ad validated it o three datasets, collected from Turkish white-goods maufacturer developig embedded cotroller software. Metrics thresholds were used to embed the expert kowledge ito our model ad subjective huma iteractio was elimiated [7]. First, K-meas clusterig method is applied ad the mea vector of each cluster is checked agaist the metrics thresholds vector. A cluster is predicted as fault-proe if at least oe metric of the mea vector is higher tha the threshold value of that metric [7]. The mai cotributio of that study is the usage of metrics thresholds with or without clusterig methods ad the removig the obligatio of a expert assistace. However, there is oe drawback of our previous study [7]. Because K-meas clusterig method is used i the first stage, K umber should be selected heuristically ad this umber may affect the overall performace of the model. Therefore, we aimed to build a fully automated fault predictio model which ca be applied whe there is o previous fault data. I this ew study, we propose a ew fault predictio model which does ot require the selectio of K umber heuristically. Istead, K umber is automatically calculated with X-meas clusterig algorithm. After the idetificatio of K umber ad the clusters, metrics thresholds are agai used as doe i our previous study. The mai cotributio of this paper is the developmet of a automated way of assigig fault-proeess labels to the modules ad the removig the subjective expert opiio. Subjective huma iteractio directly affects the quality of the software fault predictio model ad adds uecessary complexity. We explored our ew approach o three datasets, collected from Turkish white-goods maufacturer developig embedded cotroller software for washig machies, dish washers, ad refrigerators. These datasets, AR3, AR4, ad AR5 are available at They iclude 29 metrics, but we used oly 6 metrics durig modelig because we kow oly their idustrial thresholds. Eve though we validated our approach o datasets collected from Turkish compay, either our model or the metrics thresholds are depedet o this compay. I this aalysis, a module is a method because procedural programmig was used. The results of this study show that the applicatio of X-meas clusterig method with metrics thresholds provides better performace compared to pure thresholds ad fuzzy clusterig-based approaches. 6 method-level metrics icludig the primitive Halstead ad McCabe metrics were used for the developmet of this model. The metrics used i our experimets are lies of code, cyclomatic complexity, uique operator, uique operad, total operad, ad total operator. Threshold vector [LoC, CC, UOp, UOpd, TOp, ad TOpd] was chose as [65, 0, 25, 40, 25, ad 70]. We started the aalysis with the metrics thresholds proposed by Itegrated Software Metrics, Ic. (ISM). Later, values were calibrated accordig to our experimets i order to achieve high-performace predictio models. We used same thresholds values as i our previous study [7]. The rest of the paper is orgaized as follows. Sectio 2 presets related work ad Sectio 3 itroduces clusterig methods. Sectio 4 presets a empirical case study usig real-world data from embedded cotroller software developed i Turkey. Sectio 5 explais coclusio ad future works. II. RELATED WORK There are a few software fault predictio studies which do ot use prior fault data for modelig. Zhog et al. [6] used K-meas ad Neural-Gas algorithms to cluster modules ad

3 a expert explored several statistical data withi each cluster to label each cluster as fault-proe or ot fault-proe. However, this approach is depedet o the capability of the expert who should be specialized i machie learig ad software egieerig areas. Furthermore, the selectio of the cluster umber, K, is doe heuristically whe k-meas clusterig method is chose ad this process ca affect the model s performace drastically. Seliya et al. [8] proposed a costrait-based semi-supervised clusterig scheme that uses K-meas clusterig method as the uderlyig clusterig algorithm for this problem. They showed that this approach works better tha their previous usupervised learig based predictio approach. However, the selectio of the cluster umber is still a critical issue i this model ad their approach uses a expert s domai kowledge to iteratively label clusters as fault-proe or ot. Therefore, this model is also depedet o the capability of the expert. Catal et al. [7] proposed a clusterig ad metrics thresholds based software fault predictio approach ad explored it o three datasets. The mai cotributio of their paper is the usage of metrics thresholds with or without clusterig methods ad the removig of the obligatio of a expert assistace. However, the selectio of the cluster umber is doe heuristically i this clusterig based model too. I this study, we use x-meas clusterig method ad our model does ot require the selectio of cluster umber. Istead of a exact cluster umber, a iterval is provided to the x-meas algorithm. III. CLUSTERING METHODS A. Clusterig Clusterig is a usupervised learig approach. It locates i idirect data miig group ad classificatio area locates i direct data miig group. While classificatio uses class labels for traiig, clusterig does ot use class labels ad tries to discover relatioships betwee the features [9]. Clusterig methods ca be used to group the modules havig similar metrics by usig similarity measures or distaces. After the clusterig phase, the mea values of each metric withi cluster ca be checked agaist idustrial metrics thresholds. If the limits are exceeded, the cluster ca be labeled as fault-proe. Cluster aalysis has four basic steps [0]: Feature Selectio: We used 6 method-level metrics icludig the primitive Halstead ad McCabe metrics because we kow the thresholds of these metrics. Clusterig Algorithm Selectio: X-meas clusterig algorithm was selected because it does ot require the selectio of cluster umber, K, prior to executio of the algorithms. Cluster Validatio: Ay clusterig algorithm ca geerate several clusters, but they may ot reflect the existece of the patters locatig i the dataset. Therefore, evaluatio parameters are required to judge the effectiveess of the algorithm. After the clusterig phase, the mea vector of each cluster is checked agaist the metrics thresholds vector. Evaluatio parameters are used after this phase ad they evaluate the overall performace of our approach. False positive rate (fpr), false egative rate (fr), ad the error values were calculated by usig cofusio matrix durig our experimets. Results Iterpretatio: We compared our model s performace with pure thresholds based approach. Because the performace of our two-phase model improves, we suggest this approach. The overall iterpretatio was realized with this compariso. The classificatio of clusterig algorithms is ot easy, but a categorizatio was created by Berkhi [] ad we provide this list as follows: Hierarchical Methods o Agglomerative Algorithms o Divisive Algorithms Partitioig Methods o Relocatio Algorithms o Probabilistic Clusterig o K-medoids Methods o o K-meas Methods Desity-Based Algorithms Coectivity Clusterig Desity Fuctios Clusterig Grid-Based Methods Methods usig Co-occurrece of Categorical Data Costrait-based Clusterig Clusterig Algorithms used i Machie Learig o Gradiet Descet ad Neural Networks o Evolutioary Methods Scalable Clusterig Algorithms Algorithms for High Dimesioal Data o Subspace Clusterig o Projectio Techiques o Co-clusterig Techiques These groups may overlap ad other researchers may create differet categorizatios. Aother categorizatio is show as follows [9]: Fuzzy clusterig Hard clusterig o Partitioal K-meas ad derivatives Locality-sesitive hashig Graph-theoretic methods o Hierarchical Divisive Agglomerative Graph methods Geometric methods X-meas is uder K-meas ad derivatives group. B. Clusterig Algorithms Used i Experimets K-meas: Oe of the simplest clusterig algorithms is K-meas clusterig method. The pseudo code of this algorithm is give as follows [9]: Require: Dataset D, umber of clusters k, Dimesio d: { C i is the ith cluster }

4 {. Iitializatio Phase} : (C, C 2,, C k } = Iitial partitio of D. { 2. Iteratio Phase} 2: repeat 3: d ij = distace betwee case i ad cluster j; 4: i = argmi d ij ; 5: Assig case i to cluster i ; 6: Recompute the cluster meas of ay chaged clusters; 7: util o further chages of cluster membership occur 8: Output results [9]. I the iitializatio phase, clusters are iitialized with radom istaces ad i the iteratio phase, istaces are assiged to clusters accordig to the distaces, computed betwee the cetroid of the cluster ad the istace. This iteratio phase goes o util o chages occur i the clusters. X-meas: Oe drawback of k-meas algorithm is the selectio of the umber of clusters, k, as a iput parameter. Pelleg ad Moore [2] developed a algorithm to solve this problem ad used the Bayesia Iformatio Criterio (BIC) or the Akaike Iformatio Criterio (AIC) measure for optimizatio [9]. Rather tha choosig the specific umber of clusters, k, x-meas eeds k mi ad k max values. The algorithm starts with k mi value ad adds cetroids if eeded. The BIC or Schwarz criterio is applied to split some cetroids ito two ad hece ew cetroids are created [9]. Fial cetroid set is the oe that has the best score. Give objects i a dataset D = {x, x 2,..., x } i a d-dimesioal space ad a set of alterative models M j = {C, C 2,, C k }, scorig of these alterative models, idetified with differet k values, is doe by usig the posterior probabilities P(M j D) [9]. The Schwarz criterio is show i Equatio. p j BIC( M j ) I j ( D) log () 2 I j (D) is the loglikelihood of the jth model ad M j s umber of parameters are represeted with p j. The largest score reflects the true model ad it is selected as the fial model [9]. The maximum likelihood estimate of variace is calculated usig the Equatio 2 uder the idetical spherical Gaussia distributio ad μ i is the cetroid which is closest to the object x i [9]. ˆ 2 = k xi i i 2 (2) The poit probabilities are calculated usig the Equatio 3 [9]. Ci 2 Pˆ xi= exp xi i 2 d 2 ˆ 2ˆ The loglikelihood of the data is calculated usig the Equatio 4. The Schwarz criterio is used i X-meas globally to choose the best model it ecouters ad locally to guide all cetroid splits. [9]. l(d)= i P( xi ) (3) i log d 2ˆ 2 Ci xi i log 2 2ˆ Fuzzy C-meas: Fuzzy c-meas clusterig method was developed by Bezdek [3]. Each istace ca belog to every cluster with a differet membership grades betwee 0 ad for this algorithm. A dissimilarity fuctio, show i Equatio 5, is miimized ad cetroids which miimize this fuctio are idetified. The geeral steps of this algorithm are show as follows [4], [5]: I. Iitialize the membership fuctio radomly accordig to the Equatio 5. II. Calculate cetroids accordig to the Equatio 7. III. Calculate dissimilarity value accordig to the Equatio 6. Stop, if the improvemet compared to previous iteratio is below a threshold level. IV. Calculate a ew u accordig to the Equatio 8. Go to step 2. c i J u ij (4), j,..., (5) c c m 2, uij dij (6) i i j m uij xj j (7) m uij j U c, c2,..., cc Ji ci u ij c k d d ij kj 2 /( m) c i is ith cluster s cetroid, u is betwee 0 ad, d ij is the Euclidea distace betwee cetroid ad the data poit, m is a weightig expoet which is betwee ad [4]. Because the first step of the algorithm uses radom assigmets, it may ot coverge to a optimal solutio ad the performace is depedet o the iitial cetroids [4]. Oe approach to solve this problem is usig a defied procedure to idetify iitial cetroids such as calculatig the meas of all data poits [4]. We used Fuzzy C-meas clusterig implemetatio locatig i MATLAB. However, X-meas implemetatio was accessed from WEKA ope source machie learig tool. K-meas implemetatio exists i both MATLAB ad WEKA tool, but we used MATLAB implemetatio because we had developed some MATLAB programs to evaluate the overall performace of these algorithms. Evaluatio parameters will be itroduced i the ext chapter. IV. EMPIRICAL CASE STUDY A. Evaluatio Parameters Up to ow, differet evaluatio parameters were used for imbalaced datasets, specifically for software quality classificatio problem. Some of these parameters are show as follows: Area uder ROC Curve (AUC) [6] (8)

5 PD (probability of detectio), PF (probability of false alarm), balace [] G-mea, G-mea2, F-measure [7] Sesitivity, specificity, J-coefficiet [8] Correctess, completeess [9] FPR (false positive rate), FNR (false egative rate), error [20] I this study, we used FPR, FNR ad error parameters to evaluate the models we developed. Error is the percetage of mislabeled modules, false positive rate (FPR) is the percetage of ot faulty modules labeled as fault-proe by the model, ad false egative rate (FNR) is the percetage of faulty modules labeled as ot fault-proe [6]. Cofusio matrix used to calculate evaluatio parameters is show i Table. Equatios 8, 9, ad 0 calculate FPR, FNR ad error values respectively. Predicted Labels FPR FNR Error YES NO Table. Cofusio matrix Actual Labels YES True-Positive (TP) False-Negative (FN) FP FP TN FN FN TP FN FP TP FP FN TN NO False-Positive (FP) True-Negative (TN) (8) (9) (0) All of these evaluatio parameters must be miimized, but there is a trade-off betwee FPR ad FNR values. FNR value is much more critical tha FPR value because high FNR value meas that a large amout of fault-proe modules ca ot be detected prior to the system testig or operatio. B. Results ad Aalysis We used four differet types of usupervised software fault predictors ad three of them are based o clusterig methods. Because k-meas clusterig ad fuzzy C-meas clusterig methods require the selectio of umber of clusters, we first used X-meas clusterig method i three datasets ad calculated the k values for each of these datasets. As explaied i previous chapter, X-meas algorithm requires a iterval to calculate the best k value. We chose the miimum k value as 2 ad maximum k value as the umber of data poits i that dataset. X-meas algorithm idetified k value as 2 for AR5 ad calculated k value as 3 for AR3 ad AR4 datasets. Experimetal results are show i Table 2. I order to evaluate the performace of our fully automated approach which is based o X-meas clusterig method, we compared it with our metrics thresholds based approach [7]. Table 2 shows that FPR values decreased for AR3 (from 43,63 to 34,55) ad AR5 datasets (from 32,4 to 4,29) whe X-meas based approach is used. Eve though FPR value icreased for AR4 dataset, its FNR value decreased from 20 to 5. As explaied i Evaluatio Parameters sectio, FNR value is much more critical for our models. While our pure metrics thresholds based approach (Threshold) detects fault-proe modules accordig to the metrics thresholds, X-meas based approach first calculates the best k value, divides data poits ito k clusters ad the the mea vector of each cluster is checked agaist the metrics thresholds vector. Same approach is used for fuzzy c-meas ad k-meas based fault predictors. Therefore, our clusterig based approaches have two stages ad secod step of them is similar to pure metrics thresholds based approach. Accordig to our pure metrics thresholds based approach, a module is predicted as fault-proe if at least oe metric of the module is higher tha the specified value of that metric. Accordig to our clusterig based approaches, a cluster is predicted as fault-proe if at least oe metric of the mea vector is higher tha the specified threshold value of that metric. Datasets iclude class labels, but we igored this colum for modelig because our purpose was to develop models for software fault predictio without priori fault data. Class labels were used to calculate the evaluatio parameters. Table 2. Experimetal results o three datasets Data Prm. Thres hold X- meas Fuzzy c K- meas FPR 43,63 34,55 2,73 34,27 AR3 FNR Error 4,27 33,33 4,29 33,09 # cluster N/A FPR 32,4 4,29 4,29 4,28 AR5 FNR 2,5 2,5 2,5 2,5 Error 27,77 3,89 3,89 3,88 # cluster N/A AR4 FPR 35 44,83 4,6 4,6 FNR Error 32,7 37,38 2,5 2,4 # cluster N/A Because fuzzy c-meas ad k-meas clusterig based approaches did ot improve the performace compared to x-meas based approaches whe the same k values are used, we suggest usig x-meas based predictio model. If previous fault data exist for projects, ormally supervised learig algorithms such as Naïve Bayes ad Radom Forests ca be applied. However, our research focus was to build fault predictio models that ca be used whe the fault labels for modules are uavailable. Experimets reveal that usupervised software fault predictio ca be fully automated ad effective results ca be produced usig X-meas clusterig with software metrics thresholds. C. Exteral Validity I order to geeralize the results of a empirical study outside the experimetal settig, threats to the exteral validity should be discussed. Datasets used i this study were collected from a idustrial eviromet i Turkey; systems were developed by professioal developer groups, ad these systems are real idustry projects. These features satisfy the requiremets explaied i Khoshgoftaar et al. s study [2]. However, developmet practices ad project domai of this

6 Turkish software compay may be differet tha the other software compaies that pla usig this predictio model. Software developmet model of this compay is ot process-orieted as i NASA ad projects were developed by cetrally-cotrolled top-dow maagemet teams. Therefore, ope source projects are differet tha these kids of projects used here. Aother importat poit for our models is the effect of oisy istaces. Because we calculate the mea vector of each cluster, oisy istaces ca chage the mea vector drastically ad hece the performace of our models may be affected egatively. Because projects used i these experimets are middle-sized oes ad the dataset collectio process is doe carefully, we assume that there is o oisy istaces. However, the predictio of a dataset cosistig of may oisy istaces may ot provide acceptable results. V. CONCLUSION AND FUTURE WORK This study proposed a usupervised software fault predictio approach. Experimets revealed that usupervised software fault predictio ca be fully automated ad effective results ca be produced by usig X-meas clusterig with software metrics thresholds. The mai cotributio of this paper is the developmet of a automated way of assigig fault-proeess labels to the modules ad the removig the subjective expert opiio. There is o heuristic step i our model as eeded i k-meas clusterig based fault predictio approaches. We studied o three public datasets which locate i PROMISE repository. Results are promisig ad our model ca be used whe there is o priori fault data. Our metrics thresholds vector was created by usig the thresholds proposed by Itegrated Software Metrics, Ic. (ISM) ad these threshold values were calibrated i our previous study [7]. Eve though ISM focused o NASA datasets to calculate these threshold values, we could apply similar threshold values for our datasets, collected from Turkish white-goods maufacturer developig embedded cotroller software. Our models are ot depedet o this vector ad each compay ca idetify its threshold vector with differet approaches. Future work will cosider evaluatig our model for datasets which have oisy istaces such as JM dataset i PROMISE repository. A pre-processig step is ecessary to remove oisy istaces before our predictio model is applied or we eed to develop a ew usupervised fault predictio model which is isesitive to oisy istaces. REFERENCES [] T. Mezies, J. Greewald, ad A. Frak, Data miig static code attributes to lear defect predictors, IEEE Trasactios o Software Egieerig, vol. 32, o., 2007, pp [2] N. Seliya, T. M. Khoshgoftaar, Software quality estimatio with limited fault data: a semi-supervised learig perspective, Software Quality Joural, vol. 5, o. 3, 2007, pp [3] C. Catal, B. Diri, Ivestigatig the effect of dataset size, metrics set, ad feature selectio techiques o software fault predictio problem, Iformatio Scieces, vol. 79, o. 8, pp , [4] N. Seliya, Software quality aalysis with limited prior kowledge of faults, Graduate Semiar, Waye State Uiversity, Departmet of Computer Sciece, 2006, Webpage: u_talk.ppt [5] C. Catal, B. Diri, A systematic review of software fault predictios studies, Expert Systems with Applicatios, vol. 36, o.4, pp , [6] S. Zhog, T. M. Khoshgoftaar, ad N. Seliya, Usupervised learig for expert-based software quality estimatio, Proc. of the 8 th Itl. Symp. O High Assurace Systems Eg., Tampa, FL, 2004, pp [7] C. Catal, U. Sevim, B. Diri, Clusterig ad metrics thresholds based software fault predictio of ulabeled program modules, 6 th It l. Coferece o Iformatio Techology: New Geeratios, IEEE Computer Society, Las Vegas, Nevada, [8] N. Seliya, T. M. Khoshgoftaar, Software quality aalysis of ulabeled program modules with semi-supervised clusterig, IEEE Trasactios o Systems, Ma ad Cyberetics-Part A: Systems ad Humas, vol. 37, o. 2, 2007, pp [9] G. Ga, C. Ma, J. Wu, Data clusterig: theory, algorithms, ad applicatios, Society for Idustrial ad Applied Mathematics, Philadelphia, [0] R. Xu, D. Wusch, Survey of clusterig algorithms, IEEE Trasactios o Neural Networks, vol. 6, o. 3, 2005, pp [] P. Berkhi, Survey of clusterig data miig techiques, Techical Report, Accrue Software, Sa Jose, Califoria, 2002, [2] D. Pelleg, A. Moore, X-meas: extedig k-meas with efficiet estimatio of the umber of clusters, Proceedigs of the 7th Iteratioal Coferece o Machie Learig, pp , 2000, Staford Uiversity, Staford, CA, USA. [3] J. C. Bezdek, Patter recogitio with fuzzy objective fuctio algorithms, Pleum Press, New York, 98. [4] S. Albayrak, F. Amasyalı, Fuzzy c-meas clusterig o medical diagostic systems, Iteratioal 2. Turkish Symposium o Artificial Itelligece ad Neural Networks, Turkey, [5] J. S. R. Jag, C. T. Su, E. Mizutai, Neuro-fuzzy ad soft computig, Pretice Hall, pp , 997. [6] J. Va Hulse, T. M. Khoshgoftaar, A. Napolitao, Experimetal perspectives o learig from imbalaced data, 24 th It l. Coferece o Machie Learig, Corvalis, Orego, pp , [7] Y. Ma, L. Guo, B. Cukic, A statistical framework for the predictio of fault-proeess, Advaces i Machie Learig Applicatio i Software Egieerig, Idea Group Ic., pp , [8] K. El-Emam, W. Melo, J. C. Machado, The predictio of faulty classes usig object-orieted desig metrics, Joural of Systems ad Software, vol. 56, o., pp , 200. [9] Y. Zhou, H. Leug, Empirical aalysis of object-orieted desig metrics for predictig high ad low severity faults, IEEE Trasactios o Software Eg., vol. 32, o. 0, pp , [20] S. Zhog, T. M. Khoshgoftaar, ad N. Seliya, Aalyzig software measuremet data with clusterig techiques, IEEE Itelliget Systems, vol. 9, o. 2, pp , [2] T. M. Khoshgoftaar, N. Seliya, N. Sudaresh, A empirical study of predictig software faults with case-based reasoig, Software Quality Joural, vol. 4, o. 2, pp. 85-, 2006.

Euclidean Distance Based Feature Selection for Fault Detection Prediction Model in Semiconductor Manufacturing Process

Euclidean Distance Based Feature Selection for Fault Detection Prediction Model in Semiconductor Manufacturing Process Vol.133 (Iformatio Techology ad Computer Sciece 016), pp.85-89 http://dx.doi.org/10.1457/astl.016. Euclidea Distace Based Feature Selectio for Fault Detectio Predictio Model i Semicoductor Maufacturig