Probability Base Classification Technique: A Preliminary Study for Two Groups

Mathematcal Theory and Modelng ISSN 4-5804 (Paper) ISSN 5-05 (Onlne) www.ste.org Probablty Base Classfcaton Technque: A Prelmnary Study for Two Groups Frday Znzendoff Okwonu,* Abdul Rahman Othman. Department of Mathematcs and Computer Scence, Delta State Unversty, P.M.B., Abraka, Ngera. Center for Mathematcal Scences, School of Dstance Educaton, Unverst Sans Malaysa, 800, Penang, Malaysa * E-mal:fzokwonu_delsu@yahoo.com Abstract The conventonal Fsher lnear classfcaton technque to perform classfcaton for two groups problem s strctly developed based on the wthn group sample mean vectors and wthn group sample varance covarance matrces. A comparable classfcaton procedure that ncorporate the wthn group probabltes s consdered. The conventonal procedure based on the Fsher s technque assumed equalty of the wthn group probabltes as such the computatonal procedure negate the wthn groups probabltes to solve classfcaton problems. The new approach s a modfcaton of the coeffcent of the Fsher s technque by applyng the wthn group probablty for the respectve groups to solve classfcaton problems.the classfcaton performance of these technques s nvestgated based on generated contamnated normal data set usng homoscedastc and heteroscedastc varance covarance matrces for varous sample szes and dmensons. The comparatve performance of these procedures are nvestgated by comparng the mean probabltes of correct classfcaton based on the contamnated date set wth the mean of the optmal probablty computed from the uncontamnated data set. The comparatve classfcaton performance revealed that both technques perform comparable. Though, the Monte Carlo smulaton ndcate that as the proporton of contamnaton ncreases, the probablty base approach perform better for homoscedastc covarance matrces, on the other hand, the Fsher s technque outperformed the probablty base procedure for heteroscedastc covarance matrces. The comparatve analyss ndcate that the probablty base approach performed comparable wth the conventonal procedure. The mplcaton of ths procedure ndcate that classfcaton problems can be solved by ncorporatng the respectve wthn group probabltes to develop the classfcaton model. Keywords: Classfcaton, Homoscedastc and Heteroscedastc Covarance Matrces, Mean Probablty. Introducton Conventonally, the lnear classfcaton problem for two groups s accomplshed usng the Fsher Lnear Classfcaton Analyss (). Ths procedure strctly depends on the wthn group sample mean vectors and the wthn group sample varance covarance matrces. The Fsher s technque s based on the assumpton of multvarate normal data set and the varance covarance matrces are homoscedastc. The sample mean vectors and sample covarance matrces are unstable because these parameters are susceptble or easly nfluenced by nfluental observatons (Maronna et al. 006; Munoz-Pchardo et al. 0). Sajob et al. (0) proposed to robustfy the sample mean vectors and the covarance matrces by replacng the maxmum lkelhood estmates by the maxmum lkelhood estmators computed based on coordnate wse trmmng. Hubert et al. (00) proposed permutaton nvarant technque called determnstc algorthm for the mnmum covarance determnant procedure. Ths procedure uses permutaton/determnstc method rather than the random subset to robustfy the sample mean and covarance matrx. Bouveyron & Brunet (0) proposed robust and flexble Fsher lnear dscrmnant analyss based on probablstc concept that relax the equal covarance assumpton. Ths technque, bascally does not ncorporate the wthn group probabltes n computng the classfcaton coeffcent. Ths paper consder the modfcaton of the Fsher s technque by ntroducng the wthn group probabltes to the separaton parameter w. The new procedure solve classfcaton problems for two groups by ncorporatng the nformaton the wthn group probabltes provdes and to obtan maxmum correct classfcaton rate. Ths procedure adheres strctly to the homoscedastc assumpton of the covarance matrces. The performance of these methods s nvestgated for contamnated normal data set, equal and unequal varance covarance matrces. The methodology secton contans the Fsher lnear classfcaton analyss followed by the probablty base classfcaton technque. Smulaton results are contaned n results secton followed by dscusson and conclusons, respectvely.. Method The method secton conssts of the Fsher lnear classfcaton analyss and the Probablty base classfcaton technque. Both procedures are appled to perform classfcaton for two groups problem. 40

Mathematcal Theory and Modelng ISSN 4-5804 (Paper) ISSN 5-05 (Onlne) www.ste.org. Fsher Lnear Classfcaton Analyss () It s observed that the two groups lnear classfcaton technque based on Fsher s technque assumed that the wthn group probablty and msclassfcaton cost are equal, as such ts classfcaton rule negate the probabltes for each group, that s: ℵ (/) p ξ ξ = c(/) p ℵ c - ln 0 ℵ ξ ξ< ℵ (/) p (/) p c - ln =0 c As observed n the lterature, the Fsher s technque performs optmally f the data set s drawn from the multvarate normal dstrbuton and f the varance covarance matrces are equal. When the classfcaton coeffcent s nconsstent, the msclassfcaton rate tends to ncrease. The wthn group mean vector, varance covarance matrces and the pooled common covarance matrx are defned as follows: N x = x j / N,=, (3) j= () () N S = (x x )(x x ) /(N ) (4) j j j= S = = pooled (N )S = N Equatons (3-5) are appled to develop Equatons (-). Based on the equalty assumptons n Equatons (-), the Fsher s procedure reduces to: ξ ξ (6) (5) ξ<ξ (7) where, ξ = (x x )Spooledx = q x s the classfcaton score and ξ = ((x + x )/)q s the cutoff pont. Equatons (6-7) defnes the Fsher s classfcaton rule. Equaton (6) mples that an observaton n group one s allocated correctly to group one otherwse the observaton s assgned to group two f Equaton (7) s satsfed, respectvely.. Probablty Base Classfcaton Technque (PCT) Ths secton descrbe classfcaton procedure that ncludes the wthn group probabltes to develop the classfcaton coeffcent. Based on Equaton (3), the wthn group mean vectors dfference for the two groups s ) obtaned, say, d= x xand the sum of the wthn group mean vectors s gven as d = x+ x, respectvely. To formulate the coeffcent of the new procedure, the followng are obtaned: = d, d% = +, β= d /d, % (8) ε= β. Based on the defntons n Equaton (8), the followng s obtaned: w e β β = + e /ε + p (9) where, P = N /N s the wthn group probabltes, N s the sample sze for each group, N s the total sample sze for the two groups and p= p, s the total probablty. The classfcaton model s gven as: = 4

Mathematcal Theory and Modelng ISSN 4-5804 (Paper) ISSN 5-05 (Onlne) www.ste.org The classfcaton cutoff pont s gven as follows: w z= x= ux, S pooled w u =. S pooled ) d z = u (0) () The classfcaton rule s defned as: z< z () n ths regard, an observaton s assgned to group one f Equaton () s satsfed otherwse the observaton s classfed to group two f the followng equaton hold: z z (3) 3. Result The Monte Carlo smulaton s desgned to nvestgate the comparatve classfcaton performance of the above technques for unequal and equal varance covarance matrces based on contamnated normal data set. The contamnaton normal model used n ths study for the respectve groups s gven as: ( ε )N (0,) +ε N ( µσ, I ) (4) dp dp dp Ths model requre that majorty of the data set come from the normal dstrbuton whle the rest come from the contamnated dstrbuton (Cont. Dst.). In each case, the data set s randomly reshuffled and dvded nto two categores; say tranng set (60%) and valdaton set (40%). To determne the performance of each procedure, the mean of the optmal probablty (Opt.) s used as the performance benchmark. The comparatve analyses are based on the comparson between the mean of the optmal probablty computed from the uncontamnated normal data set and the mean probabltes of correct classfcaton obtan from each technque. In the respectve fgures, the straght lne s the performance benchmark. Fgure and Fgure show that the Fsher s technque performed better than the probablty based approach for ncreasng proporton of contamnaton for the unequal varance covarance matrces. Fgure 3 revealed that the probablty base approach performed better than the Fsher s technque for the equal varance covarance matrces and performed comparable n Fgure 4. The followng results n Tables and reveal the performance of these technques for heteroscedastc matrces whle Table 3 and 4 show the performance of both technques for homoscedastc matrces. The best procedure appears n bold. The analyss reveals that the and the PCT technques are comparable n all cases nvestgated. (Mean probabltes of correct classfcaton) 0.84 0.8 0.8 0.78 0.76 0.74 Mean of the optmal probablty PCT 0.7 0 4 6 8 0 4 6 8 30 (Proporton of contamnaton) Fgure.Effect of contamnaton on the mean probablty of correct classfcaton 4

Mathematcal Theory and Modelng ISSN 4-5804 (Paper) ISSN 5-05 (Onlne) www.ste.org Table. Mean probablty of correct classfcaton and standard devaton (In Bracket), Optmal = 0.8340 Con. Dst. N d ε PCT OPT- OPT-PCT p ε N( 3,0 )? 30 0 0.834 0.8338 0.006 0.000 (0.0055) (0.00) ε N( 3,0 )? 30 0 0.807 0.806 0.068 0.034 (0.0065) (0.040) ε N( 3,0 )? 30 30 0.745 0.7393 0.095 0.0947 (0.000) (0.0096) : Fsher lnear classfcaton analyss PCT: Probablty base classfcaton technque OPT-: Dfference between the mean of the optmal probablty and the mean probablty of OPT-PCT: Dfference between the mean of the optmal probablty and the mean probablty of PCT (Mean probabltes of correct classfcaton) 0.9 0.88 0.86 0.84 0.8 0.8 0.78 0.76 0.74 Mean of the optmal probablty PCT 0.7 0 4 6 8 0 4 6 8 30 (Proporton of contamnaton) Fgure.Effect of contamnaton on the mean probablty of correct classfcaton Table. Mean probablty of correct classfcaton and standard devaton (In Bracket), Optmal = 0.8749 Con. Dst. N d ε PCT OPT- OPT-PCT p ε N3( 4.5 ) 牋 60 3 0 0.8553 0.8506 0.096 0.044 (0.0068) (0.0033) ε N3( 4.5 ) 牋 60 3 0 0.84 0.7997 0.0608 0.075 (0.0084) (0.0030) ε N3( 4.5 ) 牋 60 3 30 0.7570 0.76 0.79 0.488 (0.0096) (0.0070) : Fsher lnear classfcaton analyss PCT: Probablty base classfcaton technque OPT-: Dfference between the mean of the optmal probablty and the mean probablty of OPT-PCT: Dfference between the mean of the optmal probablty and the mean probablty of PCT 43

Mathematcal Theory and Modelng ISSN 4-5804 (Paper) ISSN 5-05 (Onlne) www.ste.org (Mean probabltes of correct classfcaton) 0.9 0.9 0.89 0.88 0.87 0.86 0.85 0.84 Mean of the optmal probablty PCT 0.83 0 4 6 8 0 4 6 8 30 (Proporton of contamnaton) Fgure 3.Effect of contamnaton on the mean probablty of correct classfcaton Table 3. Mean probablty of correct classfcaton and standard devaton (In Bracket), Optmal = 0.9099 Con. Dst. N d ε PCT OPT- OPT-PCT p ε N3(,9 )? 30 3 0 0.8967 0.9009 0.03 0.009 (0.006) (0.0063) ε N3(,9 )? 30 3 0 0.8774 0.879 0.035 0.0308 (0.000) (0.08) ε N3(,9 )? 30 3 30 0.839 0.8406 0.0707 0.0694 (0.046) (0.08) : Fsher lnear classfcaton analyss PCT: Probablty base classfcaton technque OPT-: Dfference between the mean of the optmal probablty and the mean probablty of OPT-PCT: Dfference between the mean of the optmal probablty and the mean probablty of PCT 44

Mathematcal Theory and Modelng ISSN 4-5804 (Paper) ISSN 5-05 (Onlne) www.ste.org (Mean probabltes of correct classfcaton) 0.96 0.94 0.9 0.9 0.88 0.86 0.84 Mean of the optmal probablty PCT 0.8 0 4 6 8 0 4 6 8 30 (Proporton of contamnaton) Fgure 4.Effect of contamnaton on the mean probablty of correct classfcaton Table 4. Mean probablty of correct classfcaton and standard devaton (In Bracket), Optmal = 0.9484 Con. Dst. N d ε PCT OPT- OPT-PCT p ε N5( 4,6 ) 牋 00 5 0 0.9383 0.936 0.00 0.0 (0.0085) (0.0074) ε N5( 4,6 ) 牋 00 5 0 0.897 0.890 0.0567 0.0564 (0.007) (0.00) ε N5( 4,6 ) 牋 00 5 30 0.8379 0.8438 0.05 0.046 (0.004) (0.004) : Fsher lnear classfcaton analyss PCT: Probablty base classfcaton technque OPT-: Dfference between the mean of the optmal probablty and the mean probablty of OPT-PCT: Dfference between the mean of the optmal probablty and the mean probablty of PCT 3. Dscusson The conventonal technque to solve classfcaton problem based on the Fsher s technque does not ncorporate the wthn group probabltes to develop the Fsher s classfcaton coeffcent, see Equatons (-). A comparable classfcaton technque that ncorporate the wthn group probabltes to formulate the classfcaton coeffcent was proposed. The classfcaton performance of these technques was nvestgated by volatng the homoscedastc and multvarate normalty assumptons. The Monte Carlo smulatons performed are based on the followng controlled varables; the mean vector shft, varance shft, sample sze and dmenson, proporton of contamnaton. The comparatve classfcaton performace based on the fgures and tables revealed that these technques performed comparable. These technques ultlze all the nformaton glean from the data set. The probablty base approach provde more nformaton to the end user than the conventonal technque. 4. Concluson A comparable classfcaton technque based on probablty concept for two groups problem was compared wth the conventonal Fsher lnear classfcaton procedure. The new technque based on the wthn group probabltes s sutable to perform classfcaton for two groups problem where the probablty of the respectve groups are gven. The comparatve analyses revealed that both technques performed comparable. 45

Mathematcal Theory and Modelng ISSN 4-5804 (Paper) ISSN 5-05 (Onlne) www.ste.org Acknowledgement Ths research work was funded through the short term grant of the Unverst Sans Malaysa, Penang, Malaysa. References Bouveyron, C. & Brunet, C. ( 0), Probablstc Fsher Dscrmnant analyss: A robust and Flexble Alternatve to Fsher Dscrmnant Analyss, Neurocomputng 90,-. Hubert, M., Rousseeuw, P. J. & Verdonck, T. (00), "A Determnstc Algorthm for the MCD. Cteseerx.st.psu.edu/vewdoc/summary?, -6. Maronna, R., Martn, R. D. & Yoha, V. J. (006), "Robust Statstcs: Theory and Methods", John Wley, New York. Munoz-Pchardo, J. M., Engux-Gonzalez, A., Munoz -Garca, J. & Moreno-Rebollo, J. L. (0)," Influence Analyss on Dscrmnant Coordnates", Communcatons n Statstcs-Smulaton and Computaton, 40(60), 793-807. Sajob, T. T., Lx, L. M., Dansu, B. M., Laverty, W. & L, L. (0), "Robust Descrptve Dscrmnant Analyss for Repeated Measures Data", Computatonal Statstcs and Data Anal.yss, 56(9), 78-794. 46

Ths academc artcle was publshed by The Internatonal Insttute for Scence, Technology and Educaton (IISTE). The IISTE s a poneer n the Open Access Publshng servce based n the U.S. and Europe. The am of the nsttute s Acceleratng Global Knowledge Sharng. More nformaton about the publsher can be found n the IISTE s homepage: http://www.ste.org CALL FOR JOURNAL PAPERS The IISTE s currently hostng more than 30 peer-revewed academc journals and collaboratng wth academc nsttutons around the world. There s no deadlne for submsson. Prospectve authors of IISTE journals can fnd the submsson nstructon on the followng page: http://www.ste.org/journals/ The IISTE edtoral team promses to the revew and publsh all the qualfed submssons n a fast manner. All the journals artcles are avalable onlne to the readers all over the world wthout fnancal, legal, or techncal barrers other than those nseparable from ganng access to the nternet tself. Prnted verson of the journals s also avalable upon request of readers and authors. MORE RESOURCES Book publcaton nformaton: http://www.ste.org/book/ Recent conferences: http://www.ste.org/conference/ IISTE Knowledge Sharng Partners EBSCO, Index Coperncus, Ulrch's Perodcals Drectory, JournalTOCS, PKP Open Archves Harvester, Belefeld Academc Search Engne, Elektronsche Zetschrftenbblothek EZB, Open J-Gate, OCLC WorldCat, Unverse Dgtal Lbrary, NewJour, Google Scholar