COMPARISON OF DIMENSIONALITY REDUCTION METHODS APPLIED TO ORDINAL DATA

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "COMPARISON OF DIMENSIONALITY REDUCTION METHODS APPLIED TO ORDINAL DATA"

Transcription

1 The 7 th Intenatonal Days of tatstcs and Economcs, Pague, eptembe 19-21, 2013 COMPARION OF DIMENIONALITY REDUCTION METHOD APPLIED TO ORDINAL DATA Matn Pokop Hana Řezanková Abstact The pape deals wth the compason of data dmensonalty educton methods wth emphass on odnal data. Categocal and especally odnal data we fequently obtan fom questonnae suveys. A questonnae usually ncludes a bg amount of questons (vaables). Fo applcatons of multvaate statstcal methods, t s useful to educe the numbe of these questons and ceate new latent vaables, whch epesent goups of ognal questons. ome dmensonalty educton methods ae applcable to odnal data (latent class models), some methods must be mpoved (categocal pncpal component analyss). Othe methods ae based on a dstance matx, so t s possble to use an appopate dstance measue fo odnal data (multdmensonal scalng). In ths pape, dmensonalty educton methods ae appled to eal datasets ncludng odnal data n the fom of Lket scales. Vaous technques fo the compason of these methods ae used. They ae amed to nvestgate goodness of the data stuctue n ognal and educed space. In ths pape the goodness s evaluated by peaman ank coelaton coeffcent. Key wods: dmenson educton, pncpal component analyss, multdmensonal scalng, latent class models JEL Code: C3, C6, C8 Intoducton The am of ths pape s the compason of dmensonalty educton methods fo odnal vaables. Reducton methods descbed n the followng chapte wee appled to odnal datasets ncludng values n the fom of Lket scales. Inte-object dstances n ognal and educed space wee evaluated. Wth espect to the odnal chaacte of the data, Kendall coelaton coeffcent was used as a smlaty measue. Futhe we measued how well the stuctue and stuctual elatonshp of the data wee peseved by dmenson educton. Fo the pupose of ths pape peaman ank coelaton coeffcent between nte-object dstances n ognal and educed space was used. In the cuent eseach a smla poblem but fo the 1150

2 The 7 th Intenatonal Days of tatstcs and Economcs, Pague, eptembe 19-21, 2013 dffeent knd of the data (contnuous) o fo dffeent knds of methods (nonlnea) was solved n (L, 1995). Fo the compason vaous pocedues wee used, e.g. scatteplots o peaman ank coelaton coeffcent of nte-object dstances n ognal and educed space, Pocustes analyss o measung the genealzaton eo of k-neaest neghbo classfes taned on the esultng data epesentatons (Maaten, 2008). mla educton methods appled to odnal data evaluated by fuzzy cluste analyss wee dscussed n (obíšek, 2011). 1 Dmensonalty educton methods Basc methods of the data dmensonalty educton ae pncpal component analyss (PCA), facto analyss (FA) and multdmensonal scalng (MD). Classcal FA methods assume lnea elatons among ognal vaables, new latent vaables ae contnuous and nomally dstbuted. Conventonal facto analyss s usually based on coelaton matx analyss, fo moe detals see (Hebák et al., 2007). Common methods of latent vaables dentfcaton ae latent class models. Thee exst vaous methods, whch ae avalable n statstcal softwae packages, e.g. basc LCA models, latent class cluste models LCC, dscete facto analyss models DFacto, latent tat analyss LTA, latent pofle analyss LPA, latent class egesson models LCR etc. 1.1 Categocal pncpal component analyss ome methods ae based on multdmensonal space pojecton nto the space wth lowe dmenson. A basc method s pncpal component analyss. The am s to fnd a eal dmenson of the data. To fnd a eal dmensonalty, ognal dataset X s tansfomed to the new coodnate system by an othogonal lnea tansfomaton. Let F (esp. G ) be the vecto of the ows coodnates (esp. columns) on the axs on the s-th ank. These two vectos ae elated by the tanston fomula, e. g. n the case of PCA thee ae F G k 1 1 k k x k x m G k k p G k, (1), (2) 1151

3 The 7 th Intenatonal Days of tatstcs and Economcs, Pague, eptembe 19-21, 2013 whee F denotes the coodnate of the -th object on the s-th axs, of the k-th vaable k on the s-th axs, the weght assocated to the k-th vaable, G denotes the coodnate the egenvalue assocated wth the s-th axs, p s the weght assocated to the -th object. m s k Instead of conventonal pncpal component analyss fo quanttatve vaables t s possble to use categocal pncpal component analyss CATPCA, whch tansfoms categocal vaables nto quanttatve vaables and does not assume lnea elatons among vaables. Fo moe datals see (Le, 2008). 1.2 Multdmensonal scalng Accodng to (Holland, 2008) ths method stats wth a matx of data X consstng of N ows of objects and J columns of vaables. Fom ths symmetcal matx of all pawse dstances among objects s calculated wth an appopate dstance measue, such as Eucldean dstance, Manhattan dstance (cty block dstance), and Bay dstance. The MD odnaton wll be pefomed on ths dstance matx. Next, a desed numbe of m dmensons s chosen fo the odnaton. Dstances among objects n the statng confguaton ae calculated, typcally wth the Eucldean metc. These dstances ae egessed aganst the ognal dstance matx and the pedcted odnaton dstances fo each pa of objects s calculated. A vaety of egesson methods can be used, ncludng lnea, polynomal, and non-paametc appoaches. In any case, the egesson s ftted by least-squaes. The goodness of ft of the egesson s measued based on the sum of squaed dffeences between odnaton-based dstances and the dstances pedcted by the egesson. Ths goodness of ft s called stess and can be calculated n seveal ways, e.g. wth one of the most common beng Kuskal s tess tess h, d h h, dˆ d 2 h h 2, (3) whee d h s the odnated dstance between h-th and -th objects, and dˆ s the dstance pedcted fom the egesson. The basc smlaty measue of two quanttatve vaables s Peason coelaton coeffcent. To measue smlaty of odnal vaables t s possble to use e.g. peaman o Kendall ank coelaton coefcent o symmetc ommes coeffcent. Fo detals see e.g. (Hendl, 2006). 1152

4 The 7 th Intenatonal Days of tatstcs and Economcs, Pague, eptembe 19-21, Latent class models The basc latent class model s a fnte mxtue model, n whch the component dstbutons ae assumed to be mult-way coss-classffcaton tables wth all vaables mutually ndependent. The latent class egesson model futhe enables us to estmate the effects of covaates on pedctng latent class membeshp. Evaluaton algothm uses expectatonmaxmzaton and Newton-Raphson algothms to fnd maxmum lkelhood estmates of the model paametes. Accodng to (Lnze, 2011) the basc latent class model s a fnte mxtue model n whch the component dstbutons ae assumed to be mult-way coss-classfcaton tables wth all vaables mutually ndependent. We obseve J polytomous categocal vaables (the manfest vaables), each of whch contans Kj possble outcomes, fo objects = 1,..., N. The manfest vaables may have dffeent numbes of outcomes, hence the ndexng by j. We denote as Yjk the obseved values of the J manfest vaables such that Yjk = 1 f espondent gves the k-th esponse to the j-th vaable, and Yjk = 0 othewse, whee j = 1,..., J and k = 1,..., Kj. The latent class model appoxmates the obseved jont dstbuton of the manfest vaables as the weghted sum of a fnte numbe R of consttuent coss-classfcaton tables. Let πjk denote the class-condtonal pobablty, that an object n class = 1,..., R poduces the k-th outcome on the j-th vaable. Wthn each class, fo each manfest vaable, K j theefoe k 1 jk 1. Futhe we denote as p the R mxng popotons that povde the weghts n the weghted sum of the component tables, wth p 1. The values of p ae also efeed to as the po pobabltes of latent class membeshp, as they epesent the uncondtonal pobablty that an object wll belong to each class befoe takng nto account the esponses Y jk povded on the manfest vaables. The pobablty that the -th object n the -th class poduces a patcula set of J outcomes on the manfest vaables, assumng condtonal ndependence of the outcomes Y gven class membeshps, s the poduct f Y J K j ; jk Y. (4) j 1 k 1 The pobablty densty functon acoss all classes s the weghted sum jk P Y R J K j Y, p jk. (5) p 1 j 1 k 1 jk The paametes estmated by the latent class model ae p and π jk. Gven estmates pˆ and ˆ of p and πjk, espectvely, the posteo pobablty that each object belongs to each class, jk 1153

5 The 7 th Intenatonal Days of tatstcs and Economcs, Pague, eptembe 19-21, 2013 condtonal on the obseved values of the manfest vaables, can be calculated usng Bayes fomula: whee 1,... R Pˆ Y pˆ f R pˆ q 1 q Y f ; ˆ Y ; ˆ q, (6). Let us ecall that the ˆ ae estmates of outcome pobabltes jk condtonal on the -th class. It s mpotant to eman awae that the numbe of ndependent paametes estmated by the latent class model nceases apdly wth R, J, and Kj. Gven these values, the numbe of paametes s R K j R 1 j 1. If ths numbe exceeds ethe the total numbe of objects, o one fewe than the total numbe of cells n the cossclassfcaton table of the manfest vaables, then the latent class model wll be undentfed. The polca estmates the latent class model by maxmzng the log-lkelhood functon N R K j 1 1 j 1 k 1 J jk Y jk ln L ln p (7) wth espect to p and πjk, usng the expectaton-maxmzaton (EM) algothm. Ths loglkelhood functon s dentcal n fom to the standad fnte mxtue model log-lkelhood. As wth any fnte mxtue model, the EM algothm s applcable because each object's class membeshp s unknown and may be teated as mssng data. The EM algothm poceeds teatvely. We stat wth abtay ntal values of pˆ and ˆ jk, and label them old pˆ and old ˆ. jk In the expectaton step, we calculate the mssng class membeshp pobabltes usng Equaton 6, substtutng n old pˆ and ˆ old jk. In the maxmzaton step, we update the paamete ˆ Y wth estmates by maxmzng the log-lkelhood functon gven these posteo P as the new po pobabltes and N new 1 pˆ Pˆ Y (8) N 1 N Y Pˆ Y j new 1 ˆ (9) j N Pˆ Y 1 as the new class-condtonal outcome pobabltes. In Equaton 9, new ˆ s the vecto of length j K j of class- condtonal outcome pobabltes fo the j-th manfest vaable; and Y j s the N K j matx of obseved outcomes Y jk on that vaable. The algothm epeats these steps, 1154

6 The 7 th Intenatonal Days of tatstcs and Economcs, Pague, eptembe 19-21, 2013 assgnng the new to the old, untl the oveall log-lkelhood eaches a maxmum and ceases to ncement beyond some abtaly small value. 2 Dstance and smlaty measues fo odnal data Dmenson educton methods fequently eque an appopate smlaty measue. Fo odnal vaables t s possble to use some dependence ntensty measue, e.g. an assocaton measue n contngency tables. Accodng to (Hendl, 2006) geneally we measue the dependence ntensty of two nomally dstbuted vaables by Peason coelaton coeffcent. If we do not know dstbuton of the data nstead of Peason coelaton coeffcent we can use peaman ank coelaton coeffcent (10) whch can be used also fo odnal vaables. ometmes we also know only anks of measued values. If thee s a smla ank of two samples X, Y, t means nfomaton about the dependence of these vaables. peaman coelaton coeffcent s evaluated as a coelaton coeffcent appled to the ank of the values fom anked samples. Values of the vaables X, Y we ank wth espect to the sze and we obtan the sequence X X... X, Y Y... Y. Let R be the ank of the ( 1) ( 2) ( n ) ( 1) ( 2) ( n) vaable X and Q the ank of the vaable Y n the anked sample. It holds R Q. (10) 2 n n 1 Unde the hypothess of ndependence has the mean value 0, the vaance appoxmately fo n > 30 asymptotcally nomal dstbuton wth ctcal values u( ) ( n, ) 2. (11) n 1 1 n 1 and If both vaables ae odnal, we can use some nonpaametc coelaton coeffcent, e.g. Kendall coelaton coeffcent. One opton of the Kendall coelaton coeffcent s Goodman-Kuskal coeffcent γ. It s evaluated fom the count of the concodances P and dscodances Q. It s appopate coeffcent to descbe assocaton of odnal vaables n the contngency table. Coeffcent γ can be evaluated fom the fomula P Q. (12) P Q 1155

7 The 7 th Intenatonal Days of tatstcs and Economcs, Pague, eptembe 19-21, 2013 Futhe opton of Kendall coelaton coeffcent s Kendall coeffcent c 2m( P Q) c, (13) 2 n ( m 1) m s smalle dmenson fom the contngency table. It s appopate coeffcent to descbe assocaton n the table wth vaous values of dmensons. Othe opton of Kendall coelaton coeffcent s coeffcent b P Q, (14) b P Q T P Q T X T s the count of the pas wth the same value of the vaable X and dffeent value of the x vaable Y, T s the count of the pas wth the same value of the vaable Y and dffeent Y value of the vaable X. Fom the dffeence P Q we can calculate also ommes coelaton coeffcents whch ae appopate to evaluate f ow o column vaable can pedct values of column o ow vaable. Y 3 Results Two datasets wee evaluated by statstcal softwae P and R. One dataset s elated to the eseach Publc pecepton of the polceman. Questonnae suvey was held n the yeas 1995, 1999, 2006 n the Polce academy of the Czech Republc and nvestgated typcal polceman pecepton n the publc socety. Data fom 100 espondents fom the last ealzaton (2006) ae used n ths atcle. Questonnae suvey ncluded 24 questons (vaables) wth bpola scales fom 1 to 7, 4 s the neutal level. Lowe postons mean postve, uppe negatve evaluaton of the typcal polceman. The questons descbed usually moal chaactestcs (adjectves) of the polceman, e.g. good-bad, actve-passve, fast-slow etc. Fo moe detals see Moulsová The second dataset s elated to the eseach Unvesty students actve lfestyle (15 selected Lket scaled vaables fom 100 espondents, scale 1-8), tudents descbed satsfacton concenng dffeent ponts of vew of the students lfe. Respondents evaluated the satsfacton on the scale fom 1 (no satsfacton) to 8 (vey satsfed). Fo moe detals see Valjent Compaed methods ae categocal pncpal component analyss (pocedue CATPCA n P), multdmensonal scalng (pocedue ALCAL and PROXCAL n P) and latent class models (pocedue polca n R). The numbe of latent vaables was 1156

8 The 7 th Intenatonal Days of tatstcs and Economcs, Pague, eptembe 19-21, 2013 calculated wth espect to the numbe of egenvalues dstnctly hghe than one fom categocal pncpal component analyss. It means 4 latent vaables fo the fst dataset and 5 latent vaables fo the second dataset. Fo the compason the same numbe of latent vaables o classes was selected also fo emanng dmensonalty educton methods. Kendall ank coelaton coeffcent we used as an nte-object dstance measue and fnally peaman ank coelaton coeffcent between dstances n ognal and educed space was evaluated, see Tab. 1. Gaphs of component loadngs (Fg. 1) and coodnates fom multdmensonal scalng (Fg. 2, 3) ae attached as well. Fg. 1: Component loadngs (CATPCA n P, datasets 1, 2) ouce: own eseach Fg. 2: Coodnates (MD Poxscal n P, datasets 1, 2) ouce: own eseach 1157

9 The 7 th Intenatonal Days of tatstcs and Economcs, Pague, eptembe 19-21, 2013 Fg. 3: Coodnates (MD Alscal n P, datasets 1, 2) ouce: own eseach Tab. 2: peamann ank coellaton coeffcents of nte-object dstances n ognal and educed space Method Dataset 1 Dataset 2 CATPCA MD PROXCAL MD ALCAL LCA ouce: own eseach Concluson Fo the compason of dmensonalty educton methods appled to odnal datasets we used peamann ank coelaton coeffcent between dstances n ognal and educed space. Fom the esults of fou dmensonalty educton methods appled to two odnal dataset we can see, that satsfactoy goodness of the data stuctue was obtaned n case of CATPCA and LCA, weake esults wee eached fom the methods of multdmensonal scalng. In futhe eseach othe compason technques wll be povded, e.g. Pocustes analyss and the esults fom all these technques wll be compaed. Refeences 1. Hebák, P. et al. (2007). Víceozměné statstcké metody 3. Paha: Infomatoum 2. Hendl, J. (2006). Přehled statstckých metod: analýza a metaanalýza dat. Paha: Potál 1158

10 The 7 th Intenatonal Days of tatstcs and Economcs, Pague, eptembe 19-21, Holland,. (2008). Non-metc multdmensonal scalng (mds). Athens: R foge 4. Le,., Josse, J., & Husson, F. (2008). Factomne: An R package fo multvaate analyss. Jounal of statstcal softwae, 25(1) 5. L,., Vel, O., & Coomans, D. (1995). Compaatve pefomance analyss of nonlnea dmensonalty educton methods. Noth Austala: James Cook Unvesty 6. Lnze, D., & Lews, J. (2011). polca: An R package fo polytomous vaable latent class analyss. Jounal of statstcal softwae, 42(10) 7. Maaten, L. P. J., Postma, L. O., & Hek, H. J. (2008). Dmensonalty educton: a compaatve evew. Elseve. 8. Moulsová, M. (2009). Výzkum pecepce polcsty. Kmnalstka, 42(1), obíšek, L., & Řezanková, H. (2011). ovnání metod po edukc dmenzonalty aplkovaných na odnální poměnné. Acta Oeconomca Pagensa, 1, Valjent, Z. (2010). Aktvní žvotní styl vysokoškoláků. Paha: FEL ČVUT. Contact Matn Pokop College of Polytechncs Jhlava Tolstého Jhlava Unvesty of Economcs, Pague W. Chuchll quae Paha 3 Hana Řezanková Unvesty of Economcs, Pague W. Chuchll quae Paha