A Two-Level Approach to Making Class Predictions

Size: px

Start display at page:

Download "A Two-Level Approach to Making Class Predictions"

Dinah Campbell
5 years ago
Views:

1 A Two-Level Approach to Makig Class Predictios Adria Costea Turku Cetre for Computer Sciece ad IAMSR / Åbo Akademi Uiversity, Turku, Filad, Adria.Costea@abo.fi Tomas Eklud Turku Cetre for Computer Sciece ad IAMSR / Åbo Akademi Uiversity, Turku, Filad, Tomas.Eklud@abo.fi Abstract I this paper we propose a ew two-level methodology for assessig coutries /compaies ecoomic/fiacial performace. The methodology is based o two major techiques of groupig data: cluster aalysis ad predictive classificatio models. First we use cluster aalysis i terms of self-orgaizig maps to fid possible clusters i data i terms of ecoomic/fiacial performace. We the iterpret the maps ad defie outcome values (classes) for each data row. Lastly we build classifiers usig two differet predictive models (multiomial logistic regressio ad decisio trees) ad compare the accuracy of these models. Our fidigs claim that the results of the two classificatio techiques are similar i terms of accuracy rate ad class predictios. Furthermore, we focus our efforts o uderstadig the decisio process correspodig to the two predictive models. Moreover, we claim that our methodology, if correctly implemeted, exteds the applicability of the self-orgaizig map for clusterig of fiacial data, ad thereby, for fiacial aalysis. 1. Itroductio I this study, we are iterested i the relatioship betwee a umber of macro/microecoomic idicators of coutries/compaies ad differet ecoomic/fiacial performace classificatios. We have based our research o two previous studies [2] ad [3]. I [2] we compared two differet methods of clusterig cetral-east Europea coutries ecoomic data (self-orgaizig maps ad statistical clusterig) ad preseted the advatages ad disadvatages of each method. I [3], the self-orgaizig map (SOM) was used for bechmarkig iteratioal pulp ad paper compaies. I both previous studies we were maily cocered with fidig patters i ecoomic/fiacial data ad presetig this multidimesioal data i a easy-to-read format (usig SOM maps). However, we have ot addressed the problem of class predictio as ew cases are added to our datasets. From our previous results we caot directly ifer a procedure with which a ew data row could be fit ito our maps. As we obtai ew data, depedig upo the stadardizatio techique used, we may be forced to retrai the maps, ad repeat the etire clusterig process. This is very time cosumig, ad requires the effort of a experieced SOM user. As Witte & Frak say i their book o data miig: The success of clusterig is measured subjectively i terms of how useful the result appears to be to a huma user. It may be followed by a secod step of classificatio learig where rules are leared that give a itelligible descriptio of how ew istaces should be placed ito the clusters. [17, p.39] Here we propose a methodology that eables us to model the relatioship betwee ecoomic/fiacial variables ad differet classificatios of coutries/compaies i terms of their performaces. Defiig the model permits us to predict the class (cluster) to which a ew case belogs. I other words, we isert ew data ito our model ad idetify where they fit i the previously costructed map. Choosig the best techique for these two phases of our aalysis (clusterig/bechmarkig/visualizatio ad class predictio) is ot a trivial task. I the literature there is a large umber of techiques for both clusterig ad class predictio. I this study, we use SOM as the clusterig techique due to the advatages of good visualizatio ad reduced computatioal cost. Eve with a relatively small umber of samples, may clusterig algorithms especially hierarchical oes (for example, Uweighted Pair Group Method with Arithmetic Mea (UPGMA), Ward s, or other bottom-up hierarchical clusterig methods) become itractably heavy [16]. Descriptive techiques, such as clusterig, simply summarize data i coveiet ways, or i ways that we hope will lead to icreased uderstadig. I cotrast, predictive techiques, such as multiomial logistic regressio ad decisio trees, allow us to predict the probability that data rows will be clustered i a specific class i the traied SOM model. I order to fid the predictive techique that is most suitable i our particular case, we coduct two experimets usig multiomial logistic regressio ad decisio tree techiques. Whe buildig real classifiers oe ca use three differet

2 fudametal approaches: the discrimiative approach, the regressio approach, ad the class-coditioal approach [6, p.335]. We chose to compare two regressio approach methods: multiomial logistic regressio ad decisio trees. The rest of the paper is structured as follows. I Sectio two we preset our methodology. I Sectio three, the datasets are preseted ad SOM clusterig is performed. I Sectios four ad five, the multiomial regressio ad decisio tree models are built ad validated, ad i Sectio six the models are compared. Fially, i Sectio seve, we preset our coclusios. 2. Methodology I our two-level approach we add aother level (class predictio phase) to SOM clusterig, as is depicted i Figure 1 (the arrows are the levels): Iitial dataset Data i (1) form of SOM (2) Figure 1. Two-level methodology Data predictio model (1) cosists of several stages: preprocessig of iitial data, traiig usig the SOM algorithm, choosig the best maps, idetifyig the clusters, ad attachig outcome values to each data row; [1] (2) depedig o the techique that we apply, there ca be differet stages for this methodology level. Whe applyig statistical techiques, such as multiomial logistic regressio, we follow these steps: developig the aalysis pla, estimatio of logistic regressio, assessig model fit (accuracy), iterpretig the results, ad validatig the model. Whe applyig the decisio tree algorithm: costructig a decisio tree step by step icludig oe attribute at a time i the model, assessig model accuracy, iterpretig the results, ad validatig the model. After the predictive models for classificatio were costructed we compared them, based o their accuracy measures. Quila [10] states that there are differet ways of comparig models besides their accuracy, e.g. the isight provided by the predictive model. However, we will use the accuracy measure sice the example above is a subjective measure. 3. Clusterig Usig SOM The SOM algorithm stads for self-orgaizig map algorithm, ad is based o a two-layer eural etwork usig the usupervised learig method. The selforgaizig map techique creates a two-dimesioal map from -dimesioal iput data. This map resembles a ladscape i which it is possible to idetify borders that defie differet clusters [8]. These clusters cosist of iput variables with similar characteristics, i.e. i this report, of coutries/compaies with similar ecoomic/fiacial performace. The methodology used whe applyig the self-orgaizig map is as follows [1]. First, we choose the data material. It is ofte advisable to stadardize the iput data so that the learig task of the etwork becomes easier [8]. After this, we choose the etwork topology, learig rate, ad eighborhood radius. The, the etwork is costructed. The costructio process takes place by showig the iput data to the etwork iteratively usig the same iput vector may times, the so-called traiig legth. The process eds whe the average quatizatio error is small eough. The best map is chose for further aalysis. Fially, we idetify the clusters usig the U- matrix ad iterpret the clusters (assig labels to them) usig the feature plaes. From the feature plaes we ca read per iput variable per euro the value of the variable associated with each euro. The etwork topology refers to the form of the lattice. There are two commoly used lattices, rectagular ad hexagoal. The hexagoal lattice is preferable for visualizatio purposes as it has six eighbors, as opposed to four for the rectagular lattice [8]. The learig rate refers to how much the wiig iput data vector affects the surroudig etwork. The eighborhood radius refers to how much of the surroudig etwork is affected. The average quatizatio error idicates the average distace betwee the best matchig uits ad the iput data vectors. Geerally speakig, a lower quatizatio error idicates a better-traied map. The sample data size is ot of a major cocer whe usig SOM algorithm. I [15] the author claims that SOM is easily applicable to small data sets (less tha records) but ca also be applied i case of medium sized data sets. To visualize the fial self-orgaizig map we use the uified distace matrix method (U-matrix). The U-matrix method ca be used to discover otherwise ivisible relatioships i a high-dimesioal data space. It also makes it possible to classify data sets ito clusters of similar values. The simplest U-matrix method is to calculate the distaces betwee eighborig euros, ad store them i a matrix, i.e. the output map, which the ca be iterpreted. If there are walls betwee the euros, the eighborig weights are distat, i.e. the values differ sigificatly. The distace values ca also be displayed i color whe the U-matrix is visualized. Hece, dark colors represet great distaces while brighter colors idicate similarities amogst the euros. [14] 3.1. Datasets I this study we have used two datasets from our previous papers: oe dataset o the geeral ecoomic performace (EcoomicPerf) of the cetral-east-europea coutries [2] ad aother (FiacialPerf) o the fiacial

3 performace of iteratioal pulp ad paper compaies [3]. The variables for the first dataset are: Currecy Value, or how much moey oe ca buy with 1000 USD, depicts the purchasig power of each coutry s currecy (the greater the better), Domestic Prime Rate (Refiacig Rate), which shows fiacial performace ad level of ivestmet opportuities (the smaller the better), Idustrial Output i percetages to the previous periods, to depict idustrial ecoomical developmet (the greater the better), Uemploymet Rate, which characterizes the social situatio i the coutry (the smaller the better), ad Foreig Trade i millios of US dollars, to reveal the deficit/surplus of the trade budget (the greater the better). I [2] there were two more variables i the dataset: import ad export i millio USD, as itermediary measures to calculate the foreig trade. We did ot take them ito accout here, sice they are strogly correlated with the foreig trade variable. Also, we have replaced the first variable (Foreig Exchage Rate) from the previous study [2] with Currecy Value, which is calculated from the Foreig Exchage Rate variable by reversig it ad multiplyig the result with We have chaged this variable to esure the comparability amog differet coutries currecies. Our dataset cotais mothly/aual data for six coutries (Russia, Ukraie, Romaia, Polad, Sloveia ad Latvia) durig , i total 225 cases with five variables each. We have i some cases ecoutered lack of data, which we have completed usig meas of existig values. However, the self-orgaizig map algorithm ca treat the problem of missig data simply by cosiderig at each learig step oly those idicators that are available [7]. The secod dataset cosisted of fiacial data o iteratioal pulp ad paper compaies. The dataset covered the period , ad cosisted of seve fiacial ratios per year for each compay. The ratios were chose from a empirical study by Lehtie [9], i which a umber of fiacial ratios were evaluated cocerig their validity ad reliability i a iteratioal cotext. The ratios chose were: Operatig margi, a profitability ratio, Retur o Equity, a profitability ratio, Retur o Total Assets, a profitability ratio, Quick Ratio, a liquidity ratio, Equity to Capital, a solvecy ratio, Iterest Coverage, a solvecy ratio, ad Receivables Turover, a efficiecy ratio. The ratios were calculated based o iformatio from the compaies aual reports. The dataset cosisted of 77 compaies ad 7 regioal averages. The compaies were chose from Pulp ad Paper Iteratioal s aual rakig of pulp ad paper compaies accordig to et sales [12]. I total, the dataset cosisted of 474 rows of data Choosig the Best Maps The two datasets were stadardized accordig to differet methods. I [2] the authors used the stadard deviatios of each variable to stadardize the data (Equatios 1, 2), while i [3] the data have bee scaled usig histogram equalizatio [4]. It is ot our itetio to describe differet methods for the stadardizatio of datasets; however, i the literature there are examples of both stadardizatio techiques used o similar datasets. x x ij j= i = 1 [Eq. 1] 2 ( xij xi ) j= 1 σ i = [Eq. 2] We have traied differet maps with differet parameters. As is stated i [2] a good map is obtaied after several differet traiig sessios. Best maps have bee chose based o two measures: oe objective measure (the quatizatio error) ad a subjective measure (ease of readability). However, the algorithm quatizatio error seems to be positively correlated with the dimesio of the maps, while ease of readability is egatively correlated. I other words, we ca obtai very good maps i terms of their quatizatio error if we use large dimesio parameters, while they are poor i terms of readability. Cluster aalysis is ofte a trade-off betwee accuracy ad cluster clarity ad maageability, by creatig small maps we force the data ito larger clusters. Cosequetly, whe we compared the maps we restricted the maps dimesios to be costat. The chose maps ad their clusters are preseted i Figure Idetifyig the Clusters We idetify the clusters o the maps by studyig the fial U-matrix maps (Figure 1), the feature plaes, ad at the same time, by lookig at the row data. Actually, the title of this paragraph, idetifyig the clusters, should be idetifyig the clusters of clusters. What we are sayig is that we already have the clusters idetified by SOM o the map (from ow o we will refer to these clusters as row clusters). For example, i case we are usig a 7x5 map, we have 35 row clusters. Next we have to idetify the real clusters by groupig the row clusters. SOM helps us i this respect by drawig darker lies betwee two clusters that are far from each other (i terms of the Euclidea distace). The results for both datasets were

4 very similar i terms of the amout, ad characteristics, of clusters (7 i each case). (a) (b) Figure 2. (a) The fial U-matrix maps ad (b) idetified clusters o the maps for the EcoomicPerf ad FiacialPerf data sets 3.4. Defiig the Outcome Values for each Row Data Roughly speakig, we ca state that the outcome values (the classes) i terms of ecoomic/fiacial performace, were the same i both cases (Figure 1), so the classes are as follows: A best performace, B slightly below best performace, C slightly above average performace, D average, E slightly below average performace, F slightly above poorest performace, ad G poorest performace. Defiig the outcome values for each data row is a straightforward process. Oce we figure out which cluster each row cluster belogs to, the ext step is to check which row data vectors are associated with each row cluster, ad to associate the class code with those vectors. Cosequetly, i terms of methodology, we ca divide the clusterig process ito two parts: creatig the row-clusters this part is etirely doe by the SOM algorithm, the output beig the U-matrix; creatig the real clusters this part is doe by the map reader with the help of the SOM algorithm i terms of visualizatio characteristics. This kid of multi-level clusterig approach is ot ew. A two-level SOM clusterig approach has bee suggested before, i [16]. There, the row-clusters are protoclusters ad our real clusters are the actual clusters. However, sometimes it is difficult to fid good real clusters sice the secod part of the clusterig process is highly subjective. Also, the stadardizatio method has a importat role, sice for differet stadardizatio techiques we obtai differet maps i terms of G E B B A C A C E D D F F G quatizatio error ad ease of readability. 4. Applyig multiomial logistic regressio I geeral, whe multiomial logistic regressio is applied as a predictive modelig techique for classificatio, there are some steps that have to be followed: 1. Check the requiremets regardig the data sample: size, missig data, etc., 2. Compute the multiomial logistic regressio usig a available software program (e.g. SPSS), 3. Assess the model fit (accuracy), 4. Iterpret the results, ad 5. Validate the model. Below, we follow this methodology whe applyig logistic regressio o our datasets Requiremets I the EcoomicPerf dataset, the problem of missig data was overcome by usig mothly meas for each year. Averages were also used for missig data i the FiacialPerf dataset. The requiremet of size, cases for each idepedet variable, was exceeded for each dataset Computig the Multiomial Regressio Model We use SPSS to perform multiomial regressio aalysis selectig as depedet variables the class variables ad as covariates the variables preseted i Sectio Assessig the Model Fit From the Model Fittig iformatio output table of SPSS we observe that the chi-square value has a sigificace of < , so we state that there is a strog relatioship betwee depedet ad idepedet variables (see Table 2). Next, we study the Pseudo R-Square table i SPSS, which also idicates the stregth betwee depedet ad idepedet variables. A good model fit is idicated by higher values. We will base our aalysis o the Nagelkerke R 2 idicator (see Table 2). Accordig to this, 74.5% for the EcoomicPerf dataset ad 97.8% for the FiacialPerf dataset, of the output variatio ca be explaied by variatios i iput variables. Cosequetly, we would appreciate the relatioships as very strog. To evaluate the accuracy of the model, we compute the proportioal by chace accuracy rate ad the maximum by chace accuracy rate. The proportioal chace criterio for assessig model fit is calculated by summig the squared proportio of each group i the sample, ad the maximum chace criterio is the proportio of cases i the largest

5 group. We obtaied the followig idicators (Table 1): Table 1. Evaluate the model's accuracy Model Proportioal by Maximum by chace chace criterio criterio EcoomicPerf 61,3% 29,92% 49,8% FiacialPerf 88% 15,62% 20,46% We iterpret these umbers as follows: for example, i the case of the EcoomicPerf dataset, based o the requiremet that the model accuracy should be 25% better tha the chace criteria [5, p ], the stadard to use for comparig the model's accuracy is 1.25 x = Our model accuracy rate of 61.3% exceeds this stadard. The maximum chace criterio accuracy rate is 49.8% for this dataset. Based o the requiremet that model accuracy should be 25% better tha the chace criteria, the stadard to use for comparig the model's accuracy is 1.25 x 49.8% = 62.22%. Our model accuracy rate of 61.3% is slightly below this stadard. The FiacialPerf dataset accuracy rate exceeds both stadards Iterpretig the Results To iterpret the results of our aalysis, we study the Likelihood Ratio Test ad Parameter Estimates outputs of SPSS. We fid that the idepedet variables are all sigificat, i other words they cotribute sigificatly to explaiig differeces i performace classificatio (for both datasets). However, ot all variables play a importat role i all regressio equatios (e.g. for the first regressio equatio, CurrecyValue is ot statistically sigificat 0,125 > p = 0,05). Next, we ca determie the directio of the relatioship ad the cotributio to performace classificatio of each idepedet variable by lookig at colums B ad exp(b) from the Parameter Estimates" output of SPSS. For example, a higher idustrial output rate icreases the likelihood that the coutry will be classified as a best coutry (B = +24,027) ad decreases the likelihood that the coutry will be classified amog the poorest coutries (B = -11,137). It seems that the results for the EcoomicPerf dataset are poorer, i the sese that for the FiacialPerf dataset we have more coefficiets estimates that are statistically sigificat. For example, if we study the Parameter Estimates outputs of SPSS ( Sig. colum), we fid that EcoomicPerf dataset has 33% sigificat coefficiets, while FiacialPerf dataset has 62.5% Validatig the Model I order to validate the model, we split the datasets i two parts of, approximately, the same legth. Our fidigs are illustrated i Table 2: Table 2. Datasets accuracy rates ad accuracy rates estimators whe applyig multiomial logistic regressio EcoomicPerf FiacialPerf Model Chi- Square (p < 0,0001) Mai dataset With oe exceptio, we obtaied sigificat coefficiets for the logistic regressio equatios. I both cases, the accuracy rates of the two split datasets were close to the accuracy rate of the etire dataset. For example, 89% ad 89,5% are close to the etire FiacialPerf dataset accuracy rate of 88%. Agai, the secod dataset outperformed the first oe, i the sese that for the FiacialPerf dataset, the accuracy rates for the test samples are closer to the learig sample accuracy rate. However, more ivestigatios should be doe to fid problems that arise due to isigificat coefficiets of each regressio equatio. Large stadard errors for B coefficiets ca be caused by multicolliearity amog idepedet variables, which is ot directly hadled by SPSS or other statistical packages. Moreover, the problem of outliers ad variable selectio should be carefully addressed. Also, the discrepacies betwee learig ad test accuracy rates ca arise due to the small sizes of the datasets. The larger the dataset is, the better the chace that we have correctly clustered data ad, cosequetly, correct outcome values for each data row. We costruct the outcome values based o SOM clusterig. There is, of course, a chace that there are misclustered data, which ca affect the accuracy of the model Predictig the Classes Part1 (split=0) The fiished model was the used to test the classificatio of three ew data rows for the FiacialPerf 1 this coefficiets is sigificat for p < 0,153. Part2 (spli=1) 291, , ,852 Nagelkerke R 2 0,745 0,855 0,721 Learig 61,3% 67% 58,4% Test Sigificat coefficiets (p<0,05) Model Chi- Square (p < 0,0001) o test sample ALL 57,6% 67,1% ALL except: CURRENCY 1 ALL 1479,72 792,06 752,85 Nagelkerke R 2 0,978 0,986 0,981 Learig 88% 89% 89,5% Test Sigificat coefficiets (p<0,001) o test sample 76,1% 82,4% ALL ALL ALL

6 dataset. These cosisted of data for three Fiish pulp ad paper compaies: M-Real (o. 3), Stora Eso (o. 4), ad UPM-Kymmee (o. 5), for the year These were used sice they were amog the first to publish their fiacial results. The results are illustrated i Table 3. Operatig Margi Table 3. Predictios usig multiomial logistic regressio ROE ROTA Equity to Capital Quick Ratio Iterest Receivables Compay Predicted Coverage Turover o. Cluster D B A Table 4. The first lie, for each dataset, represets the accuracy rates obtaied usig traiig datasets. The ext two lies show us the validatio accuracy rates calculated as follows: for the mai dataset a 10-crossvalidatio was coducted (64% beig the average accuracy rate of 10 decisio trees), for the split=0 dataset we used split=1 as test dataset (46,9% is the accuracy rate o the secod dataset, based o the decisio tree built with the first dataset), ad the last accuracy rate was calculated by cosiderig split=1 as the traiig dataset ad split=0 as the test dataset (chagig the roles). 5. Applyig the Decisio Tree Algorithm For compariso reasos, a See5 decisio tree builder system was applied o both datasets. The system was developed by a research team headed by Quila. The algorithm behid the program is based o oe of the most popular decisio tree algorithms, ad was developed i the late 70 s, also by Quila: ID3 [11]. The mai idea is that, at each step, the algorithm tries to select a variable ad a value associated with it that discrimiate best the dataset, ad does this recursively for each subset util all the cases from all subsets belog to a certai class. The method is called Top-Dow Iductio Of Decisio Trees (TDIDT) ad C4.5, C5.0/See5 represet differet implemetatios of this method. The best discrimiatig pair (variable-value) is chose based o so-called gai ratio criterio: gai ratio(x) = gai(x) / split ifo(x) [Eq. 3] where gai(x) meas the iformatio gaied by splittig the data usig the test X ad: split ifo (X) = i= 1 S S i 2 Si S log [Eq. 4] represets the potetial iformatio geerated by dividig S ito subsets. The See5 system implemets these formulas alog with some other features that are described i [11] ad o the web page Computig the Decisio Tree For both datasets, we performed three rus of the See5 software, exactly like we did whe applyig logistic regressio: oe for the whole dataset, aother usig first split dataset ( split=0 ), ad the other usig the secod half of data ( split=1 ). Whe validatig the etire dataset accuracy rate, we have used cross-validatio, while whe validatig oe split dataset accuracy rate we have used the other oe as test sample. The results are summarized i Table 4. Dataset accuracy rates ad accuracy rates estimators whe applyig decisio tree algorithm EcoomicPerf FiacialPerf Learig Test crossvalidatio Learig Test crossvalidatio Mai dataset Whe costructig the trees, we kept the two most importat parameters costat: m = 5, which measures the miimum umber of cases each leaf-ode should have, ad c = 25% (default value) that is a cofidece factor used i pruig the tree Assessig the Model Fit Part1 Part2 79,1% 77,7% 78,86% o test sample 64% 46,9% 54,5% o crossvalidatio o crossvalidatio 84,8% 86,5% 86,5% 74,6% 71,7% 76,8% 74,4% o crossvalidatio o crossvalidatio For the EcoomicPerf dataset, it seems that our trees were ot cosistet due to poor accuracy rates ad big discrepacies betwee learig ad test accuracy rates, so further compariso with regressio aalysis caot be performed i this case. There is at least a 10% differece betwee the accuracy rates for each split dataset used. For the FiacialPerf dataset, the differeces betwee accuracy rates are smaller. Therefore, we used this dataset for further ivestigatio. The chose decisio tree is preseted i the Appedix. Readig it we ca state that the mai attribute used to discrimiate the data was ROE. The lower that we go dow i the decisio tree, the less importat the attributes become. At each step the algorithm calculates the iformatio gai for each attribute choosig the split attribute with the largest iformatio gai we call it the most importat attribute.

7 5.3. Iterpretig the Results As we ca see from the decisio tree (Appedix), the secod most importat variable depeds upo the values of ROE: if our ROE is greater tha or equal to , it is Equity to Capital, while if ROE is less tha or equal to , it is Receivables Turover. We must ote that we have used fuzzy thresholds, which allows for a much more flexible decisio tree: the algorithm (C5.0) assigs a lower value (lv) ad a upper value (uv) for each attribute chose to split the data. The a membership fuctio (trapezoidal) is used to decide which brach of the tree will be followed whe a ew case has to be classified. If the value of the splittig attribute for the ew case is lower tha lv, the left brach will be followed, ad if it is greater tha uv the we will further use the right brach. If the value lies betwee lv ad uv, both braches of the tree are ivestigated ad the results combied probabilistically the brach with the highest probability will be followed Validatig the Model Notice the asymmetric threshold values for almost every splittig attribute. I this case (FiacialPerf), the accuracy rate of the test sample is comparable with the accuracy rate of the learig sample. There is o specificatio o how close these two values should be; cosequetly, we coclude that the tree is validated. The oly way to really validate the assumptio that the two accuracy rates are ot far from oe aother is to cosider the two accuracy rates as radom variables ad the use a statistic test to see if their meas differ sigificatly. This ew step i validatig the decisio tree model would require splittig the dataset i differet ways to obtai differet traiig ad test datasets, ad the, uder the assumptio that the accuracy rates are radom variables that follow ormal distributio, which is ot always the case, we would test if their meas are or are ot statistically differet. After traiig the decisio tree, we tested it o the same data rows used i Sectio Predictig the Classes The results are illustrated i Table 5. As ca be see i the table, the results are somewhat differet from those obtaied usig logistic regressio. Table 5..Predictio usig the decisio tree Operatig Margi ROE ROTA Equity to Capital Quick Ratio Iterest Coverage Receivables Turover M-Real (o.3) was classified as a D compay i Table 3, while it is a B compay i table 5. The data rows of Stora Eso ad M-real are geerally similar, but the decisio tree has placed more emphasis o ROE, while logistic regressio seems to have emphasized Equity to Capital. Also, we ca see from Table 6 that the decisio tree has ot quite correctly leared the patter associated with Group D, oly beig able to correctly classify 58% of the cases i this group. The logistical regressio model was much more successful, ad we therefore cosider its predictio the more reliable of the two. More study will be eeded to judge why this happeed. 6. Comparig the Classificatio Models Accuracy While this is ot the oly way to compare two classificatio techiques, comparig them usig accuracy rates is the most used. I [10] the author compared five predictive models from areas of both machie learig ad statistics. A compariso similar to ours was made i [13]. The authors compared logistic regressio ad decisio tree iductio i the diagosis of Carpal Tuel sydrome. Their fidigs claim that there is o sigificat differece betwee the two methods i terms of model accuracy rates. Also, they suggest that the classificatio accuracy of the bivariate models (two idepedet variables) is slightly higher tha that of multivariate oes. It is ot our goal to compare bivariate ad multivariate models, while this ca be a subject for further ivestigatios usig the datasets preseted i this paper. As we stated i sectio 5, we will cosider oly the secod dataset whe comparig the two methods, sice for the first dataset the results were very poor i terms of the accuracy rate. I the last sectio, we will try to explai why we obtaied such poor results usig the EcoomicPerf dataset. Coversely, i the case of the secod dataset (FiacialPerf) both logistic regressio ad decisio tree models were validated agaist the split datasets. The differeces betwee accuracy rates were smaller i this case, ad the learig dataset accuracy rates were very good (88% ad 84,8%). Also, both models performed similarly o the test datasets (89%, 89,5% ad 86,5%, 86,5%). The bigger differece for the traiig datasets could be caused by the fact that whe applyig the decisio tree algorithm, we split the data i two parts usig 75% of the rows for the learig dataset. The remaiig 25% was used as a test dataset. This was due to a umberof-rows restrictio i the See5 demo-software (max 400 Compay Predicted o. Cluster B B A rows of data). Usig logistic regressio, chages i accuracy rates ca occur whe icludig/excludig some variables i/from the model. I the case of the decisio tree, the accuracy rate of the model ca be tued usig model parameters, e.g. the miimum umber of

8 cases i each leaf (m) or the pruig cofidece factor (c). The accuracy rates for the two methods are illustrated i Table 6. Table 6. The observed accuracy rates of the two methods Logistic Regressio Observed a b c d e f g a 88% 6% 2% 4% b 5% 89% 3% 2% c 6% 6% 77% 4% 4% 2% d 6% 2% 84% 8% e 7% 1% 88% 4% f 11% 89% g 3% 97% Predicted Decisio Tree Observed a b c d e f g a 86% 10% 4% b 4% 87% 5% 1% 3% c 3% 8% 76% 5% 8% d 0% 18% 6% 58% 12% 6% e 2% 93% 2% 4% f 3% 94% 3% g 2% 4% 4% 90% Predicted 7. Discussio ad coclusios I this study, we have proposed a ew two-level approach for makig class predictios about coutries /compaies ecoomic/fiacial performace. We have applied our methodology o two datasets: the EcoomicPerf dataset that icludes variables describig the ecoomic performace of cetral-east Europea coutries durig , ad the FiacialPerf dataset, which icludes fiacial ratios describig the fiacial performace of iteratioal pulp ad paper compaies durig Firstly, SOM clusterig was applied o both datasets i order to idetify clusters i terms of ecoomic/fiacial performace, ad the optimal umber of clusters to cosider. By readig the SOM output (Umatrix maps), we have cosidered seve to be the most appropriate umber of clusters for both datasets. Cosequetly, we costruct the outcome values for each data row based o the SOM maps ad the correspodig seve classes: best, slightly below best, slightly above average, average, slightly below average, slightly above poor, ad poorest. Secodly, based o the ew datasets (updated with the outcome values), we have predicted to which class a ew iput belogs. We chose ad compared two predictive models for classificatio: logistic regressio ad decisio tree iductio. Why is this approach importat? Why combie clusterig ad classificatio techiques? Why ot directly costruct the outcome values ad apply the predictive models without performig ay clusterig? We could perform surveys, askig experts how their compay/coutry performed i differet moths or years, ad the directly apply the classificatio techique to develop predictio models as ew cases are to be classified. First of all, this kid of iformatio (outcome values for each data row) is ot easy to get (is costly), ad secodly, eve if we have it, i order for it to be useful, it has to be "true" ad "comparable". What we mea by "true" is that whe performig surveys, the respodets ca be subjective, givig higher rakigs for their coutry/compay (ot givig true aswers). The outcome values ca be u-"comparable" if, for example, oe perso has differet criteria for the term best performace tha aother. I the best perspective, whe aswerig our questios about their coutry/compay performaces the respodets would, most probably, classify their coutry/compay usig their kowledge ad iteral aggregate iformatio. We thik our methodology is a objective way of makig class predictios about coutries /compaies performaces sice, usig it, we ca choose the correct umber of clusters, defie the outcome values for each data row, ad costruct the predictive model. Also, the problem of isertig ew data ito a existig model is solved usig this method. The problem is that we ormally have to trai ew maps every time, or stadardize the ew data accordig to the variace of the old dataset, i order to add ew labels to the maps. Isertig ew data ito a existig SOM model becomes a problem whe the data have bee stadardized, for example, withi a iterval like [0,1]. Also, the retraiig of maps requires cosiderable time ad expertise. We propose that our methodology solves these problems associated with addig ew data to a existig SOM cluster model. The results show that our methodology ca be successful, if it is correctly implemeted. Clusterig is very importat i our methodology, sice we defie the outcome values for each data row based o it. Our U- matrix maps clearly show seve idetifiable clusters. More ivestigatios should be performed o fidig the utility of each clusterig or, i other words, defie "how well" we clustered the data. To evaluate the maps we used two criteria: the average quatizatio error ad the ease-ofreadability of each map. As a further research problem, we would try to develop a ew measure, or use a existig oe, to validate the clusterig. Whe applyig logistic regressio, we obtaied models with acceptable accuracy rates. All the coefficiets of all regressio equatios were statistically sigificat except oe (CURRENCY for the

EcoomicPerf dataset). The accuracy rates were evaluated usig two criteria: proportioal by chace criterio ad maximum by chace criterio.

However, like i [13] our fidigs claim that the results of the two classificatio techiques are similar i terms of accuracy rate.

Two out of three ew data rows were classified i the same class usig both predictive models (Stora Eso ad UPM-Kymmee to classes 2 ad 1 respectively).

9 EcoomicPerf dataset). The accuracy rates were evaluated usig two criteria: proportioal by chace criterio ad maximum by chace criterio. The first dataset s accuracy rate did't satisfy the secod criterio. Whe comparig the two classificatio techiques, we therefore oly took ito cosideratio the results of the secod. However, like i [13] our fidigs claim that the results of the two classificatio techiques are similar i terms of accuracy rate. Also, whe makig predictios usig the two models, we used data for the FiacialPerf dataset from year Two out of three ew data rows were classified i the same class usig both predictive models (Stora Eso ad UPM-Kymmee to classes 2 ad 1 respectively). A improvemet to our methodology would be to tackle the problem of variable selectio for both the clusterig ad the classificatio phases, fidig a ew way to measure clusterig utility, ad geeralizig the methodology. As further research, we will ivestigate differet methods of improvig our classificatio models. Ackowledgemets The authors would like to thak Professor Barbro Back for her costructive commets o the article. Refereces [1] B. Back, K. Sere, ad H. Vaharata, Maagig Complexity i Large Data Bases Usig Self-Orgaizig Maps, Accoutig Maagemet ad Iformatio Techologies 8 (4), Elsevier Sciece Ltd, Oxford, 1998, pp [2] A. Costea, A. Kloptcheko, ad B. Back, Aalyzig Ecoomical Performace of Cetral-East-Europea Coutries Usig Neural Networks ad Cluster Aalysis, i Proceedigs of the Fifth Iteratioal Symposium o Ecoomic Iformatics, I. Iva. ad I. Rosca (eds), Bucharest, Romaia, May, 2001, pp [3] T. Eklud, B. Back, H. Vaharata, ad A. Visa, Assessig the Feasibility of Self-Orgaizig Maps for Data Miig Fiacial Iformatio, i Proceedigs of the Xth Europea Coferece o Iformatio Systems (ECIS 2002), Gdask, Polad, Jue 6-8, 2002, pp [4] J. F. Hair, Jr, R. Aderso, ad R. L. Tatham, Multivariate Data Aalysis with readigs. Secod Editio. Macmilla Publishig Compay, New York, New York, [5] D. Had, H. Maila, ad P. Smyth, Priciples of Data Miig, MIT Press, Cambridge, [6] S. Kaski ad T. Kohoe, Exploratory Data Aalysis by the Self-Orgaizig Map: Structures of Welfare ad Poverty i the World, i Neural Networks i Fiacial Egieerig, N. Apostolos, N. Refees, Y. Abu-Mostafa, J. Moody, ad A. Weiged. (Eds), World Scietific, Sigapore, 1996, pp [7] J. P. Guiver ad C. C. Klimasauskas, Applyig Neural Networks, Part IV: Improvig Performace, PC/AI Magazie 5 (4), Phoeix, Arizoa, 1991, pp [8] T. Kohoe, Self-Orgaizig Maps, 2d editio, Spriger- Verlag, Heidelberg, [9] J. Lehtie, Fiacial Ratios i a Iteratioal Compariso, Acta Wasaesia 49, Vasa, [10] J. R. Quila, A Case Study i Machie Learig, i Proceedigs of ACSC-16 Sixteeth Australia Computer Sciece Coferece, Brisbae, Ja. 1993, pp [11] J. R. Quila, C4.5 Programs for Machie Learig, Morga Kaufma Series i Machie Learig, Morga Kaufma Publishers, Sa Mateo, [12] J. Rhiao, C. Jewitt, L. Galasso, ad G. Fortemps, Cosolidatio Chages the Shape of the Top 150, Pulp ad Paper Iteratioal 43 (9), Paperloop, Sa Fracisco, Califoria, 2001, pp [13] S. Rudolfer, G. Paliouras, ad I. Peers, A Compariso of Logistic Regressio to Decisio Tree Iductio i the Diagosis of Carpal Tuel Sydrome, Computers ad Biomedical Research 32, Academic Press, 1999, [14] A. Ultsch, Self orgaized feature plaes for moitorig ad kowledge acquisitio of a chemical process, i Proceedigs of the Iteratioal Coferece o Artificial Neural Networks, Spriger-Verlag, Lodo, 1993, pp [15] J. Vesato Neural Network Tool for Data Miig: SOM Toolbox, i Proceedigs of Symposium o Tool Eviromets ad Developmet Methods for Itelliget Systems (TOOLMET2000), Oulu yliopistopaio, Oulu, Filad, 2000, pp [16] J. Vesato ad E. Alhoiemi, Clusterig of the Self- Orgaizig Map, IEEE Trasactios o Neural Networks 11 (3), IEEE Neural Networks Society, Piscataway, New Jersey, 2000, pp [17] I. Witte ad E. Frak, Data Miig: Practical Machie Learig Tools ad Techiques with Java Implemetatios, Academic Press, Sa Diego, Appedix: the decisio tree

3D Model Retrieval Method Based on Sample Prediction

20 Iteratioal Coferece o Computer Commuicatio ad Maagemet Proc.of CSIT vol.5 (20) (20) IACSIT Press, Sigapore 3D Model Retrieval Method Based o Sample Predictio Qigche Zhag, Ya Tag* School of Computer