4.2.1 Bayesian Principal Component Analysis Weighted K Nearest Neighbor Regularized Expectation Maximization

Size: px

Start display at page:

Download "4.2.1 Bayesian Principal Component Analysis Weighted K Nearest Neighbor Regularized Expectation Maximization"

Jocelyn Parks
6 years ago
Views:

1 4 DATA PREPROCESSING 4.1 Data Normalizatio Mi-Max Z-Score Decimal Scalig 4.2 Data Imputatio Bayesia Pricipal Compoet Aalysis K Nearest Neighbor Weighted K Nearest Neighbor Local Least Square Iterated Local Least Square Regularized Expectatio Maximizatio 4.3 Experimetal Results 4.4 Chapter Summary A Framework for Admissible Kerel Fuctio i Support Vector Machies usig Lévy Distributio 59

2 Data preprocessig is a fudametal buildig block of the KDD process. It prepares the data by removig outliers, smoothig oisy data ad imputig the missig values i the dataset. Though most of the data miig techiques have predefied oise hadlig ad imputig data mechaisms, preprocessig reduces the cofusio durig the learig process. I additio, the acquired datasets from the differet data sources may udergo several data preprocessig techiques to produce a fial result. The simplified ad specialized data preprocessig techiques i the kowledge discovery process are listed as follows: Data cleaig Data itegratio Data trasformatio Data reductio Data discrimiatio Data cleaig idetifies the origi of errors that are detected i the dataset ad usig that iformatio, it prevets the errors from recurrig i the dataset. Thus, the icosistecy i the dataset is removed ad data quality is improved. This preprocessig techique is extesively used i data warehouses. Data itegratio is a crucial problem i desigig the decisio support systems ad data warehouses. Therefore, data from differet data sources are merged together ito a appropriate form that is suitable for miig the patters. It is used to create a coheret data repository from data sources that iclude multiple databases, flat files or data cubes. Data trasformatio cosolidates the data ito a specific format that helps to mie the feasible patters easily. Data trasformatio ca be performed usig differet techiques like smoothig, geeralizatio, ormalizatio ad feature costructio. This is depicted i Figure 4.1. Data reductio techique reduces the represetatio of a origial dataset ito a smaller subset. Usually data reductio A Framework for Admissible Kerel Fuctio i Support Vector Machies usig Lévy Distributio 60

3 techiques ca be applied to multidimesioal data, where the data must be cubed ad give as a iput to the reductio algorithms. The iput give to the reductio algorithms should be o-empty samples to reduce the approximatio error. The reduced dataset should retai the itegrity of a origial dataset ad produce almost the same experimetal results. Data Trasformatio Smoothig Geeralizatio Normalizatio Feature Costructio Logarithmic Sigmoid Statistical Colum Media Mi Max Z Score Decimal Scalig Figure 4.1 Taxoomy of Data Trasformatio techiques Data discrimiatio geerates the discrimiat rules that compare the feature values of the dataset betwee the two classes i.e. referred as target class ad cotrastig class. I discrimiat aalysis, multivariate istaces with differet classes are observed together to form the traiig data sample. Usig the istace of traiig data the class label is kow ad it is used to classify the ew data istaces ito oe of the predefied classes. The followig are the reasos where the differet data preprocessig techiques are ofte applied to multiple data sources To apply data miig algorithms easily To ehace the performace ad effectiveess of data miig algorithms To represet the data i a uderstadable format To retrieve the data from databases ad warehouses quickly ad To make the datasets suitable for a explicit data aalysis A Framework for Admissible Kerel Fuctio i Support Vector Machies usig Lévy Distributio 61

4 The above listed data preprocessig techiques help i improvig the accuracy ad efficiecy of the classificatio process. From the data aalysis, the two techiques that are required to preprocess the cosidered datasets i this research work are data ormalizatio ad data imputatio. 4.1 Data Normalizatio Data ormalizatio is a preprocessig techique where it groups the give data ito a well refied format. The success of machie learig algorithm largely depeds o the quality of the datasets chose. Thus, data ormalizatio is a importat trasformatio techique where it ca improve the accuracy ad accomplish better performace i cosidered datasets. Realizig the sigificace of trasformatio techiques i data miig algorithms, ormalizatio techique is used here to improve the geeralizatio process ad learig capability with miimum error. Normally, the feature values i the dataset are i differet scales of measuremet. Some features may be iteger values while others may be decimal values. The data ormalizatio techique is used to maage ad orgaize the feature values i the dataset. Also, it scales the feature values to the same specified rage. Normalizatio is used i classificatio ad clusterig techiques, sice the iput data should ot be overwhelmed by other data poits i terms of distace metric. It miimizes bias ad speeds up the traiig time i the classificatio process because each feature value starts i the same rage. From the literature, it is evidet that the differet types of ormalizatio techiques are logarithmic, sigmoid, statistical colum, media, mi max, z-score ad decimal scalig. Logarithmic ormalizatio (Zavadskas ad Turskis, 2008) ormalizes the datasets where the vector compoet is skewed ad distributed expoetially. This ormalizatio techique is based o o-liear trasformatio that best represets the data values. If the iput values i the dataset are clustered A Framework for Admissible Kerel Fuctio i Support Vector Machies usig Lévy Distributio 62

5 aroud miimum values with few maximum values the this trasformatio ca be applied to give better results. The sigmoid ormalizatio techique (Jayalakshmi ad Sathakumara, 2011) scales the dataset i the rage of 0-1or (+1,-1). There are differet kids of o-liear sigmoid based ormalizatio techiques. Amog these, ta sigmoid ormalizatio techique is feasible sice it estimates the parameters from the oisy data. Statistical colum ormalizatio techique (Jayalakshmi ad Sathakumara, 2011) ormalizes each data value by ormalizig its colum value. I media based ormalizatio (Jayalakshmi ad Sathakumara, 2011), each sample is ormalized by the media of iput values i the dataset. It ca be applied whe there is a requiremet, to ascertai the ratio betwee two samples. It is also used i the datasets that perform the distributio betwee the iput samples. I this classificatio framework, three kids of data ormalizatio techiques that ca ehace support vector machies are applied for the biary ad multiclass datasets. By applyig ad comparig these techiques, a best oe is idetified. The three data ormalizatio techiques that are used i the classificatio framework are as follows: Mi-Max The mi-max ormalizatio techique (Kotsiatis et.al. 2006) ormalizes the dataset usig liear trasformatio ad trasforms the iput data ito a ew fixed rage. Mi-max techique preserves the associatios betwee the origial iput value ad the scaled value. Also, a out of boud error is ecoutered whe the ormalized values deviate from the origial data rage. This techique esures that extreme iput values are costraied withi a specific rage. Mi-max ormalizatio trasforms trasforms a value X 0 to X which fits i the specified rage ad it is give by the equatio (4.1) A Framework for Admissible Kerel Fuctio i Support Vector Machies usig Lévy Distributio 63

6 X X X 0 max X X mi mi (4.1) where X is a ew value for variable X, X 0 is a curret value for variable X, X mi is the miimum data poit i the dataset ad X max is the maximum data poit i the dataset Z-Score Z-score ormalizatio (Kotsiatis et al. 2006) is also kow as zero-mea ormalizatio. Z-score ormalizatio techique ormalizes the iput values i the dataset usig mea ad stadard deviatio. The mea ad stadard deviatio for each feature vector is calculated across the traiig dataset. This ormalizatio techique determies whether a iput value is below or above the average value. It will be very useful to ormalize the dataset whe the attribute's maximum or miimum values are ukow ad outliers domiate the iput values. This techique trasforms a value v to v by the equatio (4.2) v' (( v A) / ) (4.2) where v is a ew value of a attribute, v is a old value of a attribute, A A is the mea of a attribute value A ad σ is the stadard deviatio of a attribute value A Decimal Scalig Decimal scalig ormalizatio (Jayalakshmi ad Sathakumara, 2011) is the simplest trasformatio techique that ormalizes a attribute by movig the decimal poit of the iput values. Maximum absolute value of a iput attribute decides the umber of decimal poits to be moved i a value. It is show i the equatio (4.3) j v' ( v /10 ) (4.3) where v is the ew value, v is a old value ad j is the smallest iteger value such that Max ( v <1). A Framework for Admissible Kerel Fuctio i Support Vector Machies usig Lévy Distributio 64

7 4.2 Data Imputatio Missig data is a ureletig problem i all areas of recet empirical research. This problem should be treated carefully sice data plays a key role i every domai aalysis. If this missig data problem is hadled improperly, the it will produce biased results ad distort the data aalysis. Eve though there are various techiques available i the literature to overcome the missig data problem, data imputatio is a techique that imputes the missig data approximately ad reduces the estimatio error. The mai objective of data imputatio techique is to create a iclusive dataset, where it ca be aalyzed by a iferetial method. Data imputatio is broadly categorized ito two types. They are sigle imputatio ad multiple imputatio. However, choosig the most reliable imputatio techique to fill the missig data is a challegig issue for the researchers. Figure 4.2 depicts the differet techiques that are used to overcome the missig data problem. Missig Data Acquire Missig Data Reduce Feature Models Evet Coverig Discard Istaces Pairwise Deletio Listwise Deletio No respose weightig Data Imputatio Global Based PLS SVD Neighbor Based LS A KNN Model Based Wt.KNN ML EM BPCA LLS It. LLS Reg.EM Figure 4.2 Taxoomy of Missig Data techiques A Framework for Admissible Kerel Fuctio i Support Vector Machies usig Lévy Distributio 65

8 Sigle value imputatio is a simple techique which imputes a sigle value for a missig data. Sigle value based imputatio has a disadvatage that it reproduces a additioal ucertaity i dataset. This disadvatage is replaced by a ew techique i.e. multiple imputatio, proposed by (Rubi, 1976). I this techique, imputatio takes place repeatedly to create multiple imputed dataset. Each imputed dataset is aalyzed statistically ad geerates multiple result where all the results are combied to preset a overall result. Multiple imputatio is a attractive choice for researchers who deal with real time problems. It also performs favorably by producig ubiased results. Sigle/Multiple Imputatio techiques are classified ito three types. They are global based imputatio, eighbor based imputatio ad model based imputatio. Global based imputatio techique imputes the missig data usig eige vectors ad the techiques related to global imputatio are partial least squares, sigular value decompositio ad Bayesia Pricipal Compoet Aalysis (BPCA). Neighbor based imputatio techique uses a distace measure to impute a missig data.least square aalysis, K Nearest Neighbor (KNN), Weighted K Nearest Neighbor (Wt. KNN ), Least Square (LLS) ad Iterated Local Least Square (It. LLS) are some of the methods i this category. I model based imputatio, a predictive model is created to estimate a missig value. The techiques are maximum likelihood,expectatio Maximizatio ad Regularized Expectatio Maximizatio (Reg. EM). Data imputatio techique helps to fill the missig data with a feasible value, but before substitutig the missig value the type of missigess should be idetified. There are two reasos to distiguish the type of missigess i datasets. First, it helps to check how well the relatio betwee the attribute values are represeted (Schafer ad Graham, 2002). Next, it idetifies the missig data patters that eed to be imputed. There are three differet kids of missigess (Little ad Rubi, 1987) ad they are as follows A Framework for Admissible Kerel Fuctio i Support Vector Machies usig Lévy Distributio 66

9 Missig completely at radom (MCAR) Missig at radom (MAR) ad Missig ot at radom (MNAR) Missig completely at radom Missig completely at radom is oe type of missigess where the probability of missig data is totally due to the urelated evets ad ot because of the attributes i a dataset (Schafer ad Graham, 2002; Streier, 2002).This type of missigess occurs rarely so that it is better to categorize the type of missig data ad impute the values. Missig at radom I missig at radom, the missigess occurs by removig the data that may be iterrelated to the other attribute values i the dataset (Schafer ad Graham, 2002; Streier, 2002). Missig ot at radom Missig ot at radom is a missigess that ofte arises i the datasets. The reaso for MNAR missigess is removig the outcome of oe or more attribute values ad it has a orgaized patter (Pigott, 2001; Schafer ad Graham, 2002). Usually MCAR ad MAR based missigess ca be igored but MNAR caot be igored because missig values due to MNAR are ot recoverable. Missig data problem has a major impact i the feature selectio ad classificatio process, so data imputatio techique is used here to make the datasets reliable to the classificatio framework. Based o the literature, six differet data imputatio techiques are cosidered ad examied usig the biary ad multiclass datasets. These techiques ca also improve the accuracy ad robustess of the kerel based classifier framework. Followig are the imputatio techiques that are used i this framework A Framework for Admissible Kerel Fuctio i Support Vector Machies usig Lévy Distributio 67

10 Bayesia Pricipal Compoet Aalysis Bayesia pricipal compoet aalysis (Oba et al. 2003) uses statistical procedure to impute the arbitrary missig data. BPCA imputatio presets a accurate ad suitable estimatio for missig values. Basically BPCA is depedet o probabilistic pricipal compoet ad it uses a Bayes techique that iteratively estimates the posterior distributio for missig data util it coverges. The three primary processes that are ivolved i BPCA are Pricipal compoet regressio Bayesia estimatio ad Expectatio maximizatio like repetitive algorithm K Nearest Neighbor The KNN imputatio techique (Su et al. 2009) is used to estimate ad fill the missig values i the dataset. The key factor of KNN imputatio techique is distace metric ad it is a lazy learer. I KNN imputatio, missig values are imputed by combiig the colums of K earest attribute values i a dataset based o the similarity metric. Here, similarity metric calculates the distace betwee complete record ad icomplete record. The three strategies that are required to estimate KNN imputatio are as follows Value of K should be decided Need traiig data with labeled classes Metric that measures closeess property Weighted K Nearest Neighbor Imputig the dataset usig K earest eighbor sometimes leads to loss of iformatio.so weighted K earest eighbor is itroduced (Troyaskaya et al. A Framework for Admissible Kerel Fuctio i Support Vector Machies usig Lévy Distributio 68

11 2001). The oly differece betwee K earest eighbor ad weighted K earest eighbor is Wt. KNN imputes the dataset usig a dyamically assiged K value Local Least Square I local least square imputatio (Kim et al. 2004), a absolute value of pearso correlatio coefficiet is defied as similarity metric to select the k attribute values which results i a local least square pearso correlatio based imputatio. Istead of Pearso correlatio, L2 orm is used as a similarity metric where it improves the results. Also,the missig data is imputed as a liear combiatio of missig value attributes. After defiig the similarity metric, the missig value is imputed as a liear combiatio of cosequet values of the attribute Iterated Local Least Square Iterated Local Least Square imputatio (Cai et al. 2005) is used to impute the missig data more accurately. It is ofte used to impute the microarray gee expressio data. Iterated Local Least Square based imputatio techique cosists of three steps.they are Simialrity threshold value is used to estimate the kow attribute value Next,the threshold value is used i local least square based imputatio Several iteratios are performed to obtai a estimate value for missig data Regularized Expectatio Maximizatio Regularized expectatio maximizatio imputatio techique (Scheider, 2001) has the same steps as i expectatio maximizatio.but, expectatio maximizatio algorithm caot be applied for datasets where the umber of variables exceed the iput size. Due to this shortcomig, expectatio maximizatio imputatio techique revised as regularized to impute the missig data. The three A Framework for Admissible Kerel Fuctio i Support Vector Machies usig Lévy Distributio 69

12 steps that are ivolved i regularized expectatio maximizatio algorithm are as follows Compute the regressio parameters from the estimates of the mea ad covariace Impute the missig values with their coditioal expectatio values Iterate the EM algorithm util it imputes all the missig values 4.3 Experimetal Results The experimetal results are carried out usig biary ad multiclass datasets that are take from UCI machie learig repository. The dataset descriptio is give iclusively i the previous chapter. The performace of data ormalizatio ad data imputatio techiques are examied ad recorded for evaluatio. Performace metrics that are used to evaluate the data ormalizatio techiques are Mea Squared Error (MSE), Root Mea Squared Error (RMSE), Mea Squared Error with Regularizatio (MSEREG) ad time. They are give by the equatios ( ). Tables 4.1 ad 4.2 depict the performace of data ormalizatio techiques for biary ad multiclass datasets. 1 MSE i1 ( Y i Yˆ i ) 2 (4.4) RMSE 1 i1 ( Y i Yˆ i ) 2 (4.5) MSEREG 1 2. MSE (1 ). MSW, where MSW j w 1 j (4.6) where Y i is a true value ad Yˆ i is a estimated value of a attribute.. A Framework for Admissible Kerel Fuctio i Support Vector Machies usig Lévy Distributio 70

13 Table 4.1 Performace of Normalizatio techiques for Biary datasets Data Sets Normalizatio Techique MSE RMSE MSEREG Time(s) Iris Mi-Max Z-Score Decimal Scalig Liver Mi-Max Z-Score Decimal Scalig Heart Mi-Max Z-Score Decimal Scalig Diabetes Mi-Max Z-Score Breast Cacer Decimal Scalig Mi-Max Z-Score Decimal Scalig Hepatitis Mi-Max Z-Score Decimal Scalig Ripley Mi-Max Z-Score Decimal Scalig A Framework for Admissible Kerel Fuctio i Support Vector Machies usig Lévy Distributio 71

14 Metrics that are used to evaluate the data imputatio techiques are MSE, RMSE, MSEREG, Mea Absolute Error (MAE) ad time. They are give by the equatios ( ). Tables 4.3 ad 4.4 represet the performace aalysis of data imputatio techiques for biary ad multiclass datasets. 1 MSE i1 ( Y i Yˆ i ) 2 (4.7) RMSE 1 i1 ( Y i Yˆ i ) 2 (4.8) MSEREG 1 2. MSE (1 ). MSW, where MSW j w 1 j (4.9) 1 MAE i1 Y i Yˆ i (4.10) where Y i is a true value ad Yˆ i is a estimated value of a attribute.though the differet data ormalizatio techiques miimize the estimatio error,the empirical results from Tables 4.1 ad 4.2 idicate that the decimal scalig based ormalizatio produce the best result with miimum mea squared error, root mea squared error, mea squared error with regularizatio ad time for the cosidered biary ad multiclass datasets. From the Tables 4.3 ad 4.4, it is kow that the K earest eighbor decreases the mea squared error, root mea squared error, mea squared error with regularizatio, mea absolute error ad time whe compared to the other techiques for the biary ad multiclass datasets used i the experimets. The data preprocessig techiques that refie the results ad improve the reliability of the datasets are used i this classificatio framework. Also,the experimetal results has show that the performace of the classificatio framework depeds o the data preprocessig techiques. A Framework for Admissible Kerel Fuctio i Support Vector Machies usig Lévy Distributio 72

15 Table 4.2 Performace of Normalizatio techiques for Multiclass datasets Data Sets Techique MSE RMSE MSEREG Time(s) Iris Mi-Max Z-Score Decimal Scalig Glass Mi-Max Z-Score Decimal Scalig E-Coli Mi-Max Z-Score Decimal Scalig Wie Mi-Max Z-Score Decimal Scalig Balace Scale Mi-Max Z-Score Decimal Scalig Leses Mi-Max Z-Score Decimal Scalig Petago Mi-Max Z-Score Decimal Scalig A Framework for Admissible Kerel Fuctio i Support Vector Machies usig Lévy Distributio 73

16 Table 4.3 Performace of Imputatio techiques for Biary datasets Data Sets Techique MSE RMSE MSEREG MAE Time(s) Iris BPCA LLS Itr. LLS KNN Wt. KNN Reg. EM Liver BPCA LLS Itr. LLS KNN Wt. KNN Reg. EM Heart BPCA LLS Itr. LLS KNN Wt. KNN Reg. EM Diabetes BPCA LLS Itr. LLS KNN Wt. KNN Reg. EM Breast Cacer BPCA LLS Itr. LLS KNN Wt. KNN Reg. EM Hepatitis BPCA LLS Itr. LLS KNN Wt. KNN Reg. EM Ripley BPCA LLS Itr. LLS KNN Wt. KNN Reg. EM A Framework for Admissible Kerel Fuctio i Support Vector Machies usig Lévy Distributio 74

17 Table 4.4 Performace of Imputatio techiques for Multiclass datasets Data Sets Techique MSE RMSE MSEREG MAE Time(s) Iris BPCA LLS It. LLS KNN Wt. KNN Reg. EM BPCA LLS Glass It. LLS KNN Wt. KNN Reg. EM E-Coli BPCA LLS It. LLS KNN Wt. KNN Reg. EM Wie BPCA LLS It. LLS KNN Wt. KNN Balace Scale Reg. EM BPCA LLS It. LLS KNN Wt. KNN Reg. EM Leses BPCA LLS It. LLS KNN Wt. KNN Reg. EM Petago BPCA LLS It. LLS KNN Wt. KNN Reg. EM A Framework for Admissible Kerel Fuctio i Support Vector Machies usig Lévy Distributio 75

18 4.4 Chapter Summary This chapter discusses the experimetal results of data ormalizatio ad imputatio techiques used for data preprocessig. Though all the techiques have their ow merits ad demerits, the assessmet proposes few techiques for data preprocessig that best suits the cosidered biary ad multiclass datasets i the classificatio framework. For data ormalizatio, decimal scalig shows better results.for i the case of data imputatio, KNN outperforms the other techiques. A Framework for Admissible Kerel Fuctio i Support Vector Machies usig Lévy Distributio 76

Designing a learning system

Designing a learning system CS 75 Machie Learig Lecture Desigig a learig system Milos Hauskrecht milos@cs.pitt.edu 539 Seott Square, x-5 people.cs.pitt.edu/~milos/courses/cs75/ Admiistrivia No homework assigmet this week Please try