EM algorithm with GMM and Naive Bayesian to Implement Missing Values

Size: px

Start display at page:

Download "EM algorithm with GMM and Naive Bayesian to Implement Missing Values"

Reginald Snow
5 years ago
Views:

1 , pp EM algorithm with GMM and aive Bayesian to Implement Missing Values Xi-Yu Zhou 1, Joon S. Lim 2 1 I.T. College Gachon University Seongnam, South Korea, chinazhouxiyu@gamil.com 2 I.T. College Gachon University Seongnam, South Korea, jslim@gachon.ac.r Abstract. In practical applications, datasets with missing values are quite common. Handling these missing values is an important preparation step for most data discrimination or mining tass. This paper proposes a method to implement the missing values based on the EM algorithm. Further, the proposed method uses aive Bayesian to improve the stability. We conclude by classifying two example datasets and comparing the results to those obtained by applying two other missing value handling methods: the traditional EM method and the missing value ignorance (non-substitution) method. Keywords: missing values, EM algorithm, GMM, aive Bayesian. 1 Introduction Missing values exist in many situations wherein no values are reserved for some variables in an experiment or observation [1]. Although missing values are a common occurrence, they can nonetheless have a significant effect on the processing of data and results derived from data. Therefore, it s important to implement the missing value in data preprocessing phases [2]. Accordingly, handling missing values is an important step in preprocessing phases for most data classification or data mining tass. There are many methods for accomplishing this, such as the approximation, stochastic regression, and neural networ methods. Among all the approaches, the EM (expectation-maximization) algorithm can reliably use the stable and the maximum step to find the optimal values for implementing the missing values. However, the EM algorithm s speed of convergence is quite slow and easily falls into local optimization. If we give the EM algorithm fixed initial values, we can increase the speed of convergence and algorithm stability. Below, we describe both the traditional EM algorithm and the B-EM algorithm. 2 Corresponding author: Joon S. Lim, Information Technology College Gachon University Seongnam, South Korea. jslim@gachon.ac.r ISS: ASTL Copyright 2014 SERSC

2 1.1. Traditional EM algorithm The EM algorithm is a popular method of iterative refinement [3]. In each iterative step, it has an Expectation Step and a Maximization Step, where the Expectation Step estimates the missing values and the Maximization Step updates the model parameters. In more detail: (1) The basis of the algorithm is to first estimate the missing value s initial values and obtain the values of the model parameters. (2) Iteratively repeat the Expectation Step and Maximization Step, while updating the estimated values, until the function reaches convergence [4] EM algorithm with aive Bayesian The traditional EM algorithm randomly chooses samples as the center of each class, which easily affects the clustering result. Further, marginal values have a high probability of affecting the entire algorithm, thereby decreasing the accuracy of implemented missing values. Because of these problems, this thesis proposes an improved EM algorithm based on aive Bayesian, which we call the B-EM algorithm. In this method, we can use the aive Bayesian to classify the dataset and obtain the result, and then, use the classification results to substitute the randomlyselected center of each class before repeating the Expectation Step and Maximization Step. This algorithm wors as follows: (1) We can use the aive Bayesian classifier in Wea, a collection of classification tools designed by WAIKATO University, to obtain the classification results [5]. (2) Use the classification result from (1) in place of the random initial classes, and repeat Expectation Step and Maximization Step to obtain the optimal values and update the model parameters. To solve high-dimensional problem, we can combine the GMM (Gaussian Mixture Model) with the EM algorithm. GMM can transform the high-dimensional model to the low-dimensional model. a. Expectation Step: We use the average values and deviation to obtain the Gaussian distribution density function (1), which is used to describe the value distribution. 2 1 y y exp 2 2 (1) 2 At the same time, we want to obtain the classification to assist in replacing the missing values. Here, (2) is used to describe the probability that Sample j belongs to Class. ˆ j ˆ j y j K y j 1, j 1, 2,..., ; 1, 2,..., b. Maximization Step: The main tas in this step is to update the expectation of each attribute (3), which is used to implement the missing values, and the coefficient K (2) 2 Copyright 2014 SERSC

3 of the distribution density function (4), which is used to describe the probability of each of these categories. ˆ y j j j 1 ˆ (3) ˆ j j 1 ˆ j 1 ˆ j (4) 2 Data Implement and Classification Result 2. 1 Data Implementation In this experiment, we selected two datasets, both of which were downloaded from the UCI machine learning website. The first dataset describes ernels belonging to two different varieties of wheat: Kama and Rosa, 70 samples each and randomly selected [6]. The second dataset describes vertebral columns divided into two categories: ormal (100 patients) and Abnormal (210 patients) [7]. Because we want to have more obvious results, we used the MCAR method (missing completely at random) to increase the rate of missing values by up to 30%, and compared the results of both the traditional EM algorithm and our B-EM algorithm to the datasets prior to values being removed. 2.2 Classification Results Table 1 and 2 show the results of different methods of implementing the missing values using the Multilayer Perceptron as the classifier in Wea [8]. The accuracy rate shows the method that has a better effect. Table 1. Classification Results of Seeds. Dataset Correctly Classified Instances Original Dataset % Dataset with EM algorithm % Dataset with B-EM % Copyright 2014 SERSC 3

4 Table 2. Classification Results of Column Dataset Correctly Classified Instance Original Dataset % Dataset with EM algorithm % Dataset with B-EM % In both tables, the first row is the result of classifying the MCAR dataset without any processes (Original Dataset). The second row is the result of using a traditional EM algorithm to substitute the missing values (Dataset with EM algorithm). The third row is the result of using an B-EM algorithm to substitute the missing values (Dataset with B-EM algorithm).after these, then using a Multilayer Perceptron method to classify. 3 Experimental Results In this paper, we studied a new method, the B-EM algorithm, for handling missing values in preparation of datasets for data discrimination and mining applications [9]. Thus, we can easily determine which method is most effective. Compared with the traditional EM algorithm, the B-EM algorithm has a higher accuracy rate, which suggests that the B-EM algorithm can obtain a better effect on missing values in practice. The application of these results to data mining and nowledge discovery could help to improve the selection of a method for handling missing values during the data preprocessing phases for different data structures, as well as enable a more reliable and efficient decision-maing process given uncertainties and incompleteness in presented data collections. Acnowledgments. This research was supported by Basic Science Research Program through the ational Research Foundation of Korea (RF) funded by the Ministry of Education, Science and Technology. (2012R1A1A ) References 1. Vach, W.: Missing values: statistical theory and computational practise, in Computational Statistics, edited by P. Dirschedl and R. Ostermann, Heidelberg: Physica-Verlag, pp , LAKSHMIARAYA K. (1999). Imputation of missing data in industrial databases [J ],Applied Intelligence 11: Pilla, R.S., Lindsay, B.G.: Alternative EM methods for nonparametric 4nite mixture models. Biometria 88, Celeux, G., Chr/etien, S., Forbes, F., Mhadri, A.: A component-wise EM algorithm for mixtures. J. Comput. Graph. Stat. 10, Copyright 2014 SERSC

5 5. Bilmes, J. A.: A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixed and Hidden Marov Models", U.C. Berely, TR-97-02], April UCI Reposotory of Machine Learning. datasets/seeds. 7. UCI Reposotory of Machine Learning. /datasets/vertebral+column 8. Porter, B.W., Bareiss, R. and Holte, R.C.: Concept learning and heuristic classification in wea-theory domains. Artificial Intelligence 45, , HUAG, X. L.: A pseudo-nearest-neighbor approach for missing data recovery on Gaussian random data sets [J ].Pattern Recognition Letters,2002(23): Copyright 2014 SERSC 5

A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data

Journal of Computational Information Systems 11: 6 (2015) 2139 2146 Available at http://www.jofcis.com A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data