Uncertain Data Classification Using Decision Tree Classification Tool With Probability Density Function Modeling Technique

Research Paper Uncertain Data Classification Using Decision Tree Classification Tool With Probability Density Function Modeling Technique C. Sudarsana Reddy 1 S. Aquter Babu 2 Dr. V. Vasu 3 Department of Computer Science and Engineering, S.V. University College of Engineering, S.V. University, Tirupati, Andhra Pradesh, India. Assistant Professor of Computer Science, Department of Computer Science, Dravidian University, Kuppam -517425 Chittoor District, Andhra Pradesh, India Department of Mathematics, S.V. University, Tirupati, Andhra Pradesh, India Abstract Classical decision tree classifiers are constructed using certain or point data only. But in many real life applications inherently data is always uncertain. Attribute or value uncertainty is inherently associated with data values during data collection process. Attributes in the training data sets are of two types numerical (continuous) and categorical (discrete) attributes. Data uncertainty exists in both numerical and categorical attributes. Data uncertainty in numerical attributes means range of values and data uncertainty in categorical attributes means set or collection of values. In this paper we propose a method for handling data uncertainty in numerical attributes. One of the simplest and easiest methods of handling data uncertainty in numerical attributes is finding the mean or average or representative value of the set of original values of each value of an attribute. With data uncertainty the value of an attribute is usually represented by a set of values. Decision tree classification accuracy is much improved when attribute values are represented by sets of values rather than one single representative value. Probability density function with equal probabilities is one effective data uncertainty modelling technique to represent each value of an attribute as a set of values. Here the main assumption is that actual values provided in the training data sets are averaged or representative values of originally collected values through data collection process. For each representative value of each numerical attribute in the training data set, approximated values corresponding to the originally collected values are generated by using probability density function with equal probabilities and these newly generated sets of values are used in constructing a new decision tree classifier. Keywords probability density function with equal probabilities; uncertain data; decision tree; classification; data mining; machine learning 1. INTRODUCTION Classification is a data analysis technique. Decision tree is a powerful and popular tool for classification and prediction but decision trees are mainly used for classification [1]. Main advantage of decision tree is its interpretability the decision tree can easily be converted to a set of IF-THEN rules that are easily understandable [2]. Examples sources of data uncertainty include measurement/quantization errors, data staleness, and multiple repeated measurements [3]. Data mining applications for uncertain data are classification of uncertain data, clustering of uncertain data, frequent pattern mining, outlier detection etc. Examples for certain data are locations of universities, buildings, schools, colleges, restaurants, railway stations and bus ISSN: 2231-5381 http://www.ijettjournal.org Page 76

stands etc. Data uncertainty naturally arises in a large number of real life applications including scientific data, web data integration, machine learning, and information retrieval. Data uncertainty in databases is broadly classified into three types: 1. Attribute or value uncertainty 2. Correlated uncertainty and 3. Tuple or existential uncertainty In attribute or value uncertainty, the value of each attribute is represented by an independent probability distribution. In correlated uncertainty, values of multiple attributes may be described by a joint probability distribution. Existential or tuple uncertainty exists when it is uncertain whether a data tuple is present or not in the relational database. For example, a data tuple in a relational database could be associated with a probability that represents the confidence of its presence [3]. In the case of existential uncertainty, a data tuple may or may not exist in the relational database. For example, assume we have the following tuple from a probabilistic database [(a,0.4),(b,0.5)] the tuple has 10% chance of not existing in the relational database. Constructing a decision tree classifier using given representative (certain) values as it is without any modification is called certain decision tree (CDT) classifier construction approach and constructing a decision tree classifier by modelling uncertain data with probability density function with equal probabilities is called decision tree classifier construction on uncertain data (DTU) approach. Decision tree classifiers constructed using representative values are less accurate than decision tree classifiers constructed using approximated probability density function values for each representative value of each numerical attribute in the training data set. Originally collected correct values are approximately regenerated by using probability density function modelling technique with equal probabilities. Hence, probability density function modelling technique models data uncertainty appropriately. When data mining is performed on uncertain data, different types uncertain data modelling techniques have to be considered in order to obtain high quality data mining results. One of the current challenges in the field of data mining is to develop good data mining techniques to analyse uncertain data. That is, one of the current challenges regarding the development of data mining techniques is the ability to manage data uncertainty. 2. INTRODUCTION TO UNCERTAIN DATA Mining on precise data Association rule mining Data Data Classification Hard Clustering Data Clustering Mining on Uncertain data Other data mining methods Fuzzy Clustering Fig 1.1 Taxonomy of Uncertain Data Mining Many real life applications contain uncertain data. With data uncertainty data values are no longer atomic or certain. Data is often associated with uncertainty because of measurement errors, sampling errors, repeated measurements, and outdated data sources [3]. When data mining techniques are applied on uncertain data it is called Uncertain Data Mining (UDM). For preserving privacy sometimes certain data values are explicitly transformed to range of values. For example, for preserving privacy the certain value of the true age of a person is represented as a range [16, 26] or 16 26. Tuple Marks Result No. (Numerical) (Categorical) 1 550 600 (0.8,0.1,0.1,0.0) (0.8,0.2) Class Label (Categorical) 2 222 444 (0.6,0.2,0.1,0.1) (0.5,0.5) 3 470 580 (0.7,0.2,0.1,0.0) (0.9,0.1) 4 123-290 (0.4,0.2,0.3,0.1) (0.7,0.3) 5 345 456 (0.6,0.2,0.1,0.1) (0.8,0.2) 6 111 333 (0.3,0.3,0.2,0.2) (0.9,0.1) 7 200 280 (0.3,0.3,0.2,0.2) (0.7,0.3) ISSN: 2231-5381 http://www.ijettjournal.org Page 77

8 500 580 (0.7,0.2,0.1,0.0) (0.5,0.5) 9 530 590 (0.7,0.3,0.0,0.0) (0.6,0.4) 10 450 550 (0.7,0.2,0.1,0.0) (0.4,0.6) 11 150 250 (0.3,0.3,0.2,0.2) (0.2,0.8) 12 180 260 (0.4,0.2,0.2,0.2) (0.1,0.9) Table 1.1 Example of Numerical Uncertain and Categorical Uncertain attributes. Marks attribute is a numerical uncertain attribute (NUA) and Result attribute is a categorical uncertain attribute (CUA). Class label can also be either numerical or categorical. 3. PROBLEM DEFINITION In many real life applications information cannot be ideally represented by point data only. All decision tree algorithms so far developed were based on certain data values present in the numerical attributes of the training data sets. These values of the training data sets are the representatives of the originally collected data values. Data uncertainty was not considered during the development of many data mining algorithms including decision tree classification technique. Thus, there is no classification technique which handles the uncertain data. This is the problem with the existing certain (traditional or classical) decision tree classifiers. Data uncertainty is usually modelled by a probability density function and probability density function is represented by a set of values rather than one single representative or average or aggregate value. Thus, in uncertain data management, training data tuples are typically represented by probability distributions rather than deterministic values. Currently existing decision tree classifiers consider values of attributes in the tuples with known and precise point data values only. In real life the data values inherently suffer from value uncertainty (or attribute uncertainty). Hence, certain (traditional or classical) decision tree classifiers produce incorrect or less accurate data mining results. Decision tree classifiers constructed using representative (average) values are less accurate than decision tree classifiers constructed using approximated probability density function values with equal probabilities for each representative value of each numerical attribute. Originally collected correct values are approximately regenerated by using probability density function with equal probabilities. Hence, probability density function models data uncertainty. A training data set can have both Uncertain Numerical Attributes (UNAs) and Uncertain Categorical Attributes (UCAs) and both training tuples as well as test tuples contain uncertain data. As data uncertainty widely exists in real life, it is important to develop accurate and more efficient data mining techniques for uncertain data management. The present study proposes an algorithm called Decision Tree classifier construction on Uncertain Data (DTU) to improve performance of Certain Decision Tree (CDT). DTU uses probability density function with equal probabilities in modelling data uncertainty in the values of numerical attributes of training data sets. The performance of these two algorithms is compared experimentally through simulation. The performance of DTU is proves to be better. 4. EXISTING ALGORITHM 4.1 Certain Decision Tree (CDT) Algorithm Description In the existing decision tree classifier construction each tuple t i is associated with a set of values of attributes of training data sets and the i th tuple is represented as ti ( ti, 1, ti,2, ti,3,... ti, k, classlabel) where i is the tuple number and k is the number of attributes in the training data set. t i,1 is the value of the first attribute of the i th tuple. t i,2 is the value of the second attribute of the i th tuple and so on. It is required to traverse the decision tree from the root node to a specific leaf node to find the class label of an unseen (new) test tuple. t a, a, a,... a,?) test ( 1 2 3 k The certain decision tree (CDT) algorithm constructs a decision tree classifier by splitting each ISSN: 2231-5381 http://www.ijettjournal.org Page 78

node into left and right nodes. Initially, the root node contains all the training tuples. The process of partitioning the training data tuples in a node into two nodes based on the best split point value z T of the best split attribute and storing the resulting tuples in its left and right nodes is referred to as splitting. Whenever further split of a node is not required then it becomes a leaf node referred to as an external node. The splitting process at each internal node is carried out recursively until no further split is required. Continuous valued attributes must be discretized prior to attribute selection [7]. Further splitting of an internal node is stopped if all the tuples in an internal node have the same class label or splitting does not result nonempty left and right nodes. During decision tree construction within each internal node only crisp and deterministic tests are applied. Entropy is a metric or function that is used to find the degree of dispersion of training data tuples in a node. In decision tree construction the goodness of a split is quantified by an impurity measure [2]. One possible function to measure impurity is entropy [2]. Entropy is an information based measure and it is based only on the proportions of tuples of each class in the training data set. Entropy is taken as dispersion measure because it is predominantly used for constructing decision trees. In most of the cases, entropy finds the best split and balanced node sizes after split in such a way that both left and right nodes are as much pure as possible. Accuracy and execution time of certain decision tree (CDT) algorithm for 9 data sets are shown in Table 6.2 Entropy is calculated using the formula Where p i = number of tuples belongs to the i th class Where A j is the splitting attribute. L is the total number of tuples to the left side of the split point z. R is the total number of tuples to the right side of the split point z. is the number of tuples belongs to the class label c to the left side of the split point z. is the number of tuples belongs to the class label c to the right side of the split point z. S is the total number of tuples in the node. 4.2 Pseudo code for Certain Decision Tree (CDT) Algorithm CERTAIN_DECISION_TREE (T) 1. If all the training tuples in the node T have the same 2. class label then 3. set 4. return(t) 5. If tuples in the node T have more than one class then 6. Find_Best_Split(T) 7. For i 1 to datasize[t] do 8. If split_atribute_value[t i ] <= split_point[t] then 9. Add tuple t i to left[t] 10. Else 11. Add tuple t i to right[t] 12. If left[t] = NIL or right[t] = NIL then 13. Create empirical probability distribution of the node T 14. return(t) 15. If left[t]!= NIL and right[t]!= NIL then 16. CERTAIN_DECISION_TREE(left[T]) 17. CERTAIN_DECISION_TREE(right[T]) 18. return(t) 5. PROPOSED ALGORITHM 5.1 Proposed Decision Tree Classification on Uncertain Data (DTU) Algorithm Description The procedure for creating Decision Tree classifier on Uncertain Data (DTU) is same as that of Certain Decision Tree (CDT) classifier construction except that DTU calculates entropies ISSN: 2231-5381 http://www.ijettjournal.org Page 79

for all the modelled data values of the numerical attributes of the training data sets using probability density function with equal probabilities. For each value of each numerical attribute, an interval is constructed and within the interval a set of n sample values are generated using probability density function and Gaussian distribution with the attribute value as the mean and standard deviation as the length of the interval divided by 6 and then entropies are computed for all n sample points within that interval and the point with minimum entropy is selected. If the training data set contains m tuples then each attribute of the training data set has m values. For each attribute m - 1 intervals are generated and within each interval n probability density function values with equal probabilities using Gaussian distribution are generated and the entropy is calculated for all these values and then one best split point is selected for each interval. One optimal best split point is selected from all the best points of all the intervals of one particular attribute. Same process is repeated for all attributes of the training data set. Finally, one best optimal split attribute and optimal split point is selected from all the possible k attributes and k(nm 1) potential split points. Optimal split attribute and optimal split point constitutes optimal split pair. The decision tree for uncertain data (DTU) algorithm constructs a decision tree classifier splitting each node into left and right nodes. Initially, the root node contains all the training data tuples. A set of n sample values are generated using probability density function model for each value of each numerical attribute in the training data set and then stored in the root node. Entropy values are computed for k(mn 1) split points where k is the number attributes of the training data set, m is the number of training data tuples at the current node T and n is the number of probability density function values for each numerical attribute value in the training data set. The process of partitioning the training data tuples in a node into two subsets based on the best split point value z T of best split attribute and storing the resulting tuples in its left and right nodes is referred to as splitting. After splitting of the root node into two left and right sub-nodes the same process is applied for both left and right nodes. The recursive process stops when all the divided tuples have the same class or there are no two sub nodes. 5.2 Pseudo code for Decision Tree Classification on Uncertain Data (DTU) Algorithm UNCERTAIN_DATA_DECISION_TREE(T) 1. If all the training tuples in the node T have the 2. same class label then 3. set 4. return(t) 5. If tuples in the node T have more than one class then 6. For each value of each numerical attribute in the training data set construct an interval and then find entropy at n probability density function generated values with equal probabilities in the interval and then select one optimal point, point with the minimum entropy, in the interval. If the training data set contains m tuples then for each numerical attribute m-1 probability density function intervals are generated and finally one best optimal split point is selected from kn(m 1) possible potential split points 7. Find_Best_Split(T) 8. For i 1 to datasize[t] do 9. If split_atribute_value[t i ] <= split_point[t] then 10. Add tuple t i to left[t] 11. Else 12. Add tuple t i to right[t] 13. If left[t] = NIL or right[t] = NIL then 14. Create empirical probability distribution of the node T 15. return(t) 16. If left[t]!= NIL and right[t]!= NIL then 17. UNCERTAIN_DATA_DECISION_TREE(left[T]) 18. UNCERTAIN_DATA_DECISION_TREE(right[T]) 19. return(t) DTU can build more accurate decision tree classifiers but computational complexity of DTU is n times expensive than CDT. Hence, DTU is not efficient as that of CDT. As far as accuracy is considered DTU classifier is more accurate but as far as efficiency is considered certain decision tree (CDT) classifier is more efficient. To reduce the computational complexity DTU classifier we have proposed a new pruning technique so that entropy is calculated only at one best point for each interval. Hence, the new approach, pruned version, PDTU, for decision tree ISSN: 2231-5381 http://www.ijettjournal.org Page 80

construction is more accurate with approximately same computational complexity as that of CDT. Accuracy and execution time of certain decision tree (CDT) classifier algorithm for 9 training data sets are shown in Table 6.2 and accuracy and execution time of decision tree classification on uncertain data (DTU) classifier algorithm for 9 training data sets are shown in Table 6.3 and comparison of execution time and accuracy for certain decision tree (CDT) and DTU algorithms for 9 training data sets are shown in Table 6.4 and charted in Figure 6.1 and Figure 6.2 respectively. 6. EXPERIMENTAL RESULTS A simulation model is developed for evaluating the performance of two algorithms Certain Decision Tree (CDT) classifier and Decision Tree classification on Uncertain Data (DTU) classifier experimentally. The training data sets shown in Table 6.1 from University of California (UCI) Machine Learning Repository are employed for evaluating the performance and accuracy of the above said algorithms. No Data Set Training No. Of No. Of Name Tuples Attributes Classes 1 Iris 150 4 3 2 Glass 214 9 6 3 Ionosphere 351 32 2 4 Breast 569 30 2 5 Vehicle 846 18 4 6 Segment 2310 14 7 7 Satellite 4435 36 6 8 Page 5473 10 5 9 Pen Digits 7494 16 10 Table 6.1 Data Sets from the UCI Machine Learning Repository In all our experiments we have used training data sets from the UCI Machine Learning Repository [6]. The simulation model is implemented in Java 1.7 on a Personal Computer with 3.22 GHz Pentium Dual Core processor (CPU), and 2GB of main memory (RAM). The performance measures, accuracy and execution time, for the above said algorithms are presented in Table 6.2 to Table 6.4 and Figure 6.1 to Figure 6.2. No Data Set Total Accuracy Execution Name Tuples 1 Iris 150 97.4422 1.1 2 Glass 214 88.4215 1.3 3 Ionosphere 351 84.4529 1.47 4 Breast 569 96.9614 2.5678 5 Vehicle 846 78.9476 6.9 6 Segment 2310 97.0121 29.4567 7 Satellite 4435 83.94 153.234 8 Page 5473 97.8762 36.4526 9 Pen Digits 7494 90.2496 656.164 Table 6.2 Certain Decision Tree (CDT) Accuracy and Execution No Data Set Name Total Tuples Accuracy Execution 1 Iris 150 98.5666 1.1 2 Glass 214 95.96 1.2 3 Ionosphere 351 98.128 16.504 4 Breast 569 97.345 24.223 5 Vehicle 846 96.01281 35.365 6 Segment 2310 98.122 212.879 7 Satellite 4435 85.891 294.96 8 Page 5473 98.8765 289.232 9 Pen Digits 7494 91.996 899.3491 Table 6.3 Uncertain Decision Tree (DTU) Accuracy and Execution No Data Set Name CDT Accuracy DTU Accuracy CDT Execution DTU Execution 1 Iris 97.4422 98.566 1.1 1.1 2 Glass 88.4215 95.96 1.3 1.2 3 Ionosphere 84.4529 98.128 1.47 16.504 4 Breast 96.9614 97.345 2.5678 24.223 5 Vehicle 78.9476 96.281 6.9 35.365 6 Segment 97.0121 98.122 29.4567 212.879 7 Satellite 83.94 85.891 153.234 294.96 8 Page 97.8762 98.847 36.4526 289.232 9 Pen Digits 90.2496 91.996 656.164 899.434 Table 6.4 Comparison of accuracy and execution times of CDT and DTU ISSN: 2231-5381 http://www.ijettjournal.org Page 81

attributes. Also computational complexity of DTU is very high and execution time of Decision Tree Classification on Uncertain Data (DTU) is more for many of the training data sets. 7.3. Suggestions for future work Figure 6.1 Comparison of execution times of CDT and DTU Special techniques or ideas or plans are needed to handle different types of data uncertainties present in the training data sets. Special methods are needed to handle data uncertainty in categorical attributes also. Special pruning techniques are needed to reduce execution time of Decision Tree Classification on Uncertain Data (DTU). Also special techniques are needed to find and correct random noise and other errors in the categorical attributes. REFERENCES Figure6.2 Comparison of Classification Accuracies of CDT and DTU 7.1 Contributions 7. CONCLUSIONS The performance of existing traditional or classical or certain decision tree (CDT) is verified experimentally through simulation. A new decision tree classifier construction algorithm called Decision Tree Classification on Uncertain Data (DTU) is proposed and compared with the existing Certain Decision Tree classifier (CDT). Experimentally it is found that the classification accuracy of proposed algorithm DTU is much better than CDT algorithm. [1] Jiawei Han, Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, second edition,2006. pp.285 292 [2] Introduction to Machine Learning Ethem Alpaydin PHI MIT Press, second edition. pp. 185 188 [3] SMITH Tsang, Ben Kao, Kevin Y. Yip, Wai-Shing Ho, and Sau Dan Lee Decision Trees for Uncertain Data IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.23, No.1, JANUARY 2011 [4] Hsiao-Wei Hu, Yen-Liang Chen, and Kwei Tang A Dynamic Discretization Approach for Constructing Decision Trees with a Continuous Label IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.21,No.11,NOVEMBER 2009 [5] R.E. Walpole and R.H. Myers, Probability and Statistics for Engineers and Scientists. Macmillan Publishing Company, 1993. [6] A. Asuncion and D. Newman, UCI Machine Learning Repository, http://www.ics.uci.edu/mlearn/mlrepository.html, 2007. [8] U.M. Fayyad and K.B. Irani, On the Handling of Continuous Valued Attributes in Decision tree Generation, Machine Learning, vol. 8, pp. 87-102, 1996. 7.2. Limitations Proposed algorithm, Decision Tree Classification on Uncertain Data (DTU) classifier construction, handles only data uncertainty present in the values of numerical (continuous) attributes of the training data sets only but not the data uncertainty present in categorical (discrete) ISSN: 2231-5381 http://www.ijettjournal.org Page 82