Improved Discretization Based Decision Tree for Continuous Attributes

Size: px

Start display at page:

Download "Improved Discretization Based Decision Tree for Continuous Attributes"

Whitney Grant
5 years ago
Views:

1 Improved Discretization Based Decision Tree for Continuous Attributes S.Jyothsna Gudlavalleru Engineering College, Gudlavalleru. G.Bharthi Asst. Professor Gudlavalleru Engineering College, Gudlavalleru. Abstract :- The majority of the Machine Learning and Data Mining applications can easily be applicable only on discrete features. However, data in solid world are sometimes continuous by nature. Even for algorithms that will directly encounter continuous features, learning is most often ineffective and effective. Hence discretization addresses this problem by finding the intervals of numbers which happen to be more concise to represent and specify. Discretization of continuous attributes is one of the important data preprocessing steps of knowledge extraction. The proposed improved discretization approach significantly reduces the IO cost and also requires one time sorting for numerical attributes which leads to a better performance in time dimension on rule mining algorithms. According to the experimental results, our algorithm acquires less execution time over the Entropy based algorithm and also adoptable for any attribute selection method by which the accuracy of rule mining is improved. Keywords Discretization, Preprocessing, Data Mining, Machine learning I. INTRODUCTION Discretization of continuous attributes simply not only broadens the scope of a given range of data mining algorithms able to analyze data in discrete form, but might also dramatically amplify the speed at which these tasks can be carried out. A discrete feature, also known as qualitative features, including sex and level of education, is only able to be limited among a number of values. Continuous features might be ranked if you want and admit to meaningful arithmetic operations. However, discrete features sometimes can possibly be arrayed within the meaningful order. However no arithmetic operations can be placed upon them. Data discretization is a multipurpose pre-processing method that reduces the quantity of distinct values to obtain given continuous variable by dividing its range right into a finite set of disjoint intervals, and after that relates these intervals with meaningful labels [2]. Subsequently, data are analyzed or reported with it higher-level of data representation instead of the subtle individual values, therefore results in a simplified data representation in data exploration and data mining process. Discretization of continuous attributes plays an important role in knowledge discovery. Many algorithms linked to data mining require the running examples contain only discrete values, and rules with discrete values are normally shorter and even more understandable. Suitable discretization is useful to increase the generalization and accuracy of discovered knowledge. Discretization algorithms might be categorized into unsupervised and supervised based upon if the class label details are used. Equal Width and Equal Frequency are two representative unsupervised discretization algorithms. Compared to supervised discretization, previous research[6][9] has indicated that unsupervised discretization algorithms do not have as much computational complexity, but may contribute to usually not as good classification performance. When classification performance is probably the main concern, supervised discretization should really be adopted. There are several benefits associated with using discrete values over continuous ones: (1) Discretization will reduce the number of continuous features' values, which brings smaller demands on system's storage. (2)Discrete features are in close proximity to a knowledge-level representation than continuous ones. (3)Data can also be reduced and simplified through discretization. For both users and experts, discrete features are easier to comprehend, use, and explain. (4)Discretization makes learning more accurate and faster [5]. (5)Besides the many advantages of obtaining discrete data over continuous one, a suite of classification learning algorithms is only able to cope with discrete data. Successful discretization can significantly extend the application range of many learning algorithms. Possibly one of the supervised discretization methods, introduced by Fayyad and Irani, is referred to as entropy-based discretization. An entropy-based method will use the class information entropy of candidate partitions to decide on boundaries for discretization. Class information entropy is naturally a measure of purity and it measures the quantity of information which will be ISSN: Page257

2 needed to specify to which class an outbreak belongs. It considers one big interval by using all of the known values regarding a feature then recursively partitions this interval into smaller subintervals until some stopping criterion, for instance Minimum Description Length (MDL) Principle or maybe an optimal large number of intervals has been reached thus creating multiple intervals of feature [11]. Discretization methods can possibly be supervised or unsupervised depending upon whether it uses class information files sets. Supervised methods make use of the course label when partitioning the ongoing features. On the other hand, unsupervised discretization methods tend not to require the instruction information to discretize continuous attributes. Supervised discretization can be further characterized as error-based, entropy-based or statistics based. Unsupervised discretization is seen in earlier methods like equal-width and equal-frequency. Discretization methods can also be viewed as dynamic or static. A dynamic method would discretize continuous values when a classifier has been built, for instance in C4.5 while in the static approach discretization is done previous to the classification task. II. LITERATURE SURVEY Discretization method which is supervised, static and global. This method s discretization measure takes account of the distribution of class probability vector by applying Gini criterion [1] and its stopping criterion involves a tradeoff between simplicity and predictive accuracy by incorporating the number of partition intervals. ADVANTAGES: The purpose of this nonparametric test was to determine if significant differences existed between two populations. Effective data classification using Decision tree with discretization. Reduces number of partitioning iterations. DISADVANTAGES: Cut points are selected by recursively applying the same binary discretization method. Doesn t discretization binary data. Problem in discretization small instances. In this system Multivariate Discretization (MVD) Method [2] based on the idea of transforming the problem of unsupervised discretization in association rules into a supervised problem. Within the supportconfidence framework, they find that a rule with high confidence usually makes the corresponding data space have a high density. Thus, they firstly use a density-based clustering technique to identify the regions with high densities. Regarding every region as a class, they then develop a genetic algorithm to simultaneously discretize multiattributes according to entropy criterion. ADVANTAGES: Generates quality rules. Generates high frequent association rules with proposed discretization approach. MVD-CG discretizes variables based on the HDR s (High density regions) where some patterns with relatively high confidences are hidden. DISADVANTAGES: The disadvantage is that MVD really discretizes the attributes one at a time instead of discretizing them simultaneously. For association rules this system uses basic apriori algorithm which generates high candidate sets. A whole new rule-based algorithm for classifying and [8] proposes a new and effective supervised discretization algorithm in accordance to correlation maximization (CM) is proposed by employing multiple correspondence analysis (MCA). MCA seems to be an effective technique to capture the correlations between multiple variables. Two main questions ought to be answered when preparing a discretization algorithm: the time you need to cut and how to cut. Many discretization algorithms are based on information entropy, for instance maximum entropy which discretizes the numeric attributes using the criterion of minimum information loss. IEM is an often one on account of its efficiency and good performance among the classification stage. IEM selects the very first cut-point that minimizes the entropy function over all possible candidate cut-points and recursively applies this strategy to both induced intervals. The Minimum Description Length (MDL) principle is employed to discover if you would like to accept a selected candidate cut-point or not and thus stop the recursion in the event the cut-point will not satisfy a pre-defined condition. An applicant cutpoint, MCA is made use of to measure the correlation between intervals/items and classes. The mattress that allows the highest correlation in the classes is selected being a cut-point. The geometrical representation of MCA just not only visualizes the correlation relationship between intervals/items and classes, but additionally presents an elegant way to decide the cut-points. For one numeric feature, the candidate cut-point that maximizes ISSN: Page258

3 the correlation between feature intervals and classes is chosen like the first cut-point, then the strategy is performed among the nearly everywhere intervals recursively to further partition the intervals. Empirical comparisons with IEM, IEMV, CAIM, and CACC supervised discretization algorithms are conducted using six well-known classifiers. Currently, CM places focus on discretizing a dataset with two classes and shows promising results. This will be extended to handle a dataset that come with than two classes in our future work. Discretization algorithms are mainly categorized as supervised and unsupervised algorithms. Popular unsupervised top-down algorithms are Equal Width, Equal Frequency [10] and standard deviation. While the supervised top-down algorithms are maximum entropy [11], Paterson-Niblett which uses dynamic discretization, Information Entropy Maximization (IEM) and class attribute interdependence Maximization (CAIM). Kurgan and Cios have shown the outperforming results of CAIM discretization algorithm when compared to other algorithms. As CAIM considers largest interdependence between classes and attribute it improves classification accuracy. Unlike other discretization algorithm CAIM automatically generate the intervals and interval boundaries for your given data without any user input. Over the next couple of section, C4.5 a tree based classification is discussed. C4.5 builds decision trees typically from a variety of training data in the same fashion as ID3, making use of the information gain ratio. Each node of this very tree, C4.5 chooses one attribute of the results that the majority of effectively splits its multitude of samples into subsets enriched available as one class as well as other. It calculates the post gain for the attributes. Compared to the attribute when using the highest information gain is chosen in order to make the decision. Then upon the bases on that attribute, divide the given training set into a subsets. Then recursively apply the algorithm for each subset till the set contains instances of the very same class. If the set contains instances of the same class, then return that class. III. PROPOSED APPROACH: Algorithm: Improved Discretization method. Attributes:Ai Input: N, number of examples. Ai, continuous attributes. Cj, class values in training set. Global Threshold value Output: Interval borders in Ai Procedure: 1. for each continuous attribute Ai in training dataset do 2. Do normalize the attribute within 0-1 range 3. Sorting the values of continuous attribute Ai in ascending order. 4. for each class Cj in training dataset do 5. Find the minimum (Minvalue) using StdDev attribute value of Ai for Cj 6. Find the maximum (Max) attribute value of Ai for Cj. 7. endfor 8. Find the cut points in the continuous attributes values based on the Min and Max values of each class Cj. Best Cutpoint range measure: 9. Find the conditional probability P(Cj/A) on each cut point and select the cut point with maximum probability value. Stopping criteria: 10. If the cut point using the maximum probability value is exist and satisfies the global threshold value then it can be taken as an interval border else consider the next cut point, where information gain value and global threshold value satisfy the same point. 12. endfor ISSN: Page259

4 Improved Decision tree measure: Modified Information or entropy is given as m ModInfo(D)= S l og 3 S,m different classes i i1 ModInfo(D)= S og 3 i l Si 2 i1 S log S S log S = Where S 1 indicates set of samples which belongs to target class anamoly, S 2 indicates set of samples which belongs to target class normal. Information or Entropy to each attribute is calculated using v i Info ( D) D / D ModInfo( D ) A i i i1 The term Di /D acts as the weight of the jth partition. ModInfo(D) is the expected information required to classify a tuple from D based on the partitioning by A. IV. Experimental Results: RULE-7 TECHNIQUE: ================== (word_freq_your = '( ]') and (word_freq_money = '(0.02-INF)') and (word_freq_all = '( ]') => is_spam=1 (422.0/5.0) (word_freq_free = '( INF)') and (char_freq_! = '( INF)') => is_spam=1 (372.0/15.0) (word_freq_remove = '( INF)') and (word_freq_george = '(-INF ]') => is_spam=1 (440.0/23.0) (char_freq_$ = '( INF)') and (word_freq_000 = '( INF)') => is_spam=1 (78.0/3.0) (char_freq_$ = '( INF)') and (word_freq_hp = '(- INF ]') and (capital_run_length_total = '( ]') => is_spam=1 (28.0/2.0) and (capital_run_length_total = '( ]') and (char_freq_$ = '( INF)') => is_spam=1 (31.0/0.0) (char_freq_! = '( INF)') and (capital_run_length_average = '( INF)') => is_spam=1 (45.0/3.0) (word_freq_internet = '( INF)') and (word_freq_order = '( INF)') => is_spam=1 (33.0/0.0) (capital_run_length_average = '( ]') and (capital_run_length_longest = '( ]') => is_spam=1 (35.0/5.0) and (char_freq_! = '( INF)') => is_spam=1 (31.0/2.0) (word_freq_free = '( INF)') and (word_freq_re = '(-INF ]') and (capital_run_length_longest = '( ]') and (capital_run_length_average = '( ]') => is_spam=1 (21.0/2.0) (word_freq_our = '( INF)') and (word_freq_your = '( ]') and (word_freq_george = '(-INF ]') => is_spam=1 (87.0/23.0) (char_freq_( = '(-INF ]') and (char_freq_$ = '( INF)') => is_spam=1 (11.0/0.0) (char_freq_$ = '( ]') and (char_freq_! = '( ]') => is_spam=1 (33.0/4.0) and (char_freq_( = '( ]') and (capital_run_length_average = '( ]') => is_spam=1 (11.0/0.0) (word_freq_over = '( INF)') and (word_freq_pm = '(-INF ]') and (word_freq_all = '(-INF ]') => is_spam=1 (18.0/2.0) (char_freq_! = '( ]') and (word_freq_mail = '( ]') and (word_freq_credit = '( INF)') => is_spam=1 (7.0/0.0) (word_freq_free = '( INF)') and (word_freq_edu = '(-INF ]') and (char_freq_$ = '( ]') => is_spam=1 (8.0/1.0) and (word_freq_650 = '( INF)') and (word_freq_internet = '(-INF ]') => is_spam=1 (15.0/1.0) (word_freq_business = '( INF)') => is_spam=1 (18.0/5.0) ISSN: Page260

5 (word_freq_re = '(-INF ]') and (capital_run_length_average = '( INF)') and (word_freq_our = '( ]') => is_spam=1 (7.0/0.0) (word_freq_re = '(-INF ]') and (word_freq_font = '( INF)') and (char_freq_; = '(-INF ]') => is_spam=1 (14.0/1.0) (word_freq_re = '(-INF ]') and (char_freq_! = '( INF)') and (word_freq_will = '(-INF ]') and (word_freq_meeting = '(-INF ]') => is_spam=1 (13.0/1.0) (word_freq_free = '( INF)') and (char_freq_( = '(-INF ]') and (capital_run_length_average = '( INF)') and (char_freq_! = '( ]') => is_spam=1 (5.0/0.0) (word_freq_your = '( ]') and (word_freq_business = '( ]') => is_spam=1 (7.0/1.0) => is_spam=0 (2811.0/122.0) Number of Rules : 26 V. CONCLUSION AND FUTURE SCOPE Discretization of continuous features plays an important role in data pre-processing. This paper briefly introduces that the generation of the problem of discretization brings many benefits including improving the algorithms efficiency and expanding their application scope. There have been drawbacks in the existing literature to classify discretization methods. The idea and drawbacks of some typical methods are expressed in details by supervised or unsupervised category. Proposed Improved discretization approach significantly reduces the IO cost and also requires one time sorting for numerical attributes which leads to a better performance in time dimension on rule mining algorithms. According to the experimental results, our algorithm acquires less execution time over the Entropy based algorithm and also adoptable for any attribute selection method by which the accuracy of rule mining is improved. REFERENCES [1]: A DISCRETIZATION ALGORITHM BASED ON GINI CRITERION XIAO-HANG ZHANG, JUN WU, TING-JIE LU, YUAN JIANG, Proceedings of the Sixth International Conference on Machine Learning and Cybernetics, Hong Kong, August [2]: A Novel Multivariate Discretization Method for Mining Association Rules Hantian Wei, 2009 Asia- Pacific Conference on Information Processing [3]: A Rule-Based Classification Algorithm for Uncertain Data, IEEE International Conference on Data Engineering [4]: M. C. Ludl, G. Widmer. Relative unsupervised discretization for association rule mining. In: In Proceedings of the 4 th European Conference on Principles and Practice of Knowledge Discovery in Databases, Berlin, Germany, Springer, [5]: Stephen D. Bay. Multivariate discretization for set mining. Knowledge and Information Systems, 2001, 3(4): [6]: Stephen D. Bay and Michael J. Pazzani. Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery, 2001, 5(3): [7]: CAIM Discretization Algorithm Lukasz A. Kurgan [8]: Effective Supervised Discretization for Classification based on Correlation Maximization Qiusha Zhu, Lin Lin, Mei-Ling Shyu [9]: X.S.Li, D.Y.Li. A New Method Based on Density Clustering for Discretization of Continuous Attributes, Journal of System Simulation, 15(6): ,813,2005 [10]: R.Kass, L.Wasserman. A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion, Journal of the American Statistical Association, Vol.90: , [11]: Comparative Analysis of Supervised and Unsupervised Discretization Techniques Rajashree Dash ISSN: Page261

Random Forests May, Roger Bohn Big Data Analytics

Random Forests May, Roger Bohn Big Data Analytics Random Forests May, 2017 Roger Bohn Big Data Analytics This week = 2 good algorithms Thursday May 11 Lasso and Random Forests May 16 homework = case study. Kaggle, or regular? Week 7 Project: finish assignment