Improved Discretization Based Decision Tree for Continuous Attributes

Similar documents
Random Forests May, Roger Bohn Big Data Analytics

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania

Iteration Reduction K Means Clustering Algorithm

An Entropy Based Effective Algorithm for Data Discretization

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

Keywords: clustering algorithms, unsupervised learning, cluster validity

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

CS 539 Machine Learning Project 1

6. Dicretization methods 6.1 The purpose of discretization

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Classification model with subspace data-dependent balls

DISCRETIZATION BASED ON CLUSTERING METHODS. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania

Classification. Instructor: Wei Ding

A Monotonic Sequence and Subsequence Approach in Missing Data Statistical Analysis

An ICA-Based Multivariate Discretization Algorithm

Improving Classifier Performance by Imputing Missing Values using Discretization Method

Discretizing Continuous Attributes Using Information Theory

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Comparison of FP tree and Apriori Algorithm

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Normalization based K means Clustering Algorithm

Enhancing K-means Clustering Algorithm with Improved Initial Center

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

INFORMATION-THEORETIC OUTLIER DETECTION FOR LARGE-SCALE CATEGORICAL DATA

COMP 465: Data Mining Classification Basics

CLASSIFICATION OF WEB LOG DATA TO IDENTIFY INTERESTED USERS USING DECISION TREES

Datasets Size: Effect on Clustering Results

Data Preprocessing. Data Preprocessing

Machine Learning Techniques for Data Mining

Uncertain Data Classification Using Decision Tree Classification Tool With Probability Density Function Modeling Technique

Hybrid Models Using Unsupervised Clustering for Prediction of Customer Churn

Lecture 7: Decision Trees

Slides for Data Mining by I. H. Witten and E. Frank

Part I. Instructor: Wei Ding

Building Intelligent Learning Database Systems

International Journal of Software and Web Sciences (IJSWS)

Unsupervised Learning

Mining of Web Server Logs using Extended Apriori Algorithm

CS Machine Learning

CSE4334/5334 DATA MINING

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

CloNI: clustering of JN -interval discretization

Classification with Diffuse or Incomplete Information

Rank Measures for Ordering

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

Yunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace

Discovery of Agricultural Patterns Using Parallel Hybrid Clustering Paradigm

Forward Feature Selection Using Residual Mutual Information

A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM

Extra readings beyond the lecture slides are important:

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Cluster Analysis. Angela Montanari and Laura Anderlucci

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Clustering Analysis for Malicious Network Traffic

IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK

A New Technique to Optimize User s Browsing Session using Data Mining

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Data Mining of Range-Based Classification Rules for Data Characterization

Clustering of Data with Mixed Attributes based on Unified Similarity Metric

Value Added Association Rules

Statistical Pattern Recognition

Dynamic Clustering of Data with Modified K-Means Algorithm

A STUDY OF SOME DATA MINING CLASSIFICATION TECHNIQUES

Distance-based Outlier Detection: Consolidation and Renewed Bearing

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

IMPLEMENTATION AND COMPARATIVE STUDY OF IMPROVED APRIORI ALGORITHM FOR ASSOCIATION PATTERN MINING

SSV Criterion Based Discretization for Naive Bayes Classifiers

Efficient Voting Prediction for Pairwise Multilabel Classification

REMOVAL OF REDUNDANT AND IRRELEVANT DATA FROM TRAINING DATASETS USING SPEEDY FEATURE SELECTION METHOD

Clustering Part 4 DBSCAN

Classification/Regression Trees and Random Forests

An Efficient Approach for Color Pattern Matching Using Image Mining

CHAPTER 4: CLUSTER ANALYSIS

ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, Comparative Study of Classification Algorithms Using Data Mining

Statistical Pattern Recognition

Performance Based Study of Association Rule Algorithms On Voter DB

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

An Initial Seed Selection Algorithm for K-means Clustering of Georeferenced Data to Improve

Data Mining - Motivation

Efficient SQL-Querying Method for Data Mining in Large Data Bases

Categorization of Sequential Data using Associative Classifiers

Classification: Basic Concepts, Decision Trees, and Model Evaluation

An Improved Apriori Algorithm for Association Rules

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 1

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules

Improved Classification of Known and Unknown Network Traffic Flows using Semi-Supervised Machine Learning

Statistical Pattern Recognition

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

SCHEME OF COURSE WORK. Data Warehousing and Data mining

Comparative Study of Subspace Clustering Algorithms

Performance Analysis of Data Mining Classification Techniques

Transcription:

Improved Discretization Based Decision Tree for Continuous Attributes S.Jyothsna Gudlavalleru Engineering College, Gudlavalleru. G.Bharthi Asst. Professor Gudlavalleru Engineering College, Gudlavalleru. Abstract :- The majority of the Machine Learning and Data Mining applications can easily be applicable only on discrete features. However, data in solid world are sometimes continuous by nature. Even for algorithms that will directly encounter continuous features, learning is most often ineffective and effective. Hence discretization addresses this problem by finding the intervals of numbers which happen to be more concise to represent and specify. Discretization of continuous attributes is one of the important data preprocessing steps of knowledge extraction. The proposed improved discretization approach significantly reduces the IO cost and also requires one time sorting for numerical attributes which leads to a better performance in time dimension on rule mining algorithms. According to the experimental results, our algorithm acquires less execution time over the Entropy based algorithm and also adoptable for any attribute selection method by which the accuracy of rule mining is improved. Keywords Discretization, Preprocessing, Data Mining, Machine learning I. INTRODUCTION Discretization of continuous attributes simply not only broadens the scope of a given range of data mining algorithms able to analyze data in discrete form, but might also dramatically amplify the speed at which these tasks can be carried out. A discrete feature, also known as qualitative features, including sex and level of education, is only able to be limited among a number of values. Continuous features might be ranked if you want and admit to meaningful arithmetic operations. However, discrete features sometimes can possibly be arrayed within the meaningful order. However no arithmetic operations can be placed upon them. Data discretization is a multipurpose pre-processing method that reduces the quantity of distinct values to obtain given continuous variable by dividing its range right into a finite set of disjoint intervals, and after that relates these intervals with meaningful labels [2]. Subsequently, data are analyzed or reported with it higher-level of data representation instead of the subtle individual values, therefore results in a simplified data representation in data exploration and data mining process. Discretization of continuous attributes plays an important role in knowledge discovery. Many algorithms linked to data mining require the running examples contain only discrete values, and rules with discrete values are normally shorter and even more understandable. Suitable discretization is useful to increase the generalization and accuracy of discovered knowledge. Discretization algorithms might be categorized into unsupervised and supervised based upon if the class label details are used. Equal Width and Equal Frequency are two representative unsupervised discretization algorithms. Compared to supervised discretization, previous research[6][9] has indicated that unsupervised discretization algorithms do not have as much computational complexity, but may contribute to usually not as good classification performance. When classification performance is probably the main concern, supervised discretization should really be adopted. There are several benefits associated with using discrete values over continuous ones: (1) Discretization will reduce the number of continuous features' values, which brings smaller demands on system's storage. (2)Discrete features are in close proximity to a knowledge-level representation than continuous ones. (3)Data can also be reduced and simplified through discretization. For both users and experts, discrete features are easier to comprehend, use, and explain. (4)Discretization makes learning more accurate and faster [5]. (5)Besides the many advantages of obtaining discrete data over continuous one, a suite of classification learning algorithms is only able to cope with discrete data. Successful discretization can significantly extend the application range of many learning algorithms. Possibly one of the supervised discretization methods, introduced by Fayyad and Irani, is referred to as entropy-based discretization. An entropy-based method will use the class information entropy of candidate partitions to decide on boundaries for discretization. Class information entropy is naturally a measure of purity and it measures the quantity of information which will be ISSN: 2231-2803 http://www.ijcttjournal.org Page257

needed to specify to which class an outbreak belongs. It considers one big interval by using all of the known values regarding a feature then recursively partitions this interval into smaller subintervals until some stopping criterion, for instance Minimum Description Length (MDL) Principle or maybe an optimal large number of intervals has been reached thus creating multiple intervals of feature [11]. Discretization methods can possibly be supervised or unsupervised depending upon whether it uses class information files sets. Supervised methods make use of the course label when partitioning the ongoing features. On the other hand, unsupervised discretization methods tend not to require the instruction information to discretize continuous attributes. Supervised discretization can be further characterized as error-based, entropy-based or statistics based. Unsupervised discretization is seen in earlier methods like equal-width and equal-frequency. Discretization methods can also be viewed as dynamic or static. A dynamic method would discretize continuous values when a classifier has been built, for instance in C4.5 while in the static approach discretization is done previous to the classification task. II. LITERATURE SURVEY Discretization method which is supervised, static and global. This method s discretization measure takes account of the distribution of class probability vector by applying Gini criterion [1] and its stopping criterion involves a tradeoff between simplicity and predictive accuracy by incorporating the number of partition intervals. ADVANTAGES: The purpose of this nonparametric test was to determine if significant differences existed between two populations. Effective data classification using Decision tree with discretization. Reduces number of partitioning iterations. DISADVANTAGES: Cut points are selected by recursively applying the same binary discretization method. Doesn t discretization binary data. Problem in discretization small instances. In this system Multivariate Discretization (MVD) Method [2] based on the idea of transforming the problem of unsupervised discretization in association rules into a supervised problem. Within the supportconfidence framework, they find that a rule with high confidence usually makes the corresponding data space have a high density. Thus, they firstly use a density-based clustering technique to identify the regions with high densities. Regarding every region as a class, they then develop a genetic algorithm to simultaneously discretize multiattributes according to entropy criterion. ADVANTAGES: Generates quality rules. Generates high frequent association rules with proposed discretization approach. MVD-CG discretizes variables based on the HDR s (High density regions) where some patterns with relatively high confidences are hidden. DISADVANTAGES: The disadvantage is that MVD really discretizes the attributes one at a time instead of discretizing them simultaneously. For association rules this system uses basic apriori algorithm which generates high candidate sets. A whole new rule-based algorithm for classifying and [8] proposes a new and effective supervised discretization algorithm in accordance to correlation maximization (CM) is proposed by employing multiple correspondence analysis (MCA). MCA seems to be an effective technique to capture the correlations between multiple variables. Two main questions ought to be answered when preparing a discretization algorithm: the time you need to cut and how to cut. Many discretization algorithms are based on information entropy, for instance maximum entropy which discretizes the numeric attributes using the criterion of minimum information loss. IEM is an often one on account of its efficiency and good performance among the classification stage. IEM selects the very first cut-point that minimizes the entropy function over all possible candidate cut-points and recursively applies this strategy to both induced intervals. The Minimum Description Length (MDL) principle is employed to discover if you would like to accept a selected candidate cut-point or not and thus stop the recursion in the event the cut-point will not satisfy a pre-defined condition. An applicant cutpoint, MCA is made use of to measure the correlation between intervals/items and classes. The mattress that allows the highest correlation in the classes is selected being a cut-point. The geometrical representation of MCA just not only visualizes the correlation relationship between intervals/items and classes, but additionally presents an elegant way to decide the cut-points. For one numeric feature, the candidate cut-point that maximizes ISSN: 2231-2803 http://www.ijcttjournal.org Page258

the correlation between feature intervals and classes is chosen like the first cut-point, then the strategy is performed among the nearly everywhere intervals recursively to further partition the intervals. Empirical comparisons with IEM, IEMV, CAIM, and CACC supervised discretization algorithms are conducted using six well-known classifiers. Currently, CM places focus on discretizing a dataset with two classes and shows promising results. This will be extended to handle a dataset that come with than two classes in our future work. Discretization algorithms are mainly categorized as supervised and unsupervised algorithms. Popular unsupervised top-down algorithms are Equal Width, Equal Frequency [10] and standard deviation. While the supervised top-down algorithms are maximum entropy [11], Paterson-Niblett which uses dynamic discretization, Information Entropy Maximization (IEM) and class attribute interdependence Maximization (CAIM). Kurgan and Cios have shown the outperforming results of CAIM discretization algorithm when compared to other algorithms. As CAIM considers largest interdependence between classes and attribute it improves classification accuracy. Unlike other discretization algorithm CAIM automatically generate the intervals and interval boundaries for your given data without any user input. Over the next couple of section, C4.5 a tree based classification is discussed. C4.5 builds decision trees typically from a variety of training data in the same fashion as ID3, making use of the information gain ratio. Each node of this very tree, C4.5 chooses one attribute of the results that the majority of effectively splits its multitude of samples into subsets enriched available as one class as well as other. It calculates the post gain for the attributes. Compared to the attribute when using the highest information gain is chosen in order to make the decision. Then upon the bases on that attribute, divide the given training set into a subsets. Then recursively apply the algorithm for each subset till the set contains instances of the very same class. If the set contains instances of the same class, then return that class. III. PROPOSED APPROACH: Algorithm: Improved Discretization method. Attributes:Ai Input: N, number of examples. Ai, continuous attributes. Cj, class values in training set. Global Threshold value Output: Interval borders in Ai Procedure: 1. for each continuous attribute Ai in training dataset do 2. Do normalize the attribute within 0-1 range 3. Sorting the values of continuous attribute Ai in ascending order. 4. for each class Cj in training dataset do 5. Find the minimum (Minvalue) using StdDev attribute value of Ai for Cj 6. Find the maximum (Max) attribute value of Ai for Cj. 7. endfor 8. Find the cut points in the continuous attributes values based on the Min and Max values of each class Cj. Best Cutpoint range measure: 9. Find the conditional probability P(Cj/A) on each cut point and select the cut point with maximum probability value. Stopping criteria: 10. If the cut point using the maximum probability value is exist and satisfies the global threshold value then it can be taken as an interval border else consider the next cut point, where information gain value and global threshold value satisfy the same point. 12. endfor ISSN: 2231-2803 http://www.ijcttjournal.org Page259

Improved Decision tree measure: Modified Information or entropy is given as m ModInfo(D)= S l og 3 S,m different classes i i1 ModInfo(D)= S og 3 i l Si 2 i1 S log S S log S = 3 3 1 1 2 2 Where S 1 indicates set of samples which belongs to target class anamoly, S 2 indicates set of samples which belongs to target class normal. Information or Entropy to each attribute is calculated using v i Info ( D) D / D ModInfo( D ) A i i i1 The term Di /D acts as the weight of the jth partition. ModInfo(D) is the expected information required to classify a tuple from D based on the partitioning by A. IV. Experimental Results: RULE-7 TECHNIQUE: ================== (word_freq_your = '(0.28698-0.770745]') and (word_freq_money = '(0.02-INF)') and (word_freq_all = '(0.214647-0.615166]') => is_spam=1 (422.0/5.0) (word_freq_free = '(0.068896-INF)') and (char_freq_! = '(0.107811-INF)') => is_spam=1 (372.0/15.0) (word_freq_remove = '(0.026225-INF)') and (word_freq_george = '(-INF-0.008661]') => is_spam=1 (440.0/23.0) (char_freq_$ = '(0.156751-INF)') and (word_freq_000 = '(0.218378-INF)') => is_spam=1 (78.0/3.0) (char_freq_$ = '(0.156751-INF)') and (word_freq_hp = '(- INF-0.075835]') and (capital_run_length_total = '(0.090418-0.211566]') => is_spam=1 (28.0/2.0) and (capital_run_length_total = '(0.066714-0.090418]') and (char_freq_$ = '(0.156751-INF)') => is_spam=1 (31.0/0.0) (char_freq_! = '(0.107811-INF)') and (capital_run_length_average = '(0.058836-INF)') => is_spam=1 (45.0/3.0) (word_freq_internet = '(0.036215-INF)') and (word_freq_order = '(0.092351-INF)') => is_spam=1 (33.0/0.0) (capital_run_length_average = '(0.046493-0.058836]') and (capital_run_length_longest = '(0.02916-0.041854]') => is_spam=1 (35.0/5.0) and (char_freq_! = '(0.107811-INF)') => is_spam=1 (31.0/2.0) (word_freq_free = '(0.068896-INF)') and (word_freq_re = '(-INF- 0.026082]') and (capital_run_length_longest = '(0.041854-0.073868]') and (capital_run_length_average = '(0.030341-0.046493]') => is_spam=1 (21.0/2.0) (word_freq_our = '(0.185737-INF)') and (word_freq_your = '(0.28698-0.770745]') and (word_freq_george = '(-INF-0.008661]') => is_spam=1 (87.0/23.0) (char_freq_( = '(-INF-0.010126]') and (char_freq_$ = '(0.156751-INF)') => is_spam=1 (11.0/0.0) (char_freq_$ = '(0.096152-0.156751]') and (char_freq_! = '(0.049475-0.107811]') => is_spam=1 (33.0/4.0) and (char_freq_( = '(0.010126-0.106447]') and (capital_run_length_average = '(0.030341-0.046493]') => is_spam=1 (11.0/0.0) (word_freq_over = '(0.212283-INF)') and (word_freq_pm = '(-INF-0.101716]') and (word_freq_all = '(-INF- 0.214647]') => is_spam=1 (18.0/2.0) (char_freq_! = '(0.049475-0.107811]') and (word_freq_mail = '(0.049675-0.327926]') and (word_freq_credit = '(0.064194-INF)') => is_spam=1 (7.0/0.0) (word_freq_free = '(0.068896-INF)') and (word_freq_edu = '(-INF-0.047378]') and (char_freq_$ = '(0.045623-0.096152]') => is_spam=1 (8.0/1.0) and (word_freq_650 = '(0.023453-INF)') and (word_freq_internet = '(-INF-0.036215]') => is_spam=1 (15.0/1.0) (word_freq_business = '(0.362835-INF)') => is_spam=1 (18.0/5.0) ISSN: 2231-2803 http://www.ijcttjournal.org Page260

(word_freq_re = '(-INF-0.026082]') and (capital_run_length_average = '(0.058836-INF)') and (word_freq_our = '(0.022361-0.185737]') => is_spam=1 (7.0/0.0) (word_freq_re = '(-INF-0.026082]') and (word_freq_font = '(0.081988-INF)') and (char_freq_; = '(-INF- 0.128582]') => is_spam=1 (14.0/1.0) (word_freq_re = '(-INF-0.026082]') and (char_freq_! = '(0.107811-INF)') and (word_freq_will = '(-INF-0.159165]') and (word_freq_meeting = '(-INF-0.178499]') => is_spam=1 (13.0/1.0) (word_freq_free = '(0.068896-INF)') and (char_freq_( = '(-INF-0.010126]') and (capital_run_length_average = '(0.058836-INF)') and (char_freq_! = '(0.049475-0.107811]') => is_spam=1 (5.0/0.0) (word_freq_your = '(0.28698-0.770745]') and (word_freq_business = '(0.095342-0.362835]') => is_spam=1 (7.0/1.0) => is_spam=0 (2811.0/122.0) Number of Rules : 26 V. CONCLUSION AND FUTURE SCOPE Discretization of continuous features plays an important role in data pre-processing. This paper briefly introduces that the generation of the problem of discretization brings many benefits including improving the algorithms efficiency and expanding their application scope. There have been drawbacks in the existing literature to classify discretization methods. The idea and drawbacks of some typical methods are expressed in details by supervised or unsupervised category. Proposed Improved discretization approach significantly reduces the IO cost and also requires one time sorting for numerical attributes which leads to a better performance in time dimension on rule mining algorithms. According to the experimental results, our algorithm acquires less execution time over the Entropy based algorithm and also adoptable for any attribute selection method by which the accuracy of rule mining is improved. REFERENCES [1]: A DISCRETIZATION ALGORITHM BASED ON GINI CRITERION XIAO-HANG ZHANG, JUN WU, TING-JIE LU, YUAN JIANG, Proceedings of the Sixth International Conference on Machine Learning and Cybernetics, Hong Kong, 19-22 August 2007. [2]: A Novel Multivariate Discretization Method for Mining Association Rules Hantian Wei, 2009 Asia- Pacific Conference on Information Processing [3]: A Rule-Based Classification Algorithm for Uncertain Data, IEEE International Conference on Data Engineering [4]: M. C. Ludl, G. Widmer. Relative unsupervised discretization for association rule mining. In: In Proceedings of the 4 th European Conference on Principles and Practice of Knowledge Discovery in Databases, Berlin, Germany, Springer, 2000. [5]: Stephen D. Bay. Multivariate discretization for set mining. Knowledge and Information Systems, 2001, 3(4): 491-512. [6]: Stephen D. Bay and Michael J. Pazzani. Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery, 2001, 5(3): 213-246. [7]: CAIM Discretization Algorithm Lukasz A. Kurgan [8]: Effective Supervised Discretization for Classification based on Correlation Maximization Qiusha Zhu, Lin Lin, Mei-Ling Shyu [9]: X.S.Li, D.Y.Li. A New Method Based on Density Clustering for Discretization of Continuous Attributes, Journal of System Simulation, 15(6):804-806,813,2005 [10]: R.Kass, L.Wasserman. A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion, Journal of the American Statistical Association, Vol.90:928-935, 1995. [11]: Comparative Analysis of Supervised and Unsupervised Discretization Techniques Rajashree Dash ISSN: 2231-2803 http://www.ijcttjournal.org Page261