Journal of Analysis and Computation (JAC) (An International Peer Reviewed Journal), www.ijaconline.com, ISSN 0973-2861 International Conference on Emerging Trends in IOT & Machine Learning, 2018 STUDY PAPER ON CLASSIFICATION TECHIQUE IN DATA MINING ABSTRACT: P.Aarthy 1 M.Mounitha 2 Department Of Computer Application Nadar Saraswathi College Of Arts & Science, Theni Data mining techniques are used to analysis and discover useful pattern from historical database. An classification is one of the most useful and important technique.it is useful to handle large data to predict class labels. The process of finding a model to describe and distinguish data classes or data concept. In this paper we have study report on techniques like decision tree, k-nearest neighbor, support vector machine, naive Bayesian classifier, neural network and so on. Keyword: Decision Tree, Support Vector Machine, K-Nearest Neighbor, Neural Network. [1] INTRODUCTION In Data mining a classification is major techniques and it is used in various field. It is techniques which categories a data into a given number of class. Main goal: It is used to identify category/class to which a new data will under classifiers. It is two process: First construct some training data set, Second identify the unknown tuple into a class label. Training data set Classification Classifier(model) Figure: Model construction step P. Aarthy And M. Mounitha 1
STUDY PAPER ON CLASSIFICATION TECHIQUE IN DATA MINING [2] CHARECTERISTIC OF CLASSIFIER Every classifier has unique quality which has differ from other the properties are known as characteristic of classifier. The characteristic are Correctness Time Strength Data size Expendability Correctness Extendibility Classifier Strength Datasize Time [2.1] Correctness To classify the classifier tuple accurately. There are some numeric values to check the accuracy based on number of tuples correctly and number of tuple wrong. [2.2] Time Time requirement for the construction ofthe model. [2.3] Strength To classify the tuple correctly,if the tuple has noise or not. Missing values and wrong values are may be a noise. [2.4] Data Size It should be independent from the size of the database. It should be scalable. The performance of the model is not dependent on the size of the database. [2.5] Extendibility Some new feature can be added whenever requirement. This features is difficult to implement. P. Aarthy And M. Mounitha 2
Journal of Analysis and Computation (JAC) (An International Peer Reviewed Journal), www.ijaconline.com, ISSN 0973-2861 International Conference on Emerging Trends in IOT & Machine Learning, 2018 [3] CLASSIFICATION MODEL The main goal of classification is to maximize the accuracy obtained by the model. There are several technique for classification. They are: Decision tree k-nearest Neighbor Support Vector Machines Bayesian classifiers Neural Network [3.1] Decision tree A Decision tree is a classifier. It s a flow chart like a tree structure. It consists of nodes and root. Leaf node denotes class label. Roots have exactly one incoming edges. Nodes having without outgoing are called leaves. ADVANTAGE : It is easy to explain and interpret. easy to understand and generate the rules. they are fast robust. require very little experimentation. DISADAVANTAGE: Do not work for uncorrected variable. may suffer from over fitting. classifier by rectangular partitioning. does not easy to handle non numeric data. can be quite large -pruning is necessary. [3.2] K-Nearest Neighbor k-nearest classification fin the group of k object in a training set.it is close to test object. Assign the base label in a particular class of this neighborhood. It contain 3 key: set of labeled object. compare a k distance between an object. number of nearest neighbor. P. Aarthy And M. Mounitha 3
STUDY PAPER ON CLASSIFICATION TECHIQUE IN DATA MINING ADVANTAGE Effective of training is large. very simple and initiative. can be applied to the data from any distribution. good classification if the number of sample is large enough. DISADVANTAGE Need to determine the value of parameter. depend on k value. no training stages, all work is done during the test stage. need large number sample for accuracy [3.3] Bayesian classifiers Bayesian classifierare statistical classifier. Thefoundation based on baye s theorem.it can have predict class membership,it is based on a probabilities.it has comparable performancewith decision tree and selected neural network classifiers. p(ci,x)=p(xi/ci )/p(i)/p(x) p(x) is the constant of a classes. p(ci) is prior probability. The class ci for the p(ci/x) is maximized is called the maximum posterior hypothesis. ADVANTAGE: Handle real and discrete data. easy to implement. require a small amount of training data to estimates the parameters. good result obtained in most of the cases. DISADVANTAGE: Assumption :class conditional independence,therefore loss of accuracy. practically,dependencies exist of among example:hospital:patient:profile:age,variables,family,history,etc. symptoms:fever,cough. disease:lung cancer,diabetes,etc., dependencies among these cannot be modeled naïve Bayesian classifier. [3.4] Neural Network Neural network is a mathematical model inspired by biological neural network consist of interconnected group of artificial neurons, and it processes information using a connectionist approach to computer.neural network is used for classification and pattern recognition. P. Aarthy And M. Mounitha 4
Journal of Analysis and Computation (JAC) (An International Peer Reviewed Journal), www.ijaconline.com, ISSN 0973-2861 International Conference on Emerging Trends in IOT & Machine Learning, 2018 ADVANTAGE: It is a non-parametric method. high accuracy and noise tolerance. ease of maintenance. data driven and self-adaptive. universal function approximate. DISADVANTAGE: Extracting a knowledge. Lack of transparency(black box). learning time is long(trail error). defining classification rule is difficult. [3.5] Support Vector Machine(SVM) SVM is very difficult method forregression, classification and general pattern recognition. Its high generalization because it considered a classifier. The aim of SVM is find the best classification function to distinguish between member of two classes in training data. ADVANTAGE: Useful for non-linearly separable data. it has a regularization parameter, which makes the user think about avoiding fitting. it uses the kernel tick, so you can build in expert knowledge about the problem via.,engineering the kernel. support vector machine is defined by a convex optimization problem(no local minima) for which there are efficient method. it is an approximation to a bound on the test error rate and there is a substantial body of theory behind it which suggests it should be a good idea. DISADVANTAGE: that the theory only really cover the determination of a parameter for the given value of regularization and kernel parameter and choice of kernel. in the way the support vector machine moves the problem of over fitting from optimizing the parameter to model selection. [4] CONCLUSION In this paper we are discuss the 5 algorithm while discussing this algorithm we can identify at which algorithms is best one among this we will finally conclude the support vector machine, it is one of the important concept in classification techniques. P. Aarthy And M. Mounitha 5
STUDY PAPER ON CLASSIFICATION TECHIQUE IN DATA MINING REFERENCE [1] J. Han and M. Kamber, data mining concepts and techniques:,elevier, 2011. [2] S.muthuselvan and Dr.k.soma sundaram, a survey of sequence pattern in data mining techniques,international journal of applied engineering research,2015. [1] Jaivei.H,Micheline. K.(2006) data mining concept and technique :new york:morgan Kaufmann publishers. P. Aarthy And M. Mounitha 6