Using Decision Boundary to Analyze Classifiers
|
|
- Michael Cunningham
- 6 years ago
- Views:
Transcription
1 Using Decision Boundary to Analyze Classifiers Zhiyong Yan Congfu Xu College of Computer Science, Zhejiang University, Hangzhou, China Abstract In this paper we propose to use decision boundary to analyze classifiers. Two algorithms called decision boundary point set (DBPS) and decision boundary neuron set (DBNS) are proposed to obtain the data on the decision boundary. Based on DBNS, a visualization algorithm called SOM based decision boundary visualization (SOMDBV) is proposed to visualize the high-dimensional classifiers. The decision boundary can give an insight into classifiers, which cannot be supplied by accuracy. And it can be applied to select proper classifier, to analyze the tradeoff between accuracy and comprehensibility, to discovery the chance of over-fitting, to calculate the similarity of models generated by different classifiers. Experimental results demonstrate the usefulness of the method. 1. Introduction Classification is an important problem in machine learning, and it has many applications in real world. There are many classifiers now [1], whose performance is usually estimated by accuracy. Accuracy is the proportion of correct predictions to all predictions [1]. But accuracy is a raw performance score, and it cannot give much insight to classifiers [2]. Accuracy cannot tell which of the data are classified right and which are not, and it is unable to reveal the relative positions of the data predicted correct and incorrect. Data sets in real world are mostly highdimensional. Users usually get an intuitive insight by high-dimensional data visualization algorithms, which are unsupervised. The class boundary cannot be clearly visualized by these algorithms [3]. Because there is no powerful tools users cannot understand classifiers very well. [2] proposes to use decision region connectivity to analyze high-dimensional classifiers, which can be used to analyze convexity of decision region. The algorithm is independent of the dimension of the data set. [3] proposes an algorithm named SVMV to visualize classification results of Support Vector Machine(SVM) [4] using self-organizing mapping(som) [5]. The algorithm can clearly visualize SVM classification boundary and the distance between data and classification boundary in a 2-D map. But the algorithm substitutes weight matrix of SOM for input data of SVM decision function, which limits its application to other classifiers. The method for using decision boundary to analyze classifiers is proposed in this paper. Decision boundary is the distinguishing boundary classifier uses to predict data, so the predicting labels of the data on the two sides of boundary are different. Two algorithms are provided to obtain the data on the classifier s decision boundary. The first algorithm named decision boundary point set (DBPS) is used to get the points near the decision boundary of classifiers. The second algorithm named decision boundary neuron set (DBNS) is used to get the neurons of SOM near the decision boundary of classifiers. Based on DBNS an algorithm named SOM based decision boundary visualization (SOMDBV) is proposed to visualize the decision boundary of high-dimensional classifiers in a 2-D SOM map. In the next section, the procedures of DBPS, DBNS and SOMDBV are described. In section 3, analysis of classifiers using decision boundary is given. In section 4, experiments are performed to demonstrate the usefulness of the algorithms and analysis proposed. Conclusion is drawn in section 5. We assume the output of classifiers is discrete class label, not the probability of input data to belong to some class, although the latter one can be transformed to the former easily. 2. Decision boundary algorithms In this section, we describe the details of following three decision boundary algorithms, DBPS algorithm, DBNS algorithm and SOMDBV algorithm.
2 A model will be obtained after a classifier is trained on the training data set. When the new data is coming, the model will be used to predict the labels of the data. That is the normal usage of classifiers. Some classifiers behave like a white box, and can provide users with comprehensible results. For example, RIPPER [6], a well-known rule-based algorithm, learns set of rules, and the obtained rules give users a good understanding. But some other classifiers behave like a black box, and users are unable to understand what they have learned! SVM is an example of this kind of classifiers. The knowledge obtained by the trained SVM model is hidden in the decision function, which is complicated and abstract for user to understand. Users even do not know what has happened in the latter case. However, every classifier predicts the labels of data according to some guide lines. For example, RIPPER predicts the labels according to the rule set it has learned, while SVM predicts the labels according to the decision function it has trained. The guide lines, in spite of their forms, form decision boundaries in the input data space. The procedure of prediction can be seen as finding the relation between input data and boundaries. Using a trained classifier to classify data is equal to using the decision boundary of the same classifier to partition data. If the decision boundary of the classifier is obtained and visualized, users will have an insight into the classifier, which will help users to select proper classifier. The forms of the knowledge which classifiers adopt to construct decision boundary are diverse, so acquiring the analytical equations of the decision boundary is an exhausting task. Instead we obtain some sample points on the decision boundary to analyze classifiers. DBPS algorithm is used to obtain the sample point set near the decision boundary in the input space, while DBNS algorithm is used to obtain the sample neurons near the decision boundary on the 2-D SOM map, and SOMDBV makes use of DBNS to visualize the decision boundary on the 2-D SOM map DBPS algorithm There are two methods for obtaining the points on the decision boundary. The first one is internal method, which uses classifiers internal forms of knowledge to obtain the points on the boundary. For example, using the decision function of SVM we can compute the points on the boundary. The second one is external method, which uses some approximation methods to get the points near the boundary. The first method s advantage is that it can generate accurate points on the boundary, and its disadvantage is that every classifier needs its own implementation of this method because classifiers forms of knowledge are diverse. While the second method can be applied to more classifiers, but the points generated are not as accurate as the first one. DBPS algorithm adopts the second method, and generates the points near the boundary. DBPS uses binary search to calculate the points of intersection of decision boundary and connection between two data points predicted as different by classifiers. The detail of DBPS can be seen in Algorithm 1. Users can control the precision of the points by adjusting iter_no and toler. Algorithm 1* The Decision Boundary Point Set Algorithm generates the point set near the boundary X is the set of sample points. B is the set of decision boundary points. c(x) is the classifier function. iter_no is the limit of iterations numbers. toler is the tolerance of the boundary. for all x X do if c(x) == a then Xa Xa {x} else Xb Xb {x} for all xa Xa do for all xb Xb do B B {DBP (xa, xb, c, iter_no, toler)} function DBP (x1, x2, c, iter_no, toler) x_bound (x1 + x2) / 2 for i = 1 : iter_no do if distance(x1, x2) / 2 < toler then break if c(x_bound) == c(x1) then x1 x_bound else x2 x_bound x_bound (x1 + x2) / 2 return x_bound * The function distance(x1, x2) is trivial, so we do not describe the procedure of this function here.
3 If the classifier is high-dimensional, it needs highdimensional data visualization algorithm to visualize the obtained decision boundary point set DBNS algorithm There are two methods for visualizing the decision boundary of a high-dimensional classifier. The first one is to use DBPS algorithm to obtain the decision boundary point set in input space which is visualized by some high-dimensional data visualization method. The second one is to project the input data onto some low-dimensional map and calculate the point set on the decision boundary in the map space. SOMDBV algorithm adopts the second method, and DBNS algorithm is used to obtain the neurons near the decision boundary on the 2-D SOM map. SVMV algorithm uses the decision function of SVM to calculate the distance between the neuron and the classification boundary [3]. In DBNS algorithm, classifiers employ the weights of neurons as input to predict the labels of the data projected to these neurons. [7] adopts interpolation to get extended weight matrix, which avoids high computational complexity. We adopted the same process to obtain the neurons near the decision boundary. The method used to get the neurons near the boundary is the external method in 2.1, which is the same as DBPS algorithm. The topology of SOM used in this paper is rectangle grid, the algorithm can be applied to other topologies easily, however. As seen in Figure 1(a), if four neurons of the rectangle are predicted the same labels, then we suppose there is no neurons inside the rectangle which are near the boundary. Otherwise we use the interpolation to get neurons e, f, g, h, i, then partition the rectangle to 4 smaller ones. And we continue partitioning the small rectangles whose labels are not the same until the times achieve the number user given (Figure 1(b)). At last the center neuron of the rectangle is selected as the one near the decision boundary (Figure 1(c)). (a) (b) (c) Figure 1. Three cases of finding the neurons near the decision boundary. (a) predictions are the same; (b) predictions are not the same; (c) the last step. The detail procedure of DBNS algorithm can be seen in Algorithm2. Algorithm 2* The Decision Boundary Neuron Set Algorithm finds the neuron set near the boundary N is the neurons of the SOM, whose size is m n. B is the set of neurons near the decision boundary. c(x) is the classification model. iter_no is the limit of iterations numbers. for i = 1 : m-1 do for j = 1 : n-1 do N[] {N(i,j), N(i+1,j), N(i+1,j+1), N(i,j+1)} B B GetDBNeuron(N[], c, iter_no) function GetDBNeuron(N[], c, iter_no) dbn = {} if c(n[]) are not the same then if iter_no == 1 then dbn {GetCenterNeuron(N[])} else N2[][] Partition(N[]) for i = 1:4 do dbn dbn GetDBNeuron(N2[i], c, iter_no-1) return dbn * The function GetCenterNeuron(N[]) and function Partition(N[]) are trivial, so we do not describe procedures of these two functions here SOMDBV algorithm The SOMDBV algorithm adopts the second method in 2.2. It first projects the data onto 2-D SOM map, then uses DBNS algorithm to obtain the neurons near the decision boundary, and at last display the labels of the data, classifier s predictions of each neuron and the neurons near the decision boundary. The procedure of SOMDBV algorithm is as follows: 1) Classifier is trained on data set X to get the classification model C. 2) SOM algorithm is trained on the same data set X to get the weights W. 3) C is used to classify the W, and gives predictions L. 4) DBNS algorithm is used to get the neuron set N near the decision boundary.
4 5) Input data set X, classifier s predictions L and decision boundary neuron set N are displayed on the 2- D SOM map. 3. Applications of decision boundary What decision boundary can be used to analyze is as follows: 1) The distance between the data and decision boundary is clearly understood by users, which cannot be provided by accuracy. This will help user to select proper classifier. The classifier with boundary in the middle of the data belonging to different class is usually better than the classifier with boundary near data of one class and far from data of other class. It is also able to tell users in which region of the data space the classifier makes incorrect predictions. If users know the region which the new data is likely to fall into and there are several classifiers, they may be able to choose the proper classifier. 2) There is the tradeoff between accuracy and comprehensibility in data mining models [8]. The visualization of decision boundary is able to give an insight into the classifiers with high accuracy which usually results in low comprehensibility. The visualization will help users to analyze the tradeoff between accuracy and comprehensibility. 3) Over-fitting is struggled to avoid by classier users. Visualization of decision boundary can give insight to over-fitting. Given the same accuracy, the generalization of classifiers with complicated decision boundary is usually not as good as the ones with simpler decision boundary. This can help users to select the classifier with higher generalization, or set the proper parameters for classifier to obtain a more general model. 4) Decision boundary can be adopted to define the similarity of two models obtained by different classifiers. For example, the proportion of the region two classifiers predict the same labels to the whole region the data fall into may be a measure of models similarity. Then we can conclude two models are the same in the case of some given similarity. Given the similarity, one model may be able to be transformed into the model trained by another classifier, which can overcome the drawback of some classifiers. For example, trained artificial neural network (ANN) can be transformed into rule set by extracting rules from ANN, which improves the comprehensibility of the trained ANN with high accuracy [9]. The method for calculating similarity can be used to calculate the fidelity for extracting rules for ANN [9]. 5) Diversity among the base classifiers is important when constructing a classifier ensemble [10]. The decision boundary can be used to calculate the diversity. For example, the integral of the difference between two classifiers decision boundaries may be a measure of diversity, which reflects the difference of partitions of the data space between two classifiers. 4. Experimental results In this section, two experiments are performed to demonstrate the usefulness of proposed algorithms. Classifiers we used are RIPPER and SVM, and WEKA [11] is used for the implementation of these two classifiers. Gaussian kernel with parameter gamma is used as the kernel function of SVM. Implementation of SOM is from MATLAB SOM Toolbox. For SOM, the total iteration numbers are 1000, and the topology is grid topology. The size of SOM is Experimental results of DBPS DBPS algorithm is used to generate the decision boundary point sets of RIPPER and SVM for diamond data. Diamond data is two-class simulation data with 2 dimensions, whose boundary is diamond, and its two diagonals length is Each class has 100 data points generated randomly. The results are shown in Figure 2, where cross symbols denote the data inside the diamond, while star symbols denote the data outside the diamond, and line between data of different classes is decision boundary. The decision boundary generated by RIPPER is like a cross, while the one generated by SVM is like a diamond as seen in Figure 2. The decision boundary generated by SVM is almost in the middle of the data of different class, while the decision boundary by RIPPER is close to data of one class and far from the data of other one. The position of decision boundary by SVM is more proper than that by RIPPER, so SVM will be the proper one for diamond data. At the same time, the shape of decision boundary generated by RIPPER is more regular, which can be understood better by users. The decision boundary by SVM is more complicated, and it is more difficult for user to understand it. In this experiment, there is the tradeoff between a powerful model with high accuracy and a transparent model with high comprehensibility.
5 (a) (a) (b) Figure 2. Decision boundary point set (a) by RIPPER; (b) by SVM using Gaussian kernel with gamma = Experimental results of SOMDBV The data set for SOMDBV algorithm is Johns Hopkins University Ionosphere database, which is from UCI machine learning repository [12]. The data set contains 351 records with 34 dimensions, of which 225 records are labeled Good, and 126 are labeled Bad. The results are shown in Figure 3, where square symbols denote the data of class Bad, and triangle symbols denote the data of class Good. Cross symbols denote the neurons predicted Bad, and dot symbols denote the neurons predicted Good. The line of Figure 3 denotes the decision boundary. As the analysis of 4.1, SVM in Figure 3(b) is more proper than RIPPER. (b) (c) Figure 3. Visualization of the Ionosphere data set (a) by RIPPER; (b) by SVM using Gaussian kernel with gamma = 2; (c) by SVM using Gaussian kernel with gamma = 20.
6 As seen in Figure 3(b) and Figure 3(c), the decision boundary generated by SVM using Gaussian kernel with gamma being 20 is more complicated than that generated by SVM with gamma being 2. So the SVM using Gaussian kernel with gamma being 20 is more likely to over-fit the data. And this conclusion agrees with the experience and the common sense. The number of neurons predicted the same labels by RIPPER and SVM with gamma being 2 is larger than that by SVM with gamma being 20 and SVM with gamma being 2. So although the two SVM models are generated by the same classifier using different parameter, their similarity is less than that of SVM with gamma being 2 and RIPPER. 5. Conclusion and future work In this paper, a novel method for using decision boundary to analyze classifiers is proposed. Two algorithms are proposed to obtain data on decision boundary in different spaces. DBPS algorithm is used to obtain point set on decision boundary in input data space, while DBNS algorithm is used to obtain neuron set on decision boundary on 2-D SOM map. SOMDBV algorithm using DBNS algorithm is proposed to visualize the decision boundary of highdimensional classifiers. With the help of decision boundary, users can get an insight into the classifiers. Decision boundary can be used to select proper classifier, to reveal the tradeoff between accuracy and comprehensibility, to detect over-fitting, to calculate the similarity of classifiers and to calculate diversity in ensemble learning. This paper has not supplied calculation method for obtaining similarity and diversity. This work will be done in future, and the decision boundary will be used to analyze extracting rules from ANN and ensemble learning. Acknowledgements [3] X. Wang, S. Wu, and Q. Li, SVMV a novel algorithm for the visualization of SVM classification results, Advances in Neural Networks - ISNN 2006, Springer-Verlag, Berlin Heidelberg, 2006, pp [4] Vapnik, V.N., The nature of Statistical Learning Theory, Springer, Berlin Heidelberg, [5] Kohonen, T., Self-Organizing Maps, Springer, Berlin Heidelberg, [6] W. Cohen, Fast effective rule induction, Proceedings of the 12th International Conference on Machine Learning, Morgan Kaufmann, Tahoe City, CA, 1995, pp [7] S. Wu, and W.S. Chow, Support vector visualization and clustering using self-organization map and support vector one-class classification, Proceedings of IEEE International Joint Conference on Neural Networks, Portland, USA, 2003, pp [8] U. Johansson, L. Niklasson, and R. König, Accuracy vs. comprehensibility in data mining models, Proceedings of 7th International Conference on Information Fusion, Stockholm, Sweden, 2004, pp [9] R. Andrews, J. Diederich, and A.B. Tickle, Survey and critique of techniques for extracting rules from trained artificial neural networks, Knowledge-Based Systems, Elsevier, Amsterdam, 1995, pp [10] E.K. Tang, P.N. Suganthan, and X. Yao, An analysis of diversity measures, Machine Learning, Springer, Berlin Heidelberg, 2006, pp [11] Witten, I.H., and E. Frank, Data Mining: Practical machine learning tools and techniques, 2nd Edition, Morgan Kaufmann, San Francisco, [12] P.M. Murphy, and D.W. Aha, UCI Repository of machine learning databases [ Irvine, CA: University of California, Department of Information and Computer Science This paper is supported by 863 plan (No. 2007AA01Z197), and National Natural Science Foundation of China (No ). References [1] S.B. Kotsiantis, I.D. Zaharakis, and P.E. Pintelas, Machine learning: a review of classification and combining techniques, Artificial Intelligence Review, Springer, Berlin Heidelberg, 2006, pp [2] O. Melnik, Decision region connectivity analysis: a method for analyzing high-dimensional classifiers, Machine Learning, Kluwer, Netherlands, 2002, pp
Rule extraction from support vector machines
Rule extraction from support vector machines Haydemar Núñez 1,3 Cecilio Angulo 1,2 Andreu Català 1,2 1 Dept. of Systems Engineering, Polytechnical University of Catalonia Avda. Victor Balaguer s/n E-08800
More informationConcept Tree Based Clustering Visualization with Shaded Similarity Matrices
Syracuse University SURFACE School of Information Studies: Faculty Scholarship School of Information Studies (ischool) 12-2002 Concept Tree Based Clustering Visualization with Shaded Similarity Matrices
More informationGraph projection techniques for Self-Organizing Maps
Graph projection techniques for Self-Organizing Maps Georg Pölzlbauer 1, Andreas Rauber 1, Michael Dittenbach 2 1- Vienna University of Technology - Department of Software Technology Favoritenstr. 9 11
More informationEstimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees
Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Jing Wang Computer Science Department, The University of Iowa jing-wang-1@uiowa.edu W. Nick Street Management Sciences Department,
More informationRule Based Learning Systems from SVM and RBFNN
Rule Based Learning Systems from SVM and RBFNN Haydemar Núñez 1, Cecilio Angulo 2 and Andreu Català 2 1 Laboratorio de Inteligencia Artificial, Universidad Central de Venezuela. Caracas, Venezuela hnunez@strix.ciens.ucv.ve
More informationGenerating the Reduced Set by Systematic Sampling
Generating the Reduced Set by Systematic Sampling Chien-Chung Chang and Yuh-Jye Lee Email: {D9115009, yuh-jye}@mail.ntust.edu.tw Department of Computer Science and Information Engineering National Taiwan
More informationSupport Vector Regression for Software Reliability Growth Modeling and Prediction
Support Vector Regression for Software Reliability Growth Modeling and Prediction 925 Fei Xing 1 and Ping Guo 2 1 Department of Computer Science Beijing Normal University, Beijing 100875, China xsoar@163.com
More informationFeature-weighted k-nearest Neighbor Classifier
Proceedings of the 27 IEEE Symposium on Foundations of Computational Intelligence (FOCI 27) Feature-weighted k-nearest Neighbor Classifier Diego P. Vivencio vivencio@comp.uf scar.br Estevam R. Hruschka
More informationRank Measures for Ordering
Rank Measures for Ordering Jin Huang and Charles X. Ling Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 email: fjhuang33, clingg@csd.uwo.ca Abstract. Many
More informationPerformance Analysis of Data Mining Classification Techniques
Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal
More informationCluster Analysis using Spherical SOM
Cluster Analysis using Spherical SOM H. Tokutaka 1, P.K. Kihato 2, K. Fujimura 2 and M. Ohkita 2 1) SOM Japan Co-LTD, 2) Electrical and Electronic Department, Tottori University Email: {tokutaka@somj.com,
More informationLeave-One-Out Support Vector Machines
Leave-One-Out Support Vector Machines Jason Weston Department of Computer Science Royal Holloway, University of London, Egham Hill, Egham, Surrey, TW20 OEX, UK. Abstract We present a new learning algorithm
More informationDynamic Clustering of Data with Modified K-Means Algorithm
2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq
More informationA Lazy Approach for Machine Learning Algorithms
A Lazy Approach for Machine Learning Algorithms Inés M. Galván, José M. Valls, Nicolas Lecomte and Pedro Isasi Abstract Most machine learning algorithms are eager methods in the sense that a model is generated
More informationDEVELOPMENT OF NEURAL NETWORK TRAINING METHODOLOGY FOR MODELING NONLINEAR SYSTEMS WITH APPLICATION TO THE PREDICTION OF THE REFRACTIVE INDEX
DEVELOPMENT OF NEURAL NETWORK TRAINING METHODOLOGY FOR MODELING NONLINEAR SYSTEMS WITH APPLICATION TO THE PREDICTION OF THE REFRACTIVE INDEX THESIS CHONDRODIMA EVANGELIA Supervisor: Dr. Alex Alexandridis,
More informationFlexible-Hybrid Sequential Floating Search in Statistical Feature Selection
Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection Petr Somol 1,2, Jana Novovičová 1,2, and Pavel Pudil 2,1 1 Dept. of Pattern Recognition, Institute of Information Theory and
More informationInternational Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X
Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,
More informationAn ICA-Based Multivariate Discretization Algorithm
An ICA-Based Multivariate Discretization Algorithm Ye Kang 1,2, Shanshan Wang 1,2, Xiaoyan Liu 1, Hokyin Lai 1, Huaiqing Wang 1, and Baiqi Miao 2 1 Department of Information Systems, City University of
More informationClustering of Data with Mixed Attributes based on Unified Similarity Metric
Clustering of Data with Mixed Attributes based on Unified Similarity Metric M.Soundaryadevi 1, Dr.L.S.Jayashree 2 Dept of CSE, RVS College of Engineering and Technology, Coimbatore, Tamilnadu, India 1
More informationFabric Defect Detection Based on Computer Vision
Fabric Defect Detection Based on Computer Vision Jing Sun and Zhiyu Zhou College of Information and Electronics, Zhejiang Sci-Tech University, Hangzhou, China {jings531,zhouzhiyu1993}@163.com Abstract.
More informationRobot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning
Robot Learning 1 General Pipeline 1. Data acquisition (e.g., from 3D sensors) 2. Feature extraction and representation construction 3. Robot learning: e.g., classification (recognition) or clustering (knowledge
More information.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..
.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar.. Machine Learning: Support Vector Machines: Linear Kernel Support Vector Machines Extending Perceptron Classifiers. There are two ways to
More informationPRIVACY-PRESERVING MULTI-PARTY DECISION TREE INDUCTION
PRIVACY-PRESERVING MULTI-PARTY DECISION TREE INDUCTION Justin Z. Zhan, LiWu Chang, Stan Matwin Abstract We propose a new scheme for multiple parties to conduct data mining computations without disclosing
More informationThe Role of Biomedical Dataset in Classification
The Role of Biomedical Dataset in Classification Ajay Kumar Tanwani and Muddassar Farooq Next Generation Intelligent Networks Research Center (nexgin RC) National University of Computer & Emerging Sciences
More informationApplying Supervised Learning
Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains
More informationBRACE: A Paradigm For the Discretization of Continuously Valued Data
Proceedings of the Seventh Florida Artificial Intelligence Research Symposium, pp. 7-2, 994 BRACE: A Paradigm For the Discretization of Continuously Valued Data Dan Ventura Tony R. Martinez Computer Science
More informationCost-sensitive C4.5 with post-pruning and competition
Cost-sensitive C4.5 with post-pruning and competition Zilong Xu, Fan Min, William Zhu Lab of Granular Computing, Zhangzhou Normal University, Zhangzhou 363, China Abstract Decision tree is an effective
More informationHALF&HALF BAGGING AND HARD BOUNDARY POINTS. Leo Breiman Statistics Department University of California Berkeley, CA
1 HALF&HALF BAGGING AND HARD BOUNDARY POINTS Leo Breiman Statistics Department University of California Berkeley, CA 94720 leo@stat.berkeley.edu Technical Report 534 Statistics Department September 1998
More informationRandom projection for non-gaussian mixture models
Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,
More information6. Dicretization methods 6.1 The purpose of discretization
6. Dicretization methods 6.1 The purpose of discretization Often data are given in the form of continuous values. If their number is huge, model building for such data can be difficult. Moreover, many
More informationA Two-phase Distributed Training Algorithm for Linear SVM in WSN
Proceedings of the World Congress on Electrical Engineering and Computer Systems and Science (EECSS 015) Barcelona, Spain July 13-14, 015 Paper o. 30 A wo-phase Distributed raining Algorithm for Linear
More informationData Mining and Knowledge Discovery Practice notes 2
Keywords Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si Data Attribute, example, attribute-value data, target variable, class, discretization Algorithms
More informationResearch on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a
International Conference on Education Technology, Management and Humanities Science (ETMHS 2015) Research on Applications of Data Mining in Electronic Commerce Xiuping YANG 1, a 1 Computer Science Department,
More informationKernel Methods and Visualization for Interval Data Mining
Kernel Methods and Visualization for Interval Data Mining Thanh-Nghi Do 1 and François Poulet 2 1 College of Information Technology, Can Tho University, 1 Ly Tu Trong Street, Can Tho, VietNam (e-mail:
More informationAnalyzing Outlier Detection Techniques with Hybrid Method
Analyzing Outlier Detection Techniques with Hybrid Method Shruti Aggarwal Assistant Professor Department of Computer Science and Engineering Sri Guru Granth Sahib World University. (SGGSWU) Fatehgarh Sahib,
More informationStability Assessment of Electric Power Systems using Growing Neural Gas and Self-Organizing Maps
Stability Assessment of Electric Power Systems using Growing Gas and Self-Organizing Maps Christian Rehtanz, Carsten Leder University of Dortmund, 44221 Dortmund, Germany Abstract. Liberalized competitive
More informationC-NBC: Neighborhood-Based Clustering with Constraints
C-NBC: Neighborhood-Based Clustering with Constraints Piotr Lasek Chair of Computer Science, University of Rzeszów ul. Prof. St. Pigonia 1, 35-310 Rzeszów, Poland lasek@ur.edu.pl Abstract. Clustering is
More informationUsing Analytic QP and Sparseness to Speed Training of Support Vector Machines
Using Analytic QP and Sparseness to Speed Training of Support Vector Machines John C. Platt Microsoft Research 1 Microsoft Way Redmond, WA 9805 jplatt@microsoft.com Abstract Training a Support Vector Machine
More informationEfficient SQL-Querying Method for Data Mining in Large Data Bases
Efficient SQL-Querying Method for Data Mining in Large Data Bases Nguyen Hung Son Institute of Mathematics Warsaw University Banacha 2, 02095, Warsaw, Poland Abstract Data mining can be understood as a
More informationData Mining and Knowledge Discovery: Practice Notes
Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si 8.11.2017 1 Keywords Data Attribute, example, attribute-value data, target variable, class, discretization
More informationTwo-step Modified SOM for Parallel Calculation
Two-step Modified SOM for Parallel Calculation Two-step Modified SOM for Parallel Calculation Petr Gajdoš and Pavel Moravec Petr Gajdoš and Pavel Moravec Department of Computer Science, FEECS, VŠB Technical
More informationThe Effects of Outliers on Support Vector Machines
The Effects of Outliers on Support Vector Machines Josh Hoak jrhoak@gmail.com Portland State University Abstract. Many techniques have been developed for mitigating the effects of outliers on the results
More informationUnivariate Margin Tree
Univariate Margin Tree Olcay Taner Yıldız Department of Computer Engineering, Işık University, TR-34980, Şile, Istanbul, Turkey, olcaytaner@isikun.edu.tr Abstract. In many pattern recognition applications,
More informationFigure (5) Kohonen Self-Organized Map
2- KOHONEN SELF-ORGANIZING MAPS (SOM) - The self-organizing neural networks assume a topological structure among the cluster units. - There are m cluster units, arranged in a one- or two-dimensional array;
More informationCLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS
CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of
More informationLearning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li
Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,
More informationControlling the spread of dynamic self-organising maps
Neural Comput & Applic (2004) 13: 168 174 DOI 10.1007/s00521-004-0419-y ORIGINAL ARTICLE L. D. Alahakoon Controlling the spread of dynamic self-organising maps Received: 7 April 2004 / Accepted: 20 April
More informationEfficient Pairwise Classification
Efficient Pairwise Classification Sang-Hyeun Park and Johannes Fürnkranz TU Darmstadt, Knowledge Engineering Group, D-64289 Darmstadt, Germany {park,juffi}@ke.informatik.tu-darmstadt.de Abstract. Pairwise
More informationA *69>H>N6 #DJGC6A DG C<>C::G>C<,8>:C8:H /DA 'D 2:6G, ()-"&"3 -"(' ( +-" " " % '.+ % ' -0(+$,
The structure is a very important aspect in neural network design, it is not only impossible to determine an optimal structure for a given problem, it is even impossible to prove that a given structure
More informationIndividualized Error Estimation for Classification and Regression Models
Individualized Error Estimation for Classification and Regression Models Krisztian Buza, Alexandros Nanopoulos, Lars Schmidt-Thieme Abstract Estimating the error of classification and regression models
More informationPractice EXAM: SPRING 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE
Practice EXAM: SPRING 0 CS 6375 INSTRUCTOR: VIBHAV GOGATE The exam is closed book. You are allowed four pages of double sided cheat sheets. Answer the questions in the spaces provided on the question sheets.
More informationWeka ( )
Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised
More informationUsing Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions
Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions Offer Sharabi, Yi Sun, Mark Robinson, Rod Adams, Rene te Boekhorst, Alistair G. Rust, Neil Davey University of
More informationAdvanced visualization techniques for Self-Organizing Maps with graph-based methods
Advanced visualization techniques for Self-Organizing Maps with graph-based methods Georg Pölzlbauer 1, Andreas Rauber 1, and Michael Dittenbach 2 1 Department of Software Technology Vienna University
More informationCluster analysis of 3D seismic data for oil and gas exploration
Data Mining VII: Data, Text and Web Mining and their Business Applications 63 Cluster analysis of 3D seismic data for oil and gas exploration D. R. S. Moraes, R. P. Espíndola, A. G. Evsukoff & N. F. F.
More informationLinear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines
Linear Models Lecture Outline: Numeric Prediction: Linear Regression Linear Classification The Perceptron Support Vector Machines Reading: Chapter 4.6 Witten and Frank, 2nd ed. Chapter 4 of Mitchell Solving
More informationANALYZING AND OPTIMIZING ANT-CLUSTERING ALGORITHM BY USING NUMERICAL METHODS FOR EFFICIENT DATA MINING
ANALYZING AND OPTIMIZING ANT-CLUSTERING ALGORITHM BY USING NUMERICAL METHODS FOR EFFICIENT DATA MINING Md. Asikur Rahman 1, Md. Mustafizur Rahman 2, Md. Mustafa Kamal Bhuiyan 3, and S. M. Shahnewaz 4 1
More informationCombination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset
International Journal of Computer Applications (0975 8887) Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset Mehdi Naseriparsa Islamic Azad University Tehran
More informationCartographic Selection Using Self-Organizing Maps
1 Cartographic Selection Using Self-Organizing Maps Bin Jiang 1 and Lars Harrie 2 1 Division of Geomatics, Institutionen för Teknik University of Gävle, SE-801 76 Gävle, Sweden e-mail: bin.jiang@hig.se
More informationWEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1
WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1 H. Altay Güvenir and Aynur Akkuş Department of Computer Engineering and Information Science Bilkent University, 06533, Ankara, Turkey
More informationSome questions of consensus building using co-association
Some questions of consensus building using co-association VITALIY TAYANOV Polish-Japanese High School of Computer Technics Aleja Legionow, 4190, Bytom POLAND vtayanov@yahoo.com Abstract: In this paper
More informationImage Classification Using Wavelet Coefficients in Low-pass Bands
Proceedings of International Joint Conference on Neural Networks, Orlando, Florida, USA, August -7, 007 Image Classification Using Wavelet Coefficients in Low-pass Bands Weibao Zou, Member, IEEE, and Yan
More informationNon-linear gating network for the large scale classification model CombNET-II
Non-linear gating network for the large scale classification model CombNET-II Mauricio Kugler, Toshiyuki Miyatani Susumu Kuroyanagi, Anto Satriyo Nugroho and Akira Iwata Department of Computer Science
More informationMachine Learning for NLP
Machine Learning for NLP Support Vector Machines Aurélie Herbelot 2018 Centre for Mind/Brain Sciences University of Trento 1 Support Vector Machines: introduction 2 Support Vector Machines (SVMs) SVMs
More informationA Modular Reduction Method for k-nn Algorithm with Self-recombination Learning
A Modular Reduction Method for k-nn Algorithm with Self-recombination Learning Hai Zhao and Bao-Liang Lu Department of Computer Science and Engineering, Shanghai Jiao Tong University, 800 Dong Chuan Rd.,
More informationMachine Learning : Clustering, Self-Organizing Maps
Machine Learning Clustering, Self-Organizing Maps 12/12/2013 Machine Learning : Clustering, Self-Organizing Maps Clustering The task: partition a set of objects into meaningful subsets (clusters). The
More information6. NEURAL NETWORK BASED PATH PLANNING ALGORITHM 6.1 INTRODUCTION
6 NEURAL NETWORK BASED PATH PLANNING ALGORITHM 61 INTRODUCTION In previous chapters path planning algorithms such as trigonometry based path planning algorithm and direction based path planning algorithm
More informationEfficient Pairwise Classification
Efficient Pairwise Classification Sang-Hyeun Park and Johannes Fürnkranz TU Darmstadt, Knowledge Engineering Group, D-64289 Darmstadt, Germany Abstract. Pairwise classification is a class binarization
More informationSOMSN: An Effective Self Organizing Map for Clustering of Social Networks
SOMSN: An Effective Self Organizing Map for Clustering of Social Networks Fatemeh Ghaemmaghami Research Scholar, CSE and IT Dept. Shiraz University, Shiraz, Iran Reza Manouchehri Sarhadi Research Scholar,
More informationAn Empirical Study on feature selection for Data Classification
An Empirical Study on feature selection for Data Classification S.Rajarajeswari 1, K.Somasundaram 2 Department of Computer Science, M.S.Ramaiah Institute of Technology, Bangalore, India 1 Department of
More informationCloNI: clustering of JN -interval discretization
CloNI: clustering of JN -interval discretization C. Ratanamahatana Department of Computer Science, University of California, Riverside, USA Abstract It is known that the naive Bayesian classifier typically
More informationA Comparative Study of SVM Kernel Functions Based on Polynomial Coefficients and V-Transform Coefficients
www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 6 Issue 3 March 2017, Page No. 20765-20769 Index Copernicus value (2015): 58.10 DOI: 18535/ijecs/v6i3.65 A Comparative
More informationImproving Classifier Performance by Imputing Missing Values using Discretization Method
Improving Classifier Performance by Imputing Missing Values using Discretization Method E. CHANDRA BLESSIE Assistant Professor, Department of Computer Science, D.J.Academy for Managerial Excellence, Coimbatore,
More informationSimulation of Zhang Suen Algorithm using Feed- Forward Neural Networks
Simulation of Zhang Suen Algorithm using Feed- Forward Neural Networks Ritika Luthra Research Scholar Chandigarh University Gulshan Goyal Associate Professor Chandigarh University ABSTRACT Image Skeletonization
More informationLecture #11: The Perceptron
Lecture #11: The Perceptron Mat Kallada STAT2450 - Introduction to Data Mining Outline for Today Welcome back! Assignment 3 The Perceptron Learning Method Perceptron Learning Rule Assignment 3 Will be
More informationAKA: Logistic Regression Implementation
AKA: Logistic Regression Implementation 1 Supervised classification is the problem of predicting to which category a new observation belongs. A category is chosen from a list of predefined categories.
More informationDiscrete Particle Swarm Optimization With Local Search Strategy for Rule Classification
Discrete Particle Swarm Optimization With Local Search Strategy for Rule Classification Min Chen and Simone A. Ludwig Department of Computer Science North Dakota State University Fargo, ND, USA min.chen@my.ndsu.edu,
More informationAutomatic Group-Outlier Detection
Automatic Group-Outlier Detection Amine Chaibi and Mustapha Lebbah and Hanane Azzag LIPN-UMR 7030 Université Paris 13 - CNRS 99, av. J-B Clément - F-93430 Villetaneuse {firstname.secondname}@lipn.univ-paris13.fr
More informationCS570: Introduction to Data Mining
CS570: Introduction to Data Mining Classification Advanced Reading: Chapter 8 & 9 Han, Chapters 4 & 5 Tan Anca Doloc-Mihu, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei. Data Mining.
More informationA Boosting-Based Framework for Self-Similar and Non-linear Internet Traffic Prediction
A Boosting-Based Framework for Self-Similar and Non-linear Internet Traffic Prediction Hanghang Tong 1, Chongrong Li 2, and Jingrui He 1 1 Department of Automation, Tsinghua University, Beijing 100084,
More informationEvolving SQL Queries for Data Mining
Evolving SQL Queries for Data Mining Majid Salim and Xin Yao School of Computer Science, The University of Birmingham Edgbaston, Birmingham B15 2TT, UK {msc30mms,x.yao}@cs.bham.ac.uk Abstract. This paper
More informationTraffic Signs Recognition using HP and HOG Descriptors Combined to MLP and SVM Classifiers
Traffic Signs Recognition using HP and HOG Descriptors Combined to MLP and SVM Classifiers A. Salhi, B. Minaoui, M. Fakir, H. Chakib, H. Grimech Faculty of science and Technology Sultan Moulay Slimane
More informationImproving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets
Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets Md Nasim Adnan and Md Zahidul Islam Centre for Research in Complex Systems (CRiCS)
More informationPerformance Measure of Hard c-means,fuzzy c-means and Alternative c-means Algorithms
Performance Measure of Hard c-means,fuzzy c-means and Alternative c-means Algorithms Binoda Nand Prasad*, Mohit Rathore**, Geeta Gupta***, Tarandeep Singh**** *Guru Gobind Singh Indraprastha University,
More informationWell Analysis: Program psvm_welllogs
Proximal Support Vector Machine Classification on Well Logs Overview Support vector machine (SVM) is a recent supervised machine learning technique that is widely used in text detection, image recognition
More informationNeural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani
Neural Networks CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Biological and artificial neural networks Feed-forward neural networks Single layer
More informationVersion Space Support Vector Machines: An Extended Paper
Version Space Support Vector Machines: An Extended Paper E.N. Smirnov, I.G. Sprinkhuizen-Kuyper, G.I. Nalbantov 2, and S. Vanderlooy Abstract. We argue to use version spaces as an approach to reliable
More informationSOM+EOF for Finding Missing Values
SOM+EOF for Finding Missing Values Antti Sorjamaa 1, Paul Merlin 2, Bertrand Maillet 2 and Amaury Lendasse 1 1- Helsinki University of Technology - CIS P.O. Box 5400, 02015 HUT - Finland 2- Variances and
More informationORT EP R RCH A ESE R P A IDI! " #$$% &' (# $!"
R E S E A R C H R E P O R T IDIAP A Parallel Mixture of SVMs for Very Large Scale Problems Ronan Collobert a b Yoshua Bengio b IDIAP RR 01-12 April 26, 2002 Samy Bengio a published in Neural Computation,
More informationInstantaneously trained neural networks with complex inputs
Louisiana State University LSU Digital Commons LSU Master's Theses Graduate School 2003 Instantaneously trained neural networks with complex inputs Pritam Rajagopal Louisiana State University and Agricultural
More informationDiscretizing Continuous Attributes Using Information Theory
Discretizing Continuous Attributes Using Information Theory Chang-Hwan Lee Department of Information and Communications, DongGuk University, Seoul, Korea 100-715 chlee@dgu.ac.kr Abstract. Many classification
More informationLocal Linear Approximation for Kernel Methods: The Railway Kernel
Local Linear Approximation for Kernel Methods: The Railway Kernel Alberto Muñoz 1,JavierGonzález 1, and Isaac Martín de Diego 1 University Carlos III de Madrid, c/ Madrid 16, 890 Getafe, Spain {alberto.munoz,
More informationAN APPROACH FOR LOAD BALANCING FOR SIMULATION IN HETEROGENEOUS DISTRIBUTED SYSTEMS USING SIMULATION DATA MINING
AN APPROACH FOR LOAD BALANCING FOR SIMULATION IN HETEROGENEOUS DISTRIBUTED SYSTEMS USING SIMULATION DATA MINING Irina Bernst, Patrick Bouillon, Jörg Frochte *, Christof Kaufmann Dept. of Electrical Engineering
More informationEnhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques
24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE
More informationEfficient Object Tracking Using K means and Radial Basis Function
Efficient Object Tracing Using K means and Radial Basis Function Mr. Pradeep K. Deshmuh, Ms. Yogini Gholap University of Pune Department of Post Graduate Computer Engineering, JSPM S Rajarshi Shahu College
More information5.6 Self-organizing maps (SOM) [Book, Sect. 10.3]
Ch.5 Classification and Clustering 5.6 Self-organizing maps (SOM) [Book, Sect. 10.3] The self-organizing map (SOM) method, introduced by Kohonen (1982, 2001), approximates a dataset in multidimensional
More informationTable of Contents. Recognition of Facial Gestures... 1 Attila Fazekas
Table of Contents Recognition of Facial Gestures...................................... 1 Attila Fazekas II Recognition of Facial Gestures Attila Fazekas University of Debrecen, Institute of Informatics
More informationRobust PDF Table Locator
Robust PDF Table Locator December 17, 2016 1 Introduction Data scientists rely on an abundance of tabular data stored in easy-to-machine-read formats like.csv files. Unfortunately, most government records
More informationRecent Progress on RAIL: Automating Clustering and Comparison of Different Road Classification Techniques on High Resolution Remotely Sensed Imagery
Recent Progress on RAIL: Automating Clustering and Comparison of Different Road Classification Techniques on High Resolution Remotely Sensed Imagery Annie Chen ANNIEC@CSE.UNSW.EDU.AU Gary Donovan GARYD@CSE.UNSW.EDU.AU
More informationCOMBINED METHOD TO VISUALISE AND REDUCE DIMENSIONALITY OF THE FINANCIAL DATA SETS
COMBINED METHOD TO VISUALISE AND REDUCE DIMENSIONALITY OF THE FINANCIAL DATA SETS Toomas Kirt Supervisor: Leo Võhandu Tallinn Technical University Toomas.Kirt@mail.ee Abstract: Key words: For the visualisation
More information