Evaluating the SVM Component in Oracle 10g Beta

Similar documents
ECE 662 Hw2 4/1/2008

CHAPTER 3 RESEARCH METHODOLOGY

Class dependent feature weighting and K-nearest neighbor classification

k-nn Disgnosing Breast Cancer

Assignment 5. Machine Learning, Summer term 2014, Ulrike von Luxburg To be discussed in exercise groups on May 19-21

An Expert System for Detection of Breast Cancer Using Data Preprocessing and Bayesian Network

A Classifier with the Function-based Decision Tree

A RECURSIVE HYPERSPHERIC CLASSIFICATION ALGORITHM

International Journal of Scientific Inventions and Innovations

Setting a Good Example: Improving Generalization Performance in Support Vector Machines through Outlier Exclusion

Univariate Margin Tree

Processing Missing Values with Self-Organized Maps

Package BayesLogit. May 6, 2012

Evolving Fuzzy Rules for Pattern Classification

Introduction. Welcome. Machine Learning

Fig. 1): The rule creation algorithm creates an initial fuzzy partitioning for each variable. This is given by a xed number of equally distributed tri

Unlabeled Data Classification by Support Vector Machines

Generating the Reduced Set by Systematic Sampling

Machine Learning nearest neighbors classification. Luigi Cerulo Department of Science and Technology University of Sannio

CLS Version 1.0. A Work Bench for Classification Algorithms. Written in Matlab. 11th of December 1999

Toward Automated Cancer Diagnosis: An Interactive System for Cell Feature Extraction

Robust 1-Norm Soft Margin Smooth Support Vector Machine

Beware the Null Hypothesis: Critical Value Tables for Evaluating Classifiers

Efficient Pruning Method for Ensemble Self-Generating Neural Networks

Package kdevine. May 19, 2017

method is used successfully in many elds, especially for pattern recognition, due to its learning ability, it does encounter certain diculties in prac

GLOBAL OPTIMIZATION WITH BRANCH-AND-REDUCE

Using Decision Boundary to Analyze Classifiers

CloNI: clustering of JN -interval discretization

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

Leave-One-Out Support Vector Machines

Machine Learning A WS15/16 1sst KU Version: January 11, b) [1 P] For the probability distribution P (A, B, C, D) with the factorization

Machine Learning A W 1sst KU. b) [1 P] For the probability distribution P (A, B, C, D) with the factorization

Convex Optimization CMU-10725

Comparative Study of Instance Based Learning and Back Propagation for Classification Problems

Classification by Support Vector Machines

Random Forests: Presentation Summary

Comparative Study of Clustering Algorithms using R

Supervised classification exercice

Classification Using Unstructured Rules and Ant Colony Optimization

Support Vector Machines for visualization and dimensionality reduction

Version Space Support Vector Machines: An Extended Paper

Unsupervised Discretization using Tree-based Density Estimation

FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION

Controlling False Alarms with Support Vector Machines

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Local Linear Approximation for Kernel Methods: The Railway Kernel

Efficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms

A Lazy Approach for Machine Learning Algorithms

FUZZY KERNEL K-MEDOIDS ALGORITHM FOR MULTICLASS MULTIDIMENSIONAL DATA CLASSIFICATION

A Novel Algorithm for Associative Classification

Class Strength Prediction Method for Associative Classification

Classification by Support Vector Machines

Trade-offs in Explanatory

A Reconfigurable Multiclass Support Vector Machine Architecture for Real-Time Embedded Systems Classification

TOOLS IN DATA SCIENCE FOR BETTER PROCESSING

Random Forest A. Fornaser

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 4, Jul Aug 2017

CLASSIFICATION OF C4.5 AND CART ALGORITHMS USING DECISION TREE METHOD

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

Discretizing Continuous Attributes Using Information Theory

KBSVM: KMeans-based SVM for Business Intelligence

HALF&HALF BAGGING AND HARD BOUNDARY POINTS. Leo Breiman Statistics Department University of California Berkeley, CA

A PRIVACY-PRESERVING DATA MINING METHOD BASED ON SINGULAR VALUE DECOMPOSITION AND INDEPENDENT COMPONENT ANALYSIS

Semi-Supervised Support Vector Machines for Unlabeled Data Classification

Journal of Theoretical and Applied Information Technology. KNNBA: K-NEAREST-NEIGHBOR-BASED-ASSOCIATION ALGORITHM

The Effects of Outliers on Support Vector Machines

Data Mining. Lesson 9 Support Vector Machines. MSc in Computer Science University of New York Tirana Assoc. Prof. Dr.

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets

Using Learned Dependencies to Automatically Construct Sufficient and Sensible Editing Views

Fuzzy data analysis with NEFCLASS

Improving Classifier Performance by Imputing Missing Values using Discretization Method

Performance Analysis of Data Mining Classification Techniques

Decision Trees In Weka,Data Formats

SUCCESSIVE overrelaxation (SOR), originally developed

Towards Grading Gleason Score using Generically Trained Deep convolutional Neural Networks

Introduction to Machine Learning

CISC 4631 Data Mining

CS570: Introduction to Data Mining

Classification by Support Vector Machines

Online Learning for URL classification

ELeaRNT: Evolutionary Learning of Rich Neural Network Topologies

Analysis of Association rules for Big Data using Apriori and Frequent Pattern Growth Techniques

Homework Assignment #3

Detection and Deletion of Outliers from Large Datasets

Rule Pruning in Associative Classification Mining

SVM-based CBIR of Breast Masses on Mammograms

A neural-networks associative classification method for association rule mining

Improving Stack Overflow Tag Prediction Using Eye Tracking Alina Lazar Youngstown State University Bonita Sharif, Youngstown State University

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA

SVM in Oracle Database 10g: Removing the Barriers to Widespread Adoption of Support Vector Machines

Training Data Selection for Support Vector Machines

SSV Criterion Based Discretization for Naive Bayes Classifiers

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Table of Contents. Recognition of Facial Gestures... 1 Attila Fazekas

Feature Extraction Using ICA

Genetic Fuzzy Discretization with Adaptive Intervals for Classification Problems

Introduction The problem of cancer classication has clear implications on cancer treatment. Additionally, the advent of DNA microarrays introduces a w

Contrained K-Means Clustering 1 1 Introduction The K-Means clustering algorithm [5] has become a workhorse for the data analyst in many diverse elds.

Transcription:

Evaluating the SVM Component in Oracle 10g Beta Dept. of Computer Science and Statistics University of Rhode Island Technical Report TR04-299 Lutz Hamel and Angela Uvarov Department of Computer Science and Statistics University of Rhode Island Susie Stephens Oracle Corporation 4/15/04 Table of Contents Evaluating the SVM Component in Oracle 10g Beta... 1 Overview... 2 Test I: Wisconsin Breast Cancer Data... 2 Results for SVM using a Linear Kernel... 3 Results for SVM using a Gaussian Kernel... 3 Results for C4.5... 3 Test II: The Spam Database... 4 Results for SVM using a Linear Kernel... 4 Results for SVM using a Gaussian Kernel... 5 Results for C4.5... 5 Conclusions... 5 References... 6 1

Overview The objective of this beta test was to test the data mining functionality of the Oracle 10g release. Particular attention was given to the newly included Support Vector Machine (SVM) component (Vapnik, 1998). We used two datasets from the UCI machine learning repository 1 in order to test this data mining functionality: Wisconsin breast cancer data (Mangasarian & Wolberg, 1990). Email spam database. 2 Both datasets are binary classification problems particularly chosen to be compared to more traditional data mining algorithms. Here we compare the performance of the Oracle SVM implementation with the performance of the C4.5 decision tree algorithm (Quinlan, 1993). Details about the individual datasets appear in the corresponding sections of this document. We found that the Oracle SVM implementation compares very favorably to the traditional C4.5 decision tree algorithm. Test I: Wisconsin Breast Cancer Data This dataset consists of 645 records, once duplicates have been eliminated. Of these records 512 records are reserved for training and 133 for testing. The attributes are defined as follows: Sample code number: id number Clump Thickness: 1,2,3,4,5,6,7,8,9,10. Uniformity of Cell Size: 1,2,3,4,5,6,7,8,9,10. Uniformity of Cell Shape: 1,2,3,4,5,6,7,8,9,10. Marginal Adhesion: 1,2,3,4,5,6,7,8,9,10. Single Epithelial Cell Size: 1,2,3,4,5,6,7,8,9,10. Bare Nuclei: 1,2,3,4,5,6,7,8,9,10. Bland Chromatin: 1,2,3,4,5,6,7,8,9,10. Normal Nucleoli: 1,2,3,4,5,6,7,8,9,10. Mitose: 1,2,3,4,5,6,7,8,9,10. Class: classes: 2-benign, 4-malignant For details on these attributes please refer to the literature. The class distribution in the dataset: Benign: ~ 65% Malignant: ~ 35%. This is a binary classification problem where all the independent attributes are categorical attributes. 1 http://www.ics.uci.edu/~mlearn 2 Creators: Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt, Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304. 2

Results for SVM using a Linear Kernel Model settings: SVMS_COMPLEXITY_FACTOR =.16666666666666699 2 (B) 4 (M) 2 (B) 93 2 4 (M) 2 37 Error Rate: 3% Accuracy: 97% Results for SVM using a Gaussian Kernel Model settings: SVMS_STD_DEV = 3.7416573867739413 SVMS_COMPLEXITY_FACTOR = 1.1959376673823801 2 (B) 4 (M) 2 (B) 94 1 4 (M) 0 39 Error Rate: 0.7% Accuracy: 99.3% Results for C4.5 Pruning Confidence: 25% Minimum Support: 5 2 (B) 4 (M) 2 (B) 86 9 4 (M) 1 38 Error Rate: 7.5% 3

Accuracy: 92.5% Simplified Decision Tree: Uniformity of Cell Size = 1: 2 (268.0/6.2) Uniformity of Cell Size = 2: 2 (34.0/9.3) Uniformity of Cell Size = 3: 4 (37.0/19.6) Uniformity of Cell Size = 4: 4 (31.0/10.3) Uniformity of Cell Size = 5: 4 (27.0/1.4) Uniformity of Cell Size = 6: 4 (19.0/1.3) Uniformity of Cell Size = 7: 4 (16.0/2.5) Uniformity of Cell Size = 8: 4 (22.0/2.5) Uniformity of Cell Size = 9: 4 (4.0/2.2) Uniformity of Cell Size = 10: 4 (55.0/1.4) Remarks: Both SVM models are more accurate than the decision tree model. It is interesting to note that the SVM with a Gaussian kernel can model this dataset almost perfectly. As can be seen by the pruned decision tree most of the predictive power of this dataset is due to one attribute: the uniformity of cell size. Test II: The Spam Database Each row in this dataset represents an email message that is either considered spam (Cranor & LaMacchia, 1998) or not. The 57 continuous attributes of the dataset describe word and character frequencies in the email messages. The dataset has 4601 records with a class distribution: spam: ~39%, non-spam: ~61%. This is a binary classification problem where all the independent attributes are continuous. The dataset was divided into 3520 training records and 811 records for testing. Results for SVM using a Linear Kernel SVMS_COMPLEXITY_FACTOR =.11417207662774299 0 1 0 516 26 1 9 260 Error Rate: 4.3% Accuracy: 95.7% 4

Results for SVM using a Gaussian Kernel SVMS_STD_DEV = 4.812661641473027 SVMS_COMPLEXITY_FACTOR =.75904342468903196 0 1 0 522 20 1 15 254 Error Rate: 4.3% Accuracy: 95.7% Results for C4.5 Pruning Confidence: 25% Minimum Support: 10 0 1 0 441 33 1 28 309 Error Rate: 7.5% Accuracy: 92.5% Remarks: Again we observe that the SVM models are more accurate essentially cutting the error rate in half. It is also interesting to note there is essentially a tradeoff between the linear and the Gaussian model at the same error rate: the linear model produces more false positives and the Gaussian model produces more false negatives. Conclusions The SVM component of the Oracle 10g database allowed us to perform several data mining exercises without major problems. The obtained models compared very favorably with the models constructed by the C4.5 decision tree algorithm, in fact they appear to be superior to the C4.5 models. Of course, a more detailed statistical analysis is necessary to study if the difference between the model accuracy is statistically significant and how much it depends on the particular selection of test vs. training set. 5

References Cranor, L. F., & LaMacchia, B. A. (1998). Spam! Communications of the ACM, 41(8), 74-83. Mangasarian, O. L., & Wolberg, W. H. (1990). Cancer diagnosis via linear programming. SIAM News, 23(5), 1-18. Quinlan, J. R. (1993). C4.5 : programs for machine learning. San Mateo, Calif.: Morgan Kaufmann Publishers. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley. 6