On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions

Similar documents
CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification

Applying Supervised Learning

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Random Forest A. Fornaser

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

Facial Expression Classification with Random Filters Feature Extraction

MS1b Statistical Data Mining Part 3: Supervised Learning Nonparametric Methods

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Classifiers and Detection. D.A. Forsyth

Supervised vs unsupervised clustering

R (2) Data analysis case study using R for readily available data set using any one machine learning algorithm.

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Naïve Bayes for text classification

Classification: Linear Discriminant Functions

Machine Learning. Chao Lan

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Machine Learning. Supervised Learning. Manfred Huber

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

6.034 Quiz 2, Spring 2005

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

Tutorial on Machine Learning Tools

Support Vector Machines + Classification for IR

Contents. Preface to the Second Edition

Content-based image and video analysis. Machine learning

The exam is closed book, closed notes except your one-page cheat sheet.

Network Traffic Measurements and Analysis

Topics in Machine Learning

CS 229 Midterm Review

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Interpretable Machine Learning with Applications to Banking

Statistics 202: Statistical Aspects of Data Mining

A Taxonomy of Semi-Supervised Learning Algorithms

ECG782: Multidimensional Digital Signal Processing

Deep Learning for Computer Vision

Enhancing Forestry Object Detection using Multiple Features

Machine Learning: Think Big and Parallel

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Machine Learning in Biology

Performance Evaluation of Various Classification Algorithms

MODULE 7 Nearest Neighbour Classifier and its variants LESSON 11. Nearest Neighbour Classifier. Keywords: K Neighbours, Weighted, Nearest Neighbour

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr.

PARALLEL CLASSIFICATION ALGORITHMS

Machine Learning with MATLAB --classification

CS6220: DATA MINING TECHNIQUES

7 Techniques for Data Dimensionality Reduction

Recognition Part I: Machine Learning. CSE 455 Linda Shapiro

Experimenting with Multi-Class Semi-Supervised Support Vector Machines and High-Dimensional Datasets

Louis Fourrier Fabien Gaie Thomas Rolf

Performance Analysis of Data Mining Classification Techniques

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Applied Statistics for Neuroscientists Part IIa: Machine Learning

Bioinformatics - Lecture 07

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES

Multi-label classification using rule-based classifier systems

All lecture slides will be available at CSC2515_Winter15.html

Artificial Neural Networks (Feedforward Nets)

Based on Raymond J. Mooney s slides

CS570: Introduction to Data Mining

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Generative and discriminative classification techniques

VECTOR SPACE CLASSIFICATION

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Combined Weak Classifiers

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

Last week. Multi-Frame Structure from Motion: Multi-View Stereo. Unknown camera viewpoints

Notes and Announcements

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.

Action Recognition & Categories via Spatial-Temporal Features

Search Engines. Information Retrieval in Practice

Hand Written Digit Recognition Using Tensorflow and Python

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

CS249: ADVANCED DATA MINING

Nonparametric Classification Methods

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear

RESAMPLING METHODS. Chapter 05

Bayes Risk. Classifiers for Recognition Reading: Chapter 22 (skip 22.3) Discriminative vs Generative Models. Loss functions in classifiers

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

CS273 Midterm Exam Introduction to Machine Learning: Winter 2015 Tuesday February 10th, 2014

k-nearest Neighbor (knn) Sept Youn-Hee Han

Feature Selection for Image Retrieval and Object Recognition

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classifiers for Recognition Reading: Chapter 22 (skip 22.3)

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Non-Parametric Modeling

INTRODUCTION TO ARTIFICIAL INTELLIGENCE

Intro to Artificial Intelligence

Classification Algorithms in Data Mining

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

CS 170 Algorithms Fall 2014 David Wagner HW12. Due Dec. 5, 6:00pm

Data Preprocessing. Supervised Learning

CSE 573: Artificial Intelligence Autumn 2010

Transcription:

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions CAMCOS Report Day December 9th, 2015 San Jose State University Project Theme: Classification

The Kaggle Competition Kaggle is an international platform that hosts data prediction competitions Students and experts in data science compete Our CAMCOS team entered two competitions Team 1: Digit Recognizer (Ends December 31st) Team 2: Springleaf Marketing Response (Ended October 19th)

Overview of This CAMCOS Team 1 Team 2 Wilson A. Florero-Salinas Carson Sprook Dan Li Abhirupa Sen Xiaoyan Chong Minglu Ma Yue Wang Sha Li Problem: Given an image of a handwritten digit, determine which digit it is. Problem: Identify potential customers for direct marketing. Project supervisor: Dr. Guangliang Chen

Presentation Outline (Team 1) 1. The Digit Recognition Problem 2. Classification in our Data Set 3. Data Preprocessing computer 9 4. Classification Algorithms 5. Summary Theme: Classification Problem

Classification in our data set Prediction New Points Training Set Model Training? Training Points Algorithm 0 1 Model Training Labels Predicted Labels Goal: use the training set to predict the label of a new data set. 9 MNIST Data

1 Team 1: The MNIST data set 1 28x28 images of handwritten digits 0,1,...,9 size-normalized and centered 60,000 used for training 10,000 used for testing subset of data collected by NIST, the US's National Institute of Standards and Technology

Potential Applications Banking: Check deposits Surveillance: license plates Shipping: Envelopes/Packages

Initial Challenges and Solutions High dimensional data set Images stored as 784x1 vectors Computationally expensive Digits are written differently by different people Left-handed vs right-handed Preprocess the data set Reduce dimension increase computation speed Apply some transformation enhance features important for classification

Data Preprocessing Methods In our experiments we have used the following methods Deskewing Principal Component Analysis (PCA) Linear Discriminant Analysis (LDA) 2D LDA Nonparametric Discriminant Analysis (NDA) Kernel PCA t-distributed Stochastic Neighbor Embedding (t-sne) parametric t-sne kernel t-sne PCA & LDA

Principal Component Analysis (PCA) Using too many dimensions (784) can be computationally expensive. Uses variance as dimensionality reduction criterion Throw away directions with lowest variance

Linear Discriminant Analysis: Reduce dimensionality, preserve as much class discriminatory information as possible. Methods

Classification Methods In our experiments we have used the following methods Nearest Neighbors Methods (Instance based) Naive Bayes (NB) Maximum a Posterior (MAP) Logistic Regression Support Vector Machines (Linear Classifier) Neural Networks Random Forests Xgboost

K nearest neighbors A new data point is assigned to the group of its k nearest neighbors k=5 k=9 Majority of the neighbors are from class 2. Test data is closer to class 1.

K means Situation 1: Data is well separated. Each class has a centroid/average. Situation 2: Data has non convex clustering. centroid 1 centroid 2 centroid 1 centroid 2 centroid 3 Test data is predicted to be from class 3. Test data belongs to class 2. Misclassified to class 1.

Solution : Local k means For every class local centroids are calculated around the test data. SVMs

Support Vector Machines (SVM) classify new observations by constructing a linear decision boundary

Support Vector Machines (SVM) Decision boundary chosen to maximize the separation m between classes

SVM with multiple classes SVM is a binary classifier. What if there are more than two classes? Two methods: 1) One vs. Rest 2) Pairs One vs Rest Construct one SVM model for each class Each SVM separates one class from the rest

Support Vector Machines (SVM) What if data cannot be separated by a line? Kernel SVM: Separation may be easier in higher dimensions

Combining PCA with SVM Traditionally: Apply PCA globally to entire data Our approach: Separately apply PCA to each digit space This extracts the patterns from each digit class We can use different parameters for each digit group. PCA PCA PCA Model 1 SVM1 Model 2 SVM2 Model 10 SVM10 0 or not 0 1 or not 1 9 or not 9 largest positive distance from boundary

Some Challenges for kernel SVM It is not obvious what parameters to use when training multiple models Within each class, compute a corresponding sigma This gives a starting point for parameter selection How to obtain an approximate range of parameters for training?

Parameter selection for SVMs Using knn, with k=5, sigma values for each class Error 1.25% using knn different sigma on each model Error 1.2% is achieved with the averaged sigma = 3.9235

SVM + PCA results

Some misclassified digits (Based on local PCA + SVM, deskewed data) 6 to 0 3 to 5 8 to 2 2 to 7 7 to 3 8 to 9 7 to 2 9 to 4 4 to 9 4 to 9 Neural Nets

Neural Networks: Artificial Neuron

Neural Networks: Learning

Neural Networks: Results Classification rule for ensembles: majority voting Conclusions

Summary and Results Linear methods are not sufficient as the data is nonlinear. LDA did not work well for our data. Principal Component Analysis worked better than other dimensionality reduction methods. Best results were obtained with PCA values between 50 and 200 (55 being best) Deskewing improved results in general Best classifier for this data set is SVM

Results for MNIST

Questions?

Directions to Lunch

Choosing the optimum k The optimum k should be chosen with cross validation The data set is split into a training and a test set. The algorithm is run on the test set with different k values. The k that gives the least misclassification is chosen. Local PCA For each of the class of digits the basis is found by PCA. Local PCA has ten bases instead of one global basis. Each of the test data point is projected into each of these ten bases.

Local Variance Nice Parametric Data Center Local k-center Messy Non-Parametric