Exercise 3. AMTH/CPSC 445a/545a - Fall Semester October 7, 2017

Similar documents
Exercise 4. AMTH/CPSC 445a/545a - Fall Semester October 30, 2017

Exercise 2. AMTH/CPSC 445a/545a - Fall Semester September 21, 2017

Supervised Learning Classification Algorithms Comparison

Figure 3.20: Visualize the Titanic Dataset

Classification with Decision Tree Induction

Business Club. Decision Trees

COMP 364: Computer Tools for Life Sciences

Homework Assignment #3

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

Introduction to Machine Learning

Nominal Data. May not have a numerical representation Distance measures might not make sense PR, ANN, & ML

Lab and Assignment Activity

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

1. make a scenario and build a bayesian network + conditional probability table! use only nominal variable!

Decision Trees: Discussion

Classification Algorithms in Data Mining

TITANIC. Predicting Survival Using Classification Algorithms

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

Data Mining Concepts & Techniques

CS 170 Algorithms Fall 2014 David Wagner HW12. Due Dec. 5, 6:00pm

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

CSE4334/5334 DATA MINING

Data Mining Algorithms: Basic Methods

Machine Learning Chapter 2. Input

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer

Basic Concepts Weka Workbench and its terminology

Lecture 7: Decision Trees

Hands-on Machine Learning for Cybersecurity

6.034 Design Assignment 2

Prediction. What is Prediction. Simple methods for Prediction. Classification by decision tree induction. Classification and regression evaluation

Classification: Basic Concepts, Decision Trees, and Model Evaluation

Data Mining Practical Machine Learning Tools and Techniques

Install RStudio from - use the standard installation.

Data Mining. Decision Tree. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining Practical Machine Learning Tools and Techniques

Fuzzy Rogers Research Computing Administrator Materials Research Laboratory (MRL) Center for Scientific Computing (CSC)

9/6/14. Our first learning algorithm. Comp 135 Introduction to Machine Learning and Data Mining. knn Algorithm. knn Algorithm (simple form)

Introducing Categorical Data/Variables (pp )

2. A Bernoulli distribution has the following likelihood function for a data set D: N 1 N 1 + N 0

Input: Concepts, Instances, Attributes

Homework 1 Sample Solution

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Data Engineering. Data preprocessing and transformation

Chapter 4: Algorithms CS 795

Programming Exercise 3: Multi-class Classification and Neural Networks

Association Rules. Charles Sutton Data Mining and Exploration Spring Based on slides by Chris Williams and Amos Storkey. Thursday, 8 March 12

BITS F464: MACHINE LEARNING

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

Lecture 5: Decision Trees (Part II)

Search. The Nearest Neighbor Problem

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

List of Exercises: Data Mining 1 December 12th, 2015

Homework: Data Mining

Lab Five. COMP Advanced Artificial Intelligence Xiaowei Huang Cameron Hargreaves. October 29th 2018

Chapter 4: Algorithms CS 795

Decision Trees. Query Selection

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

CIS192 Python Programming

Classification and Regression Trees

Supervised Learning: K-Nearest Neighbors and Decision Trees

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Data Preprocessing. Slides by: Shree Jaswal

CSC411/2515 Tutorial: K-NN and Decision Tree

Data Analyst Nanodegree Syllabus

Machine Learning in Real World: C4.5

Lecture 5 of 42. Decision Trees, Occam s Razor, and Overfitting

Machine Learning Techniques for Data Mining

MATH 117 Statistical Methods for Management I Chapter Two

Slides for Data Mining by I. H. Witten and E. Frank

Data Mining Classification - Part 1 -

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Machine Learning in Telecommunications

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten

Data Mining. Part 1. Introduction. 1.4 Input. Spring Instructor: Dr. Masoud Yaghini. Input

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Decision tree learning

Data Mining. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Chapter 3: Input

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Part I. Instructor: Wei Ding

Extra readings beyond the lecture slides are important:

Homework2 Chapter4 exersices Hongying Du

Decision Trees In Weka,Data Formats

Machine Learning with MATLAB --classification

Data Mining. Part 1. Introduction. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Input

Classification and Regression

Fitting Classification and Regression Trees Using Statgraphics and R. Presented by Dr. Neil W. Polhemus

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

Random Forest Ensemble Visualization

In this tutorial we will see some of the basic operations on data frames in R. We begin by first importing the data into an R object called train.

ISSUES IN DECISION TREE LEARNING

Tutorial on Machine Learning Tools

Python With Data Science

Unsupervised: no target value to predict

Technical University of Munich. Exercise 8: Neural Networks

Performance Evaluation of Various Classification Algorithms

Decision Trees: Part 2

HANDS ON DATA MINING. By Amit Somech. Workshop in Data-science, March 2016

CSCI567 Machine Learning (Fall 2014)

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats

Transcription:

Exercise 3 AMTH/CPSC 445a/545a - Fall Semester 2016 October 7, 2017 Problem 1 Compress your solutions into a single zip file titled <lastname and initials> assignment3.zip, e.g. for a student named Tom Marvolo Riddle, riddletm assignment3.zip. Include a single PDF titled <lastname> assignment3.pdf and any MATLAB or Python scripts specified. Please include your name in the header of all submitted files (on every page of the PDF) and in a comment header in each of your scripts. Your homework should be submitted to Canvas before Friday, October 27, 2017 at 5:00 PM. Programming assignments should use built-in functions in MATLAB or Python; In general, Python implementations may use the scipy stack [1]; however, exercises are designed to emphasize the nuances of data mining algorithms - if a function exists that solves an entire problem (either in the MATLAB standard library or in the scipy stack), please consult with the TA before using it. 1. Given N data points and a real-valued numerical attribute, suggest an algorithm with complexity O(N log N) for finding the best binary split of this attribute w.r.t. the Gini index. Justify the complexity of the proposed algorithm. 2. Sketch a full decision tree for the parity function over four binary attributes: A, B, C, and D (e.g., the class is + when the number of attributes that are 1 is even, and it is when this number is odd). The tree should consider a uniform distribution of all possible combinations for these attributes (i.e., 0000, 0001, 0010, 0011,...). Can this tree be pruned without degrading its classification performances? 3. Consider the following dataset: 1

Day Outlook Temperature Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No Build a decision tree for deciding whether to play tennis or not based on the weather today. Problem 2 Your tree construction should be based on Gini index and multiway splits Provide a sketch of the constructed tree in the submitted PDF Provide justifications for the choice of each attribute in the construction process. 1. Give an example of binary classification with two binary attributes where Gini index prefers one attribute for a (binary) split while information gain prefers the other attribute (hint: you can use duplicate entries of each combination of values). Explain how such an example can occur, even though we saw in class that both entropy and Gini index increase and decrease monotonically in the same regions (i.e., [0, 0.5] and [0.5, 1] correspondingly) for binary classification. 2. Consider the setting of k 2 classes, nominal attributes, and multiway splits. Prove that information gain of a split is always positive (i.e., entropy never increases by splitting a node). Hint #1: You can use Jensen s inequality for the log function to get k a k log(b k )/ k a k log( k a kb k / k a k). Hint #2: think in terms of joint and conditional probabilities. Problem 3 1. Consider the following aggregated dataset with binary attributes: Mileage Engine Air Conditioner frequency(car Value = Hi) frequency(car Value = Low) Hi Good Working 3 4 Hi Good Broken 1 2 Hi Bad Working 1 5 Hi Bad Broken 0 4 Lo Good Working 9 0 Lo Good Broken 5 1 Lo Bad Working 1 2 Lo Bad Broken 0 2 Complete the CDTs for the following Bayesian belief network and add them to the submitted PDF: 2

Using this network, compute Pr[E = Bad, AC = Broken] and add the detailed computation to the submitted PDF 2. Consider the following Bayesian belief network and add detailed computations of the following probabilities to the submitted PDF: Problem 4 Pr[B = good, F = empty, G = empty, S = yes] Pr[B = bad, F = empty, G = not empty, S = no] The probability that the car starts given that the battery is bad For this problem you will implement Hunt s decision tree algorithm for categorical data. For each split, your implementation should select the best attribute based on Information Gain, and perform a multi-way split with one child node for each of the possible attribute values (see the lecture slides for a description of Hunt s algorithm and Information Gain). If a leaf node contains more than one class, use majority voting to set the class of the leaf node. If there is a tie in majority voting, choose the smallest class value from those in the tie. If multiple attributes have the same Information Gain, split based on the attribute that first appears in the attribute data matrix. 3

Your data comes from a curated manifesto of the passengers of the Titanic, a British passenger liner that collided with an iceberg in the wee hours of the morning on 15 April 1912. 1,502 people died out of 2,224 total passengers and crewmen. Can you predict who will live and who will die using Hunt s decision tree algorithm? Two files are given to you: the first, train.csv, is a comma-separated values document that contains 891 entries, each row corresponding to an individual on the titanic. The columns in this training set are as follows: Variable PassengerId survival pclass Name sex Age sibsp parch ticket fare cabin embarked Definition Passenger ID Survival (0 = died, 1 = survived) Ticket class (1=1st, 2=2nd, 3= 3rd) Passenger Name Sex Age in years # of siblings / spouses aboard the Titanic # of parents / children aboard the Titanic Ticket number Passenger fare Cabin number Port of Embarkation Upon opening this table you ll first see that some attributes are missing for all or part of the data. You ll also notice that some columns are not in integer form. You ll want to do some preprocessing to clean the data. Consider dropping columns with greater than 50% of their entries missing (e.g. cabin number). Rationalize why this could be reasonable given the other variables you have and their relationship with these columns Drop rows that have missing elements. You should have around 700 entries after dropping these rows. Come up with a reasonable integer coding scheme for columns that are not categorical. For instance, you may encode age and fare by creating number-coded ranges and assigning values to each respective range. Code strings as integers. For instance sex could be coded as 0(male), 1(female). Port of embarkation could be coded as 1,2,or 3. Ignore the name column. Any correlation names have with death rate is spurious; if you find one in your training, you ll be overfitting. We ve prepared preprocessing files that take care of most of this for you in preprocess4.m and preprocess4.py. You ll have to take what we ve done as an example and do a similar discretization for the ticket numbers. You can also probably improve upon our preprocessing to improve your accuracy. The second file you receive is test.csv. This is a similar file as train.csv, but the survival column has been removed. We will use your predictive performance over the 418 individuals in test.csv to in part determine your final grade. To get a reasonable benchmark of where your model stands, we have created a performance accuracy assessment that will test your performance predicting over test.csv. This may be found linked as accuracy.html on the website. On this website, you must submit a two column csv with a first-row header. The first column should be labeled PassengerId and contain the PassengerID of your prediction. The second column will be labeled Survived and contain the aforementioned binary coding scheme for survival. Sort your input CSV on the first column such that the 418 test cases are ordered by their PassengerID. 4

Remark. Failure to follow these guidelines will result in an inaccurate or bugged classification metric. Do not download the html file. It will not work. Remark. Do not be discouraged by seemingly modest classification results of 70%. It is a difficult dataset for a decision tree to classify well. World class models score 95% on this dataset, but are much more advanced and overfit to this problem than the decision tree you will implement. Remark. Consider using programming techniques such as recursion or object oriented programming. MATLAB users should become intimate with struct; Pythonistas can seek similar respite with class. All of your code for this problem should be in a single file named either lastname script4.py or lastname script4.m depending on if you re using Python or MATLAB. Your solution will be graded by testing your code against the test set. Additionally, it will be tested against testing and training data sets from disparate sources, possibly with different numbers of classes, different numbers of attributes, than the titanic data set. Moreover, your code should work in general, not just on the titanic data set. For MATLAB or Python write functions with the following calling sequences: where tr = tree train(c, nc, x, nx) c is an m 1 vector of integers corresponding to the class labels for each of the m samples, nc is the number of classes, that is, c(i) {1,..., nc} for i = 1,..., m. x is an m n matrix of integers corresponding to the n categorical attributes for each of the m training samples, nx is an 1 n vector corresponding to the range of each of the n attributes, that is, x(i, j) {1,..., nx(j)} for all i = 1,..., m, and tr encodes the tree structure for usage in tree classify. You may choose the structure of tr. Second, you should implement a classification algorithm with calling sequence where b = tree classify(y, tr) y is an k n matrix of integers corresponding to the n categorical attributes for each of the k testing samples. tr is the tree structure output by your function train tree, and b is an k 1 vector of integers corresponding to the predicted class of the k testing samples by your decision tree. If you re using MATLAB, complete the exercise only using the core MATLAB language. If you re using Python, complete the exercise only using the core Python language and the SciPy stack. If you have concerns about a particular function or package, please contact the TA. 5

Problem 5 For this problem you will use an implementation of CART to build a decision tree model for a leaf data set. You will evaluate your model using several different methods of cross validation. In particular, you should evaluate you model using leave-one-out cross validation, 2-fold cross validation, and 17-fold cross validation. Each cross validation method will make a total of 340 predictions (one for each sample). For example, for leave-one-out cross validation you should make 340 splits (corresponding to the number of samples) such that each sample in excluded once from training, and its predicted value is determined by this split. For each cross validation method, summarize your predictions in a 30 30 confusion matrix (where 30 is the number of classes). Plot each of the resulting confusion matrices (using imagesc in MATLAB or imshow in Python) and include the plots in your assignment PDF. Additionally, the classification accuracy for each method should be computed and included in your PDF. Finally, construct a decision tree on the entire dataset and visualize the resutling tree. The method of visualization is detailed below. Save the resulting image as a PNG, and include the PNG file in your assignment ZIP file (because the resulting tree may be very large, there is no need to include this image in your PDF). Your code should be contained in a single file based on the template lastname script5.m or lastname script5.py depending on if you re using MATLAB or Python. The template script will load preprocessed data leaf.mat. The data consists of an m-dimensional vector c of classes represented as positive integer (1-30), and an m n floating point attribute matrix. A key for the classes and a description of the attributes is provided in leaf key.txt. If you re using MATLAB, complete the exercise only using the core MATLAB language. The function fitctree fits a decision tree based on the CART algorithm. The model the trained by and classification is performed by M = fitctree(x, c) c hat = predict(m, y). The decision tree M can be visualized in MATLAB by view(m, mode, graph ) If you re using Python, you will need to install several packages. First, install scikit-learn for your Python distribution. For example, the package can be installed using pip by pip install -U scikit-learn. for additional information. Second, install pydotplus for your Python distribution. For example, the package can be installed using pip by pip install -U pydotplus. Third, install the GraphViz binaries. For example, the binaries can be installed using yum by yum install graphviz. You will need to determine the appropriate way to install these packages on your system. For additional information, see scikitlearn.org/stable/install.html. After installing the appropriate packages, complete the exercise only using the core Python programming language and the following functions as needed: from sklearn.tree import DecisionTreeClassifier, export_graphviz from scipy.io import loadmat as load from numpy import argsort, reshape, transpose, array, zeros from matplotlib.pyplot import imshow, xlabel, ylabel, title, figure, savefig from numpy.random import permutation, seed from pydotplus import graph_from_dot_data The function DecisionTreeClassifier can be used to train a CART decision tree by M = DecisionTreeClassifier() M = M.fit(x, c) 6

and then classification can be performed by The tree can be visualized by c hat = M.predict(y). dot data = export graphviz(md, out file = None) graph = graph from dot data(dot data) graph.write pdf( cart tree.png, f= png ) For additional information see scikit-learn.org/stable/modules/tree.html. Problem 6 For this problem, you will implement a Naïve Bayesian classifier on a seed data set. Your implementation should use the Laplace Correction to avoid issues with zero probabilities. See the lecture slides for a description of Naïve Bayesian classification and the Laplace Correction. All your code should be in a single file based off the template lastname script6.m or lastname script6.py depending on if you re using MATLAB or Python. The template will load the preprocessed data seed data.mat which consists of an m-dimensional vector of positive integers c (class labels), an integer nc (the number of classes), and an m n floating point attribute matrix. The file seed data.txt documents the classes and attributes of this data set. For both MATLAB and Python your functions should have calling sequence pr = naivebayes train(c, nc, x, nk) where c is an m 1 vector of positive integer class labels, nc is the integer number of classes, that is, c(i) {1,..., nc} for all i = 1,..., m, x is an m n attribute matrix with real values, and nk is the number of histogram bins to use. Moreover, your implementation should represent the probability distributions as histograms with nk bins of equal width. Second, you should implement a classification function c hat = naivebayes classify(pr, y) which takes your trained Naïve Bayes model pr as well as a k n attribute matrix y, and returns a k 1 dimensional vector c hat with the predicted class labels represented as positive integers. After implementing the Naïve Bayes model, use nk = 6 bins to perform leave-one-out cross validation on the data with m splits (one for each data point). Compute the resulting 3 3 confusion matrix. Since the matrix is small, rather that creating a image plot, simply include a print out of the numerical matrix in your PDF. Also compute the classification accuracy of your model, and include this number in your PDF. Finally, train your Naïve Bayes model on the entire data set (c, x) using bin parameter nk = 6, and perform the following visualization. Select two attributes x i and compute P (x i c j )P (c j ). which is proportional the posterior distribution of x i given c j, for j = 1,..., nc. Since nk = 6, your posterior distribution (up to normalization) will consist of 6 numbers corresponding to the 6 bins in your histogram representation. To visualize the distributions, create a line plot of the the center of each of your nk bins on the x-axis, and the posterior distribution P (x i c j )P (c j ) on the y axis. Add a legend labeling the class c j of each line. Moreover, your should create two figure with three lines each approximating the posterior distribution (up to normalization) of your chosen attributes given each of the three classes. If you re using MATLAB, complete the exercise only using the core MATLAB language. If you re using Python, complete the exercise only using the core Python language and the SciPy stack. If you have concerns about a particular function or package, please contact the TA. 7

References [1] The scipy stack specification. [Online]. Available: https://www.scipy.org/stackspec.html 8