Introducing Categorical Data/Variables (pp )

Similar documents
Python With Data Science

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core)

Ch.1 Introduction. Why Machine Learning (ML)?

Nearest Neighbor Predictors

Ch.1 Introduction. Why Machine Learning (ML)? manual designing of rules requires knowing how humans do it.

4. Feedforward neural networks. 4.1 Feedforward neural network structure

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

1 Training/Validation/Testing

Neural Networks (pp )

Data mining: concepts and algorithms

Data Analyst Nanodegree Syllabus

Python for. Data Science. by Luca Massaron. and John Paul Mueller

Performance Evaluation of Various Classification Algorithms

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems

Overview. Data Mining for Business Intelligence. Shmueli, Patel & Bruce

Data Science. Data Analyst. Data Scientist. Data Architect

ARTIFICIAL INTELLIGENCE (CS 370D)

Lecture 25: Review I

DSC 201: Data Analysis & Visualization

Predicting Diabetes and Heart Disease Using Diagnostic Measurements and Supervised Learning Classification Models

CS229 Lecture notes. Raphael John Lamarre Townshend

CSE 546 Machine Learning, Autumn 2013 Homework 2

Machine Learning: Algorithms and Applications Mockup Examination

STA 570 Spring Lecture 5 Tuesday, Feb 1

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

WELCOME! Lecture 3 Thommy Perlinger

Machine Learning with MATLAB --classification

Online Supplement to Bayesian Simultaneous Edit and Imputation for Multivariate Categorical Data

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Random Forest A. Fornaser

CSE 158 Lecture 2. Web Mining and Recommender Systems. Supervised learning Regression

Problem Based Learning 2018

Unsupervised Learning : Clustering

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Using the DATAMINE Program

Contents. Preface to the Second Edition

Data Science Bootcamp Curriculum. NYC Data Science Academy

Certified Data Science with Python Professional VS-1442

Supervised Learning Classification Algorithms Comparison

Lecture 9: Support Vector Machines

Moving Beyond Linearity

Data Analyst Nanodegree Syllabus

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

Data analysis case study using R for readily available data set using any one machine learning Algorithm

1 Document Classification [60 points]

Coding for Random Projects

SPM Users Guide. This guide elaborates on powerful ways to combine the TreeNet and GPS engines to achieve model compression and more.

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

CPSC 340: Machine Learning and Data Mining. Multi-Dimensional Scaling Fall 2017

Statistics Lecture 6. Looking at data one variable

6.867 Machine Learning

Data Mining and Knowledge Discovery: Practice Notes

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

6.034 Quiz 2, Spring 2005

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

DATA SCIENCE INTRODUCTION QSHORE TECHNOLOGIES. About the Course:

CPSC 340: Machine Learning and Data Mining. Multi-Class Classification Fall 2017

R (2) Data analysis case study using R for readily available data set using any one machine learning algorithm.

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Converting categorical data into numbers with Pandas and Scikit-learn -...

10 things I wish I knew. about Machine Learning Competitions

Python for Data Analysis

Classification. 1 o Semestre 2007/2008

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

ARTIFICIAL INTELLIGENCE AND PYTHON

Nearest neighbor classification DSE 220

Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs.

Evaluating Classifiers

Nonparametric Methods Recap

Python Certification Training

EPL451: Data Mining on the Web Lab 5

Features: representation, normalization, selection. Chapter e-9

Classification Algorithms in Data Mining

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

Network Traffic Measurements and Analysis

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

Data Science Essentials

CS273 Midterm Exam Introduction to Machine Learning: Winter 2015 Tuesday February 10th, 2014

Nonparametric Classification Methods

Tutorial on Machine Learning Tools

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Multiple Linear Regression

Weka ( )

Evaluating Classifiers

UNIT 1A EXPLORING UNIVARIATE DATA

DADC Submission Instructions

1 Case study of SVM (Rob)

A Systematic Overview of Data Mining Algorithms

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health

BIG DATA SCIENTIST Certification. Big Data Scientist

LAB #2: SAMPLING, SAMPLING DISTRIBUTIONS, AND THE CLT

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Preparing for Data Analysis

Louis Fourrier Fabien Gaie Thomas Rolf

ECLT 5810 Clustering

Transcription:

Notation: Means pencil-and-paper QUIZ Means coding QUIZ Definition: Feature Engineering (FE) = the process of transforming the data to an optimal representation for a given application. Scaling (see Chs. 2, 3) is an example of a FE technique. The library of choice here is Pandas, not Numpy. Introducing Categorical Data/Variables (pp. 213-5) Note: If the Pandas option display.expand_frame_repr is True, Pandas 1 will not detect correctly the size of IPython s console, and the output will look like this instead: 1 Pandas version used here was 0.20.3.

What happens if we set the Pandas option display.expand_frame_repr to True, but the IPython console window is too narrow for the frame? How do we find the number of rows and columns in a Pandas dataframe? Apply to data above. Pandas includes useful functions for checking and pre-processing the data, e.g. Find what are all the different values in the column education.

One-Hot Encoding for Categorical Data (pp. 215-9) The categorical feature workclass will be replaced by four numerical features. (In this case, the numerical features are binary, so you can guess we re going to use logistic regression.) What happened to the non-categorical features? Try this, to see some of the new data: As seen above, we can slice Pandas frames, similarly to Numpy arrays, provided we use the indexer loc. There is one difference, however: in a Pandas slice, the end of the range in included in the range!

What if we need a non-contiguous range of columns from the dataframe? For example, how would we restrict the columns to the first 3 ( age, hours-per-week, workclass_? ) and the last 2 ( income_ <=50K', 'income_ >50K )? Hint: Use the Pandas function concat( ), which is similar to Numpy s concatenate( ). Finally, we can apply logistic regression classification: Read the scorpion note on p.219: Why we must perform the one-hot encoding on the entire dataset, before splitting it into test and train subsets. ------------------------------------------- Read the following article: https://www.oreilly.com/ideas/data-engineers-vs-data-scientists

Solutions: What happens if we set the Pandas option display.expand_frame_repr to True, but the IPython cosole window is too narrow for the frame? A: The rows roll-over individually, making the table hard to read: Find what are all the different values in the column education. What if we need a non-contiguous range of columns from the dataframe? For example, how would we restrict the columns to the first 3 ( age, hours-per-week, workclass_? ) and the last 2 ( income_ <=50K', 'income_ >50K )? Hint: Use the Pandas function concat( ), which is similar to Numpy s concatenate( ).

Numbers Can Encode Categoricals The problem: By default, Pandas function get_dummies( ) only applies the one-hot encoding to string features, not numerical. Why is this sometimes not accurate? Categorical features may encoded as integers instead of strings, for simplicity, e.g. male=0, female=1; single=0, married=1, divorced=2, widowed=3. How do we (humans) know if a feature is categorical or numerical? A: If the numbers have an underlying relation of order, the feature is numerical, if not, it is categorical. Classify the following features as either categorical or (truly) numerical: Race: African American=1, Asian=2, white=3, Pacific Islander=4, American Indian=5 Rating: one star=1,... five stars=5 Pain level: no pain=0, mild=1-3, moderate=4-6, severe=7-10

Possible solutions: A. Force get_dummies( ) to encode the numerical column by providing a list of column names in the parameter columns: B. Convert the numerical column to string, and then call get_dummies(): Note: In the code example on p.221 of our text, solutions A and B above are mashed together, which is misleading. C. Use the function sklearn.preprocessing.onehotencoder( ) ------------------------------------------- Read the following article: https://www.oreilly.com/ideas/what-are-machine-learning-engineers

Binning We are illustrating binning for two regression algorithms: linear regression (LR) and decision tree (DT). The following code is found on p.222 of our text: Note: The value -1 as a parameter for reshape( ) means that the function needs to figure out that dimension, based on the total number of elements in the array. What do you think the shape of line is? Verify your answer with code. Do you know another way to expand the dimensions of a Numpy array? (There are at least two other.) Why is the reshaping necessary? Would the code run with line being just the array returned by linspace( )?

What is the easiest way to reduce the complexity of the DT above? Do you remember another way? (There are several!) Bins are intervals used to goup the data. The simplest way is to create bins of equal sizes: Numpy has a handy function that maps data points to bins: Is the bin membership information numerical or categorical? Explain!

Mircea s secret note (do not share with students!) Possible question for final exam: Do the one-hot encoding with the Pandas function get_dummies( ), as shown in the previous session. Let us repeat the two regressions on the binned data: Please note the plot parameters linewidth and dashes, which are missing from the text code!

Evaluate the R-squared scores for the non-binned and binned predictors (LR and DT) and compare. Hint: As learned in Ch.2, all regressors have a score( ) method. Conclusions: The two regressors give now the same predictions. (Because the data in each bin have the same value, the slope must be zero!) The LR model has clearly benefitted (pun intended!) from binning, as it is now much more flexible. The DT model hasn t benefitted much. In fact, a DT is able to find its own optimal bins, which are not necessarily equal in size. Final conclusion: Some models (e.g. LR) benefit from binning, some (DT) don t.

Solutions: What is the easiest way to reduce the complexity of the DT above? Do you remember another way? (There are several!) Increase the hyper-parameter min_samples_split. Other parameters that control the complexity of a DT model are: max_depth, max_leaf_nodes, max_samples_leaf, etc. Is the bin membership information numerical or categorical? Explain! A: Although the bins themselves are categorical, they are naturally sorted (1D, non-overlapping), so their integer representation actually reflects a relation of order. Conclusion: numerical!

Interactions and Polynomials In the previous example of binned data, we can further increase the accuracy by adding back the original, continuous feature: Load the program used to generate the plot above 2, and calculate the R-squared score for the new predictor. How does it compare to the one before? This is marginally better that 0.778 obtained before. 2 14_binning_plus_extra_feature.py

Even better accuracy is obtained by multiplying the original feature with each column of the binned dataset: Instead of calculating products manually, we can use a Preprocessing tool. (Note that we re not binning anymore!): Now we can apply LR on the new, extended dataset:

with the LR. The score is substantially better than the 0.624 obtained previously What happens when we increase the degree of the polynomials involved? First: Read the textbook! Second: Increase the number of points in the wave generator to 1000, and apply the usual split into train and test subsets. Use a random seed of 0 and a 75-25 partition. Experiment with polynomial degrees between 5 and 30 and report what happens. Conclusion? Reading assignment: Comparing polynomial regression with kernel-svm (p.231) Polynomial features and ridge regression for the Boston housing dataset (pp.232-4)

Univariate non-linear transformations Univariate means: Each feature is transformed in isolation (by itself). Note: There are bivariate and, in general, multivariate transformations! We are using another synthetic dataset: To understand the data better, we aggregate the first column of X into bins of size 1:

This type of distribution, with many small values and fewer large ones is very common in real-life datasets. The very large ones can be considered outliers. Unfortunately, linear models do not handle well these differences in point density (or outliers): Solution: Apply a (non-linear) transformation that makes the density more uniform, e.g. logarithm: Now we apply the same linear model on the engineered data, with much better results: Conclusions on feature engineering methods (bins, interactions, polynomials, non-linear transformations): They can make a huge difference in linear and naive Bayes algorithms, but they have little relevance in tree-based algorithms. In knn, SVMs and ANNs, they can be useful, but is it less clear how to discover the appropriate engineering method. Even in linear algorithms, it is rarely the case that the same engineering method is beneficial for all features - usually we engineer separately groups of similar features, or even individual features! This is

why understanding the data is so important - use visualization tools, like plots, matrix plots, binning/histograms, extract principal components, etc. Skip Automatic Feature Selection