Introducing Categorical Data/Variables (pp )

Notation: Means pencil-and-paper QUIZ Means coding QUIZ Definition: Feature Engineering (FE) = the process of transforming the data to an optimal representation for a given application. Scaling (see Chs. 2, 3) is an example of a FE technique. The library of choice here is Pandas, not Numpy. Introducing Categorical Data/Variables (pp. 213-5) Note: If the Pandas option display.expand_frame_repr is True, Pandas 1 will not detect correctly the size of IPython s console, and the output will look like this instead: 1 Pandas version used here was 0.20.3.

What happens if we set the Pandas option display.expand_frame_repr to True, but the IPython console window is too narrow for the frame? How do we find the number of rows and columns in a Pandas dataframe? Apply to data above. Pandas includes useful functions for checking and pre-processing the data, e.g. Find what are all the different values in the column education.

One-Hot Encoding for Categorical Data (pp. 215-9) The categorical feature workclass will be replaced by four numerical features. (In this case, the numerical features are binary, so you can guess we re going to use logistic regression.) What happened to the non-categorical features? Try this, to see some of the new data: As seen above, we can slice Pandas frames, similarly to Numpy arrays, provided we use the indexer loc. There is one difference, however: in a Pandas slice, the end of the range in included in the range!

What if we need a non-contiguous range of columns from the dataframe? For example, how would we restrict the columns to the first 3 ( age, hours-per-week, workclass_? ) and the last 2 ( income_ <=50K', 'income_ >50K )? Hint: Use the Pandas function concat( ), which is similar to Numpy s concatenate( ). Finally, we can apply logistic regression classification: Read the scorpion note on p.219: Why we must perform the one-hot encoding on the entire dataset, before splitting it into test and train subsets. ------------------------------------------- Read the following article: https://www.oreilly.com/ideas/data-engineers-vs-data-scientists

Solutions: What happens if we set the Pandas option display.expand_frame_repr to True, but the IPython cosole window is too narrow for the frame? A: The rows roll-over individually, making the table hard to read: Find what are all the different values in the column education. What if we need a non-contiguous range of columns from the dataframe? For example, how would we restrict the columns to the first 3 ( age, hours-per-week, workclass_? ) and the last 2 ( income_ <=50K', 'income_ >50K )? Hint: Use the Pandas function concat( ), which is similar to Numpy s concatenate( ).

Numbers Can Encode Categoricals The problem: By default, Pandas function get_dummies( ) only applies the one-hot encoding to string features, not numerical. Why is this sometimes not accurate? Categorical features may encoded as integers instead of strings, for simplicity, e.g. male=0, female=1; single=0, married=1, divorced=2, widowed=3. How do we (humans) know if a feature is categorical or numerical? A: If the numbers have an underlying relation of order, the feature is numerical, if not, it is categorical. Classify the following features as either categorical or (truly) numerical: Race: African American=1, Asian=2, white=3, Pacific Islander=4, American Indian=5 Rating: one star=1,... five stars=5 Pain level: no pain=0, mild=1-3, moderate=4-6, severe=7-10

Possible solutions: A. Force get_dummies( ) to encode the numerical column by providing a list of column names in the parameter columns: B. Convert the numerical column to string, and then call get_dummies(): Note: In the code example on p.221 of our text, solutions A and B above are mashed together, which is misleading. C. Use the function sklearn.preprocessing.onehotencoder( ) ------------------------------------------- Read the following article: https://www.oreilly.com/ideas/what-are-machine-learning-engineers

Binning We are illustrating binning for two regression algorithms: linear regression (LR) and decision tree (DT). The following code is found on p.222 of our text: Note: The value -1 as a parameter for reshape( ) means that the function needs to figure out that dimension, based on the total number of elements in the array. What do you think the shape of line is? Verify your answer with code. Do you know another way to expand the dimensions of a Numpy array? (There are at least two other.) Why is the reshaping necessary? Would the code run with line being just the array returned by linspace( )?

What is the easiest way to reduce the complexity of the DT above? Do you remember another way? (There are several!) Bins are intervals used to goup the data. The simplest way is to create bins of equal sizes: Numpy has a handy function that maps data points to bins: Is the bin membership information numerical or categorical? Explain!

Mircea s secret note (do not share with students!) Possible question for final exam: Do the one-hot encoding with the Pandas function get_dummies( ), as shown in the previous session. Let us repeat the two regressions on the binned data: Please note the plot parameters linewidth and dashes, which are missing from the text code!

Evaluate the R-squared scores for the non-binned and binned predictors (LR and DT) and compare. Hint: As learned in Ch.2, all regressors have a score( ) method. Conclusions: The two regressors give now the same predictions. (Because the data in each bin have the same value, the slope must be zero!) The LR model has clearly benefitted (pun intended!) from binning, as it is now much more flexible. The DT model hasn t benefitted much. In fact, a DT is able to find its own optimal bins, which are not necessarily equal in size. Final conclusion: Some models (e.g. LR) benefit from binning, some (DT) don t.

Solutions: What is the easiest way to reduce the complexity of the DT above? Do you remember another way? (There are several!) Increase the hyper-parameter min_samples_split. Other parameters that control the complexity of a DT model are: max_depth, max_leaf_nodes, max_samples_leaf, etc. Is the bin membership information numerical or categorical? Explain! A: Although the bins themselves are categorical, they are naturally sorted (1D, non-overlapping), so their integer representation actually reflects a relation of order. Conclusion: numerical!

Interactions and Polynomials In the previous example of binned data, we can further increase the accuracy by adding back the original, continuous feature: Load the program used to generate the plot above 2, and calculate the R-squared score for the new predictor. How does it compare to the one before? This is marginally better that 0.778 obtained before. 2 14_binning_plus_extra_feature.py

Even better accuracy is obtained by multiplying the original feature with each column of the binned dataset: Instead of calculating products manually, we can use a Preprocessing tool. (Note that we re not binning anymore!): Now we can apply LR on the new, extended dataset:

with the LR. The score is substantially better than the 0.624 obtained previously What happens when we increase the degree of the polynomials involved? First: Read the textbook! Second: Increase the number of points in the wave generator to 1000, and apply the usual split into train and test subsets. Use a random seed of 0 and a 75-25 partition. Experiment with polynomial degrees between 5 and 30 and report what happens. Conclusion? Reading assignment: Comparing polynomial regression with kernel-svm (p.231) Polynomial features and ridge regression for the Boston housing dataset (pp.232-4)

Univariate non-linear transformations Univariate means: Each feature is transformed in isolation (by itself). Note: There are bivariate and, in general, multivariate transformations! We are using another synthetic dataset: To understand the data better, we aggregate the first column of X into bins of size 1:

This type of distribution, with many small values and fewer large ones is very common in real-life datasets. The very large ones can be considered outliers. Unfortunately, linear models do not handle well these differences in point density (or outliers): Solution: Apply a (non-linear) transformation that makes the density more uniform, e.g. logarithm: Now we apply the same linear model on the engineered data, with much better results: Conclusions on feature engineering methods (bins, interactions, polynomials, non-linear transformations): They can make a huge difference in linear and naive Bayes algorithms, but they have little relevance in tree-based algorithms. In knn, SVMs and ANNs, they can be useful, but is it less clear how to discover the appropriate engineering method. Even in linear algorithms, it is rarely the case that the same engineering method is beneficial for all features - usually we engineer separately groups of similar features, or even individual features! This is

why understanding the data is so important - use visualization tools, like plots, matrix plots, binning/histograms, extract principal components, etc. Skip Automatic Feature Selection