TITANIC. Predicting Survival Using Classification Algorithms

Size: px

Start display at page:

Download "TITANIC. Predicting Survival Using Classification Algorithms"

Antony Derick Cox
6 years ago
Views:

1 TITANIC Predicting Survival Using Classification Algorithms

2 1 Nicholas King IE May 2016

3 PROJECT OVERVIEW > Historical Background ### > Project Intent > Data: Target and Feature Variables > Initial Exporation & Feature Engineering > SVM Models > Logistic Regression Model > Decision Tree Model > Random Forest Model > k-nearest Neighbors & Naive Bayes Model > Summary: Imporant Variables and Future Steps ### ### ### ### ### ### ### ### ### 2

4 3

HISTORICAL BACKGROUND Over a century ago, on April 15, 1912, one of the greatest shipwrecks in history occured. The RMS Titanic was the largest ship alfoat and billed as unsinkable.

5 HISTORICAL BACKGROUND Over a century ago, on April 15, 1912, one of the greatest shipwrecks in history occured. The RMS Titanic was the largest ship alfoat and billed as unsinkable. It sank on its maiden voyage to the Port of New York, somewhere off the coast of Canada in the icy waters of the Atlantic. Despite several reports saying Captain Edward John Smith was warned to avoid the area due to icebergs, the Titanic plowed ahead (some say under excessive speeds) until it was too late and an iceberg dealt a glancing blow to the ship s hull. Instantly the ship began taking on water and sinking rapidly, just before midnight. By 2:20am, with hundreds of people still on board, the ship plunged beneath the waves. Despite the repeated distress calls and flares launched, the first rescue ship, the RMS Carpathia, arrived nearly two hours later, pulling more than 700 people from the water. A lack of lifeboats further contributed to the disaster s death toll. During the pandemonium and chaos several lifeboats were launched at only half capacity, while others just floated away. Women and children were saved first, meaning the greatest number of deaths of the disaster were male. Out of roughly 2223 passengers and crew aboard the Titanic, 1500 died. The Titanic sank in the icy waters of the Atlantic Ocean and descended two miles to the ocean floor and went undiscovered for decades. In 1985 it was discovered off the coast of Newfoundland by ocean explorer Robert Ballard. 4

PROJECT INTENT This study was done with the purpose of using machine learning and classification to analyze what kinds of people were more likely to survive the Titanic

6 PROJECT INTENT This study was done with the purpose of using machine learning and classification to analyze what kinds of people were more likely to survive the Titanic disaster. The types of classification methods that were evaluated were: - Support Vector Machine (Linear and Radial) - Logistic Regression - Decision Tree - Random Forest 5

7 THE DATA Variable Descriptions: Notes: - Survival: Survival (0 = No; 1 = Yes) TARGET VARIABLE - Pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) - Name: Name - Sex: Sex - Age: Age - SibSp: Number of Siblings/Spouses Aboard - Parch: Number of Parents/Children Aboard - Ticket: Ticket Number - Fare: Passenger Fare - Cabin: Cabin - Embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) -Pclass is a proxy for socio-economic status (SES): 1st = Upper; 2nd = Middle; 3rd = Lower -With respect to the family relation variables (i.e. SibSp and Parch) some relations were ignored. The following are the definitions used for Sibsp and Parch. - Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic - Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiancés Ignored) - Parent: Mother or Father of Passenger Aboard Titanic - Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic 6

8 THE DATA Target variable Variables ignored (Name, Ticket Number, and Cabin are all unique identifiers) ***Data was imputed to make a complete dataset and run some models*** 7

John Jacob Jack Astor IV (July 13, 1864 April 15, 1912) was an American businessman, real estate builder, investor, inventor, writer, lieutenant colonel in the Spanish American War, and a prominent

9 John Jacob Jack Astor IV (July 13, 1864 April 15, 1912) was an American businessman, real estate builder, investor, inventor, writer, lieutenant colonel in the Spanish American War, and a prominent member of the Astor family. He was the richest passenger aboard the Titanic and was thought to be among the richest people in the world at that time, with a net worth of nearly $87 million when he died (equivalent to $2.13 billion in 2015). 8

10 INITIAL EXPLORATION & FEATURE ENGINEERING 9

11 SUPPORT VECTOR MACHINES 10

LOGISTIC REGRESSION A logistic regression is a statistical method for analyzing a dataset in which there are two or more independent variables (in our case, 10) that determine an outcome.

12 LOGISTIC REGRESSION A logistic regression is a statistical method for analyzing a dataset in which there are two or more independent variables (in our case, 10) that determine an outcome. The outcome is measured by the dichotomous variable, Survived, of which the outcome is either 0 or 1. From the output of the logistic regression we can see what coefficients are considered significant by their P-values. Here we see that Pclass, Sex, Age, and SibSp all have extremely small P-values and thus play an important role in the predictions. A common way to represent the accuracy of a logistic regression is to use a receiver operating characteristic, or ROC, curve. The ROC curve illustrates the performance of a binary classifier system. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR). The true-positive rate is also known as sensitivity, or recall in machine learning. The false-positive rate is also known as the fall-out and can be calculated as (1 - specificity). The area under the curve, or AUC is frequently used for model comparison, the higher the AUC, the better. 11

13 LOGISTIC REGRESSION AUC = ~85% 12

15 DECISION TREE Decision trees are great because they re intuitive and can be read by people with little experience in machine learning after a brief exploration. In a decision tree, the algorithm starts with all of the data at the root node (top box) and scans all of the variables to find the best one to split on. The decision tree completed is read from the root node; at the top it reflects the fact that 62% of passengers die, and 38% live. Moving down the branches of the decision tree we see that if the passenger was a male (moving left on the branch), they had a 19% chance of survival, and represented 65% of the passengers Yes No Sex = male 14

16 DECISION TREE The final nodes at the bottom of the decision tree are known as terminal nodes. After all the boolean choices have been made for a given passenger, they will end up in one of the terminal nodes, and the majority vote in that bucket determine the prediction for new passengers who s fate is unknown. Remember from Logistic Regression that we found the most important variables were Sex, Age, Pclass, and SibSp. So the decision tree agrees. 15

17 RANDOM FOREST Random forests get past the overfitting problems with decision trees. If we take a large collection of individually imperfect models, their mistakes will not be made by the rest of the models, and so we re able to average out the results of these models to get a superior model. The combination of models typically ends up being better than an individual model. Since the formulas for building a single decision tree are the same every time, some source of randomness is required to make these trees different from one another. Through these sources of randomness, the ensemble contains a collection of totally unique trees which all make their classifications differently. Each tree is called to make a classification for a given passenger, the votes are tallied (with possibly hundreds, or thousands of trees) and the majority decision is chosen. Since each tree is grown out fully, they each overfit, but in different ways. Thus the mistakes one makes will be averaged out over them all. Missing values have to be cleaned up in order to use a random forest. For a couple missing values in the Fare variable, we can safely use the median fare value from the data to fill this information in. There are a substantial amount of of missing variables in the Age variable, however. To fill in these missing values we can use a decision tree like we previously did, however this time we specify the classification method as anova instead of class like we previously did, since we are trying to predict a continuous variable and not a categorical one. The following plot can tell us which variables are important. 16

18 RANDOM FOREST There are two types of importance measures shown. The Mean Decrease Accuracy tests to see how worse the model performs without each variable, so a high decrease in accuracy would be expected for very predictive variables. The Mean Decrease Gini plot goes into the mathematics behind decision trees, but basically measures how pure the nodes are at the end of the tree. Again, it tests to see the result if each variable is taken out; a high score indicates the variable was important. Once again we re seeing how important the variables Sex, Pclass, and Age are in determining survival. 17

19 18

20 k-nearest NEIGHBORS & NAIVE BAYES The knn model is another algorithm for object classification that is widely used in data science and analytics. The algorithm is much more simple and intuitive than previous models. To run the knn model in R we also need to have complete training and testing datasets with no missing values, similar to the Random Forest model. In R the knn is classification for the test set from the training set. For each row of the test set, the k-nearest (in Euclidean distance) training set vectors are found, and the classification is decided by majority vote, with ties broken at random. If there are ties for the kth nearest vector, all candidates are included in the vote. Typically the value of k is determined by taking the square root of the number of features. In our case we are classifying based on seven features, so I chose k = 3. In regards to a k value, the choice of k can be critical - a small value of k means that noise will have a higher influence on the result. A large value makes it computationally expensive and kind of defeats the basic purpose behind knn (that points that are near might have similar densities or classes). In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes theorem with strong (naive) independence assumptions between the features.all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable. In R, the naivebayes function computes the conditional a-posterior probabilities of a categorical class variable given independent predictor variables using the Bayes rule. 19

21 SUMMARY As previously shown, the three most important variables in predicting survival consistently appear to be Sex, Pclass, and Age. Based on the each model it can also be shown that SibSp and Fare played a rather important role. Based on results submitted to Kaggle, the original decision tree finished with the highest accuracy, 78.5%, placing 1503 out of 3916 total entries. This was good enough to beat out 62% of other competitors. A minor difference in accuracy can represent a major difference in ranking. Some of my model accuracies were reported higher within R than what my Kaggle submission indicated. This is most likely due to overfitting. The classifier fits more the training data and thus fails to give the same accuracy on the test data. My next goal would be to work further with the Random Forest model as I feel it has the most potential for increasing my submission results. As this study has already indicated, a Random Forest is an aggregation of hundreds or more decision trees - thus with some more feature engineering and model tuning one would expect the Random Forest to give a higher accuracy. In my model within the R Studio interface I was able to achieve an accuracy of 84% using a Random Forest, so clearly the model is overfitting to the training data in some ways. 20

22 SUMMARY A bar plot showing the associated accuracies of each model/ algorithm tested. The original decision tree performed the best, while the k-nearest Neighbors model performed the worst. The highest Kaggle ranking achieved was #1503 out of

Fuzzy Rogers Research Computing Administrator Materials Research Laboratory (MRL) Center for Scientific Computing (CSC)

Intro to R Fuzzy Rogers Research Computing Administrator Materials Research Laboratory (MRL) Center for Scientific Computing (CSC) fuz@mrl.ucsb.edu MRL 2066B Sharon Solis Paul Weakliem Research Computing