Machine Learning Project Report

Size: px

Start display at page:

Download "Machine Learning Project Report"

May Hopkins
5 years ago
Views:

1 Machine Learning Project Report 1. Project Objective The project aims to analyze the crime data provided by CMPD and design a predictive model implementing classification, clustering, and neuro network algorithms to predict in how many days a case can be closed. Additionally, we tried to find out if the tweets that mention CMPD have any relation to our Incident data. Built predictive models based on a series of inputs: Historical geographic crime patterns Day of week and time of day Weather conditions Special Events Tweeter Data Unemployment 2. Data retrieval Predictive Crime Analytics Madlen Ivanova Mansi Dubey Praneesh Jayaraj University of North Carolina at Charlotte

2 Contents 1. Project Objective Data sources Data retrieval Data cleansing and Preparation Enriched the Incident dataset with multiple third-party data sets Joining Data Data Evaluation Data Exploration Feature engineering Modeling References P a g e

3 1. Project Objective The project aims to analyze the crime data provided by CMPD and design predictive models implementing classification and neural networks to predict: In how many days a case can be closed Number of crimes to occur Case status of an Incident Our goal is to build predictive models based on different inputs, evaluate them and choose the best one. 2. Data sources Our main dataset was provided by CMPD. Here is a list with the additional input data and its source: Day of week and time of day feature engineering Weather conditions - Special Events data.gov Unemployment United States Department of Labor Twitter data - IBM Watson Analytics for Social Media 3. Data retrieval It took many s and couple of meetings with CMPD (Charlotte-Mecklenburg Police Department) to discuss the process that we had to go through in order to be allowed access to the CMPD Incident data. The data was provided to us through the CMPD web site. Username and password were assigned to us so we can securely get the data. The data was provided as plain text but it was not available to download. It took us 1 week to find a way to extract it. Since none of the software we tried worked properly, we had to use a C# program to connect to the CMPD web site, establish a secure connection and download the data. The data was imported into an MS SQL Server for further analysis. The Data arrived in plain text format. It was spread from year 2011 to year 2016, where we had 7 tables per year. Overall, we retrieved 42 tables (6 years X 7 tables). We used SQL Management Studio to merge all the years in order to create just 7 tables. We had to create and design proper database tables to make the data fit better into the appropriate data types. The Complain_No 3 P a g e

4 Colum contained the date/time value in plain text, so we had to read the column and extract date and time into separate columns. We received 7 tables altogether and managed to link them. Here is our entity relationship diagram: 4. Data cleansing and Preparation We found a lot of discrepancies in the data. It looks like the source database system allowed any text to be entered into any column (no client side data validations and no back-end data type enforcements). For example, it was normal to see CityName in the ZipCode column, and it was normal to see the cities misspelled a lot. We chose which fields we were going to use in our analysis and did basic data cleansing. There were small amounts of records (less than 1%) that did not contain information that we can work with (no city and zip, so we could not match the records to proper city) and we removed them. We also deleted all columns, where we had more than 50% missing values. We used multiple imputation to analyze the completeness of our dataset. 4 P a g e

5 The data format was specified. The type of numeric field (ordinal and continuous) was adjusted properly. Outliers were replaced with the mean value of the field. Outlier cut off value was set to 3 standard deviations. All the missing data entries were replaced with: Continuous fields: replace with mean Nominal fields: replace with mode Ordinal fields: replace with median Dates & Times cannot be used directly by most algorithms, but durations can be computed and used as model features, so we estimated the duration period. The features with too many missing values were excluded (> 50%). The rows with too many missing values were excluded (> 50%). The fields with too many unique categories were excluded (> 100 categories). The categorical fields with too many values in a single category were excluded (> 90%). Sparse categories were merged to maximize association with target. Input fields that have only one category after supervised merging are excluded. The dataset was partitioned to training (70%), testing (15%), and validation (15%). We created several views that allowed us to only view the data that we were interested in. 5. Enriched the Incident dataset with multiple third-party data sets We merged and exported the most interesting features in a single excel file for modeling. However, the data that we have could not be used for predictive modeling as there were not many attributes which could be used for prediction. We augmented multiple datasets to make our data more insightful. We have used weather, special events, twitter, and unemployment data. Unemployment data contains unemployment rate and labor force details for each month from 2011 to 2016 in Charlotte area. We have considered 1 Month net change when downloading the data. The weather data contained the following features: Max TemperatureF,Mean TemperatureF, Min TemperatureF Max Dew PointF, MeanDew PointF, Min DewpointF Max Humidity, Mean Humidity, Min Humidity Max Sea Level PressureIn, Mean Sea Level PressureIn, Min Sea Level PressureIn Max VisibilityMiles, Mean VisibilityMiles, Min VisibilityMiles] Max Wind SpeedMPH, Mean Wind SpeedMPH, Max Gust SpeedMPH, PrecipitationIn, CloudCover Events, WindDirDegrees 5 P a g e

6 Since we worked mostly at a day level (not on hour level) and we did not have the correct time of the Incident, we considered only the mean values. The Special Event dataset contained start and end date of the event, along with location and description of the event. The Twitter dataset contained date, number of tweets (per day), and positive/negative sentiment analysis and it was manually collected using IBM Watson Analytics for Social Media. 6. Joining Data We used many different tables for our model. The Charlotte Mecklenburg Police Department has provided multiple tables that include Incident related information. The Incident table is the main table that connects to the additional tables offenses, Property, Stolen_Vehicle, Victim_Business, Victim_Person and Weapons. All tables are linked by the column Complaint_No. The Complaint No column has encoded information contain the date, time and incremental number which makes each record unique. Linking to the remaining tables from CMPD is easy as the same format is used in the other tables. To link the CMPD data to any other data, we broke down the Complaint_No column into a Date column, DateTime column, and separately we created Year, Month, Day columns. Additional data about Unemployment, Weather, Twitter, and Special Events datasets were linked to the CMPD data based on date. The CMPD data include details from outside the Charlotte area, however our research is limited to the city of Charlotte. To filter the data, we have created a table ZipCodes and we have included all of the ZipCodes that are for the Charlotte area. This way, we can link all the data to the ZipCodes table and filter it on the Charlotte zip codes. Additionally, some of the data came at the Day level, and some data came at the Month level. We had to group data at the month level so we can properly link it by date (year, month). We performed analysis on both (year, month) levels. 7. Data Evaluation We used Tableau, IBM Watson Analytics, and SPSS to learn, ensure data validity, and perform descriptive analysis. SPSS: We used the Descriptives, Descriptive Statistics and the Frequencies command to determine percentiles, quartiles, measures of dispersion, measures of central tendency (mean, median, and mode), measures of skewness, and to create histograms. We used Tableau to better visualize the data that we have. Since we worked with little more than half a million records, we had to livestream the data from the MS SQL virtual cloud Server, otherwise, the tool was constantly crashing. 6 P a g e

space. We performed a lot of frequency analysis to better understand the distribution of our data.

7 IBM Watson Analytics was used also to create visualizations and find dependencies between variables. This tool did not support livestreaming from another server, so we had to upload our dataset in IBM s cloud space. We performed a lot of frequency analysis to better understand the distribution of our data. Here are some of the interesting data visualizations: Type of incident distribution: Day of week over vehicle theft: 7 P a g e

8 Day of week over homicide: Analysis of the number of the distinct Copmplint_No for each table in the CMPD database: 8 P a g e

9 Analysis of the vehicle body type that have been stolen most frequently per zip code: The trend of Number of Tweets over Week Day by Location Type: 9 P a g e

10 Number of Incidents compared by year and day of the week: Trend of the number of Incidents over the Incident hour and the case status: Number of incidents over mean temperature by location type: 10 P a g e

11 The time needed for a case to be resolved over incident hour and location type: 8. Data Exploration We ran correlation analysis on the numeric features of the weather data. We looked at the pair of values and we removed one feature of each pair of highly correlated ones. Then, we ran it again to confirm that we don t have 1:1 correlation in our dataset. For instance, considering the correlations between the features below, we removed the Max Gust SpeedMPH and the Mean Humidity features from our excel file. 11 P a g e

After removing the Mean Dew PointF, you can see that the VIF score of Mean Temperature become normal and it is below our threshold, which is 3. 9.

12 Then we ran linear regression in SPSS and we used the VIF (Variance Inflation Factors) value to detect multicollinearity between variables. For example, in the left picture below you can see that the Mean TempretureF and the MeanDew PointF highly influence each other. After removing the Mean Dew PointF, you can see that the VIF score of Mean Temperature become normal and it is below our threshold, which is Feature engineering Feature engineering is fundamental to the application of machine learning. To improve our initial results, we used Microsoft SQL Server Management Studio (SSMS) to create the following features: We created Day of week and time of day features Time interval for a case to be closed was calculated from the reported and clearance date 10. Modeling Method #1: Classification After the extensive data analysis, we started with data modeling. For the classification, we chose Decision Tree Modeling. Decision Tree classifier is a supervised learning algorithm, which creates a model to predict the class labels by learning decision rules based on the data features. We created the decision tree model using IBM SPSS Modeler tool. To measure the quality of the split, we applied the Gini function for the information gain. Since the predictors are categorical, the model uses multi-way splits and we have set the minimum change in Gini to To improve the model accuracy, we have used boosting, considering 10 component models. We have chosen to favor accuracy over stability, to create a model that can accurately predict the target variable. The 12 P a g e

13 model aims at minimizing the misclassification cost. Stopping rule for building the tree is based on the minimum percentage of records in percent (5%) and child (3%) branch. There are many continuous features in the data such as Time-Hour, mean temperature, humidity etc, which increase the learning time of the model as well as decreases the model accuracy and performance. Hence, we have converted these features into interval-scaled variables. We have analyzed the effectiveness of these features on the classification task and considered those intervals that showcased significant patterns in predicting the target variable. Mean Temperature converted into following categories: We extracted incident hour from Incident date feature and converted into categorical feature as: 00:00-03:59 - Midnight 04:00-07:59 - EarlyMorning 08:00-11:59 - Morning 12:00-15:59 - Afternoon 16:00-19:59 - Evening 20:00-23:59 - Night We have dropped the highly-correlated attributes and the ones which do not have significant predictor performance rating. After multiple iterations and different combinations of input features, we have implemented classification using Decision Tree algorithm to classify/predict Case_Status for an incident. We partitioned the data into training (70%) and testing (30%) datasets. Model was trained using the training data and then its performance was analyzed on test data for each trial using coincident matrix and error rate calculation. If the percentage of correct predictions was less than 50% we discarded that model and refined it by dropping the features which had low rating on Predictor importance chart. We have considered the features which will be available at the time when the incident is reported like weekday, time, place where the crime has occurred, the agency it has been reported to, location type like indoor/outdoor, temperature etc. Following are the details of our classification model: Target: Case_Status - Case_Status is Status of report (Closed/Cleared, Closed/Leads Exhausted, Further Investigation, Inactive) at time of last update (September, 2016). Predictors: WeekDay, Month, TimeFrame, Place1, Reporting_Agency, Location_Type, Temp_Range, Events 13 P a g e

14 Partition: Training Dataset (70%) and Test dataset (30%) Rules: The predictor variables for Case_Status are classified by its importance as following: 14 P a g e

We tried with different settings by increasing the allowed depth for the decision tree and number of component models to be used for boosting to 20.

15 We tried with different settings by increasing the allowed depth for the decision tree and number of component models to be used for boosting to 20. With these updated settings, the accuracy of the model jumped to almost 70%. In future we plan continue pruning the model and better the performance. Evaluation: The predictors as well as the training data size affect the performance of the model. We have iterated over the model learning process with different feature sets and different partitions. The final result we got achieves 52.6% accuracy and is not a very good model. But we can utilize different feature selection techniques and bagging and boosting methodology to improve the accuracy of the model. 15 P a g e

16 Future Work: We will focus on improving the performance of the decision tree model using different feature selections and try different tools and classification algorithms and compare the results to our Decision tree model. Method #2: Neural Networks Neural networks are the preferred tool for many predictive data mining applications because of their power and flexibility. To implement Neural Networks, we created dummy variables to transform categorical variables into numeric. After we pre-processed the data being fed into the neural network by removal of redundant information, multicollinearity and outlier treatment, and all the other processes, mentioned in section # 4, we ran the model. We used one of the most standard algorithms for this type of supervised learning - Multilayer perceptron (MLP). MLP is a function of predictors (or independent variables) that minimize the prediction error of target variable/s. Connection weights have been changed after each piece of data is processed, based on the amount of error in the output compared to the expected result. This is how the learning occurs in the perceptron, carried out through "backward propagation of errors", attempting to minimize the loss function. To link the weighted sums of units in a layer to the values of units in the succeeding layer we used the sigmoid activation function σ(x) = 1/(1+e^( x)), which takes a realvalued input (the signal strength after the sum) and squashes it to range between 0 and 1. We partitioned the dataset into training sample, test sample, and validation (holdout) sample with ratio 70:15:15. The training sample comprises the data records used to train the neural network. The testing sample is an independent set of data records used to track errors during training to prevent overtraining. The holdout sample is second independent set of data records used for assessing the final neural network. The error for the holdout sample is the one that gives an "honest" estimate of the predictive ability of the model because the holdout cases were not used to build the model. Models: Model A day level. The data used was at a level day. As a target variable, we used the time interval for a case to be resolved (called Clearance_Timeframe). Initially, we set the random sample size to 20,000 (about 4 % of our data). Stopping rule for building the neural network was the point where error cannot be further decreased. We tried different combinations of inputs to compare them and find the best fit. Additionally, we used different sample sizes and partitioning ratio. We tried to improve the performance by feature engineering, since MLP has sensitivity to parameters, but it wasn t very helpful. Sample increase and dropout of variables were some of the most useful ways to improve the model. 16 P a g e

17 Results: The results, were still not very good. The following list represents features by importance of the model which have the highest accuracy assessed: 17 P a g e

Model B- month level The unemployment data was given at a month level. To match the level, we had to aggregate our data and bring it up to the month level as well.

18 Model B- month level The unemployment data was given at a month level. To match the level, we had to aggregate our data and bring it up to the month level as well. Thus, we could not use the categorical variables, because aggregating them was not appropriate. All other values were either summed, averaged or counted. After preparing the data, we fed it into our model. As target variables, we used the time interval for a case to be resolved and the number of incidents. Results: The accuracy is better than the previous model and the UnemploymentRate feature seems to have significant impact. The predictor variables for NumberOfIncidents were classified by its importance as following: 18 P a g e

19 19 P a g e

20 The results on this aggregation level were interesting. They show that the average monthly time for a case outcome is explained best of the number of incidents, the year, the average incident hour and the #of Tweets. The predictor variables for ClearanceTimerfame were classified by its importance as following: 20 P a g e

Further Evaluation: The accuracy of the models depends on many factors such as the sample size, the ratio between the sample size and the number of features used, the relationship between features,

21 Further Evaluation: The accuracy of the models depends on many factors such as the sample size, the ratio between the sample size and the number of features used, the relationship between features, initial weights and biases, the target variable, and the divisions of data into training, validation, and test sets. These different conditions can lead to very different solutions for the same problem. For example, if we change the ratio of the training, validation, and test sets to 50: 25: 25, the accuracy becomes 93%. The result is similar when we include all the features that we did not consider because of multicollinearity. Additionally, the model accuracy can be enhanced by boosting and the model stability - by bagging. In Model A, the relatively low performance of the model can be explained by the number of missing values in the sample dataset, noise, and the selection of features. When we ran the boosting option, it was created an ensemble, which generates a sequence of models to obtain more accurate prediction. The accuracy of model A jumped from 63.3% to 77.5%. The accuracy of model B reached 97.8% for the NumberOfIncidents target variable and 99.9% for ClearanceTimeFrame target variable. Then we ran the bagging option, it was created an ensemble using bagging, or bootstrap aggregating, which generates multiple models to obtain more reliable predictions. The accuracy of the best generated prediction for model A became 70% and the accuracy of model B reached 90.8% for the NumberOfIncidents target variable and 97.4 % for ClearanceTimeFrame target variable. Following is a table comparing the percent of accuracy of the algorithms used: MLP + Bagging + Boosting Model A ClearanceTimeFrame 63.3% 70% 77.5%. Model B ClearanceTimeFrame 77.6% 97.4 % 99.9% Model B NumberOfIncidents 76% 90.8% 97.8% 21 P a g e

22 Future work: We can consider adding new data to our existing models, such as the most recent criminal activity (within past hours), recent call for service activity, school data, and house and rent prices. We could also perform analysis per zip code. Deep learning implementation is another option. Since it can be trained in an unsupervised or supervised manner for both unsupervised and supervised learning tasks, we could use it to train a deep network in an unsupervised manner, before training the network in a supervised one. 11. References Flatch, P. (n.d.). Machine Learning. (n.d.). Why Does Unsupervised Pre-training Help Deep Learning? (n.d.). Neural network studies. 1. Comparison of overfitting and overtraining. (n.d.). Boosting. (n.d.). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. (n.d.). IBM SPSS Neural Networks 22. University, N. (n.d.). Deep learning basics. Yan-yan SONG, Y. L. (n.d.). Decision tree methods: applications for classification and prediction. 22 P a g e

DATA SCIENCE INTRODUCTION QSHORE TECHNOLOGIES. About the Course:

DATA SCIENCE INTRODUCTION QSHORE TECHNOLOGIES. About the Course: DATA SCIENCE About the Course: In this course you will get an introduction to the main tools and ideas which are required for Data Scientist/Business Analyst/Data Analyst/Analytics Manager/Actuarial Scientist/Business