Texas Death Row Last Statements Data Warehousing and Data Mart By Group 16 Irving Rodriguez Joseph Lai Joe Martinez
Introduction For our data warehousing and data mart project we chose to use the Texas death row data set. The reason for this is that the data sparked our curiosity and building a data warehouse from it would give us some more insight into the data and therefore the minds of the inmates who were on death row. We wanted to find the most common words they used in their last statement as well as use some visualization tools to see the data in a new light. The data set consists of 536 inmates from the Texas Department of Criminal Justice that were on death row. To build our data warehouse we used a star schema consisting of five tables and used a LAMP stack back end driven by PHP to query the tables and output the appropriate data. This querying tool can be seen on our website for the CSC 177 class under the Data Mart link. Gathering and Cleaning Our Data In order to gather our data we had to crawl the Texas Department of Criminal Justice website, luckily we found a crawler written in python that could obtain the list of all inmates however to get the more detailed information we had to add a lot to it and test it thoroughly. This brought up a few challenges, most notably directing it to follow the appropriate link for each inmate and getting it to grab the correct html element. Another big issue was that many of the links for the more detailed information only contained a PDF so the information could not be gathered which left holes in our data. Eventually we got it all together in a CSV file and tried to load it into Weka. This did not go well at all. We naively thought that our data would not need to be cleaned and we were very wrong. There were quite a few issues; there were non ascii characters, the pound sign threw off Rapidminer, and there were many quotations that were also unacceptable. Finally we got our data cleaned and Rapid Miner would accept it. Data Mart The design of our data mart was primarily a MySQL database that was hosted on a virtual protected cloud. The database was an RDS database built by amazon located in
Our database up and running the US West 2 (Oregon) region. We used the Command Line Interface to create our database which we decided to name TexasDB. The tables in our star schema were built on the attributes that were produced with RapidMiner; they are the most common words from the inmate s last statement, occupation, the summary of the incident, and the victim s information. The main table has an execution number and this number is used to reference all the most used words in the other four tables. In the picture below you can see the four tables and many, if not all, of the attributes in the table. Our star schema
To use this data mart we created a web page that allows the user to display and sort the data based on 17 different attributes in our main table. We wanted to add a lot more functionality to this but we ran out of time mostly due to the fact that we did not include the data mart in the original scope of our project. Luckily our professor gave us one more week to build our data mart and create this reporting tool. This way anyone will be able to explore our data and find more interesting facts that we may have missed. The querying tool written in PHP The querying tool is hosted using Athena s apache server via CSUS and was built using PHP and some Javascript. This was a bit of a challenge because PHP is not always the easiest language to deal with but after connecting to the database it was simply a matter of getting our checkboxes and drop down menus to correlate with the query we were sending. This took quite a bit of time and was all done within one week because of the reason stated previously. The database was loaded in successfully using a variety of CSV files that were created by a hand made web scraper or by Rapidminer. Our fact table, TexasData, contains the information gathered via the web scraper. It comes directly from the Texas Department of Criminal Justice website. The other four tables were generated by Rapidminer and contain the most commonly used words from their appropriate attribute in the fact table.
The five tables in our database The creation of each table was a bit challenging because we had to clean the data and make sure it would be read appropriately by MySQL WorkBench. Setting up the primary key and foreign keys was not a difficult task however it was very time consuming. Once all the data was imported and the schema was created, we were able to successfully query the database and connect the front end of our website to get some crude results that we fine tuned to get what you see now on the website. Learning Experience The data warehousing part of our project was a lot more challenging and therefore more educational. Other than setting up a basic database we had not had any experience. So turning that simple database/dataset into something valuable that could be mined was a bit of a challenge. We did a lot of hands on learning though, a lot of it involved us being frustrated and running into multiple problems but, like I previously stated, those are the things that will stick with you and help you do it again in the future. Summary Overall the most difficult part of the project was the Data Warehousing. We had to come up with a schema in a short amount of time and build it. The loading of the database also posed somewhat of a challenge, cleaning up the data and making sure that it would be accepted correctly into its respective rows and tables. Some characters aren t accepted by
MySQL characters such as spaces and often quotes caused problems. Setting up a MySQL server was also a little bit of a challenge, getting it set up and being able to connect to it sometimes caused problems. Overall this was a great learning experience and it was a great way to dive into some of the concepts we learned in class. Bibliography Data Source: http://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html Web Scraper Base: https://github.com/zmjones/deathpenalty Our website: http://athena.ecs.csus.edu/~martinj/#/overview
Texas Death Row Last Statements Data Classification and Data Mining By Group 16 Irving Rodriguez Joseph Lai Joe Martinez
Introduction When we first sat down and met as a group we went exploring to find a data set that would be interesting. We searched for quite awhile and nothing really peaked our interest. Then we found the Texas Department of Criminal Justice web site and couldn t really find anything else that seemed more interesting. This posed a problem though because we wouldn t necessarily be solving a problem so we decided to solve a virtual problem. What intrigued us most was the last statements and we thought we could find the most common words and therefore themes within those statements which could give us some insight into their mind and their experience. So we ended up with 536 rows with about 20 columns and then generated quite a few more columns and tables from that. We wanted to come at this in a technical sense but we also wanted to let the data guide us and reveal cool or interesting correlations between the inmates. That is why we have included so many visuals in our project, we wanted the data mining to be interesting because the subject matter is innately interesting. Data Mining and Classification Results When we ran KNN, we came to the realization that our data set was smaller than most, this mean our accuracy wasn t going to be that great. However, even with our small data set, when we ran the KNN algorithm with RapidMiner, it was still 68% accurate. This can be seen in the pie chart to the right. The two pie charts below give a better representation of our KNN classification. We were
trying to predict the race of the inmate and this is the result that Rapidminer gave us. You can see that the Hispanic prediction matches up almost exactly with our actual data set. But the White and Black predictions were not as good. It predicted many more white people than the actual set which had a few more Black people. On the left, we have the our actual data and our predictions are on the right. KNN actual vs predicted We also performed the Naive Bayes classification on our data. We wanted to classify, or predict, the age of offense based on education level. This would show the correlation between education level and the age at which the crime was committed. We were also able to break this down by race. You can see the results below, but generally they inmates received tenth grade education level and committed their crime at age 26. Results So, in general, we wanted to find any interesting or insightful correlations within our data. One interesting thing came up immediately after loading in our data to Rapidminer. It gives you the minimum, maximum, and average amounts for each attribute. So immediately we
got the average inmate that had been executed. The average offender would be a white male named James Johnson with black hair and brown eyes from Harris County. He would be 39 years old with a 9th grade education level standing 5 6 tall and weighing 186 pounds. The next step was to get the most used words from the inmate s last statement, occupation, the summary of the incident, and the victim s information. We used some text mining modules within Rapidminer to do this. First you select the attribute to use, convert that to text, tokenize it, transform to lower case, and lastly filter stop words. This worked surprisingly well and was relatively easy. You can see in the image below the top five most common words as well as a larger subset in word cloud form. Last statements common words Last statements word cloud
We performed this same text mining on the the summary of offense, occupations, and victim information. As well as the most common phrases from the last statements however that did not yield a very interesting result. We were able to do some other cool visualizations. For instance we plotted the number of executions by year from 1982 until today. The other one below is the number of inmates based on the county they were from. Executions by year
Number of inmates by county Learning Experience The first major speed bump we hit was gathering all of the data. We had a main webpage that contained links to individual profiles and statements of the dead inmates. There were no pre cleaned and downloadable CSV files to use. Our group managed to overcome this by creating a web crawler with Python, and exported all the data into a CSV. A tip we would give to future classes is to find a dataset that is exportable to a CSV because it would allow for a more complete set of data. Expanding on that, our data did not fully represent what was online because some of the entries online were PDFs which could not be exported via the web crawler. A major resource that contributed to our success is RapidMinerTutorial s channel on YouTube. His KNN and Naive Bayes videos were the compass that gave us direction when we
were lost. None of our team had experience with RapidMiner, and we only had limited knowledge of Weka. In fact, nothing helped us with Weka, not even the volunteer tutor. We spent a few hours trying to load our dataset into Weka, but it kept giving us an error along the lines of number of columns not matching the number of datapoints. After trying to load it into RapidMiner and having it work on the second try we decided to stick with it, plus it has a lot of functionality; more than we were able to explore for this project. Summary The results of our project were better than we expected. We managed to create nearly 70% accurate predictions with our KNN algorithm, we were able to predict age of offense given highest education level using Naive Bayes algorithm, and we produced a generic profile of a typical executed inmate given our semi comprehensive dataset. Not to mention we successfully mined the text and accomplished our original goal which was to find these most common words. This has been an overall success in that we were able to apply classroom concepts to real life data, and because we did not use previously gathered data, our group was allowed to experience data mining on a lower and deeper level despite it being more problematic and sometimes more frustrating. Bibliography Data Source: http://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html Text Mining Walkthrough: https://www.youtube.com/watch?v=ejd2m4r4mbm RapidMiner Tutorials: https://www.youtube.com/user/rapidminertutorial/videos RapidMiner: https://rapidminer.com Tableau: http://www.tableau.com/ Word Cloud Creation: https://tagul.com Our website: http://athena.ecs.csus.edu/~martinj/#/overview