Texas Death Row. Last Statements. Data Warehousing and Data Mart. By Group 16. Irving Rodriguez Joseph Lai Joe Martinez

Similar documents
Data Marting Crime Correlations Using San Francisco Crime Open Data

Codify: Code Search Engine

MovieRec - CS 410 Project Report

Alyssa Grieco. Data Wrangling Final Project Report Fall 2016 Dangerous Dogs and Off-leash Areas in Austin Housing Market Zip Codes.

Known Visual Bug with UBC CLF Theme Publishing Surveys Deploying Survey Customizing the Survey URL Embedding Surveys on to

Filter and PivotTables in Excel

If Statements, For Loops, Functions

Lehigh Walking Wizard Final Report Steven Costa & Zhi Huang

Process Book. Website Github Repo. By Claudia Huang, Raul Jordan and Jacques van Rhyn

One SAS To Rule Them All

Incluvie: Actor Data Collection Ada Gok, Dana Hochman, Lucy Zhan

Improving the ROI of Your Data Warehouse

Read & Download (PDF Kindle) Data Structures And Other Objects Using Java (4th Edition)

PHP & MySQL For Dummies, 4th Edition PDF

Quick Reference for the FloridaCHARTS Fetal Death Query

5 R1 The one green in the same place so either of these could be green.

BEGINNER PHP Table of Contents

Title: Episode 11 - Walking through the Rapid Business Warehouse at TOMS Shoes (Duration: 18:10)

understanding media metrics WEB METRICS Basics for Journalists FIRST IN A SERIES

Adding content to your Blackboard 9.1 class

TUTORIAL FOR IMPORTING OTTAWA FIRE HYDRANT PARKING VIOLATION DATA INTO MYSQL

Improving Stack Overflow Tag Prediction Using Eye Tracking Alina Lazar Youngstown State University Bonita Sharif, Youngstown State University

Personal Health Assistant: Final Report Prepared by K. Morillo, J. Redway, and I. Smyrnow Version Date April 29, 2010 Personal Health Assistant

Chapter 2: Understanding Data Distributions with Tables and Graphs

Parcel QA/QC: Video Script. 1. Introduction 1

(Refer Slide Time: 02.06)

Ruby on Rails Welcome. Using the exercise files

How Do I Choose Which Type of Graph to Use?

I started off with a quick nmap scan, which showed both port 80 and 443 open.

Oracle Database 11g & MySQL 5.6 Developer Handbook (Oracle Press) PDF

CSI5387: Data Mining Project

Participation Status Report STUDIO ELEMENTS I KATE SOHNG

The compiler is spewing error messages.

Seen here are four film frames between frame 307 and

Day Health Planner. Team 4YourHealth. Zachary Nielson Daniel Hudy Peter Butler. Karen Snavely Eric Majchrzak Calvin D. Rosario.

Chameleon Metadata s Data Science Basics Tutorial Series. DSB-2: Information Gain (IG) By Eric Thornton,

Effective MySQL Optimizing SQL Statements (Oracle Press) PDF

JAVASCRIPT CHARTING. Scaling for the Enterprise with Metric Insights Copyright Metric insights, Inc.

DB2: Data Warehousing. by Andrea Piermarteri & Matteo Micheletti

Programming: Computer Programming For Beginners: Learn The Basics Of Java, SQL & C Edition (Coding, C Programming, Java Programming, SQL

Electronic Gateway Functional Team Website Usability Working Group Usability Test September 2005

Data Structures And Other Objects Using Java Download Free (EPUB, PDF)

Data Analysis and Data Science

Free Downloads Delivering Business Intelligence With Microsoft SQL Server 2008

News Article Categorization Team Members: Himay Jesal Desai, Bharat Thatavarti, Aditi Satish Mhapsekar

Google Docs Tipsheet. ABEL Summer Institute 2009

TUTORIAL FOR IMPORTING OTTAWA FIRE HYDRANT PARKING VIOLATION DATA INTO MYSQL

Data mining: concepts and algorithms

Victra A Verizon Authorized Retailer

DATA MINING TRANSACTION

Class #7 Guidebook Page Expansion. By Ryan Stevenson

Memorandum Participants Method

Writing for Use: Intersections Between Genre and Usability

Case study on PhoneGap / Apache Cordova

Worksheet Answer Key: Scanning and Mapping Projects > Mine Mapping > Investigation 2

Categorizing Migrations

Extend EBS Using Applications Express

CODE MAROON TEST SEPT. 30, 2011 SURVEY RESULTS

What about when it s down? An Application for the Enhancement of the SAS Middle Tier User Experience

Spam Detection ECE 539 Fall 2013 Ethan Grefe. For Public Use

Week - 01 Lecture - 04 Downloading and installing Python

Paper William E Benjamin Jr, Owl Computer Consultancy, LLC

2. Click on the Freeform Pen Tool. It looks like the image to the right. If it s not showing, right click on that square and choose it from the list.

Learning to Provide Modern Solutions

How to approach a computational problem

FINAL REPORT 04/25/2015 FINAL REPORT SUNY CANTON MOBILE APPLICATION

Record Linkage. with SAS and Link King. Dinu Corbu. Queensland Health Health Statistics Centre Integration and Linkage Unit

ISR Semester 1 Whitepaper Guidelines This whitepaper will serve as the summative documentation of your work for the first semester.

Read & Download (PDF Kindle) Programming Python

INCOGNITO TOOLKIT: TOOLS, APPS, AND CREATIVE METHODS FOR REMAINING ANONYMOUS, PRIVATE, AND SECURE WHILE COMMUNICATING, PUBLISHING, BUYING,

SQLite vs. MongoDB for Big Data

Assignment 0. Nothing here to hand in

Fractional. Design of Experiments. Overview. Scenario

The Journey of a Senior System Center Consultant Implementing BSM

For Volunteers An Elvanto Guide

Joopal and Drumla. Sam Moffatt, Joomla! September 13, 2009

Lecture 34 SDLC Phases and UML Diagrams

Final Project Report. Sharon O Boyle. George Mason University. ENGH 375, Section 001. May 12, 2014

DOWNLOAD PDF EXCEL MACRO TO PRINT WORKSHEET TO

Read & Download (PDF Kindle) Data Structures And Other Objects Using C++ (4th Edition)

Learn Windows PowerShell in a Month of Lunches

CIO 24/7 Podcast: Tapping into Accenture s rich content with a new search capability

Welcome Back! Without further delay, let s get started! First Things First. If you haven t done it already, download Turbo Lister from ebay.

THE AUDIENCE FOR THIS BOOK. 2 Ajax Construction Kit

MDM 4UI: Navigating and Using the Statistics Canada Website

BUILDING ANDROID APPS IN EASY STEPS: USING APP INVENTOR BY MIKE MCGRATH

Coding & Data Skills for Communicators Dr. Cindy Royal Texas State University - San Marcos School of Journalism and Mass Communication

How to integrate data into Tableau

1 Introduction to Networking

Chapter 6. Foundations of Business Intelligence: Databases and Information Management VIDEO CASES

Intro. Scheme Basics. scm> 5 5. scm>

Designing Data Warehouses. Data Warehousing Design. Designing Data Warehouses. Designing Data Warehouses

Usability Testing Review

Spotfire and Tableau Positioning. Summary

Ranking in a Domain Specific Search Engine

Name Ella Swain Assessment Number. East St Cafe. Project Name ORGANISING AND PLANNING

THE 18 POINT CHECKLIST TO BUILDING THE PERFECT LANDING PAGE

Facial Keypoint Detection

ArcMap Online Tutorial Sarah Pierce How to map in ArcMap Online using the Fresh Prince of Bel Air as an example

ArticlesPlus Launch Survey

Transcription:

Texas Death Row Last Statements Data Warehousing and Data Mart By Group 16 Irving Rodriguez Joseph Lai Joe Martinez

Introduction For our data warehousing and data mart project we chose to use the Texas death row data set. The reason for this is that the data sparked our curiosity and building a data warehouse from it would give us some more insight into the data and therefore the minds of the inmates who were on death row. We wanted to find the most common words they used in their last statement as well as use some visualization tools to see the data in a new light. The data set consists of 536 inmates from the Texas Department of Criminal Justice that were on death row. To build our data warehouse we used a star schema consisting of five tables and used a LAMP stack back end driven by PHP to query the tables and output the appropriate data. This querying tool can be seen on our website for the CSC 177 class under the Data Mart link. Gathering and Cleaning Our Data In order to gather our data we had to crawl the Texas Department of Criminal Justice website, luckily we found a crawler written in python that could obtain the list of all inmates however to get the more detailed information we had to add a lot to it and test it thoroughly. This brought up a few challenges, most notably directing it to follow the appropriate link for each inmate and getting it to grab the correct html element. Another big issue was that many of the links for the more detailed information only contained a PDF so the information could not be gathered which left holes in our data. Eventually we got it all together in a CSV file and tried to load it into Weka. This did not go well at all. We naively thought that our data would not need to be cleaned and we were very wrong. There were quite a few issues; there were non ascii characters, the pound sign threw off Rapidminer, and there were many quotations that were also unacceptable. Finally we got our data cleaned and Rapid Miner would accept it. Data Mart The design of our data mart was primarily a MySQL database that was hosted on a virtual protected cloud. The database was an RDS database built by amazon located in

Our database up and running the US West 2 (Oregon) region. We used the Command Line Interface to create our database which we decided to name TexasDB. The tables in our star schema were built on the attributes that were produced with RapidMiner; they are the most common words from the inmate s last statement, occupation, the summary of the incident, and the victim s information. The main table has an execution number and this number is used to reference all the most used words in the other four tables. In the picture below you can see the four tables and many, if not all, of the attributes in the table. Our star schema

To use this data mart we created a web page that allows the user to display and sort the data based on 17 different attributes in our main table. We wanted to add a lot more functionality to this but we ran out of time mostly due to the fact that we did not include the data mart in the original scope of our project. Luckily our professor gave us one more week to build our data mart and create this reporting tool. This way anyone will be able to explore our data and find more interesting facts that we may have missed. The querying tool written in PHP The querying tool is hosted using Athena s apache server via CSUS and was built using PHP and some Javascript. This was a bit of a challenge because PHP is not always the easiest language to deal with but after connecting to the database it was simply a matter of getting our checkboxes and drop down menus to correlate with the query we were sending. This took quite a bit of time and was all done within one week because of the reason stated previously. The database was loaded in successfully using a variety of CSV files that were created by a hand made web scraper or by Rapidminer. Our fact table, TexasData, contains the information gathered via the web scraper. It comes directly from the Texas Department of Criminal Justice website. The other four tables were generated by Rapidminer and contain the most commonly used words from their appropriate attribute in the fact table.

The five tables in our database The creation of each table was a bit challenging because we had to clean the data and make sure it would be read appropriately by MySQL WorkBench. Setting up the primary key and foreign keys was not a difficult task however it was very time consuming. Once all the data was imported and the schema was created, we were able to successfully query the database and connect the front end of our website to get some crude results that we fine tuned to get what you see now on the website. Learning Experience The data warehousing part of our project was a lot more challenging and therefore more educational. Other than setting up a basic database we had not had any experience. So turning that simple database/dataset into something valuable that could be mined was a bit of a challenge. We did a lot of hands on learning though, a lot of it involved us being frustrated and running into multiple problems but, like I previously stated, those are the things that will stick with you and help you do it again in the future. Summary Overall the most difficult part of the project was the Data Warehousing. We had to come up with a schema in a short amount of time and build it. The loading of the database also posed somewhat of a challenge, cleaning up the data and making sure that it would be accepted correctly into its respective rows and tables. Some characters aren t accepted by

MySQL characters such as spaces and often quotes caused problems. Setting up a MySQL server was also a little bit of a challenge, getting it set up and being able to connect to it sometimes caused problems. Overall this was a great learning experience and it was a great way to dive into some of the concepts we learned in class. Bibliography Data Source: http://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html Web Scraper Base: https://github.com/zmjones/deathpenalty Our website: http://athena.ecs.csus.edu/~martinj/#/overview

Texas Death Row Last Statements Data Classification and Data Mining By Group 16 Irving Rodriguez Joseph Lai Joe Martinez

Introduction When we first sat down and met as a group we went exploring to find a data set that would be interesting. We searched for quite awhile and nothing really peaked our interest. Then we found the Texas Department of Criminal Justice web site and couldn t really find anything else that seemed more interesting. This posed a problem though because we wouldn t necessarily be solving a problem so we decided to solve a virtual problem. What intrigued us most was the last statements and we thought we could find the most common words and therefore themes within those statements which could give us some insight into their mind and their experience. So we ended up with 536 rows with about 20 columns and then generated quite a few more columns and tables from that. We wanted to come at this in a technical sense but we also wanted to let the data guide us and reveal cool or interesting correlations between the inmates. That is why we have included so many visuals in our project, we wanted the data mining to be interesting because the subject matter is innately interesting. Data Mining and Classification Results When we ran KNN, we came to the realization that our data set was smaller than most, this mean our accuracy wasn t going to be that great. However, even with our small data set, when we ran the KNN algorithm with RapidMiner, it was still 68% accurate. This can be seen in the pie chart to the right. The two pie charts below give a better representation of our KNN classification. We were

trying to predict the race of the inmate and this is the result that Rapidminer gave us. You can see that the Hispanic prediction matches up almost exactly with our actual data set. But the White and Black predictions were not as good. It predicted many more white people than the actual set which had a few more Black people. On the left, we have the our actual data and our predictions are on the right. KNN actual vs predicted We also performed the Naive Bayes classification on our data. We wanted to classify, or predict, the age of offense based on education level. This would show the correlation between education level and the age at which the crime was committed. We were also able to break this down by race. You can see the results below, but generally they inmates received tenth grade education level and committed their crime at age 26. Results So, in general, we wanted to find any interesting or insightful correlations within our data. One interesting thing came up immediately after loading in our data to Rapidminer. It gives you the minimum, maximum, and average amounts for each attribute. So immediately we

got the average inmate that had been executed. The average offender would be a white male named James Johnson with black hair and brown eyes from Harris County. He would be 39 years old with a 9th grade education level standing 5 6 tall and weighing 186 pounds. The next step was to get the most used words from the inmate s last statement, occupation, the summary of the incident, and the victim s information. We used some text mining modules within Rapidminer to do this. First you select the attribute to use, convert that to text, tokenize it, transform to lower case, and lastly filter stop words. This worked surprisingly well and was relatively easy. You can see in the image below the top five most common words as well as a larger subset in word cloud form. Last statements common words Last statements word cloud

We performed this same text mining on the the summary of offense, occupations, and victim information. As well as the most common phrases from the last statements however that did not yield a very interesting result. We were able to do some other cool visualizations. For instance we plotted the number of executions by year from 1982 until today. The other one below is the number of inmates based on the county they were from. Executions by year

Number of inmates by county Learning Experience The first major speed bump we hit was gathering all of the data. We had a main webpage that contained links to individual profiles and statements of the dead inmates. There were no pre cleaned and downloadable CSV files to use. Our group managed to overcome this by creating a web crawler with Python, and exported all the data into a CSV. A tip we would give to future classes is to find a dataset that is exportable to a CSV because it would allow for a more complete set of data. Expanding on that, our data did not fully represent what was online because some of the entries online were PDFs which could not be exported via the web crawler. A major resource that contributed to our success is RapidMinerTutorial s channel on YouTube. His KNN and Naive Bayes videos were the compass that gave us direction when we

were lost. None of our team had experience with RapidMiner, and we only had limited knowledge of Weka. In fact, nothing helped us with Weka, not even the volunteer tutor. We spent a few hours trying to load our dataset into Weka, but it kept giving us an error along the lines of number of columns not matching the number of datapoints. After trying to load it into RapidMiner and having it work on the second try we decided to stick with it, plus it has a lot of functionality; more than we were able to explore for this project. Summary The results of our project were better than we expected. We managed to create nearly 70% accurate predictions with our KNN algorithm, we were able to predict age of offense given highest education level using Naive Bayes algorithm, and we produced a generic profile of a typical executed inmate given our semi comprehensive dataset. Not to mention we successfully mined the text and accomplished our original goal which was to find these most common words. This has been an overall success in that we were able to apply classroom concepts to real life data, and because we did not use previously gathered data, our group was allowed to experience data mining on a lower and deeper level despite it being more problematic and sometimes more frustrating. Bibliography Data Source: http://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html Text Mining Walkthrough: https://www.youtube.com/watch?v=ejd2m4r4mbm RapidMiner Tutorials: https://www.youtube.com/user/rapidminertutorial/videos RapidMiner: https://rapidminer.com Tableau: http://www.tableau.com/ Word Cloud Creation: https://tagul.com Our website: http://athena.ecs.csus.edu/~martinj/#/overview