Crime - Based Predictive Analysis and Warning System

Size: px

Start display at page:

Download "Crime - Based Predictive Analysis and Warning System"

Corey Hutchinson
5 years ago
Views:

1 Crime - Based Predictive Analysis and Warning System Sahil Puri, Parul Verma

2 Outline Motivation Goal Dataset details Architecture Modelling and Approach Progress Future work

3 Motivation and Goal

4 Motivation Crime - based prediction analysis approaches have been proposed in many literatures but they mostly focus on policing and government services (e.g. PredPol). Lack of a smart warning system that alerts users about anticipated crime on their location. Utilize Geo-spatial Open Data available on the internet. Data on Chicago.gov.in Canada government made 2,00,000 datasets available.

5 Goal To design an architecture and implement a real time system that warns users about anticipated crime in their neighbourhood. Benefit of such system: Safety of citizens. Reduce the crime rate of the city. Deterrent for criminals. Can be used by authorities to track crime.

6 Dataset information

7 Dataset Details Crime data made available by Department of Police in Chicago. URL - Data provides details of: Crime Type Crime Description Date and Time of Crime Latitude and Longitude

9 Architecture

10 Architecture Modelling Preprocessing layer The amount of data is very large (nearly 200,000 rows). Convert it into a format consumable by next layer Clustering layer The approach in this process compares user location with crime point locations. How to reduce the computation time in big data? Data Mining Techniques comes to rescue Categorizer layer What features should contribute to anticipate the crime? User Interface

11 Preprocessing Data Unstructured data can converted into system consumable format. Removed noise data from raw data. Removed unnecessary columns not relevant to study (Ward Number, Case ID used by Police). Can also be used to join multiple datasets relevant to application and provide more insight. Canadian government has 2,00,000 open datasets available. Some contain crime information, some contain 911 calls information. These datasets can be aggregated and processed.

12 Clustering

13 Clustering The processed data is clustered using algorithms (discussed ahead) Clustering is performed on the basis of geospatial location to group the crime data into multiple clusters. A cluster is defined by its centroid. The centroid is defined on the basis of density distribution of the points contained inside a cluster.

14 Clustering Algorithms K-means Method of vector quantization popular for clustering analysis in data mining. Groups data of size 'n' into exactly 'k' points. Most suitable for signal processing. DBScan Groups the points based on the density i.e. which are closely packed Number of clusters are dynamic and not limited to a fixed number as in case of K-means. More suitable for geo-spatial points.

15 DBScan Algorithm Density Based Spatial Clustering of Application and Noise groups together the most closely related points based on three factors : Epsilon'. - Epsilon is the maximum distance between two points that belongs to same cluster. minnumpoints Minimum number of points that should be in a cluster. (0 in our case.) Noise Points that cannot be put into any cluster will be tagged as Noise. (Again, 0 In our case.). Steps in DBSCAN STEP 1 - For each point p in Dataset which is not visited, STEP 2 - Mark p as visited. STEP 3 - For each point q in dataset which is not visited, STEP 4 - Calculate distance =HaversteinDistance(p, q); STEP 5 -If distance < epsilon, put q in p s cluster. Mark q as visited. STEP 6 (MERGE) - For each point k in dataset which is not visited, repeat STEP 4 Cluster centroid : The point which represents the whole cluster. Calculated on the basis of density distribution of points in a cluster.

16 Deciding Epsilon? Deciding epsilon is tricky: Low epsilon => large no of clusters and less no of points per cluster. Large epsilon => small no of clusters and large no. Of points per cluster. Need to find optimum value to have evenly distributed clusters. Cluster Error Coefficient Deviation in number of points for a specific value of epsilon, i.e. degree of uneven distribution of points in clusters.

17 Limitations of DBSCAN Number of Entries Number Of Clusters Points in each cluster With epsilon value = 2, Data is not evenly distributed in clusters. There is not much improvement in number of point comparisons (97, 1) (406, 1, 1) (856, 3, 2, 1, 3, 1, 1, 1, 1, 1, 1) (4334, 11, 7, 1) (8348, 19, 13, 1) (34544, 29, 1)

18 DBSCAN_Extended It is a minimized version of DBScan which does not implement the MERGE process. Steps in DBSCAN_Extended STEP 1 - For each point p which is not visited STEP 2 - For each point q in dataset, STEP 3 - Calculate distance =HaversteinDistance(p, q); STEP 4 -If distance < epsilon, put q in p s cluster. Mark q as visited Because of removal of merge step, we should get more evenly distributed clusters. Also, we can have a point in multiple clusters. Cluster centroid : The point which represents the whole cluster. Calculated on the basis of density distribution of points in a cluster.

19 Deciding Epsilon for DBScan_Extended? Time required maximum for E = 1. Cluster error coefficient maximum for E = 5. For now, we use E = 2.

20 Comparison of DBScan vs DBScan_Extended Number of Entries DBScan Number Of Clusters Points in each cluster DBScan_Extended Number of Clusters Points in each cluster (97, 1) 44 (12, 10, 8, 4...) (406, 1, 1) 69 (47, 31, 24, 10 ) (856, 3, 2, 1, 3, 1, 1, 1, 1, 1, 1) (4334, 11, 7, 1) (8348, 19, 13, 1) 78 (35, 7, 24 ) 99 (371, 204, 47 ) 101 (789, 382, 158..) (34544, 29, 1) 109 (3903, )

21 Clustering Visualization The visualization contains the clusters formed using DBScan_Extended algorithm on 1000 points.

22 Categorization

23 Categorization Categorizer function will be given a crime point from the user cluster generated by DBScan_Extended and user location. It will generate a measure of how important each crime in the cluster is with respect to user s location. Input : Crime Point from a cluster User s location. Output : Ranked list of crimes with details.

24 Categorization Weightage Parameters Categorizer Function(pointLoc, userloc) Distance(pointLoc, usersloc) DateOldness(pointDate, userdate) timedifference(pointtime, usertime) Frequency(pointCrime, Cluster) Distance between the crime point and users location Number of days passed between crime and user s date. Time difference in crime point and user s time. Frequency of the crime in the cluster list.

25 Categorization Weightage Distribution The final weightage of a point with respect to user s location is calculated using the following formula : Weightage(Point, UserLocation) : W Distance = weightage to distance between the functions. W time = weightage to difference between the time in the points. W date = weightage to difference between the date of the points. Frequence of Crime = Number of instances of the crime in the cluster. Weight Point, UserLocation = 0.35 WDistance Wtime Wdate Frequence ofcrime. The crime points in the cluster can be ranked in the decreasing order of the weightage. The ranked crime list can be displayed to the user.

26 Comparison of DBScan vs DBScan_Extended for categorization

27 Categorization Visualization The visualization shows the anticipated crime at the location of the user. User can click on Next to see the crime rankings.

28 Schedule and Progress

29 Knowledge gathering and literature reading Categorisation design and prototype implementation App Creation, Project report. Completed Completed Completed In Progress TODO Clustering algorithms analysis Fine Tuning categoriser and research on more methods

30 Future Work and References

31 Future Work Need to fine tune the categorizer and improve its accuracy. A better UI interface like an Android / ios application for the project so that it can be used by general public. Research can be extended to predict crime based on user attributes: User location is continuously changing. Transportation mode of the user e.g driving a car or walking. A flexible system where users can also contribute and provide data. Research for other possible sources of data.

32 References Erica Kalotch (2002) "Clustering Algorithms for Spatial Databases: A Survey". Lawrence McClendon and Natarajan Meghanathan "Using Machine Learning Algorithms to analyze crime data". Jarrod S.Shingleton(2003) "Crime trend prediction using regression model for salinas"

34 Questions??

Crime Prediction and Analysis using Clustering Approaches and Regression Methods

Crime Prediction and Analysis using Clustering Approaches and Regression Methods 1 Raghavendhar T.V, 2 Joslin Joshy, 3 Mahaalakshmi R, 4 Ashutosh Soni M 1 Department of CSE, SRM Institute of Science and