American Football Route Identification Using Supervised Machine Learning

Similar documents
Supervised Learning Classification Algorithms Comparison

Coach s Office Playbook Tutorial Playbook i

Business Club. Decision Trees

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Network Traffic Measurements and Analysis

Robust PDF Table Locator

Supervised vs unsupervised clustering

ADVANCED CLASSIFICATION TECHNIQUES

ESERCITAZIONE PIATTAFORMA WEKA. Croce Danilo Web Mining & Retrieval 2015/2016

Classifiers and Detection. D.A. Forsyth

Fast Scout. Desktop User Guide

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Segmenting Lesions in Multiple Sclerosis Patients James Chen, Jason Su

Notes and Announcements

Performance Evaluation of Various Classification Algorithms

Classification and Detection in Images. D.A. Forsyth

Classification. 1 o Semestre 2007/2008

AN ABSTRACT OF THE THESIS OF

Applying Supervised Learning

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core)

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear

Data Science Course Content

DEFENSIVE COMPACTNESS AS A PERFORMANCE INDICATOR FOR GAME ANNOTATION

Introduction to Automated Text Analysis. bit.ly/poir599

Semi-supervised Learning

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

MATLAB COMPUTATIONAL FINANCE CONFERENCE Quantitative Sports Analytics using MATLAB

Certified Data Science with Python Professional VS-1442

Mathematics of Data. INFO-4604, Applied Machine Learning University of Colorado Boulder. September 5, 2017 Prof. Michael Paul

Supplemental Material: Multi-Class Open Set Recognition Using Probability of Inclusion

Improving Stack Overflow Tag Prediction Using Eye Tracking Alina Lazar Youngstown State University Bonita Sharif, Youngstown State University

Week 7 Picturing Network. Vahe and Bethany

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

8/3/2017. Contour Assessment for Quality Assurance and Data Mining. Objective. Outline. Tom Purdie, PhD, MCCPM

Naïve Bayes for text classification

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Slice Intelligence!

Python With Data Science

Understanding Tracking and StroMotion of Soccer Ball

IEE 520 Data Mining. Project Report. Shilpa Madhavan Shinde

Improving Positron Emission Tomography Imaging with Machine Learning David Fan-Chung Hsu CS 229 Fall

Classifying Building Energy Consumption Behavior Using an Ensemble of Machine Learning Methods

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Data Mining and Knowledge Discovery: Practice Notes

Football result prediction using simple classification algorithms, a comparison between k-nearest Neighbor and Linear Regression

Boolean Classification

A Novel Approach to Image Segmentation for Traffic Sign Recognition Jon Jay Hack and Sidd Jagadish

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Function Algorithms: Linear Regression, Logistic Regression

BASEBALL TRAJECTORY EXTRACTION FROM

Classification and Regression

CS 340 Lec. 4: K-Nearest Neighbors

Random Forest A. Fornaser

CS273 Midterm Exam Introduction to Machine Learning: Winter 2015 Tuesday February 10th, 2014

R (2) Data analysis case study using R for readily available data set using any one machine learning algorithm.

DIGITAL IMAGE ANALYSIS. Image Classification: Object-based Classification

Lecture 25: Review I

CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification

RESAMPLING METHODS. Chapter 05

Character Recognition from Google Street View Images

CS229: Machine Learning

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

Classification and K-Nearest Neighbors

7. Nearest neighbors. Learning objectives. Foundations of Machine Learning École Centrale Paris Fall 2015

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

Evaluating Classifiers

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Weka ( )

Overview for Families

SNS College of Technology, Coimbatore, India

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

K- Nearest Neighbors(KNN) And Predictive Accuracy

Predicting Results for Professional Basketball Using NBA API Data

Chapter 4: Non-Parametric Techniques

Louis Fourrier Fabien Gaie Thomas Rolf

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

The exam is closed book, closed notes except your one-page cheat sheet.

Nearest neighbor classification DSE 220

Classification of Hand-Written Numeric Digits

A Framework for Efficient Fingerprint Identification using a Minutiae Tree

BRAT Manual V1.1. May 6th, Christian Göschl & Wolfgang Busch. for correspondence:

Multi-label classification using rule-based classifier systems

SPORTS DOOD. User Guide v1

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Hidden Trends in NFL Data

Predicting Rare Failure Events using Classification Trees on Large Scale Manufacturing Data with Complex Interactions

6.034 Quiz 2, Spring 2005

Specialist ICT Learning

Machine Learning based session drop prediction in LTE networks and its SON aspects

The Curse of Dimensionality

Characterization of the formation structure in team sports. Tokyo , Japan. University, Shinjuku, Tokyo , Japan

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

FACE DETECTION AND RECOGNITION OF DRAWN CHARACTERS HERMAN CHAU

MS1b Statistical Data Mining Part 3: Supervised Learning Nonparametric Methods

7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech

Data Mining and Machine Learning: Techniques and Algorithms

Unsupervised Learning : Clustering

Transcription:

American Football Route Identification Using Supervised Machine Learning Hochstedler, Jeremy & Paul T. Gagnon Telemetry Sports Abstract Football organizations spend significant resources classifying on-field actions. Since 2014, radio-frequency identification (RFID) tracking technology has been used to continuously monitor the on-field locations of NFL players.[1] Using spatiotemporal American football data, this research provides a framework to autonomously classify receiver routes. 1. Introduction NFL teams spend a significant amount of time each week breaking down film and tagging plays, which enables coaches, scouts, and players to filter plays for further analysis and more efficiently consume video. Traditionally, Quality Control (QC) coaches breakdown film by tagging events, players, actions, and play features. This activity isn t new, as Steve Belichick s Football Scouting Methods (1962) outlined many of the roles of a QC coach. However, when it comes to preparing for a game, accuracy (as it relates to getting the data that coaches want) and timing (allowing coaches more time to create a game plan) are key. With the NFL s available player tracking data, automation and machine learning will enable teams to tag data more quickly and in a more uniform manner. Currently, teams rely on multiple (at times, not standardized) data collection mechanisms. For example, multiple coaches (or third parties) break down game film to tag data. Naturally, these existing methods are susceptible to human error and are time intensive. 2. Data Since the NFL player tracking data (NextGen Stats) is not available for competitive scouting or public analysis, we have created a dataset using the 2014 regular season Indianapolis Colts baseoffensive plays. Spatial coordinates of all on-field players were captured based on the All 22 game video at three frames per second to the nearest 0.25 yard for every 2014 base offensive pass play. This yields 231 total plays. Here, the base offense is defined as 1st & 2nd down, less than 15-point differential, greater than five minutes remaining in a half, and between the 20 yard lines. Fig. 1 displays the data collection process flow diagram for a given play. 1

Figure 1. Data collection process flow. From the 231 base offensive pass plays, trained coaches (used here to mimic the work of QC coaches) classified each receiver s route using a simplified route tree as described in Fig. 2. Each route is labeled using a primary tag with secondary labels used to notify the reader that route subtypes are also classified under the primary tag. An industry standard numbering system is also included. The machine learning classifier will use the primary tags associated with each route, therefore we will refer to each of the eleven route types using the primary tag going forward. Figure 2. Simplified route tree. After removing screen plays (as most receivers typically block on these plays), we analyzed the routes of all players who lined up as slot receivers or wide receivers (see formation identification), leaving 548 player routes. Table 1 displays the eleven route types, relevant secondary labels, and the number of each route type in the data set. Route No. Primary Label Secondary Labels Count 0 under drag, shallow cross 31 1 out flat, arrow, whip, quick out 70 2 slant -- 30 3 hitch -- 15 4 hook curl, stop 44 5 comeback deep out 22 2

Route No. Primary Label Secondary Labels Count 6 dig drive, cross 123 7 corner flag 23 8 post bang 50 9 go fly 121 11 wheel release, out-and-up 19 Table 1. Route types and associated counts within dataset. Route label decisions were made as the coaches interviewed focus more on the route concept as it relates to the play call, not necessarily the specific route run. For example, some routes can be option routes as variants will be run based off of defensive alignment and scheme. Specifically, an out route can be modified to an arrow route if the defender is playing off coverage. In another example, an option go/9 route can turn into a post/8 route if the middle of the field is left open (MOFO). Fig. 3 displays the trajectories of all 548 routes, specifically displaying under routes in red. Figure 3. 548 receiver trajectories from the 2014 Indianapolis base offense. 3

3. Formation Identification Individual player starting locations are defined by proximity to teammates, opponents, various fixed field points (e.g. hashes, boundaries, and yard markers), as well as floating points (e.g. line of scrimmage). Logic has been established to classify player alignment as one of ten distinct positions (LT, LG, C, RG, RT, TE, SR, WR, RB, and QB). Additionally, further information is tagged for each individual player. For example, players are identified whether they are lined up on the line of scrimmage or not, in a bunch/stacked formation, their associated skill number (e.g. L1, L2, R1, R2), and more. We will keep to an abbreviated description of the formation and player alignment methods as the focus of this research is to outline a framework for route classification. 4. Route Classification Upon the start of play execution (the snap), player trajectories are analyzed for every position on the field in order to classify the player action. Because routes are run in relation to ball position, all routes starting from the formation s left side have been mirrored to create x-coordinates as-if they were run from the right side of the formation. Figure 4. 548 receiver trajectories with skill left paths mirrored to skill right. 4

4.1 Feature Extraction The key to any machine learning algorithm is the selection/extraction of features to train the model. The slot/wide receiver routes are captured until either: a maximum of 15 frames/5 seconds (3 Hz dataset) or forward pass takes place. Basic receiver routes are comprised of three essential components: a stem, pivot, and branch. The stem is the initial segment from starting point (snap, t = 0) to the pivot (re-direction) point. The branch is the final component of the route for which the receiver runs. Using a segmented Euclidean regression, features are distilled from each receiver s trajectory. Fig. 5 highlights the three basic route components. Figure 5. A typical slant route. 4.1.1 Segmented Euclidean Regression While each route possesses unique features such as stem heading and branch length, the most significant features distilled for classification are its stem length and branch heading. These features are used in the training of supervised machine-learning algorithms. As a first pass in determining the two-segment linear regression, we first use the start point [x 0,y 0] and the route s end point [x n,y n] testing each internal node as the pivot point [x p,y p]. Straight lines are then drawn from [x 0,y 0] to [x p,y p] and again from [x p,y p] to [x n,y n] for each internal data point. The Euclidean distance (error) for all data points is then summed. The point that yields the minimized Euclidean error is denoted as the route s true pivot point. where: r = regression path t = true path p = pivot point 2 minimize f p = x - - +, x - /, + y - - +, y - /,,34 subject to: 0 < p < n 5

In order to more accurately represent the trajectory data with a three-point path (the regression path), additional pivot points near the route s true pivot point are tested. These points are found by uniformly spreading hypothetical pivot points near the original path s pivot point. Fig. 6 displays an additional 25 (5x5, edges identified by blue arrows) pivot points surrounding the originally minimized pivot point (circled in blue). Note: the spread is increased for illustration purposes. Our algorithm optimizes based on a spread with an order of magnitude smaller and order of magnitude more test points. Figure 6. Illustration of tested regression paths to minimize Euclidean error. 6

Fig. 7A-D and Table 2 provide additional examples of how the two-segment Euclidean regression produces stem length and branch heading values for some sample routes. Figures 7A-D. Clockwise from top left: slant, comeback, out, and dig routes. Route Stem Length (yds) Branch Heading (deg. from East) slant 4.0 136 comeback 13.9-63 out 4.8 0 dig 10.9 172 Table 2. Stem length and branch headings for four routes shown above. 7

4.1.2 Average Speed Average speed throughout the route is calculated using a simple average over the duration of the player s route. where: f = total number of frames x n, y n = final position coordinates v = x 2 - + y 2-3f 4.1.3 Final Route Location The two final features used for training include the final position coordinates (x n, y n) relative to the player s initial position at snap. 4.2 Model Training A 1:1 train-to-test ratio was used. Thus, the 548 player routes were split in half, allowing for 274 routes for both training and testing. The training dataset and each route s five features were used to train the classifiers. Specifically, five different supervised learning algorithms were trained and tested: Naive Bayes, k-nearest Neighbors, Random Forest, Logistic Regression, and Support Vector Machines (SVM). 5. Results Precision and recall are used to determine overall model accuracy. Precision is a measure of the ratio of accurately predicted classifications to the total number predicted for that classification category. Recall is a measure of the ratio of accurately predicted classifications to the total number of true classifications for that category. Precision = t @ t @ + f @ where: t p = true positives f p = false positives f n = false negatives Recall = t @ t @ + f 2 8

Fig. 8 displays a normalized confusion matrix (of the Naïve Bayes model) to visualize accuracy while Table 3 provides further quantitative details. Figure 8. Normalized confusion matrix of a Naïve Bayes model. Route Precision Recall Count comeback 0.36 0.45 11 corner 0.33 0.08 13 dig 0.85 0.65 68 go 0.80 0.71 68 hitch 0.25 0.57 7 hook 0.56 0.50 18 out 0.71 0.77 31 post 0.59 0.79 28 slant 0.65 1.00 11 under 0.83 0.83 12 wheel 0.54 1.00 7 total / avg 0.70 0.68 274 Table 3. Precision details displaying an overall precision of 0.70. 9

The maximum recorded precision and recall for each of the five learning algorithms is outlined in Table 4. 6. Conclusion Classifier Precision Recall Naïve Bayes 0.70 0.68 knn 0.68 0.68 Random Forest 0.66 0.64 Logistic Regression 0.57 0.62 SVM 0.45 0.29 Table 4. Maximum total precision and recall for each classification method. 6.1 Summary While this framework shows that the Naïve Bayes classifier provides the most accurate (yet mediocre) solution, it alone is not the main takeaway from this analysis. Instead, the primary focus shall remain in the process outlined, which includes feature selection and extraction. The classification process outlined here will significantly improve the efficiency of NFL QC coaches. Additionally, these (and other future) methods will enhance game plan development by more accurately and more quickly identifying opponent tendencies, keys, and strategies. These methods do not replace the need for QC coaches; rather they enable the entire coaching staff to dive deeper into the data, tendencies, and video. By enabling them to focus their time and efforts on areas that are less mundane and providing these tags earlier each week, coaches can spend more time using the data and less time generating it. 6.2 Limitations Because the number of training samples used here are relatively small, we expect increased accuracy with more training data. Because the RFID data possesses significantly better resolution (10 Hz vs. 3 Hz here), we also expect an accuracy increase. As is true with supervised learning applications, models improve with additional training data. In this case the manual data creation limits the training set and the results. Once the NextGen Stats data is available to teams for competitive scouting purposes, the value and accuracy of these methods will grow. 6.3 Future work With the current data set, we plan to further investigate how route combinations contribute to play outcomes. Additional work can be done using synthesized route datasets to train the machine learning algorithm. As NextGen Stats becomes available, we plan to work with teams to focus automation and auto-classification efforts on QC coach activities that are mundane and error prone, in addition to areas where it is difficult for QC coaches to see the underlying data based on video alone. 10

References [1] Zebra Technologies. Accessed December 10, 2015. https://www.zebra.com/us/en/nfl.html [2] Belichick, Steve. Football Scouting Methods. 1962. Martino Publishing. Mansfield Centre, CT. 11