Final Project. Analyzing Reddit Data to Determine Popularity

Similar documents
Example. Section: PS 709 Examples of Calculations of Reduced Hours of Work Last Revised: February 2017 Last Reviewed: February 2017 Next Review:

Calendar PPF Production Cycles Non-Production Activities and Events

Scheduling. Scheduling Tasks At Creation Time CHAPTER

Calendar Excel Template User Guide

September 2015 Calendar This Excel calendar is blank & designed for easy use as a planner. Courtesy of WinCalendar.com

CALENDAR OF FILING DEADLINES AND SEC HOLIDAYS

Institute For Arts & Digital Sciences SHORT COURSES

AIMMS Function Reference - Date Time Related Identifiers

name name C S M E S M E S M E Block S Block M C-Various October Sunday

Basic Device Management

1 of 8 10/10/2018, 12:52 PM RM-01, 10/10/2018. * Required. 1. Agency Name: * 2. Fiscal year reported: * 3. Date: *

Nortel Enterprise Reporting Quality Monitoring Meta-Model Guide

MBTA Semester Pass Program User Guide

Auction Calendar 2017/2018 for Capacity Allocation Mechanism Network Code

AP Statistics Assignments Mr. Kearns José Martí MAST 6-12 Academy

%Addval: A SAS Macro Which Completes the Cartesian Product of Dataset Observations for All Values of a Selected Set of Variables

HP Project and Portfolio Management Center

ControlLogix/Studio 5000 Logix Designer Course Code Course Title Duration CCP143 Studio 5000 Logix Designer Level 3: Project Development 4.

2016 Calendar of System Events and Moratoriums

MAP OF OUR REGION. About

Note: The enumerations range from 0 to (number_of_elements_in_enumeration 1).

Conditional Formatting

MAP OF OUR REGION. About

Cambridge English Dates and Fees for 2018

YEAR 8 STUDENT ASSESSMENT PLANNER SEMESTER 1, 2018 TERM 1

CIMA Certificate BA Interactive Timetable

2013 Association Marketing Benchmark Report

NetworX Series. NX-507E RELAY EXPANDER NX-508E OUTPUT EXPANDER Installation and Startup

VTC FY19 CO-OP GOOGLE QUALIFICATIONS PARAMETERS & REIMBURSEMENT DOCUMENTATION HOW-TO

CPA PEP 2018 Schedule and Fees

Installation Manual GENERAL DESCRIPTION...2 WIRING INFORMATION FOR NX-507 AND NX NX-507 TERMINAL DESCRIPTION...3 NX-507 DRAWING...

NetworX Series. NX-507E RELAY EXPANDER NX-508E OUTPUT EXPANDER Installation and Startup

Drawing Courses. Drawing Art. Visual Concept Design. Character Development for Graphic Novels

CS Programming I: Arrays

Year 1 and 2 Mastery of Mathematics

Organizing and Summarizing Data

INFORMATION TECHNOLOGY SPREADSHEETS. Part 1

Vector Semantics. Dense Vectors

Connect to CCPL

/Internet Random Moment Sampling. STATE OF ALASKA Department of Health and Social Services Division of Public Assistance

TEMPLATE CALENDAR2015. facemediagroup.co.uk

What s next? Are you interested in CompTIA A+ classes?

B.2 Measures of Central Tendency and Dispersion

CS229 Final Project: Predicting Expected Response Times

Schedule/BACnet Schedule

QI TALK TIME. Run Charts. Speaker: Dr Michael Carton. Connect Improve Innovate. 19 th Dec Building an Irish Network of Quality Improvers

Connect to CCPL

2017 ASSOCIATION MARKETING BENCHMARK REPORT

Business Club. Decision Trees

MONITORING REPORT ON THE WEBSITE OF THE STATISTICAL SERVICE OF CYPRUS DECEMBER The report is issued by the.

CLOVIS WEST DIRECTIVE STUDIES P.E INFORMATION SHEET

Exercises Software Development I. 02 Algorithm Testing & Language Description Manual inspection, test plan, grammar, metasyntax notations (BNF, EBNF)

Connect to CCPL

VTC CO-OP GOOGLE QUALIFICATIONS PARAMETERS & REIMBURSEMENT DOCUMENTATION HOW-TO

CSE 305 Programming Languages Spring, 2010 Homework 5 Maximum Points: 24 Due 10:30 AM, Friday, February 26, 2010

) $$%*%"3%*# 2# "%$& # $&# # "$) %) "$$%,*$"# 2$$%$" 3# 2)!# # "$6!# # # "$*# *$"*%*&"+ 2 # # 2"# # )

Stat 428 Autumn 2006 Homework 2 Solutions

View a Students Schedule Through Student Services Trigger:

Computer Grade 5. Unit: 1, 2 & 3 Total Periods 38 Lab 10 Months: April and May

BHARATI VIDYAPEETH`S INSTITUTE OF MANAGEMENT STUDIES AND RESEARCH NAVI MUMBAI ACADEMIC CALENDER JUNE MAY 2017

Characterization and Modeling of Deleted Questions on Stack Overflow

Pearson Edexcel Award

Voice Response System (VRS)

Arrays. Arrays (8.1) Arrays. One variable that can store a group of values of the same type. Storing a number of related values.

Student Information Systems (SIS) Updates

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

Employer Portal Guide. BenefitWallet Employer Portal Guide

Baldwin-Wallace College. 6 th Annual High School Programming Contest. Do not open until instructed

Arrays. What if you have a 1000 line file? Arrays

KENYA 2019 Training Schedule

Nights & Weekend ROB & PHIL DECEMBER 12, 2008

Math in Focus Vocabulary. Kindergarten

Year 10 OCR GCSE Computer Science (9-1)

EEN118 22nd November These are my answers. Extra credit for anyone who spots a mistike. Except for that one. I did it on purpise.

Backup Exec Supplement

Department Highlights. Annie Rosenfeld, Director of Risk Management & Real Property- April 2018

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Securities Lending Reporting

Crystal Springs Upland School Transportation Demand Management Plan. March 2016

Connect to CCPL

ADVANCED ALGORITHMS TABLE OF CONTENTS

How to Use efilemyforms.com

Connect to CCPL

Summary Statistics. Closed Sales. Paid in Cash. New Pending Sales. New Listings. Median Sale Price. Average Sale Price. Median Days on Market

LIHP Monthly Aggregate Reporting Instructions Manual. Low Income Health Program Evaluation

Arrays and Pointers (part 2) Be extra careful with pointers!

Marketing Opportunities

Supplement No Telephone PA P.U.C. - No. 14

Pearson Edexcel Award

SYS 6021 Linear Statistical Models

January and February

CONFERENCE ROOM SCHEDULER

Text Messaging Calendar

IB Event Calendar Please check regularly for updates Last Update: April 30, 2013

The Year argument can be one to four digits between 1 and Month is a number representing the month of the year between 1 and 12.

Nigerian Telecommunications Sector

Crystal Reports. Overview. Contents. Cross-Tab Capabilities & Limitations in Crystal Reports (CR) 6.x

CSE 158 Lecture 2. Web Mining and Recommender Systems. Supervised learning Regression

Boost your Analytics with Machine Learning for SQL Nerds. Julie mssqlgirl.com

Published: December 15, 2016 Revised: December 15, 2016

Transcription:

Final Project Analyzing Reddit Data to Determine Popularity

Project Background: The Problem Problem: Predict post popularity where the target/label is based on a transformed score metric Algorithms / Models Applied: SVC Random Forests Logistic Regression 2

Project Background: The Data Data: The top 1,000 posts from the top 2,500 subreddits, so 2.5 million posts in total. The top subreddits were determined by subscriber count. The data was pulled during August 2013. Data was broken out into 2,500.csv files that correspond to each subreddit Data Structure (22 Columns): created_utc - Float score - Integer domain - Text id - Integer title - Text author - Text ups - Integer downs - Integer num_comments - Integer permalink (aka the reddit link) - Text self_text (aka body copy) - Text link_flair_text - Text over_18 - Boolean thumbnail - Text subreddit_id - Integer edited - Boolean link_flair_css_class - Text author_flair_css_class - Text is_self - Boolean name - Text url - Text distinguished - Text 3

Project Background: The Data - Removed Data: The top 1,000 posts from the top 2,500 subreddits, so 2.5 million posts in total. The top subreddits were determined by subscriber count. The data was pulled during August 2013. Data was broken out into 2,500.csv files that correspond to each subreddit Data Structure: created_utc - Float score - Integer domain - Text id - Integer title - Text author - Text ups - Integer downs - Integer num_comments - Integer permalink (aka the reddit link) - Text self_text (aka body copy) - Text link_flair_text - Text over_18 - Boolean thumbnail - Text subreddit_id - Integer edited - Boolean link_flair_css_class - Text author_flair_css_class - Text is_self - Boolean name - Text url - Text distinguished - Text 4

Reviewing the Data: Subreddit Topics datasets learnpython dataisbeautiful MachineLearning BirdsBeingDicks PenmanshipPorn TreesSuckingAtThings CoffeeWithJesus Otters AnimalsWithoutNecks CemeteryPorn misleadingthumbnails FortPorn PowerWashingPorn ShowerBeer talesfromtechsupport StonerPhilosophy 5

Reviewing the Data: Top Domains Domain'Count' Imgur: 773,969 imgur.com) youtube.com) YouTube: 188,526 reddit.com) flickr.com) soundcloud.com) Reddit: 25,445 Flickr: 17,854 quickmeme.com) i.minus.com) twi6er.com) amazon.com) qkme.com) Soundcloud: 10,397 vimeo.com) wikipedia.org) ny;mes.com) guardian.co.uk) bbc.co.uk) 6

Reviewing the Data: Most Have No Body Text Posts rely primarily on the title and some related media content from the aforementioned domains - link, gif image, video, etc. Over 1.6 million posts had no body copy/text or approximately 74% of all posts contained a NaN value 7

Reviewing the Data: Time Based Data Winter Months Saw a Dip, Fall Could be Underrepresented Given Data Pulled in August 300000" 250000" 200000" 150000" 100000" 50000" 0" January" February" March" April" May" June" July" August" September" October" November" December" 8

Reviewing the Data: Time Based Data Tuesday is Slightly the Favorite Day to Post, While the Weekend Sees a Dip 400000" 350000" 300000" 250000" 200000" 150000" 100000" 50000" 0" Monday" Tuesday" Wednesday" Thursday" Friday" Saturday" Sunday" 9

Reviewing the Data: Time Based Data Reddit While You Work: Post Volume Picks up Around 9/10am, Peeking at 12pm Until Dropping off Throughout the Afternoon 160000" 140000" 120000" 100000" 80000" 60000" 40000" 20000" 0" 12am" 1am" 2am" 3am" 4am" 5am" 6am" 7am" 8am" 9am" 10am" 11am" 12pm" 1pm" 2pm" 3pm" 4pm" 5pm" 6pm" 7pm" 8pm" 9pm" 10pm" 11pm" 10

Reviewing the Data: Determining Popularity Score&Counts& 200000" 180000" 160000" 140000" 120000" 100000" 80000" 60000" 40000" 20000" 0" 50)99" 100)199" 200)299" 300)399" 400)499" ~15% of posts 500)999" 1000)4999" 5000)9999" 10000+" Note - Only about half the data because ipython was unable to run a histogram so needed to export and conduct in excel 11

Analyzing the Data: Issues Issue: Given the size of the initial data set (2.5 million rows) and how it expanded upon transformation (CountVectorizer and TFIDF) to almost 100,000 columns, resulted in issues in processing the data locally on my machine. In the end I was only able to get about 1% of the data to run through the algorithms Even with this smaller sub set of data processes could take anywhere from 30 minutes to several hours, making playing around with the data extremely hard Future: Explore platforms that are better at handling large data sets such as PySpark. Tried to process the data with PySpark but ran into technical issues that I couldn t address in time 12

Analyzing the Data: SVC Linear =.9368 C Value of.1 =.9363 Accuracy' Accuracy'w/'Linear'Kernel' 0.94000$ 0.93800% 0.93000$ 0.93600% 0.93400% 0.92000$ 0.93200% 0.91000$ 0.93000% 0.92800% 0.90000$ 0.92600% 0.89000$ 0.92400% 0.92200% 0.88000$ Linear$ Poly$ Sigmoid$ RBF$ 0.92000% 0.001% 0.01% 0.1% from sklearn import svm 13

Analyzing the Data: Regression Trees N Estimators = 125.922 Max Depth = 250.924 0.925% 0.93% 0.92% 0.925% 0.915% 0.92% 0.91% 0.905% 0.9% 0.915% 0.91% 0.905% 0.9% 0.895% 0.895% 0.89% 0.89% 0.885% 5% 10% 20% 50% 100% 125% 150% 0.885% 5% 40% 100% 150% 200% 250% 300% from sklearn import ensemble 14

Analyzing the Data: Logistic C Value of 1 =.9471 L1 =.947733 L2 =.947066 0.95& 0.9474& 0.94735& 0.945& 0.9473& 0.94725& 0.94& 0.9472& 0.94715& 0.935& 0.9471& 0.94705& 0.93& 0.947& 0.94695& 0.925& 0.001& 0.01& 0.1& 1& 10& 50& 0.9469& L1& L2& from sklearn import linear_model 15

Totally Crushing It! 16

Analyzing the Data: Classification Report SVC Random Forests Logistic Regression 17

Soooo Not Crushing It 18

Feature Reduction: Accuracy SVC - All Features SVC - Reduced Features 93.63% 94.5% Random Forests - All Features Random Forests - Reduced Features 92.4% 95.2% Logistic Regression - All Features Logistic Regression - Reduced Features 94.71% 94.3% 19

Feature Reduction: Classification Report SVC - All Features SVC - Reduced Features Random Forests - All Features Random Forests - Reduced Features Logistic Regression - All Features Logistic Regression - Reduced Features 20

Next Steps Dealing with the processing issues: Learn and try our PySpark Answer some additional questions: Reevaluate how I handle the domains I originally bucketed domains by their frequency/occurrence in the data set however given the originating domain of the content and the title are the majority of the post and the top 15 domains make up the vast majority of the post I want to focus on posts from those ~15 domains to get a better picture on how they explicitly affect popularity Run the data with varying n_grams levels I tried them but they expanded the columns to hundreds of thousands which just seemed to freeze, so hopefully something like PySpark will help with the processing Predict sub-reddit/category questions: Can I predict category of a post? Do certain subreddits produce more overall popular content than others? Bears With Beaks vs. ggggg (what ever the hell that is) 21

APPENDIX 22

Reviewing the Data: Reevaluate Popularity Score&Counts& 200000" 180000" 160000" 140000" 120000" 100000" 80000" 60000" 40000" 20000" 0" 50)99" 100)199" ~8% of posts 200)299" 300)399" 400)499" ~12% of posts 500)999" 1000)4999" 5000)9999" 10000+" Note - Only about half the data because ipython was unable to run a histogram so needed to export and conduct in excel 23

Analyzing the Data: SVC C Value of.1 = 0.7077 Accuracy Score 0.71% 0.708% 0.706% 0.704% 0.702% 0.7% 0.698% 0.696% 0.694% 0.692% 0.69% 0.688% 0.001% 0.01% 0.1% 1% 10% 50% Confusion Matrix 24

Analyzing the Data: Random Forest N Estimators of 100 = Max Depth of 200 = 0.8218 0.8247 0.83$ 0.83% 0.825% 0.82$ 0.82% 0.81$ 0.815% 0.81% 0.8$ 0.805% 0.8% 0.79$ 0.795% 0.79% 0.78$ 0.785% 0.77$ 5$ 10$ 20$ 50$ 100$ 125$ 0.78% 0.775% 40% 100% 150% 200% 250% Accuracy Score Confusion Matrix 25

Analyzing the Data: Logistic C of 1 =.8453 C =1, Penalty = L2 0.85% 0.845% 0.84% 0.835% 0.83% 0.825% 0.82% 0.815% Confusion Matrix 0.81% 0.001% 0.01% 0.1% 1% 10% 50% 26