Final Project Analyzing Reddit Data to Determine Popularity
Project Background: The Problem Problem: Predict post popularity where the target/label is based on a transformed score metric Algorithms / Models Applied: SVC Random Forests Logistic Regression 2
Project Background: The Data Data: The top 1,000 posts from the top 2,500 subreddits, so 2.5 million posts in total. The top subreddits were determined by subscriber count. The data was pulled during August 2013. Data was broken out into 2,500.csv files that correspond to each subreddit Data Structure (22 Columns): created_utc - Float score - Integer domain - Text id - Integer title - Text author - Text ups - Integer downs - Integer num_comments - Integer permalink (aka the reddit link) - Text self_text (aka body copy) - Text link_flair_text - Text over_18 - Boolean thumbnail - Text subreddit_id - Integer edited - Boolean link_flair_css_class - Text author_flair_css_class - Text is_self - Boolean name - Text url - Text distinguished - Text 3
Project Background: The Data - Removed Data: The top 1,000 posts from the top 2,500 subreddits, so 2.5 million posts in total. The top subreddits were determined by subscriber count. The data was pulled during August 2013. Data was broken out into 2,500.csv files that correspond to each subreddit Data Structure: created_utc - Float score - Integer domain - Text id - Integer title - Text author - Text ups - Integer downs - Integer num_comments - Integer permalink (aka the reddit link) - Text self_text (aka body copy) - Text link_flair_text - Text over_18 - Boolean thumbnail - Text subreddit_id - Integer edited - Boolean link_flair_css_class - Text author_flair_css_class - Text is_self - Boolean name - Text url - Text distinguished - Text 4
Reviewing the Data: Subreddit Topics datasets learnpython dataisbeautiful MachineLearning BirdsBeingDicks PenmanshipPorn TreesSuckingAtThings CoffeeWithJesus Otters AnimalsWithoutNecks CemeteryPorn misleadingthumbnails FortPorn PowerWashingPorn ShowerBeer talesfromtechsupport StonerPhilosophy 5
Reviewing the Data: Top Domains Domain'Count' Imgur: 773,969 imgur.com) youtube.com) YouTube: 188,526 reddit.com) flickr.com) soundcloud.com) Reddit: 25,445 Flickr: 17,854 quickmeme.com) i.minus.com) twi6er.com) amazon.com) qkme.com) Soundcloud: 10,397 vimeo.com) wikipedia.org) ny;mes.com) guardian.co.uk) bbc.co.uk) 6
Reviewing the Data: Most Have No Body Text Posts rely primarily on the title and some related media content from the aforementioned domains - link, gif image, video, etc. Over 1.6 million posts had no body copy/text or approximately 74% of all posts contained a NaN value 7
Reviewing the Data: Time Based Data Winter Months Saw a Dip, Fall Could be Underrepresented Given Data Pulled in August 300000" 250000" 200000" 150000" 100000" 50000" 0" January" February" March" April" May" June" July" August" September" October" November" December" 8
Reviewing the Data: Time Based Data Tuesday is Slightly the Favorite Day to Post, While the Weekend Sees a Dip 400000" 350000" 300000" 250000" 200000" 150000" 100000" 50000" 0" Monday" Tuesday" Wednesday" Thursday" Friday" Saturday" Sunday" 9
Reviewing the Data: Time Based Data Reddit While You Work: Post Volume Picks up Around 9/10am, Peeking at 12pm Until Dropping off Throughout the Afternoon 160000" 140000" 120000" 100000" 80000" 60000" 40000" 20000" 0" 12am" 1am" 2am" 3am" 4am" 5am" 6am" 7am" 8am" 9am" 10am" 11am" 12pm" 1pm" 2pm" 3pm" 4pm" 5pm" 6pm" 7pm" 8pm" 9pm" 10pm" 11pm" 10
Reviewing the Data: Determining Popularity Score&Counts& 200000" 180000" 160000" 140000" 120000" 100000" 80000" 60000" 40000" 20000" 0" 50)99" 100)199" 200)299" 300)399" 400)499" ~15% of posts 500)999" 1000)4999" 5000)9999" 10000+" Note - Only about half the data because ipython was unable to run a histogram so needed to export and conduct in excel 11
Analyzing the Data: Issues Issue: Given the size of the initial data set (2.5 million rows) and how it expanded upon transformation (CountVectorizer and TFIDF) to almost 100,000 columns, resulted in issues in processing the data locally on my machine. In the end I was only able to get about 1% of the data to run through the algorithms Even with this smaller sub set of data processes could take anywhere from 30 minutes to several hours, making playing around with the data extremely hard Future: Explore platforms that are better at handling large data sets such as PySpark. Tried to process the data with PySpark but ran into technical issues that I couldn t address in time 12
Analyzing the Data: SVC Linear =.9368 C Value of.1 =.9363 Accuracy' Accuracy'w/'Linear'Kernel' 0.94000$ 0.93800% 0.93000$ 0.93600% 0.93400% 0.92000$ 0.93200% 0.91000$ 0.93000% 0.92800% 0.90000$ 0.92600% 0.89000$ 0.92400% 0.92200% 0.88000$ Linear$ Poly$ Sigmoid$ RBF$ 0.92000% 0.001% 0.01% 0.1% from sklearn import svm 13
Analyzing the Data: Regression Trees N Estimators = 125.922 Max Depth = 250.924 0.925% 0.93% 0.92% 0.925% 0.915% 0.92% 0.91% 0.905% 0.9% 0.915% 0.91% 0.905% 0.9% 0.895% 0.895% 0.89% 0.89% 0.885% 5% 10% 20% 50% 100% 125% 150% 0.885% 5% 40% 100% 150% 200% 250% 300% from sklearn import ensemble 14
Analyzing the Data: Logistic C Value of 1 =.9471 L1 =.947733 L2 =.947066 0.95& 0.9474& 0.94735& 0.945& 0.9473& 0.94725& 0.94& 0.9472& 0.94715& 0.935& 0.9471& 0.94705& 0.93& 0.947& 0.94695& 0.925& 0.001& 0.01& 0.1& 1& 10& 50& 0.9469& L1& L2& from sklearn import linear_model 15
Totally Crushing It! 16
Analyzing the Data: Classification Report SVC Random Forests Logistic Regression 17
Soooo Not Crushing It 18
Feature Reduction: Accuracy SVC - All Features SVC - Reduced Features 93.63% 94.5% Random Forests - All Features Random Forests - Reduced Features 92.4% 95.2% Logistic Regression - All Features Logistic Regression - Reduced Features 94.71% 94.3% 19
Feature Reduction: Classification Report SVC - All Features SVC - Reduced Features Random Forests - All Features Random Forests - Reduced Features Logistic Regression - All Features Logistic Regression - Reduced Features 20
Next Steps Dealing with the processing issues: Learn and try our PySpark Answer some additional questions: Reevaluate how I handle the domains I originally bucketed domains by their frequency/occurrence in the data set however given the originating domain of the content and the title are the majority of the post and the top 15 domains make up the vast majority of the post I want to focus on posts from those ~15 domains to get a better picture on how they explicitly affect popularity Run the data with varying n_grams levels I tried them but they expanded the columns to hundreds of thousands which just seemed to freeze, so hopefully something like PySpark will help with the processing Predict sub-reddit/category questions: Can I predict category of a post? Do certain subreddits produce more overall popular content than others? Bears With Beaks vs. ggggg (what ever the hell that is) 21
APPENDIX 22
Reviewing the Data: Reevaluate Popularity Score&Counts& 200000" 180000" 160000" 140000" 120000" 100000" 80000" 60000" 40000" 20000" 0" 50)99" 100)199" ~8% of posts 200)299" 300)399" 400)499" ~12% of posts 500)999" 1000)4999" 5000)9999" 10000+" Note - Only about half the data because ipython was unable to run a histogram so needed to export and conduct in excel 23
Analyzing the Data: SVC C Value of.1 = 0.7077 Accuracy Score 0.71% 0.708% 0.706% 0.704% 0.702% 0.7% 0.698% 0.696% 0.694% 0.692% 0.69% 0.688% 0.001% 0.01% 0.1% 1% 10% 50% Confusion Matrix 24
Analyzing the Data: Random Forest N Estimators of 100 = Max Depth of 200 = 0.8218 0.8247 0.83$ 0.83% 0.825% 0.82$ 0.82% 0.81$ 0.815% 0.81% 0.8$ 0.805% 0.8% 0.79$ 0.795% 0.79% 0.78$ 0.785% 0.77$ 5$ 10$ 20$ 50$ 100$ 125$ 0.78% 0.775% 40% 100% 150% 200% 250% Accuracy Score Confusion Matrix 25
Analyzing the Data: Logistic C of 1 =.8453 C =1, Penalty = L2 0.85% 0.845% 0.84% 0.835% 0.83% 0.825% 0.82% 0.815% Confusion Matrix 0.81% 0.001% 0.01% 0.1% 1% 10% 50% 26