Social Media Intelligence Text and Network Mining combined Dr. Rosaria Silipo rosariasilipo@yahoo.com
Previously on PAW... PAW San Francisco 2012 2
Social Media Analysis Water Water Everywhere, and not a drop to drink Approaches and Challenges: In-House Text Mining: Sentiment but no relevance In-House Network Mining: Relevance but no Sentiment In-House Scorecard: No Analytics Cloud-based Approach: No Access to Data 3
Our Goal in Social Media Analysis Text Mining for Sentiment Network Mining for Relevance Drill Down on special cases Analytics for Prediction 4
Case Study: Major European Telco Very rich new data sources about customers! Combine Text mining Network Analysis Classic Predictive Analytics Modeling, Clustering, Time Series, etc Combine with internal Data makes the text relevant Include Product names/categories exclude Staff Members Include number of web hits per page... Include existing marketing positioning Include major campaign information 5
Case Study Example: Slashdot Data News for Nerds, Stuff that Matters Basic Facts: 24532 users 491 threads with 15 843 responses from 12 507 users 113505 posts (text mining on posts) 60 main topics 6
Combining Text and Network Mining Network Analysis Hub and Authority Score per User Text Analysis Attitude Level per User 7
Remove anonymous users, group by PostID Text Mining Words Tagging MPQA Corpus Positive words Negative words BoW Standard Named Entity Filter Word Frequency User Bins Word cloud for selected users
Slashdot Text Mining List of negative and positive words (MPQA Opinion Corpus) Tag positive and negative words Count words in posts Aggregate over users Negative + Positive User. Most positive user: dada21 (2838 positive / 1725 negative words) Most negative user: pnutz (43 positive / 109 negative words) 16016 positive users 7107 negative users Which Topics have positive users in common? Government People Law/s Money Market Parties
Network Creation User1 User2 User3 User4 User5 User6 10
Topic Graphs 11
Topic Graph: NASA 12
Hubs & Authorities Hubs = Follower Authorities = Leader Users with hub and authority weights and other features Filtering anonymous users and creating network Centrality index to define hub weight and authority weight 13
Hubs & Authorities dada21 Carl Bialik from the WSJ pnutz Tube Steak Doc Ruby 99BottlesOfBeerInMyF 14
Hubs, Authorities &Attitudes dada21 Carl Bialik from the WSJ Tube Steak WebHosting Guy Catbeller 99BottlesOfBeerInMyF Doc Ruby pnutz 15
What we have found... - The positive leaders - The neutral leaders - The negative leaders - The inactive users What identifies each group? How do I identify a new user? How do I handle each user? 16
User Classification Authority Score Histogram Hub Score Histogram How do I define leadership? 17
Attitude Level Histogram Defining thresholds on attitude might be easier 18
Why Clustering? - No a priori knowledge (not even on a subset of users) - Prediction and interpretation capabilities required k-means algorithm 19
Normalization (Authority score, Hub score) in [0,1] x [0,1] Attitude level in [-66, 1113] 20
Authority after Normalization Leadership is now a bit easier to obtain. 21
Hub Score after Normalization Also the follower condition is more spread out. 22
Attitude after Normalization Attitude is the only parameter that is now easier to identify. 23
Number of Clusters Users with a negative attitude are hard to catch! K=30: 10 clusters with more than 1000 users; 2 clusters with clear negative attitude (< 0.4) K=20: 5 clusters with more than 1000 users; 2 clusters with negative attitude (<0.4) K=10: 2 clusters with more than 5000 users and no cluster with a negative attitude anymore. 24
Re-sampling the Training Set k = 10 25
The k-means Clusters 26
Additional Discoveries There are only very few real leaders! Authority and hub scores identify active participants rather than leaders. Superfans can be found in cluster_3 Negative and (sigh!) active users are collected in cluster_1. Neutral users are usually inactive (cluster_2, cluster_7, and cluster_8) Positive users with different degrees of activity are scattered across the remaining clusters. 27
The k-means Clusters Neutral users Superfans Negative users Fans 28
The operational Workflow Pre-processing Cluster Extraction Assignment of new data 29
Full system to: Summary and Conclusions - Integrate text and network mining - Find meaningful clusters in terms of attitude and activity - Define appropriate actions for users in different clusters - Assign new data to existing clusters 30
Next Steps - Integrate topic information - Integrate user demographic and behavioural information - Discover [time series] patterns for early detection of negative users and superfans - Try other techniques, maybe even on manually segmented data, to discover new user segments 31
Where do I find more? Whitepaper: rosariasilipo@yahoo.com Complete Workflows + Data: - text mining - network mining www.knime.com - combined analysis (note the above 3 process huge data and require 16G memory) clustering Open Source Software: KNIME www.knime.com 32