Mining Social Media Users Interest Presenters: Heng Wang,Man Yuan April, 4 th, 2016
Agenda Introduction to Text Mining Tool & Dataset Data Pre-processing Text Mining on Twitter Summary & Future Improvement
Why text mining? Approximately 90% of the world s data is held in unstructured formats Web pages Emails Technical documents Corporate documents Digital libraries Customer complaint letters Structured Numerical or Coded Information 10% Unstructured or Semi-structured Information 90%
Why text mining? Widely used in various fields Marketing Political campaign Scientific research 10% 90%
Text vs Data Search (goal-oriented) Discover (opportunistic) Structured Data Unstructured Data (Text) Data Retrieval Information Retrieval Data Mining Text Mining
Text Mining Challenges Unstructured Form Large textual database High number of possible dimensions Sophisticated and subtle relationship Noisy data.
Text Mining Process Text Pre-processing Feature Generation Feature Selection Text Mining Interpretation of Results
Research Information Research Objective: Mining Twitter Users Interest Find the popular trend of social media users Sentiment Analysis Social Network Analysis Dataset: Twitter Tool: R, Google Refine, Weka
Twitter Dataset Collection A collection of records extracted from tweets containing both #hashtags and URLs. Date range: November 2012.(22M rows, 6 attributes) (Karissa McKelvey and Filippo Menczer. Truthy: Enabling the Study of Online Social Networks. In Proc. 16th ACM Conference on Computer Supported Cooperative Work and Social Computing Companion (CSCW), 2013) A collection of records extracted from tweets directly from Twitter by using R. Date range: Mar,3 rd,2016 & Mar,6 th, 2016 (3000 records totally) **Twitter Authentication Required
Data Processing Non-English removal Punctuation, extra space removal Stem Words Stop words removal Upper/Lower Character Uniform Noisy Data Clearance Text Transformation
Term Frequency Most Frequent Words gameinsight Android Android games ipad games iphone Instagram lol syria Justin Bieber 100000 90000 80000 70000 60000 50000 40000 30000 20000 10000 Popular Trend 0 1 2 3 4 5 6 7 8 9 10 android androidgames gameinsight ipadgames iphone
Term Association "Android" gameinsight android game now playing 0.45 0.56 0.4 iphone ipad amazon 0.36 0.33 0.32
Cluster Analysis Document clustering is the application of cluster analysis to textual documents in automatic document organization, topic extraction and fast information retrieval or filtering. Clustering a set of objects into groups is usually moved by the aim of identifying internally homogenous groups according to a specific set of variables. The starting point of clustering is computing a matrix, called dissimilarity matrix, which contains information about the dissimilarity of the observed units. Cluster Algorithm: Hierarchical Partitional
Hierarchical Cluster Analysis -----Example datamining Hierarchical clustering builds a hierarchy from the bottom-up, and doesn t require to specify the number of clusters beforehand. Once this is done, it is usually represented by a dendrogramlike structure. The algorithm works as follows: Put each data point in its own cluster. Identify the closest two clusters and combine them into one. Repeat the above step till all the data points are in a single cluster.
K-means Cluster Analysis K-means is a prototype-based, partitional clustering technique that attempts to find a user-specified number of cluster. K-means Algorithm: Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid Number of clusters K must be specified
K-means Cluster Analysis------Example datamining
Social Network Analysis(1) Social network analysis is the process of investigating social structures through the use of network and graph theories. It characterizes networked structures in terms of nodes (individual actors, people, or things within the network) and the ties or edges (relationships or interactions) that connect them. As each person uses Twitter, they form networks as they follow, reply and mention one another. These connections are visible in the text of each tweet or by requesting lists of the users that follow the author of each tweet from Twitter.
Social Network Analysis(2)
Sentiment Analysis Sentiment analysis is an area of research that investigates people s opinions towards different matters: products, events, organisations. Provide information for understanding collective human behaviour, valuable to commercial interest. Asur and Huberman( 2012 ) predicted Twitter analytics among the amount of ticket sales at the opening weekend for movies with 97.3% accuracy.
Sentiment Analysis Approach The main two methods of sentiment analysis, lexicon-based method (unsupervised approach) and machine learning based method (supervised approach), both rely on the bag-of-words. Machine learning supervised method is using the unigrams or their combinations (N-grams) as features. Lexicon-based method the unigrams which are found in the lexicon are assigned a polarity score, the overall polarity score of the text is then computed as sum of the polarities of the unigrams. Score average=! " "! wi
Sentiment Analysis------Walmart Example A collection of records extracted from tweets directly from Twitter with the keywords "Walmart. Date range: Mar,3 rd,2016, 2500 records
Project Summary & Future Work By mining part of the tweets, we find out the popular trends and hot topics among the twitter within the period given. With the help of social network analysis and sentiment analysis, it reveals that social media plays an important role in rating the commercial service performance and finding out the relationship between terms In the future, some deep learning work need implementing, such as, improving the accuracy of the documentation classifiers, expanding the data volume of the social media, find out the reasons combined with the sentiment etc.
Q&A