Topic Classification in Social Media using Metadata from Hyperlinked Objects

Similar documents
From Online Community Data to RDF

Query Expansion using Wikipedia and DBpedia

Linking Entities in Chinese Queries to Knowledge Graph

Using Linked Data to Build Open, Collaborative Recommender Systems

Porting Social Media Contributions with SIOC

Zurich Open Repository and Archive. Private Cross-page Movie Recommendations with the Firefox add-on OMORE

Using Linked Data to Reduce Learning Latency for e-book Readers

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.

Automatic Domain Partitioning for Multi-Domain Learning

Automated Tagging for Online Q&A Forums

Social Voting Techniques: A Comparison of the Methods Used for Explicit Feedback in Recommendation Systems

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge

Weaving SIOC into the Web of Linked Data

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

An Empirical Study of Lazy Multilabel Classification Algorithms

Is Brad Pitt Related to Backstreet Boys? Exploring Related Entities

WEB SPAM IDENTIFICATION THROUGH LANGUAGE MODEL ANALYSIS

Analyzing Aggregated Semantics-enabled User Modeling on Google+ and Twitter for Personalized Link Recommendations

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

Query Disambiguation from Web Search Logs

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

Social Networks and Data Portability using Semantic Web technologies

Ning Frequently Asked Questions

Theme Identification in RDF Graphs

Kristina Lerman University of Southern California. This lecture is partly based on slides prepared by Anon Plangprasopchok

Information Retrieval

A Content Vector Model for Text Classification

Prof. Dr. Christian Bizer

Browser-Oriented Universal Cross-Site Recommendation and Explanation based on User Browsing Logs

RSDC 09: Tag Recommendation Using Keywords and Association Rules

Tools and Infrastructure for Supporting Enterprise Knowledge Graphs

Individualized Error Estimation for Classification and Regression Models

DATA MINING II - 1DL460. Spring 2014"

DBpedia-An Advancement Towards Content Extraction From Wikipedia

Quality-Based Automatic Classification for Presentation Slides

W3C Workshop on the Future of Social Networking, January 2009, Barcelona

Exploiting Index Pruning Methods for Clustering XML Collections

Open Source Software Recommendations Using Github

A Supervised Method for Multi-keyword Web Crawling on Web Forums

User Guide Written By Yasser EL-Manzalawy

History and Backgound: Internet & Web 2.0

Creating Large-scale Training and Test Corpora for Extracting Structured Data from the Web

A rule-based approach to address semantic accuracy problems on Linked Data

Chapter 27 Introduction to Information Retrieval and Web Search

Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics

Collaborative Filtering using a Spreading Activation Approach

Knowledge Representation in Social Context. CS227 Spring 2011

A Privacy Preference Ontology (PPO) for Linked Data

A Survey on Information Extraction in Web Searches Using Web Services

Similarity search in multimedia databases

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

ResPubliQA 2010

Papers for comprehensive viva-voce

Evaluating the Usefulness of Sentiment Information for Focused Crawlers

Chapter 6: Information Retrieval and Web Search. An introduction

Exploring Dynamics and Semantics of User Interests for User Modeling on Twitter for Link Recommendations

How to Publish Linked Data on the Web - Proposal for a Half-day Tutorial at ISWC2008

A P2P-based Incremental Web Ranking Algorithm

DATA MINING II - 1DL460. Spring 2017

Predict the box office of US movies

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion

Wikipedia is not the sum of all human knowledge: do we need a wiki for open data?

A Time Decoupling Approach for Studying Forum Dynamics

Improving Stack Overflow Tag Prediction Using Eye Tracking Alina Lazar Youngstown State University Bonita Sharif, Youngstown State University

Finding Relevant Relations in Relevant Documents

Hyperdata: Update APIs for RDF Data Sources (Vision Paper)

Feature Selection for an n-gram Approach to Web Page Genre Classification

An Approach To Web Content Mining

Module 1: Internet Basics for Web Development (II)

Influence of Word Normalization on Text Classification

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Volume 6, Issue 5, May 2018 International Journal of Advance Research in Computer Science and Management Studies

Sindice Widgets: Lightweight embedding of Semantic Web capabilities into existing user applications.

Network community detection with edge classifiers trained on LFR graphs

What is this Song About?: Identification of Keywords in Bollywood Lyrics

FORECASTING INITIAL POPULARITY OF JUST-UPLOADED USER-GENERATED VIDEOS. Changsha Ma, Zhisheng Yan and Chang Wen Chen

Towards an Interlinked Semantic Wiki Farm

16th International World Wide Web Conference Developers Track, May 11, DBpedia. Querying Wikipedia like a Database

Entity and Knowledge Base-oriented Information Retrieval

A Tagging Approach to Ontology Mapping

Improving the methods of classification based on words ontology

ELEC6910Q Analytics and Systems for Social Media and Big Data Applications Lecture 4. Prof. James She

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Frequently Asked Questions- Communication, the Internet, Presentations Question 1: What is the difference between the Internet and the World Wide Web?

Learning Similarity Metrics for Event Identification in Social Media. Hila Becker, Luis Gravano

The Ultimate Digital Marketing Glossary (A-Z) what does it all mean? A-Z of Digital Marketing Translation

Untangling the semantic web:

Digital Marketing Proposal

Creating a Classifier for a Focused Web Crawler

Disambiguating Entities Referred by Web Endpoints using Tree Ensembles

Online music is constantly changing. Static artist bios and old images from the archives no longer deliver an engaging experience.

The role of vocabularies for estimating carbon footprint for food recipies using Linked Open Data

UNIVERSITY OF BOLTON WEB PUBLISHER GUIDE JUNE 2016 / VERSION 1.0

A Feature Selection Method to Handle Imbalanced Data in Text Classification

Social Tagging. Kristina Lerman. USC Information Sciences Institute Thanks to Anon Plangprasopchok for providing material for this lecture.

HUKB at NTCIR-12 IMine-2 task: Utilization of Query Analysis Results and Wikipedia Data for Subtopic Mining

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

Identifying Important Communications

Transcription:

Topic Classification in Social Media using Metadata from Hyperlinked Objects Sheila Kinsella 1, Alexandre Passant 1, and John G. Breslin 1,2 1 Digital Enterprise Research Institute, National University of Ireland, Galway {firstname.lastname}@deri.org 2 School of Engineering and Informatics, National University of Ireland, Galway john.breslin@nuigalway.ie Abstract. Social media presents unique challenges for topic classification, including the brevity of posts, the informal nature of conversations, and the frequent reliance on external hyperlinks to give context to a conversation. In this paper we investigate the usefulness of these external hyperlinks for determining the topic of an individual post. We focus specifically on hyperlinks to objects which have related metadata available on the Web, including Amazon products and YouTube videos. Our experiments show that the inclusion of metadata from hyperlinked objects in addition to the original post content improved classifier performance measured with the F-score from 84% to 9%. Further, even classification based on object metadata alone outperforms classification based on the original post content. 1 Introduction Social media such as blogs, message boards, micro-blogging services and socialnetworking sites have grown significantly in popularity in recent years. By lowering the barriers to online communication, social media enables users to easily access and share content, news, opinions and information in general. However navigating this wealth of information can be problematic due to the fact that contributions are often much shorter than a typical Web documents, and the quality of content in social media is highly variable [1]. A potential source of context to conversations in social media is the hyperlinks which are frequently posted to refer to related information. These objects are often an integral part of the conversation. For example, a poster may recommend a book by posting a link to a webpage where you can buy the book, rather than employing the traditional method of providing the title and the name of the author. Yet there still remains the question of identifying which snippets of content from these links are most relevant to the conversation at hand. Recently, there has been a trend towards publishing structured information on the Web, resulting in an increasing amount of rich metadata associated with Web resources. Facebook have launched their Open Graph Protocol 3, which The work presented in this paper has been funded in part by Science Foundation Ireland under Grant No. SFI/8/CE/I138 (Lion-2). 3 http://opengraphprotocol.org/ visited January 211

allows external site owners to markup their content using Facebook-defined schemas, such that these enriched content items can then be used for metadata import into news feeds, profiles, etc. The Linking Open Data [5] project is a community effort to make structured datasets from diverse domains available on the Web, some of which we use as data sources in this work. Such rich representations of objects can be a useful resource for information retrieval and machine learning. In social media in particular, structured data from hyperlinked websites can provide important context in an otherwise chaotic domain. In this paper, we investigate the potential of improving topic classification in social media by using metadata retrieved from external hyperlinks in usergenerated posts 4. We compare the results of topic classification based on post content, based on metadata from external websites, and based on a combination of the two. The usage of structured metadata allows us to include only specific, relevant external data. Our experiments show that incorporating metadata from hyperlinks can significantly improve the task of topic classification in social media posts. Our approach can be applied to recommend an appropriate forum in which to post a new message, or to aid the navigation of existing, uncategorised posts. Related work has been carried out in the field of Web document classification, proving that the classification of webpages can be boosted by taking into account the text of neighbouring webpages ([2], [9]). This work differs in that we incorporate not entire webpages but only metadata which is directly related to objects discussed in a post. There has also been previous work using tags and other textual features to classify social media. Our work is closely related to that of Figueiredo et al. [6], who assess the quality of various textual features in Web 2. sites such as YouTube for classifying objects within that site. Berendt and Hanser [4] investigated automatic domain classification of blog posts with different combinations of body, tags and title. Sun et al. [1] showed that blog topic classification can be improved by including tags and descriptions. Our paper differs from these because we use metadata from objects on the Web to describe a post which links to those objects. Thus our approach combines the usage of neighbouring pages in Web search, and of metadata in social media search. 2 Data Corpus Our analysis uses the message board corpus from the boards.ie SIOC Data Competition 5 which was held in 8 and covers ten years of discussions. Each post belongs to a thread, or conversation, and each thread belongs to a forum. A forum typically covers one particular area of interest, and may contain sub-forums which are more specialised (for example, Music and Hip-Hop). Our analysis uses a subset of the forums and is limited to the posts contained in the final year of the dataset, because these are most likely to have structured data. We examined the posts in the dataset which contained hyperlinks and identified potential sources of structured metadata. These are websites which publish 4 A post is a message which a user submits as part of a conversation 5 http://data.sioc-project.org/ visited January 211

metadata about objects, such as videos (YouTube) or products (Amazon), and make it available via an API or as Linked Data. In some cases the metadata is published by external sources, e.g. DBpedia [3] provides a structured representation of Wikipedia. The sources on which our study focuses are listed in Table 1, along with the available metadata that we extracted. We consider the first paragraph of a Wikipedia article as a description. For this study, we focus on the most commonly available metadata, but in the future we plan to make use of additional information such as movie directors, book authors and music albums. Website Object type Title Description Category Tags Amazon X X X Flickr X X X IMDB X X MySpace Music Artist X X Wikipedia Article X X X YouTube X X X X Table 1. External websites and the metadata types used in our experiments. Amazon, Flickr and YouTube metadata was retrieved from the respective APIs. We obtained MySpace music artist information from DBTune 6 (an RDF wrapper of various musical sources including MySpace), IMDB movie information from LinkedMDB 7 (an export of IMDB data) and Wikipedia article information from DBpedia 8. The latter three services are part of the Linking Open Data project [5]. The data collection process is described in more detail in [8]. Two groups of forums were chosen for classification experiments - the ten rather general forums shown in Figure 1(a), which we refer to as General, and the five more specific and closely related music forums shown in Figure 1(b), Music. These forums were chosen since they have good coverage in the dataset and they each have a clear topic (as opposed to general chat forums). The percentage of posts in General that have hyperlinks varies between forums, from 4% in Poker to 14% in Musicians, with an average of 8% across forums. Of the posts with hyperlinks, 23% link to an object with available structured data. The number of posts containing links to each type of object is shown in Figure 1. For General, hyperlinks to Music Artists occur mainly in the Musicians forum, s in the Films forum, and s in the graphy forum. The other object types are spread more evenly between the remaining seven forums. Note that in rare cases, a post contains links to multiple object types, in which case that post is included twice in a column. Therefore the total counts in Figure 1 are inflated by approximately 1%. In total, General contains 6,626 posts and Music contains 1,564 posts. 6 http://dbtune.org/ visited January 211 7 http://linkedmdb.org/ visited January 211 8 http://dbpedia.org/ visited January 211

14 1 1 8 6 14 1 1 8 6 Music Artist Article Music Artist Article 6 5 4 3 6 5 4 Music Artist 3 Article Music Artist Article 4 4 1 1 (a) (b) Fig. 1. Number of posts containing each type of hyperlinked object for (a) 1 General forums and (b) 5 Music forums 3 Experiments We evaluated the usefulness of various textual representations of message board posts for classification, where the class of a post is derived from the title of the forum in which it was posted. For each post, we derive four different bag-of-words representations, in order to compare their usefulness as sources of features for topic classification. C-L denotes the original content of the post, with all hyperlinks removed, while C denotes the full original content with hyperlinks intact. M denotes the external metadata retrieved from the hyperlinks of the post. C+M denotes the combination of C and M for a given post, i.e. the full original content plus the external metadata retrieved from its hyperlinks. For the 23% of posts which have a title, this is included as part of the content. The average number of unique terms was 38 for post content, and 2 for associated metadata. At present we simply concatenate the text of the metadata values rather than considering the metadata key-value pairs, however it would be interesting to weight the metadata based on which keys provide the most useful descriptors for classification. Classification of documents was performed using the Multinomial Naïve Bayes classifier implemented in Weka [7]. For each textual representation of each post the following transformations are performed. All text is lower-cased and nonalphabetic characters are replaced with spaces. Stopwords are removed and TF- IDF and document length normalisation are applied. Ten-fold cross validation was used to assess classifier performance on each type of document representation. To avoid duplication of post content due to

one post quoting another, the data was split by thread so that posts from one thread do not occur in separate folds. Duplication of hyperlinks across splits was also disallowed, so the metadata of an object cannot occur in multiple folds. These restrictions resulted in the omission of approximately 11% of the posts in each forum group. The same ten folds are used to test C-L, C, M and C+M. The results of the classification experiments are shown in Table 2. We performed a paired t-test at the.5 level over the results of the cross-validation in order to assess statistical significance. For General, all differences in results are significant. Simply using the content without hyperlinks (C-L) gives quite good results, but incorporating URL information from hyperlinks (C) improves the results. Using only metadata from hyperlinked objects (M) gives improved performance, and combining post content with metadata from hyperlinks (C+M) gives the best performance. For Music, the scores in general are lower, likely due to the higher similarity of the topics. Here, the difference between M and C+M is not significant, but all other differences are significant. When a post has a hyperlink, object metadata, with or without post content, provides a better description of the topic of a post than the original post content. Dataset C-L C M C+M General.783.838.858.92 Music.663.694.78.83 Table 2. Micro-averaged F-scores achieved by classifier on each dataset Detailed results for General are shown in Table 3, arranged in order of descending F-score for C+M. It can be seen that there is a large variation in classification performance across different forums. For example, classification of a post in the Musicians forum is trivial, since almost all posts feature a link to MySpace. However classification of a Television forum post is more challenging, because this forum can also cover any topic which is televised. Despite the variation between topics, C+M always results in the best performance, although not all of these results are statistically significant. 4 Conclusion We investigated the potential of using metadata from external objects for classifying the topic of posts in online forums. To perform this study, we augmented a message board corpus with metadata associated with hyperlinks from posts. Our experiments reveal that this external metadata has better descriptive power for topic classification than the original posts, and that a combination of both gives best results. We conclude that for those posts that contain hyperlinks for which structured data is available, the external metadata can provide valuable features for topic classification. We plan to continue this work by comparing the usefulness of object titles, descriptions, categories and tags as features for improving topic

Forum C-L C M C+M Musicians.948.976.91.98 graphy.777.918.895.948 Soccer.812.831.99.949 Martial Arts.775.81.877.914 Motors.751.785.865.97 Films.744.831.844.88 Politics.786.81.89.844 Poker.662.739.794.838 Atheism.8.824.771.829 Television.595.634.74.736 Macro-Averaged.765.815.837.883 Table 3. F-score achieved by classifier on each General forum classification. Thus not only textual metadata content will be considered, but also the relationships linking them to the original object. The growing amount of structured data on the Web means that this type of semantically-rich information can be retrieved from a significant and increasing proportion of hyperlinks. Potential applications of the approach include suggesting appropriate forums for new posts, and categorising existing posts for improved browsing and navigation. The enhanced representation of a post as content plus hyperlink metadata also has potential for improving search in social media. References 1. Agichtein, E., Castillo, C., Donato, D., Gionis, A., Mishne, G.: Finding high-quality content in social media. In: Proceedings of WSDM, 8 2. Angelova, R., Weikum, G.: Graph-based text classification: Learn from your neighbors. In: Proceedings of SIGIR, 6 3. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: Dbpedia: A nucleus for a web of open data. In: Proceedings of ISWC, 7 4. Berendt, B., Hanser, C.: Tags are not metadata, but just more content to some people. In: Proceedings of ICWSM, 7 5. Bizer, C., Heath, T., Berners-Lee, T.: Linked Data - The story so far. International Journal on Semantic Web and Information Systems 5(3) (9) 6. Figueiredo, F., Belém, F., Pinto, H., Almeida, J., Gonçalves, M., Fernandes, D., Moura, E., Cristo, M.: Evidence of quality of textual features on the Web 2.. In: Proceedings of CIKM, 9 7. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The WEKA data mining software: An update. ACM SIGKDD Exp. 11(1) (9) 8. Kinsella, S., Passant, A., Breslin, J.G.: Using hyperlinks to enrich message board content with Linked Data. In: Proceedings of I-SEMANTICS, 21 9. Qi, X., Davison, B.: Classifiers without borders: Incorporating fielded text from neighboring web pages. In: Proceedings of SIGIR, 8 1. Sun, A., Suryanto, M., Liu, Y.: Blog classification using tags: An empirical study. In: Proceedings of ICADL, 7