Detecting Blog Spam Hashtags Using Topic Modeling

Size: px

Start display at page:

Download "Detecting Blog Spam Hashtags Using Topic Modeling"

Lucas Malone
6 years ago
Views:

1 Detecting Blog Spam Hashtags Using Topic Modeling Yoonjin Hyun Ph.D. Candidate, Graduate School of Business Information Technology, Kookmin University 77 Jeongneung-ro, Seongbuk-gu, Seoul, 02707, Korea Namgyu Kim Associate Professor, School of Management Information Systems, Kookmin University 77 Jeongneung-ro, Seongbuk-gu, Seoul, 02707, Korea ABSTRACT Tremendous amounts of data are generated daily. Accordingly, unstructured text data that is distributed through news, blogs, and social media has gained much attention from many researchers as this data contains abundant information about various consumers opinions. However, as the usefulness of text data is increasing, attempts to gain profits by distorting text data maliciously or nonmaliciously are also increasing. In this sense, various types of spam detection techniques have been studied to prevent the side effects of spamming. The most representative studies include spam detection, web spam detection, and opinion spam detection. Spam is recognized on the basis of three characteristics and actions: (1) if a certain user is recognized as a spammer, then all content created by that user should be recognized as spam; (2) if certain content is exposed to other users (regardless of the users intention), then content is recognized as spam; and (3) any content that contains malicious or non-malicious false information is recognized as spam. Many studies have been performed to solve type (1) and type (2) spamming by analyzing various metadata, such as user networks and spam words. In the case of type (3), however, relatively few studies have been conducted because it is difficult to determine the veracity of a certain word or information. In this study, we regard a hashtag that is irrelevant to the content of a blog post as spam and devise a methodology to detect such spam hashtags. CCS Concepts Information Systems World Wide Web Web searching and information discovery Web search engines Spam detection Keywords Text Mining; Topic Modeling; Hash Tag Spam; Spam Detection 1. INTRODUCTION With the growth of the Internet and popularization of smart devices, tremendous amounts of data are generated daily. The Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. ICEC '16, August 17-19, 2016, Suwon, Republic of Korea Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM /16/08 $15.00 DOI: generation of big data and its importance have been addressed in various publications, such as The Economist (2011) [1], McKinsey (2011) [2], and Gartner (2011) [3]. Therefore, the demands and interests of big data analysis are still a major concern. In particular, unstructured text data generated through news, blogs, and social media has gained the attention of many researchers due to the trove of information contained regarding real-time consumer opinions and behavior. However, as the use of text data becomes popular and the applicability of text data is extended, attempts to achieve specific effects by distorting text data maliciously or non-maliciously are increasing. The increasing spam text data not only causes trouble to those users who want to get useful information but also reduces data reliability. Therefore, it is important to conduct research on solving the spam problem. Spam can be classified into three types: , social network service (SNS), and blog spams. spam is called junk mail or bulk mail. These unwanted commercial spam s are transmitted to random addresses that have been collected either from community sites or bulletin boards or created by combining words or numbers randomly. This spam not only interrupts the process of searching s but can also cause overload on the receiver s server(s). SNS spam is usually distributed using mention or hashtag. However, spammers have become smarter, and their strategies have evolved. Spam is distributed either through a coordinated posting behavior among spammers, en masse and variously generated by finite-state machine-based spam templates, or through exposure by passive spam when a user searches a specific keyword [4]. SNS spam creates a lot of trouble for other online users who want a safe and free environment for communication that excludes unwanted advertising posts. In addition, many users are also exposed to an unspecified number of unwanted advertising posts through unwanted following of the spammers. To solve these problems, Twitter, a popular SNS service, introduced a system whereby Twitter users can directly report spam, such as tweeting malicious links, sending unsolicited messages to legitimate users, and hijacking trending topics. When Twitter receives a spam report, the offending account is temporarily suspended. Recently, Facebook changed its algorithm for screening spam articles by calculating the time taken to read articles via a newsfeed ranking since some article titles do not reflect the actual content or contain phishing attempts. These cases are indicative of the damage caused by spam. Finally, blog spam is created by posting fake articles regarding products and services or using unrelated hashtags to intentionally expose the article to other random users. In this case, not only do users face difficulties in searching for their desired information but also the reliability of the relevant blog is reduced.

2 Spam detection has been studied for a long time in an effort to prevent the various side effects of spam activation. The most representative studies include opinion spamming, opinion spam detection, spam detection, and web spam detection [5]. However, the criterion that defines spam is different for each study. There are three criteria: (1) if a certain user is recognized as a spammer through his/her identity or usage pattern, then all content created or posted by the relevant user is recognized as spam; (2) if a certain content is exposed to other users (regardless of the users intention), then the content is recognized as spam (i.e., spam, fake links, etc.); and (3) any content that contains malicious or nonmalicious false information and is written differently from fact is recognized as spam. For case (1), relevant studies using SNS data have been actively conducted using the user s access pattern and demographic information [6, 7, 8, 9, 10]. Thus, several applications to block spam have been developed and are commercially available. Case (2) describes the classic field of spam detection for spam transmitted to unspecified individuals and SNS spam through embedded URLs [11]. For cases (1) and (2), many studies have been conducted to detect spam by taking advantage of the metadata apart from the content. However, there are insufficient studies in case (3) as this requires going through the content to determine whether or not it is spam. In hashtag spamming, for example, a hashtag that is irrelevant to the content is used to attract other users. Let s assume that a nonrelevant hashtag is attached to a particular post; then, even though both the hashtag and the post are not spam, the hashtag is likely to be spam if it is attached to the post. The details are illustrated in Figure 1. Figure 1. Example of Hashtag Spam Figure 1 shows the posts relevant to (a) Health and (b) Movie and their hashtags. Post (a) is attached to the Exercise and Diet hashtags, whereas post (b) is attached to the Avengers, Black Widow, and Scarlett Johansson hashtags. None of these posts and hashtags are recognized as spam because the posts and hashtags are properly attached. However, if the Diet hashtag of post (a) is attached to post (d) Movie, which is totally unrelated, then the whole post can be considered spam even though the content of the post is not. Likewise, if the Scarlett Johansson hashtag of post (b) is attached to the unrelated post (c) Health, then the whole healthrelated post can be recognized as spam. In this type of spamming, the spam should not be detected only using the content of the post but should also consider part or whole of the hashtags. In this manner, it is impossible to solve the problem using the methods mentioned in cases (1) and (2). Therefore, this study conducts a content analysis to detect abnormal connection between the posts and the hashtags. In general, the topics of a particular content can be identified through topic modeling; then, a set of hashtags that is used above a certain level for each topic is derived. If there is a hashtag that does not belong to the derived set of hashtags used, then this hashtag is likely to be spam. The remainder of this paper is organized as follows. The next section introduces related work on topic modeling and spam detection. Section 3 describes the proposed methodology in detail. Section 4 presents the concluding remarks and includes future plan for experiments, contributions, and limitations to overcome. 2. Related Works 2.1 Topic Modeling Traditionally, data mining has been used to extract new knowledge through structured data analysis [12]. Recently, large volumes of unstructured text data have become available for distribution through different social media platforms, such as news articles, blogs, and social media systems. Thus, to discover new knowledge and useful patterns from this unstructured text data, text mining study has been conducted. Text mining plays an important role in many fields that use text data. Text mining is a comprehensive technique used in information extraction, information retrieval, natural language processing, topic tracking, text summarization, and categorization [13, 14]. Thus, in addition to resolving the traditional topics [15, 16, 17], there is further scope for the use of topic modeling in more diverse topics. Among the various contemporary text mining-related applications, topic modeling is the most actively utilized and shows tangible results in many areas. In topic modeling, a document is used as the minimum unit in the analysis of its title, abstract/summary, content, and comment. Topic modeling groups a large number of documents on the basis of their similarity and describes each group through representative keywords. The keywords of each document are selected according to term frequency. Depending on the purpose, term frequency can be measured using a binary model, three-value model, or TF-IDF (Term Frequency Inverse Document Frequency) [18]. A document that belongs to a topic group can simultaneously belong to multiple topic groups, which is a feature that differentiates this method from traditional cluster analysis. Recently, various attempts have been made to solve the problems in different areas using topic modeling. Following the extraction of a large number of issues, Kim et al. (2014) [19] and Hyun et al. (2015) [20] derived the main issues using a clustering method; however, the results vary in accordance with different perspectives. Choi et al. (2015) [21] showed that a recommendation system can be improved by topic modeling for analyzing users interests. In addition, other related studies, such as customer segmentation based on users issues of interest through topic modeling [22], public opinion analysis of science and technology issues [23], and the analysis of the dynamic mutation process regarding issues [24]. In this study, topic modeling is utilized in content analysis for content-based spam detection. 2.2 Spam Detection To overcome the risk of spam in the unprotected Internet world, spam detection has been a topic of study for a long time. The three

3 types of spam detection most studied are , web, and opinion spam. Since there is not a lot of variation in structure, many e- mail spam-related studies have shown relatively desired results. Different kinds of methods and applications to block spam have been commercialized. The most representative methodology is spam filtering using the Bayesian approach [26, 27]. Recently, many spam studies have utilized the spam dataset released by Enron, a U.S. company found guilty of massive accounting fraud in 2001 [27, 28]. Web spam can be classified into two types: link spam and content spam [5]. Many attempts have been made to prevent web spams. For example, the TrustRank algorithm is proposed to compute trust scores, where good pages are given higher scores. Based on the calculated trust score, spam pages can be filtered out during the search engine process [29]. Moreover, mass measurement of spam is proposed to identify link spamming based on web link structures [30]. Study on web spam detection through content analysis of web pages also has been introduced [31]. However, due to the fluid nature and massive volumes of web spam, its detection remains problematic. Unlike and web spam, opinion spam-related study is relatively insufficient. Opinion spam aims to achieve the specific purpose of identifying fake opinion about social and political issues or fake reviews about certain products or services. Opinion spam detection is far more difficult because it requires a visual review to make a final determination of spam. In the case of fake product reviews, it is very difficult to detect opinion spam because it is impossible to confirm whether the user actually used the product or not. Thus, there are still numerous challenges facing opinion spam detection [5]. In recent years, SNS and blogs spam have also been studied. The most representative studies identify spam based on social attributes [7], detecting abnormal behaviors [6], spam detection through utilizing users account or message network structure [8, 9, 10], and spam detection through classifying tweet-embedded URLs [11]. In particular, hashtag spam detection draws the most attention from researchers and practitioners. Hashtags are used to share content with unspecified individuals and expose the content to groups of people who are interested. However, as it is easy to share information through hashtags, they are also more likely to be used as spam vectors. Therefore, attempts to prevent hashtag spam or hashtag hijacking are being made. For example, several studies are analyzing the types of hashtag hijacking attacks [32] and investigating the effect of spam on hashtag recommendations for tweets [33]. However, hashtag spam detection-related study is insufficient, and most studies focus on Twitter to take advantage of its metadata, such as user accounts, follow relationships, content, network structure, and so on. However, Twitter data has many disadvantages; foremost, it is very simple and short as the content is limited to 140 characters. Furthermore, Twitter data contain significant noise, i.e., garbage data that makes it quite difficult to perform content analysis. Thus, methodologies that rely on Twitter metadata can hardly be applied to researching other content sources. In this study, a methodology is introduced to detect hashtag spam based on blog content analysis without metadata. 3. Contents-Based Blog Hashtag Spam Detection 3.1 Research Overview In this section, the methodology of content-based blog hashtag spam detection is explained in detail. The term hashtag spam refers to a hashtag that does not fit with the topic of the document to which it has been attached. Simply put, the hashtag does not fit the theme of the content. The blog data used in this study is assumed not to be spam. Figure 2 illustrates an overview of the entire process. In Figure 2, the cylinders represent the data source (blog data, document text data, and hashtag information), the rectangles represent the main analysis processes, and the dotted-line boxes represent each output. There are a total of six important processes: (1) blog data is classified into content (document text) and hashtag information; (2) topic modeling is performed on the document text to analyze the content and assign a topic cluster; (3) spam hashtags are added randomly to the existing hashtag information, and the hashtag table for each document derived. By analyzing hashtag frequency using the results of steps (2) and (3), the valid hashtag lists for each topic and document are derived in steps (4) and (5). Finally, by comparing the analyzed results of steps (3) and (5), spam hashtags can be detected, with verification conducted using the F-score. The whole process is explained further through examples in the next section. Figure 2. Research Overview 3.2 Contents Analysis using Topic Modeling In this subsection, processes (1) and (2) in Figure 1 are explained in detail. First, blog data was refined for it to be appropriate for analysis, and databases of (i) document text data and (ii) hashtag information were constructed. The fields of database (i) are document number, date, title, and content, whereas the fields of

In Figure 3, two topics were derived: travel and movie. Documents d1 d3 were grouped under travel and documents d2 d5 were grouped under movie.

4 database (ii) are document number and hashtag. After performing topic modeling on database (i), the document clusters for each topic were derived. As the topic modeling process in this paper mirrors the general methodology, it is not described in this paper. Figure 3 shows the virtual example of topic modeling for five documents: d1 d5. In Figure 3, two topics were derived: travel and movie. Documents d1 d3 were grouped under travel and documents d2 d5 were grouped under movie. In addition, d2 and d3 belong to travel and movie simultaneously because one document can belong to multiple topic groups in topic modeling, which is different than in clustering. 3.3 Extracting Valid Hashtag list The method for extracting valid hashtag lists for each topic and document is explained using the outcomes of the previous subsection. This subsection includes processes (3) (5) in Figure 1. Prior to extracting the valid hashtags, the spam hashtags are added to the existing hashtag information. When collecting blog data, it is very difficult to distinguish spam from something that is not spam, and there is the possibility of no spam being collected from the blog data. Therefore, the blog data was assumed as containing no spam, and the hashtag list for each document was derived by inserting spam hashtags randomly. Figure 3. Example of Inserting Spam Hashtag Figure 3 is a diagrammatic representation of the random insertion of spam hashtags. Spam 1 was added to the hashtag list of d1, and spam 2 was added to the hashtag list of d5. After inserting spam hashtags, the valid hashtags for each topic were derived by generating a frequency analysis on the hashtag list for each document. If the frequency of a hashtag is above a certain threshold in the document cluster for a specific topic, then this hashtag is selected as a valid hashtag for that topic. Figure 4 shows the virtual valid hashtag lists for each topic. Figure 4. The process of Extracting Valid Hashtag for each Topic For results (b) and (c) in Figure 4, the number of hashtags used is more than the majority of the documents in the document cluster for each topic. Specifically, the hashtags,,, and were derived as valid hashtags of T1 because they were used in two or more documents in document cluster T1. Likewise,,, Mission Impossible, and were derived as valid hashtags because they were used in two documents in the document cluster T2, which has four documents. After the valid hashtag list for each topic was derived, the list is used as a criterion to derive the valid hashtag list of each document. In other words, the valid hashtag list of a specific topic comprises the valid hashtag lists for all the documents in that topic. Table 1 shows the virtual result of extracting valid hashtag lists for each document. Table 1. The Virtual Result of Extracting Valid Hashtag list for each Doc. Doc. No V_tag d1 {,, } d2 {,, } d3 {,, } d4 {, } d5 {, } 3.4 Detecting Hashtag Spam Finally, in process (6) of Figure 2, the spam hashtag is identified using the outcomes of the previous subsection. Detecting spam hashtags is achieved using a comparative analysis of hashtag lists that includes both the spam hashtag and valid hashtag lists for each document. Table 2 shows the virtual result of detecting spam hashtags.

5 Table 2. The Virtual Result of Detecting Spam Hashtag Doc. No d1 d2 d3 d4 d5 V_tag A_tag Detected spam1 Movie spam2 spam1 X X Movie spam2 In Table 2, Spam1 and Spam2 were detected as spam in d1 and d5, whereas for other documents, although Movie is not a spam hashtag, it was unfortunately detected as spam in d4. In the actual experiment, detecting spam hashtags will be conducted in the same way as shown in the proposed methodology. Verification will also be performed using the F-score calculated based on the harmonic average of recall and precision. 4. Concluding Remarks This study remains in progress and the actual experiment using the proposed methodology will be conducted. To enable further progress, target experiment data should be collected beforehand. The most important data for this study is blog data that contains hashtag information. Using a customized crawler, we collected about 14,000 random blog samples from the most popular website in Korea. This data comprises the document number, date, title, hashtag, and content. The experiment will only use document number, hashtag, and content. Using SAS Enterprise Guide and Excel VBA, we will refine and analyze the data that was collected. In addition, we will analyze the contents through topic modeling using SAS Enterprise Miner REFERENCES [1] Economist Intelligence Unit Big Data Harnessing a Game-Changing Asset. The Economist. [2] McKinsey Global Institute Big Data: The next Frontier for Innovation, Competition, and Productivity. McKinsey and Company. [3] Gartner Inc Hype Cycle for Emerging Technologies. Gartner Inc. [4] Chen, C., Zhang, J., Xiang, Y., and Zhou, W Spammers Are Becoming Smarter on Twitter. Browse Journals & Magazines. 18, 2. DOI= [5] Liu, B Sentiment Analysis and Opinion Mining. Syntehesis Lectures on Human Language Technologies #16, Morgan & Claypool Publisiers. [6] Egele, M., Stringhini, G., Kruegel, C., and Vigna, G Compa: Detecting Compromised Accounts on Social Networks. Proc. Ann. Network and Distributed System Security Symp. ompa.pdf. [7] Song, J., Lee, S., and Kim, J Spam Filtering in Twitter Using Sender-Receiver Relationship. Recent Advances in Intrusion Detection. Volume 6961 of the series Lecture Notes in Computer Science DOI= [8] Yarde, S., Romero, D., Schoenebeck, G., and Boyd, D Detecting Spam in a Twitter Network. Peer-reviewed journal on the Internet. 15, 1(January. 2010). DOI= [9] Wang, A. H Don t Follow Me: Spam Detection in Twitter. Security and Cryptography(SECRYPT), Proceedings of the 2010 International Conference on, (July ). [10] Ma, Y., Niu, Y., Ren, Y., and Xue, Y Detecting Spam on Sina Weibo. International Workshop on Cloud Computing and Information Security(CCIS), (October. 2013). DOI= [11] Lee, S. and Kim, J Warningbird: A Near Real-Time Detection System for Suspicious URLs in Twitter Stream. IEEE Transactions on Dependable and Secure Computing. 10, 3 (April. 2013), DOI= [12] Han, J., Kamber, M., and Pei, J Data Mining: Concepts and Techniques, 3rd Edition, Morgan Kaufmann Publishers. [13] Mooney, R.J. and Bunescu, R Mining Knowledge from Text using Information Extraction. ACM SIGKDD Explorations Newsletter - Natural language processing and text mining, 7, 1(June. 2006), DOI= [14] Rijsbergen, C. J. V., Information Retrieval, 2nd edition, Butterworth, London, [15] Kim, K. and Ahn. H Development of Web-based Intelligent Recommender Systems using Advanced Data Mining Techniques. Journal of Information Technology Applications and Management. 12, 3 (September. 2005), [16] Hur, J. and Kim, J. W Characteristics on Inconsistency Pattern Modeling as Hybrid Data Mining Techniques. Journal of Information Technology Applications and Management, 15, 1 (March. 2008), [17] Hwang, I A Study on Dynamic Query Expansion Using Web Mining in Information Retrieval. Journal of Information Technology Applications and Management. 11, 2 (June. 2004), [18] Weiss, S. M., Indurkhya, N., and Zhang, T Fundamentals of Predictive Text Mining, Springer. [19] Kim, J., Kim. N., and Cho, Y User-Perspective Issue Clustering Using Multi-Layered Two-Mode Network

6 Analysis. Journal of Intelligence and Information Systems. 20, 2 (June. 2014), DOI= [20] Hyun, Y., Kim, N., and Cho, Y A Multi-Dimensional Issue Clustering from the Perspective Consumers Interests and R&D. Journal of Information Technology Services. 14, 1 (March. 2015), DOI= [21] Choi, S., Hyun, Y., and Kim, N Improving Performance of Recommendation Systems Using Topic Modeling. Journal of Intelligence and Information Systems. 21, 3 (September. 2015), DOI= [22] Hyun, Y., Kim, N., and Cho, Y Interest-based Customer Segmentation Methodology Using Topic Modeling. Journal of Information Technology Applications & Management. 22, 1 (March. 2015), [23] Kim, D., Wong, W. X. S., Lim, M., Liu, C., Kim, N., Park, J., Kil, W., and Yoon, H A Methodology for Analyzing Public Opinion about Science and Technology Issues Using Text Analysis. Journal of Information Technology Services, 14, 3 (September. 2015), DOI= [24] Lim, M. and Kim, N Investigating Dynamic Mutation Process of Issues Using Unstructured Text Analysis. Journal of Intelligence and Information Systems. 22, 1 (March. 2016), DOI= [25] Grier, C., Thomas, K., Paxson, V., and Zhang, M. The Underground on 140 Characters or Less. Proceedings of the 17th ACM conference on Computer and communications security DOI= [26] Sahami, M., Dumais, S., Heckerman, D., and Horvitz, E A bayesian approach to filtering junk . In AAAI Workshop on Learning for Text Categorization. [27] Jia, X., Zheng, K., Li, W., Liu, T., and Shang, L Three-Way Decisions Solution to Filter Spam An Empirical Study. Rough Sets and Current Trends in Computing. Volume 7413 of the series Lecture Notes in Computer Science, DOI= [28] Klimt, B. and Yang, Y Introducing the Enron corpus. In CEAS The Conference on and Anti-Spam. [29] Gyongyi, Z., Garcia-Molina, H., and Pedersen, J Combating web spam with trustrank. In VLDB 04: Proceedings of the Thirtieth international conference on Very large data bases VLDB Endowment. DOI= [30] Gyongyi, Z., Berkhin, P., Garcia-Molina, H., and Pedersen, J Link spam detection based on mass estimation. In VLDB 06: Proceedings of the 32nd international conference on Very large data bases VLDB Endowment. [31] Ntoulas, A., Najork, M., Manasse, M., and Retterly, D Detecting spam web pages through content analysis. Proceedings of the 15th international conference on World Wide Web. (May. 2006), DOI= [32] Xanthopoulos, P., Panagopoulos, O. P., Bakamitsos, G. A., and Freudmann, E Hashtag Hijacking: What it is, why it happens and how to avoid it. Journal of Digital & Social Media Marketing, 3, 4 (February. 2016), [33] Sedhai, S. and Sun, A Effect on Spam on Hashtag Recommendation for Tweets. Proceedings of the 25th International Conference Companion on World Wide Web. (April. 2016), DOI=

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Alex Hai Wang College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA 18512, USA