Mining Spammers in Social Media: Techniques and Applications

Size: px

Start display at page:

Download "Mining Spammers in Social Media: Techniques and Applications"

Kerry Norman
5 years ago
Views:

1 Mining Spammers in Social Media: Techniques and Applications Tutorial at the 18th Pacific-Asia Conference on Knowledge Discovery and Data Mining Data Mining and Machine Learning Lab 1

Social Media http://www.athgo.org/ablog/index.

2 Social Media Data Mining and Machine Learning Lab 2

3 Social Spamming With the growing availability of social media services, social spamming has become rampant. Social spammers are employed to unfairly overwhelm normal users. Data Mining and Machine Learning Lab 3

4 A New Type of Spammers on Social Media Social Spammers send out unwanted spam content appearing on social networks and any website with user-generated content to targeted users, often corroborating to boost their social influence, legitimacy, credibility Spam content can be manifested in many ways, including bulk messages, profanity, insults, hate speech, malicious links, fraudulent reviews, fake friends, and personally identifiable information -- Wikipedia Data Mining and Machine Learning Lab 4

5 Examples from Twitter Spam describes a variety of prohibited behaviors that violate the Twitter Rules. -- Twitter Content Information Network Information Here are some common tactics that spam accounts often use: Posting harmful links (including links to phishing or malware sites) Aggressive following behavior (mass following and mass unfollowing for attention) Abusing function to post unwanted messages to users Creating multiple accounts (either manually or using automated tools) Posting repeatedly to trending topics to try to grab attention Repeatedly posting duplicate updates Posting links with unrelated tweets Data Mining and Machine Learning Lab 5

or Send Spam Theft of user s personal information Fake like and click fraud Malicious URL 50 likes per dollar

6 Spamming on Facebook An Example Large spammer population on Social Media 83 million (8.7%) on Facebook [Facebook] Over 27% of the top 10 Twitter accounts followers are fake Spammers are used to: Share or Send Spam Theft of user s personal information Fake like and click fraud Malicious URL 50 likes per dollar Survey for prizes Data Mining and Machine Learning Lab 6

Spamming on Twitter An Example Followers Large spammer population on Social Media Over 27% of the top 10 Twitter accounts followers are fake Political AstroTurf 900K 4,000 new 800K followers/day 700K

7 Spamming on Twitter An Example Followers Large spammer population on Social Media Over 27% of the top 10 Twitter accounts followers are fake Political AstroTurf 900K 4,000 new 800K followers/day 700K Jul-4 Jul-8 July 21st 100,000 new followers in 1 day Jul-12 Jul-16 Jul-20 Jul-24 Jul-28 Aug-1 Data Mining and Machine Learning Lab 7

8 Characteristics of Social Spammers Content information: Short text Unconventional use of language Adaptive to specific events Twitter spam bot replies to offer prizes related to events such as NFL or Miley Cyrus Data Mining and Machine Learning Lab 8

9 Characteristics of Social Spammers Why is the detection of social spammers so hard? It is easy to establish an arbitrarily large number of social trust relations via Twitter follower markets [Stringhini et al. 2013] Data Mining and Machine Learning Lab 9

10 Characteristics of Social Spammers Social network information: Collaborative link farming widely exists on Twitter: spammers try to infiltrate the Twitter network by building social relationships with normal users and spammers themselves [Ghosh et al. 2012] In social media, many users simply follow back when they are followed by someone for the sake of courtesy -- reflexive reciprocity [Hu et al. 2013] Data Mining and Machine Learning Lab 10

11 Combating Social Spammers for Users In a world without social spammers, from users perspective, Information on social media services will be easier to access, more conducive and rewarding Social media will be less prone to cyber-attacks when acquiring useful information, and more trustworthy Data Mining and Machine Learning Lab 11

12 Combating Social Spammers for Companies Spam can inflict damages to companies: Spammers on social media can smear a brand and turn fans and followers into doubters When advertisements of products from a company are mixed with spam information, it can have a profoundly negative impact on your social media marketing return on investment ROI Data Mining and Machine Learning Lab 12

13 How to study the problem? A Typical Framework of Spammer Detection [Lee et al. 2010] Mining Social Spammers Data Collection Mining Applications Data Mining and Machine Learning Lab 13

14 Outline Mining Social Spammers Data Collection Mining Applications Crawling and Identification Crowdsourcing Social Honeypotbased Approach Active Learning Network-based Methods Content-based Methods Methods with Hybrid Information Online Learning Cross-media Learning Data Mining and Machine Learning Lab 14

15 Crawling and Identification Twitter accounts crawling and Identification: An alternative approach [Hu et al. 2013, Thomoas et al. 2011] Step 1: Crawl a Twitter dataset from a period (July 2012 to September 2012) via Twitter s streaming API Step 2: Query via Twitter s API to identify accounts that no longer have records, either due to deletion or suspension, followed by a request to access each missing account s Twitter profile via the web to identify requests that redirect to Data Mining and Machine Learning Lab 15

16 Crowdsourcing Two groups of users are compared to assess their effectiveness [Wang et al. 2013] Experts CS professors and graduate students Turkers Crowdworkers from online crowdsourcing systems Data Mining and Machine Learning Lab 16

17 Real or fake? Why? Navigation Buttons Classifying Profiles Browsing Profiles Screenshot of Profile (Links Cannot be Clicked) Data Mining and Machine Learning Lab 17

18 System Architecture [Wang et al. 2013] Maximize Utility of Crowdsourcing Layer High Accurate Turkers Rejected! OSN Employees Very Accurate Turkers Turker Selection Accurate Turkers Sybils All Turkers Social Network Continuous Quality Control Locate Malicious Workers Heuristics User Reports Flag Suspicious Users Suspicious Profiles Data Mining and Machine Learning Lab 18 18

19 Crowdsourcing for Data Collection A crowdsourcing spammer detection system [Wang et al. 2013] False positives and negatives <1% Resistant to infiltration by malicious workers Low cost Data Mining and Machine Learning Lab 19

20 Social Honeypot-based Approach Two ways of collecting spammer evidence: Human experts User report spammers Data Mining and Machine Learning Lab 20

21 Social Honeypot-based Approach Create and deploy social honeypots in SNS [Lee et al. 2010, 2011] Data Mining and Machine Learning Lab 21

22 Social Honeypot-based Approach Create and deploy social honeypots in social networks [Lee et al. 2010, 2011] Data Mining and Machine Learning Lab 22

23 Social Honeypot-based Approach Create and deploy social honeypots in social networks [Lee et al. 2011] 60 social honeypots are deployed 36,000 content polluters for seven months Some advantages of using social honeypots: Automatically collecting evidence of spammers No interference or intrusion on the activities of normal users Robustness of ongoing spammer identification and filtering Data Mining and Machine Learning Lab 23

24 Who are the social spammers? How to effectively collect labeled data? Data Mining and Machine Learning Lab 24

25 Active Learning Traditional Data Data Mining and Machine Learning Lab 25

26 Representativeness Active Learning Representative Instances Data Mining and Machine Learning Lab 26

27 Informativeness Active Learning Informative Instances Data Mining and Machine Learning Lab 27

28 Challenges Networked Data How do we select the instances by taking advantage of relation information? Data Mining and Machine Learning Lab 28

29 Selection Strategies for Networked Data Strategy 1: Global Selection The globally important nodes in the network are selected Data Mining and Machine Learning Lab 29

30 Selection Strategies for Networked Data Strategy 2: Local Selection The important nodes from different communities are selected Data Mining and Machine Learning Lab 30

31 ActNeT Framework ActNeT Framework: (1) relation (A) modeling from source data S; (2) text content modeling; (3) selection based on relations Data Mining and Machine Learning Lab 31

32 Outline Mining Social Spammers Data Collection Mining Applications Crawling and Identification Crowdsourcing Social Honeypotbased Approach Active Learning Network-based Methods Content-based Methods Methods with Hybrid Information Online Learning Cross-media Learning Data Mining and Machine Learning Lab 32

33 Network-based Methods A traditional assumption is that spammers cannot be influential in a social network Data Mining and Machine Learning Lab 33

34 Network-based Methods Q1: How to measure influence in a social network? Q2: Are spammers less influential in social networks? Q3: What are the following patterns of spammers, normal users and influential users in the social network? Data Mining and Machine Learning Lab 34

35 How to Measure Influence of Individuals? Centrality is widely used for influence measurements on social networks Important or prominent actors are those that are linked or involved with other actors extensively A person with extensive contacts (links) or communications with many other people in the organization is considered more important than a person with relatively fewer contacts Links are also called ties. A central actor is the one having many ties Data Mining and Machine Learning Lab 35

Degree Centrality The degree centrality measure ranks nodes with more connections higher in terms of centrality d i is the degree (number of adjacent

36 Degree Centrality The degree centrality measure ranks nodes with more connections higher in terms of centrality d i is the degree (number of adjacent edges) for vertex v i In this graph degree centrality for vertex v 1 is d 1 = 8 and for all others is d j = 1, j 1 Data Mining and Machine Learning Lab 36

37 PageRank The centrality I derive from my network neighbors is proportional to their centrality divided by their outdegree x i j A ij k x j out j D diag ( d, d2,..., d 1 n ) X 1 AD X 1 Data Mining and Machine Learning Lab 37

38 Influence of Spammer Communities Another assumption is that spammers can form tight-knit communities [Danezis et al. 2009, Yu et al. 2008] Normal Users Spammers How to find communities? How to measure influence of a community? Data Mining and Machine Learning Lab 38

39 How to Find Communities? Network-centric criterion needs to consider the connections within a network globally Goal: partition nodes of a network into disjoint sets Approaches: Clustering based on vertex similarity Latent space models Block model approximation Spectral clustering Modularity maximization Tang et al. Community Detection and Mining in Social Media, Morgan & Claypool Publishers, Data Mining and Machine Learning Lab 39 39

40 Clustering based on Vertex Similarity Apply k-means or similarity-based clustering Vertex similarity is defined in terms of the similarity of their neighborhood Structural equivalence: two nodes are structurally equivalent iff they are connecting to the same set of actors Nodes 1 and 3 are structurally equivalent; So are nodes 5 and 7. Structural equivalence is too restrict for practical use Data Mining and Machine Learning Lab 40

41 Vertex Similarity Jaccard Similarity Cosine similarity Data Mining and Machine Learning Lab 41

42 Latent Space Models Map nodes into a low-dimensional space such that the proximity between nodes based on network connectivity is preserved in the new space, then apply k-means clustering Multi-dimensional scaling (MDS) Given a network, construct a proximity matrix P representing the pairwise distance between nodes (e.g., geodesic distance) Let S R n l denote the coordinates of nodes in the low-dimensional space Objective function: Solution: V is the top eigenvectors of, and is a diagonal matrix of top eigenvalues Data Mining and Machine Learning Lab 42

43 How to Measure Influence of a Community? All centrality measures defined so far measure centrality for a single node. These measures can be generalized for a group of nodes A simple approach is to replace all nodes in a group with a super node The group structure is disregarded Let S denote the set of nodes in the group and V-S the set of outsiders Data Mining and Machine Learning Lab 43

44 Group Centrality Group Degree Centrality We can normalize it by dividing it by V-S Example: consider S={v2,v3} Group degree centrality=3 Data Mining and Machine Learning Lab 44

45 Are Spammers Less Influential? Social media services have become a target for link farming, where users try to acquire large numbers of follower links [Ghosh et al. 2012] Link farming in Web Websites exchange reciprocal links with other sites to improve ranking by search engines Link farming on social media Spammers follow other users and attempt to get them to follow back Data Mining and Machine Learning Lab 45

46 Link Farming by Spammers Spammers farm links at large scale [Ghosh et al. 2012] Over 15 million users (27% of total) targeted by 41,352 spammers (0.08% of total) 1.3 million spam-followers 82% are targeted spammers get most links by reciprocation Data Mining and Machine Learning Lab 46

47 Influential Social Spammers Spammers get more followers than an average Twitter user Some spammers acquire very high Pagerank scores 304 within top 100,000 (0.18% of all users) Social Spammers are not necessarily less influential Data Mining and Machine Learning Lab 47

48 Edges To Normal Users Spammer Communities? Key assumption: Spammers form tight-knit communities Edges Between Sybils Spammers don t necessarily form communities on social media Data Mining and Machine Learning Lab 48

49 What are the Following Patterns? Reflexive Reciprocity widely exists on social media: many users simply follow back when they are followed by someone for the sake of courtesy Who are the spam-followers? Who are the top link-farmers? Before answering the two questions, we first present a brief introduction on reciprocity Data Mining and Machine Learning Lab 49

50 Reciprocity on Social Networks In directed networks, the frequency of loops of length two is measured by Reciprocity It tells that how likely it is that a vertex that you point to also points back at you Directed edges between i and j are Reciprocated iff: (i, j) (j, i) Data Mining and Machine Learning Lab 50

51 Reciprocity on Social Networks Reciprocity r is the fraction of edges that are Reciprocated r 1 m ij A ij A ji 1 m Tr A 2 A ij = 1 and A ji = 1 iff there is an edge between i and j and also between j and i m is the total number of edges Data Mining and Machine Learning Lab 51

52 Reflexive Reciprocity 72.4% of the twitterers follow more than 80% of their followers [Weng et al. 2010] 80.5% of the twitterers have 80% of their friends follow them back Data Mining and Machine Learning Lab 52

53 Farming Links on Twitter A Twitter account is created, and followed some of the top targeted spam-followers Followed 500 randomly selected users out of the top 100K spam-followers Within 3 days, 65 reciprocated by following back The account ranked within the top 9% of all users in Twitter in 3 days Existence of a set of users from whom social links (hence social influence) can be farmed easily Data Mining and Machine Learning Lab 53

54 Who are the Spam-Followers? Non-targeted spam-followers Mostly spammers / hired helps of spammers Most have now been suspended by Twitter Targeted spam-followers Ranked on the basis of number of links to spammers 60% of follow-links acquired by spammers come from the top 100,000 targeted followers Top spam-followers tend to reciprocate almost all links established to them by spammers Data Mining and Machine Learning Lab 54

55 Who are the Top Link-Farmers? Not spammers themselves 76% not suspended by Twitter in the last two years 235 verified by Twitter to be real, well-known users Have much higher indegree as well as outdegree compared to spammers Most of their tweets contain valid URLs Data Mining and Machine Learning Lab 55

56 Who are the Top Link-Farmers? Highly influential users Rank within top 5% according to Pagerank, follower-rank, retweet-rank Mostly social marketers, entrepreneurs, Want to promote some online business / website Heavily interconnect with each other density of subgraph is (for whole graph: 10-7 ) Aim: to acquire social capital Data Mining and Machine Learning Lab 56 56

57 Combating the Link-farmers Not practical for Twitter to suspend / blacklist top link-farmers Solutions [Ghosh et al. 2012] Strategy to disincentivize users from following / reciprocating to unknown people Penalize users for following spammers Algorithm that is inverse of Pagerank Negatively bias a small set of known spammers Propagate negative scores from spammers to spamfollowers Data Mining and Machine Learning Lab 57

58 Collusionrank A user is penalized for following spammers, but not for being followed by spammers Data Mining and Machine Learning Lab 58

59 Pagerank + Collusionrank Computed Collusionrank considering 600 known spammers Rank users by Pagerank + Collusionrank Effectively filters out spammers and link-farmers (top spam-followers) from top ranks Data Mining and Machine Learning Lab 59

60 Pagerank + Collusionrank Selectively penalizes spammers & link-farmers Out of top 100K according to Pagerank, 20K demoted heavily, rest 80% not affected much (inset) The heavily demoted 20K follow many more spammers than the rest (main figure) Data Mining and Machine Learning Lab 60

61 Content-based Methods Q1: What types of features can we use? Q2: How to model content information? Data Mining and Machine Learning Lab 61

62 Feature Engineering Features used to detect Foursquare spammers and their χ 2 Rankings [Aggarwal et al. 2013] Data Mining and Machine Learning Lab 62

63 Feature Engineering Features can be grouped into four classes having as scope the message, user, topic, and propagation respectively [Castillo et al. 2011] Data Mining and Machine Learning Lab 63

64 Feature Engineering Features can be grouped into four classes having as scope the message, user, topic, and propagation respectively [Castillo et al. 2011] Data Mining and Machine Learning Lab 64

65 Modeling Content Information Supervised learning methods such as Least Squares are widely used for modeling content information X W Y Data Mining and Machine Learning Lab 65

66 Sparse Learning Sparse learning has been introduced to tackle the curse of dimensionality X W Y Data Mining and Machine Learning Lab 66

67 Matrix Factorization Another effective way of tackling the curse of dimensionality is matrix factorization X U V Data Mining and Machine Learning Lab 67

68 Decision Tree Decision tree can be used for classification and feature analysis, which is effective in understanding spamming purposes Data Mining and Machine Learning Lab 68

69 Two Classification Strategies Flat classification promoters (P), spammers (S), and legitimate users (L) Hierarchical strategy first separate promoters (P) from non-promoters (NP) heavy (HP) and light promoters (LP) legitimate users (L) and spammers (S) Flat Classification Hierarchical Classification Data Mining and Machine Learning Lab 69

70 Methods with Hybrid Information How to collectively make use of content and relations for social spammer detection? Data Mining and Machine Learning Lab 70

71 Modeling Social Networks Four types of following relations on social networks: [spammer, spammer], [normal, normal], [normal, spammer], [spammer, normal] Directed Graph Laplacian: Data Mining and Machine Learning Lab 71

72 Social Spammer Detection Objective function of the proposed formulation with network and content information: Data Mining and Machine Learning Lab 72

73 Dataset for Study Crawled a Twitter dataset from July 2012 to September 2012 via the Twitter Search API The users that were suspended by Twitter during this period are considered as the gold standard of spammers in the experiment. Data Mining and Machine Learning Lab 73

74 Social Spammer Detection Results Different sizes of training data Comparison with possible solutions Precision, recall and F 1 -measure are used as metrics Data Mining and Machine Learning Lab 74

75 MFSR with Supervised Information Label Informed matrix factorization with social relations (MFSR) Network Information Data Mining and Machine Learning Lab 75

76 Outline Mining Social Spammers Data Collection Mining Applications Crawling and Identification Crowdsourcing Social Honeypotbased Approach Active Learning Network-based Methods Content-based Methods Methods with Hybrid Information Online Learning Cross-media Learning Data Mining and Machine Learning Lab 76

77 Spammers Evolve Fast Behaviors that constitute spamming will continue to evolve as we respond to new tactics by spammers. - - Twitter Social spammers show dynamic content patterns in social media Data Mining and Machine Learning Lab 77

78 Online Learning Existing systems rely on building a new model to capture newly emerging content-based and networkbased patterns of social spammers Given the rapidly evolving nature, it is necessary to have a framework that efficiently reflects the effect of newly emerging data in social spammer detection How do we update the built model to efficiently incorporate newly emerging data objects? Data Mining and Machine Learning Lab 78

79 Problem Statement Data Mining and Machine Learning Lab 79

80 Learning the Basic Model Data Mining and Machine Learning Lab 80

81 Learning a New Model Data Mining and Machine Learning Lab 81

82 Reformulated Objective Function Data Mining and Machine Learning Lab 82

83 Experiments How effective is the proposed framework compared with other methods of social spammer detection? How efficient is the proposed online learning framework compared with other methods for modeling? Data Mining and Machine Learning Lab 83

84 Social Spammer Detection Results Data Mining and Machine Learning Lab 84

85 Social Spammer Detection Results Data Mining and Machine Learning Lab 85

86 Cross-Media Learning A straightforward way to perform content-based spammer detection is to model this task as a supervised learning problem While the problem of social spamming is relatively new, it has been extensively studied for years in other platforms, e.g., communication, SMS and the web Data Mining and Machine Learning Lab 86

87 Cross-Media Learning Are the resources from other media potentially helpful for spammer detection in microblogging? How do we explicitly model and make use of the resources from other media for spammer detection? Is the knowledge learned from other media helpful for microblogging spammer detection? Data Mining and Machine Learning Lab 87

88 Lexical Analysis Are the resources from other media potentially helpful for spammer detection in microblogging? Microblogging data is not significantly different from the datasets in other media Data Mining and Machine Learning Lab 88

89 Modeling Knowledge across Media Data Mining and Machine Learning Lab 89

90 Social Spammer Detection Results Data Mining and Machine Learning Lab 90

91 Open Research Issues Mining Social Spammers Data Collection Mining Applications Data Mining and Machine Learning Lab 91

92 Research Issues in Data Collection Quality Datasets are needed: Large-scale Accurate Up-to-date Labeling Issues: Active learning Crowdsourcing Data Mining and Machine Learning Lab 92

93 Research Issues in Mining Social Spammers Mining Social Networks Sophisticated centrality-based measures Community detection Mining Content Information Feature engineering Machine Learning Methods Sparse learning Online learning Multi-source learning Data Mining and Machine Learning Lab 93

94 Potential Applications Social Sciences Understanding purposes of social spammers in different events, e.g., natural disasters Understanding spamming behavior of normal users on social media Geographical and temporal patterns of social spammers Data Mining and Machine Learning Lab 94

95 Acknowledgments Members of the Data Mining and Machine Learning Lab at ASU The Office of Naval Research, Army Research Office Everyone attending our tutorial Data Mining and Machine Learning Lab 95

96 References Aggarwal et al. Detection of spam tipping behaviour on foursquare, In WWW Companion, Castillo et al. Information credibility on twitter, In WWW 2011 Danezis, George, and Prateek Mittal. "SybilInfer: Detecting Sybil Nodes using Social Networks." NDSS Ghosh, Saptarshi, et al. "Understanding and combating link farming in the twitter social network." In WWW Grier, Chris, et al. spam: the underground on 140 characters or less. In CCS Hu et al. "Social spammer detection in microblogging." In IJCAI, Hu et al. "Leveraging Knowledge across Media for Spammer Detection in Microblogging." In SIGIR, Hu et al. Online social spammer detection" In AAAI, Hu et al. "ActNeT: Active Learning for Networked Texts in Microblogging." In SDM, Lee et al. Seven Months with the Devils: A Long-Term Study of Content Polluters on Twitter. In ICWSM 2011 Data Mining and Machine Learning Lab 96

97 References Lee et al. Uncovering social spammers: social honeypots+ machine learning. In SIGIR 2010 Stringhini et al. "Follow the green: growth and dynamics in twitter follower markets." In IMC, Thomas et al. "Suspended accounts in retrospect: an analysis of twitter spam. In IMC, Tan, Enhua, et al. "UNIK: unsupervised social network spam detection. In CIKM Tang et al. Community Detection and Mining in Social Media, Morgan & Claypool Publishers, Viswanath et al. An analysis of social network-based sybil defenses[j]. In ACM SIGCOMM Computer Communication Review, 2011, 41(4): Wang et al. "Social Turing Tests: Crowdsourcing Sybil Detection" In NDSS, 2013 Weng, Jianshu, et al. "Twitterrank: finding topic-sensitive influential twitterers."proceedings of the third ACM international conference on Web search and data mining. ACM, Data Mining and Machine Learning Lab 97

References Yang, Zhi, et al. "Uncovering social network sybils in the wild." In IMC, 2011. Yu, Haifeng, et al. "Sybillimit: A near-optimal social network defense against sybil attacks." In SP 2008.

98 References Yang, Zhi, et al. "Uncovering social network sybils in the wild." In IMC, Yu, Haifeng, et al. "Sybillimit: A near-optimal social network defense against sybil attacks." In SP Zafarani R, Abbasi M A, Liu H. Social Media Mining: An Introduction[M]. Cambridge University Press, Zhu et al. Discovering Spammers in Social Networks. In AAAI Data Mining and Machine Learning Lab 98

Link Farming in Twitter

Link Farming in Twitter Pawan Goyal CSE, IITKGP Nov 11, 2016 Pawan Goyal (IIT Kharagpur) Link Farming in Twitter Nov 11, 2016 1 / 1 Reference Saptarshi Ghosh, Bimal Viswanath, Farshad Kooti, Naveen Kumar