Semantic Analysis of Search-Autocomplete Manipulations

Size: px

Start display at page:

Download "Semantic Analysis of Search-Autocomplete Manipulations"

Erin Hudson
6 years ago
Views:

1 Game of Missuggestions Semantic Analysis of Search-Autocomplete Manipulations Peng Wang 1, Xianghang Mi 1, Xiaojing Liao 2, XiaoFeng Wang 1, Kan Yuan 1, Feng Qian 1, Raheem Beyah 3 Indiana University Bloomington 1 William and Mary 2 Georgia Institute of Technology 3 NDSS 2018, San Diego 0

2 Autocomplete NDSS 2018, San Diego 1

3 Autocomplete popular searches web content How predictions are made NDSS 2018, San Diego 2

4 Winter is here NDSS 2018, San Diego 3

5 Winter is here promotion target NDSS 2018, San Diego 4

6 Autocomplete Manipulation pollute search logs NDSS 2018, San Diego 5

7 Autocomplete Manipulation pollute search logs NDSS 2018, San Diego 6

8 Autocomplete Manipulation pollute search logs pollute web content compromised websites spam hosting webpages NDSS 2018, San Diego 7

9 Autocomplete Manipulation pollute search logs pollute web content compromised websites spam hosting webpages NDSS 2018, San Diego 8

10 Challenges Search log analysis can only be done by search providers Web content analysis a thorough study is non-trivial on massive data Little understanding about the real-world impacts of illicit promotions NDSS 2018, San Diego 9

11 Sacabuche Search AutoComplete Abuse Checking first detection system without accessing to search logs novel NLP techniques achieves highly efficient, accurate and scalable first large-scale analysis of autocomplete missuggestions first step to understand the ecosystem of this underground business NDSS 2018, San Diego 10

12 Observation Semantic inconsistency trigger: online backup free download legitimate: manipulated: online backup software free download strongvault online backup free download NDSS 2018, San Diego 11

13 Observation Semantic inconsistency trigger: online backup free download semsim=0.96 legitimate: online backup software free download manipulated: strongvault online backup free download NDSS 2018, San Diego 12

14 Observation Semantic inconsistency trigger: legitimate: online backup free download semsim=0.96 online backup software free download semsim=0.43 manipulated: strongvault online backup free download NDSS 2018, San Diego 13

15 Sentence Similarity Semantic inconsistency trigger: legitimate: online backup free download semsim=0.96 online backup software free download semsim=0.43 manipulated: strongvault online backup free download NDSS 2018, San Diego 14

16 Observation Semantic inconsistency trigger: legitimate: online backup free download semsim=0.96 online backup software free download semsim=0.43 manipulated: strongvault online backup free download legitimate: norton online backup free download NDSS 2018, San Diego 15

17 Observation Semantic inconsistency trigger: legitimate: online backup free download semsim=0.96 online backup software free download manipulated: strongvault online backup free download legitimate: norton online backup free download semsim=0.43 semsim=0.49 NDSS 2018, San Diego 16

18 Observation Search results inconsistency missuggestion: stongvault online backup free download trigger: online backup free download suggestion: norton online backup free download NDSS 2018, San Diego 17

19 Observation Search results inconsistency missuggestion: stongvault online backup free download trigger: online backup free download suggestion: norton online backup free download NDSS 2018, San Diego 18

20 Search Results Similarity Search results inconsistency missuggestion: stongvault online backup free download trigger: online backup free download suggestion: norton online backup free download NDSS 2018, San Diego 19

21 Architecture NDSS 2018, San Diego 20

22 Prediction Finder seeds API Preprocessing NDSS 2018, San Diego 21

23 Search Term Analyzer semantic features classifier NDSS 2018, San Diego 22

24 Semantic Feature example online backup free download -> strongvault online backup free download Sentence level similarity strongvault online backup free download VS. online backup free download NDSS 2018, San Diego 23

25 Semantic Feature example online backup free download -> strongvault online backup free download Sentence level similarity strongvault online backup free download VS. online backup free download phrases strongvault online online backup backup free free download NDSS 2018, San Diego 24

26 Semantic Feature example online backup free download -> strongvault online backup free download Sentence level similarity strongvault online backup free download VS. online backup free download phrases strongvault online online backup backup free free download words strongvault online backup free download NDSS 2018, San Diego 25

27 Semantic Feature example online backup free download -> strongvault online backup free download Sentence level similarity strongvault online backup free download VS. online backup free download phrases strongvault online online backup backup free free download phrase similarity words strongvault online backup free download word vector NDSS 2018, San Diego 26

28 Semantic Feature example online backup free download -> strongvault online backup free download Sentence level similarity strongvault online backup free download VS. online backup free download sentence similarity phrases strongvault online online backup backup free free download phrase similarity words strongvault online backup free download word vector NDSS 2018, San Diego 27

29 Semantic Features Sentence similarity! "" # $, # & = ()(" +,", ) ()(" +," + )()(",,", ),./ #$, # & = 1 2 3/(4 $, 4 & ) 3/ 4 $, 4 & = 9:; 678 </ = $ 6, = & 6, </ = 6, = > = 8 (1 + cos.de(= 2 6, = > )) F Word similarity! G" = $, = & = HIJ(IKL > </ = $ 6 6, < > Infrequency! 6M = $, = & = NOP Q(RS9TU VW XY:Z G Q, ) NOP [ (RS9TU VW XY:Z G [ + ) ) NDSS 2018, San Diego 28

30 Search Result Analyzer search results features classifier NDSS 2018, San Diego 29

31 Search Result Features Result similarity! "# $ %, $ ' = / + -1/ 2 ($ ', $ % ) - Content impact! 56 7 %, 8 %, 8 ' = 9:; 6 (< 7 %, 8 %, 8 ' ) Result popularity! "= $ %, $ ' = <>?(2@ % ($ % ), 2@ ' ($ ' )) Result size! "# ; %, ; ' = ;% ; ' ; ' NDSS 2018, San Diego 30

32 Evaluation Datasets Badset: 150 missuggestions, 296 result pages Goodset: 300 legitimate suggestions, 593 result pages Unknown set: 114 millions trigger-suggestion pairs, 1.6 millions result pages Accuracy and coverage Ground truth: precision 96.23%, recall 95.63% Unknown set: precision 95.4% on 1K suspicious trigger-suggestion pairs Performance 1.5s / trigger-suggestion pair NDSS 2018, San Diego 31

33 Scope and magnitude Number of missuggestions on each platform (G: 0.48%, B: 0.37%, Y: 0.2%) Categories of the polluted triggers NDSS 2018, San Diego 32

34 Scope and magnitude Number of missuggestions on each platform (G: 0.48%, B: 0.37%, Y: 0.2%) Categories of the polluted triggers NDSS 2018, San Diego 33

35 Scope and magnitude Number of missuggestions on each platform (G: 0.48%, B: 0.37%, Y: 0.2%) Categories of the polluted triggers 257K polluted triggers 383K missuggestions NDSS 2018, San Diego 34

36 Evolution and lifetime Number of missuggestions over time % of newly-appeared missuggestions related to newly-appeared polluted triggers - 1.9% of triggers were polluted on average Lifetime distribution of missuggestions % of missuggestions stay > 30 days - 34 days vs. 63 days (missuggestion vs. legit.) NDSS 2018, San Diego 35

37 Evolution and lifetime Number of missuggestions over time % of newly-appeared missuggestions related to newly-appeared polluted triggers - 1.9% of triggers were polluted on average Lifetime distribution of missuggestions % of missuggestions stay > 30 days - 34 days vs. 63 days (missuggestion vs. legit.) NDSS 2018, San Diego 36

38 Missuggestion content and pattern 20% missuggestions related to more than one trigger free web hosting and domain name registration services by doteasy.com related to 123 triggers NDSS 2018, San Diego 37

39 Missuggestion content and pattern 20% missuggestions related to more than one trigger free web hosting and domain name registration services by doteasy.com related to 123 triggers missuggestion grammatical pattern Top 5 missuggestion patterns NDSS 2018, San Diego 38

40 Missuggestion content and pattern 20% missuggestions related to more than one trigger free web hosting and domain name registration services by doteasy.com related to 123 triggers missuggestion grammatical pattern Top 5 missuggestion patterns NDSS 2018, San Diego 39

41 Missuggestion content and pattern 20% missuggestions related to more than one trigger free web hosting and domain name registration services by doteasy.com related to 123 triggers missuggestion grammatical pattern Top 5 missuggestion patterns NDSS 2018, San Diego 40

42 Revenue analysis Manipulation service provider ixiala 10K sites request suggestion manipulation $54K/week commission earned by manipulation operators $515K/week for 465K manipulated suggestions NDSS 2018, San Diego 41

43 Discussion Limitations adversary can make the manipulations mimic benign ones lack of ground truth, manual efforts involved NDSS 2018, San Diego 42

44 Discussion Limitations adversary can make the manipulations mimic benign ones lack of ground truth, manual efforts involved Lesson learned unpopular targets related to triggers similar keyword patterns NDSS 2018, San Diego 43

45 Conclusion first large-scale analysis of autocomplete missuggestions, and make first step to understand the underground ecosystem novel NLP techniques to build up the first detection system without accessing to search logs NDSS 2018, San Diego 44

46 Questions & Answers NDSS 2018, San Diego 45

47 Data collection Datasets Dataset # of suggestions # of triggers # of result pages Badset Goodset Unknown set 114,275,000 1,000,900 1,607,951 Validation criteria missuggestion must promote a target whose own reputation cannot make itself stand out in the search results of the trigger missuggestion and its search results conflict with the user s original search intention NDSS 2018, San Diego 46

48 Semantic Consistency Classifier 100 missuggestions legitimate trigger-suggestion pairs SVM classification model with 5-folder cross validation Precision 94.59%, Recall 95.89% Label Feature F-score! "" # $, # & sentence similarity 0.597! '" ( $, ( & word similarity 0.741! )* ( $, ( & infrequency NDSS 2018, San Diego 47

49 Missuggestion Classifier 150 missuggestions legitimate trigger-suggestion pairs SVM classification model with 5-folder cross validation Precision: 96.23%, Recall 95.63% Label Feature F-score! "# $ %, $ ' result similarity 0.782! () * %, + %, + ' content impact 0.808! ", $ %, $ ' result popularity 0.632! "# - %, - ' result size NDSS 2018, San Diego 48

50 Evaluation Accuracy and coverage Tow-step analysis : precision 96.23%, recall 95.63%on ground truth One-step analysis: precision 97.68%, recall 95.59% on ground truth Performance Tow-step analysis: 0.016s/pair (94X faster) One-step analysis: 1.5s/pair NDSS 2018, San Diego 49

Game of Missuggestions: Semantic Analysis of Search-Autocomplete Manipulations

Game of Missuggestions: Semantic Analysis of Search-Autocomplete Manipulations Peng Wang, Xianghang Mi, Xiaojing Liao, XiaoFeng Wang, Kan Yuan, Feng Qian, Raheem Beyah Indiana University Bloomington, William