Previous: how search engines work

Size: px

Start display at page:

Download "Previous: how search engines work"

Florence Fox
5 years ago
Views:

1 detection Ricardo Baeza-Yates,3 With: L. Becchetti 2, P. Boldi 5, C. Castillo, D. Donato, A. Gionis, S. Leonardi 2, V.Murdock, M. Santini 5, F. Silvestri 4, S. Vigna 5. Yahoo! Research Barcelona Catalunya, Spain 2. Università di Roma La Sapienza Rome, Italy 3. Yahoo! Research Santiago Chile 4. ISTI-CNR Pisa,Italy 5. Università degli Studi di Milano Milan, Italy Previous: how search engines work detection

2 Search engine: issues detection Scalability (crawling, indexing, searching, ranking) Relevance (query to document match) Static ranking (content quality) Incentives for cheating ($) This is a talk about academic research! Tools for dealing with detection

3 detection Topological 6 7 detection 8 9 detection Topological 6 7 detection 8 9

The Web The sum of all human knowledge plus porn Robert Gilbert detection Graphic: www.milliondollarhomepage.

4 The Web The sum of all human knowledge plus porn Robert Gilbert detection Graphic: Adversarial IR Issues on the Web detection Link spam Content spam Cloaking Comment/forum/wiki spam -oriented blogging Click fraud 2 Reverse engineering of ranking algorithms Web content filtering Advertisement blocking Stealth crawling Malicious tagging... more?

5 Opportunities for Web spam detection X dexing Keyword stuffing Link farms blogs (splogs) Cloaking Adversarial relationship Every undeserved gain in ranking for a spammer, is a loss of precision for the search engine. Naïve detection

6 Hidden text detection Made for Advertising detection

7 Search engine? detection Fake search engine detection

8 Normal content in link farms detection Cloaking detection

9 Redirection detection Redirects using Javascript detection Simple redirect <script> document.location=" </script> Hidden redirect <script> var=24; var2=var; if(var==var2) { document.location=" } </script>

10 Problem: obfuscated code detection Obfuscated redirect <script> var a="win",a2="dow",a3="loca",a4="tion.", a5="replace",a6="( )"; var i,str=""; for(i=;i<=6;i++) { str += eval("a"+i); } eval(str); </script> Problem: really obfuscated code detection Encoded javascript <script> var s = "%5CBED%5C%5GDHJ BDE%6...%4%E"; var e =, i; eval(unescape( s%edunescape%28s%29%3bfor...%3b )); </script> More examples: [Chellapilla and Maykov, 27]

11 detection Topological 6 7 detection 8 9 Machine Learning detection

12 Training of a Decision Tree detection Decision Tree (error = 5%) detection

13 Decision Tree (error = 5% 2%) detection Machine Learning (cont.) detection

14 Feature Extraction detection Challenges: Machine Learning detection Machine Learning Challenges: Instances are not really independent (graph) Learning with few examples Scalability

15 Challenges: Information Retrieval detection Information Retrieval Challenges: Feature extraction: which features? Feature aggregation: page/host/domain Feature propagation (graph) Recall/precision tradeoffs Scalability detection Topological 6 7 detection 8 9

16 Data is really important detection It is dangerous for a search engine to provide labelled data for this Even if they do, it would never reflect a consensus Assembling Process detection Crawling of base data Elaboration of the guidelines and classification interface Labeling Post-processing

17 Crawling of base data detection U.K. collection 77.9 M pages downloaded from the.uk domain in May 26 (LAW, University of Milan) Large seed of about 5,.uk hosts,4 hosts 8 levels depth, with <=5, pages per host Classification interface detection

18 Labeling process detection We asked 2+ volunteers to classify entire hosts Asked to classify normal / borderline / spam Do they agree? Mostly... Agreement detection

19 Results Labels Label Frequency Percentage Normal 4, % Borderline 79.82%, % Can not classify % detection Agreement Category Kappa Interpretation normal.62 Substantial agreement spam.63 Substantial agreement borderline. Slight agreement global.56 Moderate agreement Result: first public collection detection Public spam collection Labels for 6,552 hosts 2,725 hosts classified by at least 2 humans 3,6 automatically considered normal (.ac.uk,.sch.uk,.gov.uk,.mod.uk,.nhs.uk or.police.uk) Upcoming challenge Track I: Information retrieval + Machine learning Track II: Machine learning AIRWeb 27 Workshop (challenge results available) Regular and short papers Track I of the Challenge

20 AIRWeb 27 in Banff, Canada detection detection Topological 6 7 detection 8 9

21 detection Scale-free networks detection

22 How to find meaningful patterns? detection Several levels of analysis: Macroscopic view: overall structure Microscopic view: nodes Mesoscopic view: regions Macroscopic view, e.g. Bow-tie detection [Broder et al., 2]

23 Macroscopic view, e.g. Bow-tie, migration detection [Baeza-Yates and Poblete, 26] Macroscopic view, e.g. Jellyfish detection [Tauro et al., 2] - Internet Autonomous Systems (AS) Topology

24 Macroscopic view, e.g. Jellyfish detection Microscopic view, e.g. Degree detection [Barabási, 22] and others

25 detection Microscopic view, e.g. Degree Greece Spain Chile Korea [Baeza-Yates et al., 26b] - compares this distribution in 8 countries... guess what is the result? Mesoscopic view, e.g. Hop-plot detection

26 Mesoscopic view, e.g. Hop-plot detection Mesoscopic view, e.g. Hop-plot detection Frequency.it (4M pages) Distance.eu.int (8K pages).3 Frequency.uk (8M pages) Distance Synthetic graph (K pages).3 Frequency.2. Frequency Distance [Baeza-Yates et al., 26a] Distance

27 detection detection Topological 6 7 detection 8 9

28 Topological spam: link farms detection Single-level farms can be detected by searching groups of nodes sharing their out-links [Gibson et al., 25] Motivation detection Fetterly [Fetterly et al., 24] hypothesized that studying the distribution of statistics about pages could be a good way of detecting spam pages: in a number of these distributions, outlier values are associated with web spam

29 Handling large graphs detection For large graphs, random access is not possible. Large graphs do not fit in main memory Streaming model of computation Semi-streaming model detection Memory size enough to hold some data per-node Disk size enough to hold some data per-edge A small number of passes over the data

30 Restriction detection Semi-streaming model: graph on disk : for node :... N do 2: INITIALIZE-MEM(node) 3: end for 4: for distance :... d do {Iteration step} 5: for src :... N do {Follow links in the graph} 6: for all links from src to dest do 7: COMPUTE(src,dest) 8: end for 9: end for : NORMALIZE : end for 2: POST-PROCESS 3: return Something Link-Based Features detection Degree-related measures PageRank TrustRank [Gyöngyi et al., 24] Truncated PageRank [Becchetti et al., 26] Estimation of supporters [Becchetti et al., 26] 4 features per host (2 pages per host)

31 Degree-Based.2. Normal detection.4.2 Normal TrustRank Idea detection

32 TrustRank / PageRank..9 Normal detection e+ 4e+ e+2 3e+2 e+3 3e+3 9e+3 detection Topological 6 7 detection 8 9

33 High and low-ranked pages are different detection Number of Nodes x Top % % Top 4% 5% Top 6% 7% Distance Areas below the curves are equal if we are in the same strongly-connected component Probabilistic counting Propagation of bits using the OR operation detection Target page Count bits set to estimate supporters [Becchetti et al., 26] shows an improvement of ANF algorithm [Palmer et al., 22] based on probabilistic counting [Flajolet and Martin, 985]

34 Bottleneck number detection b d (x) = min j d { N j (x) / N j (x) }. Minimum rate of growth of the neighbors of x up to a certain distance. We expect that spam pages form clusters that are somehow isolated from the rest of the Web graph and they have smaller bottleneck numbers than non-spam pages Normal detection Topological 6 7 detection 8 9

35 Content-Based Features detection Most of the features reported in [Ntoulas et al., 26] Number of word in the page and title Average word length Fraction of anchor text Fraction of visible text Compression rate Corpus precision and corpus recall Query precision and query recall Independent trigram likelihood Entropy of trigrams 96 features per host Average word length detection Normal Figure: Histogram of the average word length in non-spam vs. spam pages for k = 5.

36 Corpus precision detection Normal Figure: pages. Histogram of the corpus precision in non-spam vs. spam Query precision detection Normal Figure: Histogram of the query precision in non-spam vs. spam pages for k = 5.

37 detection Topological 6 7 detection 8 9 General hypothesis detection Pages topologically close to each other are more likely to have the same label (spam/nonspam) than random pairs of pages. Pages linked together are more likely to be on the same topic than random pairs of pages [Davison, 2]

$detection Topological dependencies: in-links Histogram of fraction of spam hosts in the in-links = no in-link comes from$

38 detection Topological dependencies: in-links Histogram of fraction of spam hosts in the in-links = no in-link comes from spam hosts = all of the in-links come from spam hosts detection In-links of non spam In-links of spam

39 Topological dependencies: out-links Histogram of fraction of spam hosts in the out-links = none of the out-links points to spam hosts = all of the out-links point to spam hosts detection Out-links of non spam Outlinks of spam Idea : Clustering detection Classify, then cluster hosts, then assign the same label to all hosts in the same cluster by majority voting Baseline Clustering Without bagging True positive rate 75.6% 74.5% False positive rate 8.5% 6.8% F-Measure With bagging True positive rate 78.7% 76.9% False positive rate 5.7% 5.% F-Measure V Reduces error rate

40 Idea 2: Propagate the label detection Classify, then interpret spamicity as a probability, then do a random walk with restart from those nodes Baseline Fwds. Backwds. Both Classifier without bagging True positive rate 75.6% 7.9% 69.4% 7.4% False positive rate 8.5% 6.% 5.8% 5.8% F-Measure Classifier with bagging True positive rate 78.7% 76.5% 75.% 75.2% False positive rate 5.7% 5.4% 4.3% 4.7% F-Measure Idea 3: Stacked graphical learning detection Classify, then add the average predicted spamicity of neighbors as a new feature for each node, then classify again[cohen and Kou, 26] Avg. Avg. Avg. Baseline of in of out of both True positive rate 78.7% 84.4% 78.3% 85.2% False positive rate 5.7% 6.7% 4.8% 6.% F-Measure V Increases detection rate

41 Idea 3: Stacked graphical learning x2 detection And repeat... Baseline First pass Second pass True positive rate 78.7% 85.2% 88.4% False positive rate 5.7% 6.% 6.3% F-Measure V Significant improvement over the baseline detection Topological 6 7 detection 8 9

42 Concluding remarks detection V The UK-26-5 dataset is harder than previous datasets V Considering content-based and link-based attributes improves the accuracy V Considering the dependencies improves the accuracy Thank you! detection

43 detection Baeza-Yates, R., Boldi, P., and Castillo, C. (26a). Generalizing pagerank: Damping functions for link-based ranking algorithms. In Proceedings of ACM SIGIR, pages 38 35, Seattle, Washington, USA. ACM Press. Baeza-Yates, R., Castillo, C., and Efthimiadis, E. (26b). Characterization of national web domains. To appear in ACM TOIT. Baeza-Yates, R. and Poblete, B. (26). Dynamics of the chilean web structure. Comput. Networks, 5(): Barabási, A.-L. (22). Linked: The New Science of Networks. Perseus Books Group. Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R. (26). Using rank propagation and probabilistic counting for link-based spam detection. In Proceedings of the Workshop on Web Mining and Web Usage Analysis (WebKDD), Pennsylvania, USA. ACM Press. detection Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., and Wiener, J. (2). Graph structure in the web: Experiments and models. In Proceedings of the Ninth Conference on World Wide Web, pages 39 32, Amsterdam, Netherlands. ACM Press. Chellapilla, K. and Maykov, A. (27). A taxonomy of javascript redirection spam. In AIRWeb 7: Proceedings of the 3rd international workshop on Adversarial information retrieval on the web, pages 8 88, New York, NY, USA. ACM Press. Cohen, W. W. and Kou, Z. (26). Stacked graphical learning: approximating learning in markov random fields using very short inhomogeneous markov chains. Technical report. Davison, B. D. (2). Topical locality in the web. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, pages , Athens, Greece. ACM Press. Fetterly, D., Manasse, M., and Najork, M. (24)., damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of the seventh workshop on the Web and databases (WebDB), pages 6, Paris, France.

44 detection Flajolet, P. and Martin, N. G. (985). Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 3(2): Gibson, D., Kumar, R., and Tomkins, A. (25). Discovering large dense subgraphs in massive graphs. In VLDB 5: Proceedings of the 3st international conference on Very large data bases, pages VLDB Endowment. Gyöngyi, Z., Garcia-Molina, H., and Pedersen, J. (24). Combating Web spam with TrustRank. In Proceedings of the 3th International Conference on Very Large Data Bases (VLDB), pages , Toronto, Canada. Morgan Kaufmann. Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. (26). Detecting spam web pages through content analysis. In Proceedings of the World Wide Web conference, pages 83 92, Edinburgh, Scotland. Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (22). ANF: a fast and scalable tool for data mining in massive graphs. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 8 9, New York, NY, USA. ACM Press. Tauro, L., Palmer, C., Siganos, G., and Faloutsos, M. (2). A simple conceptual model for the internet topology. In Global Internet, San Antonio, Texas, USA. IEEE CS Press.

Web Spam. Seminar: Future Of Web Search. Know Your Neighbors: Web Spam Detection using the Web Topology

Web Spam. Seminar: Future Of Web Search. Know Your Neighbors: Web Spam Detection using the Web Topology Seminar: Future Of Web Search University of Saarland Web Spam Know Your Neighbors: Web Spam Detection using the Web Topology Presenter: Sadia Masood Tutor : Klaus Berberich Date : 17-Jan-2008 The Agenda