Previous: how search engines work

Size: px
Start display at page:

Download "Previous: how search engines work"

Transcription

1 detection Ricardo Baeza-Yates,3 With: L. Becchetti 2, P. Boldi 5, C. Castillo, D. Donato, A. Gionis, S. Leonardi 2, V.Murdock, M. Santini 5, F. Silvestri 4, S. Vigna 5. Yahoo! Research Barcelona Catalunya, Spain 2. Università di Roma La Sapienza Rome, Italy 3. Yahoo! Research Santiago Chile 4. ISTI-CNR Pisa,Italy 5. Università degli Studi di Milano Milan, Italy Previous: how search engines work detection

2 Search engine: issues detection Scalability (crawling, indexing, searching, ranking) Relevance (query to document match) Static ranking (content quality) Incentives for cheating ($) This is a talk about academic research! Tools for dealing with detection

3 detection Topological 6 7 detection 8 9 detection Topological 6 7 detection 8 9

4 The Web The sum of all human knowledge plus porn Robert Gilbert detection Graphic: Adversarial IR Issues on the Web detection Link spam Content spam Cloaking Comment/forum/wiki spam -oriented blogging Click fraud 2 Reverse engineering of ranking algorithms Web content filtering Advertisement blocking Stealth crawling Malicious tagging... more?

5 Opportunities for Web spam detection X dexing Keyword stuffing Link farms blogs (splogs) Cloaking Adversarial relationship Every undeserved gain in ranking for a spammer, is a loss of precision for the search engine. Naïve detection

6 Hidden text detection Made for Advertising detection

7 Search engine? detection Fake search engine detection

8 Normal content in link farms detection Cloaking detection

9 Redirection detection Redirects using Javascript detection Simple redirect <script> document.location=" </script> Hidden redirect <script> var=24; var2=var; if(var==var2) { document.location=" } </script>

10 Problem: obfuscated code detection Obfuscated redirect <script> var a="win",a2="dow",a3="loca",a4="tion.", a5="replace",a6="( )"; var i,str=""; for(i=;i<=6;i++) { str += eval("a"+i); } eval(str); </script> Problem: really obfuscated code detection Encoded javascript <script> var s = "%5CBED%5C%5GDHJ BDE%6...%4%E"; var e =, i; eval(unescape( s%edunescape%28s%29%3bfor...%3b )); </script> More examples: [Chellapilla and Maykov, 27]

11 detection Topological 6 7 detection 8 9 Machine Learning detection

12 Training of a Decision Tree detection Decision Tree (error = 5%) detection

13 Decision Tree (error = 5% 2%) detection Machine Learning (cont.) detection

14 Feature Extraction detection Challenges: Machine Learning detection Machine Learning Challenges: Instances are not really independent (graph) Learning with few examples Scalability

15 Challenges: Information Retrieval detection Information Retrieval Challenges: Feature extraction: which features? Feature aggregation: page/host/domain Feature propagation (graph) Recall/precision tradeoffs Scalability detection Topological 6 7 detection 8 9

16 Data is really important detection It is dangerous for a search engine to provide labelled data for this Even if they do, it would never reflect a consensus Assembling Process detection Crawling of base data Elaboration of the guidelines and classification interface Labeling Post-processing

17 Crawling of base data detection U.K. collection 77.9 M pages downloaded from the.uk domain in May 26 (LAW, University of Milan) Large seed of about 5,.uk hosts,4 hosts 8 levels depth, with <=5, pages per host Classification interface detection

18 Labeling process detection We asked 2+ volunteers to classify entire hosts Asked to classify normal / borderline / spam Do they agree? Mostly... Agreement detection

19 Results Labels Label Frequency Percentage Normal 4, % Borderline 79.82%, % Can not classify % detection Agreement Category Kappa Interpretation normal.62 Substantial agreement spam.63 Substantial agreement borderline. Slight agreement global.56 Moderate agreement Result: first public collection detection Public spam collection Labels for 6,552 hosts 2,725 hosts classified by at least 2 humans 3,6 automatically considered normal (.ac.uk,.sch.uk,.gov.uk,.mod.uk,.nhs.uk or.police.uk) Upcoming challenge Track I: Information retrieval + Machine learning Track II: Machine learning AIRWeb 27 Workshop (challenge results available) Regular and short papers Track I of the Challenge

20 AIRWeb 27 in Banff, Canada detection detection Topological 6 7 detection 8 9

21 detection Scale-free networks detection

22 How to find meaningful patterns? detection Several levels of analysis: Macroscopic view: overall structure Microscopic view: nodes Mesoscopic view: regions Macroscopic view, e.g. Bow-tie detection [Broder et al., 2]

23 Macroscopic view, e.g. Bow-tie, migration detection [Baeza-Yates and Poblete, 26] Macroscopic view, e.g. Jellyfish detection [Tauro et al., 2] - Internet Autonomous Systems (AS) Topology

24 Macroscopic view, e.g. Jellyfish detection Microscopic view, e.g. Degree detection [Barabási, 22] and others

25 detection Microscopic view, e.g. Degree Greece Spain Chile Korea [Baeza-Yates et al., 26b] - compares this distribution in 8 countries... guess what is the result? Mesoscopic view, e.g. Hop-plot detection

26 Mesoscopic view, e.g. Hop-plot detection Mesoscopic view, e.g. Hop-plot detection Frequency.it (4M pages) Distance.eu.int (8K pages).3 Frequency.uk (8M pages) Distance Synthetic graph (K pages).3 Frequency.2. Frequency Distance [Baeza-Yates et al., 26a] Distance

27 detection detection Topological 6 7 detection 8 9

28 Topological spam: link farms detection Single-level farms can be detected by searching groups of nodes sharing their out-links [Gibson et al., 25] Motivation detection Fetterly [Fetterly et al., 24] hypothesized that studying the distribution of statistics about pages could be a good way of detecting spam pages: in a number of these distributions, outlier values are associated with web spam

29 Handling large graphs detection For large graphs, random access is not possible. Large graphs do not fit in main memory Streaming model of computation Semi-streaming model detection Memory size enough to hold some data per-node Disk size enough to hold some data per-edge A small number of passes over the data

30 Restriction detection Semi-streaming model: graph on disk : for node :... N do 2: INITIALIZE-MEM(node) 3: end for 4: for distance :... d do {Iteration step} 5: for src :... N do {Follow links in the graph} 6: for all links from src to dest do 7: COMPUTE(src,dest) 8: end for 9: end for : NORMALIZE : end for 2: POST-PROCESS 3: return Something Link-Based Features detection Degree-related measures PageRank TrustRank [Gyöngyi et al., 24] Truncated PageRank [Becchetti et al., 26] Estimation of supporters [Becchetti et al., 26] 4 features per host (2 pages per host)

31 Degree-Based.2. Normal detection.4.2 Normal TrustRank Idea detection

32 TrustRank / PageRank..9 Normal detection e+ 4e+ e+2 3e+2 e+3 3e+3 9e+3 detection Topological 6 7 detection 8 9

33 High and low-ranked pages are different detection Number of Nodes x Top % % Top 4% 5% Top 6% 7% Distance Areas below the curves are equal if we are in the same strongly-connected component Probabilistic counting Propagation of bits using the OR operation detection Target page Count bits set to estimate supporters [Becchetti et al., 26] shows an improvement of ANF algorithm [Palmer et al., 22] based on probabilistic counting [Flajolet and Martin, 985]

34 Bottleneck number detection b d (x) = min j d { N j (x) / N j (x) }. Minimum rate of growth of the neighbors of x up to a certain distance. We expect that spam pages form clusters that are somehow isolated from the rest of the Web graph and they have smaller bottleneck numbers than non-spam pages Normal detection Topological 6 7 detection 8 9

35 Content-Based Features detection Most of the features reported in [Ntoulas et al., 26] Number of word in the page and title Average word length Fraction of anchor text Fraction of visible text Compression rate Corpus precision and corpus recall Query precision and query recall Independent trigram likelihood Entropy of trigrams 96 features per host Average word length detection Normal Figure: Histogram of the average word length in non-spam vs. spam pages for k = 5.

36 Corpus precision detection Normal Figure: pages. Histogram of the corpus precision in non-spam vs. spam Query precision detection Normal Figure: Histogram of the query precision in non-spam vs. spam pages for k = 5.

37 detection Topological 6 7 detection 8 9 General hypothesis detection Pages topologically close to each other are more likely to have the same label (spam/nonspam) than random pairs of pages. Pages linked together are more likely to be on the same topic than random pairs of pages [Davison, 2]

38 detection Topological dependencies: in-links Histogram of fraction of spam hosts in the in-links = no in-link comes from spam hosts = all of the in-links come from spam hosts detection In-links of non spam In-links of spam

39 Topological dependencies: out-links Histogram of fraction of spam hosts in the out-links = none of the out-links points to spam hosts = all of the out-links point to spam hosts detection Out-links of non spam Outlinks of spam Idea : Clustering detection Classify, then cluster hosts, then assign the same label to all hosts in the same cluster by majority voting Baseline Clustering Without bagging True positive rate 75.6% 74.5% False positive rate 8.5% 6.8% F-Measure With bagging True positive rate 78.7% 76.9% False positive rate 5.7% 5.% F-Measure V Reduces error rate

40 Idea 2: Propagate the label detection Classify, then interpret spamicity as a probability, then do a random walk with restart from those nodes Baseline Fwds. Backwds. Both Classifier without bagging True positive rate 75.6% 7.9% 69.4% 7.4% False positive rate 8.5% 6.% 5.8% 5.8% F-Measure Classifier with bagging True positive rate 78.7% 76.5% 75.% 75.2% False positive rate 5.7% 5.4% 4.3% 4.7% F-Measure Idea 3: Stacked graphical learning detection Classify, then add the average predicted spamicity of neighbors as a new feature for each node, then classify again[cohen and Kou, 26] Avg. Avg. Avg. Baseline of in of out of both True positive rate 78.7% 84.4% 78.3% 85.2% False positive rate 5.7% 6.7% 4.8% 6.% F-Measure V Increases detection rate

41 Idea 3: Stacked graphical learning x2 detection And repeat... Baseline First pass Second pass True positive rate 78.7% 85.2% 88.4% False positive rate 5.7% 6.% 6.3% F-Measure V Significant improvement over the baseline detection Topological 6 7 detection 8 9

42 Concluding remarks detection V The UK-26-5 dataset is harder than previous datasets V Considering content-based and link-based attributes improves the accuracy V Considering the dependencies improves the accuracy Thank you! detection

43 detection Baeza-Yates, R., Boldi, P., and Castillo, C. (26a). Generalizing pagerank: Damping functions for link-based ranking algorithms. In Proceedings of ACM SIGIR, pages 38 35, Seattle, Washington, USA. ACM Press. Baeza-Yates, R., Castillo, C., and Efthimiadis, E. (26b). Characterization of national web domains. To appear in ACM TOIT. Baeza-Yates, R. and Poblete, B. (26). Dynamics of the chilean web structure. Comput. Networks, 5(): Barabási, A.-L. (22). Linked: The New Science of Networks. Perseus Books Group. Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R. (26). Using rank propagation and probabilistic counting for link-based spam detection. In Proceedings of the Workshop on Web Mining and Web Usage Analysis (WebKDD), Pennsylvania, USA. ACM Press. detection Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., and Wiener, J. (2). Graph structure in the web: Experiments and models. In Proceedings of the Ninth Conference on World Wide Web, pages 39 32, Amsterdam, Netherlands. ACM Press. Chellapilla, K. and Maykov, A. (27). A taxonomy of javascript redirection spam. In AIRWeb 7: Proceedings of the 3rd international workshop on Adversarial information retrieval on the web, pages 8 88, New York, NY, USA. ACM Press. Cohen, W. W. and Kou, Z. (26). Stacked graphical learning: approximating learning in markov random fields using very short inhomogeneous markov chains. Technical report. Davison, B. D. (2). Topical locality in the web. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, pages , Athens, Greece. ACM Press. Fetterly, D., Manasse, M., and Najork, M. (24)., damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of the seventh workshop on the Web and databases (WebDB), pages 6, Paris, France.

44 detection Flajolet, P. and Martin, N. G. (985). Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 3(2): Gibson, D., Kumar, R., and Tomkins, A. (25). Discovering large dense subgraphs in massive graphs. In VLDB 5: Proceedings of the 3st international conference on Very large data bases, pages VLDB Endowment. Gyöngyi, Z., Garcia-Molina, H., and Pedersen, J. (24). Combating Web spam with TrustRank. In Proceedings of the 3th International Conference on Very Large Data Bases (VLDB), pages , Toronto, Canada. Morgan Kaufmann. Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. (26). Detecting spam web pages through content analysis. In Proceedings of the World Wide Web conference, pages 83 92, Edinburgh, Scotland. Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (22). ANF: a fast and scalable tool for data mining in massive graphs. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 8 9, New York, NY, USA. ACM Press. Tauro, L., Palmer, C., Siganos, G., and Faloutsos, M. (2). A simple conceptual model for the internet topology. In Global Internet, San Antonio, Texas, USA. IEEE CS Press.

Web Spam. Seminar: Future Of Web Search. Know Your Neighbors: Web Spam Detection using the Web Topology

Web Spam. Seminar: Future Of Web Search. Know Your Neighbors: Web Spam Detection using the Web Topology Seminar: Future Of Web Search University of Saarland Web Spam Know Your Neighbors: Web Spam Detection using the Web Topology Presenter: Sadia Masood Tutor : Klaus Berberich Date : 17-Jan-2008 The Agenda

More information

An introduction to Web Mining part II

An introduction to Web Mining part II An introduction to Web Mining part II Ricardo Baeza-Yates, Aristides Gionis Yahoo! Research Barcelona, Spain & Santiago, Chile ECML/PKDD 2008 Antwerp Yahoo! Research Agenda Statistical methods: the size

More information

Using Rank Propagation and Probabilistic Counting for Link-Based Spam Detection

Using Rank Propagation and Probabilistic Counting for Link-Based Spam Detection Using Rank Propagation and Probabilistic Counting for Link-Based Spam Detection Luca Becchetti 1 becchett@dis.uniroma1.it Stefano Leonardi 1 leon@dis.uniroma1.it Carlos Castillo 1 castillo@dis.uniroma1.it

More information

Web Spam Challenge 2008

Web Spam Challenge 2008 Web Spam Challenge 2008 Data Analysis School, Moscow, Russia K. Bauman, A. Brodskiy, S. Kacher, E. Kalimulina, R. Kovalev, M. Lebedev, D. Orlov, P. Sushin, P. Zryumov, D. Leshchiner, I. Muchnik The Data

More information

A Reference Collection for Web Spam

A Reference Collection for Web Spam A Reference Collection for Web Spam Carlos Castillo 1,3, Debora Donato 1,3, Luca Becchetti 1, Paolo Boldi 2, Stefano Leonardi 1, Massimo Santini 2 and Sebastiano Vigna 2 1 Università di Roma 2 Università

More information

Using Rank Propagation and Probabilistic Counting for Link-Based Spam Detection

Using Rank Propagation and Probabilistic Counting for Link-Based Spam Detection Using Rank Propagation and Probabilistic Counting for Link-Based Spam Detection Luca Becchetti 1 becchett@dis.uniroma1.it Carlos Castillo 1 castillo@dis.uniroma1.it Debora Donato 1 donato@dis.uniroma1.it

More information

Learning to Detect Web Spam by Genetic Programming

Learning to Detect Web Spam by Genetic Programming Learning to Detect Web Spam by Genetic Programming Xiaofei Niu 1,3, Jun Ma 1,, Qiang He 1, Shuaiqiang Wang 2, and Dongmei Zhang 1,3 1 School of Computer Science and Technology, Shandong University, Jinan

More information

Adversarial Information Retrieval on the Web AIRWeb 2006

Adversarial Information Retrieval on the Web AIRWeb 2006 SIGIR 2006 Workshop on Adversarial Information Retrieval on the Web AIRWeb 2006 Proceedings of the Second International Workshop on Adversarial Information Retrieval on the Web AIRWeb 2006 29th Annual

More information

Web Spam Detection with Anti-Trust Rank

Web Spam Detection with Anti-Trust Rank Web Spam Detection with Anti-Trust Rank Viay Krishnan Computer Science Department Stanford University Stanford, CA 4305 viayk@cs.stanford.edu Rashmi Ra Computer Science Department Stanford University Stanford,

More information

A Survey of Major Techniques for Combating Link Spamming

A Survey of Major Techniques for Combating Link Spamming Journal of Information & Computational Science 7: (00) 9 6 Available at http://www.joics.com A Survey of Major Techniques for Combating Link Spamming Yi Li a,, Jonathan J. H. Zhu b, Xiaoming Li c a CNDS

More information

A Method for Finding Link Hijacking Based on Modified PageRank Algorithms

A Method for Finding Link Hijacking Based on Modified PageRank Algorithms DEWS2008 A10-1 A Method for Finding Link Hijacking Based on Modified PageRank Algorithms Young joo Chung Masashi Toyoda Masaru Kitsuregawa Institute of Industrial Science, University of Tokyo 4-6-1 Komaba

More information

Towards Building A Personalized Online Web Spam Detector in Intelligent Web Browsers

Towards Building A Personalized Online Web Spam Detector in Intelligent Web Browsers 203 IEEE/WIC/ACM International Conferences on Web Intelligence (WI) and Intelligent Agent Technology (IAT) Towards Building A Personalized Online Web Spam Detector in Intelligent Web Browsers Cailing Dong,

More information

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 5, NO. 3, SEPTEMBER

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 5, NO. 3, SEPTEMBER IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 5, NO. 3, SEPTEMBER 2010 581 Web Spam Detection: New Classification Features Based on Qualified Link Analysis and Language Models Lourdes Araujo

More information

A Method for Finding Link Hijacking Based on Modified PageRank Algorithms

A Method for Finding Link Hijacking Based on Modified PageRank Algorithms DEWS2008 A10-1 A Method for Finding Link Hijacking Based on Modified PageRank Algorithms Young joo Chung Masashi Toyoda Masaru Kitsuregawa Institute of Industrial Science, University of Tokyo 4-6-1 Komaba

More information

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion Ye Tian, Gary M. Weiss, Qiang Ma Department of Computer and Information Science Fordham University 441 East Fordham

More information

Detecting Spam Web Pages

Detecting Spam Web Pages Detecting Spam Web Pages Marc Najork Microsoft Research Silicon Valley About me 1989-1993: UIUC (home of NCSA Mosaic) 1993-2001: Digital Equipment/Compaq Started working on web search in 1997 Mercator

More information

Incorporating Trust into Web Search

Incorporating Trust into Web Search Incorporating Trust into Web Search Lan Nie Baoning Wu Brian D. Davison Department of Computer Science & Engineering Lehigh University {lan2,baw4,davison}@cse.lehigh.edu January 2007 Abstract The Web today

More information

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008 Countering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest Lecture February 21, 2008 Overview Introduction Countering Email Spam Problem Description Classification

More information

Page Rank Link Farm Detection

Page Rank Link Farm Detection International Journal of Engineering Inventions e-issn: 2278-7461, p-issn: 2319-6491 Volume 4, Issue 1 (July 2014) PP: 55-59 Page Rank Link Farm Detection Akshay Saxena 1, Rohit Nigam 2 1, 2 Department

More information

Fraudulent Support Telephone Number Identification Based on Co-occurrence Information on the Web

Fraudulent Support Telephone Number Identification Based on Co-occurrence Information on the Web Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence Fraudulent Support Telephone Number Identification Based on Co-occurrence Information on the Web Xin Li, Yiqun Liu, Min Zhang,

More information

Detecting Content Spam on the Web through Text Diversity Analysis

Detecting Content Spam on the Web through Text Diversity Analysis Detecting Content Spam on the Web through Text Diversity Analysis Anton Pavlov pavvloff@yandex.ru M.V. Lomonosov Mosco State University, Faculty of Computational Mathematics and Cybernetics Boris Dobrov

More information

A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET

A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET Shervin Daneshpajouh, Mojtaba Mohammadi Nasiri¹ Computer Engineering Department, Sharif University of Technology, Tehran, Iran daneshpajouh@ce.sharif.edu,

More information

CS Search Engine Technology

CS Search Engine Technology CS236620 - Search Engine Technology Ronny Lempel Winter 2008/9 The course consists of 14 2-hour meetings, divided into 4 main parts. It aims to cover both engineering and theoretical aspects of search

More information

Robust PageRank and Locally Computable Spam Detection Features

Robust PageRank and Locally Computable Spam Detection Features Robust PageRank and Locally Computable Spam Detection Features Reid Andersen reidan@microsoftcom John Hopcroft Cornell University jeh@cscornelledu Christian Borgs borgs@microsoftcom Kamal Jain kamalj@microsoftcom

More information

New Metrics for Reputation Management in P2P Networks

New Metrics for Reputation Management in P2P Networks New Metrics for Reputation Management in P2P Networks Carlos Castillo 1 chato@yahoo-inc.com Stefano Leonardi 3 leon@dis.uniroma1.it Giovanni Cortese 2 g.cortese@computer.org Mario Paniccia 3 mario paniccia@yahoo.it

More information

Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam

Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam Isabel Drost and Tobias Scheffer Humboldt-Universität zu Berlin, Department of Computer Science Unter den Linden 6, 0099 Berlin, Germany

More information

HYBRIDIZED MODEL FOR EFFICIENT MATCHING AND DATA PREDICTION IN INFORMATION RETRIEVAL

HYBRIDIZED MODEL FOR EFFICIENT MATCHING AND DATA PREDICTION IN INFORMATION RETRIEVAL International Journal of Mechanical Engineering & Computer Sciences, Vol.1, Issue 1, Jan-Jun, 2017, pp 12-17 HYBRIDIZED MODEL FOR EFFICIENT MATCHING AND DATA PREDICTION IN INFORMATION RETRIEVAL BOMA P.

More information

CS47300 Web Information Search and Management

CS47300 Web Information Search and Management CS47300 Web Information Search and Management Search Engine Optimization Prof. Chris Clifton 31 October 2018 What is Search Engine Optimization? 90% of search engine clickthroughs are on the first page

More information

Developing Intelligent Search Engines

Developing Intelligent Search Engines Developing Intelligent Search Engines Isabel Drost Abstract Developers of search engines today do not only face technical problems such as designing an efficient crawler or distributing search requests

More information

Splog Detection Using Self-Similarity Analysis on Blog Temporal Dynamics. Yu-Ru Lin, Hari Sundaram, Yun Chi, Junichi Tatemura and Belle Tseng

Splog Detection Using Self-Similarity Analysis on Blog Temporal Dynamics. Yu-Ru Lin, Hari Sundaram, Yun Chi, Junichi Tatemura and Belle Tseng Splog Detection Using Self-Similarity Analysis on Blog Temporal Dynamics Yu-Ru Lin, Hari Sundaram, Yun Chi, Junichi Tatemura and Belle Tseng NEC Laboratories America, Cupertino, CA AIRWeb Workshop 2007

More information

Automatic Identification of User Goals in Web Search [WWW 05]

Automatic Identification of User Goals in Web Search [WWW 05] Automatic Identification of User Goals in Web Search [WWW 05] UichinLee @ UCLA ZhenyuLiu @ UCLA JunghooCho @ UCLA Presenter: Emiran Curtmola@ UC San Diego CSE 291 4/29/2008 Need to improve the quality

More information

Identifying Spam Link Generators for Monitoring Emerging Web Spam

Identifying Spam Link Generators for Monitoring Emerging Web Spam Identifying Spam Link Generators for Monitoring Emerging Web Spam Young-joo Chung chung@tkl.iis.utokyo.ac.jp Masashi Toyoda toyoda@tkl.iis.utokyo.ac.jp Institute of Industrial Science, The University of

More information

FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM

FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM Shipra Mittal 1* and Akanksha Juneja 2 Department of Computer Science & Engineering, National Institute of Technology, Delhi, India

More information

Classifying Spam using URLs

Classifying Spam using URLs Classifying Spam using URLs Di Ai Computer Science Stanford University Stanford, CA diai@stanford.edu CS 229 Project, Autumn 2018 Abstract This project implements support vector machine and random forest

More information

Topical TrustRank: Using Topicality to Combat Web Spam

Topical TrustRank: Using Topicality to Combat Web Spam Topical TrustRank: Using Topicality to Combat Web Spam Baoning Wu Vinay Goel Brian D. Davison Department of Computer Science & Engineering Lehigh University Bethlehem, PA 18015 USA {baw4,vig204,davison}@cse.lehigh.edu

More information

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai CS 8803 AIAD Prof Ling Liu Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai Under the supervision of Steve Webb Motivations and Objectives Spam, which was until

More information

Analyzing the Popular Words to Evaluate Spam in Arabic Web Pages

Analyzing the Popular Words to Evaluate Spam in Arabic Web Pages T h e R e s e a r c h B u l l e t i n o f J o r d a n A C M, V o l u m e I I ( I I ) P a g e 22 Analyzing the Popular Words to Evaluate Spam in Arabic Web Pages Heider A. Wahsheh heiderwahsheh@yahoo.com

More information

LINK GRAPH ANALYSIS FOR ADULT IMAGES CLASSIFICATION

LINK GRAPH ANALYSIS FOR ADULT IMAGES CLASSIFICATION LINK GRAPH ANALYSIS FOR ADULT IMAGES CLASSIFICATION Evgeny Kharitonov *, ***, Anton Slesarev *, ***, Ilya Muchnik **, ***, Fedor Romanenko ***, Dmitry Belyaev ***, Dmitry Kotlyarov *** * Moscow Institute

More information

Link Analysis in Web Mining

Link Analysis in Web Mining Problem formulation (998) Link Analysis in Web Mining Hubs and Authorities Spam Detection Suppose we are given a collection of documents on some broad topic e.g., stanford, evolution, iraq perhaps obtained

More information

The Evolution of Web Content and Search Engines

The Evolution of Web Content and Search Engines The Evolution of Web Content and Search Engines Ricardo Baeza-Yates Yahoo! Research Barcelona, Spain & Santiago, Chile ricardo@baeza.cl Álvaro Pereira Jr Dept. of Computer Science Federal Univ. of Minas

More information

WEB SPAM IDENTIFICATION THROUGH LANGUAGE MODEL ANALYSIS

WEB SPAM IDENTIFICATION THROUGH LANGUAGE MODEL ANALYSIS WEB SPAM IDENTIFICATION THROUGH LANGUAGE MODEL ANALYSIS Juan Martinez-Romo and Lourdes Araujo Natural Language Processing and Information Retrieval Group at UNED * nlp.uned.es Fifth International Workshop

More information

Analyzing and Detecting Review Spam

Analyzing and Detecting Review Spam Seventh IEEE International Conference on Data Mining Analyzing and Detecting Review Spam Nitin Jindal and Bing Liu Department of Computer Science University of Illinois at Chicago nitin.jindal@gmail.com,

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Alex Hai Wang College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA 18512, USA

More information

Crawling the Infinite Web: Five Levels are Enough

Crawling the Infinite Web: Five Levels are Enough Crawling the Infinite Web: Five Levels are Enough Ricardo Baeza-Yates and Carlos Castillo Center for Web Research, DCC Universidad de Chile {rbaeza,ccastillo}@dcc.uchile.cl Abstract. A large amount of

More information

Ranking web pages using machine learning approaches

Ranking web pages using machine learning approaches University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 2008 Ranking web pages using machine learning approaches Sweah Liang Yong

More information

Pagerank Increase under Different Collusion Topologies

Pagerank Increase under Different Collusion Topologies Pagerank Increase under Different Collusion Topologies Ricardo Baeza-Yates ICREA-Univ. Pompeu Fabra and University of Chile ricardo.baeza@upf.edu Carlos Castillo Dept. of Technology Universitat Pompeu

More information

Dynamics of the Chilean Web structure

Dynamics of the Chilean Web structure Dynamics of the Chilean Web structure Ricardo Baeza-Yates *, Barbara Poblete Center for Web Research, Department of Computer Science, University of Chile, Blanco Encalada 2120, Santiago, Chile Abstract

More information

Lecture 17 November 7

Lecture 17 November 7 CS 559: Algorithmic Aspects of Computer Networks Fall 2007 Lecture 17 November 7 Lecturer: John Byers BOSTON UNIVERSITY Scribe: Flavio Esposito In this lecture, the last part of the PageRank paper has

More information

Models for the growth of the Web

Models for the growth of the Web Models for the growth of the Web Chi Bong Ho Introduction Yihao Ben Pu December 6, 2007 Alexander Tsiatas There has been much work done in recent years about the structure of the Web and other large information

More information

Effectively Detecting Content Spam on the Web Using Topical Diversity Measures

Effectively Detecting Content Spam on the Web Using Topical Diversity Measures 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology Effectively Detecting Content Spam on the Web Using Topical Diversity Measures Cailing Dong Department of

More information

Search Engines Information Retrieval in Practice

Search Engines Information Retrieval in Practice Search Engines Information Retrieval in Practice W. BRUCE CROFT University of Massachusetts, Amherst DONALD METZLER Yahoo! Research TREVOR STROHMAN Google Inc. ----- PEARSON Boston Columbus Indianapolis

More information

A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS

A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN 2320-7345 A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS Satwinder Kaur 1 & Alisha Gupta 2 1 Research Scholar (M.tech

More information

University of TREC 2009: Indexing half a billion web pages

University of TREC 2009: Indexing half a billion web pages University of Twente @ TREC 2009: Indexing half a billion web pages Claudia Hauff and Djoerd Hiemstra University of Twente, The Netherlands {c.hauff, hiemstra}@cs.utwente.nl draft 1 Introduction The University

More information

Effective Spam Filtering using Random Forest Machine Learning Algorithm

Effective Spam Filtering using Random Forest Machine Learning Algorithm Effective Spam Filtering using Random Forest Machine Learning Algorithm Manish Kumar Researcher, Department of Computer Science, Banaras Hindu University, Institute of Science, UP, India ABSTRACT: Now-a-days

More information

A Survey on k-means Clustering Algorithm Using Different Ranking Methods in Data Mining

A Survey on k-means Clustering Algorithm Using Different Ranking Methods in Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 4, April 2013,

More information

Beyond Ten Blue Links Seven Challenges

Beyond Ten Blue Links Seven Challenges Beyond Ten Blue Links Seven Challenges Ricardo Baeza-Yates VP of Yahoo! Research for EMEA & LatAm Barcelona, Spain Thanks to Andrei Broder, Yoelle Maarek & Prabhakar Raghavan Agenda Past and Present Wisdom

More information

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings. Online Review Spam Detection by New Linguistic Features Amir Karam, University of Maryland Baltimore County Bin Zhou, University of Maryland Baltimore County Karami, A., Zhou, B. (2015). Online Review

More information

Method to Study and Analyze Fraud Ranking In Mobile Apps

Method to Study and Analyze Fraud Ranking In Mobile Apps Method to Study and Analyze Fraud Ranking In Mobile Apps Ms. Priyanka R. Patil M.Tech student Marri Laxman Reddy Institute of Technology & Management Hyderabad. Abstract: Ranking fraud in the mobile App

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

REPORT DOCUMENTATION PAGE

REPORT DOCUMENTATION PAGE REPORT DOCUMENTATION PAGE Form Approved OMB NO. 0704-0188 The public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions,

More information

A STUDY ON THE EVOLUTION OF THE WEB

A STUDY ON THE EVOLUTION OF THE WEB A STUDY ON THE EVOLUTION OF THE WEB Alexandros Ntoulas, Junghoo Cho, Hyun Kyu Cho 2, Hyeonsung Cho 2, and Young-Jo Cho 2 Summary We seek to gain improved insight into how Web search engines should cope

More information

World Wide Web has specific challenges and opportunities

World Wide Web has specific challenges and opportunities 6. Web Search Motivation Web search, as offered by commercial search engines such as Google, Bing, and DuckDuckGo, is arguably one of the most popular applications of IR methods today World Wide Web has

More information

Compressed Collections for Simulated Crawling

Compressed Collections for Simulated Crawling PAPER This paper was unfortunately omitted from the print version of the June 2008 issue of Sigir Forum. Please cite as SIGIR Forum June 2008, Volume 42 Number 1, pp 84-89 Compressed Collections for Simulated

More information

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today 3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

More information

Automatically Constructing a Directory of Molecular Biology Databases

Automatically Constructing a Directory of Molecular Biology Databases Automatically Constructing a Directory of Molecular Biology Databases Luciano Barbosa Sumit Tandon Juliana Freire School of Computing University of Utah {lbarbosa, sumitt, juliana}@cs.utah.edu Online Databases

More information

Link Analysis in Web Information Retrieval

Link Analysis in Web Information Retrieval Link Analysis in Web Information Retrieval Monika Henzinger Google Incorporated Mountain View, California monika@google.com Abstract The analysis of the hyperlink structure of the web has led to significant

More information

Part 1: Link Analysis & Page Rank

Part 1: Link Analysis & Page Rank Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Graph Data: Social Networks [Source: 4-degrees of separation, Backstrom-Boldi-Rosa-Ugander-Vigna,

More information

CS425: Algorithms for Web Scale Data

CS425: Algorithms for Web Scale Data CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org J.

More information

IDENTIFYING SEARCH ENGINE SPAM USING DNS. A Thesis SIDDHARTHA SANKARAN MATHIHARAN

IDENTIFYING SEARCH ENGINE SPAM USING DNS. A Thesis SIDDHARTHA SANKARAN MATHIHARAN IDENTIFYING SEARCH ENGINE SPAM USING DNS A Thesis by SIDDHARTHA SANKARAN MATHIHARAN Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of the requirements for the

More information

Mining the Web 2.0 to improve Search

Mining the Web 2.0 to improve Search Mining the Web 2.0 to improve Search Ricardo Baeza-Yates VP, Yahoo! Research Agenda The Power of Data Examples Improving Image Search (Faceted Clusters) Searching the Wikipedia (Correlator) Understanding

More information

Compressed Collections for Simulated Crawling

Compressed Collections for Simulated Crawling PAPER Compressed Collections for Simulated Crawling Alessio Orlandi Università di Pisa, Italy aorlandi@di.unipi.it Sebastiano Vigna Università degli Studi di Milano, Italy vigna@dsi.unimi.it Abstract Collections

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Query Suggestions Using Query-Flow Graphs

Query Suggestions Using Query-Flow Graphs Query Suggestions Using Query-Flow Graphs Paolo Boldi 1 boldi@dsi.unimi.it Debora Donato 2 debora@yahoo-inc.com Francesco Bonchi 2 bonchi@yahoo-inc.com Sebastiano Vigna 1 vigna@dsi.unimi.it 1 DSI, Università

More information

Connected Components, and Pagerank

Connected Components, and Pagerank COMP4650 Connected Components, and Pagerank Lexing Xie Research School of Computer Science, ANU Lecture slides credit: Lada Adamic, U Michigan Jure Leskovec, Stanford, Andreas Haeberlen, UPenn Connected

More information

Information Retrieval. Lecture 9 - Web search basics

Information Retrieval. Lecture 9 - Web search basics Information Retrieval Lecture 9 - Web search basics Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Up to now: techniques for general

More information

A Survey of Current Directions in Service Placement in Mobile Ad-hoc Networks

A Survey of Current Directions in Service Placement in Mobile Ad-hoc Networks A Survey of Current Directions in Service Placement in Mobile Ad-hoc Networks Georg Wittenburg and Jochen Schiller Freie Universität Berlin Middleware Support for Pervasive Computing Workshop (PerWare

More information

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu SPAM FARMING 2/11/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 2/11/2013 Jure Leskovec, Stanford

More information

Welcome to the class of Web and Information Retrieval! Min Zhang

Welcome to the class of Web and Information Retrieval! Min Zhang Welcome to the class of Web and Information Retrieval! Min Zhang z-m@tsinghua.edu.cn Coffee Time The Sixth Sense By 费伦 Min Zhang z-m@tsinghua.edu.cn III. Key Techniques of Web IR Using web specific page

More information

Science foundation under Grant No , DARPA award N , TCS Inc., and DIMI matching fund DIM

Science foundation under Grant No , DARPA award N , TCS Inc., and DIMI matching fund DIM A Simple Conceptual Model for the Internet Topology Sudhir L. Tauro Christopher Palmer Georgos Siganos Michalis Faloutsos U.C. Riverside C.M.U. U.C.Riverside U.C.Riverside Dept. of Comp. Science Dept.

More information

Slides based on those in:

Slides based on those in: Spyros Kontogiannis & Christos Zaroliagis Slides based on those in: http://www.mmds.org A 3.3 B 38.4 C 34.3 D 3.9 E 8.1 F 3.9 1.6 1.6 1.6 1.6 1.6 2 y 0.8 ½+0.2 ⅓ M 1/2 1/2 0 0.8 1/2 0 0 + 0.2 0 1/2 1 [1/N]

More information

arxiv:cs/ v1 [cs.ir] 26 Apr 2002

arxiv:cs/ v1 [cs.ir] 26 Apr 2002 Navigating the Small World Web by Textual Cues arxiv:cs/0204054v1 [cs.ir] 26 Apr 2002 Filippo Menczer Department of Management Sciences The University of Iowa Iowa City, IA 52242 Phone: (319) 335-0884

More information

New Classification Method Based on Decision Tree for Web Spam Detection

New Classification Method Based on Decision Tree for Web Spam Detection Research Article International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347-5161 2014 INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Rashmi

More information

Anti-Trust Rank for Detection of Web Spam and Seed Set Expansion

Anti-Trust Rank for Detection of Web Spam and Seed Set Expansion International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 4 (2013), pp. 241-250 International Research Publications House http://www. irphouse.com /ijict.htm Anti-Trust

More information

Keywords Web Mining, Language Model, Web Spam Detection, SVM, C5.0,

Keywords Web Mining, Language Model, Web Spam Detection, SVM, C5.0, Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Web Spam Recognition

More information

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University Instead of generic popularity, can we measure popularity within a topic? E.g., computer science, health Bias the random walk When

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Adversarial Web Search. Contents

Adversarial Web Search. Contents Foundations and Trends R in Information Retrieval Vol. 4, No. 5 (2010) 377 486 c 2011 C. Castillo and B. D. Davison DOI: 10.1561/1500000021 Adversarial Web Search By Carlos Castillo and Brian D. Davison

More information

Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval O. Chum, et al.

Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval O. Chum, et al. Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval O. Chum, et al. Presented by Brandon Smith Computer Vision Fall 2007 Objective Given a query image of an object,

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu [Kumar et al. 99] 2/13/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

More information

Detecting Tag Spam in Social Tagging Systems with Collaborative Knowledge

Detecting Tag Spam in Social Tagging Systems with Collaborative Knowledge 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery Detecting Tag Spam in Social Tagging Systems with Collaborative Knowledge Kaipeng Liu Research Center of Computer Network and

More information

Dynamic Visualization of Hubs and Authorities during Web Search

Dynamic Visualization of Hubs and Authorities during Web Search Dynamic Visualization of Hubs and Authorities during Web Search Richard H. Fowler 1, David Navarro, Wendy A. Lawrence-Fowler, Xusheng Wang Department of Computer Science University of Texas Pan American

More information

Information Retrieval Issues on the World Wide Web

Information Retrieval Issues on the World Wide Web Information Retrieval Issues on the World Wide Web Ashraf Ali 1 Department of Computer Science, Singhania University Pacheri Bari, Rajasthan aali1979@rediffmail.com Dr. Israr Ahmad 2 Department of Computer

More information

Finding dense clusters in web graph

Finding dense clusters in web graph 221: Information Retrieval Winter 2010 Finding dense clusters in web graph Prof. Donald J. Patterson Scribe: Minh Doan, Ching-wei Huang, Siripen Pongpaichet 1 Overview In the assignment, we studied on

More information

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5) INTRODUCTION TO DATA SCIENCE Link Analysis (MMDS5) Introduction Motivation: accurate web search Spammers: want you to land on their pages Google s PageRank and variants TrustRank Hubs and Authorities (HITS)

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

Characterizing the Splogosphere

Characterizing the Splogosphere Characterizing the Splogosphere Pranam Kolari and Akshay Java and Tim Finin University of Maryland Baltimore County 1 Hilltop Circle Baltimore, MD, USA {kolari1, aks1, finin}@cs.umbc.edu ABSTRACT Weblogs

More information

Do TREC Web Collections Look Like the Web?

Do TREC Web Collections Look Like the Web? Do TREC Web Collections Look Like the Web? Ian Soboroff National Institute of Standards and Technology Gaithersburg, MD ian.soboroff@nist.gov Abstract We measure the WT10g test collection, used in the

More information