Developing Focused Crawlers for Genre Specific Search Engines

Similar documents
Don't Use a Lot When Little Will Do : Genre Identi cation Using URLs

Measuring Diversity of a Domain-Specic Crawl

Seed Selection for Domain-Specific Search

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang

Focused crawling: a new approach to topic-specific Web resource discovery. Authors

Tag-based Social Interest Discovery

Automatically Constructing a Directory of Molecular Biology Databases

Text Analytics (Text Mining)

Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO

SUPERVISED TERM WEIGHTING METHODS FOR URL CLASSIFICATION

Text Analytics (Text Mining)

Information Retrieval

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 6: Information Retrieval and Web Search. An introduction

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

A genetic algorithm based focused Web crawler for automatic webpage classification

Domain-specific Concept-based Information Retrieval System

Information Retrieval

Learning Similarity Metrics for Event Identification in Social Media. Hila Becker, Luis Gravano

2. Design Methodology

Focused Crawling with Scalable Ordinal Regression Solvers

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

Using Semantic Similarity in Crawling-based Web Application Testing. (National Taiwan Univ.)

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Creating a Classifier for a Focused Web Crawler

Part I: Data Mining Foundations

Research Article. August 2017

Application of rough ensemble classifier to web services categorization and focused crawling

LINK context is utilized in various Web-based information

Automatic Summarization

Information Retrieval. (M&S Ch 15)

TISA Methodology Threat Intelligence Scoring and Analysis

Question Answering Systems

Searching the Deep Web

DATA MINING II - 1DL460. Spring 2014"

Real-time Collaborative Filtering Recommender Systems

INTRODUCTION. Chapter GENERAL

Building Search Applications

A Two-stage Crawler for Efficiently Harvesting Deep-Web Interfaces

Query Reformulation for Clinical Decision Support Search

Improving Relevance Prediction for Focused Web Crawlers

An Efficient Method for Deep Web Crawler based on Accuracy

Hyperlink-Extended Pseudo Relevance Feedback for Improved. Microblog Retrieval

Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits

A PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS

Multilingual Information Retrieval

Automated Tagging for Online Q&A Forums

Search Engines. Information Retrieval in Practice

DATA MINING - 1DL105, 1DL111

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Searching the Deep Web

Simulation Study of Language Specific Web Crawling

Un-moderated real-time news trends extraction from World Wide Web using Apache Mahout

Classification. 1 o Semestre 2007/2008

TABLE OF CONTENTS CHAPTER NO. TITLE PAGENO. LIST OF TABLES LIST OF FIGURES LIST OF ABRIVATION

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

CS 6604: Data Mining Large Networks and Time-Series

Replication on Affinity Propagation: Clustering by Passing Messages Between Data Points

Search Engines Information Retrieval in Practice

Social Search Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

An ICA based Approach for Complex Color Scene Text Binarization

CHAPTER 5 CLUSTERING USING MUST LINK AND CANNOT LINK ALGORITHM

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

CROSS LANGUAGE INFORMATION ACCESS IN TELUGU

Keyword Extraction by KNN considering Similarity among Features

Natural Language Processing

Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS

Collaborative Filtering using Euclidean Distance in Recommendation Engine

Information Retrieval Spring Web retrieval

Kristina Lerman University of Southern California. This lecture is partly based on slides prepared by Anon Plangprasopchok

May 3, Spring 2016 CS 5604 Information storage and retrieval Instructor : Dr. Edward Fox

Web Data Extraction and Alignment Tools: A Survey Pranali Nikam 1 Yogita Gote 2 Vidhya Ghogare 3 Jyothi Rapalli 4

Natural Language Processing

A Survey on Information Extraction in Web Searches Using Web Services

Information Retrieval May 15. Web retrieval

Mahout in Action MANNING ROBIN ANIL SEAN OWEN TED DUNNING ELLEN FRIEDMAN. Shelter Island

Feature selection. LING 572 Fei Xia

June 15, Abstract. 2. Methodology and Considerations. 1. Introduction

DIGIT.B4 Big Data PoC

PUBCRAWL: Protecting Users and Businesses from CRAWLers

Focused Crawling with

ResPubliQA 2010

Advanced Topics in Information Retrieval. Learning to Rank. ATIR July 14, 2016

SEARCH ENGINE INSIDE OUT

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index.

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Infinite data. Filtering data streams

FAQ Retrieval using Noisy Queries : English Monolingual Sub-task

Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman

Enhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm

Mining Social Media Users Interest

Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

A Document-centered Approach to a Natural Language Music Search Engine

Entity and Knowledge Base-oriented Information Retrieval

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

Transcription:

Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014

Examples of Genre Specific Search Engines MedlinePlus Naukri.com DirectIndustry RecipeBridge MobileWalla

Outline Introduction Focused Crawlers for Sandhan URL Based Genre Identification Seed Selection for Genre Specific Search Measuring Diversity of a Genre Specific Crawl

Architecture of a Web Search Engine Figure: Overview of a Web Search Engine 1 1 Source: http://www.ibm.com/developerworks/library/wa-lucene2/

Focused Crawler Figure: Difference between a Normal Crawler and a Focused Crawler

Challenges How to decide the relevance of a web page? How to achieve a high recall? A severe problem for Indian language genre specific search engines. How to select seeds for genre specific search? Which crawling strategy to use?

Outline Introduction Focused Crawlers for Sandhan URL Based genre Identification Seed selection for genre specific search Measuring Diversity of a Genre Specific Crawl

Introduction to Sandhan There are two ways to gather content for Sandhan: Using a crawler and a genre identifier or Building a focused crawler.

Drawbacks of previous approaches None of the previous approaches can be directly used for Sandhan due to the following reasons: Most of the previous approaches report precision of their focused crawlers but none of them talks about recall or crawl coverage. A good crawl coverage is essential to offer domain specific search. Most of the existing approaches work for close domains but do not work on open domains like tourism or health 2. No prior work has been done in building focused crawlers for Indian languages. 2 D. R. Fesenmaier. Domain Specific search engines

Challenges in building Indian language focused crawlers The combination of an Indian language and an open domain poses the following challenges: Proprietary Fonts Other language Content Language identification Scarcity of content Lack of training data. Pingali et al. proposed a working approach to overcome some of these challenges, but they do not address issues related to domain specific search.

Our Approach In this work we build a classifier guided focused crawler for Hindi tourism and health pages. We decided to use a Naïve bayes classifier which classifies a page into tourism, health and general. For the classifier to work well on the real life data we need to model the actual distribution. Therefore the trick is in collecting data.

Data Collection Define the domains: tourism and health Query Collection: Around 30 people from different places (states) in India were asked to provide the following: 3 regional queries and their translations in English and Hindi. 3 non-regional queries and their translations in English and Hindi. 3 health queries and their translations in English and Hindi. In this way we were able to collect 110 tourism queries and 60 health queries.

Firing Queries onto Search engine Table: Statistics of data collected Domain No of Web Pages Tourism 885 Health 978 Miscellaneous 1354

Web page classification Table: Confusion Matrix of Naïve Bayes classifier Tourism Health Miscellaneous Precision Recall F-Score 502 48 78 0.85 0.8 0.83 Tourism 6 500 45 0.75 0.91 0.82 Health 80 121 707 0.85 0.78 0.81 Miscellaneous

Focused crawler We use a simple crawling strategy i.e: if a page is relevant so might be its outlinks. Relevance of a page is judged using the WPC mentioned before. Intuition behind our approach: We are assuming 2 properties about the nature of the web: Linkage locality: Web pages on a given topic are more likely to link to those on the same topic. Sibling locality: If a web page points to certain web pages on a given topic, then it is likely to point to other pages on the same topic.

Evaluation Metrics We evaluate the quality of the crawl using 3 metrics: precision, recall and harvest ratio. Recall = Recall = Precision = # of relevant pages Total # of pages fetched # of relevant pages fetched by focused crawler Total # of relevant pages # of relevant pages fetched by focused crawler # of relevant pages fetched by unfocused crawler Harvest ratio measures the average number of relevant pages retrieved over different time slices of the crawl. (1) (2) (3)

Results The URLs of web pages collected for building the classifier were used as seed URLs to crawl till a depth of 3. Table: Quality of crawl Crawler Precision Recall Unfocused crawler tourism 0.105 1 Unfocused crawler health 0.106 1 Focused tourism 0.4 0.74 Focused health 0.36 0.58

Figure: Performance comparison of unfocused and focused crawler in tourism and health domains. Blue and red colours indicate tourism and health domains respectively.

Outline Introduction Focused Crawlers for Sandhan URL Based genre Identification Seed selection for genre specific search Measuring Diversity of a Genre Specific Crawl

URL Based Genre Identification URL based methods have several advantages and they should be employed when: Classification speed must be high. Content filtering is needed before an objectionable or advertisement page is downloaded. Page s content is hidden in images or non-standard encodings. Annotation needs to be performed on hyperlinks in a personalized web browser, without fetching the target page. Focused crawler wants to infer the topic of a target page before devoting bandwidth to download it. Language of the page needs to be identified.

Introduction No single work reported for Indian languages Most of the existing approaches 1 2 require huge amount of resources such as heavy training data, already existing corpus or web graph information collected from a search engine. Many works like [1] report their results on toy datasets and hence are not scalable to be used in search engines like Sandhan. In this work we build a URL based genre identification system on a huge dataset using minimal resources and training. 1 Whats in a URL? Genre Classification from URLs 2 Topical Host Reputation for Lightweight Url Classification

Previous Approaches Baykan et al. 1 uses a combination of character N-grams and SVM to classify URLs into one of 15 predefined categories. Abramson et al. 2 train a classifier on selectively picked character n-grams(4 to 8) by eliminating n-grams which are redundant in nature. Kan et al. 3 train a classifier on an exhaustive set of features which include character n-grams, sequential n-grams, URL components, URL tokens length and orthographic features like number of queries and number of numerics in the URL. Hernndez et al. 4 use a clustering based approach to identify URL category. 1 Purely URL-Based Topic Classification 2 Whats in a URL? Genre Classification from URLs 3 Fast Webpage Classification Using URL Features 4 A Statistical Approach to URL-Based Web Page Clustering

Dataset and System Architecture Dataset: 94995 Hindi web pages tagged as tourism/health/misc by a WPC(described above) Features: Words n-grams (4 to 8) All grams

Approach We use 3 approaches for URL based genre identification List Based Manually prepared list via an external corpus via a retrieval system Naïve Bayes Incremental Naïve Bayes (Outperforms the other two)

Results Genre Tourism Health Feature P R F P R F Words 0.849 0.834 0.842 0.862 0.704 0.775 4 grams 0.849 0.824 0.836 0.851 0.658 0.742 5 grams 0.858 0.862 0.860 0.865 0.717 0.784 6 grams 0.860 0.873 0.866 0.873 0.750 0.807 7 grams 0.860 0.880 0.870 0.875 0.765 0.816 8 grams 0.858 0.886 0.872 0.873 0.769 0.818 All grams 0.856 0.869 0.862 0.867 0.736 0.796 Table: Incremental Naïve Bayes Approach

Comparison Approach Tourism Health Incremental Naïve Bayes 0.858 0.873 Baykan et al. 1 0.81 0.37 Abramson et al. 2 0.82 0.39 Kan et al. 3 0.77 0.30 Table: Comparison of Incremental Naïve Bayes Approach and Earlier Approaches proposed in Literature Survey 1 Purely URL-Based Topic Classification 2 Whats in a URL? Genre Classification from URLs 3 Fast Webpage Classification Using URL Features

Outline Introduction Focused Crawlers for Sandhan URL Based genre Identification Seed selection for genre specific search Measuring Diversity of a Genre Specific Crawl

Introduction Coverage and diversity of crawl are two of the pivotal aspects of a genre specific search engine mostly governed by the initial set of seed URLs. Generally, seed URLs are collected manually Following factors influence the quality of seed set: Relevance Diversity of the seed set Seed set size This task requires domain expertise and manual effort. No automated system exists for this purpose. In this work, we present an approach to automate the process of seed URLs collection for domain specific search with a special focus on diversity. For this we use twitter data.

Why Twitter? About 25% tweets contain URLs 3 Users post and exchange information about a variety of trending topics like entertainment, politics, tourism, etc. Twitter has millions of users coming from different social, geographical and cultural backgrounds which ensures a diverse audience. Twitter provides to the users the option of following other users. Content similarity/overlap between tweets of different users gives us the diversity of user opinions. The trail of tweets over time can be used to ensure temporal diversity. Huge number of new URLs are posted everyday. Following these URLs would lead to a fresh and updated crawl. 3 http://techcrunch.com/2010/09/14/twitter-event/

System Architecture Figure: System Workflow Genre specific keywords: Manually fed into the system Twitter Search: We have searched Nov 2009 tweets Tweet Validation: Filters tweets containing valid URLs Graph construction: Constructs an undirected unweighted graph from the validated tweets Diversification Engine: Returns k diverse seed URLs

Diversification Algorithm Algorithm 1 Diverse Seed Selection Algorithm 1: Input : Graph G(V, E), number of seeds k, k < V 2: Output : Diverse k seed URLs 3: Initialize Picked Nodes P = { }, Eliminated Nodes E l = { }, h = V 1 4: while P k do 5: Pick random node n such that n V and n / P and n / E l 6: Add n to P 7: Add neighbours(n, h) to E l 8: if E l P == V then 9: Reinitialize P = { }, Eliminated Nodes E l = { } 10: h = h 1 11: end if 12: end while 13: Return P

Sample Run of Diversification Algorithm Figure: Example of Algorithm 1, Number of Seeds k=3

Graph Construction Zero Similarty (Baseline): No two vertices are connected in the graph. Content Similarity: Two URLs are connected if the tweets that contain these URLs have content overlap above a threshold. URL N-Grams Similarity: Two URLs are connected if the URL n-grams have overlap above a threshold. User Similarity: Two users are considered similar if at least one of them has retweeted the other s tweet.

Evaluation Metric The diversity of a seed URL set is judged by the diversity of the web crawl that it leads to. We measure the variance across the crawled set of documents to judge its diversity, called Dispersion Dispersion is average squared distance of all documents from the mean N i=1 dispersion = ( d i µ) 2 (4) N where d i refers to document i represented as a bag of words vector, µ represents the mean of all d i s and N represents total number of documents selected.

Results Threshold 0.15 0.20 0.25 0.3 0.35 Zero Similarity 35.0 35.0 35.0 35.0 35.0 Content 69.1 36.1 31.7 36.4 38.4 URL 34.2 39.9 42.7 33.9 39.8 User 32.0 32.0 32.0 32.0 32.0 Content + URL 49.9 47.6 34.7 59.5 42.6 User + URL 44.4 29.2 38.8 35.4 35.6 User + Content 64.2 29.4 36.6 42.0 32.7 Content + URL + User 62.6 63.4 51.4 32.6 29.6 Table: Dispersion Values for Different Approaches

Seed Selection Comparison Seed Selection Source Dispersion Twitter (using Algorithm 1) 63.4 Web (using Manual Collection) 84.82 ODP 25.83 Wikipedia 30.62 Table: Comparison of Seed Collection with other Types of Seeds

Outline Introduction Focused Crawlers for Sandhan URL Based genre Identification Seed selection for genre specific search Measuring Diversity of a Genre Specific Crawl

Metrics to Measure Crawl Diversity While measuring diversity of Seed Sets we used only one metric: dispersion. Here we propose 3 new metrics to measure the diversity of a genre-specific crawl. These metrics can be used to evaluate focused crawlers and seed sets w.r.t diversity. Semantic Distance Average Cosine Similarity KL-divergence using Topic Models

Semantic Distance SD(d x, d y ) = k k WordNet Distance(w xi, w yj ) 4 (5) i=1 j=1 where SD(d x, d y ) represents the semantic distance between documents x and y respectively and w xi, w yj represents the i th word of document x and the j th word of document y respectively. D1 Score = N N SD(d x, d y ) x=1 y=1 N 2 (6) 4 http://rednoise.org/rita/wordnet/documentation/riwordnet_met

Average Cosine Similarity k k i=1 j=1, j i Cosine Similarity(d i, d j ) ACS = number of unique unordered document pairs D3 Score = 1 ACS (7) (8)

KL-Divergence using Topic Models D4 Score = k k i=1 j=1, j i where t i and t j represent topic i and topic j, and KLDivergence(t i, t j ) (9) V ( ) ti (v) KLDivergence(t i, t j ) = ln t i (v) (10) t j (v) where t i (v) and t j (v) represent the probabilities of word v in topics i and j respectively and V represents vocabulary size. Hence a web crawl covering varied topics will have a higher diversity score than the crawl containing similar topics. v=1

Feature Space For dispersion(equation 4) and average cosine similarity (equation 7) based metric we use the following two feature spaces: 1. Bag of Words (TF-IDF) 2. Context Vectors The algorithm to compute context vectors of a document is given below: Algorithm 2 Context Vector Algorithm 1: Input : Document d, number of word vectors n, window length k 2: Output : Context Vector of document d 3: Calculate tf idf scores for all words in d 4: Pick n words with highest tf idf values call it W 5: for each word w i in W do 6: Find all words within k neighbourhood of w i call it w k i 7: w i = Vector containing tf idf values of all words in w k i 8: end for 9: CV = Centroid( w 1, w 2,... w n) 10: Return CV

How to evaluate Metrics We evaluate the usefulness of a metric by its ability to distinguish between More diverse and Less diverse web crawl. We experiment on 3 domains: Tourism, Health and Sports. Figure: Picking URLs from ODP Hierarchy

Results MD LD Ratio Tourism 0.788 0.779 1.01 Health 0.759 0.721 1.05 Sports 0.764 0.769 0.99 Table: Average Semantic Distance based Metric MD LD Ratio Tourism 4.475 4.470 1.001 Health 4.645 4.406 1.054 Sports 4.470 4.420 1.011 Table: Topic Modeling based Metric Feature Space Bag of Words Context Vectors MD LD Ratio MD LD Ratio Tourism 53.19 31.35 1.69 83.33 46.26 1.77 Health 49.54 18.61 2.66 83.34 26.52 2.91 Sports 37.60 14.07 2.67 55.55 18.21 3.04 Table: Similarity based Metric Feature Space Bag of Words Context Vectors MD LD Ratio MD LD Ratio Tourism 118.31 109.42 1.08 38.37 36.82 1.04 Health 169.58 110.56 1.53 52.79 37.07 1.42 Sports 118.55 100.48 1.17 40.70 36.21 1.12 Table: Dispersion based Metric

Comparison with other Diversity Measures Approach Tourism Health Sports MD LD Ratio MD LD Ratio MD LD Ratio Similarity Based Metric 83.33 46.26 1.77 83.34 26.52 2.91 55.55 18.21 3.04 Refined Diversity JaccardIndex 5 33.33 27.02 1.23 28.57 20.40 1.39 30.30 19.60 1.54 Sampled Diversity JaccardIndex 6 32.25 29.41 1.09 33.33 18.16 1.76 31.25 21.73 1.43 Simpsons Inverse Diversity 7 3.717 3.533 1.05 3.584 2.604 1.37 3.003 3.717 0.81 Table: Comparison of Metrics to Measure Diversity 5 Efficient Jaccard-Based Diversity Analysis of Large Document Collections 6 Efficient Jaccard-Based Diversity Analysis of Large Document Collections 7 Measurement of diversity

Conclusions Built Focused Crawler for Sandhan Established the use of incremental methods in the context of Focused Crawling. We were the first ones to automate the process of Seed selection for a genre specific search. Proposed various metrics to measure the diversity of a genre specific web crawl.

Thank You Questions?