Information Networks. Hacettepe University Department of Information Management DOK 422: Information Networks

Similar documents
Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Information Retrieval. Lecture 10 - Web crawling

Crawling the Web. Web Crawling. Main Issues I. Type of crawl

Information Retrieval Spring Web retrieval

Search Engines. Information Retrieval in Practice

Information Retrieval

Searching. Outline. Copyright 2006 Haim Levkowitz. Copyright 2006 Haim Levkowitz

Collection Building on the Web. Basic Algorithm

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

How Does a Search Engine Work? Part 1

CS6200 Information Retreival. Crawling. June 10, 2015

CS47300: Web Information Search and Management

Information Retrieval May 15. Web retrieval

CS47300: Web Information Search and Management

Information Retrieval and Web Search

THE WEB SEARCH ENGINE

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

SE Workshop PLAN. What is a Search Engine? Components of a SE. Crawler-Based Search Engines. How Search Engines (SEs) Work?

Running Head: HOW A SEARCH ENGINE WORKS 1. How a Search Engine Works. Sara Davis INFO Spring Erika Gutierrez.

Search Like a Pro. How Search Engines Work. Comparison Search Engine. Comparison Search Engine. How Search Engines Work

Web Search. Web Spidering. Introduction

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India

Searching the Deep Web

Web Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson

THE HISTORY & EVOLUTION OF SEARCH

EECS 395/495 Lecture 5: Web Crawlers. Doug Downey

SEARCH ENGINE INSIDE OUT

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

You got a website. Now what?

Search Engine Techniques: A Review

FAQ: Crawling, indexing & ranking(google Webmaster Help)

Plan for today. CS276B Text Retrieval and Mining Winter Evolution of search engines. Connectivity analysis

Crawling. CS6200: Information Retrieval. Slides by: Jesse Anderton

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Google Search Appliance

Searching the Deep Web

How to Crawl the Web. Hector Garcia-Molina Stanford University. Joint work with Junghoo Cho

Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Definition. Spider = robot = crawler. Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web.

Website Name. Project Code: # SEO Recommendations Report. Version: 1.0

Search Engines. Charles Severance

data analysis - basic steps Arend Hintze

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search

Web Applications: Internet Search and Digital Preservation

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

Component 4: Introduction to Information and Computer Science

Search & Google. Melissa Winstanley

Search Engine Architecture. Hongning Wang

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

The Anatomy of a Large-Scale Hypertextual Web Search Engine

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index.

Search Quality. Jan Pedersen 10 September 2007

2018 SEO CHECKLIST. Use this checklist to ensure that you are optimizing your website by following these best practices.

Making your agency s sites more accessible to web search engine users. Implementing the Sitemap protocol

Searching the Web for Information

Introduction. What do you know about web in general and web-searching in specific?

A Survey on Web Information Retrieval Technologies

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia

Searching in All the Right Places. How Is Information Organized? Chapter 5: Searching for Truth: Locating Information on the WWW

Web Search Basics. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University

The Performance Study of Hyper Textual Medium Size Web Search Engine

LIST OF ACRONYMS & ABBREVIATIONS

Chapter 2. Architecture of a Search Engine

Full-Text Indexing For Heritrix

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

CSE 3. How Is Information Organized? Searching in All the Right Places. Design of Hierarchies

FILTERING OF URLS USING WEBCRAWLER

Computer Science 572 Midterm Prof. Horowitz Thursday, March 8, 2012, 2:00pm 3:00pm

DEC Computer Technology LESSON 6: DATABASES AND WEB SEARCH ENGINES

NBA 600: Day 15 Online Search 116 March Daniel Huttenlocher

IJESRT. [Hans, 2(6): June, 2013] ISSN:

OnCrawl Metrics. What SEO indicators do we analyze for you? Dig into our board of metrics to find the one you are looking for.

3 Media Web. Understanding SEO WHITEPAPER

Python & Web Mining. Lecture Old Dominion University. Department of Computer Science CS 495 Fall 2012

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

Finding Information on the Information Highway. How to get around in the Internet

Accessibility of INGO FAST 1997 ARTVILLE, LLC. 32 Spring 2000 intelligence

Search Engine Optimization 101. Janette Toral

Why is Search Engine Optimisation (SEO) important?

Crawling CE-324: Modern Information Retrieval Sharif University of Technology

Computer Fundamentals : Pradeep K. Sinha& Priti Sinha

Performance Evaluation of a Regular Expression Crawler and Indexer

PROJECT REPORT (Final Year Project ) Project Supervisor Mrs. Shikha Mehta

Web Archiving Workshop

SEO Technical & On-Page Audit

An Approach To Web Content Mining

Why is the Internet so important? Reaching clients on the Internet

Information Retrieval

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

Google Search Appliance

Web Search Strategy/Behavior. Meenu Sharma. Librarian Canadian Institute for International studies C-2, Phase-1, Industrial Area, Mohali

Google Search Appliance

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

How to Get Your Website Listed on Major Search Engines

Mining Web Data. Lijun Zhang

Deep Web. CSF: Forensics Cyber-Security. Part II.B. Techniques and Tools: Network Forensics. Fall 2015 Nuno Santos

Transcription:

Information Networks Hacettepe University Department of Information Management DOK 422: Information Networks

Search engines Some Slides taken from: Ray Larson

Search engines Web Crawling Web Search Engines and Algorithms

Standard Web Search Engine Architecture crawl the web Check for duplicates, store the documents DocIds user query create an inverted index Show results To user Search engine servers Inverted index

More detailed architecture, from Brin & Page 98. Only covers the preprocessing in detail, not the query serving.

Web Crawling How do the web search engines get all of the items they index? Main idea: Start with known sites Record information for these sites Follow the links from each site Record information found at new sites Repeat

Web Crawlers How do the web search engines get all of the items they index? More precisely: Put a set of known sites on a queue Repeat the following until the queue is empty: Take the first page off of the queue If this page has not yet been processed: Record the information found on this page Positions of words, links going out, etc Add each link on the current page to the queue Record that this page has been processed In what order should the links be followed?

Page Visit Order Animated examples of breadth-first vs depth-first search on trees: http://www.rci.rutgers.edu/~cfs/472_html/ai_search/exhaustivesearch.html Structure to be traversed

Page Visit Order Animated examples of breadth-first vs depth-first search on trees: http://www.rci.rutgers.edu/~cfs/472_html/ai_search/exhaustivesearch.html Breadth-first search (must be in presentation mode to see this animation)

Page Visit Order Animated examples of breadth-first vs depth-first search on trees: http://www.rci.rutgers.edu/~cfs/472_html/ai_search/exhaustivesearch.html Depth-first search (must be in presentation mode to see this animation)

Page Visit Order Animated examples of breadth-first vs depth-first search on trees: http://www.rci.rutgers.edu/~cfs/472_html/ai_search/exhaustivesearch.html

Sites Are Complex Graphs, Not Just Trees Page 1 Site 1 Page 1 Site 2 Page 2 Page 3 Page 2 Page 3 Page 5 Page 1 Page 4 Site 5 Page 1 Page 6 Page 1 Page 2 Site 6 Site 3

Web Crawling Issues Keep out signs A file called robots.txt tells the crawler which directories are off limits Freshness Figure out which pages change often Recrawl these often Duplicates, virtual hosts, etc Convert page contents with a hash function Compare new pages to the hash table Lots of problems Server unavailable Incorrect html Missing links Infinite loops Web crawling is difficult to do robustly!

Searching the Web Web Directories versus Search Engines Some statistics about Web searching Challenges for Web Searching Search Engines Crawling Indexing Querying

Directories vs. Search Engines Directories Hand-selected sites Search over the contents of the descriptions of the pages Organized in advance into categories Search Engines All pages in all sites Search over the contents of the pages themselves Organized after the query by relevance rankings or other scores

Search Engines vs. Internal Engines Not long ago HotBot, GoTo, Yahoo and Microsoft were all powered by Inktomi Today Google is the search engine behind many other search services (such as Yahoo up until very recently and AOL s search service)

Statistics from Inktomi Statistics from Inktomi, August 2000, for one client, one week Total # queries: 1315040 Number of repeated queries: 771085 Number of queries with repeated words: 12301 Average words/ query: 2.39 Query type: All words: 0.3036; Any words: 0.6886; Some words:0.0078 Boolean: 0.0015 (0.9777 AND / 0.0252 OR / 0.0054 NOT) Phrase searches: 0.198 URL searches: 0.066 URL searches w/http: 0.000 email searches: 0.001 Wildcards: 0.0011 (0.7042 '?'s ) frac '?' at end of query: 0.6753 interrogatives when '?' at end: 0.8456 composed of: who: 0.0783 what: 0.2835 when: 0.0139 why: 0.0052 how: 0.2174 where 0.1826 where-mis 0.0000 can,etc.: 0.0139 do(es)/did: 0.0

What Do People Search for on the Web? Topics Genealogy/Public Figure: 12% Computer related: 12% Business: 12% Entertainment: 8% Medical: 8% Politics & Government 7% News 7% Hobbies 6% General info/surfing 6% Science 6% Travel 5% Arts/education/shopping/images 14% (from Spink et al. 98 study)

Challenges for Web Searching: Data Distributed data Volatile data/ Freshness : 40% of the web changes every month Exponential growth Unstructured and redundant data: 30% of web pages are near duplicates Unedited data Multiple formats Commercial biases Hidden data

Challenges for Web Searching: Users Users unfamiliar with search engine interfaces (e.g., Does the query apples oranges mean the same thing on all of the search engines?) Users unfamiliar with the logical view of the data (e.g., Is a search for Oranges the same things as a search for oranges?) Many different kinds of users

Web Search Queries Web search queries are SHORT ~2.4 words on average (Aug 2000) Has increased, was 1.7 (~1997) User Expectations Many say the first item shown should be what I want to see! This works if the user has the most popular/common notion in mind