Search Engines Information Retrieval in Practice

Similar documents
An Introduction to Search Engines and Web Navigation

Database Concepts. David M. Kroenke UNIVERSITATSBIBLIOTHEK HANNOVER

Essentials of Database Management

World Wide Web PROGRAMMING THE PEARSON EIGHTH EDITION. University of Colorado at Colorado Springs

Information Retrieval

THE AVR MICROCONTROLLER AND EMBEDDED SYSTEMS. Using Assembly and С

ony Gaddis Haywood Community College STARTING OUT WITH PEARSON Amsterdam Cape Town Dubai London Madrid Milan Munich Paris Montreal Toronto

Part I: Data Mining Foundations

Access ComprehGnsiwG. Shelley Gaskin, Carolyn McLellan, and. Nancy Graviett. with Microsoft

PROBLEM SOLVING USING JAVA WITH DATA STRUCTURES. A Multimedia Approach. Mark Guzdial and Barbara Ericson PEARSON. College of Computing

Visual C# Tony Gaddis. Haywood Community College STARTING OUT WITH. Piyali Sengupta. Third Edition. Global Edition contributions by.

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

Prelude to Programming

MariaDB Crash Course. A Addison-Wesley. Ben Forta. Upper Saddle River, NJ Boston. Indianapolis. Singapore Mexico City. Cape Town Sydney.

Modern Information Retrieval

Integrated Approach. Operating Systems COMPUTER SYSTEMS. LEAHY, Jr. Georgia Institute of Technology. Umakishore RAMACHANDRAN. William D.

Systems:;-'./'--'.; r. Ramez Elmasri Department of Computer Science and Engineering The University of Texas at Arlington

FUNDAMENTALS OF. Database S wctpmc. Shamkant B. Navathe College of Computing Georgia Institute of Technology. Addison-Wesley

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Business Driven Data Communications

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Building Search Applications

CJT^jL rafting Cm ompiler

This page intentionally left blank

Information Retrieval Spring Web retrieval

Chapter 2. Architecture of a Search Engine

FUNDAMENTALS OF SEVENTH EDITION

DB2 SQL Tuning Tips for z/os Developers

MECHATRONICS. William Bolton. Sixth Edition ELECTRONIC CONTROL SYSTEMS ENGINEERING IN MECHANICAL AND ELECTRICAL PEARSON

Programming in Python 3

Programming. In Ada JOHN BARNES TT ADDISON-WESLEY

Information Retrieval May 15. Web retrieval

Mining Web Data. Lijun Zhang

TEXT MINING APPLICATION PROGRAMMING

MACHINES AND MECHANISMS

Taming Text. How to Find, Organize, and Manipulate It MANNING GRANT S. INGERSOLL THOMAS S. MORTON ANDREW L. KARRIS. Shelter Island

60-538: Information Retrieval

Data Structures and Abstractions with Java

DATA AND COMPUTER COMMUNICATIONS

Anany Levitin 3RD EDITION. Arup Kumar Bhattacharjee. mmmmm Analysis of Algorithms. Soumen Mukherjee. Introduction to TllG DCSISFI &

Chapter 27 Introduction to Information Retrieval and Web Search

Data Structures and Abstractions with Java

HCS12 Microcontroller and Embedded Systems: Using Assembly and C with CodeWarrior 1 st Edition

CRYPTOGRAPHY AND NETWORK SECURITY

Real-Time Systems and Programming Languages

Web Development and Design Foundations with HTML5

Objects First with Java

CLASSIC DATA STRUCTURES IN JAVA

CSE 5243 INTRO. TO DATA MINING

AND ASSURANCE AN INTEGRATED APPROACH SIXTEENTH EDITION GLOBAL EDITION

^l^s^^^^^^^^^^s^^^ ^.1^L^ gs *^gs (s^s^^^^s^^ ^S^^^^ls

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

The Unified Modeling Language User Guide

Refactoring HTML. Improving the Design of Existing Web Applications. Elliotte Rusty Harold. TT rvaddison-wesley

Mining Web Data. Lijun Zhang

Software Engineering Ian Sommerville Pearson Education File Type

Natural Language Processing

The Power of Events. An Introduction to Complex Event Processing in Distributed Enterprise Systems. David Luckham

Search Engines. Information Retrieval in Practice

Information Retrieval

PYTHON. p ykos vtawynivis. Second eciitiovl. CO Ve, WESLEY J. CHUN

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

Quality Code. Software Testing Principles, Practices, and Patterns. Stephen Vance. AAddison-Wesley

Search Engine Architecture II

CS290N Summary Tao Yang

CS371R: Final Exam Dec. 18, 2017

MODERN DATABASE MANAGEMENT

SQL Queries. for. Mere Mortals. Third Edition. A Hands-On Guide to Data Manipulation in SQL. John L. Viescas Michael J. Hernandez

Digital System Design with SystemVerilog

JAVASCRIPT FOR PROGRAMMERS

Chapter 6: Information Retrieval and Web Search. An introduction

Cloud Computing and SOA Convergence in Your Enterprise

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Automatic Text Processing

Networking Security Essentials 4th Edition Solution Manual

A Survey on Web Information Retrieval Technologies

DATABASE SYSTEM CONCEPTS

Social Search Networks of People and Search Engines. CS6200 Information Retrieval

Fundamentals of. Database Systems. Shamkant B. Navathe. College of Computing Georgia Institute of Technology PEARSON.

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

Network Programming With Go Essential Skills For Using And Securing Networks

Starting Out With C From Control Structures To Objects Plus Myprogramminglab With Pearson Etext Access Card Package 8th Edition

Fit for Developing Software

Application Programming

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

\ Smart Client 0" Deploymentwith v^ ClickOnce

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Domain-Specific. Languages. Martin Fowler. AAddison-Wesley. Sydney Tokyo. With Rebecca Parsons

Framework Design Guidelines

Rails AntiPatterns. Chad Pytel. Best Practice Ruby on Rails Refactoring. Tammer Saleh. AAddison-Wesley

Core Java Volume Ii Advanced Features 10th Edition

UMass at TREC 2006: Enterprise Track

Chapter IR:II. II. Architecture of a Search Engine. Indexing Process Search Process

Virtualization from the Trenches

NETWORKING KEITH W. ROSS. Polytechnic Institute of NYU. Addison-Wesley

Software Engineering Ian Sommerville 7th Edition

[Contents. Sharing. sqlplus. Storage 6. System Support Processes 15 Operating System Files 16. Synonyms. SQL*Developer

OpenGL SUPERBIBLE. Fifth Edition. Comprehensive Tutorial and Reference. Richard S. Wright, Jr. Nicholas Haemel Graham Sellers Benjamin Lipchak

TABLE OF CONTENTS CHAPTER NO. TITLE PAGENO. LIST OF TABLES LIST OF FIGURES LIST OF ABRIVATION

Introductory logic and sets for Computer scientists

Transcription:

Search Engines Information Retrieval in Practice W. BRUCE CROFT University of Massachusetts, Amherst DONALD METZLER Yahoo! Research TREVOR STROHMAN Google Inc. ----- PEARSON Boston Columbus Indianapolis New York San Francisco Upper Saddle River Amsterdam Cape Town Dubai London Madrid Milan Munich Paris Montreal Toronto Delhi Mexico City Sao Paulo Sydney Hong Kong Seoul Singapore Taipei Tokyo

Contents 1 Search Engines and Information Retrieval 1 1.1 What Is Information Retrieval?,. "............. 1 1.2 The Big Issues............................................. 4 1.3 Search Engines............................................ 6 1.4 Search Engineers.......................................... 9 2 Architecture ofa Search Engine................................. 13 2.1 What Is an Architecrure?................................... 13 2.2 Basic Building Blocks...................................... 14 2.3 Breaking It Down 17 2.3.1 Text Acquisition..................................... 17 2.3.2 Text Transformation................................. 19 2.3.3 Index Creation...................................... 22 2.3.4 User Interaction..................................... 23 2.3.5 Ranking............................................ 25 2.3.6 Evaluation... 27 2.4 How Does It Real(y Work?.................................. 28 3 Crawls and Feeds.............................................. 31 3.1 DeciclingWhat [0 Search... 31 3.2 Crawling the Web......................................... 32 3.2.1 Retrieving Web Pages 33 3.2.2 The Web Crawler.................................... 35 3.2.3 Freshness... 37 3.2.4 Focused Crawling 41 3.2.5 Deep Web... 41

X Contems 3.2.6 Sitemaps 43 3.2.7 Distributed Crawling............................. 44 3.3 Crawling Documents and Email............................. 46 3.4 DocumentFeeds ", 47 3.5 The Conversion Problem.... 49 3.5.1 Character Encodings..... 50 3.6 Storing the Documents, 52 3.6.1 Using a Database System " "....... 53 3.6.2 Random Access 53 3.6.3 Compression and Large Files.......................... 54 3.6.4 Update... 56 3.6.5 BigTable... 57 3.7 Detecting Duplicates 60 3.8 Removing Noise, 63 4 Processing Text 75 4.1 From Words to Terms...................................... 75 4.2 Text Statistics............................................. 77 4.2.1 Vocabulary Growth, 82 4.2.2 Estimating Collection and Result Set Sizes, 85 4.3 Document Parsing, 88 4.3.1 Overview... 88 4.3.2 Tokenizing 89 4.3.3 Stopping... 92 4.3.4 Stemming... 93 4.3.5 Phrases and N-grams, 99 4.4 Document Structure and Markup, 103 4.5 Link Analysis, 106 4.5.1 Anchor Text, 107 4.5.2 PageRank 107 4.5.3 Link Quality 113 4.6 Information Extraction, 115 4.6.1 Hidden Markov Models for Extraction................. 117 4.7 Internationalization 120

Contents XI 5 Rankingwith Indexes 127 5.1 Overview 127 5.2 Abstract Model ofranking " 128 5.3 Inverted Indexes........................................... 131 5.3.1 Documents 133 5.3.2 Counts 135 5.3.3 Positions 136 5.3.4 Fields and Extents................................... 138 5.3.5 Scores... 140 5.3.6 Ordering... 141 5.4 Compression 142 5.4.1 Entropy and Ambiguity.............................. 144 5.4.2 Delta Encoding..................................... 146 5.4.3 Bit-Aligned Codes 147 5.4.4 Byte-Aligned Codes 150 5.4.5 Compression in Practice.............................. 151 5.4.6 Looking Ahead...................................... 153 5.4.7 Skipping and Skip Pointers 153 5.5 Auxiliary Structures........................................ 156 5.6 Index Construction........................................ 158 5.6.1 Simple Construction................................. 158 5.6.2 Merging... 159 5.6.3 Parallelism and Distribution 160 5.6.4 Update 166 5.7 Query Processing, 167 5.7.1 Document-at-a-time Evaluation " 168 5.7.2 Term-at-a-time Evaluation............................ 170 5.7.3 Optimization Techniques............................. 172 5.7.4 Structured Queries 180 5.7.5 Distributed Evaluation 182 5.7.6 Caching 183 6 Queries and Interfaces......................................... 191 6.1 Information Needs and Queries 191 6.2 Query Transformation and Refinement....................... 194 6.2.1 Stopping and Stemming Revisited 194 6.2.2 Spell Checking and Suggestions 197

XII Concents 6.2.3 Query Expansion, 203 6.2.4 Relevance Feedback ",, 212 6.2.5 Context and Personalization 215 6.3 Showing the Results 219 6.3.1 Result Pages and Snippets 219 6.3.2 Advertising and Search,, 222 6.3.3 Clustering the Results,,, 225 6.4 Cross-Language Search ",, 230 7 Retrieval Models, 237 7.1 Overview ofretrieval Models 237 7.1.1 Boolean Retrieval, 239 7.1.2 The Vector Space Model 241 7.2 Probabilistic Models, 247 7.2.1 Information Retrieval as Classification 248 7.2.2 The BM25 Ranking Algorithm 254 7.3 RankingBased on Language Models 256 7.3.1 Query Likelihood Ranking........................... 258 7.3.2 Relevance Models and Pseudo-Relevance Feedback 265 7.4 Complex Queries and Combining Evidence................... 271 7.4.1 The Inference NetworkModel,. 272 7.4.2 The Galago Query Language, 277 7.5 Web Search,. 283 7.6 Machine Learning and Information Retrieval.................. 287 7.6.1 Learning to Rank.................................... 288 7.6.2 Topic Models and Vocabulary Mismatch 292 7.7 Application-Based Models 295 8 Evaluating Search Engines...................................... 301 8.1 Why Evaluate? 301 8.2 The Evaluation Corpus..................................... 303 8.3 Logging.................................................. 309 8.4 Effectiveness Metries....................................... 312 8.4.1 Recall and Precision 312 8.4.2 Averaging and Interpolation 317 8.4.3 Focusing on the Top Documents 322 8.4.4 Using Preferences 325

Contents XIII 8.5 Efficiency Metrics 326 8.6 Training, Testing, and Statistics 329 8.6.1 Significance Tests.................................... 329 8.6.2 Setting Parameter VaIues 334 8.6.3 OnIine Testing...................................... 336 8.7 The Bottom Line 337 9 Classification and Clustering................................... 343 9.1 Classification and Categorization ", 344 9.1.1 Naive Bayes 346 9.1.2 SupportVector Machines 355 9.1.3 Evaluation 363 9.1.4 Classifier and Feature Selection.., 363 9.1.5 Spam, Sentiment, and OnIine Advertising.. " " 368 9.2 CIustering 377 9.2.1 Hierarchical and K -Means CIustering.................. 379 9.2.2 K Nearest Neighbor CIustering 388 9.2.3 Evaluation 390 9.2.4 How to Choose K................................... 391 9.2.5 Clustering and Search " 393 10 Social Search 401 10.1 What 1s Social Search? 401 10.2 UserTags andmanual1ndexing 404 10.2.1 SearchingTags., 406 10.2.2 1nferring Missing Tags 408 10.2.3 Browsingand Tag CIouds 410 10.3 Searchingwith Communities 412 10.3.1 WhatIs a Community? 412 10.3.2 FindingCommunities 413 10.3.3 Community-Based Question Answering................ 419 10.3.4 Collaborative Searching.............................. 424 10.4 Fihering and Recommending 427 10.4.1 Document Fihering ',", 427 10.4.2 Collaborative Fihering, 436 10.5 Peer-to-Peer and Metasearch................................ 442 10.5.1 Distributed Search 442

XIV Contents 10.5.2 P2P Networks 446 11 Beyond Bag ofwords 455 11.1 Overview,, 455 11.2 Feature-Based Retrieval Models 456 11.3 Term Dependence Models, 458 11.4 Structure Revisited 463 11.4.1 XML Retrieval. 465 11.4.2 Entity Search 468 11.5 Longer Questions, Better Answers, 470 11.6 Words, Pictures, and Music 474 11.7 One Search Fits All? 483 References, 491 Index, 517