CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

Similar documents
Chapter 27 Introduction to Information Retrieval and Web Search

Information Retrieval

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

Introduction to Information Retrieval

Chapter 6: Information Retrieval and Web Search. An introduction

Search Engine Architecture. Hongning Wang

Information Retrieval. (M&S Ch 15)

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

CS 6320 Natural Language Processing

CS 347 Parallel and Distributed Data Processing

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

CS 347 Parallel and Distributed Data Processing

Information Retrieval and Web Search

CS54701: Information Retrieval

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Information Retrieval. hussein suleman uct cs

Text Analytics (Text Mining)

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Session 10: Information Retrieval

Information Retrieval. Information Retrieval and Web Search

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

Google Scale Data Management

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

Searching the Web for Information

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Text Analytics (Text Mining)

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Social Search Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm

Chapter 2. Architecture of a Search Engine

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

Optimizing Search Engines using Click-through Data

Efficient query processing

Computer Science 572 Midterm Prof. Horowitz Thursday, March 8, 2012, 2:00pm 3:00pm

Introduction to Information Retrieval

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4

modern database systems lecture 4 : information retrieval

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

THE WEB SEARCH ENGINE

Distributed computing: index building and use

CISC 7610 Lecture 2b The beginnings of NoSQL

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

Representation of Documents and Infomation Retrieval

Digital Libraries: Language Technologies

Search Engine Overview

Multimedia Information Systems

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO

Information Retrieval: Retrieval Models

Mining Web Data. Lijun Zhang

A Survey of Google's PageRank

Information Retrieval

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

Models for Document & Query Representation. Ziawasch Abedjan

Mining Web Data. Lijun Zhang

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Web Information Retrieval using WordNet

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

CS290N Summary Tao Yang

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Information Retrieval

SEARCH ENGINE INSIDE OUT

Query Processing and Alternative Search Structures. Indexing common words

CS Search Engine Technology

Introduction to Information Retrieval and Anatomy of Google. Information Retrieval Introduction

Distributed computing: index building and use

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Chapter IR:II. II. Architecture of a Search Engine. Indexing Process Search Process

Information Retrieval

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

Part I: Data Mining Foundations

DATA MINING - 1DL105, 1DL111

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

7. Mining Text and Web Data

Indexing and Query Processing. What will we cover?

Search Engine Architecture II

COMP6237 Data Mining Searching and Ranking

Authoritative K-Means for Clustering of Web Search Results

Instructor: Stefan Savev

Dynamic Visualization of Hubs and Authorities during Web Search

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

Indexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Midterm 2, CS186 Prof. Hellerstein, Spring 2012

Information Retrieval. Lecture 10 - Web crawling

Transcription:

CS377: Database Systems Text data and information retrieval Li Xiong Department of Mathematics and Computer Science Emory University

Outline Information Retrieval (IR) Concepts Text Preprocessing Inverted Indexing Retrieval Models Types of Queries in IR Systems Google architecture

Information Retrieval (IR) Concepts Information retrieval Process of retrieving documents from a collection in response to a query by a user Unstructured data User s information need expressed as a free-form search request Keyword search query Query

Databases and IR Systems: A Comparison

Search Engine Search engine is an application of information retrieval to large-scale document collections Crawler - Responsible for discovering, analyzing, and indexing new documents Query - Set of terms

Generic IR Pipeline

Text Preprocessing Applied to both documents before indexing and queries Stopword removal the, of, to, a, and, in, said, for, that, was, on, he, is, with, at, by, and it Stemming Trimming the suffix and prefix Utilizing thesaurus UMLS, WordNet

Types of Queries in IR Systems Keyword queries Boolean queries Phrase queries Natural language queries

Retrieval Models Three main statistical models Boolean Vector space Probabilistic Semantic model

Boolean Model Documents represented as a set of terms Form queries using standard Boolean logic operators AND, OR and NOT Retrieval and relevance Binary concepts Lacks sophisticated ranking algorithms

Vector Space Model Documents Represented as features and weights in an n- dimensional vector space TF-IDF (term frequency and inverse document frequency) weighting Query Specified as a terms vector Compared to the document vectors for similarity/relevance assessment

Probabilistic Model Probability ranking principle Decide whether the document belongs to the relevant set or the nonrelevant set for a query Conditional probabilities calculated using Bayes Rule

Semantic Model Include different levels of analysis Morphological Syntactic Semantic Requires knowledge-bases of semantic information E.g WordNet

Document Ranking based on Link Structure The PageRank ranking algorithm Used by Google Highly linked pages are more important (have greater authority) than pages with fewer links Measure of query-independent importance of a page/node

Evaluation Measures of Search Relevance Topical relevance Measures extent to which topic of a result matches topic of query User relevance Describes goodness of a retrieved result with regard to user s information need Web information retrieval Must evaluate document ranking order

Evaluation of Search Relevance Recall Number of relevant documents retrieved by a search / Total number of existing relevant documents Precision Number of relevant documents retrieved by a search / Total number of documents retrieved by that search

What happens behind a Google Query? http://www.google.com/search?emory+university

DNS look up and load balancing http://www.google.com/search?emory+university DNS Lookup google.com -> IP address DNS-based load balancing Multiple clusters distributed worldwide Selects a cluster based on geographic proximity and available capacity at clusters Sends HTTP request to the selected cluster

Query processing at Google Local load balancing Web Server Selects a Cache Server and Google Web Server (GWS) from a set of servers Query processing at GWS Index search at index server Document retrieval at document server

Query processing at GWS Index search uses inverted index (hit list) to compute a relevance score for each document Highly parallelized index divided into pieces (index shards), each shard is served by a pool of machines Document retrieval Retrieve the title, document summary, etc. Highly parallelized - distribute documents into shards; multiple server replicas handle each shard

Google Philosophy (according to Ed Austin) Parallelize everything Distribute everything Compress everything Cache (almost) everything Redundantize everything Jedis build their own lightsabres (the MS Eat your own Dog Food)

Google s Major Glue Google File System GFS Google database Bigtable Google computation Mapreduce Google scheduling - GWQ

References The Anatomy of a Large-Scale Hypertextual Web Search Engine. Sergey Brin and Lawrence Page. Computer Networks and ISDN Systems, 1998 Web search for a planet: The Google cluster architecture. Barroso, L.A.; Dean, J.; Holzle, U.; IEEE Micro, 2003 The Anatomy Of The Google Architecture, presentation slides, by Ed Austin, 2009