CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

CS377: Database Systems Text data and information retrieval Li Xiong Department of Mathematics and Computer Science Emory University

Outline Information Retrieval (IR) Concepts Text Preprocessing Inverted Indexing Retrieval Models Types of Queries in IR Systems Google architecture

Information Retrieval (IR) Concepts Information retrieval Process of retrieving documents from a collection in response to a query by a user Unstructured data User s information need expressed as a free-form search request Keyword search query Query

Databases and IR Systems: A Comparison

Search Engine Search engine is an application of information retrieval to large-scale document collections Crawler - Responsible for discovering, analyzing, and indexing new documents Query - Set of terms

Generic IR Pipeline

Text Preprocessing Applied to both documents before indexing and queries Stopword removal the, of, to, a, and, in, said, for, that, was, on, he, is, with, at, by, and it Stemming Trimming the suffix and prefix Utilizing thesaurus UMLS, WordNet

Types of Queries in IR Systems Keyword queries Boolean queries Phrase queries Natural language queries

Retrieval Models Three main statistical models Boolean Vector space Probabilistic Semantic model

Boolean Model Documents represented as a set of terms Form queries using standard Boolean logic operators AND, OR and NOT Retrieval and relevance Binary concepts Lacks sophisticated ranking algorithms

Vector Space Model Documents Represented as features and weights in an n- dimensional vector space TF-IDF (term frequency and inverse document frequency) weighting Query Specified as a terms vector Compared to the document vectors for similarity/relevance assessment

Probabilistic Model Probability ranking principle Decide whether the document belongs to the relevant set or the nonrelevant set for a query Conditional probabilities calculated using Bayes Rule

Semantic Model Include different levels of analysis Morphological Syntactic Semantic Requires knowledge-bases of semantic information E.g WordNet

Document Ranking based on Link Structure The PageRank ranking algorithm Used by Google Highly linked pages are more important (have greater authority) than pages with fewer links Measure of query-independent importance of a page/node

Evaluation Measures of Search Relevance Topical relevance Measures extent to which topic of a result matches topic of query User relevance Describes goodness of a retrieved result with regard to user s information need Web information retrieval Must evaluate document ranking order

Evaluation of Search Relevance Recall Number of relevant documents retrieved by a search / Total number of existing relevant documents Precision Number of relevant documents retrieved by a search / Total number of documents retrieved by that search

What happens behind a Google Query? http://www.google.com/search?emory+university

DNS look up and load balancing http://www.google.com/search?emory+university DNS Lookup google.com -> IP address DNS-based load balancing Multiple clusters distributed worldwide Selects a cluster based on geographic proximity and available capacity at clusters Sends HTTP request to the selected cluster

Query processing at Google Local load balancing Web Server Selects a Cache Server and Google Web Server (GWS) from a set of servers Query processing at GWS Index search at index server Document retrieval at document server

Query processing at GWS Index search uses inverted index (hit list) to compute a relevance score for each document Highly parallelized index divided into pieces (index shards), each shard is served by a pool of machines Document retrieval Retrieve the title, document summary, etc. Highly parallelized - distribute documents into shards; multiple server replicas handle each shard

Google Philosophy (according to Ed Austin) Parallelize everything Distribute everything Compress everything Cache (almost) everything Redundantize everything Jedis build their own lightsabres (the MS Eat your own Dog Food)

Google s Major Glue Google File System GFS Google database Bigtable Google computation Mapreduce Google scheduling - GWQ

References The Anatomy of a Large-Scale Hypertextual Web Search Engine. Sergey Brin and Lawrence Page. Computer Networks and ISDN Systems, 1998 Web search for a planet: The Google cluster architecture. Barroso, L.A.; Dean, J.; Holzle, U.; IEEE Micro, 2003 The Anatomy Of The Google Architecture, presentation slides, by Ed Austin, 2009