CS377: Database Systems Text data and information retrieval Li Xiong Department of Mathematics and Computer Science Emory University
Outline Information Retrieval (IR) Concepts Text Preprocessing Inverted Indexing Retrieval Models Types of Queries in IR Systems Google architecture
Information Retrieval (IR) Concepts Information retrieval Process of retrieving documents from a collection in response to a query by a user Unstructured data User s information need expressed as a free-form search request Keyword search query Query
Databases and IR Systems: A Comparison
Search Engine Search engine is an application of information retrieval to large-scale document collections Crawler - Responsible for discovering, analyzing, and indexing new documents Query - Set of terms
Generic IR Pipeline
Text Preprocessing Applied to both documents before indexing and queries Stopword removal the, of, to, a, and, in, said, for, that, was, on, he, is, with, at, by, and it Stemming Trimming the suffix and prefix Utilizing thesaurus UMLS, WordNet
Types of Queries in IR Systems Keyword queries Boolean queries Phrase queries Natural language queries
Retrieval Models Three main statistical models Boolean Vector space Probabilistic Semantic model
Boolean Model Documents represented as a set of terms Form queries using standard Boolean logic operators AND, OR and NOT Retrieval and relevance Binary concepts Lacks sophisticated ranking algorithms
Vector Space Model Documents Represented as features and weights in an n- dimensional vector space TF-IDF (term frequency and inverse document frequency) weighting Query Specified as a terms vector Compared to the document vectors for similarity/relevance assessment
Probabilistic Model Probability ranking principle Decide whether the document belongs to the relevant set or the nonrelevant set for a query Conditional probabilities calculated using Bayes Rule
Semantic Model Include different levels of analysis Morphological Syntactic Semantic Requires knowledge-bases of semantic information E.g WordNet
Document Ranking based on Link Structure The PageRank ranking algorithm Used by Google Highly linked pages are more important (have greater authority) than pages with fewer links Measure of query-independent importance of a page/node
Evaluation Measures of Search Relevance Topical relevance Measures extent to which topic of a result matches topic of query User relevance Describes goodness of a retrieved result with regard to user s information need Web information retrieval Must evaluate document ranking order
Evaluation of Search Relevance Recall Number of relevant documents retrieved by a search / Total number of existing relevant documents Precision Number of relevant documents retrieved by a search / Total number of documents retrieved by that search
What happens behind a Google Query? http://www.google.com/search?emory+university
DNS look up and load balancing http://www.google.com/search?emory+university DNS Lookup google.com -> IP address DNS-based load balancing Multiple clusters distributed worldwide Selects a cluster based on geographic proximity and available capacity at clusters Sends HTTP request to the selected cluster
Query processing at Google Local load balancing Web Server Selects a Cache Server and Google Web Server (GWS) from a set of servers Query processing at GWS Index search at index server Document retrieval at document server
Query processing at GWS Index search uses inverted index (hit list) to compute a relevance score for each document Highly parallelized index divided into pieces (index shards), each shard is served by a pool of machines Document retrieval Retrieve the title, document summary, etc. Highly parallelized - distribute documents into shards; multiple server replicas handle each shard
Google Philosophy (according to Ed Austin) Parallelize everything Distribute everything Compress everything Cache (almost) everything Redundantize everything Jedis build their own lightsabres (the MS Eat your own Dog Food)
Google s Major Glue Google File System GFS Google database Bigtable Google computation Mapreduce Google scheduling - GWQ
References The Anatomy of a Large-Scale Hypertextual Web Search Engine. Sergey Brin and Lawrence Page. Computer Networks and ISDN Systems, 1998 Web search for a planet: The Google cluster architecture. Barroso, L.A.; Dean, J.; Holzle, U.; IEEE Micro, 2003 The Anatomy Of The Google Architecture, presentation slides, by Ed Austin, 2009