Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Similar documents
Introduction to Information Retrieval

Ranking of ads. Sponsored Search

Digital Libraries: Language Technologies

CS 6320 Natural Language Processing

Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Boolean Model. Hongning Wang

Information Retrieval

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Chapter 6: Information Retrieval and Web Search. An introduction

Information Retrieval

Introduction to Information Retrieval

Web Search Basics. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University

Knowledge Discovery and Data Mining 1 (VO) ( )

Information Retrieval. (M&S Ch 15)

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

Full-Text Indexing For Heritrix

Information Retrieval

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Instructor: Stefan Savev

Models for Document & Query Representation. Ziawasch Abedjan

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Natural Language Processing

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Natural Language Processing Basics. Yingyu Liang University of Wisconsin-Madison

Midterm Exam Search Engines ( / ) October 20, 2015

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Chapter 2. Architecture of a Search Engine

Information Retrieval

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12

Outline of the course

Elementary IR: Scalable Boolean Text Search. (Compare with R & G )

Chapter 27 Introduction to Information Retrieval and Web Search

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17

Information Retrieval

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang

Text Analytics (Text Mining)

How Does a Search Engine Work? Part 1

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Approaches to Mining the Web

Lecture 5: Information Retrieval using the Vector Space Model

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Introduction to Information Retrieval

Information Retrieval: Retrieval Models

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

CS54701: Information Retrieval

Tag-based Social Interest Discovery

Information Retrieval

Large Crawls of the Web for Linguistic Purposes

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Text Retrieval an introduction

Web Search Basics. Berlin Chen Department t of Computer Science & Information Engineering National Taiwan Normal University

A Survey on Web Information Retrieval Technologies

60-538: Information Retrieval

Information Retrieval and Web Search

Searching non-text information objects

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

Information Retrieval and Organisation

A Review on Identifying the Main Content From Web Pages

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

Text Analytics (Text Mining)

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Suresha M Department of Computer Science, Kuvempu University, India

Lecture 7 April 30, Prabhakar Raghavan

Information Retrieval May 15. Web retrieval

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy

COMP6237 Data Mining Searching and Ranking

Component ranking and Automatic Query Refinement for XML Retrieval

Informa(on Retrieval

Relevance of a Document to a Query

Information Retrieval on the Internet (Volume III, Part 3, 213)

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

Information Retrieval and Organisation

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

The Anatomy of a Large-Scale Hypertextual Web Search Engine

International Journal of Advanced Research in Computer Science and Software Engineering

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

How SPICE Language Modeling Works

Classic IR Models 5/6/2012 1

Information Retrieval Spring Web retrieval

Web Information Retrieval using WordNet

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

User Profiling for Interest-focused Browsing History

Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

FRONT CODING. Front-coding: 8automat*a1 e2 ic3 ion. Extra length beyond automat. Encodes automat. Begins to resemble general string compression.

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

Transcription:

Basic techniques Text processing; term weighting; vector space model; inverted index; Web Search

Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing Query Multimedia documents Crawler 2

The Web corpus No design/coordination Distributed content creation, linking, democratization of publishing Content includes truth, lies, obsolete information, contradictions The Web Unstructured (text, html, ), semi-structured (XML, annotated photos), structured (Databases) Scale much larger than previous text corpora but corporate records are catching up. Content can be dynamically generated 3

The Web: Dynamic content A page without a static html version E.g., current status of flight AA129 Current availability of rooms at a hotel Usually, assembled at the time of a request from a browser Typically, URL has a? character in it Browser AA129 Application server Back-end databases 4

Dynamic content Most dynamic content is ignored by web spiders Many reasons including malicious spider traps Some dynamic content (news stories from subscriptions) are sometimes delivered as dynamic content Application-specific spidering Spiders commonly view web pages just as Lynx (a text browser) would Note: even static pages are typically assembled on the fly (e.g., headers are common) 5

Index updates: rate of change Fetterly et al. study (2002): several views of data, 150 million pages over 11 weekly crawls Bucketed into 85 groups by extent of change 6

What is the size of the web? What is being measured? Number of hosts Number of (static) html pages Volume of data Number of hosts netcraft survey http://news.netcraft.com/archives/web_server_survey.html Monthly report on how many web hosts & servers are out there Number of pages numerous estimates 7

Total Web sites http://news.netcraft.com/ 8

Market share for top servers http://news.netcraft.com/ 9

Market share for active sites 10

Clue Web 2012 Most of the web documents were collected by five instances of the Internet Archive's Heritrix web crawler at Carnegie Mellon University Running on five Dell PowerEdge R410 machines with 64GB RAM. Heritrix was configured to follow typical crawling guidelines. The crawl was initially seeded with 2,820,500 uniq URLs. This is a small sample of the Web. 733 million pages 25TBytes Even at this low-scale there are many challenges: How can we store it and make it searchable? How can we refresh the search index? 11

Basic techniques 1. Text pre-processing 2. Terms weighting 3. Vector Space Model 4. Indexing 5. Web page segments 12

1. Text pre-processing Instead of aiming at fully understanding a text document, IR takes a pragmatic approach and looks at the most elementary textual patterns e.g. a simple histogram of words, also known as bag-of-words. Heuristics capture specific text patterns to improve search effectiveness Enhances the simplicity of word histograms The most simple heuristics are stop-words removal and stemming 13

Character processing and stop-words Term delimitation Punctuation removal Numbers/dates Stop-words: remove words that are present in all documents a, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will Chapter 2: Introduction to Information Retrieval, Cambridge University Press, 2008 14

Stemming and lemmatization Stemming: Reduce terms to their roots before indexing Stemming suggest crude affix chopping e.g., automate(s), automatic, automation all reduced to automat. http://tartarus.org/~martin/porterstemmer/ http://snowball.tartarus.org/demo.php Lemmatization: Reduce inflectional/variant forms to base form, e.g., am, are, is be car, cars, car's, cars' car Chapter 2: Introduction to Information Retrieval, Cambridge University Press, 2008 15

N-grams An n-gram is a sequence of items, e.g. characters, syllables or words. Can be applied to text spelling correction interactive meida >>>> interactive media Can also be used as indexing tokens to improve Web page search You can order the Google n-grams (6DVDs): http://googleresearch.blogspot.com/2006/08/all-our-n-gram-arebelong-to-you.html N-grams were under some criticism in NLP because they can add noise to information extraction tasks...but are widely successful in IR to infer document topics. 16

Tolerant retrieval Wildcards Enumerate all k-grams (sequence of k chars) occurring in any term Maintain a second inverted index from bigrams to dictionary terms that match each bigram. Spelling corrections Edit distances: Given two strings S 1 and S 2, the minimum number of operations to convert one to the other Phonetic corrections Class of heuristics to expand a query into phonetic equivalents Chapter 3: Introduction to Information Retrieval, Cambridge University Press, 2008 17

Documents representation After the text analysis steps, a document (e.g. Web page) is represented as a vector of terms, n-grams, (PageRank if a Web page), etc. : previou step document eg web page repres vector term ngram pagerank web page etc d = w 1,, w L, ng 1,, ng M, PR, 18

Comparision of text parsing techniques 19

Index for boolean retrieval docid 10 40 33 9 2 99... multimedia search docid 3 2 99 40... engines index crawler ranking inverted-file............... docid... docid URI 1 http://www.di.fct.unl... 2 http://nlp.stanford.edu/ir-book/... 3... 4......... 20

2. Term weighting Boolean retrieval looks for terms overlap What s wrong with the overlap measure? It doesn t consider: Term frequency in document Term scarcity in collection (document mention frequency) of is more common than ideas or march Length of documents (And queries: score not normalized) 21

Term weighting Term weighting tries to reflect the importance of a document for a given query term. Term weighting must consider two aspects: The frequency of a term in a document Consider the rarity of a term in the repository Several weighting intuitions were evaluated throughout the years: Salton, G. and Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24, 5 (Aug. 1988), 513-523. Robertson, S. E. and Sparck Jones, K. 1988. Relevance weighting of search terms. In Document Retrieval Systems, P. Willett, Ed. Taylor Graham Series In Foundations Of Information Science, vol. 3. 22

TF-IDF: Term Freq.-Inverted Document Freq. Text terms should be weighted according to their importance for a given document: tf i (d) = n i d and how rare a word is: idf i = log D df t i df t i = d: t i d The final term weight is: w i,j = tf i d j idf i Chapter 6: Introduction to Information Retrieval, Cambridge University Press, 2008 23

Inverse document frequency idf i = log D df t i df t i = d: t i d Chapter 6: Introduction to Information Retrieval, Cambridge University Press, 2008 24

3. Vector Space Model Each doc d can now be viewed as a vector of tf idf values, one component for each term So, we have a vector space where: terms are axes docs live in this space even with stemming, it may have 50,000+ dimensions First application: Query-by-example Given a doc d, find others like it. Now that d is a vector, find vectors (docs) near it. Salton, G., Wong, A., and Yang, C. S. 1975. A vector space model for automatic indexing. Commun. ACM 18, 11 (Nov. 1975). 25

Documents and queries Documents are represented as an histogram of terms, n- grams and other indicators: d = w 1,, w L, ng 1,, ng M, PR, The text query is processed with the same text preprocessing techniques. A query is then represented as a vector of text terms and n-grams (and possibly other indicators): q = w 1,, w L, ng 1,, ng M 26

Intuition If d 1 is near d 2, then d 2 is near d 1. If d 1 near d 2, and d 2 near d 3, then d 1 is not far from d 3. No doc is closer to d than d itself. t 3 d 2 d 3 Postulate: Documents that are close together in the vector space talk about the same things. φ θ d 1 t 1 t 2 d 4 d 5 Chapter 6: Introduction to Information Retrieval, Cambridge University Press, 2008 27

First cut Idea: Distance between d 1 and d 2 is the length of the vector d 1 d 2. Euclidean distance Why is this not a great idea? We still haven t dealt with the issue of length normalization Short documents would be more similar to each other by virtue of length, not topic However, we can implicitly normalize by looking at angles instead 28

Angle as a similarity Distance between vectors d 1 and d 2 captured by the cosine of the angle x between them. Vectors pointing in the same direction Note this is similarity, not distance No triangle inequality for similarity. t 3 d 2 θ d 1 t 1 t 2 29

Vectors normalization A vector can be normalized (given a length of 1) by dividing each of its components by its length here we use the L 2 norm x 2 = i x i 2 This maps vectors onto the unit sphere. Then, n d j = w i,j = 1 i=1 Longer documents don t get more weight Chapter 6: Introduction to Information Retrieval, Cambridge University Press, 2008 30

Cosine similarity Cosine of angle between two vectors The denominator involves the lengths of the vectors. sim q, d i = cos q, d i = q d i q d i sim q, d i = cos q, d i = σ t q t d i,t σ t q t 2 σ 2 t d i,t Chapter 6: Introduction to Information Retrieval, Cambridge University Press, 2008 31

Improved ranking How to enhace/improve the previous model? Positional indexing Documents with distances between query terms greater than n are discarded Distance between query terms in the documents affect rank score Other ranking functions BM-25 Bayesian networks Learning to rank Chapter 6: Introduction to Information Retrieval, Cambridge University Press, 2008 32

BM-25 ranking function Proposed by Robertson et al. in 1994. score q, d j = q t f t,d k 1 + 1 k 1 1 b + b l avg l d + f t,d IDF t IDF q i = log N nd q i + 0.5 nd q i + 0.5 Chapter 11: Introduction to Information Retrieval, Cambridge University Press, 2008 33

4. Indexing This stage creates an index to quickly locate relevant documents An index is an aggregation of several data structures (e.g. B- trees or hash tables) Index compression is used to reduce the amount of space and the time needed to compute similarities The distribution of the index pages across a cluster improves the search engine responsiveness 34

Inverted file index v2 docid 10 40 33 9 2 99... weight 0.837 0.634 0.447 0.401 0.237 0.165... multimedia search engines docid 3 2 99 40... weight 0.901 0.438 0.420 0.265... index crawler ranking inverted-file............... docid... weight... docid URI 1 http://www.di.fct.unl... 2 http://nlp.stanford.edu/ir-book/... 3... 4......... 35

Inverted file index v3 docid 10 40 33... weight 0.837 0.634 0.447... multimedia pos 2,56,890 1,89,456 4,5,6 search engines index crawler ranking inverted-file............... docid 3 2 99 40... weight 0.901 0.438 0.420 0.265... pos 64,75 4,543,234 23,545 docid... weight... pos docid URI 1 http://www.di.fct.unl... 2 http://nlp.stanford.edu/ir-book/... 3... 4......... 36

5. Web page segments Web pages are divided into different parts (title, abstract, body, etc) Each part has a specific relevance to the main content A Web page can be divided by its HTML structure (e.g., <div> tags) or by its visual aspect. 37

Web page segmentation methods Segmenting visually Cai, D., Yu, S., Wen, J. R., & Ma, W. Y. (2003). VIPS: A vision-based page segmentation algorithm. Linguistic approach Kohlschütter, C., Fankhauser, P., and Nejdl, W. (2010). Boilerplate detection using shallow text features. ACM Web Search and Data Mining. Densitometric approach Kohlschütter, C., and Nejdl, W., (2008). A densitometric approach to web page segmentation. ACM Conference on Information and Knowledge Management (CIKM '08). https://boilerpipe-web.appspot.com/ https://github.com/kohlschutter/boilerpipe 38

Multiple indexes Title, abstract, corpus, sections are all indexed separately Each zone index is inspected for the each query term The final rank is a linear combination of the multiple ranks Discussed in class 10. Field 1 Field 2 Field 3 39

Summary of basic techniques Text processing Text representation, stop words, stemming, lemmatization Term weighting Term weighting Vector space model TF-IDF and cosine distance Chapter 2, 6 Inverted index Web page segments 40