Information retrieval systems

Size: px
Start display at page:

Download "Information retrieval systems"

Transcription

1

2 Information retrieval systems Information retrieval (IR): n. searching for documents or information in documents. Question-answering: respond with a specific answer to a question (e.g., Wolfram Alpha). Document retrieval: find documents relevant to a query, ranked by relevance (e.g., or Google). Text analytics/data mining: General organization of large textual databases (e.g., Lexis-Nexis, OpenText, MedSearch,.) CSC401/2511 Spring

3 Terminology Information retrieval has slightly different terminology than the tasks we ve seen previously: Document: a book, article, web page, or paragraph (depending on the task and data). Collection: a corpus of documents Term: a word type Stop word: a functional (non-content) word (e.g., the) CSC401/2511 Spring

4 Query types Different kinds of questions can be asked. Factoid questions, e.g., How often were the peace talks in Ireland delayed or disrupted as a result of acts of violence? Narrative (open-ended) questions, e.g., Can you tell me about contemporary interest in the Greek philosophy of stoicism? Complex/hybrid questions, e.g., Who was involved in the Schengen agreement to eliminate border controls in Western Europe and what did they hope to accomplish? CSC401/2511 Spring

5 Question answering (QA) Which woman has won more than 1 Nobel prize? (Marie Curie) Question Answering (QA) usually involves a specific answer to a question. CSC401/2511 Spring

6 Document retrieval vs IR One strategy is to turn question answering into information retrieval (IR) and let the human complete the task. CSC401/2511 Spring

7 Question answering (QA) CSC401/2511 Spring

8 Knowledge-based QA 1. Build a structured semantic representation of the query. Extract times, dates, locations, entities using regular expressions. Fit to well-known templates. CSC401/2511 Spring Query databases with these semantics. Ontologies (Wikipedia infoboxes). Restaurant review databases. Calendars. Movie schedules.

9 IR-based QA CSC401/2511 Spring

10 IR-based QA CSC401/2511 Spring

11 IR-based QA Information retrieval Question answering CSC401/2511 Spring

12 IBM s Watson Human 1 Game Control System Clue Grid Decisions to Buzz and Bet Strategy Watson s Game Controller Text-to-Speech Clue & Category Answers & Confidences Watson s QA Engine 2,880 IBM Power750 Compute Cores 15 TB of Memory Human 2 Clues, Scores & Other Game Data Content equivalent to ~ 1,000,000 books source: A Brief Overview and Thoughts for Healthcare Education and Performance Improvement by the IBM Watson team CSC401/2511 Spring

13 IBM s Watson: search This man became the 44 th President of the United States in 2008 CSC401/2511 Spring

14 IBM s Watson: search Title-oriented search: In some cases, the solution is in the title of highly-ranked documents. E.g., This pizza delivery boy celebrated New Year s at Applied Cryogenics. CSC401/2511 Spring

15 BASICS

16 Queries A query is a textual key which orders a specific subset of documents (or answers) in a collection. Historically, these were highly structured in a logical language, but in modern search engines queries are more often streams of syntactically disconnected keywords. A boolean query is a logical combination of boolean membership predicates. Brutus AND Caesar AND NOT Calpurnia CSC401/2511 Spring

17 Term-document incidence Anthony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth ANTHONY BRUTUS CAESAR CALPURNIA CLEOPATRA MERCY WORSER For the query Brutus AND Caesar AND NOT Calpurnia, (Brutus) (Caesar) (Not Calpurnia) (Bitwise AND) CSC401/2511 Spring

18 Boolean Queries and big collections If we have 1 million documents, each with 1000 tokens 1 billion tokens at most 1 billion 1 s in the matrix. If we have 500,000 distinct terms, the term-document incidence matrix will have 500,000,000,000 elements. There will be << 1 billion 1s in this matrix. Very sparse and a waste of space. Can there be a better way? CSC401/2511 Spring

19 Inverted index Given a query word, the inverted index for that word gives us all documents that contain that word in either the title, the abstract (summary), some hidden metadata, or the entire text. More sophisticated versions also include the frequency and positions of the query word in each document. Python query Inverted index! " documents How does one construct such indices? CSC401/2511 Spring

20 Inverted index construction 1. Collect the documents to be indexed. Friends, Romans, countrymen So let it be with Caesar 2. Tokenize the text. Friends Romans countrymen So 3. Do preprocessing and normalization, resulting in the indexing terms. friend roman countryman so 4. Create a dictionary (hash) of documents given terms. CSC401/2511 Spring

21 Simple conjunctive query Given the query Brutus AND Calpurnia, 1. Locate Brutus in the dictionary. Retrieve documents list. 2. Locate Calpurnia in the dictionary. Retrieve documents list. 3. Intersect the two document lists. Return the result to the user. Linear in the lengths of document lists. (if lists are sorted) CSC401/2511 Spring

22 Constructing indices Spiders (i.e., bots, crawlers) start with root (seed) URLs. Follow all links on these pages recursively. Novel pages are processed and indexed. Despite the exponential growth in memory across depth, breadth-first search is quite popular. Depth-first search is linear in depth, but can get lost. Trivia: If you click on the first contentful link in any Wikipedia page, you will eventually be led to the Philosophy article. CSC401/2511 Spring

23 Increasing entropy? Boolean retrieval is precise and was very popular for decades (it still is used for structured data, e.g., ehealth records). The amount and value of unstructured data (i.e., text) has grown faster than structured data on the web Unstructured Structured Data volume Market cap (data from Chris Manning) Data volume Market cap CSC401/2511 Spring

24 Zipf s law on the web These variables have Zipfian distributions: Number of links to and links from a page. Length of web pages. Number of web page hits. (graph from Ray Mooney) CSC401/2511 Spring

25 New challenges for IR on the web Distributed data: Documents spread over millions of web servers. Volatile data: Document change or disappear frequently and rapidly. Large volume: Petabytes of data. Poor quality: No editorial control, false information, poor writing, typographic errours. Heterogeneity: Various media, languages, encodings. Unstructured: No uniform structure, HTML errors, CSC401/2511 Spring duplicate documents.

26 Detecting duplicates duplicates The user will become annoyed when many top-ranking hits are identical/similar. Nearly-identical pages can be determined by hashing E.g., don t index en.m.wikipedia.org/wiki/ if you ve indexed en.wikipedia.org/wiki/. Zero marginal relevance occurs when a highly relevant document becomes irrelevant by being ranked below a (near-)duplicate. CSC401/2511 Spring

27 Detecting duplicates duplicates Compute similarity with some edit-distance measure. Syntactic similarity (e.g., overlap of bigrams) easier to measure than semantic similarity. If this measure is above some threshold! for some pair of documents, we consider them duplicates. Jaccard coefficient: " #, % = ' ) ' ) Is a measure of similarity on [0.. 1] " #, # = 1 " #, % = 0 iff # % = CSC401/2511 Spring

28 Jaccard coefficient on 2-grams Documents:! " : Jack London went to Toronto! # : Jack London went to the city of Toronto! $ : Jack went from Toronto to London %! ",! # = $ ( = %! ",! $ = 0 CSC401/2511 Spring

29 LINKS

30 Link analysis When we re crawling the web and indexing, we want to retain some record of similarity between (non-duplicate) documents in terms of their link structure. This will help in searching. CSC401/2511 Spring

31 Bibliometrics: citation analysis Impact factor: Developed in 1972 to measure the quality and influence of scientific journals. Measures how often articles are cited. Bibliographic coupling: Measure of similarity between documents according to the intersection of their citations (Kessler, 1963). A B CSC401/2511 Spring

32 Bibliometrics: citation analysis Co-citation: Measure of similarity between documents according to the intersection of the documents that cite them (Small, 1973). A B CSC401/2511 Spring

33 Links are not citations Many links are navigational within a website. Many pages with high in-degree are portals without much content. Some links are not necessarily endorsements. Relevance of citations in scientific settings is (theoretically) enforced by peer review. Can we mimic the enforcement of relevance usually done by human experts in scientific articles? CSC401/2511 Spring

34 Authorities and hubs Authorities are pages recognized as significant, trustworthy, and useful for a topic. In-degree (number of incoming links) is an estimate of authority. Should incoming links from authoritative pages count more than others? Hubs are index pages that provide lots of links to relevant content pages. e.g., reddit.com is a hub page for memes. CSC401/2511 Spring

35 HITS (hits hits hits hits) The HITS algorithm (Kleinberg, 1998) attempts to learn hubs and authorities on a given topic given relevant web subgraphs. Hubs and authorities tend to form bipartite graphs. Hubs Authorities CSC401/2511 Spring

36 HITS First, find (top!) most relevant pages for a query this is the root set, ". (we ll see how to do this next lecture) Next, look at the link structure relative to ". The base set, # is " and all pages that link to and are linked from pages in " # " CSC401/2511 Spring

37 HITS: Authorities and In-degree Even for!, nodes with high in-degree may not be authorities they may just be generically popular pages. Authority should be determined by strong hubs. Iteratively (slowly) converge on a mutually reinforcing set of hubs and authorities. For every page "!, maintain Authority score: $ % (initialized to 1/! ) Hub score: h % (initialized to 1/! ) CSC401/2511 Spring

38 HITS update rules Authorities! are pointed to ( ) by lots of good hubs #: $ % = ' (:( % h ( $ + = h, + h. + h / Hubs point to lots of good authorities: h ( = ' %:( % $ % h + = $, + $. + $ / CSC401/2511 Spring

39 PageRank PageRank (Brin & Page, 1998) is an alternative to HITS that does not distinguish between hub and authority. CSC401/2511 Spring

40 PageRank initial idea Assume that in-degree does not account for the authority of the source of a link. For page!, the page rank is: where "! = $ % CSC401/2511 Spring &:& ) "(+) - & - & is the total number of out-links over all +. $ is a normalizing constant. A page s rank flows out equally among outgoing links.

41 PageRank flow of authority PageRank would iteratively adjust all! " until overall page ranking converged Steady state CSC401/2511 Spring

42 PageRank problem Groups of purely self-referential pages (linked from the outside) are sinks that absorb all the rank in the system during the iterative rank assignment process. CSC401/2511 Spring

43 PageRank rank source An ethereal rank source! continually replenishes the rank of each page " by a fixed amount! " # " = % & ':' * #(,). ' +!(") CSC401/2511 Spring

44 Complete ranking A complete ranking involves combining: PageRank. Preferences using HTML tags (e.g., title or abstract are often highly informative). Similarity of query words and documents. How do we relate query words and documents in the first place? CSC401/2511 Spring

45 VECTOR SPACES

46 The vector space model In the vector space model, queries and documents are both represented by unit-length vectors in word space. Each dimension can be a word in vocabulary. The domain of each dimension can be, e.g., 0/1: absent/present N: term frequency of word " in document # ($% &' ). R ) * : damped weight, e.g., =, 1 + log $% &' if $% &' > 0 0 if $% &' = 0 Note: vectors in the above domains can easily be normalized to unit vectors, if desired. Note 2: you don t always want these vectors to be unit length sometimes you explicitly don t. CSC401/2511 Spring

47 The vector space model If the query and the available documents can be represented by vectors, we can determine similarity according to the cosine method. Vectors that are near each other (within a certain angular radius) are considered relevant. Document! " is closest to query #. CSC401/2511 Spring

48 The cosine measure The cosine measure (a.k.a., normalized correlation coefficient ) is cos %, ' = - *+, % * ' * -. *+, % * -. *+, ' * where % and ' are /-dimensional vectors for the query and document, respectively. CSC401/2511 Spring

49 The cosine measure The cosine measure (a.k.a., normalized correlation coefficient ) is cos %, ' - *+, % * ' * = -. *+, % * -. *+, ' * where % and ' are /-dimensional vectors for the query and document, respectively. Larger values of cos %, ' means stronger correlation, so % is closer to ', than '. iff cos %, ', > cos %, '.. CSC401/2511 Spring

50 Term weighting What if we want to weight words in the vector space model? Term frequency,!" #$ : Document frequency, )" # : Collection frequency, *" # : number of occurrences of word % & in document ' (. number of documents in which % & appears. total occurrences of % & in the collection. CSC401/2511 Spring

51 Term frequency Higher values of!" #$ (for contentful words) suggest that word % & is a good indicator of the content of document ' (. When comparing the relevance of a document ' ( to a keyword % &, )* &( should be maximized. We often dampen )* &( to temper these comparisons. E.g., even if )* &( = 3)* &-, empirically we don t want to say that document ' ( is thrice as relevant as document ' -. )*./0123 = 1 + log()*), if )* > 0. CSC401/2511 Spring

52 Document frequency The document frequency,!" #, is the number of documents in which $ # appears. Meaningful words may occur repeatedly in a related document, but functional (or less meaningful) words may be distributed evenly over all documents. Word Collection frequency Document frequency kernel 10, try 10, E.g., kernel occurs about as often as try in total, but it occurs in fewer documents it represents a more specific concept. CSC401/2511 Spring

53 Inverse document frequency Very specific words,! ", would give smaller values of #$ ". To maximize specificity, the inverse document frequency is %#$ " = log * #$ " where * is the total number of documents and we scale with log, as before. This measure gives full weight to words that occur in 1 document, and zero weight to words that occur in all documents. CSC401/2511 Spring

54 tf.idf We combine the term frequency and the inverse document frequency to give us a joint measure of relatedness between words and documents:!". $%" & ', % ) = log(!" ')) log 3 if!" %" ') 1 ' 0 if!" ') = 0 CSC401/2511 Spring

55 Aspects of tf.idf The!". $%" score has been criticised for being ad hoc, but it has been shown to be robust and effective in a wide range of applications where a rough estimate of relatedness is needed. However, The effectiveness of!". $%" can vary depending on the number of words in a query. Ambiguous words can cause spurious matches. Vectors of terms collected independently in this way can t capture semantic connections between words. We somehow need vectors of ideas/concepts. CSC401/2511 Spring

56 Latent semantic indexing Co-occurrence: n. when two or more terms occur in the same documents more often than by chance. Note: this is not quite the same thing as collocations Consider the following: Term 1 Term 2 Term 3 Term 4 Query user interface Document 1 user interface HCI interaction Document 2 HCI interaction Document 2 appears to be related to the query although it contains none of the query terms. The query and document 2 are semantically related. CSC401/2511 Spring

57 Latent semantic indexing Latent semantic indexing projects queries and documents into a space with latent (i.e., hidden) semantic dimensions. Co-occurring terms are projected onto the same latent dimensions. Non-co-occurring terms are not. In latent space, a query and a document can have high cosine similarity even if they do not share any terms. The latent semantic space has fewer dimensions than the original space (which had 1 dimension per term). CSC401/2511 Spring

58 Latent semantic indexing There are many different mappings from high-dimensional spaces to low-dimensional spaces. The low dimensions are not usually selected from among the existing dimensions (in this case). Recall, however, feature selection from Lecture 3-1. We must learn some function that projects our! dimensions onto some new dimensions CSC401/2511 Spring

59 Latent semantic indexing example This original space has 5 dimensions (one per term). The reduced space has two dimensions, perhaps vaguely referring to outer space and to vehicles, respectively.! "! #! $! %! &! ' cosmonaut astronaut moon car truck What are some ways of projecting onto fewer dimensions? CSC401/2511 Spring

60 1. Principal components analysis (PCA) PCA is an eigendecomposition of the variance in the data. It gives us a sequence of orthogonal vectors that represent the maximum amount of variance in the remaining data. Below, most of the variance is along! it captures the most amount of change in the data. We can rotate the data so that it s expressed in the new dimensions. # " CSC401/2511 Spring

61 1. Principal components analysis (PCA) If we express the data only in terms of!, it will be easier to learn a model. (we have fewer parameters) we will lose some information. (what if the classes we want to distinguish are all differentiated in "? (note: this should be rare)) $ # CSC401/2511 Spring

62 2. Singular value decomposition Singular value decomposition (SVD) can be used both as: a method of co-occurrence analysis between words that aids in similarity judgements, and a method for dimensionality reduction. Just as linear regression projects 2-dimensional data onto a 1- dimensional line, SVD projects a!-dimensional matrix, ", onto a #-dimensional matrix, $", where! #. The matrix &' is produced such that some maximal amount of information in " is retained. CSC401/2511 Spring

63 2. Singular value decomposition The SVD projection is computed by decomposing the term-bydocument matrix! " $ into the product of three matrices: % " &, ' & &, and ( $ & where ) is the number of words (terms), * is the number of documents, and + = min(), *). Specifically,! " $ = % " & ' & & ( $ & CSC401/2511 Spring

64 2. Singular value decomposition A A CSC401/2511 Spring

65 SVD example (. 0 = * , 0 1 ( =! "! #! $! %! &! ' cosmonaut astronaut moon car truck * = cosm astro moon car truck = , =! "! #! $! %! &! ' What do these matrices mean? CSC401/2511 Spring

66 SVD example! "! #! $! %! &! ' cosmonaut ( = astronaut moon car truck ( is the matrix of term frequencies, *+,-. E.g., moon occurs once in. / and once in. 0. CSC401/2511 Spring

67 SVD example Matrices! and # represent terms and documents, respectively in this new space. E.g., the first row of! corresponds to the first row of $, and so on.! and # are orthonormal, so all columns are orthogonal to each other and!! = # # = &.! = # = cosm astro moon car truck ' ( ' ) ' * ' + ', ' CSC401/2511 Spring

68 SVD example The matrix! contains the singular values of # in descending order. The $ %& singular value hints at the amount of variation on the $ %& axis.! = CSC401/2511 Spring

69 SVD example By restricting!, #, and + to their first - < / columns, their product gives us 01, a best least squares approximation of 1.! = cosm astro moon car truck # = = $ % $ & $ ' $ ( $ ) $ * CSC401/2511 Spring

70 SVD in practice Body parts Place names Animals Rohde et al. (2006) An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence. Communications of the ACM 8: CSC401/2511 Spring

71 SVD in practice Trained on the Google news corpus with over 300 billion words. CSC401/2511 Spring

72 SVD projection Recall that the purpose of doing SVD was to find co-occurrence patterns among terms and documents. When SVD finds the optimal projection to a lowerdimensional space, the result is that words that have similar co-occurrence patterns are projected onto the same dimensions. e.g., car and truck could be projected on the same dimensions. Therefore, we can identify similarities between queries and documents even if they share no terms in common. Note: PCA typically uses SVD (e.g., it does in Matlab). CSC401/2511 Spring

73 Similarities in lower dimensions Consider! = # $ $ & $ ' from the previous example:! = ) * ) + ), ) - ). ) / If we normalize the columns of! and multiply its transpose with it, we get: ) * 1 ) ) * ) + ), ) - ). ) / ), ) ) ) / CSC401/2511 Spring

74 Similarities in lower dimensions This matrix gives us the correlation coefficients between documents in their latent (hidden) semantic space, regardless of their constituent terms. % & % ' % ( % ) % * % + % & 1 % ' % ( % ) % * % E.g.,! " and! # are very similar, whereas! " and! $ are not. CSC401/2511 Spring

75 Latent semantic indexing in IR The process of projecting different (but related) terms to common underlying dimensions in the semantic space can be thought of as soft clustering. This is desirable if a user poses a particular query, we want to identify all documents that might be of use, regardless of the specific terms used. Occasionally this process results in some spurious matches, especially with polysemous (many meaning) words, but it remains a useful tool in practice. CSC401/2511 Spring

76 Aside neural networks cut for time We ve seen word embeddings before, from word2vec. Everything is the same, but we swap out SVD for NNs. Sometimes, small amounts of labeled data can be used to retrain the network. Mitra B, Craswell N. (2017) Neural Models for Information Retrieval. CSC401/2511 Spring

77 Aside neural networks cut for time Often word2vec or GloVe vectors are trained on some massive global corpus. Global word embeddings risk capturing only coarse representations of topics dominant in the corpus. Diaz F, Mitra B, Craswell N. (2016) Query Expansion with Locally-Trained Word Embeddings, Proc. of ACL, doi: /v1/p CSC401/2511 Spring

78 Aside neural networks cut for time Query expansion involves reweighting likelihoods, usually through deleted interpolation, as we saw with HMMs:! " # $ = &! $ + 1 &! " *($) - " * comes from taking the. 0 term embedding matrix 1 and the. 1 query term vector 2, taking the top terms from 33 2, and normalizing their weights. We can also recompute the probability of a document as the softmax over KL divergences between distributions for query terms and document terms. This emphasizes documents that are more similar to the queries. Diaz F, Mitra B, Craswell N. (2016) Query Expansion with Locally-Trained Word Embeddings, Proc. of ACL, doi: /v1/p CSC401/2511 Spring

79 Mitra B, Craswell N. (2017) Neural Models for Information Retrieval. Zhang Y, Rahman MM, Braylan A, et al. (2016) Neural Information Retrieval: A Literature Review. CSC401/2511 Spring

80 MISCELLANEOUS

81 Naïve Bayes Naïve Bayes can be used for ad hoc document retrieval. Imagine each of! documents is a class with one training example the document itself. Rank documents " # based on the posterior probability of a query $ being generated by those documents. %($; " # ) This is sometimes called a language modelling approach. CSC401/2511 Spring

82 Naïve Bayes generative model!: The greatest and best song Ranked Retrievals: ((!; # $ ) = 0.3 # $ # % # & # ' # $ 0.3 #, 0.28 # # % 0.1 # & 0.08 # CSC401/2511 Spring

83 Smoothing Since each language model,! ", will be very sparse, we need to smooth. Linear interpolation: # $;! " ' (# $;! " + 1 ' (# $ where (#($;! " ) is the probability of $ in the language model earned on! " (#($) is the probability of $ given the entire corpus. Add-.: same as in Assignment 2. CSC401/2511 Spring

84 Experimental results Linear interpolation performs almost as well as vector-space ranking (VSR). Add-! (called Laplace here) not so much. From Ray Mooney CSC401/2511 Spring

85 Evaluation How can we decide which of two search engines is better at responding to a query? Precision is the proportion of returned documents that are relevant. Recall is the proportion of relevant documents that are returned. Rank is also important we want the correct documents to be near the top of the ordered list. CSC401/2511 Spring

86 Recall is not enough If there are 20 relevant documents in a collection, each of the following systems has a recall of " #$ = 25%. Rank Boogle Ging Whoopie CSC401/2511 Spring

87 Precision is not enough Each of the following systems has a precision of 50%. Rank Boogle Ging Whoopie CSC401/2511 Spring

88 Cutoff precision Measure precision down to some cutoff, e.g., first 5. Rank Boogle Ging Whoopie CP 0% 100% 60% CSC401/2511 Spring

89 Uninterpolated Average Precision Measure precision at each relevant document returned. Rank Boogle precision /6 7 2/7 8 3/8 9 4/9 10 5/10 UAP CSC401/2511 Spring ! " #$ % #& ' #( ) # *!+ * = Normalize at end by number of relevant documents

90 Uninterpolated Average Precision Rank Boogle Ging Whoopie UAP CSC401/2511 Spring

91 Precision versus recall UAP is useful when your primary concern is relevance order. Returning all documents available will give you 100% recall, but would be useless to the user. Returning one document may give you high precision, but very low recall. The F-score combines precision and recall with parameter! that weighs the two: " = $ % & '()% * CSC401/2511 Spring

92 Conclusion Some slide and material based on those of Ray J. Mooney (UTexas, CS371R), Hinrich Schütze, Christina Lioma, and Chris Manning (Stanford, CS276). Dan Jurafsky (Stanford, CS124) CSC401/2511 Spring

This lecture. Introduction to information retrieval. Making money with information retrieval. Some technical basics. Link analysis.

This lecture. Introduction to information retrieval. Making money with information retrieval. Some technical basics. Link analysis. This lecture Introduction to information retrieval. Making money with information retrieval. Some technical basics. Link analysis. CSC401/2511 Spring 2017 2 Information retrieval systems Information retrieval

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction Inverted index Processing Boolean queries Course overview Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural

More information

CS105 Introduction to Information Retrieval

CS105 Introduction to Information Retrieval CS105 Introduction to Information Retrieval Lecture: Yang Mu UMass Boston Slides are modified from: http://www.stanford.edu/class/cs276/ Information Retrieval Information Retrieval (IR) is finding material

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Information Retrieval

Information Retrieval Information Retrieval Natural Language Processing: Lecture 12 30.11.2017 Kairit Sirts Homework 4 things that seemed to work Bidirectional LSTM instead of unidirectional Change LSTM activation to sigmoid

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2016/17 IR Chapter 01 Boolean Retrieval Example IR Problem Let s look at a simple IR problem Suppose you own a copy of Shakespeare

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 6-: Scoring, Term Weighting Outline Why ranked retrieval? Term frequency tf-idf weighting 2 Ranked retrieval Thus far, our queries have all been Boolean. Documents

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 06 Scoring, Term Weighting and the Vector Space Model 1 Recap of lecture 5 Collection and vocabulary statistics: Heaps and Zipf s laws Dictionary

More information

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system. Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 1: Boolean retrieval Information Retrieval Information Retrieval (IR)

More information

Bibliometrics: Citation Analysis

Bibliometrics: Citation Analysis Bibliometrics: Citation Analysis Many standard documents include bibliographies (or references), explicit citations to other previously published documents. Now, if you consider citations as links, academic

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-09 Schütze: Boolean

More information

Information Retrieval and Text Mining

Information Retrieval and Text Mining Information Retrieval and Text Mining http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze & Wiltrud Kessler Institute for Natural Language Processing, University of Stuttgart 2012-10-16

More information

60-538: Information Retrieval

60-538: Information Retrieval 60-538: Information Retrieval September 7, 2017 1 / 48 Outline 1 what is IR 2 3 2 / 48 Outline 1 what is IR 2 3 3 / 48 IR not long time ago 4 / 48 5 / 48 now IR is mostly about search engines there are

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 1: Introduction and Boolean retrieval Outline ❶ Course details ❷ Information retrieval ❸ Boolean retrieval 2 Course details

More information

CSE 7/5337: Information Retrieval and Web Search Introduction and Boolean Retrieval (IIR 1)

CSE 7/5337: Information Retrieval and Web Search Introduction and Boolean Retrieval (IIR 1) CSE 7/5337: Information Retrieval and Web Search Introduction and Boolean Retrieval (IIR 1) Michael Hahsler Southern Methodist University These slides are largely based on the slides by Hinrich Schütze

More information

Boolean Retrieval. Manning, Raghavan and Schütze, Chapter 1. Daniël de Kok

Boolean Retrieval. Manning, Raghavan and Schütze, Chapter 1. Daniël de Kok Boolean Retrieval Manning, Raghavan and Schütze, Chapter 1 Daniël de Kok Boolean query model Pose a query as a boolean query: Terms Operations: AND, OR, NOT Example: Brutus AND Caesar AND NOT Calpuria

More information

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Mustafa Jarrar: Lecture Notes on Information Retrieval University of Birzeit, Palestine 2014 Introduction to Information Retrieval Dr. Mustafa Jarrar Sina Institute, University of Birzeit mjarrar@birzeit.edu

More information

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Boolean Retrieval vs. Ranked Retrieval Many users (professionals) prefer

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural Language Processing, University of Stuttgart 2011-05-03 1/ 36 Take-away

More information

General Instructions. Questions

General Instructions. Questions CS246: Mining Massive Data Sets Winter 2018 Problem Set 2 Due 11:59pm February 8, 2018 Only one late period is allowed for this homework (11:59pm 2/13). General Instructions Submission instructions: These

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS3245 Information Retrieval Lecture 2: Boolean retrieval 2 Blanks on slides, you may want to fill in Last Time: Ngram Language Models Unigram LM: Bag of words Ngram

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 1: Boolean retrieval Information Retrieval Information Retrieval (IR) is finding

More information

Lecture 1: Introduction and the Boolean Model

Lecture 1: Introduction and the Boolean Model Lecture 1: Introduction and the Boolean Model Information Retrieval Computer Science Tripos Part II Helen Yannakoudakis 1 Natural Language and Information Processing (NLIP) Group helen.yannakoudakis@cl.cam.ac.uk

More information

boolean queries Inverted index query processing Query optimization boolean model September 9, / 39

boolean queries Inverted index query processing Query optimization boolean model September 9, / 39 boolean model September 9, 2014 1 / 39 Outline 1 boolean queries 2 3 4 2 / 39 taxonomy of IR models Set theoretic fuzzy extended boolean set-based IR models Boolean vector probalistic algebraic generalized

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2013 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,

More information

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2015 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures

More information

Advanced Retrieval Information Analysis Boolean Retrieval

Advanced Retrieval Information Analysis Boolean Retrieval Advanced Retrieval Information Analysis Boolean Retrieval Irwan Ary Dharmawan 1,2,3 iad@unpad.ac.id Hana Rizmadewi Agustina 2,4 hagustina@unpad.ac.id 1) Development Center of Information System and Technology

More information

Part 2: Boolean Retrieval Francesco Ricci

Part 2: Boolean Retrieval Francesco Ricci Part 2: Boolean Retrieval Francesco Ricci Most of these slides comes from the course: Information Retrieval and Web Search, Christopher Manning and Prabhakar Raghavan Content p Term document matrix p Information

More information

Introduction to Information Retrieval and Boolean model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H.

Introduction to Information Retrieval and Boolean model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Introduction to Information Retrieval and Boolean model Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze 1 Unstructured (text) vs. structured (database) data in late

More information

Digital Libraries: Language Technologies

Digital Libraries: Language Technologies Digital Libraries: Language Technologies RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Recall: Inverted Index..........................................

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural Language Processing, Universität Stuttgart 2008.04.22 Schütze: Boolean

More information

Search: the beginning. Nisheeth

Search: the beginning. Nisheeth Search: the beginning Nisheeth Interdisciplinary area Information retrieval NLP Search Machine learning Human factors Outline Components Crawling Processing Indexing Retrieval Evaluation Research areas

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 1: Boolean Retrieval Paul Ginsparg Cornell University, Ithaca, NY 27 Aug

More information

Introducing Information Retrieval and Web Search. borrowing from: Pandu Nayak

Introducing Information Retrieval and Web Search. borrowing from: Pandu Nayak Introducing Information Retrieval and Web Search borrowing from: Pandu Nayak Information Retrieval Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 6: Index Compression Paul Ginsparg Cornell University, Ithaca, NY 15 Sep

More information

Indexing. Lecture Objectives. Text Technologies for Data Science INFR Learn about and implement Boolean search Inverted index Positional index

Indexing. Lecture Objectives. Text Technologies for Data Science INFR Learn about and implement Boolean search Inverted index Positional index Text Technologies for Data Science INFR11145 Indexing Instructor: Walid Magdy 03-Oct-2017 Lecture Objectives Learn about and implement Boolean search Inverted index Positional index 2 1 Indexing Process

More information

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency Ralf Moeller Hamburg Univ. of Technology Acknowledgement Slides taken from presentation material for the following

More information

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues COS 597A: Principles of Database and Information Systems Information Retrieval Traditional database system Large integrated collection of data Uniform access/modifcation mechanisms Model of data organization

More information

Models for Document & Query Representation. Ziawasch Abedjan

Models for Document & Query Representation. Ziawasch Abedjan Models for Document & Query Representation Ziawasch Abedjan Overview Introduction & Definition Boolean retrieval Vector Space Model Probabilistic Information Retrieval Language Model Approach Summary Overview

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation

More information

CSCI 5417 Information Retrieval Systems! What is Information Retrieval?

CSCI 5417 Information Retrieval Systems! What is Information Retrieval? CSCI 5417 Information Retrieval Systems! Lecture 1 8/23/2011 Introduction 1 What is Information Retrieval? Information retrieval is the science of searching for information in documents, searching for

More information

Information Retrieval Spring Web retrieval

Information Retrieval Spring Web retrieval Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic

More information

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science Lecture 9: I: Web Retrieval II: Webology Johan Bollen Old Dominion University Department of Computer Science jbollen@cs.odu.edu http://www.cs.odu.edu/ jbollen April 10, 2003 Page 1 WWW retrieval Two approaches

More information

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Information Retrieval Information Retrieval (IR) is finding material of an unstructured

More information

1Boolean retrieval. information retrieval. term search is quite ambiguous, but in context we use the two synonymously.

1Boolean retrieval. information retrieval. term search is quite ambiguous, but in context we use the two synonymously. 1Boolean retrieval information retrieval The meaning of the term information retrieval (IR) can be very broad. Just getting a credit card out of your wallet so that you can type in the card number is a

More information

Lecture 1: Introduction and Overview

Lecture 1: Introduction and Overview Lecture 1: Introduction and Overview Information Retrieval Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group Simone.Teufel@cl.cam.ac.uk Lent 2014 1

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 01 Boolean Retrieval 1 01 Boolean Retrieval - Information Retrieval - 01 Boolean Retrieval 2 Introducing Information Retrieval and Web Search -

More information

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s Summary agenda Summary: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University March 13, 2013 A Ardö, EIT Summary: EITN01 Web Intelligence

More information

CS 572: Information Retrieval. Lecture 2: Hello World! (of Text Search)

CS 572: Information Retrieval. Lecture 2: Hello World! (of Text Search) CS 572: Information Retrieval Lecture 2: Hello World! (of Text Search) 1/13/2016 CS 572: Information Retrieval. Spring 2016 1 Course Logistics Lectures: Monday, Wed: 11:30am-12:45pm, W301 Following dates

More information

Informa(on Retrieval

Informa(on Retrieval Introduc)on to Informa)on Retrieval CS3245 Informa(on Retrieval Lecture 7: Scoring, Term Weigh9ng and the Vector Space Model 7 Last Time: Index Construc9on Sort- based indexing Blocked Sort- Based Indexing

More information

CSE 494: Information Retrieval, Mining and Integration on the Internet

CSE 494: Information Retrieval, Mining and Integration on the Internet CSE 494: Information Retrieval, Mining and Integration on the Internet Midterm. 18 th Oct 2011 (Instructor: Subbarao Kambhampati) In-class Duration: Duration of the class 1hr 15min (75min) Total points:

More information

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)

More information

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100

More information

: Semantic Web (2013 Fall)

: Semantic Web (2013 Fall) 03-60-569: Web (2013 Fall) University of Windsor September 4, 2013 Table of contents 1 2 3 4 5 Definition of the Web The World Wide Web is a system of interlinked hypertext documents accessed via the Internet

More information

Exam IST 441 Spring 2014

Exam IST 441 Spring 2014 Exam IST 441 Spring 2014 Last name: Student ID: First name: I acknowledge and accept the University Policies and the Course Policies on Academic Integrity This 100 point exam determines 30% of your grade.

More information

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris Manning at Stanford U.) The Web as a Directed Graph

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 5: Index Compression Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-17 1/59 Overview

More information

COMP5331: Knowledge Discovery and Data Mining

COMP5331: Knowledge Discovery and Data Mining COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, Jon M. Kleinberg 1 1 PageRank

More information

Classic IR Models 5/6/2012 1

Classic IR Models 5/6/2012 1 Classic IR Models 5/6/2012 1 Classic IR Models Idea Each document is represented by index terms. An index term is basically a (word) whose semantics give meaning to the document. Not all index terms are

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Searching the Web [Arasu 01]

Searching the Web [Arasu 01] Searching the Web [Arasu 01] Most user simply browse the web Google, Yahoo, Lycos, Ask Others do more specialized searches web search engines submit queries by specifying lists of keywords receive web

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) CSE 6242 / CX 4242 Apr 1, 2014 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer,

More information

- Content-based Recommendation -

- Content-based Recommendation - - Content-based Recommendation - Institute for Software Technology Inffeldgasse 16b/2 A-8010 Graz Austria 1 Content-based recommendation While CF methods do not require any information about the items,

More information

Recap: lecture 2 CS276A Information Retrieval

Recap: lecture 2 CS276A Information Retrieval Recap: lecture 2 CS276A Information Retrieval Stemming, tokenization etc. Faster postings merges Phrase queries Lecture 3 This lecture Index compression Space estimation Corpus size for estimates Consider

More information

Introduction to Text Mining. Hongning Wang

Introduction to Text Mining. Hongning Wang Introduction to Text Mining Hongning Wang CS@UVa Who Am I? Hongning Wang Assistant professor in CS@UVa since August 2014 Research areas Information retrieval Data mining Machine learning CS@UVa CS6501:

More information

CS54701: Information Retrieval

CS54701: Information Retrieval CS54701: Information Retrieval Basic Concepts 19 January 2016 Prof. Chris Clifton 1 Text Representation: Process of Indexing Remove Stopword, Stemming, Phrase Extraction etc Document Parser Extract useful

More information

Behrang Mohit : txt proc! Review. Bag of word view. Document Named

Behrang Mohit : txt proc! Review. Bag of word view. Document  Named Intro to Text Processing Lecture 9 Behrang Mohit Some ideas and slides in this presenta@on are borrowed from Chris Manning and Dan Jurafsky. Review Bag of word view Document classifica@on Informa@on Extrac@on

More information

Unstructured Data Management. Advanced Topics in Database Management (INFSCI 2711)

Unstructured Data Management. Advanced Topics in Database Management (INFSCI 2711) Unstructured Data Management Advanced Topics in Database Management (INFSCI 2711) Textbooks: Database System Concepts - 2010 Introduction to Information Retrieval - 2008 Vladimir Zadorozhny, DINS, SCI,

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) CSE 6242 / CX 4242 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko,

More information

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Manning, Raghavan, and Schütze http://www.informationretrieval.org OVERVIEW Introduction Basic XML Concepts Challenges

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information

Chapter 2. Architecture of a Search Engine

Chapter 2. Architecture of a Search Engine Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them

More information

Creating a Classifier for a Focused Web Crawler

Creating a Classifier for a Focused Web Crawler Creating a Classifier for a Focused Web Crawler Nathan Moeller December 16, 2015 1 Abstract With the increasing size of the web, it can be hard to find high quality content with traditional search engines.

More information

FRONT CODING. Front-coding: 8automat*a1 e2 ic3 ion. Extra length beyond automat. Encodes automat. Begins to resemble general string compression.

FRONT CODING. Front-coding: 8automat*a1 e2 ic3 ion. Extra length beyond automat. Encodes automat. Begins to resemble general string compression. Sec. 5.2 FRONT CODING Front-coding: Sorted words commonly have long common prefix store differences only (for last k-1 in a block of k) 8automata8automate9automatic10automation 8automat*a1 e2 ic3 ion Encodes

More information

Searching the Web What is this Page Known for? Luis De Alba

Searching the Web What is this Page Known for? Luis De Alba Searching the Web What is this Page Known for? Luis De Alba ldealbar@cc.hut.fi Searching the Web Arasu, Cho, Garcia-Molina, Paepcke, Raghavan August, 2001. Stanford University Introduction People browse

More information

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How

More information

Information Retrieval. hussein suleman uct cs

Information Retrieval. hussein suleman uct cs Information Management Information Retrieval hussein suleman uct cs 303 2004 Introduction Information retrieval is the process of locating the most relevant information to satisfy a specific information

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Web Crawling Instructor: Rada Mihalcea (some of these slides were adapted from Ray Mooney s IR course at UT Austin) The Web by the Numbers Web servers 634 million Users

More information

Problem 1: Complexity of Update Rules for Logistic Regression

Problem 1: Complexity of Update Rules for Logistic Regression Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 16 th, 2014 1

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

Overview of Information Retrieval and Organization. CSC 575 Intelligent Information Retrieval

Overview of Information Retrieval and Organization. CSC 575 Intelligent Information Retrieval Overview of Information Retrieval and Organization CSC 575 Intelligent Information Retrieval 2 How much information? Google: ~100 PB a day; 1+ million servers (est. 15-20 Exabytes stored) Wayback Machine

More information

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _ COURSE DELIVERY PLAN - THEORY Page 1 of 6 Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _ LP: CS6007 Rev. No: 01 Date: 27/06/2017 Sub.

More information

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes CS630 Representing and Accessing Digital Information Information Retrieval: Retrieval Models Information Retrieval Basics Data Structures and Access Indexing and Preprocessing Retrieval Models Thorsten

More information

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Lecture #3: PageRank Algorithm The Mathematics of Google Search Lecture #3: PageRank Algorithm The Mathematics of Google Search We live in a computer era. Internet is part of our everyday lives and information is only a click away. Just open your favorite search engine,

More information

Searching the Deep Web

Searching the Deep Web Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index

More information

Lec 8: Adaptive Information Retrieval 2

Lec 8: Adaptive Information Retrieval 2 Lec 8: Adaptive Information Retrieval 2 Advaith Siddharthan Introduction to Information Retrieval by Manning, Raghavan & Schütze. Website: http://nlp.stanford.edu/ir-book/ Linear Algebra Revision Vectors:

More information

Information Retrieval. Information Retrieval and Web Search

Information Retrieval. Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent

More information

Web Structure Mining using Link Analysis Algorithms

Web Structure Mining using Link Analysis Algorithms Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract- The World Wide Web is a huge repository of data which includes audio, text and video.

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Boolean retrieval Basic assumptions of Information Retrieval Collection: Fixed set of documents Goal: Retrieve documents with information that is relevant to the user

More information

Feature selection. LING 572 Fei Xia

Feature selection. LING 572 Fei Xia Feature selection LING 572 Fei Xia 1 Creating attribute-value table x 1 x 2 f 1 f 2 f K y Choose features: Define feature templates Instantiate the feature templates Dimensionality reduction: feature selection

More information

Information Retrieval May 15. Web retrieval

Information Retrieval May 15. Web retrieval Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically

More information

Multimedia Information Systems

Multimedia Information Systems Multimedia Information Systems Samson Cheung EE 639, Fall 2004 Lecture 6: Text Information Retrieval 1 Digital Video Library Meta-Data Meta-Data Similarity Similarity Search Search Analog Video Archive

More information

Ges$one Avanzata dell Informazione Part A Full- Text Informa$on Management. Full- Text Indexing

Ges$one Avanzata dell Informazione Part A Full- Text Informa$on Management. Full- Text Indexing Ges$one Avanzata dell Informazione Part A Full- Text Informa$on Management Full- Text Indexing Contents } Introduction } Inverted Indices } Construction } Searching 2 GAvI - Full- Text Informa$on Management:

More information