COSC 431 Information Retrieval. Phrase Search & Structured Search

Size: px

Start display at page:

Download "COSC 431 Information Retrieval. Phrase Search & Structured Search"

Julius Daniels
5 years ago
Views:

1 COSC 431 Information Retrieval Phrase Search & Structured Search 1

2 Phrase Searching What are Structured documents Meta-data Structured documents Outline Searching Structured documents Meta-data Semi-structured documents Searching Progressive filters Embedded paths Dewey decimal codes 2

3 Phrase Searching Using AND Phrase search: Information Retrieval A = Postings for Information B = Postings for Retrieval P = A & B Advantages No false negatives Fast No changes to the search engine (just to the query parser) High recall Disadvantages Many false positives Low precision (apparently) Can t do proximity ( within n words of ) searching 3

4 Phrase Searching Using Bi-grams Index adjacent words as a single term Example documents Doc1: University of Otago Doc2: Otago University Bi-gram postings: Of Otago <1,1> Otago University <2,1> University Of <1,1> Now search for the bi-gram Otago University is the same as searching for any single term 4

5 Phrase Searching Using N-grams How to search for University of Otago Search for University of and of Otago and assume any document containing both is correct Search for University of and of Otago adjacent to each other? Could use tri-grams Where should n in n-gram stop? 5

6 Phrase Searching For arbitrary proximity / adjacency search it is necessary to store word positions Where before <d n,f n > was used for postings, now <d n,w n > is used w n is the word number within the given document <d n,f n > can be calculated from <d n,w n > by counting the number of w n for a given d n Relevance ranking therefore not affected You can compute idf phrase from these postings 6

7 Short-Phrase Searching Example documents Doc1: University of Otago Doc2: Otago University Postings become: Of <1,2> Otago <1,3><2,1> University <1,1><2,2> To find Otago University Load postings for Otago (L 1 ) Load postings for University (L 2 ) Merge together looking for L 1.d = L 2.d L 1.w = L 2.w 1 Document is the same and position in L 2 is one larger then position in L 1 7

8 Proximity Searching To find Otago within n words of University Load postings for Otago (L 1 ) Load postings for University (L 2 ) Merge together looking for L 1.d = L 2.d n <= L 2.w L 1.w Document is the same and position and the distance between the two words is less than or equal to n 8

9 Long-Phrase Searching E.g. University of Otago T 1 = university T 2 = of T 3 = otago L 1 = postings(t 1 ) For n = 2 to T L 2 = postings(t n ) L 1 = adjacent(l 1, L 2 ) In other words, the result of adjacency is a postings list so phrases of varying length can be found using the same algorithm But more commonly, a multi-way merge is performed where the shortest list drives a merge (with skipping) 9

10 Efficient Phrase Searching How do we reduce the computation? Use w n from start of collection, insert a gap between documents (e.g. 100 words) to prevent finding phrases across document boundaries Example documents Doc1: University of Otago Doc2: Otago University Postings become: Of <1,2> Otago <1,3><2,104> University <1,1><2,105> 10

11 Efficient Phrase Searching No longer need to compare d n, as word numbers are no longer re-used. If L 2.w = L 1.w + 1 then terms must be adjacent Postings become <w 1 ><w 2 > <w n > Of <2> Otago <3><104> University <1><105> Then convert into document IDs by merging with a boundary <3><105> 11

12 Region Algebra Storing term positions rather than document numbers is the approach taken by the Wumpus search engine of the University of Waterloo We ll see a little later on that it can be adapted to semi-structured documents with little extra effort This branch of research is known as region algebra 12

13 Structured Documents Not all documents are flowing text (most contain some form of structure) Card catalogue EndNote, Reference Manager, ProCite, MEDLARS Two Letter Format Example from PubMed Today, XML or JSON are the preferred formats 13

14 Structured Documents ID AU - Gritzalis D AU - Kokolakis S DP TI - Security policy development for Healthcare Information Systems. TA - Stud Health Technol Inform VI - 96 PG AB - In this paper the issue of security policy development for health information systems is addressed. Security policy development involves the definition of the policy content, the analysis of the social, organisational, and technical contexts, as well as the organisation of the policy development process. We present the structure of security policies, analyse the characteristics of the HIS context, and analyse the different categories of methodologies, which can be used towards this end. 14

15 Structured Documents <PubmedArticle> <MedlineCitation Owner="NLM" Status="Completed"> <PMID> </PMID> <MedlineJournalInfo> <MedlineTA>Stud Health Technol Inform</MedlineTA> </MedlineJournalInfo> <Article> <Journal> <JournalIssue PrintYN="Y"> <Volume>96</Volume> <PubDate><Year>2003</Year></PubDate> </JournalIssue> </Journal> <ArticleTitle>Security policy development for Healthcare Information Systems.</ArticleTitle> <Pagination><MedlinePgn>105-10</MedlinePgn></Pagination> <Abstract> <AbstractText>In this paper the issue of security policy development for health information systems is addressed. Security policy development involves the definition of the policy content, the analysis of the social, organisational, and technical contexts, as well as the organisation of the policy development process. We present the structure of security policies, analyse the characteristics of the HIS context, and analyse the different categories of methodologies, which can be used towards this end.</abstracttext> </Abstract> <AuthorList CompleteYN="Y"> <Author> <LastName>Gritzalis</LastName><Initials>D</Initials> </Author> <Author> <LastName>Kokolakis</LastName><Initials>S</Initials> </Author> </AuthorList> </Article> </MedlineCitation> </PubmedArticle> 15

16 Structured Documents The document could be stored in a relational database (e.g. Oracle) ARTID JID ABID Date Vol Page TITLE Security policy development for Healthcare Information Systems. ABID JID Journal Abstract 1 In this paper the issue of security policy development for health information systems is addressed. Security policy development involves the definition of the policy content, the analysis of the social 1 Stud Health Technol Inform ARTID AID AID Surname Initial 1 Gritzalis d 2 Kokolakis s 16

17 Properties Structured Documents Information is in structures (fields) Not all structures need to appear in every document Structures are flat (even if they appear to be hierarchical) Can easily be kept in a database Many structured documents originate in databases When converting to XML, the relational references get dropped and the entities get strung together to form documents New questions can be asked of this structured information 17

18 Structured Documents What has Dr Brown written? Compare precision of To Brown as author Document contains Brown Compare recall of To Brown as author Document contains Brown If documents are marked up correctly, structured information retrieval should increase precision, while not forfeiting recall 18

19 Structured Searching Can t ask relational questions Who publishes with Dr Brown Can only ask IR questions What documents contain author Brown Examples What has been published in SIGIR? What was published in 2009? What cited Dr Brown in the 2009 SIGIR? 19

20 Metadata Metadata is information known about the document that is not part of the document In HTML, this includes the web-page URL, and often the text of anchors pointing to the page 20

21 Metadata How does Google know which pages come from New Zealand when this (often) isn t in the page itself? Google allows a structured search on the metadata in combination with the document contents e.g. otago site:nz 21

22 Metadata Metadata is often structured. It can be thought of as part of the document, only you can t see it In HTML <meta> tags are used, the contents of the <meta> tags are not displayed If the metadata can be constructed, and linked to the document at index time, it becomes possible to index the metadata at index time (convert it into structured data) Either add metadata to the document, or index it as terms in the document without adding it to the document (i.e. don t make the length of the document longer) ebay (and others) do a substantial amount of this 22

23 Metadata If it s possible to index virtual metadata, it s possible to index virtual documents select * from table where author=brown Index each row the database returns as a document then discard the row. The document s id should be somehow linked back into the database (the rowid?) 23

24 Searching Structured Documents In a structured document, all the data is in nonoverlapping structures Earlier example: AU - Gritzalis D AU - Kokolakis S DP TI - Security policy development for Healthcare Information Systems. TA - Stud Health Technol Inform VI - 96 PG Unique words by structure. Stopping numbers: AU TA TI d gritzalis kokolakis s health inform stud technol development for healthcare information policy security systems Build an inverted file index for each unique structure 24

25 Searching Structured Documents AU Dict d gritzalis kokolakis s AU Postings <1,1> TA Dict health inform stud technol TA Postings <1,1> TI Dict development for healthcare information policy security systems TI Postings <1,1> 25

26 Searching Structured Documents Each structure is in it s own index Given the query Gritzalis as author Gritzalis:AU Determine which inverted index to use and load the postings from there From the AU dictionary Binary search for the word Gritzalis If found, load the postings Process postings as usual Given the query security in title or abstract security:ti or security:ab First search the TI index then the AB index 26

27 Searching Structured The cost of searching one inverted index is two disk seeks and two disk reads The cost of searching a structured document with 9 structures is therefore 18 seeks and 18 reads. This is too slow! 27

28 Solution: Searching Structured Users usually perform simple searches without structural constraints, so we build a global index too: TI development for healthcare information policy security systems TA health inform stud technol AU d gritzalis kokolakis s TI Postings TA Postings AU Postings ALL d development for gritzalis health healthcare inform information kokolakis policy s security stud systems technol ALL Postings 28

29 ebay The ebay search engine sometimes does this, but merges the vocabularies by prefixing with the zone. For efficiency it includes a default zone Vocabulary :d :development :for :gritzalis :health :healthcare :inform :information :kokolakis :policy :s :security :stud :systems :technol AU:d AU:gritzalis AU:kokolakis AU:s TA:health TA:inform TA:stud TA:technol TI:development TI:for TI:healthcare TI:information TI:policy TI:security TI:systems Postings 29

30 Semi-Structured Semi-structured data is not flat-structured Compare <title>evolution of the second beta-galactosidase of Escherichia coli</title> To <title>evolution of the second beta-galactosidase of <species>escherichia coli</species></title> Does Escherichia coli lay in the <title> structure or the <species> structure? In a semi-structured document it is in <title>, it is in <species> and it is in <species> in <title> Semi-structured formats include: SGML, XML, HTML, and so on 30

31 Semi-Structured Many different kinds of queries Find documents containing: The given term The given path The given term in a given tag The given term in a partially specified path The given term in a fully specified path And there are new complications in phrase search: <title><species>e. coli</species> inquiry calls for stricter laws</title> Find coli inquiry crossing tag boundaries 31

32 Semi-Structured Documents What has Dr Brown authored? What cites Dr Brown? In what papers does Dr Brown self cite? <article> <tig> <au> <fnm>j.m.</fnm> <snm>brown</snm> </au> <atl>real-time Process Control</atl> <ti>annal Hist Comput</ti> <obi> <volno>117</volno> <issno>1</issno> </obi> <pp>3-3</pp> </tig> <bb> <au> <fnm>j.m.</fnm> <snm>brown</snm> </au> <atl>computer Controlled Processing</atl> <ti>chem Eng Prog</ti> <obi> <volno>156</volno> <issno>5</issno> </obi> <pp>63-67</pp> </bb> </article> 32

33 Region Algrbra: Series of Filters The document collection is divided into a series of contiguous extents, one per tag (the extents are chosen by the indexer) Each extent is represented: (start, end) Each series of extents is a list: [(start, end), ] <collection> <document> <section> <title> <collection> <document> <section> <title> </title> </section> <section> </section> <section> </section> </document> <document> <title> </title> <section> </section> <section> </section> </document> </collection> 33

34 Series of Filters Structures are represented as a contiguous extent [(start, end), (start, end), ] Terms are represented as a contiguous extents [(start, start + 1), ] coli <title> Phrases are represented as contiguous extents [(start, start + len), ] local enquiry All operations on the collection can now be expressed as a series of filters 34

35 Find coli in <title> Series of Filters Load the extent list for coli coli Load extent list for <title> <title> Filter coli by <title> coli <title> ANSWER 35

36 Series of Filters Find coli in <title> in <section> As above, filter coli by <title> coli in <title> Now filter result by <section> coli in <title> <section> ANSWER 36

37 Series of Filters Problem: <a> in <b> or <b> in <a> coli in <title> in <section> is not the same as coli in <section> in <title> coli in <title> in <section> coli <title> <section> ANSWER coli in <section> in <title> coli <section> <title> ANSWER But series of filters gives the same result! 37

38 Series of Filters Problem: Self-containing structures <b>the <b>brown</b> cows</b> Where is the beginning and end of the <b> extent? <b> <b> = (start, start, end, end) How is the query brown in <b> in <b> resolved? Problem: Frequent structures Some structures are very common (e.g. <p> tags). These tags can be more common than the most common words. The index can become clogged with these extents 38

39 Series of Filters These problems have been addressed by keeping not only the extent for a tag, but also the depth Problematically, this increases the size of the postings list even further and adds to the computation cost of calculating the results 39

40 ebay The ebay search engine sometimes does this Terms are stored with term frequencies and word positions: Lectures <d,tf,w,w,w>,<d,tf,w,w>, zone markers are stored (start, length) rather than (start, end) zone_markers:title <d,tf,s,l,s,l>,<d,tf,s,l>, If the d s match and w s are between s and s+l then d is a matching document 40

41 Embedded Paths In place of each <d n, f n > in the postings, additionally embed the location in the document <d n, p n, f n > Many variants exist <doc> <sec> <p>fox in Sox</p> </sec> </doc> <doc> <sec> <p>the Cat in the Hat</p> </sec> <sec> <p>comes Back</p> </sec> </doc> <doc> <sec> <p>green Eggs</p> <p>and Ham</p> </sec> </doc> sec:2 doc:1 sec:4 p:3 p:6 p:5 Example postings green: <3,3,1> ham: <3,6,1> hat: <2,3,1> in: <1,3,1><2,3,1> 41

42 Searching Embedded Paths Determine which nodes of the tree are relevant and select postings from only those positions doc:1 Fox in <p> sec:2 sec:4 <p> is structures (3,5,6) Load postings for fox, use only those where p n = 3, 5 or 6 Self containing structures As the entire structure is represented, self-containing structures are supported Frequent structures p:3 p:6 p:5 Each structure is represented once and each posting is a fixed size 42

43 Additionally Embedded Paths <d n,p n,f n > can be converted into <d n,f n > by collecting all occurrences of the term where d n is the same Ranking Because postings can be converted into <d n,f n > we can rank whole documents using any of the standard ranking functions Various optimizations to store the embedded paths for fast processing exist: term <p n > <d pn,f pn ><d pn,f pn ><d pn,f pn ><d pn,f pn > <p n > <d pn,f pn > 43

44 Embedded Paths The problem with the embedded path approach is that the whole tree for the entire collection must be kept in memory. This does not scale in semistructured free-text data such as HTML. One solution to this is to trim the tree by ignoring tags: Smaller than some length Smaller than, say, 50 words Pre-determined to be unlikely to be useful <i>, <b>, <firstname>, etc. 44

45 Dewey Decimal Codes Originallly from libraries, Dewey is a hierarchical classification scheme The tree from a single document can be labelled using a similar scheme The root is 1, the children are 1:1, 1:2, etc. Their children are 1:1:1, 1:1:2, etc. These codes are stored directly in the postings, <d n, p n, f n > where p n is the Dewey code From p n it is possible to know: Which postings are for the same node (p n1 =p n2 ) The child / parent relationship between postings 45

46 Dewey Decimal Codes The problem with the Dewey encoding of postings is that the Dewey codes can get very long They can get so long that more space is spent storing the codes than storing the postings 46

47 Structured IR It is an unfortunate state of affairs that the three best systems for semi-structured IR all have scalability problems The best solution to the problem is not known One possible hack is to search for documents first and then apply a post-filter to each document. This post-filter can result is a linear search over the entire collection (remember signature files?) 47

48 Summary Structured Information Retrieval Structured documents Meta-data Semi-structured documents Methods of searching The best method is not currently known 48

Query Processing and Alternative Search Structures. Indexing common words

Query Processing and Alternative Search Structures CS 510 Winter 2007 1 Indexing common words What is the indexing overhead for a common term? I.e., does leaving out stopwords help? Consider a word such