信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

Similar documents
信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

Information Retrieval and Organisation

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

Information Retrieval CS-E credits

Machine Vision Market Analysis of 2015 Isabel Yang

Information Retrieval

Digital Libraries: Language Technologies

OTAD Application Note

Introduction to Information Retrieval

如何查看 Cache Engine 缓存中有哪些网站 /URL

云计算入门 Introduction to Cloud Computing GESC1001

Web Information Retrieval. Lecture 2 Tokenization, Normalization, Speedup, Phrase Queries

数据挖掘 Introduction to Data Mining

More on indexing and text operations CE-324: Modern Information Retrieval Sharif University of Technology

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

Recap of the previous lecture. Recall the basic indexing pipeline. Plan for this lecture. Parsing a document. Introduction to Information Retrieval

上汽通用汽车供应商门户网站项目 (SGMSP) User Guide 用户手册 上汽通用汽车有限公司 2014 上汽通用汽车有限公司未经授权, 不得以任何形式使用本文档所包括的任何部分

ICP Enablon User Manual Factory ICP Enablon 用户手册 工厂 Version th Jul 2012 版本 年 7 月 16 日. Content 内容

Text Pre-processing and Faster Query Processing

The Design of Everyday Things

PRODUCT SPECIFICATION

Building Large Scale Text Corpus for Tibetan Natural Language Processing by Extracting Text from Web Pages 抽取网页文本为藏文自然语言处理构建大规模文本语料库

Information Retrieval

Tokenization and Sentence Segmentation. Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017

Oriented Scene Text Detection Revisited. Xiang Bai Huazhong University of Science and Technology

2.8 Megapixel industrial camera for extreme environments

5.1 Megapixel machine vision camera with GigE interface

Duke University. Information Searching Models. Xianjue Huang. Math of the Universe. Hubert Bray

n Tuesday office hours changed: n 2-3pm n Homework 1 due Tuesday n Assignment 1 n Due next Friday n Can work with a partner

Bi-monthly report. Tianyi Luo

Information Retrieval. Lecture 2 - Building an index

XML allows your content to be created in one workflow, at one cost, to reach all your readers XML 的优势 : 只需一次加工和投入, 到达所有读者的手中

Previous on Computer Networks Class 18. ICMP: Internet Control Message Protocol IP Protocol Actually a IP packet

Bing.com scholar. Мобильный портал WAP версия: wap.altmaster.ru

SPECIFICATION. Manual Card Reader MANUAL CARD READER PRODUCT SPECIFICATION

正则表达式 Frank from

云计算入门 Introduction to Cloud Computing GESC1001

1. Features. 2,Block diagram. 3. Outline dimension V power supply. 3. Assembled with 20 x 4 character displays

Corso di Biblioteche Digitali

DATA VISUALIZATION. Lecture 10--Scientific Visualization

2. Introduction to Digital Media Format

TDS - 3. Battery Compartment. LCD Screen. Power Button. Hold Button. Body. Sensor. HM Digital, Inc.

IR System Components. Lecture 2: Data structures and Algorithms for Indexing. IR System Components. IR System Components

Mini Wireless Keyboard 迷你无线键盘

Outline of the course

Presentation Title. By Author The MathWorks, Inc. 1

IDT-288-K001 SPECIFICATION MANUAL INSERTION CARD READER PRODUCT SPECIFICATION. Date 2013/06/22 Manual Insertion. Ver. 1.0 Card Reader Page 1/11

[ 电子书 ]Spark for Data Science PDF 下载 Spark 大数据博客 -

More on indexing and text operations CE-324: Modern Information Retrieval Sharif University of Technology

PRODUCT SPECIFICATION

NyearBluetoothPrint SDK. Development Document--Android

Information Retrieval and Web Search Engines

#MDCC Swift 链式语法应 用 陈乘

CS 6320 Natural Language Processing

nbns-list netbios-type network next-server option reset dhcp server conflict 1-34

PubMed 简介. PubMed 是美国国立医学图书馆 (NLM) 所属的国家生物技术信息中心 (NCBI) 开发的因特网生物医学信息检索系统

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

大数据基准测试 : 原理 方法和应用. 詹剑锋 中国科学院计算技术研究所中国科学院大学 可信云服务大会, 北京 INSTITUTE OF COMPUTING TECHNOLOGY

Multiprotocol Label Switching The future of IP Backbone Technology

Chapter 10 Java and SQL. Wang Yang

CS347. Lecture 2 April 9, Prabhakar Raghavan

Information Retrieval

Information Retrieval and Web Search

Command Dictionary CUSTOM

Recap of the previous lecture. This lecture. A naïve dictionary. Introduction to Information Retrieval. Dictionary data structures Tolerant retrieval

Technology: Anti-social Networking 科技 : 反社交网络

Information Retrieval

Today s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan

Cyber Security Introduction

Introduction to Information Retrieval

计算机科学与技术专业本科培养计划. Undergraduate Program for Specialty in Computer Science & Technology

More about Posting Lists

Chapter 4. Processing Text

Lenovo S850. Quick Start Guide v1.0. Read this guide carefully before using your smartphone.

Information Retrieval. (M&S Ch 15)

Lenovo S580. Quick Start Guide v1.0. Read this guide carefully before using your smartphone.

display portal server display portal user display portal user count display portal web-server

Research of Attitude Measuring System Using Single Camera for Non-cooperative Spacecraft 基于单目相机的空间非合作目标姿态测量

Introduction to Computer Science

Keygen Codes For Photoshop Cs6 ->>> DOWNLOAD

Overview. Lecture 3: Index Representation and Tolerant Retrieval. Type/token distinction. IR System components

Logitech G302 Daedalus Prime Setup Guide 设置指南

GUJARAT TECHNOLOGICAL UNIVERSITY

Made by: Ali Ibrahim. Supervisor: MR. Ali Jnaide. Class: 12

Indexing and Query Processing. What will we cover?

Digital Asset Management 数字媒体资源管理理 2. Introduction to Digital Media Format

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

基于项目信息类专业教育实验班本科培养计划 (2+2)

计算机组成原理第二讲 第二章 : 运算方法和运算器 数据与文字的表示方法 (1) 整数的表示方法. 授课老师 : 王浩宇

Division of Science and Technology

DataCube Data Analysis With Mongodb.

Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras

Chapter 2. Architecture of a Search Engine

A CAN Bus Based Control System for Joint Modular Robot

Britannica Academic Online Edition 大不列顛百科全书网络学术版

Natural Language Processing

Transcription:

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Spring 2018 1

Last week What is Information Retrieval ( 信息检索 )? We discussed the «Boolean retrieval model ( 布尔检索模型 )»: searching documents using terms and Boolean operators (e.g. AND, OR, NOT) QQ Group: 623881278 Website: PPTs 2

Course schedule ( 日程安排 ) Lecture 1 Lecture 2 Lecture 3 Lecture 4 Lecture 5 Lecture 6 Introduction Boolean retrieval ( 布尔检索模型 ) Term vocabulary and posting lists Dictionaries and tolerant retrieval Index construction and compression Scoring, weighting, and the vector space model Computer scores, and a complete search system Lecture 7 Lecture 8 Evaluation in information retrieval Web search engines, advanced topics, and conclusion 3

An exercise This is an exercise that you can do at home if you want to review what we have learnt last week b. Draw the dictionary (also called inverted index representation) for this collection c. What are the returned result for these queries? - schizophrenia AND drug - for AND NOT (drug OR approach) 4

Introduction To able to search for documents quickly, we need to create an index ( 索引 ). What kind of index? Term-document matrix ( 关联矩阵 ) Dictionary ( 词典 ) (also called inverted index 倒排索引 ) Four steps to create an index 5

How to create an index? Step 1: collect the documents to be indexed Book1 Book2 Book3 Book100

How to create an index? Step 1: collect the documents to be indexed Book1 Book2 Book3 Book100 Step 2: tokenize the text ( 标记文本 ): separate it into words Book1 «The city of Shenzhen is located in China» token1 token2 token7 token8 7

How to create an index? Step 3: Linguistic preprocessing ( 语言的预处理 ) Keep only the terms that are useful for indexing documents. «The city of Shenzhen is located in China» token1 token2 token7 token8 During that step, words can be also transformed if necessary: friends friend wolves wolf eaten eat 8

How to create an index? Step 4: Create the dictionary City Shenzhen Located China Dictionary City Shenzhen Located China Book1, Book2, Book 10, Book 7. Book1, Book3, Book 5, Book 9. Book1, Book 20, Book 34

How to create an index? The index has been created! It can then be used to search documents. Dictionary China City Located Shenzhen Book1, Book 20, Book 34 Book1, Book2, Book 7, Book 20. Book1, Book3, Book 5, Book 9.

CHAPTER 2 TERM VOCABULARY AND POSTING LISTS p56 11

In Chapter 2 We will discuss: Reading documents (2.1) Tokenization ( 标记化 ) and linguistic processing (2.2) Posting-lists (2.3) An extended model to handle phrase and proximity queries (2.4). e.g. City of Shenzhen 12

2.1 Reading digital documents Data ( 数据 ) stored in computers are represented as bits ( 比特 ). To read documents, an IR system must convert these bits into characters. 01001000 01100101 01101100 01101100 01101111 Hello http://www.binaryhexconverter.com/ascii-text-to-binary-converter 13

Reading documents (2) How to convert from bits to characters? There exists several encodings ( 文本编码 ) such as ASCII, UTF-8 : 01001000 H 01100101 E 01101100 L 01101100 L 01101111 O 14

Reading documents (3) An IR system will only extract relevant content ( 相关内容 )from a document (e.g. the text). e.g. in a webpage ( 网页 ), pictures ( 图片 )can be ignored. Text ( 文本 ) Pictures ( 图片 ) 15

Reading documents (4) In this course, we consider English documents English is read from left to right. Some other languages are more complex to read. e.g. Arabic( 阿拉伯语 )mixes both left to right and right to left Also, some vowels( 元音 )are not written Creating an index is difficult for such languages! 16

Reading documents (5) Some IR systems process each document individually e.g. Indexing each e-mail individually Some IR systems process documents as groups. e.g. Indexing all e-mails for a given day, together 17

Reading documents (6) It is also important to choose the granularity ( 粒度 ) carefully. should we index a book as a single document? It can be a bad idea! For example, if we search for books about Food from China but Food appears only in the first chapter and China appears only in the last chapter Then this book is not about food from China or should we index each chapter of the book separately? 18

2.2 Tokenization (1) After reading a document, the next step is tokenization ( 标记化 ). This means to split a text into pieces called tokens ( 标记 ) while throwing away some characters such as punctuation ( 标点符号 ). A text Tokenization ( 标记化 ) Token1 Token2 Token3 Token4 Token5 Token6 Token7 19

Tokenization (2) A token is a sequence of characters( 字符 ) appearing at a specific location in a document. Two tokens that are identical are said to be of the same type. This house is close to my house. These two tokens are of the same type ( house ).

Tokenization (3) Naive approach for tokenization ( 幼稚的方法 ): Remove punctuation. Split the text according to the whitespaces ( 空格 ) A text Tokenization ( 标记化 ) Token1 Token2 Token3 Token4 Token5 Token6 Token7 21

Tokenization (3) This approach has some problems. e.g. Mr. O Neill and his friends aren t How to tokenized O Neill and aren t? Which one is better? Which one is better? 22

Tokenization (4) In general, choosing how to tokenize a text influences how we can search for documents e.g. Mr. O Neill and his friends aren t If aren t is considered to be a token, then if a person searches for the term are, he may not find the document. 23

Tokenization (5) e.g. Mr. O Neill and his friends aren t If aren t is considered to be two tokens ( are and n t ), then if a person searches for aren t, he may not find the document. Solution: 1 - Tokenize the documents 2 - Tokenize the queries of users in the same way. 24

Tokenization (6) In general, tokenization is different for each language. For this reason, it is useful to first identify the language of a document before performing tokenization and indexing. In Chinese, a difficulty is that there are no whitespaces ( 空格 ) between words e.g. 我喜欢这节课 " 25

Tokenization (6) Word segmentation ( 分词 ) is the process of dividing a text into words. In Chinese, there are some ambiguities ( 歧义 ): «monk»? or «and» + «still»? Simple solution: find the longest words Other solutions: use Markov movels, and other techniques. 26

Tokenization (7) In English, there are whitespaces between words. But splitting a text using whitespaces may cause problems. San Francisco is the name of a city (it should not be considered as two tokens) 1 st January 2016 is a date Hunan University should be considered as a single token 27

Tokenization (8) A solution: For a given query such as: «Hunan University» a search engine can retrieve documents using all the different tokenizations: Hunan University HunanUniversity and combine the results. 28

Tokenization (9) In many languages, there are some unusual tokens. e.g. B-52 is an aircraft C++ and C# are programming languages ( 编程语言 ) M*A*S*H* is the name of a TV show ( 电视节目 ) http://www.hitsz.edu.cn is a web page It is important to consider these special tokens. 29

Tokenization (10) Some tokens can be ignored because it is unlikely that someone will search for them: amounts of money e.g. 56 元, numbers e.g. 56.7869 Advantage: this reduces the size of the dictionary Disadvantage: we cannot search for the tokens that are ignored. 30

Removing common words In text documents, there are some words that are very common and may not be useful for retrieving documents. In English, 25 common words are: Such words are called «stop words» ( 停用词 ) 31

Removing common words (2) These words can be ignored when indexing documents. In general, this will not cause problems when searching for documents. However, stops words are useful when searching for phrases ( 短语 ) (consecutive words) e.g. «Airplane tickets to Beijing» is more precise than: «Airplane AND tickets AND Beijing» 32

Removing common words (3) In terms of performance, removing stop words: results in a smaller index. does not make a big difference in terms of performance (speed ). Most Web search engines do not remove stop words. instead they use other strategies to cope with common words, based on statistics about words. 33

Normalization - 规范化 When a person enters a query in a search engine: User ( 用户 ) cars shenzhen Query ( 查询 ) An IR system will also «tokenize» the query. 34

Normalization - 规范化 (2) When a person enters a query in a search engine: User ( 用户 ) cars shenzhen Query ( 查询 ) It is possible that the tokens obtained from the query do not match the tokens from documents 35

Normalization - 规范化 (3) Example: «cars» is used instead of «car» but these two tokens refer to the same object. «cars» is used instead of «automobile» but these two tokens have the same meaning (they are synonyms - 同义词 ) 36

Normalization - 规范化 (4) Normalization ( 规范化 ): it is the process of converting tokens to a standard form so that matches will occur despite small differences. cars car car automobile windows window Windows (operating system) 37

Normalization: accents and diacritics Diacritic ( 变音符 ): a sign written above or below a letter that indicates a difference in pronunciation à é ê Should we just ignore them? In some languages, they are important. In Spanish: peña = a cliff pena = sorrow 38

Normalization: accents and diacritics But it is possible that users will not use the diacritics because they may be lazy or may not know how to type them on the computer. Thus, a strategy is to remove them: peña = a cliff pena = sorrow 39

Capitalization Lower-case letters : a,b,c,d. ( 小写 ) Upper-case letters: A,B,C,D. ( 大写 ) A common strategy is to transform everything to lower-case letters: Ferrari ferrari Australia australia This can be a good idea because often people will not type upper-case letters when searching for documents. 40

Capitalization But sometimes capitalization is important. Bush: a person named «Bush» ( 布什 ) bush: a bush ( 灌木 ) C.A.T : a company cat : an animal ( 猫 ) 41

Capitalization A good solution for English: convert the first letter of a sentence to a lower-case letter. Saturday, Jim went out to eat something. saturday Jim went out to eat something This is not a perfect solution but work most of the time. However, as mentioned, users may not type the upper-case letters anyway. Thus, transforming everything to lower-case is a often the best solution. 42

Other issues in English British spelling vs American spelling colour color Dates 3/12/16 3rd March 2016 Mar. 3, 2016 43

Lemmatization Sometimes a same word may have different forms: organize, organizes, organizing Lemmatization: converting a word to a common base form called lemma am, are, is be car, cars, car s, cars car The lemma for «car, cars,» 44

Lemmatization (2) How to perform lemmatization? A simple way called stemming consists of removing the end of words: cars car airplanes airplane But it may give some incorrect results: saw s The result should be «see»! 45

Lemmatization (2) If we want to perform lemmatization in a better way, it is necessary to analyze how the words are used in the text. This can be quite complicated. There exist some software to analyze texts and perform stemming for different languages (free or commercial). For English: Porter Stemmer http://www.tartarus.org/ martin/porterstemmer/ 46

Example Porter Stemmer Applying the Porter Stemmer 47

Lemmatization (3) In some cases, lemmatization can help to provide better results when searching for documents But in some other cases, it does not help and lead to worse results. Thus, lemmatization may not always be used in practice. Example of problem 48

Lemmatization (5) Example The Porter Stemmer convert all these words operate operating operates operation operative operatives operational to «oper». But these words have different meanings. 49

Lemmatization (6) In general, applying lemmatization allows users to find more documents using an Information retrieval system. But these documents may be less relevant. In other words, lemmatization may: decrease precision. increase recall 50

Precision ( 准确率 ) Precision: What fraction of the returned results are relevant to the information need? Example: A person searches for webpages about Beijing The search engine returns: 5 relevant webpages 5 irrelevant webpages. Precision = 5 / 10 = 0.5 (50 %) 51

Recall ( 召回 ) Recall: What fraction of the relevant documents in a collection were returned by the system? Example: A database contains 1000 documents about HITSZ. The user search for documents about HITSZ. Only 100 documents about HITSZ are retrieved. Recall = 100 / 1000 = 0.1 (10 %) 52

2.3 HOW TO SEARCH FASTER USING A DICTIONARY 53

Introduction Last week, we saw how we can use a dictionary to search for documents. Example 54

Example QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10,, Book 20. Book1, Book3 Book1, Book 20, 55

Example QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10,, Book 20. Book1, Book3 Book1, Book 20, We need to do the intersection ( 交线 ) of the two lists 56

Example QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10,, Book 20. Book1, Book3 Book1, Book 20, We need to do the intersection ( 交线 ) of the two lists 57

Example QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10,, Book 20. Book1, Book3 Book1, Book 20, We need to do the intersection ( 交线 ) of the two lists. To do that, we compare both lists, posting by posting. 58

Example QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10,, Book 20. Book1, Book3 Book1, Book 20, We need to do the intersection ( 交线 ) of the two lists 59

Example QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10,, Book 20. Book1, Book3 Book1, Book 20, We need to do the intersection ( 交线 ) of the two lists 60

Example QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10,, Book 20. Book1, Book3 Book1, Book 20, RESULT: Book 1, Book20 61

How to search faster? There are some techniques to allow faster search. One such technique is to use skip pointers. We will see the main idea (without the details) 62

Example QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10,, Book 20. Book1, Book3 Book1, Book 20, 63

Example QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10,, Book 20. Book1, Book3 Book1, Book 20, We need to do the intersection ( 交线 ) of the two lists 64

Example QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10,, Book 20. Book1, Book3 Book1, Book 20, We need to do the intersection ( 交线 ) of the two lists 65

Example QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10,, Book 20. Book1, Book3 Book1, Book 20, RESULT: Book 1, Book20 66

Skip-pointers The idea is to use some «shortcuts» (arrows) to skip some entries when comparing lists. By doing this, we can compare lists of documents faster (we don t need to completely read the lists). This is just the main idea. We will not discuss technical details! This idea only works for queries using the AND operator (it does not work for OR). 67

2.4 PHRASE QUERIES 68

Phrase query ( 精确查询 ) Phrase query: a query where words must appear consecutively (one after the other) in documents e.g.: «Harbin Institute of Technology» This query is written with quotes ( ). It will find all documents containing these words one after the other. This type of query is not supported by all Web search engines. 69

Phrase query (2) Some Web search engines will instead consider the proximity between words in documents. Documents where words from a query appear closer will be preferred to other documents. How to answer a phrase query? 70

Biword indexes A solution is to considers each pair of consecutive terms in a document as a term. I walked in Beijing «I walked» «walked in» «in Beijing» Those terms are called «biwords» Each biword can be used to create an index that we call a «biword index». 71

Illustration of a biword index Dictionary I I walked walked walked in in Beijing Beijing Book1, Book5, Book 10,, Book 20. Book1, Book7 Book1, Book 12, 72

Biword indexes Using a biword index, we can search using the «biwords: A query: «Harbin Institute» AND «Institute of» AND «of Technology» This query would work pretty well. But it could still find documents where the phrase «Harbin Institute of Technology» would not appear consecutively. 73

Biword indexes How to solve this problem? A solution is to generalize the concept of biword index to more than two words (e.g. three words). Then, we may find more relevant documents. But a problem is that the index would become much larger (there will be more entries in the dictionary). 74

Positional indexes ( 位置索引 ) A better solution is to use another type of index called positional indexes. Positional index: a dictionary where the positions of terms in documents are stored. Dictionary City Shenzhen Located China Book1 (3, 25, 38) Book 20 (4, 100, 1000) Book1 (2, 24, 35). Book20(3,500) This indicates that «Shenzhen» appears as the 2 nd, 24 th and 35 th word in Book1 75

Positional indexes ( 位置索引 ) A better solution is to use another type of index called positional indexes. Positional index: a dictionary where the positions of terms in documents are stored. Dictionary City Shenzhen Located China Book1 (3, 25, 38) Book 20 (4, 100, 1000) Book1 (2, 24, 35). Book20(3,500) This indicates that «Shenzhen» appears as the 3 rd and 500 th word in Book20 76

Positional indexes Positional indexes can be used to answer phrase queries. 77

Example Phrase query: «Shenzhen City» Dictionary City Shenzhen Located China Book1 (3, 25, 38) Book 20 (4, 100, 1000) Book1 (2, 24, 35). Book20(3,500) 78 Result: Book 1 and Book 20 78

Positional indexes Positional indexes can also be used to answer proximity queries. «Shenzhen (within five words of) City» 79

Conclusion Today, we have discussed in more details how index are created. Tokenization, normalization, lemmatization The PPT slides are on the website. QQ Group: 80

References Manning, C. D., Raghavan, P., Schütze, H. Introduction to information retrieval. Cambridge: Cambridge University Press, 2008 81