SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT

Similar documents
Chapter 6: Information Retrieval and Web Search. An introduction

Introduction to Information Retrieval

Chapter 27 Introduction to Information Retrieval and Web Search

KNOWLEDGE DISCOVERY AND DATA MINING

Query Refinement and Search Result Presentation

CS 6320 Natural Language Processing

Information Retrieval

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Information Retrieval and Web Search

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Introduction to Information Retrieval

Information Retrieval. (M&S Ch 15)

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Information Retrieval. Chap 7. Text Operations

Information Retrieval. Information Retrieval and Web Search

Indexing and Searching

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

Chapter 3 - Text. Management and Retrieval

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

INFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE

Web Information Retrieval using WordNet

Module 1: Internet Basics for Web Development (II)

SEARCH ENGINE INSIDE OUT

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

Glossary. ASCII: Standard binary codes to represent occidental characters in one byte.

Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Midterm Exam Search Engines ( / ) October 20, 2015

Multimedia Information Systems

CHAPTER-26 Mining Text Databases

Graph Mining and Social Network Analysis

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION

Oracle Database 10g: Introduction to SQL

Automated Online News Classification with Personalization

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

CANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM

Information Retrieval. hussein suleman uct cs

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

HELP ON THE VIRTUAL LIBRARY

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

CMSC 476/676 Information Retrieval Midterm Exam Spring 2014

Information Retrieval

TECNOLOGIES FOR INFORMATION SYSTEMS

IBE101: Introduction to Information Architecture. Hans Fredrik Nordhaug 2008

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4

Unit VIII. Chapter 9. Link Analysis

Information Retrieval: Retrieval Models

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Mining Web Data. Lijun Zhang

An Adaptive Agent for Web Exploration Based on Concept Hierarchies

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

modern database systems lecture 4 : information retrieval

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

OvidSP Quick Reference Guide

Automatic Summarization

CISC689/ Information Retrieval Midterm Exam

DATA MINING - 1DL105, 1DL111

Automatic Document; Retrieval Systems. The conventional library classifies documents by numeric subject codes which are assigned manually (Dewey

A Document Graph Based Query Focused Multi- Document Summarizer

Elementary IR: Scalable Boolean Text Search. (Compare with R & G )

An Introduction to Search Engines and Web Navigation

A Model for Information Retrieval Agent System Based on Keywords Distribution

Relevance of a Document to a Query

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Mining Web Data. Lijun Zhang

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5

Document Clustering for Mediated Information Access The WebCluster Project

Contents 1. INTRODUCTION... 3

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

To search and summarize on Internet with Human Language Technology

Introduction to Information Retrieval

Part I: Data Mining Foundations

Component ranking and Automatic Query Refinement for XML Retrieval

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Session 10: Information Retrieval

Word Indexing Versus Conceptual Indexing in Medical Image Retrieval

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval

Outline. Lecture 3: EITN01 Web Intelligence and Information Retrieval. Query languages - aspects. Previous lecture. Anders Ardö.

Towards Understanding Latent Semantic Indexing. Second Reader: Dr. Mario Nascimento

Text Analytics (Text Mining)

Sec. 8.7 RESULTS PRESENTATION

Oracle Database: SQL and PL/SQL Fundamentals Ed 2

Reading group on Ontologies and NLP:

New Features in Oracle Data Miner 4.2. The new features in Oracle Data Miner 4.2 include: The new Oracle Data Mining features include:

Information Retrieval CSCI

Efficient Implementation of Postings Lists

Transcription:

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT Prof. Dipartimento di Elettronica e Informazione Politecnico di Milano

INFORMATION SEARCH AND RETRIEVAL Inf. retrieval 1

PRESENTATION SCHEMA GOALS AND ARCHITECTURES OF INFORMATIOIN RETRIEVAL SYSTEMS PHYSICAL AND LOGICAL STORAGE STRUCTURES AUTOMATIC TEXT ANALYSIS AND INDEX BUILDING INTERNET SEARCHING Inf. retrieval 2

INFORMATION MANAGEMENT TECHNOLOGIES DATA WAREHOUSE DECISION SUPPORT SYSTEMS DATA MINING INFORMATION SYSTEMS ANALYSIS DATA INTEGRATION DISTRIBUTED ETHEROGENEOUS DATA MANAGEMENT WEB INFORMATION SYSTEMS REAL-TIME MAIN MEMORY TEMPORAL DATABASES NON STRUCTURED SEMISTRUCTURED AND MULTIMEDIAL INFORMATION EMBEDDED SISTEMS MOBILE AND CONTEXT- AWARE COMPONENTS INFORMATION RETRIEVAL SISTEMS Inf. retrieval 3

MANAGEMENT INFORMATION SYSTEMS INFORMATION COMPLEX HIGHLY STRUCTURED QUERIES COMPLEX MOSTLY RECURRENT UPDATES FREQUENCY IS CASUAL, BUT HIGH OFTEN ON-LINE USED TECHNOLOGY DATABASE MANAGEMENT SYSTEMS Inf. retrieval 4

INFORMATION SEARCH INFORMATION SIMPLE (authors, keywords, colours, patterns,...) POORLY STRUCTURED QUERIES COMPLEX CLAUSES ARE LOGICALLY CONNECTED PARTIALLY SPECIFIED ITERATIVE REFINEMENT NON FORESEABLE Inf. retrieval 5

INFORMATION SEARCH UPDATES MOSTLY PERIODIC, WITH LOW FREQUENCY OFTEN OFF-LINE USED TECHNOLOGY INDEXING AND SEARCHING BY KEYWORDS DIRECT SEARCH ON TEXT FULL TEXT ABSTRACT SIGNATURE Inf. retrieval 6

NON STRUCTURED INFORMATION DOCUMENT WHICHEVER INFORMATION COLLECTION SEARCHABLE BY ITS CONTENT TEXTS STATISTICAL DATA IMAGES SOUNDS Inf. retrieval 7

FUNCTIONAL ARCHITECTURE OF AN INFORMATION RETRIEVAL SYSTEM (IRS) QUERIES FORMAL LANGUAGE SIMILARITY ASSESSMENT INDEXED DOCUMENTS DOCUMENTS SEARCH FORMULATION PROCESS DOCUMENTS STORAGE PROCESS SIMILAR ITEMS EXTRACTION Inf. retrieval 8

DOCUMENT SPACE W.R.T. A QUERY RESULT ALL DOCUMENTS NON RETRIEVED, BUT NON RELEVANT) (NRITNRIL) RETIRIEVED AND RELEVANT (RITRIL) RETRIEVED, BUT NON RELEVANT (RITNRIL) NON RETRIEVED, BUT RELEVANT (NRITRIL) RELEVANT RETRIEVED Inf. retrieval 9

INFORMATION RETRIEVAL SYSTEMS GOAL OF AN IRS IS TO EFFECTIVELY RETRIEVE ALL THE DOCUMENTS WHICH ARE RELEVANT TO A GIVEN QUERY AND ONLY THEM PERFORMANCE INDEXES RECALL RECALL = RITRIL RITRIL+ NRITRIL EFFECTIVENESS IN FINDING THE USEFUL MATERIAL (RELEVANT AND RETRIEVED DOCUMENTS W.R.T. ALL THE RELEVANT DOCUMENTS ) PRECISION PRECISION = RITRIL RITRIL+ RITNRIL EFFECTIVENESS IN REMOVING THE USELESS MATERIAL (RELEVANT AND RETRIEVED DOCUMENTS W.R.T. ALL THE RETRIEVED DOCUMENTS ) Inf. retrieval 10

INFORMATION RETRIEVAL SYSTEMS (NRITNRIL) RETIRIEVED AND RELEVANT (RITRIL) NON RETRIEVED, BUT RELEVANT (NRITRIL) (RITNRIL) EXPERIMENTAL FINDING: THE USER IS (PSYCHOLOGICALLY) HAPPY WITH LOW RECALL (~20%) VALUES, BUT HIGH PRECISION (~80%) IS REQUIRED Inf. retrieval 11

STORAGE STRUCTURES THEY DEPEND ON THE PHYSICAL NATURE OF THE DOCUMENT (text, image,...) AND ON THE INTENDED USAGE TEXT INVERTED FILES FOR EACH TERM OR ATTRIBUTE VALUE A DENSE INDEX TO THE FILE IS BUILT THE SET OF ALL THE INDEXES CONSTITUTES THE INVERTED FILE BIT MAPS GRAPHICS QUADTREES OF DIFFERENT TYPE THE IMAGE SPACE IS RECURSIVELY DECOMPOSED INTO SQUARES UNTIL A SQUARE CONTAINS A SINGLE MEANINGFUL ELEMENT THE RESULTING TREE IS CODED AND STORED IN A COMPACT FORMAT Inf. retrieval 12

INVERTED FILES PHYSICAL ARCHITECTURE INVERTED FILE DOCUMENT REPOSITORY INVERSION INDEX FILE SYSTEM KEYWORDS (CONTRROLLED VOCABULARY) LOGICAL STRUCTURE THESAURUS SYNONYMS OMONYMS DIFFERENT SPELLINGS SEMANTIC LINKS (CROSS REFERENCE, KWIC) HIERARCHICAL RELATIONS (GENERAL.-SPECIAL.) Inf. retrieval 13

STAIRS STORAGE STRUCTURE DICTIONARY TERMS INVERSION FILE TERM POINTER TO THE INVERSION FILE POINTER TO SYNONYMS # OF DOCUMENTS # OF OCCURRENCIES OCCURR. 1 OCCURR. 2 OCCURR. n UPPER/LOWER CASE N OF THE DOCUMENT SECTION CODE N OF THE SENTENCE N OF THE WORD INDEX TO TEXT TEXT FILE DOCUMENT ADDRESS PRIVACY CODE FORMATTED FIELDS DOCUMENT HEADER HEADER OF 1 TEXT 1 HEADER OF 2 TEXT 2... FROM: SALTON 89 Inf. retrieval 14

REGION QUADTREE A F G B B C D E H I J 37 38 39 40 N O F G H I J K L M N O P Q L M 57 58 59 60 Q 37 38 39 40 57 58 59 60 FROM: SAMET 90 Inf. retrieval 15

BITMAP SUPERIMPOSED CODING IN ITS BASIC FORM, EACH DOCUMENT IS REPRESENTED BY A ROW IN A BINARY ARRAY, THE COLUMNS OF WHICH REPRESENT THE b RELEVANT TERMS (very expensive) THE SUPERIMPOSED VARIANT CODES EACH DOCUMENT WITH A SHORTER (n<<b) BIT STRING RELEVANT TERMS ARE CODED WITH n-ary STRINGS IN WHICH k (k<n) BIT = 1 WHICH ARE OR-ed (false drops i.e., coding synonyms, are generated) THE GENERATED TERM CODES ARE LINKED TOGETHER TO PRODUCE THE SIGNATURE Inf. retrieval 16

BITMAP SUPERIMPOSED CODING Data 0000 0010 0000 1000 base 0100 0010 0000 0000 management 0000 0100 0001 0000 system 0000 0000 0101 0000 SIGNATURE 0100 0110 0101 1000 IN LARGE DOCUMENT REPOSITORIES, DENSE INDEXES CAN BE BUILT ON THE MAIN TABLE Inf. retrieval 17

BITMAPS AND INVERTED FILES BITMAPS ARE PROFITABLY USED TO REPRESENT SHORT AND MOSTLY HOMOGENEOUS IN THEIR VOCABULARY TEXTS MEMORY OVERHEAD VERSUS THE NUMBER OF DOCUMENTS CONTAINING THE SAME KEY BIT MAP: CONSTANT INVERTED LISTS: LINEAR GROWTH WITH BITMAP ORGANIZATIONS, QUERY PROCESSING BECOMES A SIMPLE BINARY STRING MATCHING BETWEEN THE QUERY BITMAP AND THOSE OF THE DOCUMENTS Inf. retrieval 18

AUTOMATIC TEXT ANALYSIS ITS GOAL IS TO EXTRACT THE TERMS TO BE INCLUDED IN THE INDEXES AND THEIR MUTUAL RELATIONSHIPS SINGLE TERMS (KWOC) TERMS IN CONTEXT (KWIC) EXHAUSTIVE INDEXING (> RECALL) SPECIFIC INDEXING (> PRECISION) DEEP INDEXING (> PERFORMANCE, > COST) SHALLOW INDEXING (< PERFORMANCE, < COST) Inf. retrieval 19

AUTOMATIC TEXT ANALYSIS ZIPF LAW (least effort principle) ORDERING THE SET OF WORDS IN A TEXT IN DECREASING FREQUENCY ORDER (RANK), IT CAN BE OBSERVED THAT RANK(i)*FREQ(i)=COSTANT FOR THE ENGLISH LANGUAGE: COSTANT 0.1 50% OF DISTINCT WORDS ARE FOUND ONLY ONCE 80% OF DISTINCT WORDS DO NOT APPEAR MORE THAN 4 TIMES Inf. retrieval 20

COMPRESSION OPERATIONS ON TEXT VARIABLE LENGTH CODES MOST FREQUENT WORDS SHORTER CODE MOST FREQUENT LETTERS SHORTER CODE HUFFMAN CODE: 3 BIT FOR E, 10 BIT FOR Z, AVERAGE LENGTH: 4.12 48% COMPRESSION DIGRAMS, TRIGRAMS,, CODING CRYPTOGRAPHY REVERSIBLE TEXT TRANSFORMATION INFORMATION PRIVACY ACCESS RIGHTS AUTENTICATION Inf. retrieval 21

AUTOMATIC INDEXING THE CHOICE OF INSERTING OF A TERM INTO AN INDEX IS TO BE MADE ON THE BASE OF TWO PARAMETERS ITS RELEVANCE FOR IDENTIFYING A DOCUMENT RECALL ITS WEIGHT FOR SINGLING OUT A DOCUMENT FROM A COLLECTION OF SIMILAR DOCUMENTS PRECISION TERM OCCURRENCY PROPERTIES IN A WHOLE COLLECTION OF N DOCUMENTS MUST BE EXAMINED THE MOST COMMON FUNCTIONAL TERMS ARE REMOVED (ARTICLES, PREPOSITIONS, ECC.) STOP LIST THE FREQUENCY tf ij OF REMAINING TERMS T j IN EACH DOCUMENT D i IS COMPUTED A THRESHLD FREQUENCY T IS CHOSEN AND TO EACH DOCUMENT D i ALL THE TERMS T j ARE ASSIGNED FOR WHICH tf ij > T Inf. retrieval 22

AUTOMATIC INDEXING TERMS WHICH ALLOW A GOOD INDEXING BOTH FOR RECALL AND PRECISION APPEAR OFTEN IN INDIVIDUAL DOCUMENTS SELDOM IN THE REMAINING COLLECTION A GOOD PERFORMANCE INDEX IS THE WEIGHT w ij =tf ij *log(n/df j ) WHERE THE DOCUMENT FREQUENCY df j REPRESENTS THE NUMBER OF DOCUMENTS IN THE COLLECTION IN WHICH THE TERM T j APPEARS Inf. retrieval 23

ON AUTOMATIC INDEXING TITLE ONLY TITLE AND ABSTRACT (best cost/performance) FULL TEXT PROCESS STEPS REMOVE STOP WORDS CREATE WORD STEMS BY REMOVING PRE- AND POST- FIXES COALESCE EQUIVALENT STEMS THESAURI WEIGHT REMAINING TERMS APPLY POSSIBLE THRESHOLDS INSERT REMAINING TERMS INTO THE INDEX Inf. retrieval 24

THESAURI THESAURI ALLOW A LARGER RECALL BY SUBSTITUTING TOO SPECIFIC TERMS WITH MORE COMMON SYNONYMS STEM USAGE REQUIRES THAT CORRECT LEXICAL RULES ARE FOLLOWED FOR EACH LANGUAGE (e.g. SUBSTITUTION OF THE FINAL I WITH Y) STEMS MUST BE AT LEAST THREE CHARACTERS LONG IN ORDER TO BE SIGNIFICANT (the progressive time rule would truncate King TO K) Inf. retrieval 25

DOCUMENT SEARCH INTERACTIVITY AFTER THE FIRST QUERY, THE SYSTEM SHOWS THE NUMBER OF RELEVANT DOCUMENTS IN EACH FURTHER ITERATION, THE USER TRIES TO ENHANCE THE PRECISION UNTIL THE NUMBER OF RETRIEVED DOCUMENTS IS MANAGEABLE TO BE DIRECTLY INSPECTED RANKING DOCUMENTS ARE PRESENTED IN RELEVANCE ORDER BASED ON WEIGHTS ASSIGNED TO THE DIFFERENT TERMS BROWSING SIMILAR DOCUMENTS ARE GROUPED IN A SINGLE CLASS AND INSPECTED BY PROXIMITY Inf. retrieval 26

DOCUMENT SEARCH RELEVANCE FEEDBACK THE SYSTEM INVITES THE USER TO EVALUATE THE RELEVANCE OF EACH RETRIEVED DOCUMENT FROM THE ANSWERS, THE SYSTEM TUNES THE TERM WEIGHTS IN THE DOCUMENTS USER PROFILES INFORMATION ABOUT MOST CONSULTED DOCUMENTS RELEVANCE ANALYSIS RESULTS INFORMATION ABOUT THE WORK CONTEXT DYNAMIC MANAGEMENT IS NEEDED CAN BE USED IN WORKING ENVIRONMENTS WITH WELL DEFINED, CUSTOMARY USERS Inf. retrieval 27

LANGUAGES FOR DOCUMENT SEARCHING QUERY LANGUAGES ARE MOSTLY BASED ON FUNDAMENTAL SET OPERATORS - AND, OR, NOT - AND THEIR COMBINATIONS SUPPLEMENTARY OPERATORS TERMS ORDERING TERMS CONTIGUITY WILDCARDS (truncation or separation) SEARCH FIELD (title, abstract, full text) OTHER COMMANDS DOCUMENT DATA BANK CHOICE THESAURUS INSPECTION SEARCH RESULT MEMORIZATION... Inf. retrieval 28

NETWORK SEARCH THE MAIN DIFFERENCES BETWEN WEB SEARCHING AND TRADITIONAL INFORMATION RETRIEVAL ARE: HIGHER HETEROGENEITY OF WEB INFORMATION EXTREMELY LARGE DIMENSIONS OF THE SEARCH DOMAIN (year 2005) 8x10 9 STATIC WEB PAGES AMOUNTING TO 10 2 TBYTE 1 MILLION/DAY NEW PAGES (very high volatility) 140x10 3 SEARCHES / MINUTE (Google 2004) EVEN IF THE RECALL IS LARGE, ONLY THE VERY FIRST DOCUMENTS ARE EXAMINED OWING TO THEIR COMMERCIAL VALUE TO ADVERTISERS, SORTING AND RANKING ALGORITHMS ARE AMONG THE BEST KEPT INDUSTRIAL SECRETS! Inf. retrieval 29

NETWORK SEARCH SEARCH ENGINES USE CENTRALIZED SEARCH INDEXES WITH TREE CATEGORIZATION OF CONTENTS BOTH CONTENT AND CONTEXT EFFECTIVE DOCUMENT CLASSIFICATION PORTALS (SUBJECT GATEWAYS) TRADIZIONAL ENGINES INDEX INDIVIDUAL PAGES A PORTAL, AMONG OTHER FEATURES, RECOGNIZES A DOCUMENT AS SUCH, AND IT KEEPS INFORMATION CHERENCE Inf. retrieval 30

SEARCH ENGINES DIRECTORY BASED (Magellan,... ) KNOWLEDGE IS ORGANIZED INTO TREE STRUCTURES; WEB PAGES ARE CLASSIFIED ACCORDINGLY CLASSIFICATION IS A HEAVY JOB IF THE REQUIRED INFORMATION DOES NOT FALL INTO THE CLASSIFICATION FINDING IT IS IMPOSSIBLE SPIDER BASED (Alta Vista, Lycos, Google,... ) SPECIFIC PROGRAMS LOOK FOR EVERYTING AND ORGANIZE THE TOPICS IN WHICHEVER MODE THE SPIDER ESPLORES THE WEB AND FINDS THE PAGES A DATABASE STORES THE RETRIEVED INFORMATION AND THE RELEVANCE SORTING ALGORITHMS A USER INTERFACE ALLOWS QUERY FORMULATION AND RESULT PRESENTATION Inf. retrieval 31

SEARCH ENGINES GOOGLE BORN AS A RESEARCH PRODUCT AT STANFORD IT USES AN INDEX WITH MORE THAN 10 9 PAGES SPIDER ADDING MORE OR LESS 10 6 PAGE/DAY IT MANAGES 200 MILION/DAY SEARCHES SEARCH RESULTS ARE EVALUATED BY MEANS OF PageRank TECHNOLOGY RELEVANCE IS COMPUTED BY MEANS OF MATHEMATICAL FORMULAS WITH 500*10 6 VARIABLES AND 2*10 9 TERMS IT ALLOWS BOTH FOR PAGE CONTENT AND FOR REFERENCES MADE FROM OTHER PAGES, CLASSIFIED AS TO RELEVANCE TRIES TO AVOID USERS INTERFERENCE IN RANKING Inf. retrieval 32