SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT Prof. Dipartimento di Elettronica e Informazione Politecnico di Milano
INFORMATION SEARCH AND RETRIEVAL Inf. retrieval 1
PRESENTATION SCHEMA GOALS AND ARCHITECTURES OF INFORMATIOIN RETRIEVAL SYSTEMS PHYSICAL AND LOGICAL STORAGE STRUCTURES AUTOMATIC TEXT ANALYSIS AND INDEX BUILDING INTERNET SEARCHING Inf. retrieval 2
INFORMATION MANAGEMENT TECHNOLOGIES DATA WAREHOUSE DECISION SUPPORT SYSTEMS DATA MINING INFORMATION SYSTEMS ANALYSIS DATA INTEGRATION DISTRIBUTED ETHEROGENEOUS DATA MANAGEMENT WEB INFORMATION SYSTEMS REAL-TIME MAIN MEMORY TEMPORAL DATABASES NON STRUCTURED SEMISTRUCTURED AND MULTIMEDIAL INFORMATION EMBEDDED SISTEMS MOBILE AND CONTEXT- AWARE COMPONENTS INFORMATION RETRIEVAL SISTEMS Inf. retrieval 3
MANAGEMENT INFORMATION SYSTEMS INFORMATION COMPLEX HIGHLY STRUCTURED QUERIES COMPLEX MOSTLY RECURRENT UPDATES FREQUENCY IS CASUAL, BUT HIGH OFTEN ON-LINE USED TECHNOLOGY DATABASE MANAGEMENT SYSTEMS Inf. retrieval 4
INFORMATION SEARCH INFORMATION SIMPLE (authors, keywords, colours, patterns,...) POORLY STRUCTURED QUERIES COMPLEX CLAUSES ARE LOGICALLY CONNECTED PARTIALLY SPECIFIED ITERATIVE REFINEMENT NON FORESEABLE Inf. retrieval 5
INFORMATION SEARCH UPDATES MOSTLY PERIODIC, WITH LOW FREQUENCY OFTEN OFF-LINE USED TECHNOLOGY INDEXING AND SEARCHING BY KEYWORDS DIRECT SEARCH ON TEXT FULL TEXT ABSTRACT SIGNATURE Inf. retrieval 6
NON STRUCTURED INFORMATION DOCUMENT WHICHEVER INFORMATION COLLECTION SEARCHABLE BY ITS CONTENT TEXTS STATISTICAL DATA IMAGES SOUNDS Inf. retrieval 7
FUNCTIONAL ARCHITECTURE OF AN INFORMATION RETRIEVAL SYSTEM (IRS) QUERIES FORMAL LANGUAGE SIMILARITY ASSESSMENT INDEXED DOCUMENTS DOCUMENTS SEARCH FORMULATION PROCESS DOCUMENTS STORAGE PROCESS SIMILAR ITEMS EXTRACTION Inf. retrieval 8
DOCUMENT SPACE W.R.T. A QUERY RESULT ALL DOCUMENTS NON RETRIEVED, BUT NON RELEVANT) (NRITNRIL) RETIRIEVED AND RELEVANT (RITRIL) RETRIEVED, BUT NON RELEVANT (RITNRIL) NON RETRIEVED, BUT RELEVANT (NRITRIL) RELEVANT RETRIEVED Inf. retrieval 9
INFORMATION RETRIEVAL SYSTEMS GOAL OF AN IRS IS TO EFFECTIVELY RETRIEVE ALL THE DOCUMENTS WHICH ARE RELEVANT TO A GIVEN QUERY AND ONLY THEM PERFORMANCE INDEXES RECALL RECALL = RITRIL RITRIL+ NRITRIL EFFECTIVENESS IN FINDING THE USEFUL MATERIAL (RELEVANT AND RETRIEVED DOCUMENTS W.R.T. ALL THE RELEVANT DOCUMENTS ) PRECISION PRECISION = RITRIL RITRIL+ RITNRIL EFFECTIVENESS IN REMOVING THE USELESS MATERIAL (RELEVANT AND RETRIEVED DOCUMENTS W.R.T. ALL THE RETRIEVED DOCUMENTS ) Inf. retrieval 10
INFORMATION RETRIEVAL SYSTEMS (NRITNRIL) RETIRIEVED AND RELEVANT (RITRIL) NON RETRIEVED, BUT RELEVANT (NRITRIL) (RITNRIL) EXPERIMENTAL FINDING: THE USER IS (PSYCHOLOGICALLY) HAPPY WITH LOW RECALL (~20%) VALUES, BUT HIGH PRECISION (~80%) IS REQUIRED Inf. retrieval 11
STORAGE STRUCTURES THEY DEPEND ON THE PHYSICAL NATURE OF THE DOCUMENT (text, image,...) AND ON THE INTENDED USAGE TEXT INVERTED FILES FOR EACH TERM OR ATTRIBUTE VALUE A DENSE INDEX TO THE FILE IS BUILT THE SET OF ALL THE INDEXES CONSTITUTES THE INVERTED FILE BIT MAPS GRAPHICS QUADTREES OF DIFFERENT TYPE THE IMAGE SPACE IS RECURSIVELY DECOMPOSED INTO SQUARES UNTIL A SQUARE CONTAINS A SINGLE MEANINGFUL ELEMENT THE RESULTING TREE IS CODED AND STORED IN A COMPACT FORMAT Inf. retrieval 12
INVERTED FILES PHYSICAL ARCHITECTURE INVERTED FILE DOCUMENT REPOSITORY INVERSION INDEX FILE SYSTEM KEYWORDS (CONTRROLLED VOCABULARY) LOGICAL STRUCTURE THESAURUS SYNONYMS OMONYMS DIFFERENT SPELLINGS SEMANTIC LINKS (CROSS REFERENCE, KWIC) HIERARCHICAL RELATIONS (GENERAL.-SPECIAL.) Inf. retrieval 13
STAIRS STORAGE STRUCTURE DICTIONARY TERMS INVERSION FILE TERM POINTER TO THE INVERSION FILE POINTER TO SYNONYMS # OF DOCUMENTS # OF OCCURRENCIES OCCURR. 1 OCCURR. 2 OCCURR. n UPPER/LOWER CASE N OF THE DOCUMENT SECTION CODE N OF THE SENTENCE N OF THE WORD INDEX TO TEXT TEXT FILE DOCUMENT ADDRESS PRIVACY CODE FORMATTED FIELDS DOCUMENT HEADER HEADER OF 1 TEXT 1 HEADER OF 2 TEXT 2... FROM: SALTON 89 Inf. retrieval 14
REGION QUADTREE A F G B B C D E H I J 37 38 39 40 N O F G H I J K L M N O P Q L M 57 58 59 60 Q 37 38 39 40 57 58 59 60 FROM: SAMET 90 Inf. retrieval 15
BITMAP SUPERIMPOSED CODING IN ITS BASIC FORM, EACH DOCUMENT IS REPRESENTED BY A ROW IN A BINARY ARRAY, THE COLUMNS OF WHICH REPRESENT THE b RELEVANT TERMS (very expensive) THE SUPERIMPOSED VARIANT CODES EACH DOCUMENT WITH A SHORTER (n<<b) BIT STRING RELEVANT TERMS ARE CODED WITH n-ary STRINGS IN WHICH k (k<n) BIT = 1 WHICH ARE OR-ed (false drops i.e., coding synonyms, are generated) THE GENERATED TERM CODES ARE LINKED TOGETHER TO PRODUCE THE SIGNATURE Inf. retrieval 16
BITMAP SUPERIMPOSED CODING Data 0000 0010 0000 1000 base 0100 0010 0000 0000 management 0000 0100 0001 0000 system 0000 0000 0101 0000 SIGNATURE 0100 0110 0101 1000 IN LARGE DOCUMENT REPOSITORIES, DENSE INDEXES CAN BE BUILT ON THE MAIN TABLE Inf. retrieval 17
BITMAPS AND INVERTED FILES BITMAPS ARE PROFITABLY USED TO REPRESENT SHORT AND MOSTLY HOMOGENEOUS IN THEIR VOCABULARY TEXTS MEMORY OVERHEAD VERSUS THE NUMBER OF DOCUMENTS CONTAINING THE SAME KEY BIT MAP: CONSTANT INVERTED LISTS: LINEAR GROWTH WITH BITMAP ORGANIZATIONS, QUERY PROCESSING BECOMES A SIMPLE BINARY STRING MATCHING BETWEEN THE QUERY BITMAP AND THOSE OF THE DOCUMENTS Inf. retrieval 18
AUTOMATIC TEXT ANALYSIS ITS GOAL IS TO EXTRACT THE TERMS TO BE INCLUDED IN THE INDEXES AND THEIR MUTUAL RELATIONSHIPS SINGLE TERMS (KWOC) TERMS IN CONTEXT (KWIC) EXHAUSTIVE INDEXING (> RECALL) SPECIFIC INDEXING (> PRECISION) DEEP INDEXING (> PERFORMANCE, > COST) SHALLOW INDEXING (< PERFORMANCE, < COST) Inf. retrieval 19
AUTOMATIC TEXT ANALYSIS ZIPF LAW (least effort principle) ORDERING THE SET OF WORDS IN A TEXT IN DECREASING FREQUENCY ORDER (RANK), IT CAN BE OBSERVED THAT RANK(i)*FREQ(i)=COSTANT FOR THE ENGLISH LANGUAGE: COSTANT 0.1 50% OF DISTINCT WORDS ARE FOUND ONLY ONCE 80% OF DISTINCT WORDS DO NOT APPEAR MORE THAN 4 TIMES Inf. retrieval 20
COMPRESSION OPERATIONS ON TEXT VARIABLE LENGTH CODES MOST FREQUENT WORDS SHORTER CODE MOST FREQUENT LETTERS SHORTER CODE HUFFMAN CODE: 3 BIT FOR E, 10 BIT FOR Z, AVERAGE LENGTH: 4.12 48% COMPRESSION DIGRAMS, TRIGRAMS,, CODING CRYPTOGRAPHY REVERSIBLE TEXT TRANSFORMATION INFORMATION PRIVACY ACCESS RIGHTS AUTENTICATION Inf. retrieval 21
AUTOMATIC INDEXING THE CHOICE OF INSERTING OF A TERM INTO AN INDEX IS TO BE MADE ON THE BASE OF TWO PARAMETERS ITS RELEVANCE FOR IDENTIFYING A DOCUMENT RECALL ITS WEIGHT FOR SINGLING OUT A DOCUMENT FROM A COLLECTION OF SIMILAR DOCUMENTS PRECISION TERM OCCURRENCY PROPERTIES IN A WHOLE COLLECTION OF N DOCUMENTS MUST BE EXAMINED THE MOST COMMON FUNCTIONAL TERMS ARE REMOVED (ARTICLES, PREPOSITIONS, ECC.) STOP LIST THE FREQUENCY tf ij OF REMAINING TERMS T j IN EACH DOCUMENT D i IS COMPUTED A THRESHLD FREQUENCY T IS CHOSEN AND TO EACH DOCUMENT D i ALL THE TERMS T j ARE ASSIGNED FOR WHICH tf ij > T Inf. retrieval 22
AUTOMATIC INDEXING TERMS WHICH ALLOW A GOOD INDEXING BOTH FOR RECALL AND PRECISION APPEAR OFTEN IN INDIVIDUAL DOCUMENTS SELDOM IN THE REMAINING COLLECTION A GOOD PERFORMANCE INDEX IS THE WEIGHT w ij =tf ij *log(n/df j ) WHERE THE DOCUMENT FREQUENCY df j REPRESENTS THE NUMBER OF DOCUMENTS IN THE COLLECTION IN WHICH THE TERM T j APPEARS Inf. retrieval 23
ON AUTOMATIC INDEXING TITLE ONLY TITLE AND ABSTRACT (best cost/performance) FULL TEXT PROCESS STEPS REMOVE STOP WORDS CREATE WORD STEMS BY REMOVING PRE- AND POST- FIXES COALESCE EQUIVALENT STEMS THESAURI WEIGHT REMAINING TERMS APPLY POSSIBLE THRESHOLDS INSERT REMAINING TERMS INTO THE INDEX Inf. retrieval 24
THESAURI THESAURI ALLOW A LARGER RECALL BY SUBSTITUTING TOO SPECIFIC TERMS WITH MORE COMMON SYNONYMS STEM USAGE REQUIRES THAT CORRECT LEXICAL RULES ARE FOLLOWED FOR EACH LANGUAGE (e.g. SUBSTITUTION OF THE FINAL I WITH Y) STEMS MUST BE AT LEAST THREE CHARACTERS LONG IN ORDER TO BE SIGNIFICANT (the progressive time rule would truncate King TO K) Inf. retrieval 25
DOCUMENT SEARCH INTERACTIVITY AFTER THE FIRST QUERY, THE SYSTEM SHOWS THE NUMBER OF RELEVANT DOCUMENTS IN EACH FURTHER ITERATION, THE USER TRIES TO ENHANCE THE PRECISION UNTIL THE NUMBER OF RETRIEVED DOCUMENTS IS MANAGEABLE TO BE DIRECTLY INSPECTED RANKING DOCUMENTS ARE PRESENTED IN RELEVANCE ORDER BASED ON WEIGHTS ASSIGNED TO THE DIFFERENT TERMS BROWSING SIMILAR DOCUMENTS ARE GROUPED IN A SINGLE CLASS AND INSPECTED BY PROXIMITY Inf. retrieval 26
DOCUMENT SEARCH RELEVANCE FEEDBACK THE SYSTEM INVITES THE USER TO EVALUATE THE RELEVANCE OF EACH RETRIEVED DOCUMENT FROM THE ANSWERS, THE SYSTEM TUNES THE TERM WEIGHTS IN THE DOCUMENTS USER PROFILES INFORMATION ABOUT MOST CONSULTED DOCUMENTS RELEVANCE ANALYSIS RESULTS INFORMATION ABOUT THE WORK CONTEXT DYNAMIC MANAGEMENT IS NEEDED CAN BE USED IN WORKING ENVIRONMENTS WITH WELL DEFINED, CUSTOMARY USERS Inf. retrieval 27
LANGUAGES FOR DOCUMENT SEARCHING QUERY LANGUAGES ARE MOSTLY BASED ON FUNDAMENTAL SET OPERATORS - AND, OR, NOT - AND THEIR COMBINATIONS SUPPLEMENTARY OPERATORS TERMS ORDERING TERMS CONTIGUITY WILDCARDS (truncation or separation) SEARCH FIELD (title, abstract, full text) OTHER COMMANDS DOCUMENT DATA BANK CHOICE THESAURUS INSPECTION SEARCH RESULT MEMORIZATION... Inf. retrieval 28
NETWORK SEARCH THE MAIN DIFFERENCES BETWEN WEB SEARCHING AND TRADITIONAL INFORMATION RETRIEVAL ARE: HIGHER HETEROGENEITY OF WEB INFORMATION EXTREMELY LARGE DIMENSIONS OF THE SEARCH DOMAIN (year 2005) 8x10 9 STATIC WEB PAGES AMOUNTING TO 10 2 TBYTE 1 MILLION/DAY NEW PAGES (very high volatility) 140x10 3 SEARCHES / MINUTE (Google 2004) EVEN IF THE RECALL IS LARGE, ONLY THE VERY FIRST DOCUMENTS ARE EXAMINED OWING TO THEIR COMMERCIAL VALUE TO ADVERTISERS, SORTING AND RANKING ALGORITHMS ARE AMONG THE BEST KEPT INDUSTRIAL SECRETS! Inf. retrieval 29
NETWORK SEARCH SEARCH ENGINES USE CENTRALIZED SEARCH INDEXES WITH TREE CATEGORIZATION OF CONTENTS BOTH CONTENT AND CONTEXT EFFECTIVE DOCUMENT CLASSIFICATION PORTALS (SUBJECT GATEWAYS) TRADIZIONAL ENGINES INDEX INDIVIDUAL PAGES A PORTAL, AMONG OTHER FEATURES, RECOGNIZES A DOCUMENT AS SUCH, AND IT KEEPS INFORMATION CHERENCE Inf. retrieval 30
SEARCH ENGINES DIRECTORY BASED (Magellan,... ) KNOWLEDGE IS ORGANIZED INTO TREE STRUCTURES; WEB PAGES ARE CLASSIFIED ACCORDINGLY CLASSIFICATION IS A HEAVY JOB IF THE REQUIRED INFORMATION DOES NOT FALL INTO THE CLASSIFICATION FINDING IT IS IMPOSSIBLE SPIDER BASED (Alta Vista, Lycos, Google,... ) SPECIFIC PROGRAMS LOOK FOR EVERYTING AND ORGANIZE THE TOPICS IN WHICHEVER MODE THE SPIDER ESPLORES THE WEB AND FINDS THE PAGES A DATABASE STORES THE RETRIEVED INFORMATION AND THE RELEVANCE SORTING ALGORITHMS A USER INTERFACE ALLOWS QUERY FORMULATION AND RESULT PRESENTATION Inf. retrieval 31
SEARCH ENGINES GOOGLE BORN AS A RESEARCH PRODUCT AT STANFORD IT USES AN INDEX WITH MORE THAN 10 9 PAGES SPIDER ADDING MORE OR LESS 10 6 PAGE/DAY IT MANAGES 200 MILION/DAY SEARCHES SEARCH RESULTS ARE EVALUATED BY MEANS OF PageRank TECHNOLOGY RELEVANCE IS COMPUTED BY MEANS OF MATHEMATICAL FORMULAS WITH 500*10 6 VARIABLES AND 2*10 9 TERMS IT ALLOWS BOTH FOR PAGE CONTENT AND FOR REFERENCES MADE FROM OTHER PAGES, CLASSIFIED AS TO RELEVANCE TRIES TO AVOID USERS INTERFERENCE IN RANKING Inf. retrieval 32