SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT
|
|
- Juliana Golden
- 6 years ago
- Views:
Transcription
1 SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT Prof. Dipartimento di Elettronica e Informazione Politecnico di Milano
2 INFORMATION SEARCH AND RETRIEVAL Inf. retrieval 1
3 PRESENTATION SCHEMA GOALS AND ARCHITECTURES OF INFORMATIOIN RETRIEVAL SYSTEMS PHYSICAL AND LOGICAL STORAGE STRUCTURES AUTOMATIC TEXT ANALYSIS AND INDEX BUILDING INTERNET SEARCHING Inf. retrieval 2
4 INFORMATION MANAGEMENT TECHNOLOGIES DATA WAREHOUSE DECISION SUPPORT SYSTEMS DATA MINING INFORMATION SYSTEMS ANALYSIS DATA INTEGRATION DISTRIBUTED ETHEROGENEOUS DATA MANAGEMENT WEB INFORMATION SYSTEMS REAL-TIME MAIN MEMORY TEMPORAL DATABASES NON STRUCTURED SEMISTRUCTURED AND MULTIMEDIAL INFORMATION EMBEDDED SISTEMS MOBILE AND CONTEXT- AWARE COMPONENTS INFORMATION RETRIEVAL SISTEMS Inf. retrieval 3
5 MANAGEMENT INFORMATION SYSTEMS INFORMATION COMPLEX HIGHLY STRUCTURED QUERIES COMPLEX MOSTLY RECURRENT UPDATES FREQUENCY IS CASUAL, BUT HIGH OFTEN ON-LINE USED TECHNOLOGY DATABASE MANAGEMENT SYSTEMS Inf. retrieval 4
6 INFORMATION SEARCH INFORMATION SIMPLE (authors, keywords, colours, patterns,...) POORLY STRUCTURED QUERIES COMPLEX CLAUSES ARE LOGICALLY CONNECTED PARTIALLY SPECIFIED ITERATIVE REFINEMENT NON FORESEABLE Inf. retrieval 5
7 INFORMATION SEARCH UPDATES MOSTLY PERIODIC, WITH LOW FREQUENCY OFTEN OFF-LINE USED TECHNOLOGY INDEXING AND SEARCHING BY KEYWORDS DIRECT SEARCH ON TEXT FULL TEXT ABSTRACT SIGNATURE Inf. retrieval 6
8 NON STRUCTURED INFORMATION DOCUMENT WHICHEVER INFORMATION COLLECTION SEARCHABLE BY ITS CONTENT TEXTS STATISTICAL DATA IMAGES SOUNDS Inf. retrieval 7
9 FUNCTIONAL ARCHITECTURE OF AN INFORMATION RETRIEVAL SYSTEM (IRS) QUERIES FORMAL LANGUAGE SIMILARITY ASSESSMENT INDEXED DOCUMENTS DOCUMENTS SEARCH FORMULATION PROCESS DOCUMENTS STORAGE PROCESS SIMILAR ITEMS EXTRACTION Inf. retrieval 8
10 DOCUMENT SPACE W.R.T. A QUERY RESULT ALL DOCUMENTS NON RETRIEVED, BUT NON RELEVANT) (NRITNRIL) RETIRIEVED AND RELEVANT (RITRIL) RETRIEVED, BUT NON RELEVANT (RITNRIL) NON RETRIEVED, BUT RELEVANT (NRITRIL) RELEVANT RETRIEVED Inf. retrieval 9
11 INFORMATION RETRIEVAL SYSTEMS GOAL OF AN IRS IS TO EFFECTIVELY RETRIEVE ALL THE DOCUMENTS WHICH ARE RELEVANT TO A GIVEN QUERY AND ONLY THEM PERFORMANCE INDEXES RECALL RECALL = RITRIL RITRIL+ NRITRIL EFFECTIVENESS IN FINDING THE USEFUL MATERIAL (RELEVANT AND RETRIEVED DOCUMENTS W.R.T. ALL THE RELEVANT DOCUMENTS ) PRECISION PRECISION = RITRIL RITRIL+ RITNRIL EFFECTIVENESS IN REMOVING THE USELESS MATERIAL (RELEVANT AND RETRIEVED DOCUMENTS W.R.T. ALL THE RETRIEVED DOCUMENTS ) Inf. retrieval 10
12 INFORMATION RETRIEVAL SYSTEMS (NRITNRIL) RETIRIEVED AND RELEVANT (RITRIL) NON RETRIEVED, BUT RELEVANT (NRITRIL) (RITNRIL) EXPERIMENTAL FINDING: THE USER IS (PSYCHOLOGICALLY) HAPPY WITH LOW RECALL (~20%) VALUES, BUT HIGH PRECISION (~80%) IS REQUIRED Inf. retrieval 11
13 STORAGE STRUCTURES THEY DEPEND ON THE PHYSICAL NATURE OF THE DOCUMENT (text, image,...) AND ON THE INTENDED USAGE TEXT INVERTED FILES FOR EACH TERM OR ATTRIBUTE VALUE A DENSE INDEX TO THE FILE IS BUILT THE SET OF ALL THE INDEXES CONSTITUTES THE INVERTED FILE BIT MAPS GRAPHICS QUADTREES OF DIFFERENT TYPE THE IMAGE SPACE IS RECURSIVELY DECOMPOSED INTO SQUARES UNTIL A SQUARE CONTAINS A SINGLE MEANINGFUL ELEMENT THE RESULTING TREE IS CODED AND STORED IN A COMPACT FORMAT Inf. retrieval 12
14 INVERTED FILES PHYSICAL ARCHITECTURE INVERTED FILE DOCUMENT REPOSITORY INVERSION INDEX FILE SYSTEM KEYWORDS (CONTRROLLED VOCABULARY) LOGICAL STRUCTURE THESAURUS SYNONYMS OMONYMS DIFFERENT SPELLINGS SEMANTIC LINKS (CROSS REFERENCE, KWIC) HIERARCHICAL RELATIONS (GENERAL.-SPECIAL.) Inf. retrieval 13
15 STAIRS STORAGE STRUCTURE DICTIONARY TERMS INVERSION FILE TERM POINTER TO THE INVERSION FILE POINTER TO SYNONYMS # OF DOCUMENTS # OF OCCURRENCIES OCCURR. 1 OCCURR. 2 OCCURR. n UPPER/LOWER CASE N OF THE DOCUMENT SECTION CODE N OF THE SENTENCE N OF THE WORD INDEX TO TEXT TEXT FILE DOCUMENT ADDRESS PRIVACY CODE FORMATTED FIELDS DOCUMENT HEADER HEADER OF 1 TEXT 1 HEADER OF 2 TEXT 2... FROM: SALTON 89 Inf. retrieval 14
16 REGION QUADTREE A F G B B C D E H I J N O F G H I J K L M N O P Q L M Q FROM: SAMET 90 Inf. retrieval 15
17 BITMAP SUPERIMPOSED CODING IN ITS BASIC FORM, EACH DOCUMENT IS REPRESENTED BY A ROW IN A BINARY ARRAY, THE COLUMNS OF WHICH REPRESENT THE b RELEVANT TERMS (very expensive) THE SUPERIMPOSED VARIANT CODES EACH DOCUMENT WITH A SHORTER (n<<b) BIT STRING RELEVANT TERMS ARE CODED WITH n-ary STRINGS IN WHICH k (k<n) BIT = 1 WHICH ARE OR-ed (false drops i.e., coding synonyms, are generated) THE GENERATED TERM CODES ARE LINKED TOGETHER TO PRODUCE THE SIGNATURE Inf. retrieval 16
18 BITMAP SUPERIMPOSED CODING Data base management system SIGNATURE IN LARGE DOCUMENT REPOSITORIES, DENSE INDEXES CAN BE BUILT ON THE MAIN TABLE Inf. retrieval 17
19 BITMAPS AND INVERTED FILES BITMAPS ARE PROFITABLY USED TO REPRESENT SHORT AND MOSTLY HOMOGENEOUS IN THEIR VOCABULARY TEXTS MEMORY OVERHEAD VERSUS THE NUMBER OF DOCUMENTS CONTAINING THE SAME KEY BIT MAP: CONSTANT INVERTED LISTS: LINEAR GROWTH WITH BITMAP ORGANIZATIONS, QUERY PROCESSING BECOMES A SIMPLE BINARY STRING MATCHING BETWEEN THE QUERY BITMAP AND THOSE OF THE DOCUMENTS Inf. retrieval 18
20 AUTOMATIC TEXT ANALYSIS ITS GOAL IS TO EXTRACT THE TERMS TO BE INCLUDED IN THE INDEXES AND THEIR MUTUAL RELATIONSHIPS SINGLE TERMS (KWOC) TERMS IN CONTEXT (KWIC) EXHAUSTIVE INDEXING (> RECALL) SPECIFIC INDEXING (> PRECISION) DEEP INDEXING (> PERFORMANCE, > COST) SHALLOW INDEXING (< PERFORMANCE, < COST) Inf. retrieval 19
21 AUTOMATIC TEXT ANALYSIS ZIPF LAW (least effort principle) ORDERING THE SET OF WORDS IN A TEXT IN DECREASING FREQUENCY ORDER (RANK), IT CAN BE OBSERVED THAT RANK(i)*FREQ(i)=COSTANT FOR THE ENGLISH LANGUAGE: COSTANT % OF DISTINCT WORDS ARE FOUND ONLY ONCE 80% OF DISTINCT WORDS DO NOT APPEAR MORE THAN 4 TIMES Inf. retrieval 20
22 COMPRESSION OPERATIONS ON TEXT VARIABLE LENGTH CODES MOST FREQUENT WORDS SHORTER CODE MOST FREQUENT LETTERS SHORTER CODE HUFFMAN CODE: 3 BIT FOR E, 10 BIT FOR Z, AVERAGE LENGTH: % COMPRESSION DIGRAMS, TRIGRAMS,, CODING CRYPTOGRAPHY REVERSIBLE TEXT TRANSFORMATION INFORMATION PRIVACY ACCESS RIGHTS AUTENTICATION Inf. retrieval 21
23 AUTOMATIC INDEXING THE CHOICE OF INSERTING OF A TERM INTO AN INDEX IS TO BE MADE ON THE BASE OF TWO PARAMETERS ITS RELEVANCE FOR IDENTIFYING A DOCUMENT RECALL ITS WEIGHT FOR SINGLING OUT A DOCUMENT FROM A COLLECTION OF SIMILAR DOCUMENTS PRECISION TERM OCCURRENCY PROPERTIES IN A WHOLE COLLECTION OF N DOCUMENTS MUST BE EXAMINED THE MOST COMMON FUNCTIONAL TERMS ARE REMOVED (ARTICLES, PREPOSITIONS, ECC.) STOP LIST THE FREQUENCY tf ij OF REMAINING TERMS T j IN EACH DOCUMENT D i IS COMPUTED A THRESHLD FREQUENCY T IS CHOSEN AND TO EACH DOCUMENT D i ALL THE TERMS T j ARE ASSIGNED FOR WHICH tf ij > T Inf. retrieval 22
24 AUTOMATIC INDEXING TERMS WHICH ALLOW A GOOD INDEXING BOTH FOR RECALL AND PRECISION APPEAR OFTEN IN INDIVIDUAL DOCUMENTS SELDOM IN THE REMAINING COLLECTION A GOOD PERFORMANCE INDEX IS THE WEIGHT w ij =tf ij *log(n/df j ) WHERE THE DOCUMENT FREQUENCY df j REPRESENTS THE NUMBER OF DOCUMENTS IN THE COLLECTION IN WHICH THE TERM T j APPEARS Inf. retrieval 23
25 ON AUTOMATIC INDEXING TITLE ONLY TITLE AND ABSTRACT (best cost/performance) FULL TEXT PROCESS STEPS REMOVE STOP WORDS CREATE WORD STEMS BY REMOVING PRE- AND POST- FIXES COALESCE EQUIVALENT STEMS THESAURI WEIGHT REMAINING TERMS APPLY POSSIBLE THRESHOLDS INSERT REMAINING TERMS INTO THE INDEX Inf. retrieval 24
26 THESAURI THESAURI ALLOW A LARGER RECALL BY SUBSTITUTING TOO SPECIFIC TERMS WITH MORE COMMON SYNONYMS STEM USAGE REQUIRES THAT CORRECT LEXICAL RULES ARE FOLLOWED FOR EACH LANGUAGE (e.g. SUBSTITUTION OF THE FINAL I WITH Y) STEMS MUST BE AT LEAST THREE CHARACTERS LONG IN ORDER TO BE SIGNIFICANT (the progressive time rule would truncate King TO K) Inf. retrieval 25
27 DOCUMENT SEARCH INTERACTIVITY AFTER THE FIRST QUERY, THE SYSTEM SHOWS THE NUMBER OF RELEVANT DOCUMENTS IN EACH FURTHER ITERATION, THE USER TRIES TO ENHANCE THE PRECISION UNTIL THE NUMBER OF RETRIEVED DOCUMENTS IS MANAGEABLE TO BE DIRECTLY INSPECTED RANKING DOCUMENTS ARE PRESENTED IN RELEVANCE ORDER BASED ON WEIGHTS ASSIGNED TO THE DIFFERENT TERMS BROWSING SIMILAR DOCUMENTS ARE GROUPED IN A SINGLE CLASS AND INSPECTED BY PROXIMITY Inf. retrieval 26
28 DOCUMENT SEARCH RELEVANCE FEEDBACK THE SYSTEM INVITES THE USER TO EVALUATE THE RELEVANCE OF EACH RETRIEVED DOCUMENT FROM THE ANSWERS, THE SYSTEM TUNES THE TERM WEIGHTS IN THE DOCUMENTS USER PROFILES INFORMATION ABOUT MOST CONSULTED DOCUMENTS RELEVANCE ANALYSIS RESULTS INFORMATION ABOUT THE WORK CONTEXT DYNAMIC MANAGEMENT IS NEEDED CAN BE USED IN WORKING ENVIRONMENTS WITH WELL DEFINED, CUSTOMARY USERS Inf. retrieval 27
29 LANGUAGES FOR DOCUMENT SEARCHING QUERY LANGUAGES ARE MOSTLY BASED ON FUNDAMENTAL SET OPERATORS - AND, OR, NOT - AND THEIR COMBINATIONS SUPPLEMENTARY OPERATORS TERMS ORDERING TERMS CONTIGUITY WILDCARDS (truncation or separation) SEARCH FIELD (title, abstract, full text) OTHER COMMANDS DOCUMENT DATA BANK CHOICE THESAURUS INSPECTION SEARCH RESULT MEMORIZATION... Inf. retrieval 28
30 NETWORK SEARCH THE MAIN DIFFERENCES BETWEN WEB SEARCHING AND TRADITIONAL INFORMATION RETRIEVAL ARE: HIGHER HETEROGENEITY OF WEB INFORMATION EXTREMELY LARGE DIMENSIONS OF THE SEARCH DOMAIN (year 2005) 8x10 9 STATIC WEB PAGES AMOUNTING TO 10 2 TBYTE 1 MILLION/DAY NEW PAGES (very high volatility) 140x10 3 SEARCHES / MINUTE (Google 2004) EVEN IF THE RECALL IS LARGE, ONLY THE VERY FIRST DOCUMENTS ARE EXAMINED OWING TO THEIR COMMERCIAL VALUE TO ADVERTISERS, SORTING AND RANKING ALGORITHMS ARE AMONG THE BEST KEPT INDUSTRIAL SECRETS! Inf. retrieval 29
31 NETWORK SEARCH SEARCH ENGINES USE CENTRALIZED SEARCH INDEXES WITH TREE CATEGORIZATION OF CONTENTS BOTH CONTENT AND CONTEXT EFFECTIVE DOCUMENT CLASSIFICATION PORTALS (SUBJECT GATEWAYS) TRADIZIONAL ENGINES INDEX INDIVIDUAL PAGES A PORTAL, AMONG OTHER FEATURES, RECOGNIZES A DOCUMENT AS SUCH, AND IT KEEPS INFORMATION CHERENCE Inf. retrieval 30
32 SEARCH ENGINES DIRECTORY BASED (Magellan,... ) KNOWLEDGE IS ORGANIZED INTO TREE STRUCTURES; WEB PAGES ARE CLASSIFIED ACCORDINGLY CLASSIFICATION IS A HEAVY JOB IF THE REQUIRED INFORMATION DOES NOT FALL INTO THE CLASSIFICATION FINDING IT IS IMPOSSIBLE SPIDER BASED (Alta Vista, Lycos, Google,... ) SPECIFIC PROGRAMS LOOK FOR EVERYTING AND ORGANIZE THE TOPICS IN WHICHEVER MODE THE SPIDER ESPLORES THE WEB AND FINDS THE PAGES A DATABASE STORES THE RETRIEVED INFORMATION AND THE RELEVANCE SORTING ALGORITHMS A USER INTERFACE ALLOWS QUERY FORMULATION AND RESULT PRESENTATION Inf. retrieval 31
33 SEARCH ENGINES GOOGLE BORN AS A RESEARCH PRODUCT AT STANFORD IT USES AN INDEX WITH MORE THAN 10 9 PAGES SPIDER ADDING MORE OR LESS 10 6 PAGE/DAY IT MANAGES 200 MILION/DAY SEARCHES SEARCH RESULTS ARE EVALUATED BY MEANS OF PageRank TECHNOLOGY RELEVANCE IS COMPUTED BY MEANS OF MATHEMATICAL FORMULAS WITH 500*10 6 VARIABLES AND 2*10 9 TERMS IT ALLOWS BOTH FOR PAGE CONTENT AND FOR REFERENCES MADE FROM OTHER PAGES, CLASSIFIED AS TO RELEVANCE TRIES TO AVOID USERS INTERFERENCE IN RANKING Inf. retrieval 32
Chapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections
More informationChapter 27 Introduction to Information Retrieval and Web Search
Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval
More informationKNOWLEDGE DISCOVERY AND DATA MINING
KNOWLEDGE DISCOVERY AND DATA MINING Prof. Fabio A. Schreiber Dipartimento di Elettronica e Informazione Politecnico di Milano INFORMATION MANAGEMENT TECHNOLOGIES DATA WAREHOUSE DECISION SUPPORT SYSTEMS
More informationQuery Refinement and Search Result Presentation
Query Refinement and Search Result Presentation (Short) Queries & Information Needs A query can be a poor representation of the information need Short queries are often used in search engines due to the
More informationCS 6320 Natural Language Processing
CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic
More informationInformation Retrieval
Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have
More informationIn = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most
In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100
More informationInformation Retrieval and Web Search
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught at UT Austin and Stanford) Information Retrieval
More informationKnowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.
Knowledge Retrieval Franz J. Kurfess Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. 1 Acknowledgements This lecture series has been sponsored by the European
More informationDesigning and Building an Automatic Information Retrieval System for Handling the Arabic Data
American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation
More informationInformation Retrieval. (M&S Ch 15)
Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion
More information10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues
COS 597A: Principles of Database and Information Systems Information Retrieval Traditional database system Large integrated collection of data Uniform access/modifcation mechanisms Model of data organization
More informationInformation Retrieval. Chap 7. Text Operations
Information Retrieval Chap 7. Text Operations The Retrieval Process user need User Interface 4, 10 Text Text logical view Text Operations logical view 6, 7 user feedback Query Operations query Indexing
More informationInformation Retrieval. Information Retrieval and Web Search
Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent
More informationIndexing and Searching
Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)
More informationIndexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems
Indexing: Part IV CPS 216 Advanced Database Systems Announcements (February 17) 2 Homework #2 due in two weeks Reading assignments for this and next week The query processing survey by Graefe Due next
More informationCHAPTER THREE INFORMATION RETRIEVAL SYSTEM
CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost
More informationChapter 3 - Text. Management and Retrieval
Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter 3 - Text Management and Retrieval Literature: Baeza-Yates, R.;
More informationShrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India
International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent
More informationINFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE
15 : CONCEPT AND SCOPE 15.1 INTRODUCTION Information is communicated or received knowledge concerning a particular fact or circumstance. Retrieval refers to searching through stored information to find
More informationWeb Information Retrieval using WordNet
Web Information Retrieval using WordNet Jyotsna Gharat Asst. Professor, Xavier Institute of Engineering, Mumbai, India Jayant Gadge Asst. Professor, Thadomal Shahani Engineering College Mumbai, India ABSTRACT
More informationModule 1: Internet Basics for Web Development (II)
INTERNET & WEB APPLICATION DEVELOPMENT SWE 444 Fall Semester 2008-2009 (081) Module 1: Internet Basics for Web Development (II) Dr. El-Sayed El-Alfy Computer Science Department King Fahd University of
More informationSEARCH ENGINE INSIDE OUT
SEARCH ENGINE INSIDE OUT From Technical Views r86526020 r88526016 r88526028 b85506013 b85506010 April 11,2000 Outline Why Search Engine so important Search Engine Architecture Crawling Subsystem Indexing
More informationBasic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert
More informationLearning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search
1 / 33 Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search Bernd Wittefeld Supervisor Markus Löckelt 20. July 2012 2 / 33 Teaser - Google Web History http://www.google.com/history
More informationDepartment of Electronic Engineering FINAL YEAR PROJECT REPORT
Department of Electronic Engineering FINAL YEAR PROJECT REPORT BEngCE-2007/08-HCS-HCS-03-BECE Natural Language Understanding for Query in Web Search 1 Student Name: Sit Wing Sum Student ID: Supervisor:
More informationGlossary. ASCII: Standard binary codes to represent occidental characters in one byte.
Glossary ASCII: Standard binary codes to represent occidental characters in one byte. Ad hoc retrieval: standard retrieval task in which the user specifies his information need through a query which initiates
More informationEnhanced Retrieval of Web Pages using Improved Page Rank Algorithm
Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm Rekha Jain 1, Sulochana Nathawat 2, Dr. G.N. Purohit 3 1 Department of Computer Science, Banasthali University, Jaipur, Rajasthan ABSTRACT
More informationBasic techniques. Text processing; term weighting; vector space model; inverted index; Web Search
Basic techniques Text processing; term weighting; vector space model; inverted index; Web Search Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing
More informationMidterm Exam Search Engines ( / ) October 20, 2015
Student Name: Andrew ID: Seat Number: Midterm Exam Search Engines (11-442 / 11-642) October 20, 2015 Answer all of the following questions. Each answer should be thorough, complete, and relevant. Points
More informationMultimedia Information Systems
Multimedia Information Systems Samson Cheung EE 639, Fall 2004 Lecture 6: Text Information Retrieval 1 Digital Video Library Meta-Data Meta-Data Similarity Similarity Search Search Analog Video Archive
More informationCHAPTER-26 Mining Text Databases
CHAPTER-26 Mining Text Databases 26.1 Introduction 26.2 Text Data Analysis and Information Retrieval 26.3 Basle Measures for Text Retrieval 26.4 Keyword-Based and Similarity-Based Retrieval 26.5 Other
More informationGraph Mining and Social Network Analysis
Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References q Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann
More informationTEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION
TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION Ms. Nikita P.Katariya 1, Prof. M. S. Chaudhari 2 1 Dept. of Computer Science & Engg, P.B.C.E., Nagpur, India, nikitakatariya@yahoo.com 2 Dept.
More informationOracle Database 10g: Introduction to SQL
ORACLE UNIVERSITY CONTACT US: 00 9714 390 9000 Oracle Database 10g: Introduction to SQL Duration: 5 Days What you will learn This course offers students an introduction to Oracle Database 10g database
More informationAutomated Online News Classification with Personalization
Automated Online News Classification with Personalization Chee-Hong Chan Aixin Sun Ee-Peng Lim Center for Advanced Information Systems, Nanyang Technological University Nanyang Avenue, Singapore, 639798
More informationInformation Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes
CS630 Representing and Accessing Digital Information Information Retrieval: Retrieval Models Information Retrieval Basics Data Structures and Access Indexing and Preprocessing Retrieval Models Thorsten
More informationEffect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching
Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching Wolfgang Tannebaum, Parvaz Madabi and Andreas Rauber Institute of Software Technology and Interactive Systems, Vienna
More informationCS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University
CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and
More informationRanked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?
Ranked Retrieval One option is to average the precision scores at discrete Precision 100% 0% More junk 100% Everything points on the ROC curve But which points? Recall We want to evaluate the system, not
More informationCS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University
CS377: Database Systems Text data and information retrieval Li Xiong Department of Mathematics and Computer Science Emory University Outline Information Retrieval (IR) Concepts Text Preprocessing Inverted
More informationCANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM
CANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM Ms.Susan Geethu.D.K 1, Ms. R.Subha 2, Dr.S.Palaniswami 3 1, 2 Assistant Professor 1,2 Department of Computer Science and Engineering, Sri Krishna
More informationInformation Retrieval. hussein suleman uct cs
Information Management Information Retrieval hussein suleman uct cs 303 2004 Introduction Information retrieval is the process of locating the most relevant information to satisfy a specific information
More informationRepresentation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s
Summary agenda Summary: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University March 13, 2013 A Ardö, EIT Summary: EITN01 Web Intelligence
More informationHELP ON THE VIRTUAL LIBRARY
HELP ON THE VIRTUAL LIBRARY The Virtual Library search system allows accessing in a quick way to the information the students are interested in and that are available in the Didactic Cyberspace. In its
More informationHome Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit
Page 1 of 14 Retrieving Information from the Web Database and Information Retrieval (IR) Systems both manage data! The data of an IR system is a collection of documents (or pages) User tasks: Browsing
More informationCMSC 476/676 Information Retrieval Midterm Exam Spring 2014
CMSC 476/676 Information Retrieval Midterm Exam Spring 2014 Name: You may consult your notes and/or your textbook. This is a 75 minute, in class exam. If there is information missing in any of the question
More informationInformation Retrieval
Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,
More informationTECNOLOGIES FOR INFORMATION SYSTEMS
TECNOLOGIES FOR INFORMATION SYSTEMS INTRODUCTION Prof. Fabio A. Schreiber http://home.dei.polimi.it home.dei.polimi.it/schreibe/index.htmlindex.html Prof. Letizia Tanca http://tanca.dei.polimi.it tanca.dei.polimi.it
More informationIBE101: Introduction to Information Architecture. Hans Fredrik Nordhaug 2008
IBE101: Introduction to Information Architecture Hans Fredrik Nordhaug 2008 Objectives Defining IA Practicing IA User Needs and Behaviors The anatomy of IA Organizations Systems Labelling Systems Navigation
More informationQuery Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4
Query Languages Berlin Chen 2005 Reference: 1. Modern Information Retrieval, chapter 4 Data retrieval Pattern-based querying The Kinds of Queries Retrieve docs that contains (or exactly match) the objects
More informationUnit VIII. Chapter 9. Link Analysis
Unit VIII Link Analysis: Page Ranking in web search engines, Efficient Computation of Page Rank using Map-Reduce and other approaches, Topic-Sensitive Page Rank, Link Spam, Hubs and Authorities (Text Book:2
More informationInformation Retrieval: Retrieval Models
CS473: Web Information Retrieval & Management CS-473 Web Information Retrieval & Management Information Retrieval: Retrieval Models Luo Si Department of Computer Science Purdue University Retrieval Models
More informationIntroduction to Information Retrieval (Manning, Raghavan, Schutze)
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationAn Adaptive Agent for Web Exploration Based on Concept Hierarchies
An Adaptive Agent for Web Exploration Based on Concept Hierarchies Scott Parent, Bamshad Mobasher, Steve Lytinen School of Computer Science, Telecommunication and Information Systems DePaul University
More informationA Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2
A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,
More informationCSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable
CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable References Bigtable: A Distributed Storage System for Structured Data. Fay Chang et. al. OSDI
More informationInformation Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Information Retrieval CS 6900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Information Retrieval Information Retrieval (IR) is finding material of an unstructured
More informationmodern database systems lecture 4 : information retrieval
modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation
More informationEfficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)
Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-
More informationOvidSP Quick Reference Guide
OvidSP Quick Reference Guide Select Resources On the Select a Database to Begin Searching page, select one resource by clicking on the database name link, or select several resources by clicking the checkbox
More informationAutomatic Summarization
Automatic Summarization CS 769 Guest Lecture Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of Wisconsin, Madison February 22, 2008 Andrew B. Goldberg (CS Dept) Summarization
More informationCISC689/ Information Retrieval Midterm Exam
CISC689/489-010 Information Retrieval Midterm Exam You have 2 hours to complete the following four questions. You may use notes and slides. You can use a calculator, but nothing that connects to the internet
More informationDATA MINING - 1DL105, 1DL111
1 DATA MINING - 1DL105, 1DL111 Fall 2007 An introductory class in data mining http://user.it.uu.se/~udbl/dut-ht2007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database
More informationAutomatic Document; Retrieval Systems. The conventional library classifies documents by numeric subject codes which are assigned manually (Dewey
I. Automatic Document; Retrieval Systems The conventional library classifies documents by numeric subject codes which are assigned manually (Dewey decimal aystemf Library of Congress system). Cross-indexing
More informationA Document Graph Based Query Focused Multi- Document Summarizer
A Document Graph Based Query Focused Multi- Document Summarizer By Sibabrata Paladhi and Dr. Sivaji Bandyopadhyay Department of Computer Science and Engineering Jadavpur University Jadavpur, Kolkata India
More informationElementary IR: Scalable Boolean Text Search. (Compare with R & G )
Elementary IR: Scalable Boolean Text Search (Compare with R & G 27.1-3) Information Retrieval: History A research field traditionally separate from Databases Hans P. Luhn, IBM, 1959: Keyword in Context
More informationAn Introduction to Search Engines and Web Navigation
An Introduction to Search Engines and Web Navigation MARK LEVENE ADDISON-WESLEY Ал imprint of Pearson Education Harlow, England London New York Boston San Francisco Toronto Sydney Tokyo Singapore Hong
More informationA Model for Information Retrieval Agent System Based on Keywords Distribution
A Model for Information Retrieval Agent System Based on Keywords Distribution Jae-Woo LEE Dept of Computer Science, Kyungbok College, 3, Sinpyeong-ri, Pocheon-si, 487-77, Gyeonggi-do, Korea It2c@koreaackr
More informationRelevance of a Document to a Query
Relevance of a Document to a Query Computing the relevance of a document to a query has four parts: 1. Computing the significance of a word within document D. 2. Computing the significance of word to document
More informationInformation Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group
Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationJames Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!
James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation
More informationSOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES
SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES Introduction to Information Retrieval CS 150 Donald J. Patterson This content based on the paper located here: http://dx.doi.org/10.1007/s10618-008-0118-x
More informationInverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5
Inverted Indexes Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5 Basic Concepts Inverted index: a word-oriented mechanism for indexing a text collection to speed up the
More informationDocument Clustering for Mediated Information Access The WebCluster Project
Document Clustering for Mediated Information Access The WebCluster Project School of Communication, Information and Library Sciences Rutgers University The original WebCluster project was conducted at
More informationContents 1. INTRODUCTION... 3
Contents 1. INTRODUCTION... 3 2. WHAT IS INFORMATION RETRIEVAL?... 4 2.1 FIRST: A DEFINITION... 4 2.1 HISTORY... 4 2.3 THE RISE OF COMPUTER TECHNOLOGY... 4 2.4 DATA RETRIEVAL VERSUS INFORMATION RETRIEVAL...
More informationSemantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman
Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Abstract We intend to show that leveraging semantic features can improve precision and recall of query results in information
More informationTo search and summarize on Internet with Human Language Technology
To search and summarize on Internet with Human Language Technology Hercules DALIANIS Department of Computer and System Sciences KTH and Stockholm University, Forum 100, 164 40 Kista, Sweden Email:hercules@kth.se
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval Skiing Seminar Information Retrieval 2010/2011 Introduction to Information Retrieval Prof. Ulrich Müller-Funk, MScIS Andreas Baumgart and Kay Hildebrand Agenda 1 Boolean
More informationPart I: Data Mining Foundations
Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?
More informationComponent ranking and Automatic Query Refinement for XML Retrieval
Component ranking and Automatic uery Refinement for XML Retrieval Yosi Mass, Matan Mandelbrod IBM Research Lab Haifa 31905, Israel {yosimass, matan}@il.ibm.com Abstract ueries over XML documents challenge
More informationINF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering
INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Murhaf Fares & Stephan Oepen Language Technology Group (LTG) September 27, 2017 Today 2 Recap Evaluation of classifiers Unsupervised
More informationSession 10: Information Retrieval
INFM 63: Information Technology and Organizational Context Session : Information Retrieval Jimmy Lin The ischool University of Maryland Thursday, November 7, 23 Information Retrieval What you search for!
More informationWord Indexing Versus Conceptual Indexing in Medical Image Retrieval
Word Indexing Versus Conceptual Indexing in Medical Image Retrieval (ReDCAD participation at ImageCLEF Medical Image Retrieval 2012) Karim Gasmi, Mouna Torjmen-Khemakhem, and Maher Ben Jemaa Research unit
More informationIntroduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.
Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How
More informationInformation Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Information Retrieval CS 6900 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Boolean Retrieval vs. Ranked Retrieval Many users (professionals) prefer
More informationInformation Retrieval
Information Retrieval WS 2016 / 2017 Lecture 2, Tuesday October 25 th, 2016 (Ranking, Evaluation) Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University
More informationOutline. Lecture 3: EITN01 Web Intelligence and Information Retrieval. Query languages - aspects. Previous lecture. Anders Ardö.
Outline Lecture 3: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University February 5, 2013 A. Ardö, EIT Lecture 3: EITN01 Web Intelligence
More informationTowards Understanding Latent Semantic Indexing. Second Reader: Dr. Mario Nascimento
Towards Understanding Latent Semantic Indexing Bin Cheng Supervisor: Dr. Eleni Stroulia Second Reader: Dr. Mario Nascimento 0 TABLE OF CONTENTS ABSTRACT...3 1 INTRODUCTION...4 2 RELATED WORKS...6 2.1 TRADITIONAL
More informationText Analytics (Text Mining)
CSE 6242 / CX 4242 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko,
More informationSec. 8.7 RESULTS PRESENTATION
Sec. 8.7 RESULTS PRESENTATION 1 Sec. 8.7 Result Summaries Having ranked the documents matching a query, we wish to present a results list Most commonly, a list of the document titles plus a short summary,
More informationOracle Database: SQL and PL/SQL Fundamentals Ed 2
Oracle University Contact Us: Local: 1800 103 4775 Intl: +91 80 67863102 Oracle Database: SQL and PL/SQL Fundamentals Ed 2 Duration: 5 Days What you will learn This Oracle Database: SQL and PL/SQL Fundamentals
More informationReading group on Ontologies and NLP:
Reading group on Ontologies and NLP: Machine Learning27th infebruary Automated 2014 1 / 25 Te Reading group on Ontologies and NLP: Machine Learning in Automated Text Categorization, by Fabrizio Sebastianini.
More informationNew Features in Oracle Data Miner 4.2. The new features in Oracle Data Miner 4.2 include: The new Oracle Data Mining features include:
Oracle Data Miner Release Notes Release 4.2 E64607-03 March 2017 This document provides late-breaking information and information that is not yet part of the formal documentation. This document contains
More informationInformation Retrieval CSCI
Information Retrieval CSCI 4141-6403 My name is Anwar Alhenshiri My email is: anwar@cs.dal.ca I prefer: aalhenshiri@gmail.com The course website is: http://web.cs.dal.ca/~anwar/ir/main.html 5/6/2012 1
More informationEfficient Implementation of Postings Lists
Efficient Implementation of Postings Lists Inverted Indices Query Brutus AND Calpurnia J. Pei: Information Retrieval and Web Search -- Efficient Implementation of Postings Lists 2 Skip Pointers J. Pei:
More information