Information Retrieval and Search Engine

Size: px
Start display at page:

Download "Information Retrieval and Search Engine"

Transcription

1 Information Retrieval and Search Engine ดร. ชชาต หฤไชยะศกด Choochart Haruechaiyasak, Ph.D. งานว+จ-ยและพ-ฒนาเทคโนโลย8สารสนเทศ Research and Development on Information (RDI) ศFนยGเทคโนโลย8อ+เลIกทรอน+กสGและคอมพ+วเตอรGแหNงชาต+ (เนคเทค) National Electronics and Computer Technology Center (NECTEC)

2 Lecture Outline Information Retrieval (IR) History and overview System functionalities: Indexing: Inverted file Retrieval: Vector space model Search Engine (SE) Web Crawler Link analysis and search result ranking Open-Source IR Platform Lucene: Open-source IR library Sansarn Look!: A system for developing search engines and IR systems for Thai language

3 Information Retrieval (IR)

4 What is Information Retrieval? An information retrieval system is a device interposed between a potential user of information and the information collection itself. For a given information problem, the purpose of the system is to capture wanted items and to filter out unwanted items. Information retrieval deals with the representation, storage, and access to documents or representatives of documents.

5 Data Retrieval VS. Information Retrieval

6 History s: Initial exploration of text retrieval systems for small corpora of scientific abstracts, and law and business documents. Development of the basic Boolean and vector-space models of retrieval. Prof. Salton and his students at Cornell University are the leading researchers in the area s: Large document database systems, many run by companies: Lexis-Nexis / MEDLINE

7 History (cont'd) 1990 s: Searching FTPable documents on the Internet Archie / WAIS Searching the World Wide Web Lycos / Yahoo / Altavista Organized Competitions NIST TREC Recommender Systems Amazon.com / CDnow.com / imdb.com Automated Text Categorization & Clustering

8 History (cont'd) 2000 s: Link analysis for Web Search Google Automated Information Extraction Whizbang / Fetch / Burning Glass Question Answering TREC Q/A track Multimedia IR Image / Video / Audio and music Cross-Language IR Document Summarization

9 Related Research Areas Database Management Library and Information Science Artificial Intelligence (AI) Natural Language Processing (NLP) Machine Learning (ML)

10 Database Management Focused on structured data stored in relational tables rather than free-form text. Focused on efficient processing of well-defined queries in a formal language (SQL). Clearer semantics for both data and queries. Recent move towards semi-structured data (XML) brings it closer to IR.

11 Library and Information Science Focused on the human user aspects of information retrieval (human-computer interaction, user interface, visualization). Concerned with effective categorization of human knowledge. Concerned with citation analysis and bibliometrics (structure of information). Recent work on digital libraries brings it closer to CS & IR.

12 Artificial Intelligence (AI) Focused on the representation of knowledge, reasoning, and intelligent action. Formalisms for representing knowledge and queries: First-order Predicate Logic Bayesian Networks Recent work on web ontologies and intelligent information agents brings it closer to IR.

13 Natural Language Processing (NLP) Focused on the syntactic, semantic, and pragmatic analysis of natural language text and discourse. Ability to analyze syntax (phrase structure) and semantics could allow retrieval based on meaning rather than keywords.

14 NLP: IR Directions Methods for determining the sense of an ambiguous word based on context (Word-Sense Disambiguation) Methods for identifying specific pieces of information in a document (Information Extraction) Methods for answering specific natural-language questions from document corpora. (Question & Answering)

15 Machine Learning (ML) Focused on the development of computational systems that improve their performance with experience. Automated classification of examples based on learning concepts from labeled training examples (supervised learning). Automated methods for clustering unlabeled examples into meaningful groups (unsupervised learning).

16 ML : IR Directions Text Categorization Automatic hierarchical classification (Yahoo). Adaptive filtering/routing/recommending. Automated spam filtering. Text Clustering Clustering of IR query results. Automatic formation of hierarchies (Yahoo). Learning for Information Extraction Text Mining

17 Overview of IR IR is concerned with indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent killer application. Concerned firstly with retrieving relevant documents to a query. Concerned secondly with retrieving from large sets of documents efficiently.

18 Relevance Relevance is a subjective judgement and may include: Being on the proper subject. Being timely (recent information). Being authoritative (from a trusted source). Satisfying the goals of the user and his/her intended use of the information (information need).

19 Keyword Search Simplest notion of relevance is that the query string appears verbatim in the document. Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words).

20 Problems with Keywords May not retrieve relevant documents that include synonymous terms. restaurant vs. café PRC vs. China May retrieve irrelevant documents that include ambiguous terms. bat (baseball vs. mammal) Apple (company vs. fruit) bit (unit of data vs. act of eating)

21 Intelligent IR Taking into account the meaning of the words used. Taking into account the order of words in the query. Adapting to the user based on direct or indirect feedback. Taking into account the authority of the source.

22 Typical IR Task Given: A corpus of textual natural-language documents. A user query in the form of a textual string. Find: A ranked set of documents that are relevant to the query.

23 IR Functional Components Two IR main functions: (1) Indexing: (system perspective) - Text processing - Index construction (2) Retrieval: (user perspective) - User interface - Query processing - Searching from index (index lookup) - Search result ranking

24 IR System: (1) Indexing

25 IR System: (2) Retrieval

26 Text Processing Strip unwanted characters/markup e.g. HTML tags, punctuation, numbers, etc. Break into tokens (keywords) on whitespace Stem tokens to root words (Porter stemming) computational -> comput Remove common stopwords (e.g. a, the, it, etc.). Detect common phrases and named entities possibly by using a domain specific dictionary

27 Indexing Issues Index is a data structure constructed from the text to speed up the search. Indexing issues are: How to structure the index How to create the index (storage, time) How to store the index (storage, compression) How to process the index (storage, time) How to update the index (storage, time) The most critical issue is logarithmic or constant-time access to token information. Less important issue is the storage space.

28 Indexing: Inverted File Inverted file structure is composed of two elements: Vocabulary: A set of all different words or terms in the text Occurrence: A list of all documents including text positions where the word appears The word-to-document index can be implemented as a hash table, a sorted array, or a tree-based data structure (such as trie, B-tree). Inverted file is the most widely adopted indexing approach for implementing IR systems and search engines.

29 Indexing: Inverted File

30 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion of relevance can be either Binary: yes or no Continuous: sorted by score

31 Classes of Retrieval Models Boolean models (set theoretic) Extended Boolean Vector space models (statistical/algebraic) Generalized Latent Semantic Indexing Probabilistic models Bayesian networks Inference network model Belief network model

32 Retrieval: Boolean Model A document is represented as a set of words. Queries are Boolean expressions of keywords, connected by AND, OR, and NOT, including the use of brackets to indicate scope. ((Pattaya OR Phuket) AND Thailand AND hotel AND NOT Hilton Output: Document is relevant or not. No partial matches or ranking.

33 Retrieval: Boolean Model Popular retrieval model because: Easy to understand for simple queries. Clean formalism. Boolean models can be extended to include ranking. (Extended Boolean model) Reasonably efficient implementations possible for normal queries.

34 Retrieval: Boolean Model - Problems Very rigid: AND means all; OR means any. Difficult to express complex user requests. Difficult to control the number of documents retrieved. All matched documents will be returned. Difficult to rank output. All matched documents logically satisfy the query. Difficult to perform relevance feedback. If a document is identified by the user as relevant or irrelevant, how should the query be modified?

35 Retrieval: Vector Space Model (VSM) Statistical-based model A document is typically represented by a bag of words (unordered words with frequencies). Note: Bag = set that allows multiple occurrences of the same element. User specifies a set of desired terms with optional weights: Weighted query terms: Q = < database 0.5; text 0.8; information 0.2 > Unweighted query terms: Q = < database; text; information > No Boolean conditions specified in the query

36 Retrieval: Vector Space Model (VSM) Retrieval based on similarity between query and documents. Output documents are ranked according to similarity to query. Similarity based on occurrence frequencies of keywords in query and document. Automatic relevance feedback can be supported: Relevant documents added to query. Irrelevant documents subtracted from query.

37 Retrieval: Vector Space Model (VSM) How to determine important words in a document? Word sense? Word n-grams (and phrases, idioms, ) -> terms How to determine the degree of importance of a term within a document and within the entire collection? How to determine the degree of similarity between a document and the query? In the case of the web, what is a collection and what are the effects of links, formatting information, etc.?

38 Retrieval: Vector Space Model (VSM) Assume t distinct terms remain after preprocessing; call them index terms or the vocabulary. These orthogonal terms form a vector space. Dimension = t = vocabulary Each term, i, in a document or query, j, is given a real-valued weight, w ij Both documents and queries are expressed as t-dimensional vectors: d j = (w 1j, w 2j,, w tj )

39 Retrieval: Vector Space Model (VSM) Graphic Representation Example: D 1 = 2T 1 + 3T 2 + 5T 3 T 3 D 2 = 3T 1 + 7T 2 + T 3 Q = 0T 1 + 0T 2 + 2T 3 5 D 1 = 2T 1 + 3T 2 + 5T 3 Q = 0T 1 + 0T 2 + 2T T 1 D 2 = 3T 1 + 7T 2 + T 3 T 2 7 Is D 1 or D 2 more similar to Q? How to measure the degree of similarity? Distance? Angle? Projection?

40 Retrieval: VSM document collection A collection of n documents can be represented in the vector space model by a term-document matrix. An entry in the matrix corresponds to the weight of a term in the document; zero means the term has no significance in the document or it simply doesn t exist in the document. T 1 T 2. D 1 w 11 w 21 T t w t1 D 2 w 12 w 22 w t2 : : : : : : : : D n w 1n w 2n w tn

41 Retrieval: VSM Term-frequency More frequent terms in a document are more important, i.e. more indicative of the topic. f ij = frequency of term i in document j May want to normalize term frequency (tf) across the entire corpus: tf ij = f ij / max{f ij }

42 Retrieval: VSM Inverse Document Frequency Terms that appear in many different documents are less indicative of overall topic. df i = document frequency of term i = number of documents containing term i idf i = inverse document frequency of term i, = log 2 (N/ df i ) (N: total number of documents) An indication of a term s discrimination power. Log used to dampen the effect relative to tf.

43 Retrieval: VSM TF-IDF weighting A typical combined term importance indicator is tfidf weighting: w ij = tf ij x idf i = tf ij x log 2 (N/ df i ) A term occurring frequently in the document but rarely in the rest of the collection is given high weight. Many other ways of determining term weights have been proposed. Experimentally, tf-idf has been found to work well.

44 Retrieval: VSM TF-IDF weighting Given a document containing terms with given frequencies: A(3), B(2), C(1) Assume collection contains 10,000 documents and document frequencies of these terms are: A(50), B(1300), C(250) Then: A: tf = 3/3; idf = log(10000/50) = 5.3; tf-idf = 5.3 B: tf = 2/3; idf = log(10000/1300) = 2.0; tf-idf = 1.3 C: tf = 1/3; idf = log(10000/250) = 3.7; tf-idf = 1.2

45 Retrieval: Similarity Measure A similarity measure is a function that computes the degree of similarity between two vectors. Using a similarity measure between the query and each document: It is possible to rank the retrieved documents in the order of presumed relevance. It is possible to enforce a certain threshold so that the size of the retrieved set can be controlled.

46 Retrieval: Cosine Similarity Measure Cosine similarity measures the cosine of the angle between two vectors. CosSim(d j, q) = dj dj q q = i= 1 t t ( w w ij ij 2 w i= 1 i= 1 t iq ) w iq 2 D 1 θ 1 θ 2 t 3 Q t 1 t 2 D 2 D 1 = 2T 1 + 3T 2 + 5T 3 CosSim(D 1, Q) = 10 / (4+9+25)(0+0+4) = 0.81 D 2 = 3T 1 + 7T 2 + 1T 3 CosSim(D 2, Q) = 2 / (9+49+1)(0+0+4) = 0.13 Q = 0T 1 + 0T 2 + 2T 3

47 Retrieval: VSM Drawbacks Missing semantic information (e.g. word sense). Missing syntactic information (e.g. phrase structure, word order, proximity information). Assumption of term independence (e.g. ignores synonomy). Lacks the control of a Boolean model (e.g., requiring a term to appear in a document). Given a two-term query A B, may prefer a document containing A frequently but not B, over a document that contains both A and B, but both less frequently.

48 Retrieval: VSM Summary Simple, mathematically based approach. Considers both local (tf) and global (idf) word occurrence frequencies. Provides partial matching and ranked results. Tends to work quite well in practice despite obvious weaknesses. Allows efficient implementation for large document collections. (scalable)

49 IR System Evaluation Coverage of the text collection: extent to which the system includes relevant material Response time: average interval between the time request is made and the time answer is given Presentation of the output (user interface) Effort involved by user in obtaining answers to request Recall: proportion of relevant documents retrieved Precision: proportion of the retrieved documents that are actually relevant

50 Search Engine (SE)

51 Search Engine VS. Librarian

52 Search Engine : An Overview

53 Web Challenges for IR Distributed Data: Documents spread over millions of different web servers. Volatile Data: Many documents change or disappear rapidly (e.g. dead links). Large Volume: Billions of separate documents. Unstructured and Redundant Data: No uniform structure, HTML errors, up to 30% (near) duplicate documents. Quality of Data: No editorial control, false information, poor quality writing, typos, etc. Heterogeneous Data: Multiple media types (images, video, VRML), languages, character sets, etc.

54 What is Web search? Access to heterogeneous, distributed information Heterogeneous in creation Heterogeneous in motives Heterogeneous in accuracy Multi-billion dollar business Source of new opportunities in marketing Strains the boundaries of trademark and intellectual property laws A source of unending technical challenges

55 Search Engine Market Share

56 Search Engine's Components Web crawler (robot /spider) builds corpus Collects web pages recursively Additional pages from direct submissions & other sources The indexer creates inverted indexes Various policies on which words are indexed, capitalization, support for Unicode, stemming, support for phrases, etc. Query processor serves query results Front end query reformulation, word stemming, capitalization, optimization of Booleans, etc. Back end finds matching documents and ranks them

57 Web Crawlers Start with a comprehensive set of root URL s from which to start the search. Follow all links on these pages recursively to find additional pages. Index all new found pages in an inverted index as they are encountered. May allow users to directly submit pages to be indexed (and crawled from).

58 Web Crawlers Search Strategies Two possible crawling strategies: Breadth-First Search (BFS) Depth-First Search (DFS) BFS explores uniformly outward from the root page but requires memory of all nodes on the previous level (exponential in depth). Standard spidering method. DFS requires memory of only depth times branching-factor (linear in depth) but gets lost pursuing a single thread.

59 Web Crawlers Summary Must avoid page duplication - Web is a graph not a tree Must obey page-owner restrictions - Robot exclusion protocol - robots.txt Crawling is a time-consuming process. - Distributed crawlers - Update strategy Many open-source crawlers are avaiable from sourceforge.net

60 Search Result Ranking Most search engines use two components in determining their ranking algorithms: (1) On-Page Factors Domain name, file name/url, image file name, title tag, meta tags, image ALT tag, header/font tags, body/visible text (2) Off-Page Factors Link popularity, link reputation

61 Search Result Ranking: On-page factor Domain name Directory and File name / URL /keyword1-keyword2/ Image file name

62 Search Result Ranking: On-page factor Title Tag The title tag is one of the most important factors in achieving high search engine rankings. <HEAD> <TITLE>keyword phrase1, keyword phrase2... </TITLE> </HEAD> Meta Description Tag <META NAME="Description" CONTENT= "> Meta description tag is a snippet of HTML code that belongs inside the <Head> </Head> section of a Web page.

63 Search Result Ranking: On-page factor Meta Keywords Tag <meta name="keywords" content="keyword phrase1, keyword phrase2, keyword phrase3"> Image Alt Tags (IMG ALT) <img src="../../images/keyword.gif" alt= sentence including keyword that define an image."> Body Copy / Body Text / Visible Text Include keyword phrase at least once in the following tags: headers (H1, H2), bold, italic, text link

64 Search Result Ranking: Off-page factor Link Popularity Number of incoming links to your Web page Link Reputation Anchor text & description Quality of the Web pages in the same category that links to your Web page Authority Sites: Links from highly trusted Web sites such as Yahoo!, Dmoz, etc.

65 Link Analysis: Overview Two well-known link analysis approach: HITS (Hypertext Induced Topic Selection) By Jon Kleinberg PageRank By Brin & Page of Google

66 Link Analysis: HITS Hubs are index pages that provide lots of useful links to relevant content pages. Web portal or directories behave like hub pages for various topics. Authorities are pages that are recognized as providing significant, trustworthy, and useful information on a topic. In-degree (number of pointers to a page) is one simple measure of authority.

67 Link Analysis: PageRank Just measuring in-degree (citation count or authority of the page) R( p) = q: q p R( q) Nq Nq is the total number of out-links from page q. c A page, q, gives an equal fraction of its authority to all the pages it points to (e.g. p). c, or damping factor, is a normalizing constant set so that the rank of all pages always sums to 1.

68 Link Analysis: Summary Link analysis uses information about the structure of the web graph to aid search. It is one of the major innovations in web search. It is the primary reason for Google s success.

69 Open-Source IR Platform

70 Lucene: An Open Source IR Library Lucene is a high performance, scalable Information Retrieval (IR) library. Fact: created by Doug Cutting free, open-source, originally implemented in Java a member of the popular Apache Jakarta projects licensed under the Apache Software License not a complete application also ported to Perl, Python, C++,.NET active user and developer communities widely used, e.g., IBM and Microsoft.

71 Lucene: A Typical Application Integration

72 Lucene: Architecture

73 Lucene: API org.apache.lucene.document A text is organized as a document which represents a collection of fields. org.apache.lucene.analysis Analysis contains classes for performing the text processing. org.apache.lucene.index Index package contains classes responsible for indexing process. org.apache.lucene.search Search package contains classes related to search operations.

74 Lucene: Analyzer Lucene provides classes for various text analyzing approaches: WhitespaceAnalyzer: divides text at whitespace. SimpleAnalyzer: normalizes token text to lower case. StopAnalyzer: normalizes token text to lower case and removes stop words from a token stream. StandardAnalyzer: combine all above tasks, suitable for English texts These classes are designed for English language. To make the analysis applicable to particular language, a language-specific analyzer package must be developed to segment or tokenize texts into a set of words.

75 Lucene: Analyzer Analyzer Demo 1: "The quick brown fox jumped over the lazy dogs" WhitespaceAnalyzer: [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] SimpleAnalyzer: [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] StopAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dogs] StandardAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dogs]

76 Lucene: Analyzer Analyzer Demo 2: "The XY&Z Corporation: - xyz@example.com" WhitespaceAnalyzer: [The] [XY&Z] [Corporation:] [-] [xyz@example.com] SimpleAnalyzer: [the] [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StandardAnalyzer: [xy&z] [corporation] [xyz@example.com]

77 Lucene: ThaiAnalyzer ThaiAnalyzer(th) package contains the following classes: LexTo.java : Lexeme Tokenizer (Dictionary-based word segmentation) ParseTree.java : Construct parse tree while tokenizing texts lexitron.txt : Thai dictionary VTrie.java : Trie data structure for efficient dictionary storage and retrieval ThaiTokenizer.java : calls LexTo for word segmentation ThaiFilter.java : check for special characters ThaiAnalyzer.java : calls ThaiTokenizer, ThaiFilter, StandardFilter, LowerCaseFilter and StopFilter, respectively.

78 Sansarn Look! : A system platform for developing search engines and IR systems for Thai language

79 Sansarn Look!: Architecture

80 Sansarn-Lucene API

81 Sansarn Look! : Key Features Built on a widely used open-source IR library Lucene Support full-text and field-text search Support local files and Web pages Support multiple file formats Support English-Thai texts User-friendly via Browser Interface Configurability Additional features on Thai languages: Query suggestion Query spell-check

82 References 1) R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, ACM Press, Addison Wesley, ) W. B. Frakes and R. Baeza-Yates, eds., Information Retrieval: Data Structures & Algorithms, Prentice Hall, ) E. Hatcher and O. Gospo, Lucene in Action, Manning Publications, ) Jon M. Kleinberg, Authoritative sources in a hyperlinked environment, Journal of the ACM, 46(5), ) L. Page, S. Brin, R. Motwani, and T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web, 7th Int. World Wide Web Conference, pp , ) The Lucene web site: 7) Sansarn Look! :

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught at UT Austin and Stanford) Information Retrieval

More information

Information Retrieval. Information Retrieval and Web Search

Information Retrieval. Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent

More information

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Information Retrieval Information Retrieval (IR) is finding material of an unstructured

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Boolean Retrieval vs. Ranked Retrieval Many users (professionals) prefer

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _ COURSE DELIVERY PLAN - THEORY Page 1 of 6 Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _ LP: CS6007 Rev. No: 01 Date: 27/06/2017 Sub.

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system. Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.

More information

Multimedia Information Systems

Multimedia Information Systems Multimedia Information Systems Samson Cheung EE 639, Fall 2004 Lecture 6: Text Information Retrieval 1 Digital Video Library Meta-Data Meta-Data Similarity Similarity Search Search Analog Video Archive

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections

More information

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s Summary agenda Summary: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University March 13, 2013 A Ardö, EIT Summary: EITN01 Web Intelligence

More information

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues COS 597A: Principles of Database and Information Systems Information Retrieval Traditional database system Large integrated collection of data Uniform access/modifcation mechanisms Model of data organization

More information

Information Retrieval. hussein suleman uct cs

Information Retrieval. hussein suleman uct cs Information Management Information Retrieval hussein suleman uct cs 303 2004 Introduction Information retrieval is the process of locating the most relevant information to satisfy a specific information

More information

Chapter 2. Architecture of a Search Engine

Chapter 2. Architecture of a Search Engine Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them

More information

Information Retrieval on the Internet (Volume III, Part 3, 213)

Information Retrieval on the Internet (Volume III, Part 3, 213) Information Retrieval on the Internet (Volume III, Part 3, 213) Diana Inkpen, Ph.D., University of Toronto Assistant Professor, University of Ottawa, 800 King Edward, Ottawa, ON, Canada, K1N 6N5 Tel. 1-613-562-5800

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search Basic techniques Text processing; term weighting; vector space model; inverted index; Web Search Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing

More information

60-538: Information Retrieval

60-538: Information Retrieval 60-538: Information Retrieval September 7, 2017 1 / 48 Outline 1 what is IR 2 3 2 / 48 Outline 1 what is IR 2 3 3 / 48 IR not long time ago 4 / 48 5 / 48 now IR is mostly about search engines there are

More information

modern database systems lecture 4 : information retrieval

modern database systems lecture 4 : information retrieval modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Web Crawling Instructor: Rada Mihalcea (some of these slides were adapted from Ray Mooney s IR course at UT Austin) The Web by the Numbers Web servers 634 million Users

More information

Natural Language Processing Basics. Yingyu Liang University of Wisconsin-Madison

Natural Language Processing Basics. Yingyu Liang University of Wisconsin-Madison Natural Language Processing Basics Yingyu Liang University of Wisconsin-Madison Natural language Processing (NLP) The processing of the human languages by computers One of the oldest AI tasks One of the

More information

CS54701: Information Retrieval

CS54701: Information Retrieval CS54701: Information Retrieval Basic Concepts 19 January 2016 Prof. Chris Clifton 1 Text Representation: Process of Indexing Remove Stopword, Stemming, Phrase Extraction etc Document Parser Extract useful

More information

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost

More information

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How

More information

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University CS377: Database Systems Text data and information retrieval Li Xiong Department of Mathematics and Computer Science Emory University Outline Information Retrieval (IR) Concepts Text Preprocessing Inverted

More information

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)

More information

Unstructured Data. CS102 Winter 2019

Unstructured Data. CS102 Winter 2019 Winter 2019 Big Data Tools and Techniques Basic Data Manipulation and Analysis Performing well-defined computations or asking well-defined questions ( queries ) Data Mining Looking for patterns in data

More information

Information Retrieval: Retrieval Models

Information Retrieval: Retrieval Models CS473: Web Information Retrieval & Management CS-473 Web Information Retrieval & Management Information Retrieval: Retrieval Models Luo Si Department of Computer Science Purdue University Retrieval Models

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Information Retrieval Potsdam, 14 June 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book Outline 2 1 Introduction 2 Indexing Block Document

More information

Web Search. Web Spidering. Introduction

Web Search. Web Spidering. Introduction Web Search. Web Spidering Introduction 1 Outline Information Retrieval applied on the Web The Web the largest collection of documents available today Still, a collection Should be able to apply traditional

More information

Boolean Model. Hongning Wang

Boolean Model. Hongning Wang Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer

More information

AN EFFECTIVE INFORMATION RETRIEVAL FOR AMBIGUOUS QUERY

AN EFFECTIVE INFORMATION RETRIEVAL FOR AMBIGUOUS QUERY Asian Journal Of Computer Science And Information Technology 2: 3 (2012) 26 30. Contents lists available at www.innovativejournal.in Asian Journal of Computer Science and Information Technology Journal

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search Seo tutorial Seo tutorial Introduction to seo... 4 1. General seo information... 5 1.1 History of search engines... 5 1.2 Common search engine principles... 6 2. Internal ranking factors... 8 2.1 Web page

More information

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

Department of Electronic Engineering FINAL YEAR PROJECT REPORT Department of Electronic Engineering FINAL YEAR PROJECT REPORT BEngCE-2007/08-HCS-HCS-03-BECE Natural Language Understanding for Query in Web Search 1 Student Name: Sit Wing Sum Student ID: Supervisor:

More information

SEARCH ENGINE INSIDE OUT

SEARCH ENGINE INSIDE OUT SEARCH ENGINE INSIDE OUT From Technical Views r86526020 r88526016 r88526028 b85506013 b85506010 April 11,2000 Outline Why Search Engine so important Search Engine Architecture Crawling Subsystem Indexing

More information

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures Springer Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web

More information

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur 603 203 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER CS6007-INFORMATION RETRIEVAL Regulation 2013 Academic Year 2018

More information

Information Retrieval

Information Retrieval Natural Language Processing SoSe 2015 Information Retrieval Dr. Mariana Neves June 22nd, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline Introduction Indexing Block 2 Document Crawling Text Processing

More information

Authoritative K-Means for Clustering of Web Search Results

Authoritative K-Means for Clustering of Web Search Results Authoritative K-Means for Clustering of Web Search Results Gaojie He Master in Information Systems Submission date: June 2010 Supervisor: Kjetil Nørvåg, IDI Co-supervisor: Robert Neumayer, IDI Norwegian

More information

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert

More information

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes CS630 Representing and Accessing Digital Information Information Retrieval: Retrieval Models Information Retrieval Basics Data Structures and Access Indexing and Preprocessing Retrieval Models Thorsten

More information

COMP6237 Data Mining Searching and Ranking

COMP6237 Data Mining Searching and Ranking COMP6237 Data Mining Searching and Ranking Jonathon Hare jsh2@ecs.soton.ac.uk Note: portions of these slides are from those by ChengXiang Cheng Zhai at UIUC https://class.coursera.org/textretrieval-001

More information

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai. UNIT-V WEB MINING 1 Mining the World-Wide Web 2 What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns. 3 Web search engines Index-based: search the Web, index

More information

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK Qing Guo 1, 2 1 Nanyang Technological University, Singapore 2 SAP Innovation Center Network,Singapore ABSTRACT Literature review is part of scientific

More information

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.

More information

A short introduction to the development and evaluation of Indexing systems

A short introduction to the development and evaluation of Indexing systems A short introduction to the development and evaluation of Indexing systems Danilo Croce croce@info.uniroma2.it Master of Big Data in Business SMARS LAB 3 June 2016 Outline An introduction to Lucene Main

More information

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine International Journal of Scientific & Engineering Research Volume 2, Issue 12, December-2011 1 Web Search Engine G.Hanumantha Rao*, G.NarenderΨ, B.Srinivasa Rao+, M.Srilatha* Abstract This paper explains

More information

Information Retrieval

Information Retrieval Natural Language Processing SoSe 2014 Information Retrieval Dr. Mariana Neves June 18th, 2014 (based on the slides of Dr. Saeedeh Momtazi) Outline Introduction Indexing Block 2 Document Crawling Text Processing

More information

The Topic Specific Search Engine

The Topic Specific Search Engine The Topic Specific Search Engine Benjamin Stopford 1 st Jan 2006 Version 0.1 Overview This paper presents a model for creating an accurate topic specific search engine through a focussed (vertical)

More information

Information Retrieval

Information Retrieval Introduction Information Retrieval Information retrieval is a field concerned with the structure, analysis, organization, storage, searching and retrieval of information Gerard Salton, 1968 J. Pei: Information

More information

CS6200 Information Retreival. Crawling. June 10, 2015

CS6200 Information Retreival. Crawling. June 10, 2015 CS6200 Information Retreival Crawling Crawling June 10, 2015 Crawling is one of the most important tasks of a search engine. The breadth, depth, and freshness of the search results depend crucially on

More information

Information Retrieval Spring Web retrieval

Information Retrieval Spring Web retrieval Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic

More information

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields Informa/on Retrieval CISC437/637, Lecture #23 Ben CartereAe Copyright Ben CartereAe 1 Text Search Consider a database consis/ng of long textual informa/on fields News ar/cles, patents, web pages, books,

More information

An Introduction to Search Engines and Web Navigation

An Introduction to Search Engines and Web Navigation An Introduction to Search Engines and Web Navigation MARK LEVENE ADDISON-WESLEY Ал imprint of Pearson Education Harlow, England London New York Boston San Francisco Toronto Sydney Tokyo Singapore Hong

More information

Building Search Applications

Building Search Applications Building Search Applications Lucene, LingPipe, and Gate Manu Konchady Mustru Publishing, Oakton, Virginia. Contents Preface ix 1 Information Overload 1 1.1 Information Sources 3 1.2 Information Management

More information

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule 1 How big is the Web How big is the Web? In the past, this question

More information

Searching the Web [Arasu 01]

Searching the Web [Arasu 01] Searching the Web [Arasu 01] Most user simply browse the web Google, Yahoo, Lycos, Ask Others do more specialized searches web search engines submit queries by specifying lists of keywords receive web

More information

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science Lecture 9: I: Web Retrieval II: Webology Johan Bollen Old Dominion University Department of Computer Science jbollen@cs.odu.edu http://www.cs.odu.edu/ jbollen April 10, 2003 Page 1 WWW retrieval Two approaches

More information

International ejournals

International ejournals Available online at www.internationalejournals.com International ejournals ISSN 0976 1411 International ejournal of Mathematics and Engineering 112 (2011) 1023-1029 ANALYZING THE REQUIREMENTS FOR TEXT

More information

How Does a Search Engine Work? Part 1

How Does a Search Engine Work? Part 1 How Does a Search Engine Work? Part 1 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0 What we ll examine Web crawling

More information

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT Prof. Dipartimento di Elettronica e Informazione Politecnico di Milano INFORMATION SEARCH AND RETRIEVAL Inf. retrieval 1 PRESENTATION SCHEMA GOALS AND

More information

Word Disambiguation in Web Search

Word Disambiguation in Web Search Word Disambiguation in Web Search Rekha Jain Computer Science, Banasthali University, Rajasthan, India Email: rekha_leo2003@rediffmail.com G.N. Purohit Computer Science, Banasthali University, Rajasthan,

More information

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent

More information

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 1, January 2014,

More information

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454 Administrivia Crawlers: Nutch Groups Formed Architecture Documents under Review Group Meetings CSE 454 4/14/2005 12:54 PM 1 4/14/2005 12:54 PM 2 Info Extraction Course Overview Ecommerce Standard Web Search

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction Inverted index Processing Boolean queries Course overview Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

DATA MINING II - 1DL460. Spring 2014"

DATA MINING II - 1DL460. Spring 2014 DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

Tag-based Social Interest Discovery

Tag-based Social Interest Discovery Tag-based Social Interest Discovery Xin Li / Lei Guo / Yihong (Eric) Zhao Yahoo!Inc 2008 Presented by: Tuan Anh Le (aletuan@vub.ac.be) 1 Outline Introduction Data set collection & Pre-processing Architecture

More information

Reading group on Ontologies and NLP:

Reading group on Ontologies and NLP: Reading group on Ontologies and NLP: Machine Learning27th infebruary Automated 2014 1 / 25 Te Reading group on Ontologies and NLP: Machine Learning in Automated Text Categorization, by Fabrizio Sebastianini.

More information

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems Indexing: Part IV CPS 216 Advanced Database Systems Announcements (February 17) 2 Homework #2 due in two weeks Reading assignments for this and next week The query processing survey by Graefe Due next

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) CSE 6242 / CX 4242 Apr 1, 2014 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer,

More information

DATA MINING - 1DL105, 1DL111

DATA MINING - 1DL105, 1DL111 1 DATA MINING - 1DL105, 1DL111 Fall 2007 An introductory class in data mining http://user.it.uu.se/~udbl/dut-ht2007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database

More information

A Survey on Web Information Retrieval Technologies

A Survey on Web Information Retrieval Technologies A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New York, Stony Brook Presented by Kajal Miyan Michigan State University Overview Web Information

More information

THE WEB SEARCH ENGINE

THE WEB SEARCH ENGINE International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com

More information

Session 10: Information Retrieval

Session 10: Information Retrieval INFM 63: Information Technology and Organizational Context Session : Information Retrieval Jimmy Lin The ischool University of Maryland Thursday, November 7, 23 Information Retrieval What you search for!

More information

Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts

Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts Kwangcheol Shin 1, Sang-Yong Han 1, and Alexander Gelbukh 1,2 1 Computer Science and Engineering Department, Chung-Ang University,

More information

Searching the Web for Information

Searching the Web for Information Search Xin Liu Searching the Web for Information How a Search Engine Works Basic parts: 1. Crawler: Visits sites on the Internet, discovering Web pages 2. Indexer: building an index to the Web's content

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS3245 12 Lecture 12: Crawling and Link Analysis Information Retrieval Last Time Chapter 11 1. Probabilistic Approach to Retrieval / Basic Probability Theory 2. Probability

More information

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. Knowledge Retrieval Franz J. Kurfess Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. 1 Acknowledgements This lecture series has been sponsored by the European

More information

Outline. Lecture 2: EITN01 Web Intelligence and Information Retrieval. Previous lecture. Representation/Indexing (fig 1.

Outline. Lecture 2: EITN01 Web Intelligence and Information Retrieval. Previous lecture. Representation/Indexing (fig 1. Outline Lecture 2: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University January 23, 2013 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence

More information

Approaches to Mining the Web

Approaches to Mining the Web Approaches to Mining the Web Olfa Nasraoui University of Louisville Web Mining: Mining Web Data (3 Types) Structure Mining: extracting info from topology of the Web (links among pages) Hubs: pages pointing

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

Chapter 3 - Text. Management and Retrieval

Chapter 3 - Text. Management and Retrieval Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter 3 - Text Management and Retrieval Literature: Baeza-Yates, R.;

More information

Overview of Information Retrieval and Organization. CSC 575 Intelligent Information Retrieval

Overview of Information Retrieval and Organization. CSC 575 Intelligent Information Retrieval Overview of Information Retrieval and Organization CSC 575 Intelligent Information Retrieval 2 How much information? Google: ~100 PB a day; 1+ million servers (est. 15-20 Exabytes stored) Wayback Machine

More information

Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO

Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO INDEX Proposal Recap Implementation Evaluation Future Works Proposal Recap Keyword Visualizer (chrome

More information

A PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS

A PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS A PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS KULWADEE SOMBOONVIWAT Graduate School of Information Science and Technology, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033,

More information

Web Information Retrieval using WordNet

Web Information Retrieval using WordNet Web Information Retrieval using WordNet Jyotsna Gharat Asst. Professor, Xavier Institute of Engineering, Mumbai, India Jayant Gadge Asst. Professor, Thadomal Shahani Engineering College Mumbai, India ABSTRACT

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) CSE 6242 / CX 4242 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko,

More information

Open Source Search. Andreas Pesenhofer. max.recall information systems GmbH Künstlergasse 11/1 A-1150 Wien Austria

Open Source Search. Andreas Pesenhofer. max.recall information systems GmbH Künstlergasse 11/1 A-1150 Wien Austria Open Source Search Andreas Pesenhofer max.recall information systems GmbH Künstlergasse 11/1 A-1150 Wien Austria max.recall information systems max.recall is a software and consulting company enabling

More information