Text Algorithms (4AP) Information Retrieval. Jaak Vilo 2008 fall. MTAT Text Algorithms. Materials
|
|
- Reynold McLaughlin
- 5 years ago
- Views:
Transcription
1 Text Algorithms (4AP) Information Retrieval Jaak Vilo 2008 fall Jaak Vilo MTAT Text Algorithms 1 Materials Modern Information Retrieval by Ricardo Baeza Yates and Berthier Ribeiro Neto. Information Retrieval ACM Press/dp/ X/ref=sr_1_1?ie=UTF8&s=books&qid= &sr=8 1 New edition in May 2009 Google Books: Information Retrieval /b k ti t i l ESSCaSS 08 : Ricardo Baeza Yates and Filippo Menczer 1
2 Given a set of documents, find those relevant to topic X User formulates a query, documents are returned and retrieved by user Looking at first 100, result how many are relevant to topic, how many of all fit in the first 100? Given an interesting document (one?), how to find similar ones? Which keywords characterise documents similar to other documents? How to present the answer to user? Topic hierarchies Self organising maps (see WebSom)... 2
3 3
4 4
5 Mida otsiti? auto buss rong tramm troll bensiin diisel puit vesinik maagaas elekter Milline dokument peaks kõige sarnasem olema? Dokumendi ja päringu sarnasus Dokumentide järjestamine Käänded/pöörded Ontoloogiad (mõistete struktuur) Dokumendi enda olulisus (e.g. PageRank) Information retrieval (IR) Finding relevant information From unstructured document database(s) Relevance, measures Presenting information (UI, relevance) Free text queries (Natural Language Processing) User feedback Information Retrieval is the "science of search The study of systems for indexing, searching, and recalling data, particularly text or other unstructured forms 5
6 History of IR s: Small text retrieval systems; basic boolean and vector space retrieval models 1980 s: Large document database systems, many run by companies: (e.g. Lexis Nexis, Dialog, MEDLINE) 1990 s: Searching FTPable documents on the Internet (e.g. Archie, WAIS); Searching the World Wide Web (e.g. Lycos, Yahoo, Altavista) History cont s: Link analysis for Web Search (e.g. Google) Automated Information Extraction (e.g. Whizbang, Fetch, Burning Glass) Question Answering (e.g. TREC Q/A track, Ask Jeeves) Multimedia IR (Image, Video, Audio and music) Cross Language IR (e.g. DARPA Tides) Document Summarization 6
7 Cont 2000 s: Recommender Systems (e.g. MovieLens, Pandora, LastFM) Automated Text Categorization & Clustering itunes Top Songs Amazon people who bought this also bought Bloglines similar blogs Dlii Del.icio.us most popular bookmarks k Flickr.com most viewed pictures NYTimes most ed articles IR discipline that deals with: retrieval ti representation storage organization access of structured, semi structured and unstructured data (information objects) in response to query (topic statement) structured (e.g. boolean expression) unstructured (e.g. sentence, document) 7
8 Concepts Information Retrieval the study of systems for representing, indexing (organising), searching (retrieving), and recalling (delivering) data. Information Filtering given a large amount of data, return the data that the user wants to see Information Need what the user really wants to know; a query is an approximation to the information need. Query astring of words that characterizes the information that the user seeks Browsing a sequence of user interaction tasks that characterizes the information that the user seeks The process of applying algorithms over unstructured, semi structured or structured data in order to satisfy a given information (explicit) query Efficiency with respect to: algorithms query building data organization/structure 8
9 Data vs. Information Retrieval Information Retrieval: Set of keywords (loose semantics) Semantics of the information need Errors are tolerable Data Retrieval: Regular expression (well defined query) Constraints for the objects in the answer set Single error results in a falure retrieval task Informa tion Need Compare the information need with the information Summary generate a ranking which reflects relevance User Query IR System Ranked list of documents Lecture 2: Query Languages & Operations feedback 2ID10: Information Retrieval ( ), Lora Aroyo 9
10 IR introduction IR research issues Applications of IR 1. IR Models 2. IR Query Languages & Operations 3. Searcher Feedback 4. Language Modeling for IR 8. Multimedia IR 6. Semantic in IR 5. Search Engines 9. Structured Content classification and categorization (catalogues) systems and languages (NL based systems) user interfaces and visualization The Web fenomena universal repository of knowledge free (low cost) universal access no central editorial board IR the key to finding the solutions 10
11 Logical View of Documents text + structure Documents text accents, spacing Stop-words Noun groups Stemming Automatic or Manual Indexing Structure structure Full text Index terms Document representation continuum Intermediate representations (transformations) Text operations to reduce complexity of documents Lecture 1: Introduction 2ID10: Information Retrieval ( ) 21 The Retrieval Process user feedback change the query 4 specifies user need User Interface text 1 text defines logical view Text Operations 5 logical view logical view ranking docs Query Operations query generated Searching retrieved docs Ranking Indexing inverted file Index builds 2 DB Manager Module Text Database Lecture 1: Introduction 2ID10: Information Retrieval ( )
12 12
13 Inverted index: document level T 0 ="it is what it is", T 1 = "what tis it" T 2 ="it is a banana Q: "what", "is" "it" {0, 1} {0, 1, 2} {0, 1, 2} = {0,1} "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1} Inverted index: word level T 0 ="it is what it is", T 1 = "what tis it" T 2 ="it is a banana "a": {(2, 2)} "banana": {(2, 3)} "is": {(0, 1), (0, 4), (1, 1), (2, 1)} "it": {(0, 0), (0, 3), (1, 2), (2, 0)} "what": {(0, 2), (1, 0)} Q: "what is it {(0, 2), (1, 0)} {(0, 1), (0, 4), (1, 1), (2, 1)} {(0, 0), (0, 3), (1, 2), (2, 0)} 13
14 The inverted index data structure is a central component of a typical search engine indexing algorithm. A goal of a search engine implementation is to optimize the speed of the query: find the documents where word X occurs. Once a forward index is developed, which stores lists of words per document, it is next inverted to develop an inverted index. Querying the forward index would require sequential iteration through each document and to each word to verify a matching document. The time, memory, and processing resources to perform such a query are not always technically realistic. Instead of listing the words per document in the forward index, the inverted index data structure is developed which lists the documents per word. 14
15 Measures Precision is the fraction of the documents retrieved dthat t are relevant tto the user's information need. Recall is the fraction of the documents that are relevant to the query that are successfully retrieved. Measures: Precision & Recall Retrieved Not retrieved User need TP FN Relevant Not needed FP TN Irrelevant TP+FP TN+FN TP True Positive TN True Negative FP False Positive FN False Negative TP Relevant Retrieved Precision = = TP+FP Retrieved TP Relevant Retrieved Recall = = TP+FN Relevant 15
16 Measures: Precision & Recall Retrieved Not retrieved User need TP FN Relevant Not needed FP TN Irrelevant TP+FP TN+FN TP Precision = TP+FP Specificity TP Recall = TP+FN Sensitivity Measure: F Measure The weighted harmonic mean of precision and recall, the traditional F measure or balanced F score is: 2 X Precision x Recall F-Measure = ( Precison + Recall) F 2 measure, weights recall twice as much as precision, and the F 0.5 measure, which weights precision twice as much as recall. 16
17 ROC Receiver Operator Characteristic AUC Area Under Curve 3 systems compared TP Relevant FP Irrelevant Vector space model Document: a vector of words A sparse vector over all possible words Similarity between query and document: Scalar product An angle between the two vectors 17
18 Scalar product Query Q is a document with perhaps just a single word. Similarity of query and document M(Q, D i ) = Q D i X Y = i x i y i Weighted version The more the word occurs, the more relevant, Same word vectors, count occurrences M(X, Y ) = i w q,i w d,i w is different for word in each document Extend: add weight for a word in a "more important" context Can you add term weight on query words? 18
19 Limitations of vector space Long documents are poorly represented because they have poor similarity values (a small scalar product and a large dimensionality) Search keywords must precisely match document terms; word substrings might result in a "false positive match" Semantic sensitivity; documents with similar context but different term vocabulary won't be associated, resulting in a "false negative match". The order in which the terms appear in the document is lost in the vector space representation. Term Weigh Calculation quantification of intra-document (-cluster) contents (similarity) = tf factor the term frequency within a document how well a term describes a document quantification of inter-documents (-cluster) separation (dissimilarity) = idf factor the inverse document frequency frequency of the term in docs of the collection w ij = tf(i,j) * idf(i) Lecture 1: Introduction 2ID10: Information Retrieval ( ) 38 19
20 TF and IDF Factors Let, N be the total number of docs in the collection n i be the number of docs which contain k i freq(i,j) raw frequency of k i within d j A normalized frequency (tf factor) is given by: f(i,j) = freq(i,j) / max(freq(l,j)) where max is computed over all terms occuring in doc d j The idf factor is computed as: idf(i) = log (N/ n i ) log makes values tf and idf comparable or the amount of information associated with term k i Lecture 1: Introduction 2ID10: Information Retrieval ( ) 39 TF and IDF Factors Let, N be the total number of docs in the collection vector n i be model the number with of tf-idf docs which weights contain k i freq(i,j) raw frequency of k i within d j A normalized frequency (tf factor) is given by: a good ranking strategy in general collections f(i,j) = freq(i,j) / max(freq(l,j)) where max is computed over all terms occuring in doc d j The simple idf factor and fast is computed to compute as: idf(i) = log (N/ n i ) log makes values tf and idf comparable or the amount of information associated with term k i Lecture 1: Introduction 2ID10: Information Retrieval ( ) 40 20
21 Pros & Cons Advantages: term-weighting improves quality of the answer set partial matching allows retrieval of docs that approximate the query conditions cosine ranking formula sorts documents according to degree of similarity to the query Disadvantages: assumes independence of index terms not clear whether this is bad though Lecture 1: Introduction 2ID10: Information Retrieval ( ) 41 Ontology Ontology: a conceptualisation of things An explicit formal specification of how to represent the objects, concepts and other entities that are assumed to exist in some area of interest and the relationships that hold among them. Sõidukid veesõidukid autod lennukid Vesilennuk 21
22 Ontology driven search Query => map to an ontology Use ontology to guide what you really want Map documents to the same ontology Fth Fetch most relevant tto term, ontology, etc gopubmed 22
23 Importance of a document Can we say that some documents are a priori more important than others? Type of a document /law, news, chat, / Good source Relevant (often cited, popular) What is a Markov Chain? A Markov chain has two components: 1) A network structure much like a web site, where each node is called a state. So the complete web is the set of all possible states. 2) A transition probability of traversing a link given that the chain is in a state. For each state the sum of outgoing probabilities is one. A sequence of steps through h the chain is called a random walk. 23
24 The Random Surfer Assume the web is a Markov chain. Surfers randomly click on links, where the probability of an outlink from page A is 1/m, where m is the number of outlinks from A. The surfer occasionally gets bored and is teleported to another web page, say B, where B is equally likely to be any page. Using the theory of Markov chains it can be shown that if the surfer follows links for long enough, the PageRank of a web page is the probability that the surfer will visit that page. Dangling Pages A C B Problem: A and B have no outlinks. Solution: Assume A and B have links to all web pages with equal probability. 24
25 Rank Sink Problem: Pages in a loop accumulate lt rank but tdo not distribute it. Solution: Teleportation, i.e. with a certain probability the surfer can jump to any other web page to get out of the loop. PageRank (PR) Definition PR( P) d N (1 d ) ( PR( P1 ) O( P ) 1 PR( P2 ) PR( Pn )... O( P ) O( P ) 2 n ) P is a web page Pi are the web pages that have a link to P O(Pi) is the number of outlinks from Pi d is the teleportation probability N is the size of the web 25
26 Example Web Graph Iteratively Computing PageRank Replace d/n in the def. of PR(P) by d, so PR will take values between 1 and N. d is normally set to 0.15, but for simplicity lets set it to 0.5 Set initial PR values to 1 Solve the following equations iteratively: PR( A) PR( C) PR( B) ( PR( A) PR( C) ( PR( A) / 2) / 2 PR( B)) 26
27 Example Computation of PR Iteration PR(A) PR(B) PR(C) Large Matrix Computation Computing PageRank can be done via matrix multiplication, wherethe matrix has 30 million rows and columns. The matrix is sparse as average number of outlinks is between 7 and 8. Setting d = 0.15 or above requires at most 100 iterations to convergence. Researchers still trying to speed up the computation. 27
28 PageRank - Motivation A link from page A to page B is a vote of the author of A for B, or a recommendation of fthe page. The number incoming links to a page is a measure of importance and authority of the page. Also take into account the quality of recommendation, so a page is more important if the sources of its incomoing llinks are important. 28
29 29
30 The Anatomy of a Large Scale Hypertextual Web Search Engine In this paper, we present Google, a prototype of a large scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce muchmoresatisfyingmore searchresultsresults than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at 30
31 Personalized PageRank Teleportation to a set of pages defining the preferences of a particular user Topic sensitive PageRank [Haveliwala 02] Teleportation to a set of pages defining a particular topic TrustRank [Gyöngyi 04] Teleportation to trustworthy pages Many papers on analyzing PageRank and numerical methods for efficient computation 31
32 32
33 Future? Or current? Recommendations (Tagging) Common behaviour (news/epidemics spread) Social networks Focus Generalisation Rich get richer; Googlearchy?; Your contribution? 33
Materials Text Algorithms (4AP) Information Retrieval. Jaak Vilo 2008 fall
Materials Text Algorithms (4AP) Information Retrieval Jaak Vilo 2008 fall Modern Information Retrieval by Ricardo Baeza Yates and Berthier Ribeiro Neto. http://people.ischool.berkeley.edu/~hearst/irbook/
More informationBasic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert
More informationInformation Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.
Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.
More informationmodern database systems lecture 4 : information retrieval
modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation
More informationCS 6320 Natural Language Processing
CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationModern information retrieval
Modern information retrieval Modelling Saif Rababah 1 Introduction IR systems usually adopt index terms to process queries Index term: a keyword or group of selected words any word (more general) Stemming
More informationCS6200 Information Retreival. The WebGraph. July 13, 2015
CS6200 Information Retreival The WebGraph The WebGraph July 13, 2015 1 Web Graph: pages and links The WebGraph describes the directed links between pages of the World Wide Web. A directed edge connects
More informationMultimedia Information Systems
Multimedia Information Systems Samson Cheung EE 639, Fall 2004 Lecture 6: Text Information Retrieval 1 Digital Video Library Meta-Data Meta-Data Similarity Similarity Search Search Analog Video Archive
More informationLink Analysis and Web Search
Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html
More informationInformation Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group
Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More informationIn = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most
In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100
More informationInformation Retrieval
s Information Retrieval Information system management system Model Processing of queries/updates Queries Answer Access to stored data Patrick Lambrix Department of Computer and Information Science Linköpings
More informationInformation Retrieval
Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have
More informationBrief (non-technical) history
Web Data Management Part 2 Advanced Topics in Database Management (INFSCI 2711) Textbooks: Database System Concepts - 2010 Introduction to Information Retrieval - 2008 Vladimir Zadorozhny, DINS, SCI, University
More informationChapter 27 Introduction to Information Retrieval and Web Search
Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval
More informationCC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan
CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7: Information Retrieval II Aidan Hogan aidhog@gmail.com How does Google know about the Web? Inverted Index: Example 1 Fruitvale Station is a 2013
More informationInformation Retrieval. (M&S Ch 15)
Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion
More informationInformation Retrieval and Web Search
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught at UT Austin and Stanford) Information Retrieval
More informationInformation Retrieval. Information Retrieval and Web Search
Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent
More informationLecture #3: PageRank Algorithm The Mathematics of Google Search
Lecture #3: PageRank Algorithm The Mathematics of Google Search We live in a computer era. Internet is part of our everyday lives and information is only a click away. Just open your favorite search engine,
More informationLecture 8: Linkage algorithms and web search
Lecture 8: Linkage algorithms and web search Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group ronan.cummins@cl.cam.ac.uk 2017
More informationINTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)
INTRODUCTION TO DATA SCIENCE Link Analysis (MMDS5) Introduction Motivation: accurate web search Spammers: want you to land on their pages Google s PageRank and variants TrustRank Hubs and Authorities (HITS)
More informationEncoding Words into String Vectors for Word Categorization
Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,
More informationIndexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems
Indexing: Part IV CPS 216 Advanced Database Systems Announcements (February 17) 2 Homework #2 due in two weeks Reading assignments for this and next week The query processing survey by Graefe Due next
More information10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues
COS 597A: Principles of Database and Information Systems Information Retrieval Traditional database system Large integrated collection of data Uniform access/modifcation mechanisms Model of data organization
More information~ Ian Hunneybell: WWWT Revision Notes (15/06/2006) ~
. Search Engines, history and different types In the beginning there was Archie (990, indexed computer files) and Gopher (99, indexed plain text documents). Lycos (994) and AltaVista (995) were amongst
More informationHome Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit
Page 1 of 14 Retrieving Information from the Web Database and Information Retrieval (IR) Systems both manage data! The data of an IR system is a collection of documents (or pages) User tasks: Browsing
More informationProximity Prestige using Incremental Iteration in Page Rank Algorithm
Indian Journal of Science and Technology, Vol 9(48), DOI: 10.17485/ijst/2016/v9i48/107962, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Proximity Prestige using Incremental Iteration
More informationCS224W: Social and Information Network Analysis Jure Leskovec, Stanford University
CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second
More informationInformation Retrieval. hussein suleman uct cs
Information Management Information Retrieval hussein suleman uct cs 303 2004 Introduction Information retrieval is the process of locating the most relevant information to satisfy a specific information
More informationCOMP 4601 Hubs and Authorities
COMP 4601 Hubs and Authorities 1 Motivation PageRank gives a way to compute the value of a page given its position and connectivity w.r.t. the rest of the Web. Is it the only algorithm: No! It s just one
More informationCS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University
CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and
More informationTag-based Social Interest Discovery
Tag-based Social Interest Discovery Xin Li / Lei Guo / Yihong (Eric) Zhao Yahoo!Inc 2008 Presented by: Tuan Anh Le (aletuan@vub.ac.be) 1 Outline Introduction Data set collection & Pre-processing Architecture
More informationINF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering
INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,
More informationPart 1: Link Analysis & Page Rank
Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Graph Data: Social Networks [Source: 4-degrees of separation, Backstrom-Boldi-Rosa-Ugander-Vigna,
More informationUNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.
UNIT-V WEB MINING 1 Mining the World-Wide Web 2 What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns. 3 Web search engines Index-based: search the Web, index
More informationInformation Retrieval and Web Search
Information Retrieval and Web Search Course overview Instructor: Rada Mihalcea What is this course about? Processing Indexing Retrieving textual data (or audio, video, geo-spatial,, data) Fits in four
More informationThe Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search Engine Article by: Larry Page and Sergey Brin Computer Networks 30(1-7):107-117, 1998 1 1. Introduction The authors: Lawrence Page, Sergey Brin started
More informationA Survey on Web Information Retrieval Technologies
A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New York, Stony Brook Presented by Kajal Miyan Michigan State University Overview Web Information
More informationString Vector based KNN for Text Categorization
458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research
More informationCS54701: Information Retrieval
CS54701: Information Retrieval Basic Concepts 19 January 2016 Prof. Chris Clifton 1 Text Representation: Process of Indexing Remove Stopword, Stemming, Phrase Extraction etc Document Parser Extract useful
More informationAdministrative. Web crawlers. Web Crawlers and Link Analysis!
Web Crawlers and Link Analysis! David Kauchak cs458 Fall 2011 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture15-linkanalysis.ppt http://webcourse.cs.technion.ac.il/236522/spring2007/ho/wcfiles/tutorial05.ppt
More informationInternational Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014.
A B S T R A C T International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Information Retrieval Models and Searching Methodologies: Survey Balwinder Saini*,Vikram Singh,Satish
More informationLink Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.
Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material. 1 Contents Introduction Network properties Social network analysis Co-citation
More informationCOMP Page Rank
COMP 4601 Page Rank 1 Motivation Remember, we were interested in giving back the most relevant documents to a user. Importance is measured by reference as well as content. Think of this like academic paper
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationAutomatic Summarization
Automatic Summarization CS 769 Guest Lecture Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of Wisconsin, Madison February 22, 2008 Andrew B. Goldberg (CS Dept) Summarization
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections
More informationLink Structure Analysis
Link Structure Analysis Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!) Link Analysis In the Lecture HITS: topic-specific algorithm Assigns each page two scores a hub score
More informationText Analytics (Text Mining)
CSE 6242 / CX 4242 Apr 1, 2014 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer,
More informationSearching the Web [Arasu 01]
Searching the Web [Arasu 01] Most user simply browse the web Google, Yahoo, Lycos, Ask Others do more specialized searches web search engines submit queries by specifying lists of keywords receive web
More informationAuthoritative K-Means for Clustering of Web Search Results
Authoritative K-Means for Clustering of Web Search Results Gaojie He Master in Information Systems Submission date: June 2010 Supervisor: Kjetil Nørvåg, IDI Co-supervisor: Robert Neumayer, IDI Norwegian
More informationPage rank computation HPC course project a.y Compute efficient and scalable Pagerank
Page rank computation HPC course project a.y. 2012-13 Compute efficient and scalable Pagerank 1 PageRank PageRank is a link analysis algorithm, named after Brin & Page [1], and used by the Google Internet
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation
More informationSearching the Web What is this Page Known for? Luis De Alba
Searching the Web What is this Page Known for? Luis De Alba ldealbar@cc.hut.fi Searching the Web Arasu, Cho, Garcia-Molina, Paepcke, Raghavan August, 2001. Stanford University Introduction People browse
More informationRecent Researches on Web Page Ranking
Recent Researches on Web Page Pradipta Biswas School of Information Technology Indian Institute of Technology Kharagpur, India Importance of Web Page Internet Surfers generally do not bother to go through
More informationText Analytics (Text Mining)
CSE 6242 / CX 4242 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko,
More informationOutline. Lecture 2: EITN01 Web Intelligence and Information Retrieval. Previous lecture. Representation/Indexing (fig 1.
Outline Lecture 2: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University January 23, 2013 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence
More informationInformation Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes
CS630 Representing and Accessing Digital Information Information Retrieval: Retrieval Models Information Retrieval Basics Data Structures and Access Indexing and Preprocessing Retrieval Models Thorsten
More informationCSE 494: Information Retrieval, Mining and Integration on the Internet
CSE 494: Information Retrieval, Mining and Integration on the Internet Midterm. 18 th Oct 2011 (Instructor: Subbarao Kambhampati) In-class Duration: Duration of the class 1hr 15min (75min) Total points:
More informationLearning to Rank Networked Entities
Learning to Rank Networked Entities Alekh Agarwal Soumen Chakrabarti Sunny Aggarwal Presented by Dong Wang 11/29/2006 We've all heard that a million monkeys banging on a million typewriters will eventually
More informationDepartment of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _
COURSE DELIVERY PLAN - THEORY Page 1 of 6 Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _ LP: CS6007 Rev. No: 01 Date: 27/06/2017 Sub.
More informationRepresentation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s
Summary agenda Summary: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University March 13, 2013 A Ardö, EIT Summary: EITN01 Web Intelligence
More informationInstructor: Stefan Savev
LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information
More informationInformation Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Information Retrieval CS 6900 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Boolean Retrieval vs. Ranked Retrieval Many users (professionals) prefer
More informationInformation Retrieval
Natural Language Processing SoSe 2015 Information Retrieval Dr. Mariana Neves June 22nd, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline Introduction Indexing Block 2 Document Crawling Text Processing
More informationPAGE RANK ON MAP- REDUCE PARADIGM
PAGE RANK ON MAP- REDUCE PARADIGM Group 24 Nagaraju Y Thulasi Ram Naidu P Dhanush Chalasani Agenda Page Rank - introduction An example Page Rank in Map-reduce framework Dataset Description Work flow Modules.
More informationWeb search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)
' Sta306b May 11, 2012 $ PageRank: 1 Web search before Google (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.) & % Sta306b May 11, 2012 PageRank: 2 Web search
More informationInformation Retrieval
Natural Language Processing SoSe 2014 Information Retrieval Dr. Mariana Neves June 18th, 2014 (based on the slides of Dr. Saeedeh Momtazi) Outline Introduction Indexing Block 2 Document Crawling Text Processing
More informationSearching the Web for Information
Search Xin Liu Searching the Web for Information How a Search Engine Works Basic parts: 1. Crawler: Visits sites on the Internet, discovering Web pages 2. Indexer: building an index to the Web's content
More informationCS224W: Social and Information Network Analysis Jure Leskovec, Stanford University
CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second
More informationUnit VIII. Chapter 9. Link Analysis
Unit VIII Link Analysis: Page Ranking in web search engines, Efficient Computation of Page Rank using Map-Reduce and other approaches, Topic-Sensitive Page Rank, Link Spam, Hubs and Authorities (Text Book:2
More informationLecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods
Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods Matthias Schubert, Matthias Renz, Felix Borutta, Evgeniy Faerman, Christian Frey, Klaus Arthur
More informationExam in course TDT4215 Web Intelligence - Solutions and guidelines - Wednesday June 4, 2008 Time:
English Student no:... Page 1 of 14 Contact during the exam: Geir Solskinnsbakk Phone: 735 94218/ 93607988 Exam in course TDT4215 Web Intelligence - Solutions and guidelines - Wednesday June 4, 2008 Time:
More informationInforma/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields
Informa/on Retrieval CISC437/637, Lecture #23 Ben CartereAe Copyright Ben CartereAe 1 Text Search Consider a database consis/ng of long textual informa/on fields News ar/cles, patents, web pages, books,
More informationArchitecture and Implementation of Database Systems (Summer 2018)
Jens Teubner Architecture & Implementation of DBMS Summer 2018 1 Architecture and Implementation of Database Systems (Summer 2018) Jens Teubner, DBIS Group jens.teubner@cs.tu-dortmund.de Summer 2018 Jens
More informationKeyword Extraction by KNN considering Similarity among Features
64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,
More informationModern Information Retrieval
Modern Information Retrieval Chapter 3 Modeling Part I: Classic Models Introduction to IR Models Basic Concepts The Boolean Model Term Weighting The Vector Model Probabilistic Model Chap 03: Modeling,
More informationPagerank Scoring. Imagine a browser doing a random walk on web pages:
Ranking Sec. 21.2 Pagerank Scoring Imagine a browser doing a random walk on web pages: Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably
More informationCOMP5331: Knowledge Discovery and Data Mining
COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, Jon M. Kleinberg 1 1 PageRank
More informationROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015
ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti
More informationRoadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases
Roadmap Random Walks in Ranking Query in Vagelis Hristidis Roadmap Ranking Web Pages Rank according to Relevance of page to query Quality of page Roadmap PageRank Stanford project Lawrence Page, Sergey
More informationCS224W: Social and Information Network Analysis Jure Leskovec, Stanford University
CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second
More informationHow to organize the Web?
How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second try: Web Search Information Retrieval attempts to find relevant docs in a small and trusted set Newspaper
More informationInformation Networks: PageRank
Information Networks: PageRank Web Science (VU) (706.716) Elisabeth Lex ISDS, TU Graz June 18, 2018 Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 1 / 38 Repetition Information Networks Shape of the
More informationLearning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search
1 / 33 Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search Bernd Wittefeld Supervisor Markus Löckelt 20. July 2012 2 / 33 Teaser - Google Web History http://www.google.com/history
More informationInformation Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Information Retrieval CS 6900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Information Retrieval Information Retrieval (IR) is finding material of an unstructured
More information1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a
!"#$ %#& ' Introduction ' Social network analysis ' Co-citation and bibliographic coupling ' PageRank ' HIS ' Summary ()*+,-/*,) Early search engines mainly compare content similarity of the query and
More informationDATA MINING II - 1DL460. Spring 2014"
DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,
More informationInformation retrieval
Information retrieval Lecture 8 Special thanks to Andrei Broder, IBM Krishna Bharat, Google for sharing some of the slides to follow. Top Online Activities (Jupiter Communications, 2000) Email 96% Web
More informationINF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering
INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal & Stephan Oepen Language Technology Group (LTG) September 23, 2015 Agenda Last week Supervised vs unsupervised learning.
More informationInformation Retrieval: Retrieval Models
CS473: Web Information Retrieval & Management CS-473 Web Information Retrieval & Management Information Retrieval: Retrieval Models Luo Si Department of Computer Science Purdue University Retrieval Models
More information60-538: Information Retrieval
60-538: Information Retrieval September 7, 2017 1 / 48 Outline 1 what is IR 2 3 2 / 48 Outline 1 what is IR 2 3 3 / 48 IR not long time ago 4 / 48 5 / 48 now IR is mostly about search engines there are
More informationWeb Structure Mining using Link Analysis Algorithms
Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract- The World Wide Web is a huge repository of data which includes audio, text and video.
More informationOverview of Web Mining Techniques and its Application towards Web
Overview of Web Mining Techniques and its Application towards Web *Prof.Pooja Mehta Abstract The World Wide Web (WWW) acts as an interactive and popular way to transfer information. Due to the enormous
More informationCS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University
CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University Instead of generic popularity, can we measure popularity within a topic? E.g., computer science, health Bias the random walk When
More information5/30/2014. Acknowledgement. In this segment: Search Engine Architecture. Collecting Text. System Architecture. Web Information Retrieval
Acknowledgement Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014 Contents of lectures, projects are extracted
More information