Introduction to Information Retrieval

Size: px
Start display at page:

Download "Introduction to Information Retrieval"

Transcription

1 Introduction to Information Retrieval WS 2008/ Information Systems Group Mohammed AbuJarour

2 Contents 2 Basics of Information Retrieval (IR) Foundations: extensible Markup Language (XML) Probability & Statistics

3 What is IR? 3 IR by examples: Credit card example. Searching for a book or paper in a library system. Use Google to find a restaurant in Potsdam. Look through a product catalog to find an item. Browse a movie catalog to find an interesting movie.

4 What is IR? 4 IR by definition (scientifically): Representation of information items Storage of information items Organization of information items Access to information items Characterization of the User Information Need is not simple problem: Example: Find all web pages of researchers who studied in Germany and participated at least in 3 EU-funded projects and has been doing research in IR for more than 10 years! This description could not be used directly to get user s information need. Translated into query: typically a set of keywords. IR system (query) relevant information.

5 Data vs. Information Retrieval 5 Data Retrieval Deals with data that has well-defined structure and semantics. Clearly defined conditions, like regular expressions, relational algebra. Information Retrieval Deals with natural language text that is usually not well structured and could be semantically ambiguous. A set of keywords or terms. Each element in the result must satisfy the conditions in the query. An element in the result might be inaccurate with a small number of errors. Example: Name Grade Major Michael B Physics Martin C Mathematics John A Bioinformatik

6 Unstructured (text) vs. Structured (database) Data in 1996 and Unstructured Structured

7 The World Wide Web 7 Huge amount of information. Unusual and diverse documents, e.g., HTML, XHTML, XML, Multimedia... etc. Unusual and diverse users, queries, information needs. (Number of webpages) Size in billion webpages GYWA = Sorted on Google, Yahoo!, Windows Live Search (Msn Search) and Ask YGWA = Sorted on Yahoo!, Google, Windows Live Search (Msn Search) and Ask

8 The Retrieval Process 8 User need User Interface Text Text Text Operations User feedback Query Logical view Query Operations Logical view Inverted file Indexing DB Manager Module Searching Index Retrieved docs Text Database Ranked docs Ranking

9 Basic Concepts in IR 9 Documents: whatever units we have decided to build a retrieval system over, e.g., web page, XML file, pdf file, article, paper, book chapter, product... etc. Collection (Corpus): the group of documents over which we perform retrieval. Information need: the topic about which the user desires to know more. Query: what the user conveys to the computer (system) in an attempt to communicate the information need. Relevance: the degree to which the user perceives a document as containing information of value with respect to his personal information need. DocID: the unique serial number for each document in the collection.

10 Basic Concepts in IR 10 The effectiveness of IR System: the quality of it search results. Precision (Präzision): the fraction of the returned results that are relevant to the information need. Recall (Ausbeute): the fraction of the relevant documents in the collection that were returned by the system.

11 Basic Concepts in IR 10 The effectiveness of IR System: the quality of it search results. Precision (Präzision): the fraction of the returned results that are relevant to the information need. Recall (Ausbeute): the fraction of the relevant documents in the collection that were returned by the system. Collection

12 Basic Concepts in IR 10 The effectiveness of IR System: the quality of it search results. Precision (Präzision): the fraction of the returned results that are relevant to the information need. Recall (Ausbeute): the fraction of the relevant documents in the collection that were returned by the system. Collection Relevant

13 Basic Concepts in IR 10 The effectiveness of IR System: the quality of it search results. Precision (Präzision): the fraction of the returned results that are relevant to the information need. Recall (Ausbeute): the fraction of the relevant documents in the collection that were returned by the system. Collection Relevant Retrieved

14 Basic Concepts in IR 10 The effectiveness of IR System: the quality of it search results. Precision (Präzision): the fraction of the returned results that are relevant to the information need. Recall (Ausbeute): the fraction of the relevant documents in the collection that were returned by the system. Collection Relevant Retrieved Relevant Retrieved

15 Basic Concepts in IR 10 The effectiveness of IR System: the quality of it search results. Precision (Präzision): the fraction of the returned results that are relevant to the information need. Recall (Ausbeute): the fraction of the relevant documents in the collection that were returned by the system. Collection Relevant Retrieved Relevant Retrieved

16 Basic Concepts in IR 10 The effectiveness of IR System: the quality of it search results. Precision (Präzision): the fraction of the returned results that are relevant to the information need. Recall (Ausbeute): the fraction of the relevant documents in the collection that were returned by the system. Collection Relevant Retrieved Relevant Retrieved

17 Contents 11 Basics of Information Retrieval (IR) Foundations: extensible Markup Language (XML) Probability & Statistics

18 extensible Markup Language (XML) 12 Overview of XML: XML was designed to carry/ transfer data, not to display data. XML tags are not predefined. You must define your own tags. XML is designed to be self-descriptive. XML documents may conform to well-defined schemata, e.g., DTD, XSD. XML is a W3C Recommendation. Example:

19 extensible Markup Language (XML) 13 Elements and Relationships

20 extensible Markup Language (XML) 13 Elements and Relationships

21 extensible Markup Language (XML) 13 Elements and Relationships

22 extensible Markup Language (XML) 13 Elements and Relationships

23 extensible Markup Language (XML) 14 Modeling XML Documents Modeled as trees. Tree traversal algorithms: Preorder. Inorder. Postorder. Pre + Post: unique identification of nodes.

24 extensible Markup Language (XML) 14 Modeling XML Documents Modeled as trees. Tree traversal algorithms: Preorder. Inorder. Postorder. Pre + Post: unique identification of nodes

25 extensible Markup Language (XML) 14 Modeling XML Documents Modeled as trees. Tree traversal algorithms: Preorder. Inorder. Postorder. Pre + Post: unique identification of nodes.

26 extensible Markup Language (XML) 14 Modeling XML Documents Modeled as trees. Tree traversal algorithms: Preorder. Inorder. Postorder. Pre + Post: unique identification of nodes

27 extensible Markup Language (XML) 14 Modeling XML Documents Modeled as trees. Tree traversal algorithms: Preorder. Inorder. Postorder. Pre + Post: unique identification of nodes.

28 extensible Markup Language (XML) 14 Modeling XML Documents Modeled as trees. Tree traversal algorithms: Preorder. Inorder. Postorder. Pre + Post: unique identification of nodes

29 extensible Markup Language (XML) 15 XPath is a language for finding information in an XML document. XPath is used to navigate through elements and attributes in an XML document. What is XPath? XPath is a syntax for defining parts of an XML document. XPath uses path expressions to navigate in XML documents. XPath contains a library of standard functions. XPath is a W3C recommendation XPath uses path expressions to select nodes or node-sets in an XML document. The node is selected by following a path or steps.

30 extensible Markup Language (XML) 16 Examples: XPath Syntax XPATH /bookstore Description Selects the root element bookstore bookstore/book Selects all book elements that are children of bookstore //book Selects all book elements no matter where they are in the document bookstore//book Selects all book elements that are descendant of the bookstore element, no matter where they are under the bookstore element Selects all attributes that are named lang /bookstore/ book[price>35.00]/ title Selects all the title elements of the book elements of the bookstore element that have a price element with a value greater than //* Selects all elements in the document

31 extensible Markup Language (XML) 17 XPath expression consists of a sequence of one or more location steps. //descendant::book [position()=1] / child::title [contains ( text(), XML )].

32 extensible Markup Language (XML) 17 XPath expression consists of a sequence of one or more location steps. //descendant::book [position()=1] / child::title [contains ( text(), XML )]. Axis Name

33 extensible Markup Language (XML) 17 XPath expression consists of a sequence of one or more location steps. //descendant::book [position()=1] / child::title [contains ( text(), XML )]. Axis Name Node Test

34 extensible Markup Language (XML) 17 XPath expression consists of a sequence of one or more location steps. //descendant::book [position()=1] / child::title [contains ( text(), XML )]. Axis Name Node Test Predicate Expression

35 extensible Markup Language (XML) 17 XPath expression consists of a sequence of one or more location steps. //descendant::book [position()=1] / child::title [contains ( text(), XML )]. Axis Name Node Test Predicate Expression Location step

36 extensible Markup Language (XML) 17 XPath expression consists of a sequence of one or more location steps. //descendant::book [position()=1] / child::title [contains ( text(), XML )]. Axis Name Node Test Predicate Expression Predicates & Functions Location step

37 extensible Markup Language (XML) 17 XPath expression consists of a sequence of one or more location steps. //descendant::book [position()=1] / child::title [contains ( text(), XML )]. Axis Name Node Test Predicate Expression Predicates & Functions Location step Location step

38 extensible Markup Language (XML) 18 What is XQuery? XQuery is the language for querying XML data. XQuery for XML is like SQL for databases. XQuery is built on XPath expressions. XQuery is a W3C Recommendation Example: XQuery: for $x in doc("books.xml")/bookstore/book where $x/price>30 order by $x/title return $x/title Result: <title lang="eng"> Learning XML</title>

39 Contents 19 Basics of Information Retrieval (IR) Foundations: extensible Markup Language (XML) Probability & Statistics

40 Basics from Probability Theory 20 A probability space is a triple(ω, E, P) with a set Ω of elementary events (sample space), a family E (events) of subsets of Ω with Ω E which is closed under,, and with a countable number of operands, Note: with finite Ω usually E=2 Ω. a probability measure P: E [0,1] with P[Ω]=1 and P[ i Ai] = Σi P[Ai] for countably many, pairwise disjoint Ai. Properties of P: P[A] + P[ A] = 1 P[A B] = P[A] + P[B] P[A B] P[ ] = 0 (null / impossible event) P[Ω] = 1 (true / certain event)

41 Basics from Probability Theory Example 21 Drawing one playing card: Ω of elementary events={s,h,d,c} E (events)= {, S, H, D, C, SH, SD, SC, HD, HC, DC, SHD, SHC, HDC, SDC, SHDC} P[S] + P[ S] = ¼+¾= 1 P[H D] = P[H] + P[D] P[H D] = ¼+¼=½ P[ ] = 0 (null / impossible event) P[SHDC] = ¼+¼+¼+¼=1 Note: SHDC means S H D C.

42 Independence and Conditional Probabilities 22 Two events A, B of a probability space are independent if P[A B] = P[A] P[B]. A finite set of events A={A1,..., An} is independent if for every subset S A the equation holds. The conditional probability P[A B] of A under the condition (hypothesis) B is defined as:

43 Total Probability and Bayes Theorem 23 Total probability theorem: For a partitioning of Ω into events B1,..., Bn: Bayes theorem: P[A B] is called posterior probability. P[A] is called prior probability.

44 Total Probability and Bayes Theorem Example 24 M: a man is chosen. E: the one chosen is employed. P[M] = 500/900= 5/9 P[E] = 600/900 = 2/3 P[M E]= 460/900 = 46/90 Male Female Total Employed Unemployed Total P[M E] = P[M E]/P[E]= 460/600 = 23/30 P[E M] = P[M E]XP[E]/P[M] = (23/30 X 2/3) / (5/9) = 23/25

45 References 25 Baeza-Yates, R. A. and Ribeiro-Neto, B. 1999: Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc. Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze: Introduction to Information Retrieval, Cambridge University Press, David A. Grossman, Ophir Frieder: Information Retrieval: Algorithms and Heuristics, Springer, 2004 Walpole, Myers, Myers, Ye: Probability and Statistics for Engineers and Scientists. Prentice-Hall, Seventh Edition,

46 The End 26 Questions?

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Course overview Instructor: Rada Mihalcea What is this course about? Processing Indexing Retrieving textual data (or audio, video, geo-spatial,, data) Fits in four

More information

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _ COURSE DELIVERY PLAN - THEORY Page 1 of 6 Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _ LP: CS6007 Rev. No: 01 Date: 27/06/2017 Sub.

More information

Introduction to Database Systems CSE 414

Introduction to Database Systems CSE 414 Introduction to Database Systems CSE 414 Lecture 14-15: XML CSE 414 - Spring 2013 1 Announcements Homework 4 solution will be posted tomorrow Midterm: Monday in class Open books, no notes beyond one hand-written

More information

Data Formats and APIs

Data Formats and APIs Data Formats and APIs Mike Carey mjcarey@ics.uci.edu 0 Announcements Keep watching the course wiki page (especially its attachments): https://grape.ics.uci.edu/wiki/asterix/wiki/stats170ab-2018 Ditto for

More information

Introduction to Database Systems CSE 414

Introduction to Database Systems CSE 414 Introduction to Database Systems CSE 414 Lecture 13: XML and XPath 1 Announcements Current assignments: Web quiz 4 due tonight, 11 pm Homework 4 due Wednesday night, 11 pm Midterm: next Monday, May 4,

More information

10/24/12. What We Have Learned So Far. XML Outline. Where We are Going Next. XML vs Relational. What is XML? Introduction to Data Management CSE 344

10/24/12. What We Have Learned So Far. XML Outline. Where We are Going Next. XML vs Relational. What is XML? Introduction to Data Management CSE 344 What We Have Learned So Far Introduction to Data Management CSE 344 Lecture 12: XML and XPath A LOT about the relational model Hand s on experience using a relational DBMS From basic to pretty advanced

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 11: XML and XPath 1 XML Outline What is XML? Syntax Semistructured data DTDs XPath 2 What is XML? Stands for extensible Markup Language 1. Advanced, self-describing

More information

modern database systems lecture 4 : information retrieval

modern database systems lecture 4 : information retrieval modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation

More information

Introduction to Text Mining. Hongning Wang

Introduction to Text Mining. Hongning Wang Introduction to Text Mining Hongning Wang CS@UVa Who Am I? Hongning Wang Assistant professor in CS@UVa since August 2014 Research areas Information retrieval Data mining Machine learning CS@UVa CS6501:

More information

Introduction to Information Retrieval. Lecture Outline

Introduction to Information Retrieval. Lecture Outline Introduction to Information Retrieval Lecture 1 CS 410/510 Information Retrieval on the Internet Lecture Outline IR systems Overview IR systems vs. DBMS Types, facets of interest User tasks Document representations

More information

XML and Semi-structured Data

XML and Semi-structured Data XML and Semi-structured Data Krzysztof Trawiński Winter Semester 2008 slides 1/27 Outline 1. Introduction 2. Usage & Design 3. Expressions 3.1 Xpath 3.2 Datatypes 3.3 FLWOR 4. Functions 5. Summary 6. Questions

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost

More information

Introduction to Database Systems CSE 444

Introduction to Database Systems CSE 444 Introduction to Database Systems CSE 444 Lecture 25: XML 1 XML Outline XML Syntax Semistructured data DTDs XPath Coverage of XML is much better in new edition Readings Sections 11.1 11.3 and 12.1 [Subset

More information

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014.

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014. A B S T R A C T International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Information Retrieval Models and Searching Methodologies: Survey Balwinder Saini*,Vikram Singh,Satish

More information

Chapter 13 XML: Extensible Markup Language

Chapter 13 XML: Extensible Markup Language Chapter 13 XML: Extensible Markup Language - Internet applications provide Web interfaces to databases (data sources) - Three-tier architecture Client V Application Programs Webserver V Database Server

More information

Introduction to Information Retrieval. Hongning Wang

Introduction to Information Retrieval. Hongning Wang Introduction to Information Retrieval Hongning Wang CS@UVa What is information retrieval? 2 Why information retrieval Information overload It refers to the difficulty a person can have understanding an

More information

Information Retrieval (Part 1)

Information Retrieval (Part 1) Information Retrieval (Part 1) Fabio Aiolli http://www.math.unipd.it/~aiolli Dipartimento di Matematica Università di Padova Anno Accademico 2008/2009 1 Bibliographic References Copies of slides Selected

More information

60-538: Information Retrieval

60-538: Information Retrieval 60-538: Information Retrieval September 7, 2017 1 / 48 Outline 1 what is IR 2 3 2 / 48 Outline 1 what is IR 2 3 3 / 48 IR not long time ago 4 / 48 5 / 48 now IR is mostly about search engines there are

More information

One of the main selling points of a database engine is the ability to make declarative queries---like SQL---that specify what should be done while

One of the main selling points of a database engine is the ability to make declarative queries---like SQL---that specify what should be done while 1 One of the main selling points of a database engine is the ability to make declarative queries---like SQL---that specify what should be done while leaving the engine to choose the best way of fulfilling

More information

CS506/606 - Topics in Information Retrieval

CS506/606 - Topics in Information Retrieval CS506/606 - Topics in Information Retrieval Instructors: Class time: Steven Bedrick, Brian Roark, Emily Prud hommeaux Tu/Th 11:00 a.m. - 12:30 p.m. September 25 - December 6, 2012 Class location: WCC 403

More information

Section 5.5. Left subtree The left subtree of a vertex V on a binary tree is the graph formed by the left child L of V, the descendents

Section 5.5. Left subtree The left subtree of a vertex V on a binary tree is the graph formed by the left child L of V, the descendents Section 5.5 Binary Tree A binary tree is a rooted tree in which each vertex has at most two children and each child is designated as being a left child or a right child. Thus, in a binary tree, each vertex

More information

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 27-1

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 27-1 Slide 27-1 Chapter 27 XML: Extensible Markup Language Chapter Outline Introduction Structured, Semi structured, and Unstructured Data. XML Hierarchical (Tree) Data Model. XML Documents, DTD, and XML Schema.

More information

XML, DTD, and XPath. Announcements. From HTML to XML (extensible Markup Language) CPS 116 Introduction to Database Systems. Midterm has been graded

XML, DTD, and XPath. Announcements. From HTML to XML (extensible Markup Language) CPS 116 Introduction to Database Systems. Midterm has been graded XML, DTD, and XPath CPS 116 Introduction to Database Systems Announcements 2 Midterm has been graded Graded exams available in my office Grades posted on Blackboard Sample solution and score distribution

More information

Trees 11/15/16. Chapter 11. Terminology. Terminology. Terminology. Terminology. Terminology

Trees 11/15/16. Chapter 11. Terminology. Terminology. Terminology. Terminology. Terminology Chapter 11 Trees Definition of a general tree A general tree T is a set of one or more nodes such that T is partitioned into disjoint subsets: A single node r, the root Sets that are general trees, called

More information

: Semantic Web (2013 Fall)

: Semantic Web (2013 Fall) 03-60-569: Web (2013 Fall) University of Windsor September 4, 2013 Table of contents 1 2 3 4 5 Definition of the Web The World Wide Web is a system of interlinked hypertext documents accessed via the Internet

More information

Seleniet XPATH Locator QuickRef

Seleniet XPATH Locator QuickRef Seleniet XPATH Locator QuickRef Author(s) Thomas Eitzenberger Version 0.2 Status Ready for review Page 1 of 11 Content Selecting Nodes...3 Predicates...3 Selecting Unknown Nodes...4 Selecting Several Paths...5

More information

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Manning, Raghavan, and Schütze http://www.informationretrieval.org OVERVIEW Introduction Basic XML Concepts Challenges

More information

Information Retrieval. Lecture 9 - Web search basics

Information Retrieval. Lecture 9 - Web search basics Information Retrieval Lecture 9 - Web search basics Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Up to now: techniques for general

More information

Information Retrieval CSCI

Information Retrieval CSCI Information Retrieval CSCI 4141-6403 My name is Anwar Alhenshiri My email is: anwar@cs.dal.ca I prefer: aalhenshiri@gmail.com The course website is: http://web.cs.dal.ca/~anwar/ir/main.html 5/6/2012 1

More information

CS290N Summary Tao Yang

CS290N Summary Tao Yang CS290N Summary 2015 Tao Yang Text books [CMS] Bruce Croft, Donald Metzler, Trevor Strohman, Search Engines: Information Retrieval in Practice, Publisher: Addison-Wesley, 2010. Book website. [MRS] Christopher

More information

CSE 544 Principles of Database Management Systems. Lecture 4: Data Models a Never-Ending Story

CSE 544 Principles of Database Management Systems. Lecture 4: Data Models a Never-Ending Story CSE 544 Principles of Database Management Systems Lecture 4: Data Models a Never-Ending Story 1 Announcements Project Start to think about class projects If needed, sign up to meet with me on Monday (I

More information

Relational Approach. Problem Definition

Relational Approach. Problem Definition Relational Approach (COSC 416) Nazli Goharian nazli@cs.georgetown.edu Slides are mostly based on Information Retrieval Algorithms and Heuristics, Grossman, Frieder Grossman, Frieder 2002, 2010 1 Problem

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections

More information

11. EXTENSIBLE MARKUP LANGUAGE (XML)

11. EXTENSIBLE MARKUP LANGUAGE (XML) 11. EXTENSIBLE MARKUP LANGUAGE (XML) Introduction Extensible Markup Language is a Meta language that describes the contents of the document. So these tags can be called as self-describing data tags. XML

More information

Information Retrieval and Extraction

Information Retrieval and Extraction Information Retrieval and Extraction Berlin Chen (Picture from the TREC web site) Textbooks Textbook and References R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley Longman,

More information

Equivalence Detection Using Parse-tree Normalization for Math Search

Equivalence Detection Using Parse-tree Normalization for Math Search Equivalence Detection Using Parse-tree Normalization for Math Search Mohammed Shatnawi Department of Computer Info. Systems Jordan University of Science and Tech. Jordan-Irbid (22110)-P.O.Box (3030) mshatnawi@just.edu.jo

More information

Information Retrieval

Information Retrieval s Information Retrieval Information system management system Model Processing of queries/updates Queries Answer Access to stored data Patrick Lambrix Department of Computer and Information Science Linköpings

More information

Introduction to XML. Yanlei Diao UMass Amherst April 17, Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau.

Introduction to XML. Yanlei Diao UMass Amherst April 17, Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau. Introduction to XML Yanlei Diao UMass Amherst April 17, 2008 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau. 1 Structure in Data Representation Relational data is highly

More information

CS490W: Web Information Search & Management. CS-490W Web Information Search and Management. Luo Si. Department of Computer Science Purdue University

CS490W: Web Information Search & Management. CS-490W Web Information Search and Management. Luo Si. Department of Computer Science Purdue University CS490W: Web Information Search & Management CS-490W Web Information Search and Management Luo Si Department of Computer Science Purdue University Overview Web: Growth of the Web The world produces between

More information

State of the Art and Trends in Search Engine Technology. Gerhard Weikum

State of the Art and Trends in Search Engine Technology. Gerhard Weikum State of the Art and Trends in Search Engine Technology Gerhard Weikum (weikum@mpi-inf.mpg.de) Commercial Search Engines Web search Google, Yahoo, MSN simple queries, chaotic data, many results key is

More information

Relational Approach. Problem Definition

Relational Approach. Problem Definition Relational Approach (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Slides are mostly based on Information Retrieval Algorithms and Heuristics, Grossman & Frieder 1 Problem Definition Three conceptual

More information

Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Overview Overview Introduction Classic

More information

Making Retrieval Faster Through Document Clustering

Making Retrieval Faster Through Document Clustering R E S E A R C H R E P O R T I D I A P Making Retrieval Faster Through Document Clustering David Grangier 1 Alessandro Vinciarelli 2 IDIAP RR 04-02 January 23, 2004 D a l l e M o l l e I n s t i t u t e

More information

CS-490WIR Web Information Retrieval and Management. Luo Si

CS-490WIR Web Information Retrieval and Management. Luo Si CS490W: Web Information Retrieval & Management CS-490WIR Web Information Retrieval and Management Luo Si Department of Computer Science Purdue University Overview Web: Growth of the Web The world produces

More information

User Interaction: XML and JSON

User Interaction: XML and JSON User Interaction: XML and JSON Assoc. Professor Donald J. Patterson INF 133 Fall 2012 1 HTML and XML 1989: Tim Berners-Lee invents the Web with HTML as its publishing language Based on SGML Separates data

More information

Outline. Lecture 3: EITN01 Web Intelligence and Information Retrieval. Query languages - aspects. Previous lecture. Anders Ardö.

Outline. Lecture 3: EITN01 Web Intelligence and Information Retrieval. Query languages - aspects. Previous lecture. Anders Ardö. Outline Lecture 3: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University February 5, 2013 A. Ardö, EIT Lecture 3: EITN01 Web Intelligence

More information

Chapter 2. Architecture of a Search Engine

Chapter 2. Architecture of a Search Engine Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them

More information

Introduction & Administrivia

Introduction & Administrivia Introduction & Administrivia Information Retrieval Evangelos Kanoulas ekanoulas@uva.nl Section 1: Unstructured data Sec. 8.1 2 Big Data Growth of global data volume data everywhere! Web data: observation,

More information

XML databases. Jan Chomicki. University at Buffalo. Jan Chomicki (University at Buffalo) XML databases 1 / 9

XML databases. Jan Chomicki. University at Buffalo. Jan Chomicki (University at Buffalo) XML databases 1 / 9 XML databases Jan Chomicki University at Buffalo Jan Chomicki (University at Buffalo) XML databases 1 / 9 Outline 1 XML data model 2 XPath 3 XQuery Jan Chomicki (University at Buffalo) XML databases 2

More information

XML: Extensible Markup Language

XML: Extensible Markup Language XML: Extensible Markup Language CSC 375, Fall 2015 XML is a classic political compromise: it balances the needs of man and machine by being equally unreadable to both. Matthew Might Slides slightly modified

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval SCCS414: Information Storage and Retrieval Christopher Manning and Prabhakar Raghavan Lecture 10: Text Classification; Vector Space Classification (Rocchio) Relevance

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2016/17 IR Chapter 00 Motivation What is Information Retrieval? The meaning of the term Information Retrieval (IR) can be

More information

Web Search Basics. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University

Web Search Basics. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University Web Search Basics Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction

More information

CMSC th Lecture: Graph Theory: Trees.

CMSC th Lecture: Graph Theory: Trees. CMSC 27100 26th Lecture: Graph Theory: Trees. Lecturer: Janos Simon December 2, 2018 1 Trees Definition 1. A tree is an acyclic connected graph. Trees have many nice properties. Theorem 2. The following

More information

Introduction to Information Retrieval. (COSC 488) Spring Nazli Goharian. Course Outline

Introduction to Information Retrieval. (COSC 488) Spring Nazli Goharian. Course Outline Introduction to Information Retrieval (COSC 488) Spring 2012 Nazli Goharian nazli@cs.georgetown.edu Course Outline Introduction Retrieval Strategies (Models) Retrieval Utilities Evaluation Indexing Efficiency

More information

M359 Block5 - Lecture12 Eng/ Waleed Omar

M359 Block5 - Lecture12 Eng/ Waleed Omar Documents and markup languages The term XML stands for extensible Markup Language. Used to label the different parts of documents. Labeling helps in: Displaying the documents in a formatted way Querying

More information

Chapter 11.!!!!Trees! 2011 Pearson Addison-Wesley. All rights reserved 11 A-1

Chapter 11.!!!!Trees! 2011 Pearson Addison-Wesley. All rights reserved 11 A-1 Chapter 11!!!!Trees! 2011 Pearson Addison-Wesley. All rights reserved 11 A-1 2015-12-01 09:30:53 1/54 Chapter-11.pdf (#13) Terminology Definition of a general tree! A general tree T is a set of one or

More information

Chapter 11.!!!!Trees! 2011 Pearson Addison-Wesley. All rights reserved 11 A-1

Chapter 11.!!!!Trees! 2011 Pearson Addison-Wesley. All rights reserved 11 A-1 Chapter 11!!!!Trees! 2011 Pearson Addison-Wesley. All rights reserved 11 A-1 2015-03-25 21:47:41 1/53 Chapter-11.pdf (#4) Terminology Definition of a general tree! A general tree T is a set of one or more

More information

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Information Retrieval Information Retrieval (IR) is finding material of an unstructured

More information

ADT 2009 Other Approaches to XQuery Processing

ADT 2009 Other Approaches to XQuery Processing Other Approaches to XQuery Processing Stefan Manegold Stefan.Manegold@cwi.nl http://www.cwi.nl/~manegold/ 12.11.2009: Schedule 2 RDBMS back-end support for XML/XQuery (1/2): Document Representation (XPath

More information

HYBRIDIZED MODEL FOR EFFICIENT MATCHING AND DATA PREDICTION IN INFORMATION RETRIEVAL

HYBRIDIZED MODEL FOR EFFICIENT MATCHING AND DATA PREDICTION IN INFORMATION RETRIEVAL International Journal of Mechanical Engineering & Computer Sciences, Vol.1, Issue 1, Jan-Jun, 2017, pp 12-17 HYBRIDIZED MODEL FOR EFFICIENT MATCHING AND DATA PREDICTION IN INFORMATION RETRIEVAL BOMA P.

More information

Research Topics in Information Retrieval

Research Topics in Information Retrieval Research Topics in Information Retrieval Cristina Ribeiro Sérgio Nunes FEUP / INESC TEC Information Systems Research Group http://infolab.fe.up.pt Information Retrieval "Information retrieval (IR) is finding

More information

Data Structure Lecture#10: Binary Trees (Chapter 5) U Kang Seoul National University

Data Structure Lecture#10: Binary Trees (Chapter 5) U Kang Seoul National University Data Structure Lecture#10: Binary Trees (Chapter 5) U Kang Seoul National University U Kang (2016) 1 In This Lecture The concept of binary tree, its terms, and its operations Full binary tree theorem Idea

More information

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling Doug Downey Based partially on slides by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze Announcements Project progress report

More information

UNIT 3 XML DATABASES

UNIT 3 XML DATABASES UNIT 3 XML DATABASES XML Databases: XML Data Model DTD - XML Schema - XML Querying Web Databases JDBC Information Retrieval Data Warehousing Data Mining. 3.1. XML Databases: XML Data Model The common method

More information

XML and Databases. Outline. Outline - Lectures. Outline - Assignments. from Lecture 3 : XPath. Sebastian Maneth NICTA and UNSW

XML and Databases. Outline. Outline - Lectures. Outline - Assignments. from Lecture 3 : XPath. Sebastian Maneth NICTA and UNSW Outline XML and Databases Lecture 10 XPath Evaluation using RDBMS 1. Recall / encoding 2. XPath with //,, @, and text() 3. XPath with / and -sibling: use / size / level encoding Sebastian Maneth NICTA

More information

Databases and Information Retrieval Integration TIETS42. Kostas Stefanidis Autumn 2016

Databases and Information Retrieval Integration TIETS42. Kostas Stefanidis Autumn 2016 + Databases and Information Retrieval Integration TIETS42 Autumn 2016 Kostas Stefanidis kostas.stefanidis@uta.fi http://www.uta.fi/sis/tie/dbir/index.html http://people.uta.fi/~kostas.stefanidis/dbir16/dbir16-main.html

More information

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17 Information Retrieval Vannevar Bush Director of the Office of Scientific Research and Development (1941-1947) Vannevar Bush,1890-1974 End of WW2 - what next big challenge for scientists? 1 Historic Vision

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 10: XML Retrieval Hinrich Schütze, Christina Lioma Center for Information and Language Processing, University of Munich 2010-07-12

More information

Search Engine Architecture. Hongning Wang

Search Engine Architecture. Hongning Wang Search Engine Architecture Hongning Wang CS@UVa CS@UVa CS4501: Information Retrieval 2 Document Analyzer Classical search engine architecture The Anatomy of a Large-Scale Hypertextual Web Search Engine

More information

Web scraping and crawling, open data, markup languages and data shaping. Paolo Boldi Dipartimento di Informatica Università degli Studi di Milano

Web scraping and crawling, open data, markup languages and data shaping. Paolo Boldi Dipartimento di Informatica Università degli Studi di Milano Web scraping and crawling, open data, markup languages and data shaping Paolo Boldi Dipartimento di Informatica Università degli Studi di Milano Data Analysis Three steps Data Analysis Three steps In every

More information

XML. extensible Markup Language. ... and its usefulness for linguists

XML. extensible Markup Language. ... and its usefulness for linguists XML extensible Markup Language... and its usefulness for linguists Thomas Mayer thomas.mayer@uni-konstanz.de Fachbereich Sprachwissenschaft, Universität Konstanz Seminar Computerlinguistik II (Miriam Butt)

More information

XML Query Languages. Content. Slide 1 Norbert Gövert. January 11, XML documents as trees. Slide 2. Overview on XML query languages XQL

XML Query Languages. Content. Slide 1 Norbert Gövert. January 11, XML documents as trees. Slide 2. Overview on XML query languages XQL XML Query Languages Slide 1 Norbert Gövert January 11, 2001 Content Slide 2 XML documents as trees Overview on XML query languages XQL XIRQL: IR extension for XQL 1 XML documents as trees Slide 3

More information

Binary Trees

Binary Trees Binary Trees 4-7-2005 Opening Discussion What did we talk about last class? Do you have any code to show? Do you have any questions about the assignment? What is a Tree? You are all familiar with what

More information

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488) Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-

More information

Models for Document & Query Representation. Ziawasch Abedjan

Models for Document & Query Representation. Ziawasch Abedjan Models for Document & Query Representation Ziawasch Abedjan Overview Introduction & Definition Boolean retrieval Vector Space Model Probabilistic Information Retrieval Language Model Approach Summary Overview

More information

Informatics 1: Data & Analysis

Informatics 1: Data & Analysis Informatics 1: Data & Analysis Lecture 9: Trees and XML Ian Stark School of Informatics The University of Edinburgh Tuesday 11 February 2014 Semester 2 Week 5 http://www.inf.ed.ac.uk/teaching/courses/inf1/da

More information

Improvement of Web Search Results using Genetic Algorithm on Word Sense Disambiguation

Improvement of Web Search Results using Genetic Algorithm on Word Sense Disambiguation Volume 3, No.5, May 24 International Journal of Advances in Computer Science and Technology Pooja Bassin et al., International Journal of Advances in Computer Science and Technology, 3(5), May 24, 33-336

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

Beyond Ten Blue Links Seven Challenges

Beyond Ten Blue Links Seven Challenges Beyond Ten Blue Links Seven Challenges Ricardo Baeza-Yates VP of Yahoo! Research for EMEA & LatAm Barcelona, Spain Thanks to Andrei Broder, Yoelle Maarek & Prabhakar Raghavan Agenda Past and Present Wisdom

More information

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.

More information

XML and Databases. Lecture 10 XPath Evaluation using RDBMS. Sebastian Maneth NICTA and UNSW

XML and Databases. Lecture 10 XPath Evaluation using RDBMS. Sebastian Maneth NICTA and UNSW XML and Databases Lecture 10 XPath Evaluation using RDBMS Sebastian Maneth NICTA and UNSW CSE@UNSW -- Semester 1, 2009 Outline 1. Recall pre / post encoding 2. XPath with //, ancestor, @, and text() 3.

More information

CSE 544 Data Models. Lecture #3. CSE544 - Spring,

CSE 544 Data Models. Lecture #3. CSE544 - Spring, CSE 544 Data Models Lecture #3 1 Announcements Project Form groups by Friday Start thinking about a topic (see new additions to the topic list) Next paper review: due on Monday Homework 1: due the following

More information

Diversification of Query Interpretations and Search Results

Diversification of Query Interpretations and Search Results Diversification of Query Interpretations and Search Results Advanced Methods of IR Elena Demidova Materials used in the slides: Charles L.A. Clarke, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova,

More information

CS145 Introduction. About CS145 Relational Model, Schemas, SQL Semistructured Model, XML

CS145 Introduction. About CS145 Relational Model, Schemas, SQL Semistructured Model, XML CS145 Introduction About CS145 Relational Model, Schemas, SQL Semistructured Model, XML 1 Content of CS145 Design of databases. E/R model, relational model, semistructured model, XML, UML, ODL. Database

More information

Elementary IR: Scalable Boolean Text Search. (Compare with R & G )

Elementary IR: Scalable Boolean Text Search. (Compare with R & G ) Elementary IR: Scalable Boolean Text Search (Compare with R & G 27.1-3) Information Retrieval: History A research field traditionally separate from Databases Hans P. Luhn, IBM, 1959: Keyword in Context

More information

Tree. A path is a connected sequence of edges. A tree topology is acyclic there is no loop.

Tree. A path is a connected sequence of edges. A tree topology is acyclic there is no loop. Tree A tree consists of a set of nodes and a set of edges connecting pairs of nodes. A tree has the property that there is exactly one path (no more, no less) between any pair of nodes. A path is a connected

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Ricardo Baeza-Yates Berthier Ribeiro-Neto ACM Press NewYork Harlow, England London New York Boston. San Francisco. Toronto. Sydney Singapore Hong Kong Tokyo Seoul Taipei. New

More information

8/1/2016. XSL stands for EXtensible Stylesheet Language. CSS = Style Sheets for HTML XSL = Style Sheets for XML. XSL consists of four parts:

8/1/2016. XSL stands for EXtensible Stylesheet Language. CSS = Style Sheets for HTML XSL = Style Sheets for XML. XSL consists of four parts: XSL stands for EXtensible Stylesheet Language. CSS = Style Sheets for HTML XSL = Style Sheets for XML http://www.w3schools.com/xsl/ kasunkosala@yahoo.com 1 2 XSL consists of four parts: XSLT - a language

More information

EMERGING TECHNOLOGIES

EMERGING TECHNOLOGIES EMERGING TECHNOLOGIES XML (Part 2): Data Model for XML documents and XPath Outline 1. Introduction 2. Structure of XML data 3. XML Document Schema 3.1. Document Type Definition (DTD) 3.2. XMLSchema 4.

More information

Querying XML. COSC 304 Introduction to Database Systems. XML Querying. Example DTD. Example XML Document. Path Descriptions in XPath

Querying XML. COSC 304 Introduction to Database Systems. XML Querying. Example DTD. Example XML Document. Path Descriptions in XPath COSC 304 Introduction to Database Systems XML Querying Dr. Ramon Lawrence University of British Columbia Okanagan ramon.lawrence@ubc.ca Querying XML We will look at two standard query languages: XPath

More information

A Universal Model for XML Information Retrieval

A Universal Model for XML Information Retrieval A Universal Model for XML Information Retrieval Maria Izabel M. Azevedo 1, Lucas Pantuza Amorim 2, and Nívio Ziviani 3 1 Department of Computer Science, State University of Montes Claros, Montes Claros,

More information

Querying XML Data. Querying XML has two components. Selecting data. Construct output, or transform data

Querying XML Data. Querying XML has two components. Selecting data. Construct output, or transform data Querying XML Data Querying XML has two components Selecting data pattern matching on structural & path properties typical selection conditions Construct output, or transform data construct new elements

More information

CS 572: Information Retrieval. Lecture 1: Course Overview and Introduction 11 January 2016

CS 572: Information Retrieval. Lecture 1: Course Overview and Introduction 11 January 2016 CS 572: Information Retrieval Lecture 1: Course Overview and Introduction 11 January 2016 1/11/2016 CS 572: Information Retrieval. Spring 2016 1 Lecture Plan What is IR? (the big questions) Course overview

More information

F453 Module 7: Programming Techniques. 7.2: Methods for defining syntax

F453 Module 7: Programming Techniques. 7.2: Methods for defining syntax 7.2: Methods for defining syntax 2 What this module is about In this module we discuss: explain how functions, procedures and their related variables may be used to develop a program in a structured way,

More information

Chapter 2 XML, XML Schema, XSLT, and XPath

Chapter 2 XML, XML Schema, XSLT, and XPath Summary Chapter 2 XML, XML Schema, XSLT, and XPath Ryan McAlister XML stands for Extensible Markup Language, meaning it uses tags to denote data much like HTML. Unlike HTML though it was designed to carry

More information

Kikori-KS: An Effective and Efficient Keyword Search System for Digital Libraries in XML

Kikori-KS: An Effective and Efficient Keyword Search System for Digital Libraries in XML Kikori-KS An Effective and Efficient Keyword Search System for Digital Libraries in XML Toshiyuki Shimizu 1, Norimasa Terada 2, and Masatoshi Yoshikawa 1 1 Graduate School of Informatics, Kyoto University

More information

An Effective and Efficient Approach for Keyword-Based XML Retrieval. Xiaoguang Li, Jian Gong, Daling Wang, and Ge Yu retold by Daryna Bronnykova

An Effective and Efficient Approach for Keyword-Based XML Retrieval. Xiaoguang Li, Jian Gong, Daling Wang, and Ge Yu retold by Daryna Bronnykova An Effective and Efficient Approach for Keyword-Based XML Retrieval Xiaoguang Li, Jian Gong, Daling Wang, and Ge Yu retold by Daryna Bronnykova Search on XML documents 2 Why not use google? Why are traditional

More information

Don t just read it; fight it! Ask your own questions, look for your own examples, discover your own proofs. Is the hypothesis necessary?

Don t just read it; fight it! Ask your own questions, look for your own examples, discover your own proofs. Is the hypothesis necessary? Don t just read it; fight it! Ask your own questions, look for your own examples, discover your own proofs. Is the hypothesis necessary? Is the converse true? What happens in the classical special case?

More information