Search engines. Børge Svingen Chief Technology Officer, Open AdExchange

Size: px
Start display at page:

Download "Search engines. Børge Svingen Chief Technology Officer, Open AdExchange"

Transcription

1 Search engines Børge Svingen Chief Technology Officer, Open AdExchange

2 Information retrieval (IR) IR: Looking for information in data Much research in IR since the 60s Late 90s: The first Internet search engines: Excite FPTSearch AltaVista AllTheWeb Inktomi HotBot

3 The difference between IR and search engines Information retrieval: Small data sets Homogenous data sets High precision Expert users Search engines: Large data sets Heterogenous data sets Speed Now: IR and search engines are merging

4 Applications of search engines Search engines are now used for three main purposes: Web search E-commerce: Typically online shopping sites. Enterprise search: Searching internal data in businesses.

5 Information retrieval models Three main groups of IR methodologies: Set theoretic models: Documents represented as sets of words. Boolean and fuzzy models are most common. Algebraic models: Documents represented as algebraic constructs, typically vectors. Probabilistic models: Documents are retrieved based probabilistic estimations of relevance.

6 Relevancy Relevancy: The degree to which a document is relevant to a query. Some definitions: A query q. A set of documents D. A relevancy function R(d, q) for a document d and query q. A relevancy cut-off value R 0, R(d, q) > R 0 is good enough. A set of relevant documents D rel where R(d, q) > R 0 for all d D rel. A result set D res.

7 Recall and precision Two important performance measures: Recall tells us how many of the relevant documents that are in the result set. recall = R rel R res R rel. Precision tells us how many of the documents in the result set that are relevant. recall = R rel R res R res.

8 Expanding scope Increasing types of information are used by search engines to calculate relevancy: Traditional IR: Relevancy decided by document content. Web search: Started included other information about documents, i.e. link graphs. Mobile search: Takes into accord the device from which a search is being performed, in order to return only content that can be used by the device. Personalized search: Uses personal interests and behavioral history to give different results to different users.

9 PageRank Use link graph to estimate document quality Query independent. Looks at the graph of links between documents. Assumption: Better documents are linked to by more documents. Disadvantage: Really measures popularity, not quality.

10 Components of a search engine Crawler Connectors Data preparation Indexes Query processing Result processing

11 Crawling The purpose of the crawler is to retrieve content from the web. The problem: No centralized catalog of all web pages. The solution: Start with a number of seed URLs. Retrieve the web pages. Analyze web pages for links, giving new URLs. Repeat... Crawling is difficult...

12 Connectors Retrieves content from different sources: Applications Content management systems servers Databases Etc. Responsible for keeping content up to date.

13 Data preparation Prepare data for indexing: Normalize content Metadata enrichment Categorization Linguistic analysis Etc.

14 Query and result processing Prepares queries for searching, and results for presentation: Normalize content Metadata enrichment Categorization Linguistic analysis Etc. Same as for content...

15 Why indexes How to search terabytes of data? Linear search takes to long. Answer: Use an index. A index is a mapping from a term to the set of documents containing the term.

16 How to index How to choose the right type of index? Many types of indexes available.. Different index types have different space and time complexities. Different index types perform differently for different types of queries. The choice of index types depend on the application: What data to search. What queries will be used. Level of expertise of the users.

17 Suffix trees A suffix tree is a compact suffix trie. A suffix trie is a trie containing all suffixes of a string. Basic observation: Every substring is the prefix of a suffix. Can be built in linear time. Main advantage: Suffix trees allow substring matching. Disadvantage: A suffix can take considerably more space than the original data.

18 Suffix tree example ABC BC C $ 6 ABC$ $ ABC$ $ ABC$ $ Example string: ABCABC$

19 Suffix arrays Suffix arrays are a more efficient implementation of suffix trees. Example string: ABCABC$ ABC$ ABCABC$ BC$ BCABC$ C$ CABC$

20 Inverted files Inverted files looks at individual terms. Each term points to documents containing the term (with positions). Advantage: Creates smaller index than original data set. Disadvantage: Limited queries, no substrings Inverted files often refer to a dictionary instead of actual terms.

21 Inverted files, example Two documents: doc1: This is a test. doc2: So is this. a is so test this (doc1,2) (doc1,1), (doc2,1) (doc2,0) (doc1,3) (doc1,0), (doc2,2)

22 Scaling search engines Search engines need to handle huge scaling requirements. There are two main dimensions in which to scale: Data volume Query volume

23 Scaling linearly In the following, a linearly scalable search architecture will be described. Required hardware is O(data volume) Required hardware is O(query volume)

24 Data partitioning A data collection D is given. On this collection an equivalence relation is defined. From the equivalence classes form a partition P = {D i }. This means that D i, D j (D i P, D j P D i D j = ) and D i P = D. On the subsets of D, a function σ : P (D) N gives a measure of the actual data size.

25 Equivalence relation properties Being an equivalence relation, fulfills the following requirements: (d, d) for all d D (reflexiveness) (d 1, d 2 ) (d 2, d 1 ) (symmetry) (d 1, d 2 ) and (d 2, d 3 ) (d 1, d 3 ) (transitivity)

26 Query definitions A set of queries Q is given. Each q Q is of the form q = {q, P }, where q is a query representation (for instance, query string). P is the subset of P that is of relevance to the query, so that P D.

27 Time distribution of queries It is assumed that the set of queries Q follow a Poisson distribution characterized by the average λ. This means that the probability of k queries arriving during a time unit is equal to P(k) = e λ λ k k! The number of queries arriving in non-overlapping intervals are therefore considered independent.

28 Types of nodes We have four types of nodes: Processing nodes. Query distribution nodes. Result accumulation nodes. Data preprocessing nodes.

29 Distributed architecture

30 Processing nodes The set of nodes N proc is used to solve the set of queries Q. A function φ : N proc P specifies how the data set D is distributed to the set of nodes.

31 Query distribution nodes The set of nodes N distr distributes the query q = {q, P } to the set of processing nodes used to process the query. This set is given by the function δ : Q P (N proc ).

32 Result accumulation nodes Upon completion of the query processing, the results are accumulated by the set of nodes N acc.

33 Data preprocessing nodes In some cases the data D on which the queries will work need to be preprocessed. A set of nodes N pre will serve this task. These nodes will not be discussed further.

34 Problem solving steps To evaluate a query q = {q, P }, the following steps are performed: 1. Distribution. The query q = {q, P } is distributed to the subset δ (q) of N proc δ is chosen so that N proc δ(q) φ (N proc) P. 2. Parallel evaluation. The query q will be evaluated in parallel on the processing nodes δ (q). 3. Result accumulation and merging. Upon completion of the parallel solving process, the results from the processing nodes are accumulated and merged into the final result.

35 Performance specifications Each processing node N N proc is assumed to have the following performance specifications: An average of kproc queries can be handled in a time unit. Up to a data amount of σmax can be handled. It is assumed that is decided so that max Di P σ (D i ) σ max. Each query distribution node N N distr is assumed to be able to distribute queries to up to k distr other nodes. Each result accumulation node N N acc is assumed to be able to accumulate results from up to k acc other nodes.

36 Two-dimensional scalability Processing nodes organized in a matrix: Each column contains a full replica of all data. Each row contains the same data. The distribution and accumulation nodes are organized as trees: Queries are first distributed to columns. Each query goes to a single column. Queries are then distributed rows. Each query goes to all rows in a column.

37 Data distribution tree

38 The matrix N proc,1,1 N proc,1,2 N proc,1,c N proc,2,1 N proc,2,2 N proc,2,c N proc,r,1 N proc,r,2 N proc,r,c

39 Fault tolerance General principles: It is not acceptable that some of the data in D is not available. It is acceptable that the performance goes down until the error is corrected. Simple strategy: Don t use columns with faulty nodes. (Complex topic...)

40 Linear scalability r = c = σ (D) σ max λ k proc N proc = cr N distr = N proc 1 N acc = N proc 1 Somewhat simplified... Assumes worst case conditions, no extra fault tolerance.

41 Linear scalability, proof N = N proc + N distr + N acc = N proc + N proc 1 + N proc 1 = 3 N proc 2 σ (D) = 3 σ max λ k proc 2

42 Pattern Matching Chip (PMC)

43 PMC overview Data Data distribution Pattern Matching Result Processing Match reports

44 The Comparison Element > = not or not = MUX

45 Searching for a string = c = b = a

46 Searching for a regular expression a b a = c = b = a = c = b = a b b a b b b = c = b = a = c = b = a c b b = c = b = a

47 The Processing Element sc i res[i] sc res[i-1] ff out [i] res[i-1] ff out [i+1] res[i] ff in D res[i]

48 Binary distribution tree M[i] : MUXed distribution node : The MUX for PE[i] : Simple distribution node Data source M[0] From neighbouring tree : MUXed PE M[4] : PE shifting data M[6] M[2] Results

49 Implementation of binary distribution tree 0 M 1 M 2 M 3 M RESULTS 7 M 6 M 5 M 4

50 Binary distribution tree with sequence control sc 0 M 1 M 2 M 3 M RESULTS res[i] 7 M 6 M 5 M 4 res[i] sc res[i-1] ff out [i] ff out [i+1] res[i] ff in res[i-1]

51 Larger binary distribution tree with sequence control s c res[i] res[i-1] M s c res[i] res[i] sc M res[i-1] s c R E S U L T S res[i-1] ff out [i] res[i] ff out [i+1] res[i] ff in res[i-1] M s c res[i] res[i-1] res[i-1]

52 The result selector res1 eq1 res2 eq2 sel doc res eq

53 Result selector operations COMPARE [C] Performs alphabetical/numerical comparison (L==R) [==] Compares L and R (L > R) [>] Compares L and R (L R) [ ] Compares L and R L + R [+] Adds L and R ((L+R) C) [ C] Compare (L + R) to C ((L+R) C) [ C] Compare (L + R) to C ((L+R)==C) [==C] Compare (L + R) to C

54 Implementing all boolean functions 1/2 F0 0 A [ 3] B Null F1 (A AND B) A [ 2] B F2 (A AND NOT B) A [>] B F3 (Transfer A) A [+] B B subtree not used, generates always 0 F4 (A NOR NOT B) B [>] A Subtrees swapped F5 (Transfer B) A [+] B A subtree not used, generates always 0 F6 (A XOR B) A [==1] B F7 (A OR B) A [ 1] B

55 Implementing all boolean functions 2/2 F8 (A NOR B) A [ 0] B F9 (A XNOR B) A [==] B Equivalence F10 (NOT B) A [ 0] B A subtree not used, generates always 0 F11 (A OR NOT B) A [ ] B Implication, if B then A else true F12 (NOT A) A [ 0] B B subtree not used, generates always 0 F13 (NOT A OR B) B [ ] A Implication, subtrees swapped, see F11 F14 (A NAND B) A [ 1] B F15 1 A [ 0] B Identity

56 Data distribution tree with result selectors 0 MUX 1 MUX 2 MUX 3 MUX sel sel sel sel sel sel sel 7 MUX MUX MUX 6 5 4

57 sc sc sc sc sc sc sc sc lat lat lat lat lat lat lat lat res res res res sc sc sc sc lat lat lat lat res sc lat res res sc lat Data distribution and result gathering CE0 CE1 CE2 CE3 CE4 CE5 CE6 CE7

58 The end.

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 CS 347 Notes 12 5 Web Search Engine Crawling

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 Web Search Engine Crawling Indexing Computing

More information

Advances In Industrial Logic Synthesis

Advances In Industrial Logic Synthesis Advances In Industrial Logic Synthesis Luca Amarù, Patrick Vuillod, Jiong Luo Design Group, Synopsys Inc., Sunnyvale, California, USA Design Group, Synopsys, Grenoble, FR Logic Synthesis Y

More information

Introduction to Computer Architecture

Introduction to Computer Architecture Boolean Operators The Boolean operators AND and OR are binary infix operators (that is, they take two arguments, and the operator appears between them.) A AND B D OR E We will form Boolean Functions of

More information

QUESTION BANK FOR TEST

QUESTION BANK FOR TEST CSCI 2121 Computer Organization and Assembly Language PRACTICE QUESTION BANK FOR TEST 1 Note: This represents a sample set. Please study all the topics from the lecture notes. Question 1. Multiple Choice

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 9 2. Information Retrieval:

More information

GC03 Boolean Algebra

GC03 Boolean Algebra Why study? GC3 Boolean Algebra Computers transfer and process binary representations of data. Binary operations are easily represented and manipulated in Boolean algebra! Digital electronics is binary/boolean

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 8 2. Information Retrieval:

More information

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system. Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.

More information

Indexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table

Indexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table Indexing Web pages Web Search: Indexing Web Pages CPS 296.1 Topics in Database Systems Indexing the link structure AltaVista Connectivity Server case study Bharat et al., The Fast Access to Linkage Information

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation

More information

Exam IST 441 Spring 2014

Exam IST 441 Spring 2014 Exam IST 441 Spring 2014 Last name: Student ID: First name: I acknowledge and accept the University Policies and the Course Policies on Academic Integrity This 100 point exam determines 30% of your grade.

More information

INFSCI 2140 Information Storage and Retrieval Lecture 2: Models of Information Retrieval: Boolean model. Final Group Projects

INFSCI 2140 Information Storage and Retrieval Lecture 2: Models of Information Retrieval: Boolean model. Final Group Projects INFSCI 2140 Information Storage and Retrieval Lecture 2: Models of Information Retrieval: Boolean model Peter Brusilovsky http://www2.sis.pitt.edu/~peterb/2140-051/ Final Group Projects Groups of variable

More information

Exam IST 441 Spring 2011

Exam IST 441 Spring 2011 Exam IST 441 Spring 2011 Last name: Student ID: First name: I acknowledge and accept the University Policies and the Course Policies on Academic Integrity This 100 point exam determines 30% of your grade.

More information

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable References Bigtable: A Distributed Storage System for Structured Data. Fay Chang et. al. OSDI

More information

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s Summary agenda Summary: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University March 13, 2013 A Ardö, EIT Summary: EITN01 Web Intelligence

More information

Exam IST 441 Spring 2013

Exam IST 441 Spring 2013 Exam IST 441 Spring 2013 Last name: Student ID: First name: I acknowledge and accept the University Policies and the Course Policies on Academic Integrity This 100 point exam determines 30% of your grade.

More information

CS8803: Advanced Digital Design for Embedded Hardware

CS8803: Advanced Digital Design for Embedded Hardware CS883: Advanced Digital Design for Embedded Hardware Lecture 2: Boolean Algebra, Gate Network, and Combinational Blocks Instructor: Sung Kyu Lim (limsk@ece.gatech.edu) Website: http://users.ece.gatech.edu/limsk/course/cs883

More information

Contents. Chapter 3 Combinational Circuits Page 1 of 34

Contents. Chapter 3 Combinational Circuits Page 1 of 34 Chapter 3 Combinational Circuits Page of 34 Contents Contents... 3 Combinational Circuits... 2 3. Analysis of Combinational Circuits... 2 3.. Using a Truth Table... 2 3..2 Using a Boolean unction... 4

More information

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi. Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture 18 Tries Today we are going to be talking about another data

More information

Unit 4: Formal Verification

Unit 4: Formal Verification Course contents Unit 4: Formal Verification Logic synthesis basics Binary-decision diagram (BDD) Verification Logic optimization Technology mapping Readings Chapter 11 Unit 4 1 Logic Synthesis & Verification

More information

SEARCH ENGINE INSIDE OUT

SEARCH ENGINE INSIDE OUT SEARCH ENGINE INSIDE OUT From Technical Views r86526020 r88526016 r88526028 b85506013 b85506010 April 11,2000 Outline Why Search Engine so important Search Engine Architecture Crawling Subsystem Indexing

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search IR models: Boolean model IR Models Set Theoretic Classic Models Fuzzy Extended Boolean U s e r T a s k Retrieval: Adhoc Filtering Browsing boolean vector probabilistic

More information

Propositional Calculus: Boolean Algebra and Simplification. CS 270: Mathematical Foundations of Computer Science Jeremy Johnson

Propositional Calculus: Boolean Algebra and Simplification. CS 270: Mathematical Foundations of Computer Science Jeremy Johnson Propositional Calculus: Boolean Algebra and Simplification CS 270: Mathematical Foundations of Computer Science Jeremy Johnson Propositional Calculus Topics Motivation: Simplifying Conditional Expressions

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

Document indexing, similarities and retrieval in large scale text collections

Document indexing, similarities and retrieval in large scale text collections Document indexing, similarities and retrieval in large scale text collections Eric Gaussier Univ. Grenoble Alpes - LIG Eric.Gaussier@imag.fr Eric Gaussier Document indexing, similarities & retrieval 1

More information

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488) Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-

More information

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University CS377: Database Systems Text data and information retrieval Li Xiong Department of Mathematics and Computer Science Emory University Outline Information Retrieval (IR) Concepts Text Preprocessing Inverted

More information

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454 Administrivia Crawlers: Nutch Groups Formed Architecture Documents under Review Group Meetings CSE 454 4/14/2005 12:54 PM 1 4/14/2005 12:54 PM 2 Info Extraction Course Overview Ecommerce Standard Web Search

More information

20489: Developing Microsoft SharePoint Server 2013 Advanced Solutions

20489: Developing Microsoft SharePoint Server 2013 Advanced Solutions 20489: Developing Microsoft SharePoint Server 2013 Advanced Solutions Length: 5 days Audience: Developers Level: 300 OVERVIEW This course provides SharePoint developers the information needed to implement

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)

More information

What is database? Types and Examples

What is database? Types and Examples What is database? Types and Examples Visit our site for more information: www.examplanning.com Facebook Page: https://www.facebook.com/examplanning10/ Twitter: https://twitter.com/examplanning10 TABLE

More information

ELCT201: DIGITAL LOGIC DESIGN

ELCT201: DIGITAL LOGIC DESIGN ELCT201: DIGITAL LOGIC DESIGN Dr. Eng. Haitham Omran, haitham.omran@guc.edu.eg Dr. Eng. Wassim Alexan, wassim.joseph@guc.edu.eg Lecture 3 Following the slides of Dr. Ahmed H. Madian ذو الحجة 1438 ه Winter

More information

Boolean Model. Hongning Wang

Boolean Model. Hongning Wang Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards This time: Spell correction Soundex Index construction Index

More information

Lecture 5: Information Retrieval using the Vector Space Model

Lecture 5: Information Retrieval using the Vector Space Model Lecture 5: Information Retrieval using the Vector Space Model Trevor Cohn (tcohn@unimelb.edu.au) Slide credits: William Webber COMP90042, 2015, Semester 1 What we ll learn today How to take a user query

More information

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks Administrative Index Compression! n Assignment 1? n Homework 2 out n What I did last summer lunch talks today David Kauchak cs458 Fall 2012 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt

More information

VLSI System Design Part II : Logic Synthesis (1) Oct Feb.2007

VLSI System Design Part II : Logic Synthesis (1) Oct Feb.2007 VLSI System Design Part II : Logic Synthesis (1) Oct.2006 - Feb.2007 Lecturer : Tsuyoshi Isshiki Dept. Communications and Integrated Systems, Tokyo Institute of Technology isshiki@vlsi.ss.titech.ac.jp

More information

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency Ralf Moeller Hamburg Univ. of Technology Acknowledgement Slides taken from presentation material for the following

More information

index construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap

index construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap to to Information Retrieval Index Construct Ruixuan Li Huazhong University of Science and Technology http://idc.hust.edu.cn/~rxli/ October, 2012 1 2 How to construct index? Computerese term document docid

More information

Digital Forensic Text String Searching: Improving Information Retrieval Effectiveness by Thematically Clustering Search Results

Digital Forensic Text String Searching: Improving Information Retrieval Effectiveness by Thematically Clustering Search Results Digital Forensic Text String Searching: Improving Information Retrieval Effectiveness by Thematically Clustering Search Results DFRWS 2007 Department of Information Systems & Technology Management The

More information

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling Doug Downey Based partially on slides by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze Announcements Project progress report

More information

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Index construction CE-324: Modern Information Retrieval Sharif University of Technology Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.

More information

Data Structures and Algorithms(12)

Data Structures and Algorithms(12) Ming Zhang "Data s and Algorithms" Data s and Algorithms(12) Instructor: Ming Zhang Textbook Authors: Ming Zhang, Tengjiao Wang and Haiyan Zhao Higher Education Press, 28.6 (the "Eleventh Five-Year" national

More information

Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries

Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries Hans Philippi (based on the slides from the Stanford course on IR) April 25, 2018 Boolean retrieval, posting lists & dictionaries

More information

4. Suffix Trees and Arrays

4. Suffix Trees and Arrays 4. Suffix Trees and Arrays Let T = T [0..n) be the text. For i [0..n], let T i denote the suffix T [i..n). Furthermore, for any subset C [0..n], we write T C = {T i i C}. In particular, T [0..n] is the

More information

1. Fill in the entries in the truth table below to specify the logic function described by the expression, AB AC A B C Z

1. Fill in the entries in the truth table below to specify the logic function described by the expression, AB AC A B C Z CS W3827 05S Solutions for Midterm Exam 3/3/05. Fill in the entries in the truth table below to specify the logic function described by the expression, AB AC A B C Z 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2.

More information

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf

More information

Information Retrieval Tutorial 1: Boolean Retrieval

Information Retrieval Tutorial 1: Boolean Retrieval Information Retrieval Tutorial 1: Boolean Retrieval Professor: Michel Schellekens TA: Ang Gao University College Cork 2012-10-26 Boolean Retrieval 1 / 19 Outline 1 Review 2 Boolean Retrieval 2 / 19 Definition

More information

doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague

doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague Praha & EU: Investujeme do vaší budoucnosti Evropský sociální fond course: Searching the Web and Multimedia Databases (BI-VWM) Tomáš Skopal, 2011 SS2010/11 doc. RNDr. Tomáš Skopal, Ph.D. Department of

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

NADAR SARASWATHI COLLEGE OF ENGINEERING AND TECHNOLOGY Vadapudupatti, Theni

NADAR SARASWATHI COLLEGE OF ENGINEERING AND TECHNOLOGY Vadapudupatti, Theni NADAR SARASWATHI COLLEGE OF ENGINEERING AND TECHNOLOGY Vadapudupatti, Theni-625531 Question Bank for the Units I to V SEMESTER BRANCH SUB CODE 3rd Semester B.E. / B.Tech. Electrical and Electronics Engineering

More information

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes CS630 Representing and Accessing Digital Information Information Retrieval: Retrieval Models Information Retrieval Basics Data Structures and Access Indexing and Preprocessing Retrieval Models Thorsten

More information

Information Retrieval May 15. Web retrieval

Information Retrieval May 15. Web retrieval Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically

More information

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents E-Companion: On Styles in Product Design: An Analysis of US Design Patents 1 PART A: FORMALIZING THE DEFINITION OF STYLES A.1 Styles as categories of designs of similar form Our task involves categorizing

More information

Code No: R Set No. 1

Code No: R Set No. 1 Code No: R059210504 Set No. 1 II B.Tech I Semester Supplementary Examinations, February 2007 DIGITAL LOGIC DESIGN ( Common to Computer Science & Engineering, Information Technology and Computer Science

More information

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Information Search and Management Prof. Chris Clifton 27 August 2018 Material adapted from course created by Dr. Luo Si, now leading Alibaba research group 1 AD-hoc IR: Basic Process Information

More information

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Index construction CE-324: Modern Information Retrieval Sharif University of Technology Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.

More information

Regular Languages and Regular Expressions

Regular Languages and Regular Expressions Regular Languages and Regular Expressions According to our definition, a language is regular if there exists a finite state automaton that accepts it. Therefore every regular language can be described

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Q: Given a set of keywords how can we return relevant documents quickly?

Q: Given a set of keywords how can we return relevant documents quickly? Keyword Search Traditional B+index is good for answering 1-dimensional range or point query Q: What about keyword search? Geo-spatial queries? Q: Documents on Computer Science? Q: Nearby coffee shops?

More information

Indexing and Query Processing. What will we cover?

Indexing and Query Processing. What will we cover? Indexing and Query Processing CS 510 Winter 2007 1 What will we cover? Key concepts and terminology Inverted index structures Organization, creation, maintenance Compression Distribution Answering queries

More information

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Index construction CE-324: Modern Information Retrieval Sharif University of Technology Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.

More information

MATH 423 Linear Algebra II Lecture 17: Reduced row echelon form (continued). Determinant of a matrix.

MATH 423 Linear Algebra II Lecture 17: Reduced row echelon form (continued). Determinant of a matrix. MATH 423 Linear Algebra II Lecture 17: Reduced row echelon form (continued). Determinant of a matrix. Row echelon form A matrix is said to be in the row echelon form if the leading entries shift to the

More information

Assignment 1. Assignment 2. Relevance. Performance Evaluation. Retrieval System Evaluation. Evaluate an IR system

Assignment 1. Assignment 2. Relevance. Performance Evaluation. Retrieval System Evaluation. Evaluate an IR system Retrieval System Evaluation W. Frisch Institute of Government, European Studies and Comparative Social Science University Vienna Assignment 1 How did you select the search engines? How did you find the

More information

ELCT201: DIGITAL LOGIC DESIGN

ELCT201: DIGITAL LOGIC DESIGN ELCT201: DIGITAL LOGIC DESIGN Dr. Eng. Haitham Omran, haitham.omran@guc.edu.eg Dr. Eng. Wassim Alexan, wassim.joseph@guc.edu.eg Lecture 3 Following the slides of Dr. Ahmed H. Madian محرم 1439 ه Winter

More information

CSCI 2121 Computer Organization and Assembly Language PRACTICE QUESTION BANK

CSCI 2121 Computer Organization and Assembly Language PRACTICE QUESTION BANK CSCI 2121 Computer Organization and Assembly Language PRACTICE QUESTION BANK Question 1: Choose the most appropriate answer 1. In which of the following gates the output is 1 if and only if all the inputs

More information

DATA MINING - 1DL105, 1DL111

DATA MINING - 1DL105, 1DL111 1 DATA MINING - 1DL105, 1DL111 Fall 2007 An introductory class in data mining http://user.it.uu.se/~udbl/dut-ht2007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database

More information

EECS 150 Homework 7 Solutions Fall (a) 4.3 The functions for the 7 segment display decoder given in Section 4.3 are:

EECS 150 Homework 7 Solutions Fall (a) 4.3 The functions for the 7 segment display decoder given in Section 4.3 are: Problem 1: CLD2 Problems. (a) 4.3 The functions for the 7 segment display decoder given in Section 4.3 are: C 0 = A + BD + C + BD C 1 = A + CD + CD + B C 2 = A + B + C + D C 3 = BD + CD + BCD + BC C 4

More information

1. Prove that if you have tri-state buffers and inverters, you can build any combinational logic circuit. [4]

1. Prove that if you have tri-state buffers and inverters, you can build any combinational logic circuit. [4] HW 3 Answer Key 1. Prove that if you have tri-state buffers and inverters, you can build any combinational logic circuit. [4] You can build a NAND gate from tri-state buffers and inverters and thus you

More information

Generalization of Hierarchical Crisp Clustering Algorithms to Fuzzy Logic

Generalization of Hierarchical Crisp Clustering Algorithms to Fuzzy Logic Generalization of Hierarchical Crisp Clustering Algorithms to Fuzzy Logic Mathias Bank mathias.bank@uni-ulm.de Faculty for Mathematics and Economics University of Ulm Dr. Friedhelm Schwenker friedhelm.schwenker@uni-ulm.de

More information

EE244: Design Technology for Integrated Circuits and Systems Outline Lecture 9.2. Introduction to Behavioral Synthesis (cont.)

EE244: Design Technology for Integrated Circuits and Systems Outline Lecture 9.2. Introduction to Behavioral Synthesis (cont.) EE244: Design Technology for Integrated Circuits and Systems Outline Lecture 9.2 Introduction to Behavioral Synthesis (cont.) Relationship to silicon compilation Stochastic Algorithms and Learning EE244

More information

58093 String Processing Algorithms. Lectures, Autumn 2013, period II

58093 String Processing Algorithms. Lectures, Autumn 2013, period II 58093 String Processing Algorithms Lectures, Autumn 2013, period II Juha Kärkkäinen 1 Contents 0. Introduction 1. Sets of strings Search trees, string sorting, binary search 2. Exact string matching Finding

More information

VALLIAMMAI ENGINEERING COLLEGE. SRM Nagar, Kattankulathur DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING EC6302 DIGITAL ELECTRONICS

VALLIAMMAI ENGINEERING COLLEGE. SRM Nagar, Kattankulathur DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING EC6302 DIGITAL ELECTRONICS VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur-603 203 DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING EC6302 DIGITAL ELECTRONICS YEAR / SEMESTER: II / III ACADEMIC YEAR: 2015-2016 (ODD

More information

A mathematician has asked us to design a simple digital device that works similarly to a pocket calculator.

A mathematician has asked us to design a simple digital device that works similarly to a pocket calculator. Lecture 1: Let's Put Together - Manual Processor Customer Specification mathematician has asked us to design a simple digital device that works similarly to a pocket calculator. The mathematician is interested

More information

Technology Dependent Logic Optimization Prof. Kurt Keutzer EECS University of California Berkeley, CA Thanks to S. Devadas

Technology Dependent Logic Optimization Prof. Kurt Keutzer EECS University of California Berkeley, CA Thanks to S. Devadas Technology Dependent Logic Optimization Prof. Kurt Keutzer EECS University of California Berkeley, CA Thanks to S. Devadas 1 RTL Design Flow HDL RTL Synthesis Manual Design Module Generators Library netlist

More information

Gate-Level Minimization

Gate-Level Minimization MEC520 디지털공학 Gate-Level Minimization Jee-Hwan Ryu School of Mechanical Engineering Gate-Level Minimization-The Map Method Truth table is unique Many different algebraic expression Boolean expressions may

More information

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per Information Retrieval Web Search Algoritmi per IR Web Search Goal of a Search Engine Retrieve docs that are relevant for the user query Doc: file word or pdf, web page, email, blog, e-book,... Query: paradigm bag of words Relevant?!?

More information

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur 603 203 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER CS6007-INFORMATION RETRIEVAL Regulation 2013 Academic Year 2018

More information

VLSI Test Technology and Reliability (ET4076)

VLSI Test Technology and Reliability (ET4076) VLSI Test Technology and Reliability (ET476) Lecture 5 Combinational Circuit Test Generation (Chapter 7) Said Hamdioui Computer Engineering Lab elft University of Technology 29-2 Learning aims of today

More information

CS/ECE 374 Fall Homework 1. Due Tuesday, September 6, 2016 at 8pm

CS/ECE 374 Fall Homework 1. Due Tuesday, September 6, 2016 at 8pm CSECE 374 Fall 2016 Homework 1 Due Tuesday, September 6, 2016 at 8pm Starting with this homework, groups of up to three people can submit joint solutions. Each problem should be submitted by exactly one

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly

More information

Page 1. Outline. A Good Reference and a Caveat. Testing. ECE 254 / CPS 225 Fault Tolerant and Testable Computing Systems. Testing and Design for Test

Page 1. Outline. A Good Reference and a Caveat. Testing. ECE 254 / CPS 225 Fault Tolerant and Testable Computing Systems. Testing and Design for Test Page Outline ECE 254 / CPS 225 Fault Tolerant and Testable Computing Systems Testing and Design for Test Copyright 24 Daniel J. Sorin Duke University Introduction and Terminology Test Generation for Single

More information

Mining Frequent Patterns without Candidate Generation

Mining Frequent Patterns without Candidate Generation Mining Frequent Patterns without Candidate Generation Outline of the Presentation Outline Frequent Pattern Mining: Problem statement and an example Review of Apriori like Approaches FP Growth: Overview

More information

CMPUT 403: Strings. Zachary Friggstad. March 11, 2016

CMPUT 403: Strings. Zachary Friggstad. March 11, 2016 CMPUT 403: Strings Zachary Friggstad March 11, 2016 Outline Tries Suffix Arrays Knuth-Morris-Pratt Pattern Matching Tries Given a dictionary D of strings and a query string s, determine if s is in D. Using

More information

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search CSE 454 - Case Studies Indexing & Retrieval in Google Some slides from http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm Logistics For next class Read: How to implement PageRank Efficiently Projects

More information

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Chapter 4: Mining Frequent Patterns, Associations and Correlations Chapter 4: Mining Frequent Patterns, Associations and Correlations 4.1 Basic Concepts 4.2 Frequent Itemset Mining Methods 4.3 Which Patterns Are Interesting? Pattern Evaluation Methods 4.4 Summary Frequent

More information

Information Networks. Hacettepe University Department of Information Management DOK 422: Information Networks

Information Networks. Hacettepe University Department of Information Management DOK 422: Information Networks Information Networks Hacettepe University Department of Information Management DOK 422: Information Networks Search engines Some Slides taken from: Ray Larson Search engines Web Crawling Web Search Engines

More information

Suffix Trees and Arrays

Suffix Trees and Arrays Suffix Trees and Arrays Yufei Tao KAIST May 1, 2013 We will discuss the following substring matching problem: Problem (Substring Matching) Let σ be a single string of n characters. Given a query string

More information

Intelligent flexible query answering Using Fuzzy Ontologies

Intelligent flexible query answering Using Fuzzy Ontologies International Conference on Control, Engineering & Information Technology (CEIT 14) Proceedings - Copyright IPCO-2014, pp. 262-277 ISSN 2356-5608 Intelligent flexible query answering Using Fuzzy Ontologies

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2015/16 IR Chapter 04 Index Construction Hardware In this chapter we will look at how to construct an inverted index Many

More information

COMP combinational logic 1 Jan. 18, 2016

COMP combinational logic 1 Jan. 18, 2016 In lectures 1 and 2, we looked at representations of numbers. For the case of integers, we saw that we could perform addition of two numbers using a binary representation and using the same algorithm that

More information

ALGORITHMS EXAMINATION Department of Computer Science New York University December 17, 2007

ALGORITHMS EXAMINATION Department of Computer Science New York University December 17, 2007 ALGORITHMS EXAMINATION Department of Computer Science New York University December 17, 2007 This examination is a three hour exam. All questions carry the same weight. Answer all of the following six questions.

More information

Improving Memory Repair by Selective Row Partitioning

Improving Memory Repair by Selective Row Partitioning 200 24th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems Improving Memory Repair by Selective Row Partitioning Muhammad Tauseef Rab, Asad Amin Bawa, and Nur A. Touba Computer

More information

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost

More information

CSCI 5417 Information Retrieval Systems Jim Martin!

CSCI 5417 Information Retrieval Systems Jim Martin! CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 4 9/1/2011 Today Finish up spelling correction Realistic indexing Block merge Single-pass in memory Distributed indexing Next HW details 1 Query

More information

Bawar Abid Abdalla. Assistant Lecturer Software Engineering Department Koya University

Bawar Abid Abdalla. Assistant Lecturer Software Engineering Department Koya University Logic Design First Stage Lecture No.5 Boolean Algebra Bawar Abid Abdalla Assistant Lecturer Software Engineering Department Koya University Boolean Operations Laws of Boolean Algebra Rules of Boolean Algebra

More information

What s new in SharePoint Search 2010 for end users. IW109 Mirjam van Olst

What s new in SharePoint Search 2010 for end users. IW109 Mirjam van Olst What s new in SharePoint Search 2010 for end users IW109 Mirjam van Olst About Mirjam Microsoft Certified Master SharePoint 2007 MVP SharePoint Server SharePoint Architect at Macaw Co-organizer DIWUG and

More information