Query Session Detection as a Cascade

Size: px
Start display at page:

Download "Query Session Detection as a Cascade"

Transcription

1 Query Session Detection as a Cascade Matthias Hagen Benno Stein Tino Rüb Bauhaus-Universität Weimar matthias.hagen@uni-weimar.de SIR 211 Dublin, Ireland April 18, 211 Hagen, Stein, Rüb Query Session Detection as a Cascade 1

2 It s quiz time! Introduction Motivation Hagen, Stein, Rüb Query Session Detection as a Cascade 2

3 It s quiz time! Introduction Motivation What is the user searching? paris hilton Hagen, Stein, Rüb Query Session Detection as a Cascade 2

4 Without context... Introduction Motivation paris hilton source: [ Hilton 3 Crop.jpg] Hagen, Stein, Rüb Query Session Detection as a Cascade 3

5 Introduction Motivation What if you knew the previous queries? paris hotels paris marriott paris hyatt paris hilton Hagen, Stein, Rüb Query Session Detection as a Cascade 4

6 Introduction Motivation What if you knew the previous queries? paris hotels paris marriott paris hyatt paris hilton sources: [ hotel paris 2.jpg] [ [ mk logo hiltonbrandlogo.jpg] Hagen, Stein, Rüb Query Session Detection as a Cascade 4

7 Introduction Motivation Query sessions: same information need The benefits Improved understanding of user intent Improved retrieval performance via session knowledge Hagen, Stein, Rüb Query Session Detection as a Cascade 5

8 Introduction Motivation Query sessions: same information need The benefits Improved understanding of user intent Improved retrieval performance via session knowledge The minor issue Users do not announce when querying for a new information need. Hagen, Stein, Rüb Query Session Detection as a Cascade 5

9 A typical query log Introduction Motivation User Query Click domain + Click rank Time 773 istanbul en.wikipedia.org :34: istanbul archeology :2: istanbul archeology :3: istanbul archeology :24:7 773 constantinople ::4 773 constantinople :1:2 773 hurling :3:1 773 hurling en.wikipedia.org :3:5 773 liam mccarthy cup :33:4 773 liam mccarthy cup :33: liam mccarthy cup starbets.ie :42:48 Hagen, Stein, Rüb Query Session Detection as a Cascade 6

10 Introduction Motivation How to determine the break points? User Query Click domain + Click rank Time 773 istanbul en.wikipedia.org :34: istanbul archeology :2: istanbul archeology :3: istanbul archeology :24:7 773 constantinople ::4 773 constantinople :1:2 773 hurling :3:1 773 hurling en.wikipedia.org :3:5 773 liam mccarthy cup :33:4 773 liam mccarthy cup :33: liam mccarthy cup starbets.ie :42:48 Hagen, Stein, Rüb Query Session Detection as a Cascade 7

11 The key is... Introduction The Problem Automatic query session detection Hagen, Stein, Rüb Query Session Detection as a Cascade 8

12 Introduction The Problem Automatic query session detection Usual technique Check for consecutive queries whether same/new information need. Example 773 istanbul :34:17 same 773 istanbul archeology :24:7 same 773 constantinople :1:2 new 773 hurling :3:5 Hagen, Stein, Rüb Query Session Detection as a Cascade 9

13 Typical features Introduction Related Work Temporal thresholds 5 minutes [Silverstein et al., 1999] 1 15 minutes [He and Göker, 2] 3 minutes [Downey et al., 27] user specific [Murray et al., 26] Lexical similarity n-gram overlap [Zhang and Moffat, 26] Levenshtein distance [Jones and Klinkner, 28] Semantic similarity Search results [Radlinski and Joachims, 25] ESA [Lucchese et al., 211] Hagen, Stein, Rüb Query Session Detection as a Cascade 1

14 Introduction Related Work Previous methods Observations Temporal thresholds: fast but bad accuracy Feature combinations: more accurate One of the best: Geometric method (time + lexical) [Gayo-Avello, 29] Hagen, Stein, Rüb Query Session Detection as a Cascade 11

15 Previous methods Introduction Related Work Observations Temporal thresholds: fast but bad accuracy Feature combinations: more accurate One of the best: Geometric method (time + lexical) [Gayo-Avello, 29] Shortcomings All features evaluated simultaneously runtime Geometric method ignores semantics accuracy Examples Subset test suffices hurling same hurling gaa Geometric method fails hurling same mccarthy cup Hagen, Stein, Rüb Query Session Detection as a Cascade 11

16 Cascading Method The Framework We address the shortcomings in a cascade... source: [ Hagen, Stein, Ru b Query Session Detection as a Cascade 12

17 Cascading Method The Framework... well... a small 4-step cascade source: [ Cascade 4 Tier GreenL.jpg] Hagen, Stein, Ru b Query Session Detection as a Cascade 13

18 Cascading Method The Framework... well... a small 4-step cascade Step 1: Subset tests & Step 2: Geometric method & Step 3: ESA similarity. Step 4: Search results source: [ Cascade 4 Tier GreenL.jpg] Basic Idea Increased feature cost (runtime) from step to step. Expensive features only if previous steps unreliable. Hagen, Stein, Ru b Query Session Detection as a Cascade 13

19 Cascading Method Simple string comparison Step 1: Subset tests Criterion Consecutive queries q and q in same session if q sub- or superset of q. Else: Goto Step 2. Example Remarks: Repetition, specialization, or generalization. Time gap = continuing a pending session. Repetition Specialization Generalization hurling same hurling same hurling gaa same hurling hurling gaa hurling Hagen, Stein, Rüb Query Session Detection as a Cascade 14

20 Cascading Method Step 2: Geometric method Combination of temporal and lexical features [Gayo-Avello, 29] For consecutive queries q and q f temp = maximum of and 1 t 24h t is time between q and q f lex = cosine similarity of 3- to 5-grams of q and s s is session of q Hagen, Stein, Rüb Query Session Detection as a Cascade 15

21 Cascading Method Step 2: Geometric method Combination of temporal and lexical features [Gayo-Avello, 29] For consecutive queries q and q f temp = maximum of and 1 t 24h t is time between q and q f lex = cosine similarity of 3- to 5-grams of q and s s is session of q 1. Criterion (original).8 Nearly identical queries at long temporal distance Same session Consecutive queries q and q in same session if ftemp 2 + f lex 2 1. Lexical similarity New session Different queries with no temporal distance Temporal similarity Hagen, Stein, Rüb Query Session Detection as a Cascade 15

22 Cascading Method Step 2: Geometric method Performs well on standard test corpus Lexical similarity.6.4 Lexical similarity Temporal similarity Temporal similarity Same session New session Hagen, Stein, Rüb Query Session Detection as a Cascade 16

23 Cascading Method Step 2: Geometric method... but has some problems on the edge Lexical similarity Temporal similarity Major problems Similar queries, time gap (upper left) Merely a matter of opinion Diff. queries, same semantics (lower right) Incorporate semantics Hagen, Stein, Rüb Query Session Detection as a Cascade 17

24 Cascading Method Step 2: Geometric method... but has some problems on the edge Lexical similarity Temporal similarity Major problems Similar queries, time gap (upper left) Merely a matter of opinion Diff. queries, same semantics (lower right) Incorporate semantics Criterion (adapted) Original geometric method if f temp <.8 or f lex >.4. Else: Goto Step 3. Hagen, Stein, Rüb Query Session Detection as a Cascade 17

25 Cascading Method Step 3: Explicit Semantic Analysis How ESA works [Gabrilovich and Markovitch, 27] Preprocessing tf idf -weighted inverted index of Wikipedia articles term-document matrix M For consecutive queries q and q f esa = cosine similarity of M T q and M T s s is session of q Criterion Consecutive queries q and q in same session if f esa.35. Else: Goto Step 4. Hagen, Stein, Rüb Query Session Detection as a Cascade 18

26 Cascading Method Step 4: Search results Even more semantics Idea Enrich the short query strings with the results of some web search engine. Criterion Consecutive queries q and q in same session iff they share at least one of the top 1 search results. Hagen, Stein, Rüb Query Session Detection as a Cascade 19

27 Cascading Method Step 4: Search results Even more semantics Idea Enrich the short query strings with the results of some web search engine. Criterion Consecutive queries q and q in same session iff they share at least one of the top 1 search results. Remark If q and q share no top 1 result, decision should be not sure. Hagen, Stein, Rüb Query Session Detection as a Cascade 19

28 Cascading Method Experimental Results That s the complete cascade Step 1: Subset tests & Step 2: Geometric method & Step 3: ESA similarity. Step 4: Search results source: [ Cascade 4 Tier GreenL.jpg] Hagen, Stein, Ru b Query Session Detection as a Cascade 2

29 Cascading Method Experimental Results That s the complete cascade Step 1: Subset tests & Step 2: Geometric method & Step 3: ESA similarity. Step 4: Search results source: [ Cascade 4 Tier GreenL.jpg] What about accuracy and performance? Hagen, Stein, Ru b Query Session Detection as a Cascade 2

30 Accuracy and runtime Cascading Method Experimental Results Accuracy on Gayo-Avello s corpus (11 queries, 2.7 per session) Precision Recall F-Measure (β = 1.5) Geometric Cascading Performance per step on Gayo-Avello s corpus affected F-Measure time factor Step % ms 1. Step % ms 2.5 Step 3 2.5% ms 3.4 Step 4.85% ms Hagen, Stein, Rüb Query Session Detection as a Cascade 21

31 Cascading Method Experimental Results Goal: high quality session test data Our own use case Sample sessions from the AOL log as test data. AOL log (cleaned): 35.4 million interactions from 47 users. Some figures Step 4 involved on 22.5% 8 million web queries 3 ms per search 1 month Hagen, Stein, Rüb Query Session Detection as a Cascade 22

32 Cascading Method Experimental Results Goal: high quality session test data Our own use case Sample sessions from the AOL log as test data. AOL log (cleaned): 35.4 million interactions from 47 users. Some figures Step 4 involved on 22.5% 8 million web queries 3 ms per search 1 month Way out Drop Step 4 and the sessions on which it would have been invoked Remaining sessions: F-Measure =.9755 Cleaned AOL log: 27 minutes Hagen, Stein, Rüb Query Session Detection as a Cascade 22

33 Conclusion Almost the end: The take-away messages! Hagen, Stein, Rüb Query Session Detection as a Cascade 23

34 Conclusion What we have done Results Cascading method Cheap features first Beats geometric Future Work Postprocessing for multi-tasking Postprocessing for goals/missions 3 step version: simple, fast, high quality sessions Hagen, Stein, Rüb Query Session Detection as a Cascade 24

35 Conclusion What we have (not) done Results Cascading method Cheap features first Beats geometric Future Work Postprocessing for multi-tasking Postprocessing for goals/missions 3 step version: simple, fast, high quality sessions Hagen, Stein, Rüb Query Session Detection as a Cascade 24

36 Conclusion What we have (not) done Results Cascading method Cheap features first Beats geometric 3 step version: simple, fast, high quality sessions Future Work Postprocessing for multi-tasking Postprocessing for goals/missions Thank you Hagen, Stein, Rüb Query Session Detection as a Cascade 24

Query Session Detection as a Cascade

Query Session Detection as a Cascade Query Session Detection as a Cascade Matthias Hagen Benno Stein Tino Rüb Bauhaus-Universität Weimar matthias.hagen@uni-weimar.de CIKM 2011 Glasgow, Scotland October 25, 2011 Hagen, Stein, Rüb Query Session

More information

Query Session Detection as a Cascade

Query Session Detection as a Cascade Query Session Detection as a Cascade Extended Abstract Matthias Hagen, Benno Stein, and Tino Rüb Bauhaus-Universität Weimar .@uni-weimar.de Abstract We propose a cascading method

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Webis at the TREC 2012 Session Track

Webis at the TREC 2012 Session Track Webis at the TREC 2012 Session Track Extended Abstract for the Conference Notebook Matthias Hagen, Martin Potthast, Matthias Busse, Jakob Gomoll, Jannis Harder, and Benno Stein Bauhaus-Universität Weimar

More information

Query Expansion using Wikipedia and DBpedia

Query Expansion using Wikipedia and DBpedia Query Expansion using Wikipedia and DBpedia Nitish Aggarwal and Paul Buitelaar Unit for Natural Language Processing, Digital Enterprise Research Institute, National University of Ireland, Galway firstname.lastname@deri.org

More information

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert

More information

Supporting Scholarly Search with Keyqueries

Supporting Scholarly Search with Keyqueries Supporting Scholarly Search with Keyqueries Matthias Hagen Anna Beyer Tim Gollub Kristof Komlossy Benno Stein Bauhaus-Universität Weimar matthias.hagen@uni-weimar.de @matthias_hagen ECIR 2016 Padova, Italy

More information

Candidate Document Retrieval for Web-scale Text Reuse Detection

Candidate Document Retrieval for Web-scale Text Reuse Detection Candidate Document Retrieval for Web-scale Text Reuse Detection Matthias Hagen Benno Stein Bauhaus-Universität Weimar matthias.hagen@uni-weimar.de SPIRE 2011 Pisa, Italy October 19, 2011 Matthias Hagen,

More information

Multimedia Information Systems

Multimedia Information Systems Multimedia Information Systems Samson Cheung EE 639, Fall 2004 Lecture 6: Text Information Retrieval 1 Digital Video Library Meta-Data Meta-Data Similarity Similarity Search Search Analog Video Archive

More information

New Issues in Near-duplicate Detection

New Issues in Near-duplicate Detection New Issues in Near-duplicate Detection Martin Potthast and Benno Stein Bauhaus University Weimar Web Technology and Information Systems Motivation About 30% of the Web is redundant. [Fetterly 03, Broder

More information

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100

More information

Overview of the 6th International Competition on Plagiarism Detection

Overview of the 6th International Competition on Plagiarism Detection Overview of the 6th International Competition on Plagiarism Detection Martin Potthast, Matthias Hagen, Anna Beyer, Matthias Busse, Martin Tippmann, and Benno Stein Bauhaus-Universität Weimar www.webis.de

More information

Overview of the 5th International Competition on Plagiarism Detection

Overview of the 5th International Competition on Plagiarism Detection Overview of the 5th International Competition on Plagiarism Detection Martin Potthast, Matthias Hagen, Tim Gollub, Martin Tippmann, Johannes Kiesel, and Benno Stein Bauhaus-Universität Weimar www.webis.de

More information

Word Embeddings in Search Engines, Quality Evaluation. Eneko Pinzolas

Word Embeddings in Search Engines, Quality Evaluation. Eneko Pinzolas Word Embeddings in Search Engines, Quality Evaluation Eneko Pinzolas Neural Networks are widely used with high rate of success. But can we reproduce those results in IR? Motivation State of the art for

More information

Exploratory Search Missions for TREC Topics

Exploratory Search Missions for TREC Topics Exploratory Search Missions for TREC Topics EuroHCIR 2013 Dublin, 1 August 2013 Martin Potthast Matthias Hagen Michael Völske Benno Stein Bauhaus-Universität Weimar www.webis.de Dataset for studying text

More information

CSE 494: Information Retrieval, Mining and Integration on the Internet

CSE 494: Information Retrieval, Mining and Integration on the Internet CSE 494: Information Retrieval, Mining and Integration on the Internet Midterm. 18 th Oct 2011 (Instructor: Subbarao Kambhampati) In-class Duration: Duration of the class 1hr 15min (75min) Total points:

More information

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Task Description: Finding Similar Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 11, 2017 Sham Kakade 2017 1 Document

More information

68A8 Multimedia DataBases Information Retrieval - Exercises

68A8 Multimedia DataBases Information Retrieval - Exercises 68A8 Multimedia DataBases Information Retrieval - Exercises Marco Gori May 31, 2004 Quiz examples for MidTerm (some with partial solution) 1. About inner product similarity When using the Boolean model,

More information

Identifying Web Search Query Reformulation using Concept based Matching

Identifying Web Search Query Reformulation using Concept based Matching Identifying Web Search Query Reformulation using Concept based Matching Ahmed Hassan Microsoft Research One Microsoft Way Redmond, WA 98053, USA hassanam@microsoft.com Abstract Web search users frequently

More information

Is Brad Pitt Related to Backstreet Boys? Exploring Related Entities

Is Brad Pitt Related to Backstreet Boys? Exploring Related Entities Is Brad Pitt Related to Backstreet Boys? Exploring Related Entities Nitish Aggarwal, Kartik Asooja, Paul Buitelaar, and Gabriela Vulcu Unit for Natural Language Processing Insight-centre, National University

More information

vector space retrieval many slides courtesy James Amherst

vector space retrieval many slides courtesy James Amherst vector space retrieval many slides courtesy James Allan@umass Amherst 1 what is a retrieval model? Model is an idealization or abstraction of an actual process Mathematical models are used to study the

More information

End-to-End Neural Ad-hoc Ranking with Kernel Pooling

End-to-End Neural Ad-hoc Ranking with Kernel Pooling End-to-End Neural Ad-hoc Ranking with Kernel Pooling Chenyan Xiong 1,Zhuyun Dai 1, Jamie Callan 1, Zhiyuan Liu, and Russell Power 3 1 :Language Technologies Institute, Carnegie Mellon University :Tsinghua

More information

CMSC 476/676 Information Retrieval Midterm Exam Spring 2014

CMSC 476/676 Information Retrieval Midterm Exam Spring 2014 CMSC 476/676 Information Retrieval Midterm Exam Spring 2014 Name: You may consult your notes and/or your textbook. This is a 75 minute, in class exam. If there is information missing in any of the question

More information

Available online at ScienceDirect. Procedia Computer Science 35 (2014 )

Available online at   ScienceDirect. Procedia Computer Science 35 (2014 ) Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 35 (2014 ) 474 483 18 th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems

More information

Instructor: Stefan Savev

Instructor: Stefan Savev LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information

More information

Efficient query processing

Efficient query processing Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking functions In general, document scoring functions are of the form The BM25 function, is one of the best performing:

More information

Sentiment analysis under temporal shift

Sentiment analysis under temporal shift Sentiment analysis under temporal shift Jan Lukes and Anders Søgaard Dpt. of Computer Science University of Copenhagen Copenhagen, Denmark smx262@alumni.ku.dk Abstract Sentiment analysis models often rely

More information

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling Doug Downey Based partially on slides by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze Announcements Project progress report

More information

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Abstract We intend to show that leveraging semantic features can improve precision and recall of query results in information

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) CSE 6242 / CX 4242 Apr 1, 2014 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer,

More information

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4 Query Languages Berlin Chen 2005 Reference: 1. Modern Information Retrieval, chapter 4 Data retrieval Pattern-based querying The Kinds of Queries Retrieve docs that contains (or exactly match) the objects

More information

Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO

Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO INDEX Proposal Recap Implementation Evaluation Future Works Proposal Recap Keyword Visualizer (chrome

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) CSE 6242 / CX 4242 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko,

More information

Clustering (COSC 488) Nazli Goharian. Document Clustering.

Clustering (COSC 488) Nazli Goharian. Document Clustering. Clustering (COSC 488) Nazli Goharian nazli@ir.cs.georgetown.edu 1 Document Clustering. Cluster Hypothesis : By clustering, documents relevant to the same topics tend to be grouped together. C. J. van Rijsbergen,

More information

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Σηµερινό ερώτηµα Typically we want to retrieve the top K docs (in the cosine ranking for the query) not totally order all docs in the corpus can we pick off docs

More information

A Security Model for Multi-User File System Search. in Multi-User Environments

A Security Model for Multi-User File System Search. in Multi-User Environments A Security Model for Full-Text File System Search in Multi-User Environments Stefan Büttcher Charles L. A. Clarke University of Waterloo, Canada December 15, 2005 1 Introduction and Motivation 2 3 4 5

More information

Analyzing a Large Corpus of Crowdsourced Plagiarism

Analyzing a Large Corpus of Crowdsourced Plagiarism Analyzing a Large Corpus of Crowdsourced Plagiarism Master s Thesis Michael Völske Fakultät Medien Bauhaus-Universität Weimar 06.06.2013 Supervised by: Prof. Dr. Benno Stein Prof. Dr. Volker Rodehorst

More information

Similarity search in multimedia databases

Similarity search in multimedia databases Similarity search in multimedia databases Performance evaluation for similarity calculations in multimedia databases JO TRYTI AND JOHAN CARLSSON Bachelor s Thesis at CSC Supervisor: Michael Minock Examiner:

More information

Exploiting Symmetry in Relational Similarity for Ranking Relational Search Results

Exploiting Symmetry in Relational Similarity for Ranking Relational Search Results Exploiting Symmetry in Relational Similarity for Ranking Relational Search Results Tomokazu Goto, Nguyen Tuan Duc, Danushka Bollegala, and Mitsuru Ishizuka The University of Tokyo, Japan {goto,duc}@mi.ci.i.u-tokyo.ac.jp,

More information

Overview. Lecture 3: Index Representation and Tolerant Retrieval. Type/token distinction. IR System components

Overview. Lecture 3: Index Representation and Tolerant Retrieval. Type/token distinction. IR System components Overview Lecture 3: Index Representation and Tolerant Retrieval Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group 1 Recap 2

More information

Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach

Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach P.T.Shijili 1 P.G Student, Department of CSE, Dr.Nallini Institute of Engineering & Technology, Dharapuram, Tamilnadu, India

More information

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Manning, Raghavan, and Schütze http://www.informationretrieval.org OVERVIEW Introduction Basic XML Concepts Challenges

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation

More information

Automatic Cluster Number Selection using a Split and Merge K-Means Approach

Automatic Cluster Number Selection using a Split and Merge K-Means Approach Automatic Cluster Number Selection using a Split and Merge K-Means Approach Markus Muhr and Michael Granitzer 31st August 2009 The Know-Center is partner of Austria's Competence Center Program COMET. Agenda

More information

Session 2: Models, part 1

Session 2: Models, part 1 Cerca i Anàlisi d'informació Massiva (CAIM) Grau en Enginyeria Informàtica (GEI) Session 2: Models, part 1 Exercise List, Fall 2016 Basic comprehension questions. Check that you can answer them before

More information

Heterogeneous Graph-Based Intent Learning with Queries, Web Pages and Wikipedia Concepts

Heterogeneous Graph-Based Intent Learning with Queries, Web Pages and Wikipedia Concepts Heterogeneous Graph-Based Intent Learning with Queries, Web Pages and Wikipedia Concepts Xiang Ren, Yujing Wang, Xiao Yu, Jun Yan, Zheng Chen, Jiawei Han University of Illinois, at Urbana Champaign MicrosoD

More information

Analysis of Trail Algorithms for User Search Behavior

Analysis of Trail Algorithms for User Search Behavior Analysis of Trail Algorithms for User Search Behavior Surabhi S. Golechha, Prof. R.R. Keole Abstract Web log data has been the basis for analyzing user query session behavior for a number of years. Web

More information

Bi-directional Linkability From Wikipedia to Documents and Back Again: UMass at TREC 2012 Knowledge Base Acceleration Track

Bi-directional Linkability From Wikipedia to Documents and Back Again: UMass at TREC 2012 Knowledge Base Acceleration Track Bi-directional Linkability From Wikipedia to Documents and Back Again: UMass at TREC 2012 Knowledge Base Acceleration Track Jeffrey Dalton University of Massachusetts, Amherst jdalton@cs.umass.edu Laura

More information

Information Retrieval

Information Retrieval Information Retrieval Dictionaries & Tolerant Retrieval Gintarė Grigonytė gintare@ling.su.se Department of Linguistics and Philology Uppsala University Slides based on previous IR course given by Jörg

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

CS54701: Information Retrieval

CS54701: Information Retrieval CS54701: Information Retrieval Basic Concepts 19 January 2016 Prof. Chris Clifton 1 Text Representation: Process of Indexing Remove Stopword, Stemming, Phrase Extraction etc Document Parser Extract useful

More information

A Distributional Approach for Terminological Semantic Search on the Linked Data Web

A Distributional Approach for Terminological Semantic Search on the Linked Data Web A Distributional Approach for Terminological Semantic Search on the Linked Data Web André Freitas Digital Enterprise Research Institute (DERI) National University of Ireland, Galway andre.freitas@deri.org

More information

Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Predictive Indexing for Fast Search Sharad Goel, John Langford and Alex Strehl Yahoo! Research, New York Modern Massive Data Sets (MMDS) June 25, 2008 Goel, Langford & Strehl (Yahoo! Research) Predictive

More information

Recommender Systems 6CCS3WSN-7CCSMWAL

Recommender Systems 6CCS3WSN-7CCSMWAL Recommender Systems 6CCS3WSN-7CCSMWAL http://insidebigdata.com/wp-content/uploads/2014/06/humorrecommender.jpg Some basic methods of recommendation Recommend popular items Collaborative Filtering Item-to-Item:

More information

Context Sensitive Search Engine

Context Sensitive Search Engine Context Sensitive Search Engine Remzi Düzağaç and Olcay Taner Yıldız Abstract In this paper, we use context information extracted from the documents in the collection to improve the performance of the

More information

Learning to Reweight Terms with Distributed Representations

Learning to Reweight Terms with Distributed Representations Learning to Reweight Terms with Distributed Representations School of Computer Science Carnegie Mellon University August 12, 215 Outline Goal: Assign weights to query terms for better retrieval results

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 7: Scores in a Complete Search System Paul Ginsparg Cornell University, Ithaca,

More information

Problem 1: Complexity of Update Rules for Logistic Regression

Problem 1: Complexity of Update Rules for Logistic Regression Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 16 th, 2014 1

More information

A Dynamic Bayesian Network Click Model for Web Search Ranking

A Dynamic Bayesian Network Click Model for Web Search Ranking A Dynamic Bayesian Network Click Model for Web Search Ranking Olivier Chapelle and Anne Ya Zhang Apr 22, 2009 18th International World Wide Web Conference Introduction Motivation Clicks provide valuable

More information

Ranking Algorithms For Digital Forensic String Search Hits

Ranking Algorithms For Digital Forensic String Search Hits DIGITAL FORENSIC RESEARCH CONFERENCE Ranking Algorithms For Digital Forensic String Search Hits By Nicole Beebe and Lishu Liu Presented At The Digital Forensic Research Conference DFRWS 2014 USA Denver,

More information

Optimizing Search Engines using Click-through Data

Optimizing Search Engines using Click-through Data Optimizing Search Engines using Click-through Data By Sameep - 100050003 Rahee - 100050028 Anil - 100050082 1 Overview Web Search Engines : Creating a good information retrieval system Previous Approaches

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 6: Index Compression Paul Ginsparg Cornell University, Ithaca, NY 15 Sep

More information

Boolean Model. Hongning Wang

Boolean Model. Hongning Wang Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer

More information

Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries

Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries Hans Philippi (based on the slides from the Stanford course on IR) April 25, 2018 Boolean retrieval, posting lists & dictionaries

More information

Using Temporal Bayesian Networks to Model user Profile Evolution

Using Temporal Bayesian Networks to Model user Profile Evolution Using Temporal Bayesian Networks to Model user Profile Evolution Farida Achemoukh 1, Rachid Ahmed-Ouamer 2 1 LARI Laboratory, Department of Computer Science, University of Mouloud Mammeri 15000 Tizi-Ouzou,

More information

Interpreting Document Collections with Topic Models. Nikolaos Aletras University College London

Interpreting Document Collections with Topic Models. Nikolaos Aletras University College London Interpreting Document Collections with Topic Models Nikolaos Aletras University College London Acknowledgements Mark Stevenson, Sheffield Tim Baldwin, Melbourne Jey Han Lau, IBM Research Talk Outline Introduction

More information

Developing Focused Crawlers for Genre Specific Search Engines

Developing Focused Crawlers for Genre Specific Search Engines Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com

More information

Linking FRBR Entities to LOD through Semantic Matching

Linking FRBR Entities to LOD through Semantic Matching Linking FRBR Entities to through Semantic Matching Naimdjon Takhirov, Fabien Duchateau, Trond Aalberg Department of Computer and Information Science Norwegian University of Science and Technology Theory

More information

AI for Smart Cities Workshop AI*IA th Year Anniversary XIII Conference Turin (Italy), December 5, 2013

AI for Smart Cities Workshop AI*IA th Year Anniversary XIII Conference Turin (Italy), December 5, 2013 Fedelucio Narducci*, Matteo Palmonari*, Giovanni Semeraro *DISCO, University of Milan-Bicocca, Italy Department of Computer Science, University of Bari Aldo Moro, Italy!!"#$%&' "!('' #&&!))'*''''''''''''''''''

More information

Clustering (COSC 416) Nazli Goharian. Document Clustering.

Clustering (COSC 416) Nazli Goharian. Document Clustering. Clustering (COSC 416) Nazli Goharian nazli@cs.georgetown.edu 1 Document Clustering. Cluster Hypothesis : By clustering, documents relevant to the same topics tend to be grouped together. C. J. van Rijsbergen,

More information

Birkbeck (University of London)

Birkbeck (University of London) Birkbeck (University of London) MSc Examination for Internal Students Department of Computer Science and Information Systems Information Retrieval and Organisation (COIY64H7) Credit Value: 5 Date of Examination:

More information

Clean Living: Eliminating Near-Duplicates in Lifetime Personal Storage

Clean Living: Eliminating Near-Duplicates in Lifetime Personal Storage Clean Living: Eliminating Near-Duplicates in Lifetime Personal Storage Zhe Wang Princeton University Jim Gemmell Microsoft Research September 2005 Technical Report MSR-TR-2006-30 Microsoft Research Microsoft

More information

Knowledge Discovery and Data Mining 1 (VO) ( )

Knowledge Discovery and Data Mining 1 (VO) ( ) Knowledge Discovery and Data Mining 1 (VO) (707.003) Data Matrices and Vector Space Model Denis Helic KTI, TU Graz Nov 6, 2014 Denis Helic (KTI, TU Graz) KDDM1 Nov 6, 2014 1 / 55 Big picture: KDDM Probability

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Preference Elicitation for Single Crossing Domain

Preference Elicitation for Single Crossing Domain Preference Elicitation for Single Crossing Domain joint work with Neeldhara Misra (IIT Gandhinagar) March 6, 2017 Appeared in IJCAI 2016 Motivation for Preference Elicitation One often wants to learn how

More information

Information Retrieval

Information Retrieval Natural Language Processing SoSe 2015 Information Retrieval Dr. Mariana Neves June 22nd, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline Introduction Indexing Block 2 Document Crawling Text Processing

More information

Digital Libraries: Language Technologies

Digital Libraries: Language Technologies Digital Libraries: Language Technologies RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Recall: Inverted Index..........................................

More information

Information Retrieval

Information Retrieval Information Retrieval Natural Language Processing: Lecture 12 30.11.2017 Kairit Sirts Homework 4 things that seemed to work Bidirectional LSTM instead of unidirectional Change LSTM activation to sigmoid

More information

Efficient Top-k Algorithms for Fuzzy Search in String Collections

Efficient Top-k Algorithms for Fuzzy Search in String Collections Efficient Top-k Algorithms for Fuzzy Search in String Collections Rares Vernica Chen Li Department of Computer Science University of California, Irvine First International Workshop on Keyword Search on

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction Inverted index Processing Boolean queries Course overview Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural

More information

Semantic Estimation for Texts in Software Engineering

Semantic Estimation for Texts in Software Engineering Semantic Estimation for Texts in Software Engineering 汇报人 : Reporter:Xiaochen Li Dalian University of Technology, China 大连理工大学 2016 年 11 月 29 日 Oscar Lab 2 Ph.D. candidate at OSCAR Lab, in Dalian University

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

Query-Free News Search

Query-Free News Search Query-Free News Search by Monika Henzinger, Bay-Wei Chang, Sergey Brin - Google Inc. Brian Milch - UC Berkeley presented by Martin Klein, Santosh Vuppala {mklein, svuppala}@cs.odu.edu ODU, Norfolk, 03/21/2007

More information

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao Large Scale Chinese News Categorization --based on Improved Feature Selection Method Peng Wang Joint work with H. Zhang, B. Xu, H.W. Hao Computational-Brain Research Center Institute of Automation, Chinese

More information

Scalable Mobile Video Retrieval with Sparse Projection Learning and Pseudo Label Mining

Scalable Mobile Video Retrieval with Sparse Projection Learning and Pseudo Label Mining Scalable Mobile Video Retrieval with Sparse Projection Learning and Pseudo Label Mining Guan-Long Wu, Yin-Hsi Kuo, Tzu-Hsuan Chiu, Winston H. Hsu, and Lexing Xie 1 Abstract Retrieving relevant videos from

More information

TABLE OF CONTENTS PAGE TITLE NO.

TABLE OF CONTENTS PAGE TITLE NO. TABLE OF CONTENTS CHAPTER PAGE TITLE ABSTRACT iv LIST OF TABLES xi LIST OF FIGURES xii LIST OF ABBREVIATIONS & SYMBOLS xiv 1. INTRODUCTION 1 2. LITERATURE SURVEY 14 3. MOTIVATIONS & OBJECTIVES OF THIS

More information

Static Pruning of Terms In Inverted Files

Static Pruning of Terms In Inverted Files In Inverted Files Roi Blanco and Álvaro Barreiro IRLab University of A Corunna, Spain 29th European Conference on Information Retrieval, Rome, 2007 Motivation : to reduce inverted files size with lossy

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Information Retrieval Potsdam, 14 June 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book Outline 2 1 Introduction 2 Indexing Block Document

More information

Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search

Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search 1 / 33 Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search Bernd Wittefeld Supervisor Markus Löckelt 20. July 2012 2 / 33 Teaser - Google Web History http://www.google.com/history

More information

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Boolean Retrieval vs. Ranked Retrieval Many users (professionals) prefer

More information

Lexical and Syntax Analysis. Bottom-Up Parsing

Lexical and Syntax Analysis. Bottom-Up Parsing Lexical and Syntax Analysis Bottom-Up Parsing Parsing There are two ways to construct derivation of a grammar. Top-Down: begin with start symbol; repeatedly replace an instance of a production s LHS with

More information

Link Analysis in Web Mining

Link Analysis in Web Mining Problem formulation (998) Link Analysis in Web Mining Hubs and Authorities Spam Detection Suppose we are given a collection of documents on some broad topic e.g., stanford, evolution, iraq perhaps obtained

More information

Intro to Peer-to-Peer Search

Intro to Peer-to-Peer Search Intro to Peer-to-Peer Search (COSC 416) Nazli Goharian nazli@cs.georgetown.edu 1 Outline Peer-to-peer historical perspective Problem definition Local client data processing Ranking functions Metadata copying

More information

Advanced Topics in Information Retrieval. Learning to Rank. ATIR July 14, 2016

Advanced Topics in Information Retrieval. Learning to Rank. ATIR July 14, 2016 Advanced Topics in Information Retrieval Learning to Rank Vinay Setty vsetty@mpi-inf.mpg.de Jannik Strötgen jannik.stroetgen@mpi-inf.mpg.de ATIR July 14, 2016 Before we start oral exams July 28, the full

More information

Inverted List Caching for Topical Index Shards

Inverted List Caching for Topical Index Shards Inverted List Caching for Topical Index Shards Zhuyun Dai and Jamie Callan Language Technologies Institute, Carnegie Mellon University {zhuyund, callan}@cs.cmu.edu Abstract. Selective search is a distributed

More information

Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis

Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis Due by 11:59:59pm on Tuesday, March 16, 2010 This assignment is based on a similar assignment developed at the University of Washington. Running

More information

Models for Document & Query Representation. Ziawasch Abedjan

Models for Document & Query Representation. Ziawasch Abedjan Models for Document & Query Representation Ziawasch Abedjan Overview Introduction & Definition Boolean retrieval Vector Space Model Probabilistic Information Retrieval Language Model Approach Summary Overview

More information

Computer Science 572 Midterm Prof. Horowitz Thursday, March 8, 2012, 2:00pm 3:00pm

Computer Science 572 Midterm Prof. Horowitz Thursday, March 8, 2012, 2:00pm 3:00pm Computer Science 572 Midterm Prof. Horowitz Thursday, March 8, 2012, 2:00pm 3:00pm Name: Student Id Number: 1. This is a closed book exam. 2. Please answer all questions. 3. There are a total of 40 questions.

More information

Faster Or-join Enactment for BPMN 2.0

Faster Or-join Enactment for BPMN 2.0 Faster Or-join Enactment for BPMN 2.0 Hagen Völzer, IBM Research Zurich Joint work with Beat Gfeller and Gunnar Wilmsmann Contribution: BPMN Diagram Enactment Or-join Tokens define the control state Execution

More information

IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM

IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM Myomyo Thannaing 1, Ayenandar Hlaing 2 1,2 University of Technology (Yadanarpon Cyber City), near Pyin Oo Lwin, Myanmar ABSTRACT

More information