Analytic Knowledge Discovery Models for Information Retrieval and Text Summarization

Similar documents
Information Retrieval. (M&S Ch 15)

Introduction to Information Retrieval

Information Retrieval Using Context Based Document Indexing and Term Graph

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS 6320 Natural Language Processing

Improving Difficult Queries by Leveraging Clusters in Term Graph

What is this Song About?: Identification of Keywords in Bollywood Lyrics

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

CS54701: Information Retrieval

Chapter 27 Introduction to Information Retrieval and Web Search

Clustering. Bruno Martins. 1 st Semester 2012/2013

Information Retrieval and Web Search

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Search Engines. Information Retrieval in Practice

Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search

Mining Web Data. Lijun Zhang

R 2 D 2 at NTCIR-4 Web Retrieval Task

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Text Analytics (Text Mining)

Information Retrieval

Chapter 6: Information Retrieval and Web Search. An introduction

A Semantic Based Search Engine for Open Architecture Requirements Documents

Text Analytics (Text Mining)

Midterm Exam Search Engines ( / ) October 20, 2015

An Investigation of Basic Retrieval Models for the Dynamic Domain Task

Question Answering Systems

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval and Web Search Engines

Introduction to Information Retrieval

MEASURING SEMANTIC SIMILARITY BETWEEN WORDS AND IMPROVING WORD SIMILARITY BY AUGUMENTING PMI

Boolean Model. Hongning Wang

Exploring Ancient Texts with a User Driven Concept Search

Information Retrieval

Learning to Reweight Terms with Distributed Representations

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

Considerations of Analysis of Healthcare Claims Data

Mining Web Data. Lijun Zhang

Axiomatic Approaches to Information Retrieval - University of Delaware at TREC 2009 Million Query and Web Tracks

Lecture 7: Relevance Feedback and Query Expansion

CS6200 Information Retrieval. Jesse Anderton College of Computer and Information Science Northeastern University

Query Refinement and Search Result Presentation

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Aggregation for searching complex information spaces. Mounia Lalmas

CLUSTER BASED AND GRAPH BASED METHODS OF SUMMARIZATION: SURVEY AND APPROACH

Entity and Knowledge Base-oriented Information Retrieval

Text Mining. Munawar, PhD. Text Mining - Munawar, PhD

Vector Space Models: Theory and Applications

Part I: Data Mining Foundations

Content-based Recommender Systems

Reading group on Ontologies and NLP:

TEXT MINING APPLICATION PROGRAMMING

International Journal of Video& Image Processing and Network Security IJVIPNS-IJENS Vol:10 No:02 7

An Exploration of Query Term Deletion

CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

Index Terms:- Document classification, document clustering, similarity measure, accuracy, classifiers, clustering algorithms.

Query Phrase Expansion using Wikipedia for Patent Class Search

Boolean Queries. Keywords combined with Boolean operators:

CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 1)"

modern database systems lecture 4 : information retrieval

Sense-based Information Retrieval System by using Jaccard Coefficient Based WSD Algorithm

Manning Chapter: Text Retrieval (Selections) Text Retrieval Tasks. Vorhees & Harman (Bulkpack) Evaluation The Vector Space Model Advanced Techniques

Information Retrieval

TEXT CHAPTER 5. W. Bruce Croft BACKGROUND

Chapter 6. Queries and Interfaces

Northeastern University in TREC 2009 Million Query Track

Effective Latent Space Graph-based Re-ranking Model with Global Consistency

CMSC 476/676 Information Retrieval Midterm Exam Spring 2014

A Multiple-stage Approach to Re-ranking Clinical Documents

Survey on Automated Text Documents Summarization Tools

Retrieval of Highly Related Documents Containing Gene-Disease Association

Extracting Conceptual Relationships from Specialized Documents

Understanding the Query: THCIB and THUIS at NTCIR-10 Intent Task. Junjun Wang 2013/4/22

Keyword Extraction by KNN considering Similarity among Features

A Language Independent Author Verifier Using Fuzzy C-Means Clustering

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

NATURAL LANGUAGE PROCESSING

CSE 494: Information Retrieval, Mining and Integration on the Internet

Tilburg University. Authoritative re-ranking of search results Bogers, A.M.; van den Bosch, A. Published in: Advances in Information Retrieval

Multimodal Information Spaces for Content-based Image Retrieval

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

A Study of MatchPyramid Models on Ad hoc Retrieval

Automatic Summarization

Master Project. Various Aspects of Recommender Systems. Prof. Dr. Georg Lausen Dr. Michael Färber Anas Alzoghbi Victor Anthony Arrascue Ayala

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

Information Retrieval. Techniques for Relevance Feedback

10-701/15-781, Fall 2006, Final

A Search Relevancy Tuning Method Using Expert Results Content Evaluation

Information Retrieval. hussein suleman uct cs

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

Retrieval Evaluation

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy

QUERY EXPANSION USING WORDNET WITH A LOGICAL MODEL OF INFORMATION RETRIEVAL

AN EFFECTIVE INFORMATION RETRIEVAL FOR AMBIGUOUS QUERY

Transcription:

Analytic Knowledge Discovery Models for Information Retrieval and Text Summarization Pawan Goyal Team Sanskrit INRIA Paris Rocquencourt Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 1 / 23

Outline 1 Ad-Hoc Information Retrieval 2 Text Summarization 3 Research Problems 4 Query Representation Model 5 Neighborhood Based Document Smoothing Model 6 A Context Based Word Indexing Model 7 Results 8 Conclusions Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 2 / 23

Ad-Hoc Information Retrieval User Query Text Documents collelction Representation Representation Function Function Matching Function Inverted Index Results An Ad-Hoc Information Retrieval System Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 3 / 23

Information Retrieval: Datasets and Evaluation Criteria TREC Dataset: Dataset used in Text REtrieval Conferences 741,686 documents, query topics 101-150 (TREC-2) and 151-200 (TREC-3). 524,000 documents, query topics 351-400 (TREC-7). Query example: Topic 169: cost of garbage trash removal. Evaluation Criteria Precision at various points are computed. P5= N Rel(5) 5 N Rel (x): Number of relevant documents in the top x documents returned by the system. Mean Averaged Precision (MAP) is the mean of the precision value at all recall points. Student s t-test is used to compare if the difference in results is statistically significant. (* : p<0.05, ** : p<0.01) Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 4 / 23

Ad-Hoc Information Retrieval Problems that Ad-Hoc Information Retrieval addresses: Document and Query indexing: How to best represent their contents? Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 5 / 23

Ad-Hoc Information Retrieval Problems that Ad-Hoc Information Retrieval addresses: Document and Query indexing: How to best represent their contents? The representation allows real time search to be computationally efficient. Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 5 / 23

Ad-Hoc Information Retrieval Problems that Ad-Hoc Information Retrieval addresses: Document and Query indexing: How to best represent their contents? The representation allows real time search to be computationally efficient. The representation minimizes the information loss. Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 5 / 23

Ad-Hoc Information Retrieval Problems that Ad-Hoc Information Retrieval addresses: Document and Query indexing: How to best represent their contents? The representation allows real time search to be computationally efficient. The representation minimizes the information loss. Relevance measure: To what extent a document is relevant to a query? Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 5 / 23

Ad-Hoc Information Retrieval Problems that Ad-Hoc Information Retrieval addresses: Document and Query indexing: How to best represent their contents? The representation allows real time search to be computationally efficient. The representation minimizes the information loss. Relevance measure: To what extent a document is relevant to a query? System evaluation: To what degree does the relevance measure reflect the human judgment? Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 5 / 23

Ad-Hoc Information Retrieval Problems that Ad-Hoc Information Retrieval addresses: Document and Query indexing: How to best represent their contents? The representation allows real time search to be computationally efficient. The representation minimizes the information loss. Relevance measure: To what extent a document is relevant to a query? System evaluation: To what degree does the relevance measure reflect the human judgment? Most Widely Used Approaches: Keyword based indexing to represent a document and a query Similarity measures such as Cosine similarity for relevance measures Precision and Recall measures for system evaluation Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 5 / 23

Information Retrieval Essential factors for Keyword-based Indexing function F Term frequency (f ij ): How many times a term appear in a document? Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 6 / 23

Information Retrieval Essential factors for Keyword-based Indexing function F Term frequency (f ij ): How many times a term appear in a document? F f ij Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 6 / 23

Information Retrieval Essential factors for Keyword-based Indexing function F Term frequency (f ij ): How many times a term appear in a document? F f ij Document length ( D i ): How many terms appear in the document? Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 6 / 23

Information Retrieval Essential factors for Keyword-based Indexing function F Term frequency (f ij ): How many times a term appear in a document? F f ij Document length ( D i ): How many terms appear in the document? F 1 D i Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 6 / 23

Information Retrieval Essential factors for Keyword-based Indexing function F Term frequency (f ij ): How many times a term appear in a document? F f ij Document length ( D i ): How many terms appear in the document? F 1 D i Document Frequency (N j ): Number of documents in which a term appears. Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 6 / 23

Information Retrieval Essential factors for Keyword-based Indexing function F Term frequency (f ij ): How many times a term appear in a document? F f ij Document length ( D i ): How many terms appear in the document? F 1 D i Document Frequency (N j ): Number of documents in which a term appears. F 1 N j Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 6 / 23

Information Retrieval: An Example Result of indexing Indexing: D 1 {(computation,0.3),(information,0.4),...}, D 2 {(formal,0.5),(computation,0.4),...},... Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 7 / 23

Information Retrieval: An Example Result of indexing Indexing: D 1 {(computation,0.3),(information,0.4),...}, D 2 {(formal,0.5),(computation,0.4),...},... Inverted Index: computation {(D 1,0.3),(D 2,0.4),...} Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 7 / 23

Information Retrieval: An Example Result of indexing Indexing: D 1 {(computation,0.3),(information,0.4),...}, D 2 {(formal,0.5),(computation,0.4),...},... Inverted Index: computation {(D 1,0.3),(D 2,0.4),...} User query, weighted by the system: q {(computation, 1),(information, 1),(processing, 1)} Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 7 / 23

Information Retrieval: An Example Result of indexing Indexing: D 1 {(computation,0.3),(information,0.4),...}, D 2 {(formal,0.5),(computation,0.4),...},... Inverted Index: computation {(D 1,0.3),(D 2,0.4),...} User query, weighted by the system: q {(computation, 1),(information, 1),(processing, 1)} Sim(D 1,q)>Sim(D 2,q) Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 7 / 23

Information Retrieval: An Example Result of indexing Indexing: D 1 {(computation,0.3),(information,0.4),...}, D 2 {(formal,0.5),(computation,0.4),...},... Inverted Index: computation {(D 1,0.3),(D 2,0.4),...} User query, weighted by the system: q {(computation, 1),(information, 1),(processing, 1)} Sim(D 1,q)>Sim(D 2,q) Evaluation Let D 1 be relevant to q and D 2 be non-relevant as per human judgment. Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 7 / 23

Information Retrieval: An Example Result of indexing Indexing: D 1 {(computation,0.3),(information,0.4),...}, D 2 {(formal,0.5),(computation,0.4),...},... Inverted Index: computation {(D 1,0.3),(D 2,0.4),...} User query, weighted by the system: q {(computation, 1),(information, 1),(processing, 1)} Sim(D 1,q)>Sim(D 2,q) Evaluation Let D 1 be relevant to q and D 2 be non-relevant as per human judgment. P1=1.0, P2=0.5, MAP=1.0 Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 7 / 23

Text Summarization Why Text Summarization? Information Retrieval gives a list of documents, assumed to be relevant to the user query. There is still a vast volume of information in the retrieved documents. Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 8 / 23

Text Summarization Why Text Summarization? Information Retrieval gives a list of documents, assumed to be relevant to the user query. There is still a vast volume of information in the retrieved documents. Text summarization reduces this information into a short set of words or paragraph. Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 8 / 23

Text Summarization Why Text Summarization? Information Retrieval gives a list of documents, assumed to be relevant to the user query. There is still a vast volume of information in the retrieved documents. Text summarization reduces this information into a short set of words or paragraph. Genres of Summary Extract vs. Abstract...lists fragments of text vs. re-phrases content coherently. Single document v/s Multi-document...based on one text vs. fuses together many texts. Generic v/s Query-oriented...provides author s view vs. reflects user s interest. Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 8 / 23

Text Summarization Why Text Summarization? Information Retrieval gives a list of documents, assumed to be relevant to the user query. There is still a vast volume of information in the retrieved documents. Text summarization reduces this information into a short set of words or paragraph. Genres of Summary Extract vs. Abstract...lists fragments of text vs. re-phrases content coherently. Single document v/s Multi-document...based on one text vs. fuses together many texts. Generic v/s Query-oriented...provides author s view vs. reflects user s interest. Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 8 / 23

Text Summarization Generic Single document Extractive Summary Document is indexed along the same lines as for Information Retrieval. Similarity between sentences is represented using Cosine similarity. Important sentences are selected using PageRank based algorithm/eigen-values. Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 9 / 23

Text Summarization Generic Single document Extractive Summary Document is indexed along the same lines as for Information Retrieval. Similarity between sentences is represented using Cosine similarity. Important sentences are selected using PageRank based algorithm/eigen-values. Evaluation Criteria DUC datasets: Various news articles used in Document Understanding Conferences. Manually created summaries are provided for each document. System generated summary is compared to the manually created summary. ROGUE toolkit is used for the evaluation. ROGUE N = S {RefSum} n gram S Count match (n gram) S {RefSum} n gram S Count(n gram) Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 9 / 23

Text Summarization: Example Sentence Graph S1 S2 S3 S4 S5 Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 10 / 23

Text Summarization: Example Sentence Graph S1 M12 S2 S3 M34 S4 S5 Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 10 / 23

Text Summarization: Example Sentence Graph S3 S1 M12 M34 S2 S4 M= 0.0 0.5 0.0 0.4 0.1 0.5 0.0 0.5 0.0 0.0 0.0 0.5 0.0 0.5 0.0 0.4 0.0 0.4 0.0 0.2 0.3 0.0 0.0 0.7 0.0 S5 Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 10 / 23

Text Summarization: Example Sentence Graph S3 S1 M12 M34 S2 S4 M= 0.0 0.5 0.0 0.4 0.1 0.5 0.0 0.5 0.0 0.0 0.0 0.5 0.0 0.5 0.0 0.4 0.0 0.4 0.0 0.2 0.3 0.0 0.0 0.7 0.0 S5 Solving using Page-Rank based algorithm iteratively for sentence centrality vector I: I j = µ I k M j,k + 1 µ k j S I = [ 0.22 0.18 0.2 0.3 0.1 ] Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 10 / 23

Research Problems Term Mismatch Stems from the word independence assumption User query: insurance cover which pays for long term care. A relevant document may contain terms different from the actual user query. Some relevant words concerning this query: {medicare, premiums, insurers} Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 11 / 23

Research Problems Term Mismatch Stems from the word independence assumption User query: insurance cover which pays for long term care. A relevant document may contain terms different from the actual user query. Some relevant words concerning this query: {medicare, premiums, insurers} Existing Solutions Manually constructed ontologies such as Wordnet Relevance feedback Co-occurrence models such as mutual information Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 11 / 23

Research Problems Context Independent Word Indexing Information Retrieval D 1 D 2 = {robot, healthcare, mobile, autonomous, research} = {fifa, soccer, germany, played, robot} Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 12 / 23

Research Problems Context Independent Word Indexing Information Retrieval D 1 D 2 = {robot, healthcare, mobile, autonomous, research} = {fifa, soccer, germany, played, robot} Sentence Extraction D 1 : S 11 ={started, career, engineering} : S 12 ={shifted, engineering, humanities} D 2 : S 21 ={engineering, application, scientific, principles} : S 22 ={engineering, design, build, machines} Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 12 / 23

Research Problems Context Independent Word Indexing Information Retrieval D 1 D 2 = {robot, healthcare, mobile, autonomous, research} = {fifa, soccer, germany, played, robot} Sentence Extraction D 1 : S 11 ={started, career, engineering} : S 12 ={shifted, engineering, humanities} D 2 : S 21 ={engineering, application, scientific, principles} : S 22 ={engineering, design, build, machines} Existing Solutions Document Clustering Latent Semantic Analysis Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 12 / 23

Knowledge Discovery: A Potential solution Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 13 / 23

Knowledge Discovery: A Potential solution Distributional Hypothesis You know a word by the company it keeps. (Firth, 1957) Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 13 / 23

Knowledge Discovery: A Potential solution Distributional Hypothesis You know a word by the company it keeps. (Firth, 1957) Words that occur in the same contexts tend to have similar meanings. (Zellig Harris, 1968) Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 13 / 23

Knowledge Discovery: A Potential solution Distributional Hypothesis You know a word by the company it keeps. (Firth, 1957) Words that occur in the same contexts tend to have similar meanings. (Zellig Harris, 1968) Semantically similar words tend to have similar distributional patterns. Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 13 / 23

Knowledge Discovery: A Potential solution Distributional Hypothesis You know a word by the company it keeps. (Firth, 1957) Words that occur in the same contexts tend to have similar meanings. (Zellig Harris, 1968) Semantically similar words tend to have similar distributional patterns. My Approach: Specific Objectives Using distributional hypothesis to analyze the research problems from a theoretical perspective. To empirically evaluate the proposed analytic knowledge discovery models with respect to the existing approaches. Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 13 / 23

Query Representation Model: Basic Idea Distributional Hypothesis Words are not independent of each other computation provides more information about algorithm and programming than about petroleum or environment. Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 14 / 23

Query Representation Model: Basic Idea Distributional Hypothesis Words are not independent of each other computation provides more information about algorithm and programming than about petroleum or environment. Distributional pattern of terms is used to find the terms, related to the query words. Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 14 / 23

Query Representation Model: Basic Idea Distributional Hypothesis Words are not independent of each other computation provides more information about algorithm and programming than about petroleum or environment. Distributional pattern of terms is used to find the terms, related to the query words. Compositional Model Problem of polysemy mouse can provide information about{keyboard, monitor} in one context and about {animals,food} in other context. Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 14 / 23

Query Representation Model: Basic Idea Distributional Hypothesis Words are not independent of each other computation provides more information about algorithm and programming than about petroleum or environment. Distributional pattern of terms is used to find the terms, related to the query words. Compositional Model Problem of polysemy mouse can provide information about{keyboard, monitor} in one context and about {animals,food} in other context. Combined effect of all query terms is used to avoid polysemy : {mouse,wireless} can disambiguate the two usages of mouse. Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 14 / 23

Query Representation Model for Term Mismatch problem The parametric model A: Matrix that captures the distributional pattern Let a user query be represented as Q=q.A Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 15 / 23

Query Representation Model for Term Mismatch problem The parametric model A: Matrix that captures the distributional pattern Let a user query be represented as Q=q.A What choice of A ij will give an enhanced retrieval performance? Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 15 / 23

Query Representation Model for Term Mismatch problem The parametric model A: Matrix that captures the distributional pattern Let a user query be represented as Q=q.A What choice of A ij will give an enhanced retrieval performance? ( ) i A ij = f j t ij k (δ ki t kj ) k t ki k t kj : Derived using system relevance criteria and an empirical evidence from user relevance criteria. f = log(x + 0.05) : Fixed through sensitivity analysis. Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 15 / 23

Query Representation Model for Term Mismatch problem The parametric model A: Matrix that captures the distributional pattern Let a user query be represented as Q=q.A What choice of A ij will give an enhanced retrieval performance? ( ) i A ij = f j t ij k (δ ki t kj ) k t ki k t kj : Derived using system relevance criteria and an empirical evidence from user relevance criteria. f = log(x + 0.05) : Fixed through sensitivity analysis. The only query expansion model with a relevance based justification. Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 15 / 23

Results TREC Topic 104: catastrophic health insurance Query Representation: surtax:1.0 hcfa:0.97 medicare:0.93 hmos:0.83 medicaid:0.8 hmo:0.78 beneficiaries:0.75 ambulatory:0.72 premiums:0.72 hospitalization:0.71 hhs:0.7 reimbursable:0.7 deductible:0.69 Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 16 / 23

Results TREC Topic 104: catastrophic health insurance Query Representation: surtax:1.0 hcfa:0.97 medicare:0.93 hmos:0.83 medicaid:0.8 hmo:0.78 beneficiaries:0.75 ambulatory:0.72 premiums:0.72 hospitalization:0.71 hhs:0.7 reimbursable:0.7 deductible:0.69 Broad expansion terms: medicare, beneficiaries, premiums... Specific domain terms: HCFA (Health Care Financing Administration), HMO (Health Maintenance Organization), HHS (Health and Human Services) Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 16 / 23

Results TREC Topic 104: catastrophic health insurance Query Representation: surtax:1.0 hcfa:0.97 medicare:0.93 hmos:0.83 medicaid:0.8 hmo:0.78 beneficiaries:0.75 ambulatory:0.72 premiums:0.72 hospitalization:0.71 hhs:0.7 reimbursable:0.7 deductible:0.69 Broad expansion terms: medicare, beneficiaries, premiums... Specific domain terms: HCFA (Health Care Financing Administration), HMO (Health Maintenance Organization), HHS (Health and Human Services) TREC Topic 355: ocean remote sensing Query Representation: radiometer:1.0 landsat:0.97 ionosphere:0.94 cnes:0.84 altimeter:0.83 nasda:0.81 meterology:0.81 cartography:0.78 geostationary:0.78 doppler:0.78 oceanographic:0.76 Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 16 / 23

Results TREC Topic 104: catastrophic health insurance Query Representation: surtax:1.0 hcfa:0.97 medicare:0.93 hmos:0.83 medicaid:0.8 hmo:0.78 beneficiaries:0.75 ambulatory:0.72 premiums:0.72 hospitalization:0.71 hhs:0.7 reimbursable:0.7 deductible:0.69 Broad expansion terms: medicare, beneficiaries, premiums... Specific domain terms: HCFA (Health Care Financing Administration), HMO (Health Maintenance Organization), HHS (Health and Human Services) TREC Topic 355: ocean remote sensing Query Representation: radiometer:1.0 landsat:0.97 ionosphere:0.94 cnes:0.84 altimeter:0.83 nasda:0.81 meterology:0.81 cartography:0.78 geostationary:0.78 doppler:0.78 oceanographic:0.76 Broad expansion terms: radiometer, landsat, ionosphere... Specific domain terms: CNES (Centre National détudes Spatiales) and NASDA (National Space Development Agency of Japan) Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 16 / 23

Neighborhood Based Document Smoothing (NBDS) Model Context Sensitive Document Indexing D 1 ={robot, healthcare, mobile, autonomous, research} D 2 ={fifa, soccer, germany, played, robot} Content-carrying (Topical) terms should be given higher weights than the background terms. Topical terms are supposed to have higher association with each other, when computed on a large corpora. t N ij = βt ij + γ k (A jk t ik ) : Proposed model to redistribute the indexing weights. Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 17 / 23

Neighborhood Based Document Smoothing (NBDS) Model Context Sensitive Document Indexing D 1 ={robot, healthcare, mobile, autonomous, research} D 2 ={fifa, soccer, germany, played, robot} Content-carrying (Topical) terms should be given higher weights than the background terms. Topical terms are supposed to have higher association with each other, when computed on a large corpora. t N ij = βt ij + γ k (A jk t ik ) : Proposed model to redistribute the indexing weights. NBDS Model: Main Features The model does not cause any extra computational burden at run-time. The only model which provides a mathematical framework with a relevance-based justification. Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 17 / 23

A Context Based Word Indexing Model for Text summarization Bernoulli model of co-occurrence for lexical association Consider the distribution of terms t i and t j in a corpus of N documents. N i, N j : Number of documents in which t i and t j occur respectively. N ij : Number of documents in which t i and t j co-occur. Probability p i of the term t i appearing in an arbitrary document: p i = N i N Term t i occurs in N ij documents out of these N j documents and does not occur in N j N ij documents. Using Bernoulli distribution: p(n ij )= ( N j ) pi N ij q N j N ij N ij i Using Shannon s self-information notion: Inf(N ij )= log 2 (p(n ij )) Stirling s approximation: n!= 2πn ( ) n n e Inf(N ij ) is used to modify the indexing weights iteratively. Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 18 / 23

Results Comparison of Query Representation over the Language Model Dataset LM CQE MCTM QR (Improvements %) TREC-2 MAP 0.183 0.192 0.185 0.203 (+10.9**,+5.7,+9.7*) P30 0.386 0.393 0.392 0.415 (+7.5,+5.6,+5.9) TREC-7 MAP 0.179 0.184 0.184 0.2 (+11.7**,+8.7**,+8.7*) P30 0.289 0.284 0.291 0.315 (+9.0**,+10.9*,+8.2*) Comparison of NBDS Model applied to the Language model Dataset LM LM+NBDS Improvement (%) TREC 2 MAP 0.183 0.199 +8.7** P10 0.448 0.462 +3.1 TREC 3 MAP 0.197 0.212 +7.6** P10 0.474 0.53 +11.8** Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 19 / 23

Results Sentence Extraction Experiments System DUC01 DUC02 ROGUE-1 ROGUE-2 ROGUE-1 ROGUE-2 IntraLink 0.439 0.172 0.45 0.19 IntraLink+bern 0.447 0.184 0.461 0.202 UniformLink 0.438 0.173 0.458 0.199 UniformLink+bern 0.443 0.183 0.462 0.205 Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 20 / 23

Conclusions The problems of term mismatch and context independent document indexing have been addressed using distributional hypothesis. A proper mathematical framework has been provided to the query expansion and document smoothing techniques. The proposed knowledge discovery models have been shown to perform significantly superior to the traditional retrieval frameworks. Being developed in the generalized retrieval framework, these models are applicable to all of the retrieval frameworks. The proposed models for document smoothing do not cause any extra computational burden at run-time. Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 21 / 23

Publications Pawan Goyal, Laxmidhar Behera and T. M. McGinnity, Query Representation through Lexical Association for Information Retrieval (Preprint) IEEE Transactions on Knowledge and Data Engineering, vol. PP, July 2011. Pawan Goyal, Laxmidhar Behera and T. M. McGinnity, A Novel Neighborhood Based Document Smoothing Model for Information Retrieval, Revised and Submitted to Information Retrieval, Springer. Pawan Goyal, Laxmidhar Behera and T. M. McGinnity, A Context Based Word Indexing Model for Text Summarization Submitted to IEEE Transactions on Knowledge and Data Engineering. Pawan Goyal, Laxmidhar Behera and T. M. McGinnity, Application of Bayesian Framework in Natural Language Understanding, IETE Technical Review, Volume 25, Issue 5, pp 251-269, 2008. Pawan Goyal, Laxmidhar Behera and T. M. McGinnity, Entailment of Causal Queries in Narratives Using Action Language. In the proceedings of KDIR 2009, October 04-06, Funchal, Portugal, pp 112-118. Pawan Goyal, Laxmidhar Behera and T. M. McGinnity, An Information Retrieval Model Based On Automatically Learnt Concept Hierarchies. In the proceedings of IEEE ICSC 2009, Berkeley, CA, USA, pp 458-465. Pawan Goyal, Laxmidhar Behera and T. M. McGinnity, An Information Retrieval Approach Based on Semantically Adapted Vector Space Model. In the proceedings of ICON, 2009, December 14-17, Hyderabad, INDIA. Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 22 / 23

Questions? Pawan Goyal (http://www.inria.fr/) Analytic Knowledge Discovery Models March 20, 2012 23 / 23