A Machine Learning Approach for Information Retrieval Applications. Luo Si. Department of Computer Science Purdue University

Size: px
Start display at page:

Download "A Machine Learning Approach for Information Retrieval Applications. Luo Si. Department of Computer Science Purdue University"

Transcription

1 A Machine Learning Approach for Information Retrieval Applications Luo Si Department of Computer Science Purdue University

2 Why Information Retrieval: Information Overload: Since the introduction of digital libraries and the Web, human being has accumulated too much digital information to absorb

3 Why Information Retrieval: Information Overload: In 2008, Americans consumed information for about 1.3 trillion hours, an average of almost 12 hours per day. Consumption totaled 10,845 trillion words and 3.6 zettabytes (10 21 bytes), corresponding to 100,500 words and 34 gigabytes for an average person on an average day. From more than 20 different sources of information, from very old (newspapers and books) to very new (portable computer games, satellite radio, and Internet video). Extracted from How Much Information? 2009 Report on American Consumers by Roger E. Bohn and James E. Short

4 Why Information Retrieval: Narrow Sense: Information retrieval ranks a collection of documents for user queries according to degree of relevance (i.e., Ah-hoc search). Broad Sense: Information retrieval provides solutions of acquisition, storage, organization, storage, retrieval and analysis of information. Information retrieval mainly studies unstructured data: Text in Web pages or s; image; audio; video; protein sequences. Web search is one of the most popular information retrieval applications.

5 IR Applications Information Retrieval: a gold mine of applications Web Search Information Organization: text categorization; document clustering Information Recommendation: by content or by collaborative information Information Extraction: deep analysis of the surface text data Question-Answering: find the answer directly Federated Search: explore hidden Web Multimedia Information Retrieval: image, video Information Visualization: Let user understand the results in the best way..

6 IR and other disciplines Theory Natural Language Processing Image Understanding Deep Analysis Machine Learning Pattern Recognition Statistical Learning Information Retrieval Information Extraction Text Mining Database Knowledge Mining Visualization Library & Info Science Security& Privacy System Applications System Support

7 Information Retrieval Models (for Ad-Hoc Retrieval) Ad-Hoc Retrieval: Satisfy users short-term information needs as queries (e.g., text) Short and temporary need (e.g., info about a movie) Information source is relatively static while user queries change Users pull information from information sources (e.g., Web) Application examples: Web search, library search, entity search.

8 Information Retrieval Models (for Ad-Hoc Retrieval) Estimate Document Relevance of User Query Similarity Based: Sim(Rep(q), Rep(d)) Probabilistic Approach P(d q), P(q d); P(r=1 q,d) r {0,1} ; Probability of Relevance Different representations and similarity measurements Vector space model (Salton et al., 75) Prob. distr. model (Wong & Yao, 89) Doc generation Classical prob. model (Robertson & Sparck Jones, 76) Generative Model Query generation or inference Language modeling approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) Discriminative Model Learning probability of relevance. (Recent work on Learning to Rank) Inference network model (Turtle & Croft, 91)

9 Information Retrieval Models (for Ad-Hoc Retrieval) Vector Space Model: D 2 D 3 Query Java D1 Sun Doc and Qry are vectors in a vector space Vectors are represented in weighted form (e.g., term frequency and inverse term frequency) Closeness of Doc and Qry vector determines relevance Starbucks

10 Information Retrieval Models (for Ad-Hoc Retrieval) Vector Space Model: Advantages: Provide an intuitive solution for retrieval Easy to implement Disadvantages: Vector representation is heuristic without solid justification Difficult to incorporate complex features (e.g., pagerank)

11 Information Retrieval Models (for Ad-Hoc Retrieval) Statistical Language Modeling: Treat Doc and Qry as language models, which are associated with words generated by Multinomial distributions. Ranking documents based query generation probability log p ( Q Doc) log p( q Doc) q w Q w Document language model smoothed by whole collection P( q Doc) P ( q Doc) (1 ) P ( q Collection) w MLE w MLE w

12 Information Retrieval Models (for Ad-Hoc Retrieval) Statistical Language Modeling: Advantages: Provide a formal method for modeling text data Less parameter tuning Disadvantages: Not optimal due to the gap between query generation probability and relevance Difficult to incorporate complex features (e.g., pagerank) due to the generative process

13 Information Retrieval Models (for Ad-Hoc Retrieval) Learning to Rank: Given a pair of user query (Qry) and a document (Doc), directly model the relevance for the query and document Use features about query and document such as language modeling retrieval score, page rank value, etc. Use different lgorithms for learning relevance (e.g., logistic regression); parameters learned by training queries and judgments exp f ( Qry, Doc) P( rel Qry, Doc) 1 exp f ( Qry, Doc) Learned model can be used for predicting relevance of documents for test queries

14 Information Retrieval Models (for Ad-Hoc Retrieval) Learning to Rank: Advantages: Explicitly optimize retrieval performance by fitting model parameters with training data Provide a solid foundation for modeling relevance Successfully used in many commercial search engines Pairwise/Listwise modeling successfully used for learning to rank Can the success of machine learning approach be generalized from ad hoc retrieval to other information retrieval applications? Yes! But this requires intelligent algorithms for different complex information retrieval applications.

15 Some IR Applications Question Answering: QA aims at finding answers to natural language questions from a large collection of documents Example question: What is the city in China with the largest population? Question Keywords Relevant Docs Answer candidates Question Analysis Document Retrieval Answer Extraction Answer Selection Text Collection 15 Answer Shanghai

16 Some IR Applications Federated Search (aka. distributed information retrieval): Information (e.g., hidden Web) hidden behind search engines of independent sources may not be searched by traditional search engines Hidden Web contents are estimated to be larger (e.g., 2 times larger) than visible web contents searchable by traditional search engines Engine 1 Engine 2 Engine 3 Engine Engine N (1) Source Representation (2) Source Selection (3) Results Merging

17 IR Applications: Expertise Search Expertise Search: In the information age, the most important thing may not be what you know, but who you know. Expert search aims at finding the right people with desired expertise; Search for people instead of documents Web Pages (e.g., homepage) Publications User Query Expert? Research Projects

18 Research Questions for Complex Information Retrieval Applications 1. Breaking Isolation: from Isolated Information Items to Connected Information Items Answer Selection: Answers are related with each other (e.g., similar contents), modeling answer relationships can improve accuracy and reduce answer redundancy Source Selection (Federate Search): Sources are related with each other (e.g., links, citations, etc), source B related with a relevant source A also tends to be relevant Our Approach: A Joint Probabilistic Approach that Models Available Information Items and Their Relationships 18

19 Research Questions for Complex Information Retrieval Applications 2. Working with Incomplete Knowledge: from full Knowledge to Partially Observed Knowledge Expertise search: relevance judgments are for each expert but not for specific documents associated with the expert Our Approach: An Integrated Learning Approach that Explicitly Models Incomplete Knowledge 19

20 Research Questions for Complex Information Retrieval Applications 3. Information Integration: Combining Evidence of Information Items from Heterogeneous Sources Expertise Search: Evidence of expertise comes from heterogeneous information sources (e.g., homepages, supervised Ph.D. dissertations, research projects) Our Approach: A Mixture Model Probabilistic Approach that Intelligently Combines Evidence from Heterogeneous Sources for Different Types of Information Needs 20

21 Breaking Isolation: from Isolated Information Items to Connected Information Items Answer Selection (Question Answering): Select most relevant and more unique answers for each question Traditional methods relies on knowledge databases (e.g., Wordnet, gazetteers) for identifying relevant answers with heuristic rules Independent Classification Models: supervised classification for predicting relevance of each answer with features from knowledge databases Joint Probabilistic Classification: model relevance of individual answers and their relationships; select unique answer by conditional probability of relevance (SIGIR 2007, Ko, Si, et al.)(acm TOIS 2010, Ko, Si, et al.) 21

22 Breaking Isolation: from Isolated Information Items to Connected Information Items Answer Selection (Question Answering): S S S..., 1, 2, F f f..., 1, 2, S n f n Joint Classification: creativeness judgments for all answers feature vectors of all answers for a question 1 1 P( S F) exp Si F ak sim( ci, c j ) SiS j Z i n i, j( i j) k Modeling Relevance of Individual Answers Also called Boltzmann machine and Ising Model A i Modeling Similarity Relationship Across Multi Knowledge Databases Select relevant and unique answers with conditional probability: Score( A ) P( S 1 F) max P S 1 S 1, F j j Ai SelectedAnswers i 22 j

23 Breaking Isolation: from Isolated Information Items to Connected Information Items Answer Selection (Question Answering): Select relevant and unique answers with conditional probability Example: Question: Who was the U.S. presidents in 1990s? P(correct(William J. Clinton)=0.8 P(correct(Bill Clinton)= P(correct(George P(correct(Bill Clinton) correct(william W. Bush)=0.6 J. Clinton))= Score (Bill P(correct(George Clinton)= W. 0.7 Bush) correct(william = J. Clinton))=0.5 Empirical studies with two answer extractors Score (George W. Bush)= = 0.1 Information Extractor 1 Information Extractor 2 Baseline Jnd Jnt Basline Ind Jnt Top 3 Accuracy Mean Reciprocal Rank

24 Breaking Isolation: from Isolated Information Items to Connected Information Items Source Selection (Federated Search): Select a few most relevant sources for each user query Big Doc Approach: treat sample docs from different sources as big documents and calculate/rank relevant scores (e.g., vector space model) Independent Classification: supervised classification for predicting relevance of each source P( V i 1 f i ) Joint Probabilistic Classification: model relevance of individual sources and their relationships in a joint model (SIGIR 2010, Hong, Si, et al.) P( V F) 24

25 Breaking Isolation: from Isolated Information Items to Connected Information Items Source Selection (Federated Search): Empirical studies for selecting up to 5 sources from about 100 sources in two TREC (Text Retrieval Evaluation Conference) collections and a collection of real world digital libraries, Src Rank TREC123 TREC4 DIGLIB Ind Jnt Ind Jnt @ Measurement: accuracy of selecting relevant sources 25

26 Working with Incomplete Knowledge: from full Knowledge to Partially Observed Knowledge Expertise Search: A Unified Model that Integrates Document Evidence and Document- Candidate Association Traditional expertise search approaches use a generative approach for estimating query generation probability with heuristics Qry generation prob given an expert P n q e P( q d ) P( d e) t1 Doc language model t t Frequency of name e occurring in dt Our approach: use some training data on experts for queries (no judgments for individual docs associated with experts); a discriminative learning model of integrating doc evidence and doccandidate associations 26

27 Working with Incomplete Knowledge: from full Knowledge to Partially Observed Knowledge Expertise Search: A Unified Model that Integrates Document Evidence and Document- Candidate Association P n 1 1 t 2 t t t1 r e, q P( r 1 q, d ) P( r 1 e, d ) P( d ) probability that doc probability that doc matches query supports expert N N f g 1 q, dt ) i fi q dt P( r2 1 e, dt ) j g j e, dt i1 j1 P( r 1, σ is the standard logistic function; f i (q, d) denotes the doc feature (e.g., doc retrieval score; page rank value); g j denotes the document-expert association feature (e.g., exact name match, last name match)

28 Working with Incomplete Knowledge: from full Knowledge to Partially Observed Knowledge Expert Search: Empirical studies on two enterprise corpora for World Wide Web Consortium (W3C) and an organization in Australian (CERC) Top 5 Generative Model Discriminative Model W3C CERC Mean Average Precision Generative Model Discriminative Model W3C CERC

29 Information Integration: Combining Evidence of Information Items from Heterogeneous Sources Expertise Search: A Mixture Model Probabilistic Approach for Combining Evidence from Heterogeneous Sources (e.g., homepages, supervised Ph.D. dissertations, research projects ) for Different Types of Information Needs Traditional expertise search approach uses weighted votes to specify importance of different sources based on intuition Our approach: Intelligently Combines Evidence from Heterogeneous Sources by Learning the Combination Weights. The weights should depend on experts. e.g., Some senior faculty do not have homepages; Some junior faculty do not have supervised Ph.D. dissertations The weights should depend on queries. For query cancer, research projects from NIH should carry more evidence than evidence from homepages 29

30 Working with Incomplete Knowledge: from full Knowledge to Partially Observed Knowledge Expertise Search: A Mixture Model Probabilistic Approach for Combining Evidence from Heterogeneous Sources S i ( e, q) evidence score from ith information source Z latent variable for expert class; Z q latent variable for query topic e P N zq N ze r e, q Pz e; Pz q; K e q z zq 1 ze 1 i1 e z q i S i ( e, q) Latent variable for expert class Latent variable for query topic Combination weights for an expert class and a query topic

31 Working with Incomplete Knowledge: from full Knowledge to Partially Observed Knowledge Expertise Search: A Mixture Model Probabilistic Approach for Combining Evidence from Heterogeneous Sources Empirical studies on INDURE (INdiana Database of University Research Expertise) expcombsum Learn a Single Model with Fixed Weights Mixture Model with Adaptive Weights P@ Top P@ Top P@ Top

32 A Machine Learning Approach for Information Retrieval Applications 1. Breaking Isolation: from Isolated Information Items to Connected Information Items 2. Working with Incomplete Knowledge: from full Knowledge to Partially Observed Knowledge 3. Information Integration: Combining Evidence of Information Items from Heterogeneous Sources

33 A Machine Learning Approach for Information Retrieval Applications Federated Search: source representation (SIGIR 2003; CIKM 2004); source selection (SIGIR 2003,2005; CIKM 2002, 2004, 2009); results merging (SIGIR 2002; TOIS 2003; IRJ 2009) Expertise Search: mixture model for integrate expertise evidence (IJR 2010a); integrated model for combining doc evidence and doc candidate association (SIGIR 2010); joint homepage discovery (IRJ 2010b) Question/Answering: independent answer selection (HLT 2007, IPM 2009); joint answer selection (SIGIR 2007, TOIS 2010), multilingual answer selection (TOIS 2010). Machine Learning Techniques: Multiple instance learning (IJCAI 2009), manifold leaning (AAAI 2010), collaborative recommendation (ICML 2003,2005; UAI 2004), active learning (UAI 2004)...

34 Acknowledgement Graduate Students: Suleyman Cetintas, Yi Fang, Dan Zhang, Dzung Hong Collaboration: Dr. Aditya Mathur; Dr. Jeongwoo Ko; Dr. Eric Nyberg Research support: National Science Foundation, State of Indiana, Purdue University, Google, Yahoo! and BGI

Boolean Model. Hongning Wang

Boolean Model. Hongning Wang Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer

More information

CS490W: Web Information Search & Management. CS-490W Web Information Search and Management. Luo Si. Department of Computer Science Purdue University

CS490W: Web Information Search & Management. CS-490W Web Information Search and Management. Luo Si. Department of Computer Science Purdue University CS490W: Web Information Search & Management CS-490W Web Information Search and Management Luo Si Department of Computer Science Purdue University Overview Web: Growth of the Web The world produces between

More information

Query Likelihood with Negative Query Generation

Query Likelihood with Negative Query Generation Query Likelihood with Negative Query Generation Yuanhua Lv Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 ylv2@uiuc.edu ChengXiang Zhai Department of Computer

More information

CS-490WIR Web Information Retrieval and Management. Luo Si

CS-490WIR Web Information Retrieval and Management. Luo Si CS490W: Web Information Retrieval & Management CS-490WIR Web Information Retrieval and Management Luo Si Department of Computer Science Purdue University Overview Web: Growth of the Web The world produces

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Information Retrieval: Retrieval Models

Information Retrieval: Retrieval Models CS473: Web Information Retrieval & Management CS-473 Web Information Retrieval & Management Information Retrieval: Retrieval Models Luo Si Department of Computer Science Purdue University Retrieval Models

More information

Effective Latent Space Graph-based Re-ranking Model with Global Consistency

Effective Latent Space Graph-based Re-ranking Model with Global Consistency Effective Latent Space Graph-based Re-ranking Model with Global Consistency Feb. 12, 2009 1 Outline Introduction Related work Methodology Graph-based re-ranking model Learning a latent space graph A case

More information

Risk Minimization and Language Modeling in Text Retrieval Thesis Summary

Risk Minimization and Language Modeling in Text Retrieval Thesis Summary Risk Minimization and Language Modeling in Text Retrieval Thesis Summary ChengXiang Zhai Language Technologies Institute School of Computer Science Carnegie Mellon University July 21, 2002 Abstract This

More information

Introduction to Information Retrieval. Hongning Wang

Introduction to Information Retrieval. Hongning Wang Introduction to Information Retrieval Hongning Wang CS@UVa What is information retrieval? 2 Why information retrieval Information overload It refers to the difficulty a person can have understanding an

More information

RMIT University at TREC 2006: Terabyte Track

RMIT University at TREC 2006: Terabyte Track RMIT University at TREC 2006: Terabyte Track Steven Garcia Falk Scholer Nicholas Lester Milad Shokouhi School of Computer Science and IT RMIT University, GPO Box 2476V Melbourne 3001, Australia 1 Introduction

More information

A Deep Relevance Matching Model for Ad-hoc Retrieval

A Deep Relevance Matching Model for Ad-hoc Retrieval A Deep Relevance Matching Model for Ad-hoc Retrieval Jiafeng Guo 1, Yixing Fan 1, Qingyao Ai 2, W. Bruce Croft 2 1 CAS Key Lab of Web Data Science and Technology, Institute of Computing Technology, Chinese

More information

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12 Fall 2016 CS646: Information Retrieval Lecture 2 - Introduction to Search Result Ranking Jiepu Jiang University of Massachusetts Amherst 2016/09/12 More course information Programming Prerequisites Proficiency

More information

Document indexing, similarities and retrieval in large scale text collections

Document indexing, similarities and retrieval in large scale text collections Document indexing, similarities and retrieval in large scale text collections Eric Gaussier Univ. Grenoble Alpes - LIG Eric.Gaussier@imag.fr Eric Gaussier Document indexing, similarities & retrieval 1

More information

An Axiomatic Approach to IR UIUC TREC 2005 Robust Track Experiments

An Axiomatic Approach to IR UIUC TREC 2005 Robust Track Experiments An Axiomatic Approach to IR UIUC TREC 2005 Robust Track Experiments Hui Fang ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign Abstract In this paper, we report

More information

An Investigation of Basic Retrieval Models for the Dynamic Domain Task

An Investigation of Basic Retrieval Models for the Dynamic Domain Task An Investigation of Basic Retrieval Models for the Dynamic Domain Task Razieh Rahimi and Grace Hui Yang Department of Computer Science, Georgetown University rr1042@georgetown.edu, huiyang@cs.georgetown.edu

More information

Information Retrieval

Information Retrieval Introduction Information Retrieval Information retrieval is a field concerned with the structure, analysis, organization, storage, searching and retrieval of information Gerard Salton, 1968 J. Pei: Information

More information

A BELIEF NETWORK MODEL FOR EXPERT SEARCH

A BELIEF NETWORK MODEL FOR EXPERT SEARCH A BELIEF NETWORK MODEL FOR EXPERT SEARCH Craig Macdonald, Iadh Ounis Department of Computing Science, University of Glasgow, Glasgow, G12 8QQ, UK craigm@dcs.gla.ac.uk, ounis@dcs.gla.ac.uk Keywords: Expert

More information

A Study of Pattern-based Subtopic Discovery and Integration in the Web Track

A Study of Pattern-based Subtopic Discovery and Integration in the Web Track A Study of Pattern-based Subtopic Discovery and Integration in the Web Track Wei Zheng and Hui Fang Department of ECE, University of Delaware Abstract We report our systems and experiments in the diversity

More information

CS54701: Information Retrieval

CS54701: Information Retrieval CS54701: Information Retrieval Basic Concepts 19 January 2016 Prof. Chris Clifton 1 Text Representation: Process of Indexing Remove Stopword, Stemming, Phrase Extraction etc Document Parser Extract useful

More information

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Information Search and Management Federated Search Prof. Chris Clifton 13 November 2017 Federated Search Outline Introduction to federated search Main research problems Resource Representation

More information

Introduction & Administrivia

Introduction & Administrivia Introduction & Administrivia Information Retrieval Evangelos Kanoulas ekanoulas@uva.nl Section 1: Unstructured data Sec. 8.1 2 Big Data Growth of global data volume data everywhere! Web data: observation,

More information

WebSci and Learning to Rank for IR

WebSci and Learning to Rank for IR WebSci and Learning to Rank for IR Ernesto Diaz-Aviles L3S Research Center. Hannover, Germany diaz@l3s.de Ernesto Diaz-Aviles www.l3s.de 1/16 Motivation: Information Explosion Ernesto Diaz-Aviles

More information

Unsupervised Rank Aggregation with Distance-Based Models

Unsupervised Rank Aggregation with Distance-Based Models Unsupervised Rank Aggregation with Distance-Based Models Alexandre Klementiev, Dan Roth, and Kevin Small University of Illinois at Urbana-Champaign Motivation Consider a panel of judges Each (independently)

More information

Entity Information Management in Complex Networks

Entity Information Management in Complex Networks Entity Information Management in Complex Networks Yi Fang Department of Computer Science 250 N. University Street Purdue University, West Lafayette, IN 47906, USA fangy@cs.purdue.edu ABSTRACT Entity information

More information

Text Categorization (I)

Text Categorization (I) CS473 CS-473 Text Categorization (I) Luo Si Department of Computer Science Purdue University Text Categorization (I) Outline Introduction to the task of text categorization Manual v.s. automatic text categorization

More information

60-538: Information Retrieval

60-538: Information Retrieval 60-538: Information Retrieval September 7, 2017 1 / 48 Outline 1 what is IR 2 3 2 / 48 Outline 1 what is IR 2 3 3 / 48 IR not long time ago 4 / 48 5 / 48 now IR is mostly about search engines there are

More information

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation

More information

Learning to Rank. from heuristics to theoretic approaches. Hongning Wang

Learning to Rank. from heuristics to theoretic approaches. Hongning Wang Learning to Rank from heuristics to theoretic approaches Hongning Wang Congratulations Job Offer from Bing Core Ranking team Design the ranking module for Bing.com CS 6501: Information Retrieval 2 How

More information

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur 603 203 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER CS6007-INFORMATION RETRIEVAL Regulation 2013 Academic Year 2018

More information

A Formal Approach to Score Normalization for Meta-search

A Formal Approach to Score Normalization for Meta-search A Formal Approach to Score Normalization for Meta-search R. Manmatha and H. Sever Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts Amherst, MA 01003

More information

BUPT at TREC 2009: Entity Track

BUPT at TREC 2009: Entity Track BUPT at TREC 2009: Entity Track Zhanyi Wang, Dongxin Liu, Weiran Xu, Guang Chen, Jun Guo Pattern Recognition and Intelligent System Lab, Beijing University of Posts and Telecommunications, Beijing, China,

More information

Discriminative graphical models for faculty homepage discovery

Discriminative graphical models for faculty homepage discovery DOI 10.1007/s10791-010-9127-7 Discriminative graphical models for faculty homepage discovery Yi Fang Luo Si Aditya P. Mathur Received: 7 July 2009 / Accepted: 25 January 2010 Ó Springer Science+Business

More information

Academic Paper Recommendation Based on Heterogeneous Graph

Academic Paper Recommendation Based on Heterogeneous Graph Academic Paper Recommendation Based on Heterogeneous Graph Linlin Pan, Xinyu Dai, Shujian Huang, and Jiajun Chen National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023,

More information

Focused Retrieval Using Topical Language and Structure

Focused Retrieval Using Topical Language and Structure Focused Retrieval Using Topical Language and Structure A.M. Kaptein Archives and Information Studies, University of Amsterdam Turfdraagsterpad 9, 1012 XT Amsterdam, The Netherlands a.m.kaptein@uva.nl Abstract

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

CS6200 Information Retrieval. Jesse Anderton College of Computer and Information Science Northeastern University

CS6200 Information Retrieval. Jesse Anderton College of Computer and Information Science Northeastern University CS6200 Information Retrieval Jesse Anderton College of Computer and Information Science Northeastern University Major Contributors Gerard Salton! Vector Space Model Indexing Relevance Feedback SMART Karen

More information

CS54701: Information Retrieval

CS54701: Information Retrieval CS54701: Information Retrieval Federated Search 10 March 2016 Prof. Chris Clifton Outline Federated Search Introduction to federated search Main research problems Resource Representation Resource Selection

More information

Mining Trusted Information in Medical Science: An Information Network Approach

Mining Trusted Information in Medical Science: An Information Network Approach Mining Trusted Information in Medical Science: An Information Network Approach Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign Collaborated with many, especially Yizhou

More information

Federated Search. Jaime Arguello INLS 509: Information Retrieval November 21, Thursday, November 17, 16

Federated Search. Jaime Arguello INLS 509: Information Retrieval November 21, Thursday, November 17, 16 Federated Search Jaime Arguello INLS 509: Information Retrieval jarguell@email.unc.edu November 21, 2016 Up to this point... Classic information retrieval search from a single centralized index all ueries

More information

Introduction to Text Mining. Hongning Wang

Introduction to Text Mining. Hongning Wang Introduction to Text Mining Hongning Wang CS@UVa Who Am I? Hongning Wang Assistant professor in CS@UVa since August 2014 Research areas Information retrieval Data mining Machine learning CS@UVa CS6501:

More information

Information Filtering SI650: Information Retrieval

Information Filtering SI650: Information Retrieval Information Filtering SI650: Information Retrieval Winter 2010 School of Information University of Michigan Many slides are from Prof. ChengXiang Zhai s lecture 1 Lecture Plan Filtering vs. Retrieval Content-based

More information

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai. UNIT-V WEB MINING 1 Mining the World-Wide Web 2 What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns. 3 Web search engines Index-based: search the Web, index

More information

CHAPTER 5 EXPERT LOCATOR USING CONCEPT LINKING

CHAPTER 5 EXPERT LOCATOR USING CONCEPT LINKING 94 CHAPTER 5 EXPERT LOCATOR USING CONCEPT LINKING 5.1 INTRODUCTION Expert locator addresses the task of identifying the right person with the appropriate skills and knowledge. In large organizations, it

More information

TEXT CHAPTER 5. W. Bruce Croft BACKGROUND

TEXT CHAPTER 5. W. Bruce Croft BACKGROUND 41 CHAPTER 5 TEXT W. Bruce Croft BACKGROUND Much of the information in digital library or digital information organization applications is in the form of text. Even when the application focuses on multimedia

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

An Exploration of Query Term Deletion

An Exploration of Query Term Deletion An Exploration of Query Term Deletion Hao Wu and Hui Fang University of Delaware, Newark DE 19716, USA haowu@ece.udel.edu, hfang@ece.udel.edu Abstract. Many search users fail to formulate queries that

More information

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search Basic techniques Text processing; term weighting; vector space model; inverted index; Web Search Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing

More information

FacFinder: Search for Expertise in Academic Institutions*

FacFinder: Search for Expertise in Academic Institutions* FacFinder: Search for Expertise in Academic Institutions* Yi Fang a, Luo Si a,, Aditya Mathur a a Department of Computer Science, Purdue University, West Lafayette, IN 47907, USA Abstract Interdisciplinary

More information

Related entity finding by unified probabilistic models

Related entity finding by unified probabilistic models World Wide Web (2015) 18:521 543 DOI 10.1007/s11280-013-0267-8 Related entity finding by unified probabilistic models Yi Fang Luo Si Received: 23 March 2013 / Revised: 11 August 2013 / Accepted: 22 October

More information

Content Based Smart Crawler For Efficiently Harvesting Deep Web Interface

Content Based Smart Crawler For Efficiently Harvesting Deep Web Interface Content Based Smart Crawler For Efficiently Harvesting Deep Web Interface Prof. T.P.Aher(ME), Ms.Rupal R.Boob, Ms.Saburi V.Dhole, Ms.Dipika B.Avhad, Ms.Suvarna S.Burkul 1 Assistant Professor, Computer

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Beyond Bag of Words Bag of Words a document is considered to be an unordered collection of words with no relationships Extending

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

Federated Text Search

Federated Text Search CS54701 Federated Text Search Luo Si Department of Computer Science Purdue University Abstract Outline Introduction to federated search Main research problems Resource Representation Resource Selection

More information

UMass at TREC 2006: Enterprise Track

UMass at TREC 2006: Enterprise Track UMass at TREC 2006: Enterprise Track Desislava Petkova and W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science University of Massachusetts, Amherst, MA 01003 Abstract

More information

Outline. Morning program Preliminaries Semantic matching Learning to rank Entities

Outline. Morning program Preliminaries Semantic matching Learning to rank Entities 112 Outline Morning program Preliminaries Semantic matching Learning to rank Afternoon program Modeling user behavior Generating responses Recommender systems Industry insights Q&A 113 are polysemic Finding

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Graph Data & Introduction to Information Retrieval Huan Sun, CSE@The Ohio State University 11/21/2017 Slides adapted from Prof. Srinivasan Parthasarathy @OSU 2 Chapter 4

More information

Learning to Rank. Tie-Yan Liu. Microsoft Research Asia CCIR 2011, Jinan,

Learning to Rank. Tie-Yan Liu. Microsoft Research Asia CCIR 2011, Jinan, Learning to Rank Tie-Yan Liu Microsoft Research Asia CCIR 2011, Jinan, 2011.10 History of Web Search Search engines powered by link analysis Traditional text retrieval engines 2011/10/22 Tie-Yan Liu @

More information

Information Retrieval (Part 1)

Information Retrieval (Part 1) Information Retrieval (Part 1) Fabio Aiolli http://www.math.unipd.it/~aiolli Dipartimento di Matematica Università di Padova Anno Accademico 2008/2009 1 Bibliographic References Copies of slides Selected

More information

A Study of Collection-based Features for Adapting the Balance Parameter in Pseudo Relevance Feedback

A Study of Collection-based Features for Adapting the Balance Parameter in Pseudo Relevance Feedback A Study of Collection-based Features for Adapting the Balance Parameter in Pseudo Relevance Feedback Ye Meng 1, Peng Zhang 1, Dawei Song 1,2, and Yuexian Hou 1 1 Tianjin Key Laboratory of Cognitive Computing

More information

Ranking models in Information Retrieval: A Survey

Ranking models in Information Retrieval: A Survey Ranking models in Information Retrieval: A Survey R.Suganya Devi Research Scholar Department of Computer Science and Engineering College of Engineering, Guindy, Chennai, Tamilnadu, India Dr D Manjula Professor

More information

It s time for a semantic engine!

It s time for a semantic engine! It s time for a semantic engine! Ido Dagan Bar-Ilan University, Israel 1 Semantic Knowledge is not the goal it s a primary mean to achieve semantic inference! Knowledge design should be derived from its

More information

Link Prediction in Relational Data

Link Prediction in Relational Data Link Prediction in Relational Data Alexandra Chouldechova STATS 319, March 1, 2011 Motivation for Relational Models Quantifying trends in social interaction Improving document classification Inferring

More information

Social Media Computing

Social Media Computing Social Media Computing Lecture 4: Introduction to Information Retrieval and Classification Lecturer: Aleksandr Farseev E-mail: farseev@u.nus.edu Slides: http://farseev.com/ainlfruct.html At the beginning,

More information

COMP6237 Data Mining Searching and Ranking

COMP6237 Data Mining Searching and Ranking COMP6237 Data Mining Searching and Ranking Jonathon Hare jsh2@ecs.soton.ac.uk Note: portions of these slides are from those by ChengXiang Cheng Zhai at UIUC https://class.coursera.org/textretrieval-001

More information

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17 Information Retrieval Vannevar Bush Director of the Office of Scientific Research and Development (1941-1947) Vannevar Bush,1890-1974 End of WW2 - what next big challenge for scientists? 1 Historic Vision

More information

Improving Difficult Queries by Leveraging Clusters in Term Graph

Improving Difficult Queries by Leveraging Clusters in Term Graph Improving Difficult Queries by Leveraging Clusters in Term Graph Rajul Anand and Alexander Kotov Department of Computer Science, Wayne State University, Detroit MI 48226, USA {rajulanand,kotov}@wayne.edu

More information

Re-ranking Documents Based on Query-Independent Document Specificity

Re-ranking Documents Based on Query-Independent Document Specificity Re-ranking Documents Based on Query-Independent Document Specificity Lei Zheng and Ingemar J. Cox Department of Computer Science University College London London, WC1E 6BT, United Kingdom lei.zheng@ucl.ac.uk,

More information

Document Clustering for Mediated Information Access The WebCluster Project

Document Clustering for Mediated Information Access The WebCluster Project Document Clustering for Mediated Information Access The WebCluster Project School of Communication, Information and Library Sciences Rutgers University The original WebCluster project was conducted at

More information

Ph.D. in Computer Science & Technology, Tsinghua University, Beijing, China, 2007

Ph.D. in Computer Science & Technology, Tsinghua University, Beijing, China, 2007 Yiqun Liu Associate Professor & Department co-chair Department of Computer Science and Technology Email yiqunliu@tsinghua.edu.cn URL http://www.thuir.org/group/~yqliu Phone +86-10-62796672 Fax +86-10-62796672

More information

University of Delaware at Diversity Task of Web Track 2010

University of Delaware at Diversity Task of Web Track 2010 University of Delaware at Diversity Task of Web Track 2010 Wei Zheng 1, Xuanhui Wang 2, and Hui Fang 1 1 Department of ECE, University of Delaware 2 Yahoo! Abstract We report our systems and experiments

More information

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS 1 WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS BRUCE CROFT NSF Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts,

More information

Towards open-domain QA. Question answering. TReC QA framework. TReC QA: evaluation

Towards open-domain QA. Question answering. TReC QA framework. TReC QA: evaluation Question ing Overview and task definition History Open-domain question ing Basic system architecture Watson s architecture Techniques Predictive indexing methods Pattern-matching methods Advanced techniques

More information

Topic-Level Random Walk through Probabilistic Model

Topic-Level Random Walk through Probabilistic Model Topic-Level Random Walk through Probabilistic Model Zi Yang, Jie Tang, Jing Zhang, Juanzi Li, and Bo Gao Department of Computer Science & Technology, Tsinghua University, China Abstract. In this paper,

More information

Attentive Neural Architecture for Ad-hoc Structured Document Retrieval

Attentive Neural Architecture for Ad-hoc Structured Document Retrieval Attentive Neural Architecture for Ad-hoc Structured Document Retrieval Saeid Balaneshin 1 Alexander Kotov 1 Fedor Nikolaev 1,2 1 Textual Data Analytics Lab, Department of Computer Science, Wayne State

More information

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.

More information

State of the Art and Trends in Search Engine Technology. Gerhard Weikum

State of the Art and Trends in Search Engine Technology. Gerhard Weikum State of the Art and Trends in Search Engine Technology Gerhard Weikum (weikum@mpi-inf.mpg.de) Commercial Search Engines Web search Google, Yahoo, MSN simple queries, chaotic data, many results key is

More information

CSEP 573: Artificial Intelligence

CSEP 573: Artificial Intelligence CSEP 573: Artificial Intelligence Machine Learning: Perceptron Ali Farhadi Many slides over the course adapted from Luke Zettlemoyer and Dan Klein. 1 Generative vs. Discriminative Generative classifiers:

More information

Graph Classification in Heterogeneous

Graph Classification in Heterogeneous Title: Graph Classification in Heterogeneous Networks Name: Xiangnan Kong 1, Philip S. Yu 1 Affil./Addr.: Department of Computer Science University of Illinois at Chicago Chicago, IL, USA E-mail: {xkong4,

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 3 Modeling Part I: Classic Models Introduction to IR Models Basic Concepts The Boolean Model Term Weighting The Vector Model Probabilistic Model Chap 03: Modeling,

More information

Entity and Knowledge Base-oriented Information Retrieval

Entity and Knowledge Base-oriented Information Retrieval Entity and Knowledge Base-oriented Information Retrieval Presenter: Liuqing Li liuqing@vt.edu Digital Library Research Laboratory Virginia Polytechnic Institute and State University Blacksburg, VA 24061

More information

Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing

Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing Some Interesting Applications of Theory PageRank Minhashing Locality-Sensitive Hashing 1 PageRank The thing that makes Google work. Intuition: solve the recursive equation: a page is important if important

More information

Axiomatic Approaches to Information Retrieval - University of Delaware at TREC 2009 Million Query and Web Tracks

Axiomatic Approaches to Information Retrieval - University of Delaware at TREC 2009 Million Query and Web Tracks Axiomatic Approaches to Information Retrieval - University of Delaware at TREC 2009 Million Query and Web Tracks Wei Zheng Hui Fang Department of Electrical and Computer Engineering University of Delaware

More information

A Study of Methods for Negative Relevance Feedback

A Study of Methods for Negative Relevance Feedback A Study of Methods for Negative Relevance Feedback Xuanhui Wang University of Illinois at Urbana-Champaign Urbana, IL 61801 xwang20@cs.uiuc.edu Hui Fang The Ohio State University Columbus, OH 43210 hfang@cse.ohiostate.edu

More information

Citation Prediction in Heterogeneous Bibliographic Networks

Citation Prediction in Heterogeneous Bibliographic Networks Citation Prediction in Heterogeneous Bibliographic Networks Xiao Yu Quanquan Gu Mianwei Zhou Jiawei Han University of Illinois at Urbana-Champaign {xiaoyu1, qgu3, zhou18, hanj}@illinois.edu Abstract To

More information

Machine Learning for Information Discovery

Machine Learning for Information Discovery Machine Learning for Information Discovery Thorsten Joachims Cornell University Department of Computer Science (Supervised) Machine Learning GENERAL: Input: training examples design space Training: automatically

More information

An Indian Journal FULL PAPER. Trade Science Inc. Research on data mining clustering algorithm in cloud computing environments ABSTRACT KEYWORDS

An Indian Journal FULL PAPER. Trade Science Inc. Research on data mining clustering algorithm in cloud computing environments ABSTRACT KEYWORDS [Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 17 BioTechnology 2014 An Indian Journal FULL PAPER BTAIJ, 10(17), 2014 [9562-9566] Research on data mining clustering algorithm in cloud

More information

CHAPTER 31 WEB SEARCH TECHNOLOGIES FOR TEXT DOCUMENTS

CHAPTER 31 WEB SEARCH TECHNOLOGIES FOR TEXT DOCUMENTS CHAPTER 31 WEB SEARCH TECHNOLOGIES FOR TEXT DOCUMENTS Weiyi Meng SUNY, BINGHAMTON Clement Yu UNIVERSITY OF ILLINOIS, CHICAGO Introduction Text Retrieval System Architecture Document Representation Document-Query

More information

Finding Topic-centric Identified Experts based on Full Text Analysis

Finding Topic-centric Identified Experts based on Full Text Analysis Finding Topic-centric Identified Experts based on Full Text Analysis Hanmin Jung, Mikyoung Lee, In-Su Kang, Seung-Woo Lee, Won-Kyung Sung Information Service Research Lab., KISTI, Korea jhm@kisti.re.kr

More information

COS 513: Foundations of Probabilistic Modeling. Lecture 5

COS 513: Foundations of Probabilistic Modeling. Lecture 5 COS 513: Foundations of Probabilistic Modeling Young-suk Lee 1 Administrative Midterm report is due Oct. 29 th. Recitation is at 4:26pm in Friend 108. Lecture 5 R is a computer language for statistical

More information

Faculty of Science and Technology MASTER S THESIS

Faculty of Science and Technology MASTER S THESIS Faculty of Science and Technology MASTER S THESIS Study program/ Specialization: Master of Science in Computer Science Spring semester, 2016 Open Writer: Shuo Zhang Faculty supervisor: (Writer s signature)

More information

Welcome to the class of Web Information Retrieval!

Welcome to the class of Web Information Retrieval! Welcome to the class of Web Information Retrieval! Tee Time Topic Augmented Reality and Google Glass By Ali Abbasi Challenges in Web Search Engines Min ZHANG z-m@tsinghua.edu.cn April 13, 2012 Challenges

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching Wolfgang Tannebaum, Parvaz Madabi and Andreas Rauber Institute of Software Technology and Interactive Systems, Vienna

More information

CS674 Natural Language Processing

CS674 Natural Language Processing CS674 Natural Language Processing A question What was the name of the enchanter played by John Cleese in the movie Monty Python and the Holy Grail? Question Answering Eric Breck Cornell University Slides

More information

Two-Stage Language Models for Information Retrieval

Two-Stage Language Models for Information Retrieval Two-Stage Language Models for Information Retrieval ChengXiang hai School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 John Lafferty School of Computer Science Carnegie Mellon University

More information

ONTOPARK: ONTOLOGY BASED PAGE RANKING FRAMEWORK USING RESOURCE DESCRIPTION FRAMEWORK

ONTOPARK: ONTOLOGY BASED PAGE RANKING FRAMEWORK USING RESOURCE DESCRIPTION FRAMEWORK Journal of Computer Science 10 (9): 1776-1781, 2014 ISSN: 1549-3636 2014 doi:10.3844/jcssp.2014.1776.1781 Published Online 10 (9) 2014 (http://www.thescipub.com/jcs.toc) ONTOPARK: ONTOLOGY BASED PAGE RANKING

More information

Introduction to Information Retrieval. (COSC 488) Spring Nazli Goharian. Course Outline

Introduction to Information Retrieval. (COSC 488) Spring Nazli Goharian. Course Outline Introduction to Information Retrieval (COSC 488) Spring 2012 Nazli Goharian nazli@cs.georgetown.edu Course Outline Introduction Retrieval Strategies (Models) Retrieval Utilities Evaluation Indexing Efficiency

More information

Liangjie Hong*, Dawei Yin*, Jian Guo, Brian D. Davison*

Liangjie Hong*, Dawei Yin*, Jian Guo, Brian D. Davison* Tracking Trends: Incorporating Term Volume into Temporal Topic Models Liangjie Hong*, Dawei Yin*, Jian Guo, Brian D. Davison* Dept. of Computer Science and Engineering, Lehigh University, Bethlehem, PA,

More information

Searching the Deep Web

Searching the Deep Web Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index

More information