A Machine Learning Approach for Information Retrieval Applications. Luo Si. Department of Computer Science Purdue University

Size: px

Start display at page:

Download "A Machine Learning Approach for Information Retrieval Applications. Luo Si. Department of Computer Science Purdue University"

Jeremy Gordon
5 years ago
Views:

1 A Machine Learning Approach for Information Retrieval Applications Luo Si Department of Computer Science Purdue University

2 Why Information Retrieval: Information Overload: Since the introduction of digital libraries and the Web, human being has accumulated too much digital information to absorb

3 Why Information Retrieval: Information Overload: In 2008, Americans consumed information for about 1.3 trillion hours, an average of almost 12 hours per day. Consumption totaled 10,845 trillion words and 3.6 zettabytes (10 21 bytes), corresponding to 100,500 words and 34 gigabytes for an average person on an average day. From more than 20 different sources of information, from very old (newspapers and books) to very new (portable computer games, satellite radio, and Internet video). Extracted from How Much Information? 2009 Report on American Consumers by Roger E. Bohn and James E. Short

4 Why Information Retrieval: Narrow Sense: Information retrieval ranks a collection of documents for user queries according to degree of relevance (i.e., Ah-hoc search). Broad Sense: Information retrieval provides solutions of acquisition, storage, organization, storage, retrieval and analysis of information. Information retrieval mainly studies unstructured data: Text in Web pages or s; image; audio; video; protein sequences. Web search is one of the most popular information retrieval applications.

5 IR Applications Information Retrieval: a gold mine of applications Web Search Information Organization: text categorization; document clustering Information Recommendation: by content or by collaborative information Information Extraction: deep analysis of the surface text data Question-Answering: find the answer directly Federated Search: explore hidden Web Multimedia Information Retrieval: image, video Information Visualization: Let user understand the results in the best way..

6 IR and other disciplines Theory Natural Language Processing Image Understanding Deep Analysis Machine Learning Pattern Recognition Statistical Learning Information Retrieval Information Extraction Text Mining Database Knowledge Mining Visualization Library & Info Science Security& Privacy System Applications System Support

7 Information Retrieval Models (for Ad-Hoc Retrieval) Ad-Hoc Retrieval: Satisfy users short-term information needs as queries (e.g., text) Short and temporary need (e.g., info about a movie) Information source is relatively static while user queries change Users pull information from information sources (e.g., Web) Application examples: Web search, library search, entity search.

8 Information Retrieval Models (for Ad-Hoc Retrieval) Estimate Document Relevance of User Query Similarity Based: Sim(Rep(q), Rep(d)) Probabilistic Approach P(d q), P(q d); P(r=1 q,d) r {0,1} ; Probability of Relevance Different representations and similarity measurements Vector space model (Salton et al., 75) Prob. distr. model (Wong & Yao, 89) Doc generation Classical prob. model (Robertson & Sparck Jones, 76) Generative Model Query generation or inference Language modeling approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) Discriminative Model Learning probability of relevance. (Recent work on Learning to Rank) Inference network model (Turtle & Croft, 91)

9 Information Retrieval Models (for Ad-Hoc Retrieval) Vector Space Model: D 2 D 3 Query Java D1 Sun Doc and Qry are vectors in a vector space Vectors are represented in weighted form (e.g., term frequency and inverse term frequency) Closeness of Doc and Qry vector determines relevance Starbucks

10 Information Retrieval Models (for Ad-Hoc Retrieval) Vector Space Model: Advantages: Provide an intuitive solution for retrieval Easy to implement Disadvantages: Vector representation is heuristic without solid justification Difficult to incorporate complex features (e.g., pagerank)

11 Information Retrieval Models (for Ad-Hoc Retrieval) Statistical Language Modeling: Treat Doc and Qry as language models, which are associated with words generated by Multinomial distributions. Ranking documents based query generation probability log p ( Q Doc) log p( q Doc) q w Q w Document language model smoothed by whole collection P( q Doc) P ( q Doc) (1 ) P ( q Collection) w MLE w MLE w

12 Information Retrieval Models (for Ad-Hoc Retrieval) Statistical Language Modeling: Advantages: Provide a formal method for modeling text data Less parameter tuning Disadvantages: Not optimal due to the gap between query generation probability and relevance Difficult to incorporate complex features (e.g., pagerank) due to the generative process

13 Information Retrieval Models (for Ad-Hoc Retrieval) Learning to Rank: Given a pair of user query (Qry) and a document (Doc), directly model the relevance for the query and document Use features about query and document such as language modeling retrieval score, page rank value, etc. Use different lgorithms for learning relevance (e.g., logistic regression); parameters learned by training queries and judgments exp f ( Qry, Doc) P( rel Qry, Doc) 1 exp f ( Qry, Doc) Learned model can be used for predicting relevance of documents for test queries

14 Information Retrieval Models (for Ad-Hoc Retrieval) Learning to Rank: Advantages: Explicitly optimize retrieval performance by fitting model parameters with training data Provide a solid foundation for modeling relevance Successfully used in many commercial search engines Pairwise/Listwise modeling successfully used for learning to rank Can the success of machine learning approach be generalized from ad hoc retrieval to other information retrieval applications? Yes! But this requires intelligent algorithms for different complex information retrieval applications.

15 Some IR Applications Question Answering: QA aims at finding answers to natural language questions from a large collection of documents Example question: What is the city in China with the largest population? Question Keywords Relevant Docs Answer candidates Question Analysis Document Retrieval Answer Extraction Answer Selection Text Collection 15 Answer Shanghai

16 Some IR Applications Federated Search (aka. distributed information retrieval): Information (e.g., hidden Web) hidden behind search engines of independent sources may not be searched by traditional search engines Hidden Web contents are estimated to be larger (e.g., 2 times larger) than visible web contents searchable by traditional search engines Engine 1 Engine 2 Engine 3 Engine Engine N (1) Source Representation (2) Source Selection (3) Results Merging

17 IR Applications: Expertise Search Expertise Search: In the information age, the most important thing may not be what you know, but who you know. Expert search aims at finding the right people with desired expertise; Search for people instead of documents Web Pages (e.g., homepage) Publications User Query Expert? Research Projects

18 Research Questions for Complex Information Retrieval Applications 1. Breaking Isolation: from Isolated Information Items to Connected Information Items Answer Selection: Answers are related with each other (e.g., similar contents), modeling answer relationships can improve accuracy and reduce answer redundancy Source Selection (Federate Search): Sources are related with each other (e.g., links, citations, etc), source B related with a relevant source A also tends to be relevant Our Approach: A Joint Probabilistic Approach that Models Available Information Items and Their Relationships 18

19 Research Questions for Complex Information Retrieval Applications 2. Working with Incomplete Knowledge: from full Knowledge to Partially Observed Knowledge Expertise search: relevance judgments are for each expert but not for specific documents associated with the expert Our Approach: An Integrated Learning Approach that Explicitly Models Incomplete Knowledge 19

20 Research Questions for Complex Information Retrieval Applications 3. Information Integration: Combining Evidence of Information Items from Heterogeneous Sources Expertise Search: Evidence of expertise comes from heterogeneous information sources (e.g., homepages, supervised Ph.D. dissertations, research projects) Our Approach: A Mixture Model Probabilistic Approach that Intelligently Combines Evidence from Heterogeneous Sources for Different Types of Information Needs 20

21 Breaking Isolation: from Isolated Information Items to Connected Information Items Answer Selection (Question Answering): Select most relevant and more unique answers for each question Traditional methods relies on knowledge databases (e.g., Wordnet, gazetteers) for identifying relevant answers with heuristic rules Independent Classification Models: supervised classification for predicting relevance of each answer with features from knowledge databases Joint Probabilistic Classification: model relevance of individual answers and their relationships; select unique answer by conditional probability of relevance (SIGIR 2007, Ko, Si, et al.)(acm TOIS 2010, Ko, Si, et al.) 21

22 Breaking Isolation: from Isolated Information Items to Connected Information Items Answer Selection (Question Answering): S S S..., 1, 2, F f f..., 1, 2, S n f n Joint Classification: creativeness judgments for all answers feature vectors of all answers for a question 1 1 P( S F) exp Si F ak sim( ci, c j ) SiS j Z i n i, j( i j) k Modeling Relevance of Individual Answers Also called Boltzmann machine and Ising Model A i Modeling Similarity Relationship Across Multi Knowledge Databases Select relevant and unique answers with conditional probability: Score( A ) P( S 1 F) max P S 1 S 1, F j j Ai SelectedAnswers i 22 j

23 Breaking Isolation: from Isolated Information Items to Connected Information Items Answer Selection (Question Answering): Select relevant and unique answers with conditional probability Example: Question: Who was the U.S. presidents in 1990s? P(correct(William J. Clinton)=0.8 P(correct(Bill Clinton)= P(correct(George P(correct(Bill Clinton) correct(william W. Bush)=0.6 J. Clinton))= Score (Bill P(correct(George Clinton)= W. 0.7 Bush) correct(william = J. Clinton))=0.5 Empirical studies with two answer extractors Score (George W. Bush)= = 0.1 Information Extractor 1 Information Extractor 2 Baseline Jnd Jnt Basline Ind Jnt Top 3 Accuracy Mean Reciprocal Rank

24 Breaking Isolation: from Isolated Information Items to Connected Information Items Source Selection (Federated Search): Select a few most relevant sources for each user query Big Doc Approach: treat sample docs from different sources as big documents and calculate/rank relevant scores (e.g., vector space model) Independent Classification: supervised classification for predicting relevance of each source P( V i 1 f i ) Joint Probabilistic Classification: model relevance of individual sources and their relationships in a joint model (SIGIR 2010, Hong, Si, et al.) P( V F) 24

25 Breaking Isolation: from Isolated Information Items to Connected Information Items Source Selection (Federated Search): Empirical studies for selecting up to 5 sources from about 100 sources in two TREC (Text Retrieval Evaluation Conference) collections and a collection of real world digital libraries, Src Rank TREC123 TREC4 DIGLIB Ind Jnt Ind Jnt @ Measurement: accuracy of selecting relevant sources 25

26 Working with Incomplete Knowledge: from full Knowledge to Partially Observed Knowledge Expertise Search: A Unified Model that Integrates Document Evidence and Document- Candidate Association Traditional expertise search approaches use a generative approach for estimating query generation probability with heuristics Qry generation prob given an expert P n q e P( q d ) P( d e) t1 Doc language model t t Frequency of name e occurring in dt Our approach: use some training data on experts for queries (no judgments for individual docs associated with experts); a discriminative learning model of integrating doc evidence and doccandidate associations 26

27 Working with Incomplete Knowledge: from full Knowledge to Partially Observed Knowledge Expertise Search: A Unified Model that Integrates Document Evidence and Document- Candidate Association P n 1 1 t 2 t t t1 r e, q P( r 1 q, d ) P( r 1 e, d ) P( d ) probability that doc probability that doc matches query supports expert N N f g 1 q, dt ) i fi q dt P( r2 1 e, dt ) j g j e, dt i1 j1 P( r 1, σ is the standard logistic function; f i (q, d) denotes the doc feature (e.g., doc retrieval score; page rank value); g j denotes the document-expert association feature (e.g., exact name match, last name match)

28 Working with Incomplete Knowledge: from full Knowledge to Partially Observed Knowledge Expert Search: Empirical studies on two enterprise corpora for World Wide Web Consortium (W3C) and an organization in Australian (CERC) Top 5 Generative Model Discriminative Model W3C CERC Mean Average Precision Generative Model Discriminative Model W3C CERC

29 Information Integration: Combining Evidence of Information Items from Heterogeneous Sources Expertise Search: A Mixture Model Probabilistic Approach for Combining Evidence from Heterogeneous Sources (e.g., homepages, supervised Ph.D. dissertations, research projects ) for Different Types of Information Needs Traditional expertise search approach uses weighted votes to specify importance of different sources based on intuition Our approach: Intelligently Combines Evidence from Heterogeneous Sources by Learning the Combination Weights. The weights should depend on experts. e.g., Some senior faculty do not have homepages; Some junior faculty do not have supervised Ph.D. dissertations The weights should depend on queries. For query cancer, research projects from NIH should carry more evidence than evidence from homepages 29

30 Working with Incomplete Knowledge: from full Knowledge to Partially Observed Knowledge Expertise Search: A Mixture Model Probabilistic Approach for Combining Evidence from Heterogeneous Sources S i ( e, q) evidence score from ith information source Z latent variable for expert class; Z q latent variable for query topic e P N zq N ze r e, q Pz e; Pz q; K e q z zq 1 ze 1 i1 e z q i S i ( e, q) Latent variable for expert class Latent variable for query topic Combination weights for an expert class and a query topic

31 Working with Incomplete Knowledge: from full Knowledge to Partially Observed Knowledge Expertise Search: A Mixture Model Probabilistic Approach for Combining Evidence from Heterogeneous Sources Empirical studies on INDURE (INdiana Database of University Research Expertise) expcombsum Learn a Single Model with Fixed Weights Mixture Model with Adaptive Weights P@ Top P@ Top P@ Top

32 A Machine Learning Approach for Information Retrieval Applications 1. Breaking Isolation: from Isolated Information Items to Connected Information Items 2. Working with Incomplete Knowledge: from full Knowledge to Partially Observed Knowledge 3. Information Integration: Combining Evidence of Information Items from Heterogeneous Sources

33 A Machine Learning Approach for Information Retrieval Applications Federated Search: source representation (SIGIR 2003; CIKM 2004); source selection (SIGIR 2003,2005; CIKM 2002, 2004, 2009); results merging (SIGIR 2002; TOIS 2003; IRJ 2009) Expertise Search: mixture model for integrate expertise evidence (IJR 2010a); integrated model for combining doc evidence and doc candidate association (SIGIR 2010); joint homepage discovery (IRJ 2010b) Question/Answering: independent answer selection (HLT 2007, IPM 2009); joint answer selection (SIGIR 2007, TOIS 2010), multilingual answer selection (TOIS 2010). Machine Learning Techniques: Multiple instance learning (IJCAI 2009), manifold leaning (AAAI 2010), collaborative recommendation (ICML 2003,2005; UAI 2004), active learning (UAI 2004)...

34 Acknowledgement Graduate Students: Suleyman Cetintas, Yi Fang, Dan Zhang, Dzung Hong Collaboration: Dr. Aditya Mathur; Dr. Jeongwoo Ko; Dr. Eric Nyberg Research support: National Science Foundation, State of Indiana, Purdue University, Google, Yahoo! and BGI

Boolean Model. Hongning Wang

Boolean Model. Hongning Wang Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer