A Machine Learning Approach for Information Retrieval Applications. Luo Si. Department of Computer Science Purdue University
|
|
- Jeremy Gordon
- 5 years ago
- Views:
Transcription
1 A Machine Learning Approach for Information Retrieval Applications Luo Si Department of Computer Science Purdue University
2 Why Information Retrieval: Information Overload: Since the introduction of digital libraries and the Web, human being has accumulated too much digital information to absorb
3 Why Information Retrieval: Information Overload: In 2008, Americans consumed information for about 1.3 trillion hours, an average of almost 12 hours per day. Consumption totaled 10,845 trillion words and 3.6 zettabytes (10 21 bytes), corresponding to 100,500 words and 34 gigabytes for an average person on an average day. From more than 20 different sources of information, from very old (newspapers and books) to very new (portable computer games, satellite radio, and Internet video). Extracted from How Much Information? 2009 Report on American Consumers by Roger E. Bohn and James E. Short
4 Why Information Retrieval: Narrow Sense: Information retrieval ranks a collection of documents for user queries according to degree of relevance (i.e., Ah-hoc search). Broad Sense: Information retrieval provides solutions of acquisition, storage, organization, storage, retrieval and analysis of information. Information retrieval mainly studies unstructured data: Text in Web pages or s; image; audio; video; protein sequences. Web search is one of the most popular information retrieval applications.
5 IR Applications Information Retrieval: a gold mine of applications Web Search Information Organization: text categorization; document clustering Information Recommendation: by content or by collaborative information Information Extraction: deep analysis of the surface text data Question-Answering: find the answer directly Federated Search: explore hidden Web Multimedia Information Retrieval: image, video Information Visualization: Let user understand the results in the best way..
6 IR and other disciplines Theory Natural Language Processing Image Understanding Deep Analysis Machine Learning Pattern Recognition Statistical Learning Information Retrieval Information Extraction Text Mining Database Knowledge Mining Visualization Library & Info Science Security& Privacy System Applications System Support
7 Information Retrieval Models (for Ad-Hoc Retrieval) Ad-Hoc Retrieval: Satisfy users short-term information needs as queries (e.g., text) Short and temporary need (e.g., info about a movie) Information source is relatively static while user queries change Users pull information from information sources (e.g., Web) Application examples: Web search, library search, entity search.
8 Information Retrieval Models (for Ad-Hoc Retrieval) Estimate Document Relevance of User Query Similarity Based: Sim(Rep(q), Rep(d)) Probabilistic Approach P(d q), P(q d); P(r=1 q,d) r {0,1} ; Probability of Relevance Different representations and similarity measurements Vector space model (Salton et al., 75) Prob. distr. model (Wong & Yao, 89) Doc generation Classical prob. model (Robertson & Sparck Jones, 76) Generative Model Query generation or inference Language modeling approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) Discriminative Model Learning probability of relevance. (Recent work on Learning to Rank) Inference network model (Turtle & Croft, 91)
9 Information Retrieval Models (for Ad-Hoc Retrieval) Vector Space Model: D 2 D 3 Query Java D1 Sun Doc and Qry are vectors in a vector space Vectors are represented in weighted form (e.g., term frequency and inverse term frequency) Closeness of Doc and Qry vector determines relevance Starbucks
10 Information Retrieval Models (for Ad-Hoc Retrieval) Vector Space Model: Advantages: Provide an intuitive solution for retrieval Easy to implement Disadvantages: Vector representation is heuristic without solid justification Difficult to incorporate complex features (e.g., pagerank)
11 Information Retrieval Models (for Ad-Hoc Retrieval) Statistical Language Modeling: Treat Doc and Qry as language models, which are associated with words generated by Multinomial distributions. Ranking documents based query generation probability log p ( Q Doc) log p( q Doc) q w Q w Document language model smoothed by whole collection P( q Doc) P ( q Doc) (1 ) P ( q Collection) w MLE w MLE w
12 Information Retrieval Models (for Ad-Hoc Retrieval) Statistical Language Modeling: Advantages: Provide a formal method for modeling text data Less parameter tuning Disadvantages: Not optimal due to the gap between query generation probability and relevance Difficult to incorporate complex features (e.g., pagerank) due to the generative process
13 Information Retrieval Models (for Ad-Hoc Retrieval) Learning to Rank: Given a pair of user query (Qry) and a document (Doc), directly model the relevance for the query and document Use features about query and document such as language modeling retrieval score, page rank value, etc. Use different lgorithms for learning relevance (e.g., logistic regression); parameters learned by training queries and judgments exp f ( Qry, Doc) P( rel Qry, Doc) 1 exp f ( Qry, Doc) Learned model can be used for predicting relevance of documents for test queries
14 Information Retrieval Models (for Ad-Hoc Retrieval) Learning to Rank: Advantages: Explicitly optimize retrieval performance by fitting model parameters with training data Provide a solid foundation for modeling relevance Successfully used in many commercial search engines Pairwise/Listwise modeling successfully used for learning to rank Can the success of machine learning approach be generalized from ad hoc retrieval to other information retrieval applications? Yes! But this requires intelligent algorithms for different complex information retrieval applications.
15 Some IR Applications Question Answering: QA aims at finding answers to natural language questions from a large collection of documents Example question: What is the city in China with the largest population? Question Keywords Relevant Docs Answer candidates Question Analysis Document Retrieval Answer Extraction Answer Selection Text Collection 15 Answer Shanghai
16 Some IR Applications Federated Search (aka. distributed information retrieval): Information (e.g., hidden Web) hidden behind search engines of independent sources may not be searched by traditional search engines Hidden Web contents are estimated to be larger (e.g., 2 times larger) than visible web contents searchable by traditional search engines Engine 1 Engine 2 Engine 3 Engine Engine N (1) Source Representation (2) Source Selection (3) Results Merging
17 IR Applications: Expertise Search Expertise Search: In the information age, the most important thing may not be what you know, but who you know. Expert search aims at finding the right people with desired expertise; Search for people instead of documents Web Pages (e.g., homepage) Publications User Query Expert? Research Projects
18 Research Questions for Complex Information Retrieval Applications 1. Breaking Isolation: from Isolated Information Items to Connected Information Items Answer Selection: Answers are related with each other (e.g., similar contents), modeling answer relationships can improve accuracy and reduce answer redundancy Source Selection (Federate Search): Sources are related with each other (e.g., links, citations, etc), source B related with a relevant source A also tends to be relevant Our Approach: A Joint Probabilistic Approach that Models Available Information Items and Their Relationships 18
19 Research Questions for Complex Information Retrieval Applications 2. Working with Incomplete Knowledge: from full Knowledge to Partially Observed Knowledge Expertise search: relevance judgments are for each expert but not for specific documents associated with the expert Our Approach: An Integrated Learning Approach that Explicitly Models Incomplete Knowledge 19
20 Research Questions for Complex Information Retrieval Applications 3. Information Integration: Combining Evidence of Information Items from Heterogeneous Sources Expertise Search: Evidence of expertise comes from heterogeneous information sources (e.g., homepages, supervised Ph.D. dissertations, research projects) Our Approach: A Mixture Model Probabilistic Approach that Intelligently Combines Evidence from Heterogeneous Sources for Different Types of Information Needs 20
21 Breaking Isolation: from Isolated Information Items to Connected Information Items Answer Selection (Question Answering): Select most relevant and more unique answers for each question Traditional methods relies on knowledge databases (e.g., Wordnet, gazetteers) for identifying relevant answers with heuristic rules Independent Classification Models: supervised classification for predicting relevance of each answer with features from knowledge databases Joint Probabilistic Classification: model relevance of individual answers and their relationships; select unique answer by conditional probability of relevance (SIGIR 2007, Ko, Si, et al.)(acm TOIS 2010, Ko, Si, et al.) 21
22 Breaking Isolation: from Isolated Information Items to Connected Information Items Answer Selection (Question Answering): S S S..., 1, 2, F f f..., 1, 2, S n f n Joint Classification: creativeness judgments for all answers feature vectors of all answers for a question 1 1 P( S F) exp Si F ak sim( ci, c j ) SiS j Z i n i, j( i j) k Modeling Relevance of Individual Answers Also called Boltzmann machine and Ising Model A i Modeling Similarity Relationship Across Multi Knowledge Databases Select relevant and unique answers with conditional probability: Score( A ) P( S 1 F) max P S 1 S 1, F j j Ai SelectedAnswers i 22 j
23 Breaking Isolation: from Isolated Information Items to Connected Information Items Answer Selection (Question Answering): Select relevant and unique answers with conditional probability Example: Question: Who was the U.S. presidents in 1990s? P(correct(William J. Clinton)=0.8 P(correct(Bill Clinton)= P(correct(George P(correct(Bill Clinton) correct(william W. Bush)=0.6 J. Clinton))= Score (Bill P(correct(George Clinton)= W. 0.7 Bush) correct(william = J. Clinton))=0.5 Empirical studies with two answer extractors Score (George W. Bush)= = 0.1 Information Extractor 1 Information Extractor 2 Baseline Jnd Jnt Basline Ind Jnt Top 3 Accuracy Mean Reciprocal Rank
24 Breaking Isolation: from Isolated Information Items to Connected Information Items Source Selection (Federated Search): Select a few most relevant sources for each user query Big Doc Approach: treat sample docs from different sources as big documents and calculate/rank relevant scores (e.g., vector space model) Independent Classification: supervised classification for predicting relevance of each source P( V i 1 f i ) Joint Probabilistic Classification: model relevance of individual sources and their relationships in a joint model (SIGIR 2010, Hong, Si, et al.) P( V F) 24
25 Breaking Isolation: from Isolated Information Items to Connected Information Items Source Selection (Federated Search): Empirical studies for selecting up to 5 sources from about 100 sources in two TREC (Text Retrieval Evaluation Conference) collections and a collection of real world digital libraries, Src Rank TREC123 TREC4 DIGLIB Ind Jnt Ind Jnt @ Measurement: accuracy of selecting relevant sources 25
26 Working with Incomplete Knowledge: from full Knowledge to Partially Observed Knowledge Expertise Search: A Unified Model that Integrates Document Evidence and Document- Candidate Association Traditional expertise search approaches use a generative approach for estimating query generation probability with heuristics Qry generation prob given an expert P n q e P( q d ) P( d e) t1 Doc language model t t Frequency of name e occurring in dt Our approach: use some training data on experts for queries (no judgments for individual docs associated with experts); a discriminative learning model of integrating doc evidence and doccandidate associations 26
27 Working with Incomplete Knowledge: from full Knowledge to Partially Observed Knowledge Expertise Search: A Unified Model that Integrates Document Evidence and Document- Candidate Association P n 1 1 t 2 t t t1 r e, q P( r 1 q, d ) P( r 1 e, d ) P( d ) probability that doc probability that doc matches query supports expert N N f g 1 q, dt ) i fi q dt P( r2 1 e, dt ) j g j e, dt i1 j1 P( r 1, σ is the standard logistic function; f i (q, d) denotes the doc feature (e.g., doc retrieval score; page rank value); g j denotes the document-expert association feature (e.g., exact name match, last name match)
28 Working with Incomplete Knowledge: from full Knowledge to Partially Observed Knowledge Expert Search: Empirical studies on two enterprise corpora for World Wide Web Consortium (W3C) and an organization in Australian (CERC) Top 5 Generative Model Discriminative Model W3C CERC Mean Average Precision Generative Model Discriminative Model W3C CERC
29 Information Integration: Combining Evidence of Information Items from Heterogeneous Sources Expertise Search: A Mixture Model Probabilistic Approach for Combining Evidence from Heterogeneous Sources (e.g., homepages, supervised Ph.D. dissertations, research projects ) for Different Types of Information Needs Traditional expertise search approach uses weighted votes to specify importance of different sources based on intuition Our approach: Intelligently Combines Evidence from Heterogeneous Sources by Learning the Combination Weights. The weights should depend on experts. e.g., Some senior faculty do not have homepages; Some junior faculty do not have supervised Ph.D. dissertations The weights should depend on queries. For query cancer, research projects from NIH should carry more evidence than evidence from homepages 29
30 Working with Incomplete Knowledge: from full Knowledge to Partially Observed Knowledge Expertise Search: A Mixture Model Probabilistic Approach for Combining Evidence from Heterogeneous Sources S i ( e, q) evidence score from ith information source Z latent variable for expert class; Z q latent variable for query topic e P N zq N ze r e, q Pz e; Pz q; K e q z zq 1 ze 1 i1 e z q i S i ( e, q) Latent variable for expert class Latent variable for query topic Combination weights for an expert class and a query topic
31 Working with Incomplete Knowledge: from full Knowledge to Partially Observed Knowledge Expertise Search: A Mixture Model Probabilistic Approach for Combining Evidence from Heterogeneous Sources Empirical studies on INDURE (INdiana Database of University Research Expertise) expcombsum Learn a Single Model with Fixed Weights Mixture Model with Adaptive Weights P@ Top P@ Top P@ Top
32 A Machine Learning Approach for Information Retrieval Applications 1. Breaking Isolation: from Isolated Information Items to Connected Information Items 2. Working with Incomplete Knowledge: from full Knowledge to Partially Observed Knowledge 3. Information Integration: Combining Evidence of Information Items from Heterogeneous Sources
33 A Machine Learning Approach for Information Retrieval Applications Federated Search: source representation (SIGIR 2003; CIKM 2004); source selection (SIGIR 2003,2005; CIKM 2002, 2004, 2009); results merging (SIGIR 2002; TOIS 2003; IRJ 2009) Expertise Search: mixture model for integrate expertise evidence (IJR 2010a); integrated model for combining doc evidence and doc candidate association (SIGIR 2010); joint homepage discovery (IRJ 2010b) Question/Answering: independent answer selection (HLT 2007, IPM 2009); joint answer selection (SIGIR 2007, TOIS 2010), multilingual answer selection (TOIS 2010). Machine Learning Techniques: Multiple instance learning (IJCAI 2009), manifold leaning (AAAI 2010), collaborative recommendation (ICML 2003,2005; UAI 2004), active learning (UAI 2004)...
34 Acknowledgement Graduate Students: Suleyman Cetintas, Yi Fang, Dan Zhang, Dzung Hong Collaboration: Dr. Aditya Mathur; Dr. Jeongwoo Ko; Dr. Eric Nyberg Research support: National Science Foundation, State of Indiana, Purdue University, Google, Yahoo! and BGI
Boolean Model. Hongning Wang
Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer
More informationCS490W: Web Information Search & Management. CS-490W Web Information Search and Management. Luo Si. Department of Computer Science Purdue University
CS490W: Web Information Search & Management CS-490W Web Information Search and Management Luo Si Department of Computer Science Purdue University Overview Web: Growth of the Web The world produces between
More informationQuery Likelihood with Negative Query Generation
Query Likelihood with Negative Query Generation Yuanhua Lv Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 ylv2@uiuc.edu ChengXiang Zhai Department of Computer
More informationCS-490WIR Web Information Retrieval and Management. Luo Si
CS490W: Web Information Retrieval & Management CS-490WIR Web Information Retrieval and Management Luo Si Department of Computer Science Purdue University Overview Web: Growth of the Web The world produces
More informationCS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University
CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and
More informationInformation Retrieval: Retrieval Models
CS473: Web Information Retrieval & Management CS-473 Web Information Retrieval & Management Information Retrieval: Retrieval Models Luo Si Department of Computer Science Purdue University Retrieval Models
More informationEffective Latent Space Graph-based Re-ranking Model with Global Consistency
Effective Latent Space Graph-based Re-ranking Model with Global Consistency Feb. 12, 2009 1 Outline Introduction Related work Methodology Graph-based re-ranking model Learning a latent space graph A case
More informationRisk Minimization and Language Modeling in Text Retrieval Thesis Summary
Risk Minimization and Language Modeling in Text Retrieval Thesis Summary ChengXiang Zhai Language Technologies Institute School of Computer Science Carnegie Mellon University July 21, 2002 Abstract This
More informationIntroduction to Information Retrieval. Hongning Wang
Introduction to Information Retrieval Hongning Wang CS@UVa What is information retrieval? 2 Why information retrieval Information overload It refers to the difficulty a person can have understanding an
More informationRMIT University at TREC 2006: Terabyte Track
RMIT University at TREC 2006: Terabyte Track Steven Garcia Falk Scholer Nicholas Lester Milad Shokouhi School of Computer Science and IT RMIT University, GPO Box 2476V Melbourne 3001, Australia 1 Introduction
More informationA Deep Relevance Matching Model for Ad-hoc Retrieval
A Deep Relevance Matching Model for Ad-hoc Retrieval Jiafeng Guo 1, Yixing Fan 1, Qingyao Ai 2, W. Bruce Croft 2 1 CAS Key Lab of Web Data Science and Technology, Institute of Computing Technology, Chinese
More informationFall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12
Fall 2016 CS646: Information Retrieval Lecture 2 - Introduction to Search Result Ranking Jiepu Jiang University of Massachusetts Amherst 2016/09/12 More course information Programming Prerequisites Proficiency
More informationDocument indexing, similarities and retrieval in large scale text collections
Document indexing, similarities and retrieval in large scale text collections Eric Gaussier Univ. Grenoble Alpes - LIG Eric.Gaussier@imag.fr Eric Gaussier Document indexing, similarities & retrieval 1
More informationAn Axiomatic Approach to IR UIUC TREC 2005 Robust Track Experiments
An Axiomatic Approach to IR UIUC TREC 2005 Robust Track Experiments Hui Fang ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign Abstract In this paper, we report
More informationAn Investigation of Basic Retrieval Models for the Dynamic Domain Task
An Investigation of Basic Retrieval Models for the Dynamic Domain Task Razieh Rahimi and Grace Hui Yang Department of Computer Science, Georgetown University rr1042@georgetown.edu, huiyang@cs.georgetown.edu
More informationInformation Retrieval
Introduction Information Retrieval Information retrieval is a field concerned with the structure, analysis, organization, storage, searching and retrieval of information Gerard Salton, 1968 J. Pei: Information
More informationA BELIEF NETWORK MODEL FOR EXPERT SEARCH
A BELIEF NETWORK MODEL FOR EXPERT SEARCH Craig Macdonald, Iadh Ounis Department of Computing Science, University of Glasgow, Glasgow, G12 8QQ, UK craigm@dcs.gla.ac.uk, ounis@dcs.gla.ac.uk Keywords: Expert
More informationA Study of Pattern-based Subtopic Discovery and Integration in the Web Track
A Study of Pattern-based Subtopic Discovery and Integration in the Web Track Wei Zheng and Hui Fang Department of ECE, University of Delaware Abstract We report our systems and experiments in the diversity
More informationCS54701: Information Retrieval
CS54701: Information Retrieval Basic Concepts 19 January 2016 Prof. Chris Clifton 1 Text Representation: Process of Indexing Remove Stopword, Stemming, Phrase Extraction etc Document Parser Extract useful
More informationCS47300: Web Information Search and Management
CS47300: Web Information Search and Management Federated Search Prof. Chris Clifton 13 November 2017 Federated Search Outline Introduction to federated search Main research problems Resource Representation
More informationIntroduction & Administrivia
Introduction & Administrivia Information Retrieval Evangelos Kanoulas ekanoulas@uva.nl Section 1: Unstructured data Sec. 8.1 2 Big Data Growth of global data volume data everywhere! Web data: observation,
More informationWebSci and Learning to Rank for IR
WebSci and Learning to Rank for IR Ernesto Diaz-Aviles L3S Research Center. Hannover, Germany diaz@l3s.de Ernesto Diaz-Aviles www.l3s.de 1/16 Motivation: Information Explosion Ernesto Diaz-Aviles
More informationUnsupervised Rank Aggregation with Distance-Based Models
Unsupervised Rank Aggregation with Distance-Based Models Alexandre Klementiev, Dan Roth, and Kevin Small University of Illinois at Urbana-Champaign Motivation Consider a panel of judges Each (independently)
More informationEntity Information Management in Complex Networks
Entity Information Management in Complex Networks Yi Fang Department of Computer Science 250 N. University Street Purdue University, West Lafayette, IN 47906, USA fangy@cs.purdue.edu ABSTRACT Entity information
More informationText Categorization (I)
CS473 CS-473 Text Categorization (I) Luo Si Department of Computer Science Purdue University Text Categorization (I) Outline Introduction to the task of text categorization Manual v.s. automatic text categorization
More information60-538: Information Retrieval
60-538: Information Retrieval September 7, 2017 1 / 48 Outline 1 what is IR 2 3 2 / 48 Outline 1 what is IR 2 3 3 / 48 IR not long time ago 4 / 48 5 / 48 now IR is mostly about search engines there are
More informationJames Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!
James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation
More informationLearning to Rank. from heuristics to theoretic approaches. Hongning Wang
Learning to Rank from heuristics to theoretic approaches Hongning Wang Congratulations Job Offer from Bing Core Ranking team Design the ranking module for Bing.com CS 6501: Information Retrieval 2 How
More informationVALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER
VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur 603 203 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER CS6007-INFORMATION RETRIEVAL Regulation 2013 Academic Year 2018
More informationA Formal Approach to Score Normalization for Meta-search
A Formal Approach to Score Normalization for Meta-search R. Manmatha and H. Sever Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts Amherst, MA 01003
More informationBUPT at TREC 2009: Entity Track
BUPT at TREC 2009: Entity Track Zhanyi Wang, Dongxin Liu, Weiran Xu, Guang Chen, Jun Guo Pattern Recognition and Intelligent System Lab, Beijing University of Posts and Telecommunications, Beijing, China,
More informationDiscriminative graphical models for faculty homepage discovery
DOI 10.1007/s10791-010-9127-7 Discriminative graphical models for faculty homepage discovery Yi Fang Luo Si Aditya P. Mathur Received: 7 July 2009 / Accepted: 25 January 2010 Ó Springer Science+Business
More informationAcademic Paper Recommendation Based on Heterogeneous Graph
Academic Paper Recommendation Based on Heterogeneous Graph Linlin Pan, Xinyu Dai, Shujian Huang, and Jiajun Chen National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023,
More informationFocused Retrieval Using Topical Language and Structure
Focused Retrieval Using Topical Language and Structure A.M. Kaptein Archives and Information Studies, University of Amsterdam Turfdraagsterpad 9, 1012 XT Amsterdam, The Netherlands a.m.kaptein@uva.nl Abstract
More informationUsing Machine Learning to Optimize Storage Systems
Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation
More informationCS6200 Information Retrieval. Jesse Anderton College of Computer and Information Science Northeastern University
CS6200 Information Retrieval Jesse Anderton College of Computer and Information Science Northeastern University Major Contributors Gerard Salton! Vector Space Model Indexing Relevance Feedback SMART Karen
More informationCS54701: Information Retrieval
CS54701: Information Retrieval Federated Search 10 March 2016 Prof. Chris Clifton Outline Federated Search Introduction to federated search Main research problems Resource Representation Resource Selection
More informationMining Trusted Information in Medical Science: An Information Network Approach
Mining Trusted Information in Medical Science: An Information Network Approach Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign Collaborated with many, especially Yizhou
More informationFederated Search. Jaime Arguello INLS 509: Information Retrieval November 21, Thursday, November 17, 16
Federated Search Jaime Arguello INLS 509: Information Retrieval jarguell@email.unc.edu November 21, 2016 Up to this point... Classic information retrieval search from a single centralized index all ueries
More informationIntroduction to Text Mining. Hongning Wang
Introduction to Text Mining Hongning Wang CS@UVa Who Am I? Hongning Wang Assistant professor in CS@UVa since August 2014 Research areas Information retrieval Data mining Machine learning CS@UVa CS6501:
More informationInformation Filtering SI650: Information Retrieval
Information Filtering SI650: Information Retrieval Winter 2010 School of Information University of Michigan Many slides are from Prof. ChengXiang Zhai s lecture 1 Lecture Plan Filtering vs. Retrieval Content-based
More informationUNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.
UNIT-V WEB MINING 1 Mining the World-Wide Web 2 What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns. 3 Web search engines Index-based: search the Web, index
More informationCHAPTER 5 EXPERT LOCATOR USING CONCEPT LINKING
94 CHAPTER 5 EXPERT LOCATOR USING CONCEPT LINKING 5.1 INTRODUCTION Expert locator addresses the task of identifying the right person with the appropriate skills and knowledge. In large organizations, it
More informationTEXT CHAPTER 5. W. Bruce Croft BACKGROUND
41 CHAPTER 5 TEXT W. Bruce Croft BACKGROUND Much of the information in digital library or digital information organization applications is in the form of text. Even when the application focuses on multimedia
More informationA Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2
A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,
More informationAn Exploration of Query Term Deletion
An Exploration of Query Term Deletion Hao Wu and Hui Fang University of Delaware, Newark DE 19716, USA haowu@ece.udel.edu, hfang@ece.udel.edu Abstract. Many search users fail to formulate queries that
More informationBasic techniques. Text processing; term weighting; vector space model; inverted index; Web Search
Basic techniques Text processing; term weighting; vector space model; inverted index; Web Search Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing
More informationFacFinder: Search for Expertise in Academic Institutions*
FacFinder: Search for Expertise in Academic Institutions* Yi Fang a, Luo Si a,, Aditya Mathur a a Department of Computer Science, Purdue University, West Lafayette, IN 47907, USA Abstract Interdisciplinary
More informationRelated entity finding by unified probabilistic models
World Wide Web (2015) 18:521 543 DOI 10.1007/s11280-013-0267-8 Related entity finding by unified probabilistic models Yi Fang Luo Si Received: 23 March 2013 / Revised: 11 August 2013 / Accepted: 22 October
More informationContent Based Smart Crawler For Efficiently Harvesting Deep Web Interface
Content Based Smart Crawler For Efficiently Harvesting Deep Web Interface Prof. T.P.Aher(ME), Ms.Rupal R.Boob, Ms.Saburi V.Dhole, Ms.Dipika B.Avhad, Ms.Suvarna S.Burkul 1 Assistant Professor, Computer
More informationSearch Engines. Information Retrieval in Practice
Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Beyond Bag of Words Bag of Words a document is considered to be an unordered collection of words with no relationships Extending
More informationInformation Retrieval
Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,
More informationFederated Text Search
CS54701 Federated Text Search Luo Si Department of Computer Science Purdue University Abstract Outline Introduction to federated search Main research problems Resource Representation Resource Selection
More informationUMass at TREC 2006: Enterprise Track
UMass at TREC 2006: Enterprise Track Desislava Petkova and W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science University of Massachusetts, Amherst, MA 01003 Abstract
More informationOutline. Morning program Preliminaries Semantic matching Learning to rank Entities
112 Outline Morning program Preliminaries Semantic matching Learning to rank Afternoon program Modeling user behavior Generating responses Recommender systems Industry insights Q&A 113 are polysemic Finding
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Graph Data & Introduction to Information Retrieval Huan Sun, CSE@The Ohio State University 11/21/2017 Slides adapted from Prof. Srinivasan Parthasarathy @OSU 2 Chapter 4
More informationLearning to Rank. Tie-Yan Liu. Microsoft Research Asia CCIR 2011, Jinan,
Learning to Rank Tie-Yan Liu Microsoft Research Asia CCIR 2011, Jinan, 2011.10 History of Web Search Search engines powered by link analysis Traditional text retrieval engines 2011/10/22 Tie-Yan Liu @
More informationInformation Retrieval (Part 1)
Information Retrieval (Part 1) Fabio Aiolli http://www.math.unipd.it/~aiolli Dipartimento di Matematica Università di Padova Anno Accademico 2008/2009 1 Bibliographic References Copies of slides Selected
More informationA Study of Collection-based Features for Adapting the Balance Parameter in Pseudo Relevance Feedback
A Study of Collection-based Features for Adapting the Balance Parameter in Pseudo Relevance Feedback Ye Meng 1, Peng Zhang 1, Dawei Song 1,2, and Yuexian Hou 1 1 Tianjin Key Laboratory of Cognitive Computing
More informationRanking models in Information Retrieval: A Survey
Ranking models in Information Retrieval: A Survey R.Suganya Devi Research Scholar Department of Computer Science and Engineering College of Engineering, Guindy, Chennai, Tamilnadu, India Dr D Manjula Professor
More informationIt s time for a semantic engine!
It s time for a semantic engine! Ido Dagan Bar-Ilan University, Israel 1 Semantic Knowledge is not the goal it s a primary mean to achieve semantic inference! Knowledge design should be derived from its
More informationLink Prediction in Relational Data
Link Prediction in Relational Data Alexandra Chouldechova STATS 319, March 1, 2011 Motivation for Relational Models Quantifying trends in social interaction Improving document classification Inferring
More informationSocial Media Computing
Social Media Computing Lecture 4: Introduction to Information Retrieval and Classification Lecturer: Aleksandr Farseev E-mail: farseev@u.nus.edu Slides: http://farseev.com/ainlfruct.html At the beginning,
More informationCOMP6237 Data Mining Searching and Ranking
COMP6237 Data Mining Searching and Ranking Jonathon Hare jsh2@ecs.soton.ac.uk Note: portions of these slides are from those by ChengXiang Cheng Zhai at UIUC https://class.coursera.org/textretrieval-001
More informationVannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17
Information Retrieval Vannevar Bush Director of the Office of Scientific Research and Development (1941-1947) Vannevar Bush,1890-1974 End of WW2 - what next big challenge for scientists? 1 Historic Vision
More informationImproving Difficult Queries by Leveraging Clusters in Term Graph
Improving Difficult Queries by Leveraging Clusters in Term Graph Rajul Anand and Alexander Kotov Department of Computer Science, Wayne State University, Detroit MI 48226, USA {rajulanand,kotov}@wayne.edu
More informationRe-ranking Documents Based on Query-Independent Document Specificity
Re-ranking Documents Based on Query-Independent Document Specificity Lei Zheng and Ingemar J. Cox Department of Computer Science University College London London, WC1E 6BT, United Kingdom lei.zheng@ucl.ac.uk,
More informationDocument Clustering for Mediated Information Access The WebCluster Project
Document Clustering for Mediated Information Access The WebCluster Project School of Communication, Information and Library Sciences Rutgers University The original WebCluster project was conducted at
More informationPh.D. in Computer Science & Technology, Tsinghua University, Beijing, China, 2007
Yiqun Liu Associate Professor & Department co-chair Department of Computer Science and Technology Email yiqunliu@tsinghua.edu.cn URL http://www.thuir.org/group/~yqliu Phone +86-10-62796672 Fax +86-10-62796672
More informationUniversity of Delaware at Diversity Task of Web Track 2010
University of Delaware at Diversity Task of Web Track 2010 Wei Zheng 1, Xuanhui Wang 2, and Hui Fang 1 1 Department of ECE, University of Delaware 2 Yahoo! Abstract We report our systems and experiments
More informationWEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS
1 WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS BRUCE CROFT NSF Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts,
More informationTowards open-domain QA. Question answering. TReC QA framework. TReC QA: evaluation
Question ing Overview and task definition History Open-domain question ing Basic system architecture Watson s architecture Techniques Predictive indexing methods Pattern-matching methods Advanced techniques
More informationTopic-Level Random Walk through Probabilistic Model
Topic-Level Random Walk through Probabilistic Model Zi Yang, Jie Tang, Jing Zhang, Juanzi Li, and Bo Gao Department of Computer Science & Technology, Tsinghua University, China Abstract. In this paper,
More informationAttentive Neural Architecture for Ad-hoc Structured Document Retrieval
Attentive Neural Architecture for Ad-hoc Structured Document Retrieval Saeid Balaneshin 1 Alexander Kotov 1 Fedor Nikolaev 1,2 1 Textual Data Analytics Lab, Department of Computer Science, Wayne State
More informationTERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES
TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.
More informationState of the Art and Trends in Search Engine Technology. Gerhard Weikum
State of the Art and Trends in Search Engine Technology Gerhard Weikum (weikum@mpi-inf.mpg.de) Commercial Search Engines Web search Google, Yahoo, MSN simple queries, chaotic data, many results key is
More informationCSEP 573: Artificial Intelligence
CSEP 573: Artificial Intelligence Machine Learning: Perceptron Ali Farhadi Many slides over the course adapted from Luke Zettlemoyer and Dan Klein. 1 Generative vs. Discriminative Generative classifiers:
More informationGraph Classification in Heterogeneous
Title: Graph Classification in Heterogeneous Networks Name: Xiangnan Kong 1, Philip S. Yu 1 Affil./Addr.: Department of Computer Science University of Illinois at Chicago Chicago, IL, USA E-mail: {xkong4,
More informationModern Information Retrieval
Modern Information Retrieval Chapter 3 Modeling Part I: Classic Models Introduction to IR Models Basic Concepts The Boolean Model Term Weighting The Vector Model Probabilistic Model Chap 03: Modeling,
More informationEntity and Knowledge Base-oriented Information Retrieval
Entity and Knowledge Base-oriented Information Retrieval Presenter: Liuqing Li liuqing@vt.edu Digital Library Research Laboratory Virginia Polytechnic Institute and State University Blacksburg, VA 24061
More informationSome Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing
Some Interesting Applications of Theory PageRank Minhashing Locality-Sensitive Hashing 1 PageRank The thing that makes Google work. Intuition: solve the recursive equation: a page is important if important
More informationAxiomatic Approaches to Information Retrieval - University of Delaware at TREC 2009 Million Query and Web Tracks
Axiomatic Approaches to Information Retrieval - University of Delaware at TREC 2009 Million Query and Web Tracks Wei Zheng Hui Fang Department of Electrical and Computer Engineering University of Delaware
More informationA Study of Methods for Negative Relevance Feedback
A Study of Methods for Negative Relevance Feedback Xuanhui Wang University of Illinois at Urbana-Champaign Urbana, IL 61801 xwang20@cs.uiuc.edu Hui Fang The Ohio State University Columbus, OH 43210 hfang@cse.ohiostate.edu
More informationCitation Prediction in Heterogeneous Bibliographic Networks
Citation Prediction in Heterogeneous Bibliographic Networks Xiao Yu Quanquan Gu Mianwei Zhou Jiawei Han University of Illinois at Urbana-Champaign {xiaoyu1, qgu3, zhou18, hanj}@illinois.edu Abstract To
More informationMachine Learning for Information Discovery
Machine Learning for Information Discovery Thorsten Joachims Cornell University Department of Computer Science (Supervised) Machine Learning GENERAL: Input: training examples design space Training: automatically
More informationAn Indian Journal FULL PAPER. Trade Science Inc. Research on data mining clustering algorithm in cloud computing environments ABSTRACT KEYWORDS
[Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 17 BioTechnology 2014 An Indian Journal FULL PAPER BTAIJ, 10(17), 2014 [9562-9566] Research on data mining clustering algorithm in cloud
More informationCHAPTER 31 WEB SEARCH TECHNOLOGIES FOR TEXT DOCUMENTS
CHAPTER 31 WEB SEARCH TECHNOLOGIES FOR TEXT DOCUMENTS Weiyi Meng SUNY, BINGHAMTON Clement Yu UNIVERSITY OF ILLINOIS, CHICAGO Introduction Text Retrieval System Architecture Document Representation Document-Query
More informationFinding Topic-centric Identified Experts based on Full Text Analysis
Finding Topic-centric Identified Experts based on Full Text Analysis Hanmin Jung, Mikyoung Lee, In-Su Kang, Seung-Woo Lee, Won-Kyung Sung Information Service Research Lab., KISTI, Korea jhm@kisti.re.kr
More informationCOS 513: Foundations of Probabilistic Modeling. Lecture 5
COS 513: Foundations of Probabilistic Modeling Young-suk Lee 1 Administrative Midterm report is due Oct. 29 th. Recitation is at 4:26pm in Friend 108. Lecture 5 R is a computer language for statistical
More informationFaculty of Science and Technology MASTER S THESIS
Faculty of Science and Technology MASTER S THESIS Study program/ Specialization: Master of Science in Computer Science Spring semester, 2016 Open Writer: Shuo Zhang Faculty supervisor: (Writer s signature)
More informationWelcome to the class of Web Information Retrieval!
Welcome to the class of Web Information Retrieval! Tee Time Topic Augmented Reality and Google Glass By Ali Abbasi Challenges in Web Search Engines Min ZHANG z-m@tsinghua.edu.cn April 13, 2012 Challenges
More informationInformation Retrieval
Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have
More informationEffect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching
Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching Wolfgang Tannebaum, Parvaz Madabi and Andreas Rauber Institute of Software Technology and Interactive Systems, Vienna
More informationCS674 Natural Language Processing
CS674 Natural Language Processing A question What was the name of the enchanter played by John Cleese in the movie Monty Python and the Holy Grail? Question Answering Eric Breck Cornell University Slides
More informationTwo-Stage Language Models for Information Retrieval
Two-Stage Language Models for Information Retrieval ChengXiang hai School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 John Lafferty School of Computer Science Carnegie Mellon University
More informationONTOPARK: ONTOLOGY BASED PAGE RANKING FRAMEWORK USING RESOURCE DESCRIPTION FRAMEWORK
Journal of Computer Science 10 (9): 1776-1781, 2014 ISSN: 1549-3636 2014 doi:10.3844/jcssp.2014.1776.1781 Published Online 10 (9) 2014 (http://www.thescipub.com/jcs.toc) ONTOPARK: ONTOLOGY BASED PAGE RANKING
More informationIntroduction to Information Retrieval. (COSC 488) Spring Nazli Goharian. Course Outline
Introduction to Information Retrieval (COSC 488) Spring 2012 Nazli Goharian nazli@cs.georgetown.edu Course Outline Introduction Retrieval Strategies (Models) Retrieval Utilities Evaluation Indexing Efficiency
More informationLiangjie Hong*, Dawei Yin*, Jian Guo, Brian D. Davison*
Tracking Trends: Incorporating Term Volume into Temporal Topic Models Liangjie Hong*, Dawei Yin*, Jian Guo, Brian D. Davison* Dept. of Computer Science and Engineering, Lehigh University, Bethlehem, PA,
More informationSearching the Deep Web
Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index
More information