Beyond Content-Based Retrieval:

Similar documents
Support Vector Machines 290N, 2015

Machine Learning in IR. Machine Learning

Overview of Web Mining Techniques and its Application towards Web

Introduction to Text Mining. Hongning Wang

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

: Semantic Web (2013 Fall)

Information Retrieval (Part 1)

Weight adjustment schemes for a centroid based classifier

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

Chapter 27 Introduction to Information Retrieval and Web Search

Outline. Structures for subject browsing. Subject browsing. Research issues. Renardus

Search Engines. Information Retrieval in Practice

State of the Art and Trends in Search Engine Technology. Gerhard Weikum

KDD 10 Tutorial: Recommender Problems for Web Applications. Deepak Agarwal and Bee-Chung Chen Yahoo! Research

Naïve Bayes for text classification

21. Search Models and UIs for IR

Information Retrieval and Web Search Engines

SVM cont d. Applications face detection [IEEE INTELLIGENT SYSTEMS]

Machine Learning for Information Discovery

Information Retrieval

Reading group on Ontologies and NLP:

Data Mining An Overview ITEV, F /18

Welcome to the class of Web Information Retrieval!

Relevance Feature Discovery for Text Mining

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Information Retrieval

Evaluation. David Kauchak cs160 Fall 2009 adapted from:

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]

Weight Adjustment Schemes For a Centroid Based Classifier. Technical Report

Optimizing Search Engines using Click-through Data

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

DATA MINING - 1DL105, 1DL111

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

Part I: Data Mining Foundations

Support Vector Machines 290N, 2014

The Security Role for Content Analysis

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Dynamic Visualization of Hubs and Authorities during Web Search

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW

Sec. 8.7 RESULTS PRESENTATION

Social Search Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

An Introduction to Search Engines and Web Navigation

Introduction to Information Retrieval. Hongning Wang

Term Graph Model for Text Classification

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Oleksandr Kuzomin, Bohdan Tkachenko

Introduction to Data Mining and Data Analytics

Content-based Recommender Systems

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

KNOWLEDGE GRAPH: FROM METADATA TO INFORMATION VISUALIZATION AND BACK. Xia Lin College of Computing and Informatics Drexel University Philadelphia, PA

TEXT MINING APPLICATION PROGRAMMING

Weka VotedPerceptron & Attribute Transformation (1)

CSCI 5417 Information Retrieval Systems. Jim Martin!

DATA MINING II - 1DL460. Spring 2014"

Information Retrieval and Organisation

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

Using Text Learning to help Web browsing

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

CSEP 573: Artificial Intelligence

Interactive Machine Learning (IML) Markup of OCR Generated Text by Exploiting Domain Knowledge: A Biodiversity Case Study

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

70 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 6, NO. 1, FEBRUARY ClassView: Hierarchical Video Shot Classification, Indexing, and Accessing

Retrieval Evaluation. Hongning Wang

A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2

Machine Learning in Action

PV211: Introduction to Information Retrieval

Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

CS371R: Final Exam Dec. 18, 2017

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Part 7: Evaluation of IR Systems Francesco Ricci

INTRODUCTION. Chapter GENERAL

MetaData for Database Mining

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Text Categorization (I)

Beyond Ten Blue Links Seven Challenges

Search Engine Architecture II

SOCIAL MEDIA MINING. Data Mining Essentials

Automated Online News Classification with Personalization

Text Mining. Representation of Text Documents

Domain-specific Concept-based Information Retrieval System

Introduction to Information Retrieval

Deep Web Crawling and Mining for Building Advanced Search Application

HARMONY: Efficiently Mining the Best Rules for Classification

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

Topics du jour CS347. Centroid/NN. Example

Machine Learning Classifiers and Boosting

Unstructured Data. CS102 Winter 2019

CAP 6412 Advanced Computer Vision

Based on Raymond J. Mooney s slides

Non-trivial extraction of implicit, previously unknown and potentially useful information from data

CS6322: Information Retrieval Sanda Harabagiu. Lecture 13: Evaluation

Transcription:

Beyond Content-Based Retrieval: Modeling Domains, Users, and Interaction Susan Dumais Microsoft Research http://research.microsoft.com/~sdumais IEEE ADL 99 - May 21, 1998

Research in IR at MS Microsoft Research (http://research.microsoft.com) Adaptive Systems and Interaction - IR/UI Machine Learning and Applied Statistics Data Mining Natural Language Processing Collaboration and Education Database MSR Cambridge; MSR Beijing Microsoft Product Groups many IR-related

IR Themes & Directions Improvements in matching algorithms and representation Probabilistic/Bayesian models p(relevant Document), p(concept Words) NLP: Truffle, MindNet Beyond content-matching User/Task modeling Domain/Object modeling WWW8 Panel Finding Anything in the Billion-Page Web: Are Algorithms the Answer? Advances in presentation and manipulation Moderator: Prabhakar Raghavan, IBM Almaden

Traditional View of IR Query Words Ranked List

What s Missing? User Modeling Query Words Ranked List Domain Modeling Information Use

Domain/Obj Modeling Not all objects are equal potentially big win A priori importance Information use ( readware ; collab filtering) Inter-object relationships Link structure / hypertext Subject categories - e.g., text categorization. text clustering Metadata E.g., reliability, recency, cost -> combining

User/Task Modeling Demographics Task -- What s the user s goal? e.g., Lumiere Short and long-term content interests e.g., Implicit queries Interest model = f(content_similarity, time, interest) e.g., Letiza, WebWatcher, Fab

Information Use Beyond batch IR model ( query->results ) Consider larger task context Human attention is critical resource no Moore s Law for human capabilities Techniques for automatic information summarization, organization, discover, filtering, mining, etc. Advanced UIs and interaction techniques E.g, tight coupling of search, browsing to support information management

The Broader View of IR User Modeling Query Words Ranked List Domain Modeling Information Use

Beyond Content Matching Domain/Object modeling A priori importance Text classification and clustering User/Task modeling Implicit queries and Lumiere Advances in presentation and manipulation Combining structure and search (e.g., DM)

Example: Web query for Microsoft Research

Estimating Priors Link Structure Google - Brin & Page Clever (Hubs/Authorities) - Kleinberg et al. Web Archeologist - Bharat, Henzinger similarities to citation analysis Information Use Access Counts - e.g., DirectHit Collaborative Filtering

New Relevance Ranking Relevance ranking can include: content-matching of couse page/site popularity <external link count; proxy stats> page quality <site quality, dates, depth> spam or porn or other downweighting etc. Combining these - relative weighting of these factors tricky and subjective

Text Classification Text Classification: assign objects to one or more of a predefined set of categories using text features E.g., News feeds, Web data, OHSUMED, Email - spam/no-spam Approaches: Human classification (e.g., LCSH, MeSH, Yahoo!, CyberPatrol) Hand-crafted knowledge engineered systems (e.g., CONSTRUE) Inductive learning methods (Semi-) automatic classification

Inductive Learning Methods Supervised learning from examples Examples are easy for domain experts to provide Models easy to learn, update, and customize Example learning algorithms Relevance Feedback, Decision Trees, Naïve Bayes, Bayes Nets, Support Vector Machines (SVMs) Text representation Large vector of features (words, phrases, hand-crafted)

Support Vector Machine Optimization Problem Find hyperplane, h, separating positive and negative examples 2 Optimization for maximum margin: min w, w x b 1, w x b 1 Classify new items using: w f ( w x) support vectors

Reuters Data Set (21578 - ModApte split) 9603 training articles; 3299 test articles Example interest article 2-APR-1987 06:35:19.50 west-germany b f BC-BUNDESBANK-LEAVES-CRE 04-02 0052 FRANKFURT, March 2 The Bundesbank left credit policies unchanged after today's regular meeting of its council, a spokesman said in answer to enquiries. The West German discount rate remains at 3.0 pct, and the Lombard emergency financing rate at 5.0 pct. REUTER Average article 200 words long

Example: Reuters news 118 categories (article can be in more than one category) Most common categories (#train, #test) Earn (2877, 1087) Acquisitions (1650, 179) Money-fx (538, 179) Grain (433, 149) Crude (389, 189) Overall Results Trade (369,119) Interest (347, 131) Ship (197, 89) Wheat (212, 71) Corn (182, 56) Linear SVM most accurate: 87% precision at 87% recall

ROC for Category - Grain 1 0.8 Precision 0.6 0.4 0.2 0 Linear SVM Decision Tree Naïve Bayes Find Similar 0 0.2 0.4 0.6 0.8 1 Recall Recall: % labeled in category among those stories that are really in category Precision: % really in category among those stories labeled in category

Text Categ Summary Accurate classifiers can be learned automatically from training examples Linear SVMs are efficient and provide very good classification accuracy Widely applicable, flexible, and adaptable representations Email spam/no-spam, Web, Medical abstracts, TREC

Beyond Content Matching Domain/Object modeling A priori importance Text classification and clustering User/Task modeling Implicit queries and Lumiere Advances in presentation and manipulation Combining structure and search (e.g., DM)

Lumiere Inferring beliefs about user s goals and ideal actions under uncertainty Pr(Goals, Needs) User query User activity User profile Data structures * E. Horvitz, et al.

Lumiere

Lumiere Office Assistant

Visualizing Implicit Queries Explicit queries: Search is a separate, discrete task Results not well integrated into larger task context Implicit queries: Search as part of normal information flow Ongoing query formulation based on user activities Non-intrusive results display

Figure 2: Data Mountain with 100 web pages.

Data Mountain with Implicit Query results shown (highlighted pages to left of selected page).

IQ Study: Experimental Details Store 100 Web pages 50 popular Web pages; 50 random pages With or without Implicit Query highlighting IQ0: No IQ IQ1: Co-occurrence based IQ - best case IQ2: Content-based IQ Retrieve 100 Web pages Title given as retrieval cue -- e.g., CNN Home Page No implicit query highlighting at retrieval

Results: Information Storage Filing strategies Filing Strategy IQ Condition Semantic Alphabetic No Org IQ0: No IQ 11 3 1 IQ1: Co-occur based 8 1 0 IQ2: Content-based 10 1 0

Results: Information Storage Number of categories (for semantic organizers) IQ Condition Average Number of Categories (std error) IQ0: No IQ 10.0 (3.6) IQ1: Co-occur based 15.8 (5.8) IQ2: Content-based 13.6 (5.9) 7 Categ 23 Categ

Average RT (seconds) Results: Retrieval Time Web Page Retrieval Time 14 12 10 8 6 4 IQ 0 IQ 1 IQ 2 2 0 Im plicit Query Condition

Results: Retrieval Time Large variability across users min: 3.1 secs max: 39.1 secs Large variability across queries min: 4.9 secs (NASA home page) max: 24.3 secs (Welcome to Mercury Center) Popularity of Web pages did not matter Top50: 12.9 secs Random50: 12.8 secs

Implicit Query Highlights IQ built by observing user s reading behavior No explicit search required Good matches returned IQ user model: Combines present context (+ previous interests) Results presented in the context of a userdefined organization

Summary Improving content-matching And, beyond... Domain/Object Models User/Task Models Information Presentation and Use Also non-text, multi-lingual, distributed http://research.microsoft.com/~sdumais

The View from Wired, May 1996 information retrieval seemed like the easiest place to make progress information retrieval is really only a problem for people in library science -- if some computer scientists were to put their heads together, they d probably have it solved before lunchtime [GS]