Computer-gestützte Interaktion. Vorlesung: Information Retrieval 2.

Similar documents
Human-Computer Information Retrieval

User Interfaces for Information Retrieval on the WWW

Toward Human-Computer Information Retrieval

January- March,2016 ISSN NO

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

Mining Web Data. Lijun Zhang

Web Information Retrieval using WordNet

Information Retrieval

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

Chapter 6: Information Retrieval and Web Search. An introduction

Enhanced retrieval using semantic technologies:

Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search

Mining Web Data. Lijun Zhang

CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS

Chapter 27 Introduction to Information Retrieval and Web Search

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Web Mining TEAM 8. Professor Anita Wasilewska CSE 634 Data Mining

Semantic Website Clustering

Knowledge Discovery and Data Mining 1 (VO) ( )

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

Chapter 2. Architecture of a Search Engine

Ontology Based Search Engine

Information Retrieval

An Introduction to Search Engines and Web Navigation

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING

Text Mining. Munawar, PhD. Text Mining - Munawar, PhD

Search Engine Architecture. Hongning Wang

Creating a Recommender System. An Elasticsearch & Apache Spark approach

Disambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity

TIC: A Topic-based Intelligent Crawler

Overview of Web Mining Techniques and its Application towards Web

EFFICIENT INTEGRATION OF SEMANTIC TECHNOLOGIES FOR PROFESSIONAL IMAGE ANNOTATION AND SEARCH

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

WEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

DATA MINING II - 1DL460. Spring 2014"

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Multi-Application Interest Modeling. Frank Shipman

Content Enrichment. An essential strategic capability for every publisher. Enriched content. Delivered.

Information Retrieval and Web Search

SCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR

Taxonomy Tools: Collaboration, Creation & Integration. Dow Jones & Company

Seek and Ye shall Find

Information Retrieval. Session 11 LBSC 671 Creating Information Infrastructures

Exploring and Navigating Ontologies and Data A Work in Progress Discussion Jan 21 st, 2009

Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO

Seek and Ye shall Find

Information Management (IM)

Limitations of XPath & XQuery in an Environment with Diverse Schemes

CS 6320 Natural Language Processing

Information Retrieval

Eleven+ Views of Semantic Search

Seek and Ye shall Find

Using the Semantic Web in Ubiquitous and Mobile Computing

Optimal Query. Assume that the relevant set of documents C r. 1 N C r d j. d j. Where N is the total number of documents.

Question Answering Approach Using a WordNet-based Answer Type Taxonomy

What is this Song About?: Identification of Keywords in Bollywood Lyrics

Automatic Identification of User Goals in Web Search [WWW 05]

Collective Intelligence in Action

Information Retrieval. hussein suleman uct cs

Next Level Marketing Online techniques to grow your business Hudson Digital

21. Search Models and UIs for IR

Enhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Semantic Web Company. PoolParty - Server. PoolParty - Technical White Paper.

CS506/606 - Topics in Information Retrieval

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

An Analysis of Image Retrieval Behavior for Metadata Type and Google Image Database

Introduction to Information Retrieval

Developing Focused Crawlers for Genre Specific Search Engines

Hyper G and Hyperwave

CHALLENGES IN ADAPTIVE WEB INFORMATION SYSTEMS: DO NOT FORGET THE LINK!

An Oracle White Paper October Oracle Social Cloud Platform Text Analytics

Let s get parsing! Each component processes the Doc object, then passes it on. doc.is_parsed attribute checks whether a Doc object has been parsed

The Topic Specific Search Engine

Document Clustering for Mediated Information Access The WebCluster Project

ISSUES IN INFORMATION RETRIEVAL Brian Vickery. Presentation at ISKO meeting on June 26, 2008 At University College, London

DATA MINING II - 1DL460. Spring 2017

WordNet-based User Profiles for Semantic Personalization

SEO: SEARCH ENGINE OPTIMISATION

Ontology Based Prediction of Difficult Keyword Queries

Big Data Analytics CSCI 4030

Marketing & Back Office Management

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

Purpose, features and functionality

Things to consider when using Semantics in your Information Management strategy. Toby Conrad Smartlogic

Part I: Data Mining Foundations

Kristina Lerman University of Southern California. This lecture is partly based on slides prepared by Anon Plangprasopchok

Information Retrieval. Lecture 9 - Web search basics

Version 11

D B M G Data Base and Data Mining Group of Politecnico di Torino

Overview MULTIMEDIA INFORMATION RETRIEVAL. Search Engines. Information Retrieval. Explanation. Van Rijsbergen

Introduction to Information Retrieval

Instructor: Stefan Savev

Applying Semantic Web in Mobile and Ubiquitous Computing: Will Policy-Awareness Help?

: Semantic Web (2013 Fall)

Information Retrieval. (M&S Ch 15)

Transcription:

Vorlesung: Information Retrieval 2. Florian Metze, Fachbereich Usability WS 2008/2009 08.01.2009 Termin: Donnerstags 10:15 11:45; TEL20, Auditorium Date Remark Topic 16.10.2008 Einführung Q&U Lab 23.10.2008 1 Statistik 30.10.2008 2 Klassifikation 06.11.2008 3 Grundlagen und ASR 13.11.2008 4 ASR Anwendungen und Systeme 20.11.2008 5 Future ASR 27.11.2008 6 Grundlagen und regelbasierte Übersetzung 04.12.2008 7 Statistische Übersetzung (10:15-11:45) 04.12.2008 8 Sprachübersetzungssysteme (12:15-13:45) 11.12.2008 9 (Sprach-)dialogsysteme (10:15-11:45) 11.12.2008 10 Multimodale Schnittstellen (12:15-13:45) 18.12.2008 11 Fusion/ Fission: Audio, Video, Keyboard, Touch, (10:15-11:45) 18.12.2008 12 Anwendungen & Wiederholung (12:15-13:45) 08.01.2009 13 Information Retrieval, Dokumentensuche (10:15-11:45) 08.01.2009 14 Information Retrieval 2, Expertensuche (12:15-13:45) VL CGI FMe 13 - IR2.ppt X 1

Human Computer Interfaces: Example Information Retrieval. Introduction Conceptual model Relationship of IR and HCI and HCC Latent Semantic Indexing The ESP Game Assessing the retrieval Future Directions VL CGI FMe 13 - IR2.ppt X 2 HCI: Information Retrieval Model. Content-Centered Retrieval as Matching Document Representations to Query Representations A powerful paradigm that has driven IR R&D for half a century. Evaluation metric is effectiveness of the match. (e.g., recall and precision). VL CGI FMe 13 - IR2.ppt X 3

HCI-IR: Content Trend. Content Features (queries too) Not only text Statistics, images, music, code, streams, bio-chemical Multimedia, multilingual Dynamic Temporal (e,g., blogs, wikis, sensor streams) Conditional (e.g., computed links, recommendations) Content Relationships Hyperlinks, new metadata, aggregations Digital libraries, personal collections Content acquires history context retrieval VL CGI FMe 13 - IR2.ppt X 4 HCI-IR: Responses to Content Trend. Link analysis Multiple sources of evidence (fusion) Authors words (e.g., full text IR) Indexer/ abstractor words (e.g., OPACs) Authors citations/links (e.g., Google) Readers search paths (e.g., recommenders, opinion miners: collaborative filtering ) Machine generated features and relationships ( mining ) Three key challenges: How do we generate references? What new relationships can we leverage (human and machine)? How can we integrate multiple sources of evidence? VL CGI FMe 13 - IR2.ppt X 5

HCI-IR: User Trend. Technical advances and technical literacy allows us to leverage information seeker intelligence Rather than sole dependence on matching algorithms, focus on flow of representations and actions in situ as people think with these new tools and information resources To leverage human intelligence and effort, people must assume responsibilities: beyond the two-word, single query Web and TV remotes have legitimized browsing as human-controlled information seeking Aim at understanding rather than retrieval Responses to User Trend: Adapt techniques to WWW Relevance feedback Query expansion User modeling/profiles, SDI services Recommender systems: explicit and implicit models Capture everything (e.g., Lifebits) User Interfaces: dynamic queries, agile views, tuning of IR systems VL CGI FMe 13 - IR2.ppt X 6 HCI: HCC Model of HCI. A user-oriented model that has driven R&D. Evaluation based on user time, accuracy, and satisfaction. VL CGI FMe 13 - IR2.ppt X 7

HCI: WWW Trends. First decade of WWW as great equalizer (we all get impoverished, but we admit MANY more people) Universal access Platform independence (lots of devices) Enhanced browsers, specialized browsers Interface Servers Social awareness (user is not alone) VL CGI FMe 13 - IR2.ppt X 8 HCI-IR: An Expanded Model. Think of IR from the perspective of an active human with information needs, information skills, powerful IR resources (that include other humans), and situated in global and local connected communities, all of which evolve over time. Get people closer to the information they need Closer to the backend Closer to the meaning Involve information professionals as integral to the IR system Increase responsibility as well as control Leverage more demanding and knowledgeable installed base Consider ubiquity, digital libraries, e-commerce as extended memories and tools (personal and shared) VL CGI FMe 13 - IR2.ppt X 9

HCI-IR: Key Challenges. Linking conceptual interface to system backend Metadata generation Alternative representations and control mechanisms Raising user literacy and involvement Engaging without insulting or annoying Adding human intelligence to the system Moving beyond retrieval to understanding Context VL CGI FMe 13 - IR2.ppt X 10 HCI Example 1: Word-Net. WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptualsemantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet's structure makes it a useful tool for computational linguistics and natural language processing. WordNet relations can be expressed in OWL, RDFS or other ontology markup languages: VL CGI FMe 13 - IR2.ppt X 11

HCI Example 2: The ESP Game. How to label images? On the web? Clever way to automate meta-data generation Image annotation/ recognition very difficult Labeling the Web using Human Computation Two-player game on the web Players get points for generating keywords describing a picture, if the other player agrees Taboo wordsexist, too Accuracy assured by over-sampling Social aspect ( become top labeler ) and fun as motivation Funded by NSA, conceived by Luis von Ahn at CMU. Now sold to Google. VL CGI FMe 13 - IR2.ppt X 12 HCI Example 2: Latent Semantic Indexing (LSA). How LSA works: LSA uses a term-document matrix which describestheoccurrencesof termsin documents It is a sparse matrix whose rows correspond to terms (typically stemmed words) and whose columns correspond to documents, matrix elements are tf-idf. LSA transforms the occurrence matrix into a relation between the terms and some concepts, and a relation between those concepts and the documents. Thus the terms and documents are now indirectly related through the concepts. LSA finds a low-rank approximation to the term-document matrix. The consequence of the rank lowering is that some dimensions are combined and depend on more than one term: {(car), (truck), (flower)} {(1.3452 * car + 0.2828 * truck), (flower)} The new concept space typically can be used to: Compare the documents in the concept space (data clustering, document classification). Find similar documents across languages, after analyzing a base set of translated documents (cross language retrieval). Find relations between terms (synonymy and polysemy). Given a query of terms, translate it into the concept space, and find matching documents (information retrieval). Synonymy and polysemy are fundamental problems in natural language processing: Synonymy is the phenomenon where different words describe the same idea. Polysemy is the phenomenon where the same word has multiple meanings. Principal Component Analysis (PCA) in term space VL CGI FMe 13 - IR2.ppt X 13

HCI and Computer Aided Interaction. Automatic classification works best when its application is supported by humans with knowledge of the domain and the techniques at hand. (Gary Marchionini) Computers should learn! The Relation Browser tool for metadata mining: VL CGI FMe 13 - IR2.ppt X 14 HCI: The Relation Browser. A general purpose dynamic query interface for databases with a small number of facets (~10) and a small number of categories in each facet (~10). Easy to look ahead (overviews and previews) Couples interactive partitioning/ exploration with string query Semi-automatic category generation and webpage classification Mousing over Coal reveals the distribution of coal -related web-pages in the other categories VL CGI FMe 13 - IR2.ppt X 15

HCI: The Relation Browser. 1) Acquire data: 2) Build Representation: Crawl sites/ Internet Formats? Mirror locally? Clean data Remove non-alphabeticals Lowercaseall Word-Net validate words Stemornotstem Select data to include Pages to include/ exclude ASCII text from Titles Link anchors Metadata tags Build raw term-document matrix Pages as rows (observations) Terms as columns (variables) Frequencies or TF-IDF weights in cells VL CGI FMe 13 - IR2.ppt X 16 HCI: The Relation Browser. 3) Filter data: 4) Project data onto lower dimensional space Stop word lists General terms Domain specific terms Web and navigation terms Iteratively developed/ refined Term discrimination filters (various).01-.1 doc frequency interval Interval augmented by 100 top freq Empirical threshold (e.g., > 5 docs) First N principal components 50-100 latent semantic dimensions 50-100 independent components Reduces to narrower term-doc matrix Still kind of experimental VL CGI FMe 13 - IR2.ppt X 17

HCI: The Relation Browser. 5) Cluster documents 6) Evaluate clusters and name topics K-means, e.g., with k<<100 EM yields a probability distribution for each document over the clusters (so a document has some probability of belonging to each cluster) Create usable output A web page with the clusters and number of documents in each For each cluster, a list of the top 10 most frequently occurring terms; a list of the top 10 log-odds ratio terms; and links to all the pages in that cluster Eyeball the terms, pick a cluster (topic) name (names); else iterate previous steps VL CGI FMe 13 - IR2.ppt X 18 HCI: The Relation Browser. 7) Assign pages to topics 8) Create other facets (views) and display For every page, compute the probability distribution (using EM model) over each cluster/ topic Select a threshold for placing pages into topics (most easily go into only one topic) Use a set of heuristic rules to place pages into geographic categories Use a set of heuristic rules to place pages into temporal categories (ad hoc at present) Map the files onto the RB relational scheme VL CGI FMe 13 - IR2.ppt X 19

HCI: Interaction Principles and Caveats (Incomplete). Principles Look ahead without penalty Minimize scrolling and clicking Alternative ways to slice and dice Closely couple search, browse, and examine Continuous engagement useful attractors Treasures to surface Caveats Scalability (getting metadata to client side) Metadata crucial: e.g. working on automatically creating partitions Increasing expectations about useful results (answers!) VL CGI FMe 13 - IR2.ppt X 20 HCI: Long-term IR paradigm. Information interaction as core life cycle process: Examples represent early ways to get the information seeker more involved in the information seeking process there is plenty more to do. Like eating we have varying expectations, invest different levels of effort, and use diverse and ubiquitous infrastructures. Key challenge is to span boundaries between cyberinfrastructure and the real world. Coda: Our hopes that we can create systems (solutions) that do IR for us are unreasonable Our expectations that people can find and understand information without thinking and investing effort are unreasonable. Aim to develop systems that involve people and machines continuously learning and changing together. Google would not work as well next month if there were not a large group of employees tuning the system, adding new spam filters, and crawlers checking out pages and links continuously. VL CGI FMe 13 - IR2.ppt X 21

Backup. 08.01.2009