Vorlesung: Information Retrieval 2. Florian Metze, Fachbereich Usability WS 2008/2009 08.01.2009 Termin: Donnerstags 10:15 11:45; TEL20, Auditorium Date Remark Topic 16.10.2008 Einführung Q&U Lab 23.10.2008 1 Statistik 30.10.2008 2 Klassifikation 06.11.2008 3 Grundlagen und ASR 13.11.2008 4 ASR Anwendungen und Systeme 20.11.2008 5 Future ASR 27.11.2008 6 Grundlagen und regelbasierte Übersetzung 04.12.2008 7 Statistische Übersetzung (10:15-11:45) 04.12.2008 8 Sprachübersetzungssysteme (12:15-13:45) 11.12.2008 9 (Sprach-)dialogsysteme (10:15-11:45) 11.12.2008 10 Multimodale Schnittstellen (12:15-13:45) 18.12.2008 11 Fusion/ Fission: Audio, Video, Keyboard, Touch, (10:15-11:45) 18.12.2008 12 Anwendungen & Wiederholung (12:15-13:45) 08.01.2009 13 Information Retrieval, Dokumentensuche (10:15-11:45) 08.01.2009 14 Information Retrieval 2, Expertensuche (12:15-13:45) VL CGI FMe 13 - IR2.ppt X 1
Human Computer Interfaces: Example Information Retrieval. Introduction Conceptual model Relationship of IR and HCI and HCC Latent Semantic Indexing The ESP Game Assessing the retrieval Future Directions VL CGI FMe 13 - IR2.ppt X 2 HCI: Information Retrieval Model. Content-Centered Retrieval as Matching Document Representations to Query Representations A powerful paradigm that has driven IR R&D for half a century. Evaluation metric is effectiveness of the match. (e.g., recall and precision). VL CGI FMe 13 - IR2.ppt X 3
HCI-IR: Content Trend. Content Features (queries too) Not only text Statistics, images, music, code, streams, bio-chemical Multimedia, multilingual Dynamic Temporal (e,g., blogs, wikis, sensor streams) Conditional (e.g., computed links, recommendations) Content Relationships Hyperlinks, new metadata, aggregations Digital libraries, personal collections Content acquires history context retrieval VL CGI FMe 13 - IR2.ppt X 4 HCI-IR: Responses to Content Trend. Link analysis Multiple sources of evidence (fusion) Authors words (e.g., full text IR) Indexer/ abstractor words (e.g., OPACs) Authors citations/links (e.g., Google) Readers search paths (e.g., recommenders, opinion miners: collaborative filtering ) Machine generated features and relationships ( mining ) Three key challenges: How do we generate references? What new relationships can we leverage (human and machine)? How can we integrate multiple sources of evidence? VL CGI FMe 13 - IR2.ppt X 5
HCI-IR: User Trend. Technical advances and technical literacy allows us to leverage information seeker intelligence Rather than sole dependence on matching algorithms, focus on flow of representations and actions in situ as people think with these new tools and information resources To leverage human intelligence and effort, people must assume responsibilities: beyond the two-word, single query Web and TV remotes have legitimized browsing as human-controlled information seeking Aim at understanding rather than retrieval Responses to User Trend: Adapt techniques to WWW Relevance feedback Query expansion User modeling/profiles, SDI services Recommender systems: explicit and implicit models Capture everything (e.g., Lifebits) User Interfaces: dynamic queries, agile views, tuning of IR systems VL CGI FMe 13 - IR2.ppt X 6 HCI: HCC Model of HCI. A user-oriented model that has driven R&D. Evaluation based on user time, accuracy, and satisfaction. VL CGI FMe 13 - IR2.ppt X 7
HCI: WWW Trends. First decade of WWW as great equalizer (we all get impoverished, but we admit MANY more people) Universal access Platform independence (lots of devices) Enhanced browsers, specialized browsers Interface Servers Social awareness (user is not alone) VL CGI FMe 13 - IR2.ppt X 8 HCI-IR: An Expanded Model. Think of IR from the perspective of an active human with information needs, information skills, powerful IR resources (that include other humans), and situated in global and local connected communities, all of which evolve over time. Get people closer to the information they need Closer to the backend Closer to the meaning Involve information professionals as integral to the IR system Increase responsibility as well as control Leverage more demanding and knowledgeable installed base Consider ubiquity, digital libraries, e-commerce as extended memories and tools (personal and shared) VL CGI FMe 13 - IR2.ppt X 9
HCI-IR: Key Challenges. Linking conceptual interface to system backend Metadata generation Alternative representations and control mechanisms Raising user literacy and involvement Engaging without insulting or annoying Adding human intelligence to the system Moving beyond retrieval to understanding Context VL CGI FMe 13 - IR2.ppt X 10 HCI Example 1: Word-Net. WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptualsemantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet's structure makes it a useful tool for computational linguistics and natural language processing. WordNet relations can be expressed in OWL, RDFS or other ontology markup languages: VL CGI FMe 13 - IR2.ppt X 11
HCI Example 2: The ESP Game. How to label images? On the web? Clever way to automate meta-data generation Image annotation/ recognition very difficult Labeling the Web using Human Computation Two-player game on the web Players get points for generating keywords describing a picture, if the other player agrees Taboo wordsexist, too Accuracy assured by over-sampling Social aspect ( become top labeler ) and fun as motivation Funded by NSA, conceived by Luis von Ahn at CMU. Now sold to Google. VL CGI FMe 13 - IR2.ppt X 12 HCI Example 2: Latent Semantic Indexing (LSA). How LSA works: LSA uses a term-document matrix which describestheoccurrencesof termsin documents It is a sparse matrix whose rows correspond to terms (typically stemmed words) and whose columns correspond to documents, matrix elements are tf-idf. LSA transforms the occurrence matrix into a relation between the terms and some concepts, and a relation between those concepts and the documents. Thus the terms and documents are now indirectly related through the concepts. LSA finds a low-rank approximation to the term-document matrix. The consequence of the rank lowering is that some dimensions are combined and depend on more than one term: {(car), (truck), (flower)} {(1.3452 * car + 0.2828 * truck), (flower)} The new concept space typically can be used to: Compare the documents in the concept space (data clustering, document classification). Find similar documents across languages, after analyzing a base set of translated documents (cross language retrieval). Find relations between terms (synonymy and polysemy). Given a query of terms, translate it into the concept space, and find matching documents (information retrieval). Synonymy and polysemy are fundamental problems in natural language processing: Synonymy is the phenomenon where different words describe the same idea. Polysemy is the phenomenon where the same word has multiple meanings. Principal Component Analysis (PCA) in term space VL CGI FMe 13 - IR2.ppt X 13
HCI and Computer Aided Interaction. Automatic classification works best when its application is supported by humans with knowledge of the domain and the techniques at hand. (Gary Marchionini) Computers should learn! The Relation Browser tool for metadata mining: VL CGI FMe 13 - IR2.ppt X 14 HCI: The Relation Browser. A general purpose dynamic query interface for databases with a small number of facets (~10) and a small number of categories in each facet (~10). Easy to look ahead (overviews and previews) Couples interactive partitioning/ exploration with string query Semi-automatic category generation and webpage classification Mousing over Coal reveals the distribution of coal -related web-pages in the other categories VL CGI FMe 13 - IR2.ppt X 15
HCI: The Relation Browser. 1) Acquire data: 2) Build Representation: Crawl sites/ Internet Formats? Mirror locally? Clean data Remove non-alphabeticals Lowercaseall Word-Net validate words Stemornotstem Select data to include Pages to include/ exclude ASCII text from Titles Link anchors Metadata tags Build raw term-document matrix Pages as rows (observations) Terms as columns (variables) Frequencies or TF-IDF weights in cells VL CGI FMe 13 - IR2.ppt X 16 HCI: The Relation Browser. 3) Filter data: 4) Project data onto lower dimensional space Stop word lists General terms Domain specific terms Web and navigation terms Iteratively developed/ refined Term discrimination filters (various).01-.1 doc frequency interval Interval augmented by 100 top freq Empirical threshold (e.g., > 5 docs) First N principal components 50-100 latent semantic dimensions 50-100 independent components Reduces to narrower term-doc matrix Still kind of experimental VL CGI FMe 13 - IR2.ppt X 17
HCI: The Relation Browser. 5) Cluster documents 6) Evaluate clusters and name topics K-means, e.g., with k<<100 EM yields a probability distribution for each document over the clusters (so a document has some probability of belonging to each cluster) Create usable output A web page with the clusters and number of documents in each For each cluster, a list of the top 10 most frequently occurring terms; a list of the top 10 log-odds ratio terms; and links to all the pages in that cluster Eyeball the terms, pick a cluster (topic) name (names); else iterate previous steps VL CGI FMe 13 - IR2.ppt X 18 HCI: The Relation Browser. 7) Assign pages to topics 8) Create other facets (views) and display For every page, compute the probability distribution (using EM model) over each cluster/ topic Select a threshold for placing pages into topics (most easily go into only one topic) Use a set of heuristic rules to place pages into geographic categories Use a set of heuristic rules to place pages into temporal categories (ad hoc at present) Map the files onto the RB relational scheme VL CGI FMe 13 - IR2.ppt X 19
HCI: Interaction Principles and Caveats (Incomplete). Principles Look ahead without penalty Minimize scrolling and clicking Alternative ways to slice and dice Closely couple search, browse, and examine Continuous engagement useful attractors Treasures to surface Caveats Scalability (getting metadata to client side) Metadata crucial: e.g. working on automatically creating partitions Increasing expectations about useful results (answers!) VL CGI FMe 13 - IR2.ppt X 20 HCI: Long-term IR paradigm. Information interaction as core life cycle process: Examples represent early ways to get the information seeker more involved in the information seeking process there is plenty more to do. Like eating we have varying expectations, invest different levels of effort, and use diverse and ubiquitous infrastructures. Key challenge is to span boundaries between cyberinfrastructure and the real world. Coda: Our hopes that we can create systems (solutions) that do IR for us are unreasonable Our expectations that people can find and understand information without thinking and investing effort are unreasonable. Aim to develop systems that involve people and machines continuously learning and changing together. Google would not work as well next month if there were not a large group of employees tuning the system, adding new spam filters, and crawlers checking out pages and links continuously. VL CGI FMe 13 - IR2.ppt X 21
Backup. 08.01.2009