CS490D: Introduction to Data Mining Prof. Chris Clifton. Data Mining in Text
|
|
- Douglas French
- 5 years ago
- Views:
Transcription
1 CS490D: Introduction to Data Mining Prof. Chris Clifton March 26, 2004 Text Mining Data Mining in Text Association search in text corpuses provides suggestive information Groups of related entities Clusters that identify topics Flexibility is crucial Describe what an interesting pattern would look like What causes items to be considered associated same document, sequential associations,? Choice of techniques to rank the results Integrate with Information Retrieval systems Common base preprocessing (e.g. Natural Language processing) Need IR system to explore/understand text mining results 1
2 Why Text is Hard Lack of structure Hard to preselect only data relevant to questions asked Lots of irrelevant data (words that don t correspond to interesting concepts) Errors in information Misleading/wrong information in text Synonyms/homonyms: concept identification hard Difficult to parse meaning I believe X is a key player vs. I doubt X is a key player Sheer volume of patterns Need ability to focus on user needs Consequence for results: False associations Vague, dull associations What About Existing Products? Data Mining Tools Designed for particular types of analysis on structured data Structure of data helps define known relationship Small, inflexible set of pattern templates Text is free flow of ideas, tough to capture precise meaning Many patterns exist that aren t relevant to problem Experiments with COTS products on tagged text corpuses demonstrate these problems Discovery overload : many irrelevant patterns, density of actionable items too low Lack of integration with Information Retrieval systems makes further exploration/understanding of results difficult 2
3 What About Existing Products? Text Mining Information Retrieval Tools Text Mining is (mis?)used to mean information retrieval IBM TextMiner (now called IBM Text Search Engine ) DataSet These are Information Retrieval products Goal is get the right document May use data mining technology (clustering, association) Used to improve retrieval, not discover associations among concepts No capability to discover patterns among concepts in the documents. May incorporate technologies such as concept extraction that ease integration with a Knowledge Discovery in Text system What About Existing Products? Concept Visualization Goal: Visualize concepts in a corpus SemioMap SPIRE rch/spire.html Aptex Convectis High-level concept visualization Good for major trends, patterns Find concepts related to a particular query Helps find patterns if you know some of the instances of the pattern Hard to visualize rare event patterns 3
4 What About Existing Products? Corpus-Specific Text Mining Some Knowledge Discovery in Text products Technology Watch (patent office) hwatch.htm TextSmart (survey responses) Provide limited types of analyses Fixed questions to be answered Primarily high-level (similar to concept visualization) Domain-specific Designed for specific corpus and task Substantial development to extend to new domain or corpus What About Existing Products? Text Mining Tools True Text Mining just beginning to come to market Associations: ClearForest Semantic Networks: Megaputer s TextAnalyst IBM Intelligent Miner for Text (toolkit) Currently limited capabilities (but improving) Further research needed Directed research will ensure the right problems are solved Major Problem: Flood of Information Analyzing results as bad as reading the documents 4
5 Scenario: Find Active Leaders in a Region Goal: Identify people to negotiate with prior to relief effort Want general picture" of a region No expert that already knows the situation is available Problems: No clear central authority ; problems are regional Many claim power/control, few have it for long Must include all key players in a region Solution: Find key players over time Who is key today? Past players (may make a comeback) Example: Association Rules in News Stories Goal: Find related (competing or cooperating) players in regions Simple association rules (any associated concepts) gives too many results Flexible search for associations allows us to specify what we want: Gives fewer, more appropriate results Person1 Person2 Support Natalie Allen Linden Soles 117 Leon Harris Joie Chen 53 Ron Goldman Nicole Simpson Mobotu Sese Seko Laurent Kabila 10 Person1 Person2 Place Support Mobuto Laurent Kinshasa 7 Sese Seko Kabila 5
6 Conventional Data Mining System Architecture Data Repository Data Mining Tool Patterns Using Conventional Tools: Text Mining System Architecture Text Corpus Concept Extraction Repository Goal: Find Cooperating/ Combating Leaders in a territory Association Rule Product Person1 Person2 Natalie Allen Linden Soles 117 Leon Harris Joie Chen 53 Ron Goldman Nicole Simpson Mobotu Sese Seko Laurent Kabila 10 Too Many Results 6
7 Flexible Text Mining System Architecture Text Corpus Concept Extraction Repository Pattern Description Query P e r s o n 1 P e r s o n 2 P la c e Person1 and Person2 at Place Flocks DBMS M o b u to L a u re n t S e s e S e k o K a b ila K in s h a s a 7 Flexible: Adapts to new tasks Text Mining System Architecture Text Corpus Concept Extraction Repository Pattern Description Person1 then Person2 at Place in 1 month Query Flocks DBMS Person1 Person2 Place Avg. Days Mobuto Sese Seko Laurent Kabila Kinshasa 12 7
8 Example of Flexible Association Search Broadcast News Navigator Concept Correlation Tool News Source (default all) Broadcast Dates 60 Minutes ABC News Nightline ABC World News British Foreign CBS Evening News CNN Early Edition CNN Early Prime CNN Headline News All Dates From Date To Date Last Week Last Month Last Year 01 JAN JAN 1996 Find correlations between Person Location Organization, Person Location Organization, and Person Location Organization, reported in connection with the same location, within 2 days, having a minimum of 10 co-occurrences. Rank results by deviation from expected value. Text Databases and IR Text databases (document databases) Large collections of documents from various sources: news articles, research papers, books, digital libraries, messages, and Web pages, library database, etc. Data stored is usually semi-structured Traditional information retrieval techniques become inadequate for the increasingly vast amounts of text data Information retrieval A field developed in parallel with database systems Information is organized into (a large number of) documents Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents 8
9 Information Retrieval Typical IR systems Online library catalogs Online document management systems Information retrieval vs. database systems Some DB problems are not present in IR, e.g., update, transaction management, complex objects Some IR problems are not addressed well in DBMS, e.g., unstructured documents, approximate search using keywords and relevance Basic Measures for Text Retrieval Relevant Relevant & Retrieved Retrieved All Documents Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., correct responses) { Relevant} { Retrieved} precision = { retrieved} Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved { Relevant} { Retrieved} recall = { relevant} 9
10 Basic Concepts Information Retrieval Techniques(1) A document can be described by a set of representative keywords called index terms. Different index terms have varying relevance when used to describe document contents. This effect is captured through the assignment of numerical weights to each index term of a document. (e.g.: frequency, tf-idf) DBMS Analogy Index Terms Attributes Weights Attribute Values Information Retrieval Techniques(2) Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms Documents Frequency Matrices Information Retrieval Models: Boolean Model Vector Model Probabilistic Model 10
11 Boolean Model Consider that index terms are either present or absent in a document As a result, the index term weights are assumed to be all binaries A query is composed of index terms linked by three connectives: not, and, and or e.g.: car and repair, plane or airplane The Boolean model predicts that each document is either relevant or non-relevant based on the match of a document to the query Boolean Model: Keyword- Based Retrieval A document is represented by a string, which can be identified by a set of keywords Queries may use expressions of keywords E.g., car and repair shop, tea or coffee, DBMS but not Oracle Queries and retrieval should consider synonyms, e.g., repair and maintenance Major difficulties of the model Synonymy: A keyword T does not appear anywhere in the document, even though the document is closely related to T, e.g., data mining Polysemy: The same keyword may mean different things in different contexts, e.g., mining 11
12 Similarity-Based Retrieval in Text Databases Finds similar documents based on a set of common keywords Answer should be based on the degree of relevance based on the nearness of the keywords, relative frequency of the keywords, etc. Basic techniques Stop list Set of words that are deemed irrelevant, even though they may appear frequently E.g., a, the, of, for, to, with, etc. Stop lists may vary when document set varies Similarity-Based Retrieval in Text Databases (2) Word stem Several words are small syntactic variants of each other since they share a common word stem E.g., drug, drugs, drugged A term frequency table Each entry frequent_table(i, j) = # of occurrences of the word t i in document d i Usually, the ratio instead of the absolute number of occurrences is used Similarity metrics: measure the closeness of a document to a query (a set of keywords) Relative term occurrences v Cosine distance: 1 v2 sim( v1, v2) = v1 v2 12
13 CS490D: Introduction to Data Mining Prof. Chris Clifton April 2, 2004 Text Mining Indexing Techniques Inverted index Maintains two hash- or B+-tree indexed tables: document_table: a set of document records <doc_id, postings_list> term_table: a set of term records, <term, postings_list> Answer query: Find all docs associated with one or a set of terms + easy to implement do not handle well synonymy and polysemy, and posting lists could be too long (storage could be very large) Signature file Associate a signature with each document A signature is a representation of an ordered list of terms that describe the document Order is obtained by frequency analysis, stemming and stop lists 13
14 Vector Model Documents and user queries are represented as m-dimensional vectors, where m is the total number of index terms in the document collection. The degree of similarity of the document d with regard to the query q is calculated as the correlation between the vectors that represent them, using measures such as the Euclidian distance or the cosine of the angle between these two vectors. Latent Semantic Indexing (1) Basic idea Similar documents have similar word frequencies Difficulty: the size of the term frequency matrix is very large Use a singular value decomposition (SVD) techniques to reduce the size of frequency table Retain the K most significant rows of the frequency table Method Create a term x document weighted frequency matrix A SVD construction: A = U * S * V Define K and obtain U k,, S k, and V k. Create query vector q. Project q into the term-document space: Dq = q * U k * S k -1 Calculate similarities: cos α = Dq. D / Dq * D 14
15 Latent Semantic Indexing (2) Weighted Frequency Matrix Query Terms: - Insulation - Joint Probabilistic Model Basic assumption: Given a user query, there is a set of documents which contains exactly the relevant documents and no other (ideal answer set) Querying process as a process of specifying the properties of an ideal answer set. Since these properties are not known at query time, an initial guess is made This initial guess allows the generation of a preliminary probabilistic description of the ideal answer set which is used to retrieve the first set of documents An interaction with the user is then initiated with the purpose of improving the probabilistic description of the answer set 15
16 Types of Text Data Mining Keyword-based association analysis Automatic document classification Similarity detection Cluster documents by a common author Cluster documents containing information from a common source Link analysis: unusual correlation between entities Sequence analysis: predicting a recurring event Anomaly detection: find information that violates usual patterns Hypertext analysis Patterns in anchors/links Anchor text correlations with linked objects Motivation Keyword-Based Association Analysis Collect sets of keywords or terms that occur frequently together and then find the association or correlation relationships among them Association Analysis Process Preprocess the text data by parsing, stemming, removing stop words, etc. Evoke association mining algorithms Consider each document as a transaction View a set of keywords in the document as a set of items in the transaction Term level association mining No need for human effort in tagging documents The number of meaningless results and the execution time is greatly reduced 16
17 Text Classification(1) Motivation Automatic classification for the large number of on-line text documents (Web pages, s, corporate intranets, etc.) Classification Process Data preprocessing Definition of training set and test sets Creation of the classification model using the selected classification algorithm Classification model validation Classification of new/unknown text documents Text document classification differs from the classification of relational data Document databases are not structured according to attributevalue pairs Text Classification(2) Classification Algorithms: Support Vector Machines K-Nearest Neighbors Naïve Bayes Neural Networks Decision Trees Association rule-based Boosting 17
18 Document Clustering Motivation Automatically group related documents based on their contents No predetermined training sets or taxonomies Generate a taxonomy at runtime Clustering Process Data preprocessing: remove stop words, stem, feature extraction, lexical analysis, etc. Hierarchical clustering: compute similarities applying clustering algorithms. Model-Based clustering (Neural Network Approach): clusters are represented by exemplars. (e.g.: SOM) Goal: Automatically Identify Recurring Topics in a News Corpus Started with a user problem: Geographic analysis of news Idea: Segment news into ongoing topics/stories How do we do this? What we need: Topics Mnemonic for describing/remembering the topic Mapping from news articles to topics Other goals: Gain insight into collection that couldn t be had from skimming a few documents Identify key players in a story/topic 18
19 User Problem: Geographic News Analysis TopCat identified separate topics for U.S. embassy bombing and counterstrike. Bombing List of Counter- Topics strike A Data Mining Based Solution Idea in Brief A topic often contains a number of recurring players/concepts Identified highly correlated named entities (frequent itemsets) Can easily tie these back to the source documents But there were too many to be useful Frequent itemsets often overlap Used this to cluster the correlated entities But the link to the source documents is no longer clear Used topic (list of entities) as a query to find relevant documents to compare with known mappings Evaluated against manually-categorized ground truth set Six months of print, video, and radio news: 65,583 stories 100 topics manually identified (covering 6941 documents) 19
20 TopCat Process Identify named entities (person, location, organization) in text Alembic natural language processing system Find highly correlated named entities (entities that occur together with unusual frequency) Query Flocks association rule mining technique Results filtered based on strength of correlation and number of appearances Cluster similar associations Hypergraph clustering based on hmetis graph partitioning algorithm (based on (Han et. al. 1997)) Groups entities that may not appear together in a single broadcast, but are still closely related Preprocessing Identify named entities (person, location, organization) in text Alembic Natural Language Processing system Data Cleansing: Coreference Resolution Used intra-document coreference from NLP system Heuristic to choose global best name from different choices in a document Eliminate composite stories Heuristic - same headline monthly or more often High Support Cutoff (5%) Eliminate overly frequent named entities (only provide common knowledge topics) 20
21 Named Entities vs. Full Text Corpus contained about 65,000 documents. Full text resulted in almost 5 million unique worddocument pairs vs. about 740,000 for named entities. Prototype was unable to generate frequent itemsets at support thresholds lower than 2% for full text. At 2% support, one week of full text data took 30 times longer to process than the named entities at 0.05% support. For one week: 91 topics were generated with the full text, most of which aren t readily identifiable. 33 topics were generated with the named-entities. Frequent Itemsets Israel State West Bank Netanyahu Albright Arafat Iraq State Albright 479 Israel Jerusalem West Bank Netanyahu Arafat Gaza Netanyahu 39 Ramallah Authority West Bank Iraq Israel U.N. 39 Query Flocks association rule mining technique frequent itemsets with 0.05% support Results filtered based on strength of correlation and support Cuts to 3129 frequent itemsets Ignored subsets when superset with higher correlation found 449 total itemsets, at most 12 items (most 2-4) 21
22 Cluster similar associations Clustering Hypergraph clustering based on hmetis graph partitioning algorithm (adapted from (Han et. al. 1997)) Groups entities that may not appear together in a single broadcast, but are still closely related Ramallah Gaza Authority Arafat West Bank Netanyahu { v P} { v e} { v e} n j= 1 n i= 1 Weight( cut_edges ) Weight( original_edges Jerusalem U.N. i Israel j ) Albright Iraq State Mapping to Documents Mapping Documents to Frequent Itemsets easy Itemset with support k has exactly k documents containing all of the items in the set. Topic clusters harder Topic may contain partial itemsets Solution: Information Retrieval Treat items as keys to search for Use Term Frequency/Inter Document Frequency as distance metric between document and topic Multiple ways to interpret ranking Cutoff: Document matches a topic if distance within threshold Best match: Document only matches closest topic 22
23 Merging Topics still to fine-grained for TDT Adjusting clustering parameters didn t help Problem was sub-topics Solution: Overlap in documents Documents often matched multiple topics Used this to further identify related topics Marriage Parent/Child i TFIDF TFIDF N documents ia i documents i documents ia TFIDF N ib TFIDF ib N i documents TFIDF TFIDF i documents ip TFIDF ic N ic N TopCat: Examples from Broadcast News LOCATION Baghdad PERSON Saddam Hussein PERSON Kofi Annan ORGANIZATION United Nations PERSON Annan ORGANIZATION Security Council LOCATION Iraq LOCATION Israel PERSON Yasser Arafat PERSON Walter Rodgers PERSON Netanyahu LOCATION Jerusalem LOCATION West Bank PERSON Arafat 23
24 TopCat Evaluation Tested on Topic Detection and Tracking Corpus Six months of print, video, and radio news sources 65,583 documents 100 topics manually identified (covering 6941 documents) Evaluation results (on evaluation corpus, last two months) Identified over 80% of human-defined topics Detected 83% of stories within human-defined topics Misclassified 0.2% of stories Results comparable to official Topic Detection and Tracking participants Slightly different problem - retrospective detection Provides mnemonic for topic (TDT participants only produce list of documents) Project Participants MITRE Corporation Modeling intelligence text analysis problems Integration with information retrieval systems Technology transfer to Intelligence Community through existing MITRE contracts with potential developers/first users Stanford University Computational issues Integration with database/data mining Technology transfer to vendors collaborating with Stanford on other data mining work Visitors: Robert Cooley (University of Minnesota, Summer 1998) Jason Rennie (MIT, Summer 1999) 24
25 MITRE internal: Where we re going now: Use of the Prototype Broadcast News Navigator GeoNODE External Use: Both Broadcast News Navigator and GeoNODE planned for testing at various sites GeoNODE working with NIMA as test site Incorporation in DARPA-sponsored TIDES Portal for Strong Angel/RIMPAC exercise this summer Exercise Strong Angel June 2000 Hawaii The scenario Humanitarian Assistance Increasing violence against Green minority in Orange Green minority refugees massing in border mountains Ethnic Green crossing into Green, though Orange citizens Live bomblets found near roads Basics in short supply water, shelter, medical care 25
26 What We ve Learned: Recommendations/Thoughts for Further Work Want flexibility in describing patterns What lends support to an association (e.g. across hyperlink; combining sequential, standard associations) Type of associated entity important in describing pattern Major risk: density of good stuff in results too low Problem isn t wrong results, but uninteresting results Simple support/confidence rarely appropriate for text Support a range of metrics - no single proper measure Cleaning and Mining as part of same process Human cost of pre-mining cleansing too high Human feedback on mining results (may alter results) What we see in the Future: COTS support for Data Mining in Text Working with vendors to incorporate query flocks technology in DBMS systems Stanford University working with IBM Almaden Research Working with vendors to incorporate text mining in information retrieval systems MITRE discussing technology transition with Manning&Napier Information Services, Cartia More Research needed What are the types of analyses that should be supported? What are the right relevance measures to find interesting patterns, and how do we optimize these? What additional capabilities are needed from concept extraction? 26
27 Potential Applications Topic Identification Identify by different types of entities (person / organization / location / event /?) Hierarchically organize topics (in progress) Support for link analysis on Text Tools exist for visualizing / analyzing links (e.g. NetMap) Text mining detects links -- giving link analysis tools something to work with Support for Natural Language Processing / Document Understanding Synonym recognition -- A and B may not appear together, but they each appear with X, Y, and Z -- A and B may be synonyms Prediction: Sequence analysis (in progress) Similarity Search in Multimedia Data Description-based retrieval systems Build indices and perform object retrieval based on image descriptions, such as keywords, captions, size, and time of creation Labor-intensive if performed manually Results are typically of poor quality if automated Content-based retrieval systems Support retrieval based on the image content, such as color histogram, texture, shape, objects, and wavelet transforms 27
28 Queries in Content-Based Retrieval Systems Image sample-based queries Find all of the images that are similar to the given image sample Compare the feature vector (signature) extracted from the sample with the feature vectors of images that have already been extracted and indexed in the image database Image feature specification queries Specify or sketch image features like color, texture, or shape, which are translated into a feature vector Match the feature vector with the feature vectors of the images in the database 28
Introduction to Information Retrieval
Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections
More informationCHAPTER-26 Mining Text Databases
CHAPTER-26 Mining Text Databases 26.1 Introduction 26.2 Text Data Analysis and Information Retrieval 26.3 Basle Measures for Text Retrieval 26.4 Keyword-Based and Similarity-Based Retrieval 26.5 Other
More informationPart I: Data Mining Foundations
Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?
More informationIntroduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.
Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How
More informationInformation Retrieval. (M&S Ch 15)
Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation
More informationBing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer
Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures Springer Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web
More informationCS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University
CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and
More informationKnowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.
Knowledge Retrieval Franz J. Kurfess Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. 1 Acknowledgements This lecture series has been sponsored by the European
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationInformation Retrieval & Text Mining
Information Retrieval & Text Mining Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References 2 Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann
More informationΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου
Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Σηµερινό ερώτηµα Typically we want to retrieve the top K docs (in the cosine ranking for the query) not totally order all docs in the corpus can we pick off docs
More informationChapter 27 Introduction to Information Retrieval and Web Search
Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More informationA Content Based Image Retrieval System Based on Color Features
A Content Based Image Retrieval System Based on Features Irena Valova, University of Rousse Angel Kanchev, Department of Computer Systems and Technologies, Rousse, Bulgaria, Irena@ecs.ru.acad.bg Boris
More informationA Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2
A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,
More informationInformation Retrieval
Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have
More informationInformation Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Information Retrieval CS 6900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Information Retrieval Information Retrieval (IR) is finding material of an unstructured
More informationTopCat: Data Mining for Topic Identification in a Text Corpus
TopCat: Data Mining for Topic Identification in a Text Corpus Chris Clifton The MITRE Corporation clifton@mitre.org Robert Cooley University of Minnesota cooley@cs.umn.edu Jason Rennie Massachusetts Institute
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationIn = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most
In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100
More informationFeature selection. LING 572 Fei Xia
Feature selection LING 572 Fei Xia 1 Creating attribute-value table x 1 x 2 f 1 f 2 f K y Choose features: Define feature templates Instantiate the feature templates Dimensionality reduction: feature selection
More informationKnowledge Discovery and Data Mining 1 (VO) ( )
Knowledge Discovery and Data Mining 1 (VO) (707.003) Data Matrices and Vector Space Model Denis Helic KTI, TU Graz Nov 6, 2014 Denis Helic (KTI, TU Graz) KDDM1 Nov 6, 2014 1 / 55 Big picture: KDDM Probability
More informationRepresentation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s
Summary agenda Summary: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University March 13, 2013 A Ardö, EIT Summary: EITN01 Web Intelligence
More informationCS 6320 Natural Language Processing
CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic
More informationMultimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency
Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency Ralf Moeller Hamburg Univ. of Technology Acknowledgement Slides taken from presentation material for the following
More informationReading group on Ontologies and NLP:
Reading group on Ontologies and NLP: Machine Learning27th infebruary Automated 2014 1 / 25 Te Reading group on Ontologies and NLP: Machine Learning in Automated Text Categorization, by Fabrizio Sebastianini.
More informationUnstructured Data. CS102 Winter 2019
Winter 2019 Big Data Tools and Techniques Basic Data Manipulation and Analysis Performing well-defined computations or asking well-defined questions ( queries ) Data Mining Looking for patterns in data
More informationDomain-specific Concept-based Information Retrieval System
Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical
More information60-538: Information Retrieval
60-538: Information Retrieval September 7, 2017 1 / 48 Outline 1 what is IR 2 3 2 / 48 Outline 1 what is IR 2 3 3 / 48 IR not long time ago 4 / 48 5 / 48 now IR is mostly about search engines there are
More informationChapter 2 BACKGROUND OF WEB MINING
Chapter 2 BACKGROUND OF WEB MINING Overview 2.1. Introduction to Data Mining Data mining is an important and fast developing area in web mining where already a lot of research has been done. Recently,
More informationCS47300: Web Information Search and Management
CS47300: Web Information Search and Management Text Clustering Prof. Chris Clifton 19 October 2018 Borrows slides from Chris Manning, Ray Mooney and Soumen Chakrabarti Document clustering Motivations Document
More informationToday s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan
Today s topic CS347 Clustering documents Lecture 8 May 7, 2001 Prabhakar Raghavan Why cluster documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics
More informationData Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.
Data Mining Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA January 13, 2011 Important Note! This presentation was obtained from Dr. Vijay Raghavan
More informationText classification II CE-324: Modern Information Retrieval Sharif University of Technology
Text classification II CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2015 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationElementary IR: Scalable Boolean Text Search. (Compare with R & G )
Elementary IR: Scalable Boolean Text Search (Compare with R & G 27.1-3) Information Retrieval: History A research field traditionally separate from Databases Hans P. Luhn, IBM, 1959: Keyword in Context
More informationInformation Retrieval and Web Search
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught at UT Austin and Stanford) Information Retrieval
More informationClustering Results. Result List Example. Clustering Results. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Presenting Results Clustering Clustering Results! Result lists often contain documents related to different aspects of the query topic! Clustering is used to
More informationContents. Preface to the Second Edition
Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................
More informationDigital Libraries: Language Technologies
Digital Libraries: Language Technologies RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Recall: Inverted Index..........................................
More informationChapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction
CHAPTER 5 SUMMARY AND CONCLUSION Chapter 1: Introduction Data mining is used to extract the hidden, potential, useful and valuable information from very large amount of data. Data mining tools can handle
More informationTable Of Contents: xix Foreword to Second Edition
Data Mining : Concepts and Techniques Table Of Contents: Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi About the Authors xxxv Chapter 1 Introduction 1 (38) 1.1 Why Data
More informationDocument Clustering for Mediated Information Access The WebCluster Project
Document Clustering for Mediated Information Access The WebCluster Project School of Communication, Information and Library Sciences Rutgers University The original WebCluster project was conducted at
More information10701 Machine Learning. Clustering
171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among
More informationIndexing and Query Processing. What will we cover?
Indexing and Query Processing CS 510 Winter 2007 1 What will we cover? Key concepts and terminology Inverted index structures Organization, creation, maintenance Compression Distribution Answering queries
More informationOverview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer
Data Mining George Karypis Department of Computer Science Digital Technology Center University of Minnesota, Minneapolis, USA. http://www.cs.umn.edu/~karypis karypis@cs.umn.edu Overview Data-mining What
More informationData Mining and Warehousing
Data Mining and Warehousing Sangeetha K V I st MCA Adhiyamaan College of Engineering, Hosur-635109. E-mail:veerasangee1989@gmail.com Rajeshwari P I st MCA Adhiyamaan College of Engineering, Hosur-635109.
More informationBasic techniques. Text processing; term weighting; vector space model; inverted index; Web Search
Basic techniques Text processing; term weighting; vector space model; inverted index; Web Search Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing
More informationCS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University
CS490W Text Clustering Luo Si Department of Computer Science Purdue University [Borrows slides from Chris Manning, Ray Mooney and Soumen Chakrabarti] Clustering Document clustering Motivations Document
More informationVALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER
VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur 603 203 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER CS6007-INFORMATION RETRIEVAL Regulation 2013 Academic Year 2018
More informationClustering. Bruno Martins. 1 st Semester 2012/2013
Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 Motivation Basic Concepts
More informationOutlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013
Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier
More informationMultimodal Information Spaces for Content-based Image Retrieval
Research Proposal Multimodal Information Spaces for Content-based Image Retrieval Abstract Currently, image retrieval by content is a research problem of great interest in academia and the industry, due
More informationSlides for Data Mining by I. H. Witten and E. Frank
Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-
More informationQ: Given a set of keywords how can we return relevant documents quickly?
Keyword Search Traditional B+index is good for answering 1-dimensional range or point query Q: What about keyword search? Geo-spatial queries? Q: Documents on Computer Science? Q: Nearby coffee shops?
More informationTask Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval
Case Study 2: Document Retrieval Task Description: Finding Similar Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 11, 2017 Sham Kakade 2017 1 Document
More informationWeb Information Retrieval using WordNet
Web Information Retrieval using WordNet Jyotsna Gharat Asst. Professor, Xavier Institute of Engineering, Mumbai, India Jayant Gadge Asst. Professor, Thadomal Shahani Engineering College Mumbai, India ABSTRACT
More informationInformation Retrieval. Information Retrieval and Web Search
Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent
More informationJames Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!
James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation
More informationInformation Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes
CS630 Representing and Accessing Digital Information Information Retrieval: Retrieval Models Information Retrieval Basics Data Structures and Access Indexing and Preprocessing Retrieval Models Thorsten
More informationMultimedia Information Systems
Multimedia Information Systems Samson Cheung EE 639, Fall 2004 Lecture 6: Text Information Retrieval 1 Digital Video Library Meta-Data Meta-Data Similarity Similarity Search Search Analog Video Archive
More informationInstructor: Stefan Savev
LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information
More informationECG782: Multidimensional Digital Signal Processing
ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/ 2 Outline Knowledge Representation Statistical Pattern Recognition Neural Networks Boosting
More informationInformation Retrieval CSCI
Information Retrieval CSCI 4141-6403 My name is Anwar Alhenshiri My email is: anwar@cs.dal.ca I prefer: aalhenshiri@gmail.com The course website is: http://web.cs.dal.ca/~anwar/ir/main.html 5/6/2012 1
More information68A8 Multimedia DataBases Information Retrieval - Exercises
68A8 Multimedia DataBases Information Retrieval - Exercises Marco Gori May 31, 2004 Quiz examples for MidTerm (some with partial solution) 1. About inner product similarity When using the Boolean model,
More informationInformation Retrieval
Natural Language Processing SoSe 2015 Information Retrieval Dr. Mariana Neves June 22nd, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline Introduction Indexing Block 2 Document Crawling Text Processing
More informationIJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:
IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Computer Science 591Y Department of Computer Science University of Massachusetts Amherst February 3, 2005 Topics Tasks (Definition, example, and notes) Classification
More informationImproving the Efficiency of Fast Using Semantic Similarity Algorithm
International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year
More informationDesign and Implementation of Search Engine Using Vector Space Model for Personalized Search
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 1, January 2014,
More informationComponent ranking and Automatic Query Refinement for XML Retrieval
Component ranking and Automatic uery Refinement for XML Retrieval Yosi Mass, Matan Mandelbrod IBM Research Lab Haifa 31905, Israel {yosimass, matan}@il.ibm.com Abstract ueries over XML documents challenge
More informationCross Reference Strategies for Cooperative Modalities
Cross Reference Strategies for Cooperative Modalities D.SRIKAR*1 CH.S.V.V.S.N.MURTHY*2 Department of Computer Science and Engineering, Sri Sai Aditya institute of Science and Technology Department of Information
More informationExam in course TDT4215 Web Intelligence - Solutions and guidelines - Wednesday June 4, 2008 Time:
English Student no:... Page 1 of 14 Contact during the exam: Geir Solskinnsbakk Phone: 735 94218/ 93607988 Exam in course TDT4215 Web Intelligence - Solutions and guidelines - Wednesday June 4, 2008 Time:
More informationCS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp
CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp Chris Guthrie Abstract In this paper I present my investigation of machine learning as
More informationText Categorization (I)
CS473 CS-473 Text Categorization (I) Luo Si Department of Computer Science Purdue University Text Categorization (I) Outline Introduction to the task of text categorization Manual v.s. automatic text categorization
More informationCHAPTER-23 MINING COMPLEX TYPES OF DATA
CHAPTER-23 MINING COMPLEX TYPES OF DATA 23.1 Introduction 23.2 Multidimensional Analysis and Descriptive Mining of Complex Data Objects 23.3 Generalization of Structured Data 23.4 Aggregation and Approximation
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More informationText Mining: A Burgeoning technology for knowledge extraction
Text Mining: A Burgeoning technology for knowledge extraction 1 Anshika Singh, 2 Dr. Udayan Ghosh 1 HCL Technologies Ltd., Noida, 2 University School of Information &Communication Technology, Dwarka, Delhi.
More informationUbiquitous Computing and Communication Journal (ISSN )
A STRATEGY TO COMPROMISE HANDWRITTEN DOCUMENTS PROCESSING AND RETRIEVING USING ASSOCIATION RULES MINING Prof. Dr. Alaa H. AL-Hamami, Amman Arab University for Graduate Studies, Amman, Jordan, 2011. Alaa_hamami@yahoo.com
More informationInternational Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani
LINK MINING PROCESS Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani Higher Colleges of Technology, United Arab Emirates ABSTRACT Many data mining and knowledge discovery methodologies and process models
More informationContents. Foreword to Second Edition. Acknowledgments About the Authors
Contents Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments About the Authors xxxi xxxv Chapter 1 Introduction 1 1.1 Why Data Mining? 1 1.1.1 Moving toward the Information Age 1
More informationText Analytics (Text Mining)
CSE 6242 / CX 4242 Apr 1, 2014 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer,
More informationMidterm Exam Search Engines ( / ) October 20, 2015
Student Name: Andrew ID: Seat Number: Midterm Exam Search Engines (11-442 / 11-642) October 20, 2015 Answer all of the following questions. Each answer should be thorough, complete, and relevant. Points
More informationCreating a Recommender System. An Elasticsearch & Apache Spark approach
Creating a Recommender System An Elasticsearch & Apache Spark approach My Profile SKILLS Álvaro Santos Andrés Big Data & Analytics Solution Architect in Ericsson with more than 12 years of experience focused
More informationProblem 1: Complexity of Update Rules for Logistic Regression
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 16 th, 2014 1
More informationvector space retrieval many slides courtesy James Amherst
vector space retrieval many slides courtesy James Allan@umass Amherst 1 what is a retrieval model? Model is an idealization or abstraction of an actual process Mathematical models are used to study the
More informationText Mining. Representation of Text Documents
Data Mining is typically concerned with the detection of patterns in numeric data, but very often important (e.g., critical to business) information is stored in the form of text. Unlike numeric data,
More informationCSE 158. Web Mining and Recommender Systems. Midterm recap
CSE 158 Web Mining and Recommender Systems Midterm recap Midterm on Wednesday! 5:10 pm 6:10 pm Closed book but I ll provide a similar level of basic info as in the last page of previous midterms CSE 158
More information70 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 6, NO. 1, FEBRUARY ClassView: Hierarchical Video Shot Classification, Indexing, and Accessing
70 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 6, NO. 1, FEBRUARY 2004 ClassView: Hierarchical Video Shot Classification, Indexing, and Accessing Jianping Fan, Ahmed K. Elmagarmid, Senior Member, IEEE, Xingquan
More informationApproaches to Mining the Web
Approaches to Mining the Web Olfa Nasraoui University of Louisville Web Mining: Mining Web Data (3 Types) Structure Mining: extracting info from topology of the Web (links among pages) Hubs: pages pointing
More information2. LITERATURE REVIEW
2. LITERATURE REVIEW CBIR has come long way before 1990 and very little papers have been published at that time, however the number of papers published since 1997 is increasing. There are many CBIR algorithms
More informationBBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler
BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for
More informationData Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395
Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining
More informationMachine Learning Techniques for Data Mining
Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already
More informationOutlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data
Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University
More informationLecture 8 May 7, Prabhakar Raghavan
Lecture 8 May 7, 2001 Prabhakar Raghavan Clustering documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics Given the set of docs from the results of
More informationSemantic Technologies in a Chemical Context
Semantic Technologies in a Chemical Context Quick wins and the long-term game Dr. Heinz-Gerd Kneip BASF SE The International Conference on Trends for Scientific Information Professionals ICIC 2010, 24.-27.10.2010,
More informationInformation Retrieval: Retrieval Models
CS473: Web Information Retrieval & Management CS-473 Web Information Retrieval & Management Information Retrieval: Retrieval Models Luo Si Department of Computer Science Purdue University Retrieval Models
More informationOverview of Web Mining Techniques and its Application towards Web
Overview of Web Mining Techniques and its Application towards Web *Prof.Pooja Mehta Abstract The World Wide Web (WWW) acts as an interactive and popular way to transfer information. Due to the enormous
More informationData Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394
Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20 Table of contents 1 Introduction 2 Data mining
More information