Text Analytics (Text Mining)

Similar documents
Text Analytics (Text Mining)

Text Analytics (Text Mining)

Text Analytics (Text Mining)

Analytics Building Blocks

Data Integration. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Data Mining Concepts & Tasks

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Data Mining Concepts. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech

Data Mining Concepts & Tasks

Classification Key Concepts

Scaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Graphs / Networks CSE 6242/ CX Interactive applications. Duen Horng (Polo) Chau Georgia Tech

Data Collection, Simple Storage (SQLite) & Cleaning

Information Retrieval. (M&S Ch 15)

Chapter 6: Information Retrieval and Web Search. An introduction

Introduction to Machine Learning. Duen Horng (Polo) Chau Associate Director, MS Analytics Associate Professor, CSE, College of Computing Georgia Tech

Graphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Introduction to Information Retrieval

Graphs III. CSE 6242 A / CS 4803 DVA Feb 26, Duen Horng (Polo) Chau Georgia Tech. (Interactive) Applications

Scaling Up Pig. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Information Retrieval: Retrieval Models

CS 6320 Natural Language Processing

Document Clustering: Comparison of Similarity Measures

Scaling Up Pig. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Clustering. Bruno Martins. 1 st Semester 2012/2013

Information Retrieval and Web Search Engines

Information Retrieval. hussein suleman uct cs

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Graphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Mining Web Data. Lijun Zhang

Chapter 27 Introduction to Information Retrieval and Web Search

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

CS/INFO 1305 Information Retrieval

Problem 1: Complexity of Update Rules for Logistic Regression

CS/INFO 1305 Summer 2009

Feature selection. LING 572 Fei Xia

Introduction to Information Retrieval

CSE 494: Information Retrieval, Mining and Integration on the Internet

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

Mining Web Data. Lijun Zhang

Information Retrieval

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Graphs / Networks CSE 6242/ CX Basics, how to build & store graphs, laws, etc. Centrality, and algorithms you should know

Clustering Results. Result List Example. Clustering Results. Information Retrieval

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Chapter 2. Architecture of a Search Engine

Information Retrieval

Information Retrieval and Web Search Engines

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Developing Focused Crawlers for Genre Specific Search Engines

Classification Key Concepts

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING

Reading group on Ontologies and NLP:

Introduction to Information Retrieval

Behavioral Data Mining. Lecture 18 Clustering

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Instructor: Stefan Savev

Last Week: Visualization Design II

DSCI 575: Advanced Machine Learning. PageRank Winter 2018

Information Retrieval

vector space retrieval many slides courtesy James Amherst

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

60-538: Information Retrieval

Multimedia Information Systems

Clustering: Overview and K-means algorithm

Lecture 8 May 7, Prabhakar Raghavan

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

modern database systems lecture 4 : information retrieval

Text Documents clustering using K Means Algorithm

Clustering: Overview and K-means algorithm

CPSC 340: Machine Learning and Data Mining. Multi-Dimensional Scaling Fall 2017

Introduction to Information Retrieval

Information Retrieval

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Basics, how to build & store graphs, laws, etc. Centrality, and algorithms you should know

CS54701: Information Retrieval

Basics, how to build & store graphs, laws, etc. Centrality, and algorithms you should know

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Boolean Model. Hongning Wang

Clustering (COSC 416) Nazli Goharian. Document Clustering.

CSE 5243 INTRO. TO DATA MINING

Introduction to Data Mining

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

SOCIAL MEDIA MINING. Data Mining Essentials

Information Retrieval and Web Search

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition

Based on Raymond J. Mooney s slides

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17

Information Retrieval

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

Transcription:

CSE 6242 / CX 4242 Apr 1, 2014 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Le Song

Text is everywhere We use documents as primary information artifact in our lives" Our access to documents has grown tremendously in recent years due to networking infrastructure" WWW: webpages, Twitter, Facebook, Wikipedia, Blogs,..." Digital libraries: Google books, ACM, IEEE,..." Lyrics, closed caption... (youtube)" 2

Big (Research) Questions... in understanding and gathering information from text and document collections" establish authorship, authenticity; plagiarism detection" finding patterns in human genome" classification of genres for narratives (e.g., books, articles)" tone classification; sentiment analysis (online reviews, twitter, social media)" code: syntax analysis (e.g., find common bugs from students answers)" " " " " 3

Outline Storage (full text storage and full text search in SQLite, MySQL)! Preprocessing (e.g., stemming)! Document representation: bag-of-words model! Word importance (e.g., word count, TF-IDF)! Word disambiguation/entity resolution! Document importance (e.g., PageRank)! Document similarity (e.g., cosine similarity, Apolo/Belief Propagation, etc.)! Retrieval (Latent Semantic Indexing)!!! Prof. Jacob Eisenstein s NLP class has far more complete info on this topic. 4

Bags-of-words model Represent each document as a bag of words, ignoring words ordering.! Why do this?! Unstructured text -> a vector of numbers! e.g., docs: I like visualization, I like data.! I : 1,! like : 2,! data : 3,! visualization : 4! I like visualization -> [1, 1, 0, 1]! I like data -> [1, 1, 1, 0]! Simplified computation! 5

TF-IDF (importance score for a word in a document) When and where to use it?! Everywhere you use word count, you can use TF-IDF.! TF: term frequency! IDF: inverse document frequency! score = TF/IDF 6

Stemming Reduce words to their stems (or base forms)! Words: compute, computing, computer,...! Stem: comput! Several classes of algorithms to do this:! Stripping suffixes, lookup-based, etc.! 7

Vector Space Model and Clustering keyword queries (vs Boolean) each document: -> vector (HOW?) each query: -> vector search for similar vectors

main idea: Vector Space Model and Clustering document...data... indexing aaron data zoo V (= vocabulary size)

Vector Space Model and Clustering Then, group nearby vectors together Q1: cluster search? Q2: cluster generation?! Two significant contributions ranked output relevance feedback

Vector Space Model and Clustering cluster search: visit the (k) closest superclusters; continue recursively MD TRs CS TRs

Vector Space Model and Clustering ranked output: easy! MD TRs CS TRs

Vector Space Model and Clustering relevance feedback (brilliant idea) [Roccio 73] MD TRs CS TRs

Vector Space Model and Clustering relevance feedback (brilliant idea) [Roccio 73] How? MD TRs CS TRs

Vector Space Model and Clustering How? A: by adding the good vectors and subtracting the bad ones MD TRs CS TRs

main idea cluster search cluster generation evaluation Outline - detailed

Cluster generation Problem: given N points in V dimensions, group them

Cluster generation Problem: given N points in V dimensions, group them

Cluster generation We need Q1: document-to-document similarity Q2: document-to-cluster similarity

Cluster generation Q1: document-to-document similarity (recall: bag of words representation) D1: { data, retrieval, system } D2: { lung, pulmonary, system } distance/similarity functions?

Cluster generation A1: # of words in common A2:... normalized by the vocabulary sizes A3:... etc! About the same performance - prevailing one: cosine similarity

Cluster generation cosine similarity: similarity(d1, D2) = cos(θ) = sum(v 1,i * v 2,i ) / [len(v 1 ) * len(v 2 )] D1 θ D2

Cluster generation cosine similarity - observations: related to the Euclidean distance weights v i,j : according to tf/idf D1 θ D2

Cluster generation tf ( term frequency ) high, if the term appears very often in this document.! idf ( inverse document frequency ) penalizes common words, that appear in almost every document

Cluster generation We need Q1: document-to-document similarity Q2: document-to-cluster similarity?

Cluster generation A1: min distance ( single-link ) A2: max distance ( all-link ) A3: avg distance (gives same cluster ranking as A4, but different values) A4: distance to centroid?

Cluster generation A1: min distance ( single-link ) leads to elongated clusters A2: max distance ( all-link ) many, small, tight clusters A3: avg distance in between the above A4: distance to centroid fast to compute

Cluster generation We have document-to-document similarity document-to-cluster similarity! Q: How to group documents into natural clusters

Cluster generation A: *many-many* algorithms - in two groups [VanRijsbergen]: theoretically sound (O(N^2)) independent of the insertion order iterative (O(N), O(N log(n))

Cluster generation - sound methods Approach#1: dendrograms - create a hierarchy (bottom up or top-down) - choose a cut-off (how?) and cut 0.8 cat tiger horse cow 0.3 0.1

Cluster generation - sound methods Approach#2: min. some statistical criterion (eg., sum of squares from cluster centers) like k-means but how to decide k?

Cluster generation - sound methods Approach#3: Graph theoretic [Zahn]: build MST; delete edges longer than 3* std of the local average

Cluster generation - sound methods Result: why 3? variations Complexity?

Cluster generation - iterative methods General outline: Choose seeds (how?) assign each vector to its closest seed (possibly adjusting cluster centroid) possibly, re-assign some vectors to improve clusters Fast and practical, but unpredictable

Cluster generation one way to estimate # of clusters k: the cover coefficient [Can+] ~ SVD

LSI - Detailed outline LSI problem definition main idea experiments

Information Filtering + LSI [Foltz+, 92] Goal: users specify interests (= keywords) system alerts them, on suitable news-documents Major contribution: LSI = Latent Semantic Indexing latent ( hidden ) concepts

Information Filtering + LSI Main idea map each document into some concepts map each term into some concepts! Concept :~ a set of terms, with weights, e.g. DBMS_concept: data (0.8), system (0.5), retrieval (0.6)

Information Filtering + LSI Pictorially: term-document matrix (BEFORE)

Information Filtering + LSI Pictorially: concept-document matrix and...

Information Filtering + LSI... and concept-term matrix

Information Filtering + LSI Q: How to search, e.g., for system?

Information Filtering + LSI A: find the corresponding concept(s); and the corresponding documents

Information Filtering + LSI A: find the corresponding concept(s); and the corresponding documents

Information Filtering + LSI Thus it works like an (automatically constructed) thesaurus:! we may retrieve documents that DON T have the term system, but they contain almost everything else ( data, retrieval )

LSI - Discussion - Conclusions Great idea, to derive concepts from documents to build a statistical thesaurus automatically to reduce dimensionality (down to few concepts ) How exactly SVD works? (Details, next)