Unstructured Data. CS102 Winter 2019

Similar documents
Chapter 6: Information Retrieval and Web Search. An introduction

Information Retrieval

Computer Science 572 Midterm Prof. Horowitz Thursday, March 8, 2012, 2:00pm 3:00pm

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Last Week: Visualization Design II

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Enhancing applications with Cognitive APIs IBM Corporation

NLP Final Project Fall 2015, Due Friday, December 18

Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids. Marek Lipczak Arash Koushkestani Evangelos Milios

COMP6237 Data Mining Searching and Ranking

Query Processing and Alternative Search Structures. Indexing common words

Information Retrieval and Web Search

Information Retrieval. Information Retrieval and Web Search

Information Retrieval

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Association Rule Mining in The Wider Context of Text, Images and Graphs

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Taming Text. How to Find, Organize, and Manipulate It MANNING GRANT S. INGERSOLL THOMAS S. MORTON ANDREW L. KARRIS. Shelter Island

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Data Mining Algorithms

Chapter 2. Architecture of a Search Engine

Information Retrieval. (M&S Ch 15)

FAQ Retrieval using Noisy Queries : English Monolingual Sub-task

Text Mining on Mailing Lists: Tagging

Chapter 27 Introduction to Information Retrieval and Web Search

Feature selection. LING 572 Fei Xia

Natural Language Processing Basics. Yingyu Liang University of Wisconsin-Madison

Wikipedia, Dead Authors, Naive Bayes & Python

Introduction to Information Retrieval

Natural Language Processing. SoSe Question Answering

TISA Methodology Threat Intelligence Scoring and Analysis

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Embedding Intelligence through Cognitive Services

Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras

Manning Chapter: Text Retrieval (Selections) Text Retrieval Tasks. Vorhees & Harman (Bulkpack) Evaluation The Vector Space Model Advanced Techniques

3 Data, Data Mining. Chengkai Li

Modelling Structures in Data Mining Techniques

On the automatic classification of app reviews

IE in Context. Machine Learning Problems for Text/Web Data

Key-value stores. Berkeley DB. Bigtable

ELEC6910Q Analytics and Systems for Social Media and Big Data Applications Lecture 4. Prof. James She

Midterm Exam Search Engines ( / ) October 20, 2015

Practical Natural Language Processing with Senior Architect West Monroe Partners

KEYWORD EXTRACTION FROM DESKTOP USING TEXT MINING TECHNIQUES

Data Science Course Content

Information Retrieval

Digital Libraries: Language Technologies

CS 6320 Natural Language Processing

Historical Text Mining:

Automated Tagging to Enable Fine-Grained Browsing of Lecture Videos

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

Research at Google: With a Case of YouTube-8M. Joonseok Lee, Google Research

On the Automatic Classification of App Reviews

Crawling Tweets & Pre- Processing

The Security Role for Content Analysis

Applied Machine Learning

Machine Learning - Clustering. CS102 Fall 2017

Solution 1 (python) Performance: Enron Samples Rate Recall Precision Total Contribution

Mining Web Data. Lijun Zhang

modern database systems lecture 4 : information retrieval

Ubiquitous Computing and Communication Journal (ISSN )

AUTOMATIC VISUAL CONCEPT DETECTION IN VIDEOS

Natural Language Processing SoSe Question Answering. (based on the slides of Dr. Saeedeh Momtazi)

Automated Tagging for Online Q&A Forums

Master Project. Various Aspects of Recommender Systems. Prof. Dr. Georg Lausen Dr. Michael Färber Anas Alzoghbi Victor Anthony Arrascue Ayala

Search & Google. Melissa Winstanley

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Session 10: Information Retrieval

The Billion Object Platform (BOP): a system to lower barriers to support big, streaming, spatio-temporal data sources

Applications of Machine Learning on Keyword Extraction of Large Datasets

Droid Notes Notebook on a phone

SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES

Sifaka: Text Mining Above a Search API

Implementation of Smart Question Answering System using IoT and Cognitive Computing

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

Visualization and text mining of patent and non-patent data

The Goal of this Document. Where to Start?

Associating video frames with text

Project 4 Machine Learning: Optical Character Recognition

CHAPTER-26 Mining Text Databases

Part I: Data Mining Foundations

NLTK Tutorial. CSC 485/2501 September 17, Krish Perumal /

Text Analytics (Text Mining)

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014.

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition

Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO

AUTOMATIC VIDEO INDEXING

Automatic Summarization

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Building A Billion Spatio-Temporal Object Search and Visualization Platform

Introduction to Information Retrieval

QMiner is a data analytics platform for processing large-scale real-time streams containing structured and unstructured data.

Natural Language Processing with PoolParty

Information Retrieval Using Context Based Document Indexing and Term Graph

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION

Introduction to Information Retrieval

Transcription:

Winter 2019

Big Data Tools and Techniques Basic Data Manipulation and Analysis Performing well-defined computations or asking well-defined questions ( queries ) Data Mining Looking for patterns in data Machine Learning Using data to make inferences or predictions Data Visualization Graphical depiction of data Data Collection and Preparation Over unstructured data (text, images, video)

Analyzing Text Much of the world s data is in the form of free-text Different sizes of text elements Small tweets Medium emails, product reviews Large documents Very large books Different sizes of text corpuses Ultimate text corpus: the web

Types of Text Processing 1. Search Text analytics, Text mining For specific words, phrases, patterns Find common patterns, phrase completions 2. Understand Parts of speech, parse trees Entity recognition people, places, organizations Disambiguation - ford, jaguar Sentiment analysis - happy, sad, angry, Natural language processing/understanding

Applications of Text Analytics / NLP Search engines Spam classification News feed management Document summarization Language translation Speech-to-text Many, many more

Information Retrieval Long-standing subfield of computer science Since the 1950 s! Goal: retrieve information (typically text) that matches a search (typically words or phrases) 1. Text corpus usually too large to scan 2. Return most relevant answers first Some techniques and tools Inverted indices N-grams TF-IDF Regular expressions Seem familiar?

Inverted Indices Goal: Find all documents in a corpus containing a specific word (e.g., web search) Impossible to scan all documents for the word Scan the corpus in advance to build an inverted index ; keep it up-to-date as content changes Word the quick brown fox jumped Document IDs 1, 2, 3, 5, 7, 8, 9, 2, 5, 7, 13, 1, 2, 7, 16, 17, 7, 9, 11, 13, 3, 4, 6, 7,

N-grams Goal: Find all documents in a corpus containing a specific phrase (e.g., web search) Cannot create inverted index for all phrases Create indexes of n-grams phrases of length n 3-gram the quick brown fox jumped over the lazy dog over the lazy Document IDs 2, 7, 15, 23, 7, 23, 35 23, 35, 87, 88, 5, 9, 23, 88, Note: index on previous slide was 1-grams Also helps with longer phrases

N-grams N-gram frequency counts typically over entire text corpus or language, useful for: Search auto-completion, spelling correction Summarization Speech recognition, machine translation 3-gram Frequency quick brown fox 1500 quick brown dog 800 the lazy dog 2500 the lazy person 50

TF-IDF Goal: estimate how important a given keyword is to a document (e.g., for keyword search ranking) 1. Use word frequency: the occurs very frequently jaguar occurs much less frequently Ø Which word is more important in a document about wild cats? 2. Incorporate word specialization: TF-IDF = Term Frequency / Document Frequency (IDF = Inverse Document Frequency )

TF-IDF TF-IDF = Term Frequency / Document Frequency Term frequency # occurrences / # total words Document frequency # documents with at least one occurrence Ex: the occurs 50 times in document, occurs in 1000 documents in corpus: TF-IDF =.05 Ex: jaguar occurs 10 times in document, occurs in 30 documents in corpus: TF-IDF =.33 Words with very high document and term frequency are typically stopwords (and, at, but, if, that, the, ), often excluded from text analysis

Regular Expressions Goal: Search for text that matches a certain pattern Sample of regular expression constructs: regular expression matches. any single character exp? zero or one instances of exp exp zero or more instances of exp exp + one or more instances of exp [exp 1 exp 2 ] either exp 1 or exp 2 May be composed for arbitrarily complex expressions

Regular Expressions regular expression matches. any single character exp? zero or one instances of exp exp zero or more instances of exp exp + one or more instances of exp exp 1 exp 2 either exp 1 or exp 2 ((Prof Dr Ms). )? Widom (.+)@(.+). (edu com) http://(.*)stanford(.*)

Precision and Recall Measures of correctness or quality of results: Basic data analysis expect the correct answer Data mining results based on support and confidence thresholds Machine learning accuracy of predicted values or labels (average error, percent correct) Text and image analysis precision and recall

Precision and Recall Example Search for documents about cats Assume ground truth every document is either about cats or it s not Precision: Fraction of returned documents that are about cats Recall: Fraction of documents about cats in the corpus that are returned by the search

Precision and Recall Documents about cats B C Documents returned A Recall = " % + " Precision = " " + $ Return one correct document 100% precision Return all documents 100% recall Ø Challenge is to achieve high values for both More complex when results are ranked

Hands-On Text Analysis Dataset Wine reviews (200 in class, 10K in assignment) Python packages Regular expressions (re) Natural Language Toolkit (nltk) Text-scanning style only no indices For help while working with nltk: Tutorials and help pages (website Course Materials) Ø Web search

Image Analysis Find images in a corpus based on specification E.g., color, shape, object, focus Rank images based on specification Classify or label images (machine learning)

Image Analysis: Progression First approach Ø human labeling Second approach Ø shape recognition Third approach Ø machine learning

Video Analysis Video = series of images + audio stream Video search can take different forms Like image search (colors, shapes, objects) but with added time element Like text search over audio stream Combine the two Find all videos with people in more than 50% of the frames and occurrences of both Stanford and MIT in the audio track

Hands-on Image Analysis Dataset 206 country flags Simple image-by-image color analysis All images are a grid of pixels - e.g., 200 x 200 Each pixel is one color Color represented as RGB [0-255,0-255,0-255] Python package Python Imaging Library (PIL) Newer version Pillow