Data Collection & Data Preprocessing

Similar documents
Crawling CE-324: Modern Information Retrieval Sharif University of Technology

Today s lecture. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

Today s lecture. Information Retrieval. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

Chapter 6: Information Retrieval and Web Search. An introduction

CS47300: Web Information Search and Management

Crawling Tweets & Pre- Processing

Plan for today. CS276B Text Retrieval and Mining Winter Evolution of search engines. Connectivity analysis

CS47300: Web Information Search and Management

Information Retrieval and Web Search

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Crawling the Web. Web Crawling. Main Issues I. Type of crawl

Introduction to Information Retrieval

Information Retrieval. Lecture 10 - Web crawling

CS 572: Information Retrieval

Information Extraction

Part I: Data Mining Foundations

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

How Does a Search Engine Work? Part 1

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Information Retrieval Spring Web retrieval

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Information Retrieval

Information Retrieval May 15. Web retrieval

Web Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson

CS6200 Information Retreival. Crawling. June 10, 2015

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Text Technologies for Data Science INFR11145 Web Search Walid Magdy Lecture Objectives

Search Engines. Charles Severance

DATA MINING II - 1DL460. Spring 2014"

Search Engine Optimization (SEO) & Your Online Success

A Method for Semi-Automatic Ontology Acquisition from a Corporate Intranet

Information Retrieval

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index.

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Corso di Biblioteche Digitali

Corso di Biblioteche Digitali

Text Mining for Software Engineering

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Chapter 27 Introduction to Information Retrieval and Web Search

DATA MINING - 1DL105, 1DL111

Search Engines. Information Retrieval in Practice

Graph and Web Mining - Motivation, Applications and Algorithms PROF. EHUD GUDES DEPARTMENT OF COMPUTER SCIENCE BEN-GURION UNIVERSITY, ISRAEL

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION

Large Crawls of the Web for Linguistic Purposes

World Wide Web has specific challenges and opportunities

CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS

Web scraping and social media scraping introduction

Glossary of on line marketing terms

Crawling Assignment. Introduction to Information Retrieval CS 150 Donald J. Patterson

CS47300: Web Information Search and Management

Mining Web Data. Lijun Zhang

Running Head: HOW A SEARCH ENGINE WORKS 1. How a Search Engine Works. Sara Davis INFO Spring Erika Gutierrez.

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India

Information Extraction Techniques in Terrorism Surveillance

CS101 Lecture 30: How Search Works and searching algorithms.

DATA MINING II - 1DL460. Spring 2017

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Collection Building on the Web. Basic Algorithm

Chapter 4. Processing Text

CRM-to-CRM Data Migration. CRM system. The CRM systems included Know What Data Will Map...3

Efficient Algorithms for Preprocessing and Stemming of Tweets in a Sentiment Analysis System

An Approach To Web Content Mining

Prior Art Retrieval Using Various Patent Document Fields Contents

12. Web Spidering. These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin.

Digital Libraries: Language Technologies

Answer Extraction. NLP Systems and Applications Ling573 May 13, 2014

HOW DOES A SEARCH ENGINE WORK?

Collective Intelligence in Action

TIC: A Topic-based Intelligent Crawler

Introduction to Text Mining. Hongning Wang

Web Search. Web Spidering. Introduction

Web Crawling and Basic Text Analysis. Hongning Wang

Mining Web Data. Lijun Zhang

Search Like a Pro. How Search Engines Work. Comparison Search Engine. Comparison Search Engine. How Search Engines Work

A Multilingual Social Media Linguistic Corpus

URLs excluded by REP may still appear in a search engine index.

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

Crawling. CS6200: Information Retrieval. Slides by: Jesse Anderton

Designing is the most important phase of software development. It requires

Information Retrieval

Profiling Medical Journal Articles Using a Gene Ontology Semantic Tagger. Mahmoud El-Haj Paul Rayson Scott Piao Jo Knight

Web Page Similarity Searching Based on Web Content

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

5/30/2014. Acknowledgement. In this segment: Search Engine Architecture. Collecting Text. System Architecture. Web Information Retrieval

Deliverable D1.4 Report Describing Integration Strategies and Experiments

Chapter 2: Literature Review

Advertising Network Affiliate Marketing Algorithm Analytics Auto responder autoresponder Backlinks Blog

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

3 Follows two- and three-part oral directions. TE: 44, 57, 58 59, , PE: 57, 58 59, ,

Performance Evaluation of a Regular Expression Crawler and Indexer

Question Answering Using XML-Tagged Documents

NLP - Based Expert System for Database Design and Development

A Frequent Max Substring Technique for. Thai Text Indexing. School of Information Technology. Todsanai Chumwatana

We Push Buttons. SEO Glossary

Performance Assessment using Text Mining

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Module 3: GATE and Social Media. Part 4. Named entities

A Survey on Web Information Retrieval Technologies

Transcription:

Data Collection & Data Preprocessing Bayu Distiawan Natural Language Processing & Text Mining Short Course Pusat Ilmu Komputer UI 22 26 Agustus 2016

DATA COLLECTION Fakultas Ilmu Komputer Universitas Indonesia

Text Mining Process Task Definition Data Collection Data Preparation Data Processing Evaluation

Data Collection Collect from internal source. Collaboration with partner Pay for the data Collect public data

Collecting public data Available corpus http://qwone.com/~jason/20newsgroups/ http://www.daviddlewis.com/resources/testcolle ctions/reuters21578/ https://dumps.wikimedia.org/ http://schwa.org/projects/resources/wiki/wikin er Available Data The Internet!

Collecting Data From Internet Crawler Spider Robot (or bot) Web agent Wanderer, worm, And famous instances: googlebot, scooter, slurp, msnbot,

7 Sec. 20.2 Basic crawler operation Begin with known seed URLs Fetch and parse them Extract URLs they point to Place the extracted URLs on a queue Fetch each URL on the queue and repeat

8 Sec. 20.2 Crawling picture URLs crawled and parsed Unseen Web Web Seed pages URLs frontier

9 Sec. 20.1.1 Simple picture complications Web crawling isn t feasible with one machine All of the above steps distributed Malicious pages Spam pages Spider traps incl dynamically generated Even non-malicious pages pose challenges Latency/bandwidth to remote servers vary Webmasters stipulations How deep should you crawl a site s URL hierarchy? Site mirrors and duplicate pages Politeness don t hit a server too often

10 Sec. 20.1.1 What any crawler must do Be Polite: Respect implicit and explicit politeness considerations Only crawl allowed pages Respect robots.txt (more on this shortly) Be Robust: Be immune to spider traps and other malicious behavior from web servers

After Crawling???

Extracting Information Gather information from unstructured data Creating relational like data Entity Extraction [Pogba, EPL] Entity coreference [Pogba = EPL Player] Time Identification [August, 2016] RelationExtraction [MU is located in England] Event Mention Extraction and Event Coreference Resolution [Pogba is transferred to MU in August 2016]

DATA PREPARATION/PREPROCESSING Fakultas Ilmu Komputer Universitas Indonesia

Data Preprocessing Depends on the task Some preprocessing: Sentence Splitting Filtering Stemming Normalization POS Tagging NP Chunking Parsing Etc.

Sentence Splitting Split paragraph/article into sentences Manchester United have agreed a world record deal to sign Paul Pogba for 110 million, Goal understands. Officials from the Premier League club met with their Juventus counterparts earlier on Wednesday to discuss a deal to bring Pogba back to Old Trafford. It is now understood that United have settled on a fee of 110m for Pogba, which eclipses the previous record set when Real Madrid paid 100m for Gareth Bale in 2013. Manchester United have agreed a world record deal to sign Paul Pogba for 110 million, Goal understands. Officials from the Premier League club met with their Juventus counterparts earlier on Wednesday to discuss a deal to bring Pogba back to Old Trafford. It is now understood that United have settled on a fee of 110m for Pogba, which eclipses the previous record set when Real Madrid paid 100m for Gareth Bale in 2013.

Filtering/Stop Word Removal Many of the most frequently used words in English are useless in IR and text mining these words are called stop words. the, of, and, to,. Typically about 400 to 500 such words For an application, an additional domain specific stopwords list may be constructed Why do we need to remove stopwords? Reduce indexing (or data) file size stopwords accounts 20-30% of total word counts. Improve efficiency and effectiveness stopwords are not useful for searching or text mining they may also confuse the retrieval system.

Filtering/Stop Word Removal Word distribution frequency

Filtering/Stop Word Removal Top 50 words from AP89 corpus

Stemming Techniques used to find out the root/stem of a word. E.g., user engineering users engineered used engineer using stem: use engineer Usefulness: improving effectiveness of IR and text mining matching similar words Mainly improve recall reducing indexing size combing words with same roots may reduce indexing size as much as 40-50%. 19

Basic stemming methods Using a set of rules. E.g., remove ending if a word ends with a consonant other than s, followed by an s, then delete s. if a word ends in es, drop the s. if a word ends in ing, delete the ing unless the remaining word consists only of one letter or of th. If a word ends with ed, preceded by a consonant, delete the ed unless this leaves only a single letter.... transform words if a word ends with ies but not eies or aies then ies --> y. 20

Normalization Token normalization is the process of canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens (Stanford IR Book). U.S.A, USA usa windows,windows, window, Window windows

Lexical Normalization in Social Media User creativity on social media creates a problem for NLP Processing. I love u -> i love you tmrw -> tomorrow 4eva -> forever

Lexical Normalization in Social Media Technique: Using dictionary Other technique (Han & Baldwin, 2011): Machine learning, features: Edit distance value Prefix substring Suffix substring Longest common subsequence (LCS)

Linguistic Pre-processing Advanced preprocessing task POS Tagging Budi/NN eats/vb bakso/nn NP Chunking [NP The most expensive footballer] wearing [NP a red shirt] Named Entity Recognition <PERS>President Joko Widodo</PERS> meets Ahok in <LOC>Istana Negara</LOC> Parsing

Thank you