Lecture 4: Data Collection and Munging

Similar documents
CS109 Data Science Data Munging

Web Scraping with Python

Web scraping tools, a real life application

HTML CS 4640 Programming Languages for Web Applications

Index. Autothrottling,

Data Acquisition and Processing

ECPR Methods Summer School: Automated Collection of Web and Social Data. github.com/pablobarbera/ecpr-sc103

BEFORE CLASS. If you haven t already installed the Firebug extension for Firefox, download it now from

Using Development Tools to Examine Webpages

webdriverplus Release 0.1

Introduction to Corpora

Web scraping and social media scraping handling JS

Web scraping. with Scrapy

HTML, XHTML, and CSS. Sixth Edition. Chapter 1. Introduction to HTML, XHTML, and

Part 1: jquery & History of DOM Scripting

A network is a group of two or more computers that are connected to share resources and information.

The course also includes an overview of some of the most popular frameworks that you will most likely encounter in your real work environments.

Web Scrapping. (Lectures on High-performance Computing for Economists X)

The Structure of the Web. Jim and Matthew

Web scraping and social media scraping introduction

DATA COLLECTION. Slides by WESLEY WILLETT 13 FEB 2014

Best Practices Chapter 5

Index. alt, 38, 57 class, 86, 88, 101, 107 href, 24, 51, 57 id, 86 88, 98 overview, 37. src, 37, 57. backend, WordPress, 146, 148

Assignment: Seminole Movie Connection

This document is for informational purposes only. PowerMapper Software makes no warranties, express or implied in this document.

Graphiq Reality. Product Requirement Document. By Team Graphiq Content. Vincent Duong Kevin Mai Navdeep Sandhu Vincent Tan Xinglun Xu Jiapei Yao

HTML5 MOCK TEST HTML5 MOCK TEST I

INTRODUCTION TO HTML5! HTML5 Page Structure!

Network Programming in Python. What is Web Scraping? Server GET HTML

DATABASE SYSTEMS. Database programming in a web environment. Database System Course, 2016

Patrick Downes Rutgers University-New Brunswick School of Management and Labor Relations WEB SCRAPING FOR RESEARCH

COMS W3101: SCRIPTING LANGUAGES: JAVASCRIPT (FALL 2018)

BeautifulSoup: Web Scraping with Python

Contents. 1. Using Cherry 1.1 Getting started 1.2 Logging in

Scraping Sites that Don t Want to be Scraped/ Scraping Sites that Use Search Forms

2nd Year PhD Student, CMU. Research: mashups and end-user programming (EUP) Creator of Marmite

Table Basics. The structure of an table

Data. Notes. are required reading for the week. textbook reading and a few slides on data formats and data cleaning

Document Object Model. Overview

Web Site Development with HTML/JavaScrip

Introduction to Web Scraping with Python

S. Rinzivillo DATA VISUALIZATION AND VISUAL ANALYTICS

Integration Test Plan

Website SEO Checklist

HTML + CSS. ScottyLabs WDW. Overview HTML Tags CSS Properties Resources

CIS192 Python Programming

Frontend guide. Everything you need to know about HTML, CSS, JavaScript and DOM. Dejan V Čančarević

This course is designed for web developers that want to learn HTML5, CSS3, JavaScript and jquery.

Review of HTML. Chapter Pearson. Fundamentals of Web Development. Randy Connolly and Ricardo Hoar

[paf Wj] open source. Selenium 1.0 Testing Tools. Beginner's Guide. using the Selenium Framework to ensure the quality

Spade Documentation. Release 0.1. Sam Liu

Web Site Design and Development. CS 0134 Fall 2018 Tues and Thurs 1:00 2:15PM

CIS192 Python Programming

2/6/2012. Rich Internet Applications. What is Ajax? Defining AJAX. Asynchronous JavaScript and XML Term coined in 2005 by Jesse James Garrett

ThingLink User Guide. Andy Chen Eric Ouyang Giovanni Tenorio Ashton Yon

,

Friday, January 18, 13. : Now & in the Future

AJAX ASYNCHRONOUS JAVASCRIPT AND XML. Laura Farinetti - DAUIN

Create web pages in HTML with a text editor, following the rules of XHTML syntax and using appropriate HTML tags Create a web page that includes

footer, header, nav, section. search. ! Better Accessibility.! Cleaner Code. ! Smarter Storage.! Better Interactions.

CS50 Quiz Review. November 13, 2017

LAST WEEK ON IO LAB. Install Firebug and Greasemonkey. Complete the online skills assessment. Join the mailing list.

NEW WEBMASTER HTML & CSS FOR BEGINNERS COURSE SYNOPSIS

LECTURE 13. Intro to Web Development

Web Development and HTML. Shan-Hung Wu CS, NTHU

The Benefits of CSS. Less work: Change look of the whole site with one edit

Understanding this structure is pretty straightforward, but nonetheless crucial to working with HTML, CSS, and JavaScript.

LECTURE 13. Intro to Web Development

XML: some structural principles

Jquery Manually Set Checkbox Checked Or Not


jquery Tutorial for Beginners: Nothing But the Goods

TIME SCHEDULE MODULE TOPICS PERIODS. HTML Document Object Model (DOM) and javascript Object Notation (JSON)

data analysis - basic steps Arend Hintze

Proper_Name Final Exam December 21, 2005 CS-081/Vickery Page 1 of 4

welcome to BOILERCAMP HOW TO WEB DEV

XML. Jonathan Geisler. April 18, 2008

DATABASE SYSTEMS. Database programming in a web environment. Database System Course,

Web Computing. Revision Notes

The Black Magic of Flash SEO

Table of Contents. Navigate the Management Menu. 911 Management Page

week8 Tommy MacWilliam week8 October 31, 2011

Introduction to Automation. What is automation testing Advantages of Automation Testing How to learn any automation tool Types of Automation tools

Comp 426 Midterm Fall 2013

Index LICENSED PRODUCT NOT FOR RESALE

Full file at New Perspectives on HTML and CSS 6 th Edition Instructor s Manual 1 of 13. HTML and CSS

Creating an Online Catalogue Search for CD Collection with AJAX, XML, and PHP Using a Relational Database Server on WAMP/LAMP Server

Using the Force of Python and SAS Viya on Star Wars Fan Posts

CSC309 Winter Lecture 2. Larry Zhang

Brief Intro to Firebug Sukwon Oh CSC309, Summer 2015

CS7026 CSS3. CSS3 Graphics Effects

Student, Perfect Midterm Exam March 24, 2006 Exam ID: 3193 CS-081/Vickery Page 1 of 5

Website Report for facebook.com

Website Development (WEB) Lab Exercises

Data Visualization (CIS 468)

Technical SEO in 2018

Web Design and Application Development

Collecting Data from the Programmable Web

Web Design Course Syllabus and Course Outline

Web Scraping. Juan Riaza.!

Transcription:

Lecture 4: Data Collection and Munging Instructor:

Outline 1 Data Collection and Scraping 2 Web Scraping basics

In-Class Quizzes URL: http://m.socrative.com/ Room Name: 4f2bb99e

Data Collection

What you wish data looked like?

What does data really look like?

What does data really look like?

What Do Analysts Do?

What Do Analysts Do? I spend more than half of my time integrating, cleansing and transforming data without doing any actual analysis. Most of the time I m lucky if I get to do any analysis. Most of the time once you transform the data you just do an average... the insights can be scarily obvious. It s fun when you get to do something somewhat analytical.

Data Collection Ethics OK: Public, non-sensitive, anonymized, fully referenced information (always cite sources) If in doubt, don t! Be a good web citizen: Honor robots.txt Obey rate limits. Do not overload servers Obey relevant copyright/license restrictions Know about fair use and its restrictions

Data Collection: 4 Things You Should Have 1 Raw data 2 Tidy data set 3 A code/process book describing each variable and its values in tidy data set 4 A explicit and exact recipe you used to go from 1 to 2, 3.

Tidy Data 1 Each variable you measure should be in one column 2 Each different observation of that variable should be in a different row 3 There should be one table for each kind of variable 4 If you have multiple tables, they should include a column in the table that allows them to be linked Other Tips: Include a row at top of each file with variable names Make variable names human readable AgeAtDiagnosis instead of AgeDx In general, data should be saved in one file per table

Why is the instruction list important?

Why is the instruction list important? 1 1 http://telegraph.co.uk

Data Access Schemes Bulk downloads: Wikipedia, IMDB, Million Song Database, etc. See list of data web sites on the Resources page API access: NY Times, Twitter, Facebook, Foursquare, Google,... Web scraping: For everything else

Data Formats Delimited Values Comma Separated Values (CSV) Tab Separated Values (TSV) Markup Languages Hypertext Markup Language (HTML / XML) JavaScript Object Notation (JSON) Hierarchical Data Format (HDF5) Ad Hoc Formations Graph edge lists, voting records, fixed width files,...

JSON Light weight text-based interchange format Language independent Based on Javascript Very easy to use and manipulate All browsers and major languages have great support

JSON A collection of name/value pairs: { a : 1, b : 2, c : 3, d : 4, } An ordered list of values: [1, 2, 3, blah ] Or a combination: { a : [1, 2, 3, 4], b : [5, 6, 7, 8], c : 4}

XML

HTML

Tags & Elements

Attributes

Classes and IDs

DOM

CSS

Referencing CSS

Web Scraping

Web Scraping 2 Data scraping is a technique in which a computer program extracts data from human-readable output coming from another program. Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human end-users and not for ease of automated use. Because of this, tool kits that scrape web content were created. A web scraper is an API to extract data from a web site. 2 Wikipedia

Inspecting Web Pages Chrome: DevTools Firefox: Developer Tools, Firebug Safari and Internet Explorer

Chrome Developer Tools Demo

Web Scraping in Python HTTP Requests: urllib2, requests HTML Parsing: lxml, beautifulsoup, pattern Crawling: Scrapy Controlling Browsers: Selenium/WebDriver Headless Browsers: PhantomJS

CSS Selectors 3 * : Selects all elements p : Select all p tags #myid : Select element with id=myid.myclass : Select elements with class=myclass p #myid.myclass : Union of all the selections div code : Find all code tags inside a div li > ul : Select all ul inside li (first level only) strong + em : Select all em that is immediately preceded by strong prev siblings : Selects all sibling elements that follow after the prev element, have the same parent, and match the filtering siblings selector. 3 http://codylindley.com/jqueryselectors/

CSS Selectors 4 li[class] : all li with attribute class [a= b ]: All elements with attribute with name a with value b li:first : First element of li li:first-child : First child of li li:even : Even elements of li :text : All text boxes 4 http://codylindley.com/jqueryselectors/

Crawling and Spiders wget -mk -w 20 http://www.example.com/ -m : Mirror a website -k : convert urls to point to local files -w : Delay between requests. 20 = 20 seconds, 20m => 20 minutes and so on -r : recursive download -p : download all files that are necessary to properly display a given HTML page. -c : continue a incomplete download tries : Number of retries reject, -A : File types to reject and accept

Scrapy 5 from scrapy import Spider, Item, Field class Post(Item): title = Field() class BlogSpider(Spider): name, start_urls = blogspider, [ http://blog.scrapinghub.com ] def parse(self, response): return [Post(title=e.extract()) for e in response.css("h2 a::text")] 5 http://scrapy.org/

Web Driver Originally a tool for automating testing of web applications Now a W3C standard (http: //w3c.github.io/webdriver/webdriver-spec.html) Interface to Selenimum and WebDriver in Python

Selenium WebDriver 6 from selenium import webdriver from selenium.webdriver.common.keys import Keys driver = webdriver.firefox() driver.get("http://www.python.org") elem = driver.find_element_by_name("q") elem.send_keys("pycon") elem.send_keys(keys.return) 6 https://selenium-python.readthedocs.org/

PhantomJS 7 Scriptable Headless WebKit Automating web related workflow Use Cases: Headless web testing Page automation. Screen capture. Network monitoring. 7 https://github.com/ariya/phantomjs

CasperJS 8 Navigation scripting and testing utility Built for PhantomJS Easy to define full navigation scenarios Syntactic sugar to make life very easy 8 http://casperjs.org/

IMDB Web Scraping Demo

Facebook Web Scraping Demo

Summary Major Concepts: Data Collection and Scraping Tools and Techniques

Slide Material References Slides from Harvard CS 109 (2013) Slides by Jeff Leek Also see slide footnotes