Network Programming in Python. What is Web Scraping? Server GET HTML

Similar documents
Networked Programs. Getting Material from the Web! Building a Web Browser! (OK, a Very Primitive One )

Networked Programs. Chapter 12. Python for Informatics: Exploring Information

Decision Structures Zelle - Chapter 7

Installing and Running the Google App Engine On a Macintosh System

Lecture 4: Data Collection and Munging

Website Development (WEB) Lab Exercises

Design Document V2 ThingLink Startup

SOAP Integration - 1

Google App Engine Using Templates

Python for Informatics

Hypertext Markup Language HTML Chapter 2. Supporting Material for Using Google App Engine - O Reilly and Associates

Quick.JS Documentation

CMSC5733 Social Computing

Implementing a chat button on TECHNICAL PAPER

WWW. HTTP, Ajax, APIs, REST

KonaKart Shopping Widgets. 3rd January DS Data Systems (UK) Ltd., 9 Little Meadow Loughton, Milton Keynes Bucks MK5 8EH UK

SI Networked Computing: Storage, Communication, and Processing, Winter 2009

Review of HTML. Chapter Pearson. Fundamentals of Web Development. Randy Connolly and Ricardo Hoar

20.5. urllib Open arbitrary resources by URL

Restful Interfaces to Third-Party Websites with Python

Lab 1: Introducing HTML5 and CSS3

Web Clients and Crawlers

3. Create headings and add a table of contents to a gdoc

HTML. Based mostly on

Web scraping and social media scraping introduction

This tutorial has been prepared for beginners to help them understand the simple but effective SEO characteristics.

Surveyor Getting Started Guide

Enhancing cloud applications by using external authentication services. 2015, 2016 IBM Corporation

Using AJAX to Easily Integrate Rich Media Elements

Chapter 1 Introduction to Dreamweaver CS3 1. About Dreamweaver CS3 Interface...4. Creating New Webpages...10

Uniform Resource Locators (URL)

Table of contents. DMXzoneUniformManual DMXzone

Introduction to APIs. Session 2, Oct. 25

OU Mashup V2. Display Page

Conditional Execution

Web Scraping with Python

IMPORTING DATA IN PYTHON. Importing flat files from the web

CIT 590 Homework 5 HTML Resumes

STEAM Clown & Productions Copyright 2017 STEAM Clown. Page 1

Accessing Web Files in Python

Website Name. Project Code: # SEO Recommendations Report. Version: 1.0

Chapter 4 Sending Data to Your Application

Chapter 2 HTML and CSS

collective.jsonify Release 1.1.dev0

App Engine Web App Framework

examples from first year calculus (continued), file I/O, Benford s Law

SI Networked Computing: Storage, Communication, and Processing, Winter 2009

App Engine Web App Framework

Pemrograman Jaringan Web Client Access PTIIK

How A Website Works. - Shobha

Patrick Downes Rutgers University-New Brunswick School of Management and Labor Relations WEB SCRAPING FOR RESEARCH

Order Central Requirements 08/04/2009

Introduction to programming using Python

ThingLink User Guide. Andy Chen Eric Ouyang Giovanni Tenorio Ashton Yon

Introduction to Multimedia. MMP100 Spring 2017 thiserichagan.com/mmp100

What s New in Laserfiche 10

BEFORE CLASS. If you haven t already installed the Firebug extension for Firefox, download it now from

The Structure of the Web. Jim and Matthew

DATABASE SYSTEMS. Database programming in a web environment. Database System Course,

INLS : Introduction to Information Retrieval System Design and Implementation. Fall 2008.

Scraping Sites that Don t Want to be Scraped/ Scraping Sites that Use Search Forms

Introduction to Web Technologies

AN SEO GUIDE FOR SALONS

Reading Files. Chapter 7. Python for Informatics: Exploring Information

Web Programming and Design. MPT Senior Cycle Tutor: Tamara Week 1

Running Head: HOW A SEARCH ENGINE WORKS 1. How a Search Engine Works. Sara Davis INFO Spring Erika Gutierrez.

mincss Documentation Release 0.1 Peter Bengtsson

STEAM Clown & Productions Copyright 2017 STEAM Clown. Page 1

Bluehost and WordPress

Markup Language. Made up of elements Elements create a document tree

What is NovelTorpedo?

Introduction to Web Scraping with Python

Web scraping and social media scraping crawling

Web Services and Application Programming Interfaces. SI539 - Charles Severance

Reading Files. Chapter 7. Python for Everybody

Data Mining - Foursquare II. Bruno Gonçalves

Brand Tools. Technical Channel Integration Guide

MULTIMEDIA COLLEGE JALAN GURNEY KIRI KUALA LUMPUR

DIGITAL MARKETING Your revolution starts here

Lecture 8. ReactJS 1 / 24

My First Python Program

CS193X: Web Programming Fundamentals

SEOHUNK INTERNATIONAL D-62, Basundhara Apt., Naharkanta, Hanspal, Bhubaneswar, India

File I/O, Benford s Law, and sets

Spring 2008 June 2, 2008 Section Solution: Python

data analysis - basic steps Arend Hintze

28 JANUARY, Updating appearances. WordPress. Kristine Aa. Kristoffersen, based on slides by Tuva Solstad and Anne Tjørhom Frick

INTRODUCTION TO CSS. Mohammad Jawad Kadhim

Website Integration Setup

ECPR Methods Summer School: Automated Collection of Web and Social Data. github.com/pablobarbera/ecpr-sc103

Manual Html A Href Onclick Submit Form

Enterprise Software Architecture & Design

Boosting Campaign Performance Through Web Analytics. David Kamerer, PhD, APR Loyola University Chicago

Index. Autothrottling,

Authoring World Wide Web Pages with Dreamweaver

Page Title is one of the most important ranking factor. Every page on our site should have unique title preferably relevant to keyword.

icreate Editor Tech spec

ADITION HTML5 clicktag

Java Applets, etc. Instructor: Dmitri A. Gusev. Fall Lecture 25, December 5, CS 502: Computers and Communications Technology

LECTURE 14. Web Frameworks

Transcription:

Network Programming in Python Charles Severance www.dr-chuck.com Unless otherwise noted, the content of this course material is licensed under a Creative Commons Attribution 3.0 License. http://creativecommons.org/licenses/by/3.0/. Copyright 2009, Charles Severance What is Web Scraping? When a program or script pretends to be a browser and retrieves web pages, looks at those web pages, extracts information and then looks at more web pages. Search engines scrape web pages - we call this spidering the web or web crawling GET HTML GET Server http://en.wikipedia.org/wiki/web_scraping http://en.wikipedia.org/wiki/web_crawler HTML

Why Scrape? Pull data - particularly social data - who links to who? Get your own data back out of some system that has no export capability Monitor a site for new information Spider the web to make a database for a search engine Scraping Web Pages There is some controversy about web page scraping and some sites are a bit snippy about it. Google: facebook scraping block Republishing copyrighted information is not allowed Violating terms of service is not allowed http://www.facebook.com/terms.php HTML and HTTP in Python Using urllib

Using urllib to retrieve web pages http://docs.python.org/lib/module-urllib.html You get the entire web page when you do f.read() - lines are separated by a newline character You get the entire web page when you do f.read() - lines are separated by a newline character We can split the contents into lines using the split() function

Splitting the contents on the newline character gives use a nice list where each entry is a single line We can easily write a for loop to look through the lines >>> print len(contents) 95328 >>> lines = contents.split("") >>> print len(lines) 2244 >>> print lines[3] <style type="text/css"> >>> A Simple Web Browser returl.py for ln in lines: # Do something for each line import urllib url = raw_input("enter a URL: ") print "Retrieving:", url contents = f.read() print "Retrieved",len(contents),"characters" f.close() lines = contents.split(''); print "Retrieved",len(lines),"lines" python returl.py Enter a URL: http://www.dr-chuck.com/ Retrieving: http://www.dr-chuck.com/ Retrieved 104867 characters Retrieved 2401 lines python returl.py Enter a URL: http://www.appenginelearn.com/ Retrieving: http://www.appenginelearn.com/ Retrieved 6769 characters Retrieved 175 lines

python returl.py Enter a URL: www.dr-chuck.com Retrieving: www.dr-chuck.com Traceback (most recent call last): File "returl.py", line 6, in <module> 2.6/lib/python2.6/urllib.py", line 87, in urlopen 2.6/lib/python2.6/urllib.py", line 203, in open 2.6/lib/python2.6/urllib.py", line 461, in open_file 2.6/lib/python2.6/urllib.py", line 475, in open_local_file IOError: [Errno 2] No such file or directory: 'www.dr-chuck.com' Parsing HTML Counting Anchor Tags We want to look through a web page and see how many lines have the string <a We don t care if there is more than one per line - we are looking for a rough number

<body> <div id="header"> <h1><a href="index.htm">appenginelearn</a></h1> <ul> <li><a href="/ezlaunch.htm">ez-launch</a></li> <li><a href=http://oreilly.com/catalog/9780596800697/ target=_new> Book</a></li> <li><a href=http://www.dr-chuck.com/ target=_new> Author</a></li> <li><a href=http://www.pythonlearn.com/>python</a></li> <li><a href=http://code.google.com/appengine/ target=_new> App Engine</a></li> </ul> </div> import urllib url = raw_input("enter a URL: ") print "Retrieving:", url contents = f.read() print "Retrieved",len(contents),"characters" f.close() lines = contents.split(''); print "Retrieved",len(lines),"lines" count = 0 for line in lines: if line.find("<a") >= 0 : count = count + 1 print count, "lines with <a tag" You get the entire web page when you do f.read() - lines are separated by a newline character We can split the contents into lines using the split() function Splitting the contents on the newline character gives use a nice list where each entry is a single line We can easily write a for loop to look through the lines >>> print len(contents) 95328 >>> lines = contents.split("") >>> print len(lines) 2244 >>> print lines[3] <style type="text/css"> >>> for line in lines: # Do something for each line

import urllib url = raw_input("enter a URL: ") print "Retrieving:", url contents = f.read() print "Retrieved",len(contents),"characters" f.close() lines = contents.split(''); print "Retrieved",len(lines),"lines" count = 0 for line in lines: if line.find("<a") >= 0 : count = count + 1 print count, "lines with <a tag" python hrefs.py Enter a URL: http://www.dr-chuck.com/ Retrieving: http://www.dr-chuck.com/ Retrieved 104739 characters Retrieved 2405 lines 63 lines with <a tag python hrefs.py Enter a URL: http://www.appenginelearn.com/ Retrieving: http://www.appenginelearn.com/ Retrieved 6769 characters Retrieved 175 lines 33 lines with <a tag Summary Python can easily retrieve data from the web and use its powerful string parsing capabilities to sift through the information and make sense of the information We can build a simple directed web-spider for our own purposes Make sure that we do not violate the terms and conditions of a web seit and make sure not to use copyrighted material improperly