CIS192 Python Programming

Similar documents
CIS192 Python Programming

CIS192 Python Programming

CIS 192: Lecture 8 HTML Parsing

Scraping I: Introduction to BeautifulSoup

CIS192 Python Programming

Quick housekeeping Last Two Homeworks Extra Credit for demoing project prototypes Reminder about Project Deadlines/specifics Class on April 12th Resul

BeautifulSoup. Lab 16 HTML. Lab Objective: Learn how to load HTML documents into BeautifulSoup and navigate the resulting BeautifulSoup object

CIS192 Python Programming

Management Tools. Management Tools. About the Management GUI. About the CLI. This chapter contains the following sections:

Web Scraping. HTTP and Requests

Markup Language. Made up of elements Elements create a document tree

Web client programming

STARCOUNTER. Technical Overview

HTML + CSS. ScottyLabs WDW. Overview HTML Tags CSS Properties Resources

CHAPTER 1: GETTING STARTED WITH HTML CREATED BY L. ASMA RIKLI (ADAPTED FROM HTML, CSS, AND DYNAMIC HTML BY CAREY)

Network Programmability with Cisco Application Centric Infrastructure

CS109 Data Science Data Munging

Intro to XML. Borrowed, with author s permission, from:

This course is designed for web developers that want to learn HTML5, CSS3, JavaScript and jquery.

Web Development and HTML. Shan-Hung Wu CS, NTHU

Web Scrapping. (Lectures on High-performance Computing for Economists X)

Comp 426 Midterm Fall 2013

REST API OVERVIEW. Design and of Web APIs using the REST paradigm.

CIS 192: Lecture 9 Working with APIs

Website Report for colourways.com.au

UI Course HTML: (Html, CSS, JavaScript, JQuery, Bootstrap, AngularJS) Introduction. The World Wide Web (WWW) and history of HTML

HTML Processing in Python. 3. Construction of the HTML parse tree for further use and traversal.

UNIT I. A protocol is a precise set of rules defining how components communicate, the format of addresses, how data is split into packets

CSC9B1: Essential Skills WWW 1

Trees : Part 1. Section 4.1. Theory and Terminology. A Tree? A Tree? Theory and Terminology. Theory and Terminology

15-388/688 - Practical Data Science: Data collection and scraping. J. Zico Kolter Carnegie Mellon University Spring 2017

Lecture 4: Data Collection and Munging

Lotus IT Hub. Module-1: Python Foundation (Mandatory)

Website Report for

MODULE 2 HTML 5 FUNDAMENTALS. HyperText. > Douglas Engelbart ( )

Website Report for facebook.com

Drexel Chatbot Requirements Specification

Uniform Resource Locators (URL)

GeneXus for Smart Devices course - Architecture of Smart Device Applications

An Introduction to Web Scraping with Python and DataCamp

Copyright 2014 Blue Net Corporation. All rights reserved

CSC Web Technologies, Spring Web Data Exchange Formats

REST in a Nutshell: A Mini Guide for Python Developers

CNIT 129S: Securing Web Applications. Ch 3: Web Application Technologies

Website Report for test.com

Introduction to Web Scraping with Python

Best Practices Chapter 5

Review of HTML. Chapter Pearson. Fundamentals of Web Development. Randy Connolly and Ricardo Hoar

Develop Mobile Front Ends Using Mobile Application Framework A - 2

5/19/2015. Objectives. JavaScript, Sixth Edition. Introduction to the World Wide Web (cont d.) Introduction to the World Wide Web

Introduction to Corpora

f5-icontrol-rest Documentation

Announcements. 1. Class webpage: Have you been reading the announcements? Lecture slides and coding examples will be posted

INTRODUCTION TO HTML5! HTML5 Page Structure!

A bit more on Testing

Web Engineering (Lecture 01)

What is a web site? Web editors Introduction to HTML (Hyper Text Markup Language)

DATABASE SYSTEMS. Database programming in a web environment. Database System Course, 2016

INLEDANDE WEBBPROGRAMMERING MED JAVASCRIPT INTRODUCTION TO WEB PROGRAMING USING JAVASCRIPT

ReST 2000 Roy Fielding W3C

OU Mashup V2. Display Page

ITP 140 Mobile Technologies. Mobile Topics

Varargs Training & Software Development Centre Private Limited, Module: HTML5, CSS3 & JavaScript

CSI 3140 WWW Structures, Techniques and Standards. Representing Web Data: XML

Large-Scale Networks

12. Web Spidering. These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin.

11 TREES DATA STRUCTURES AND ALGORITHMS IMPLEMENTATION & APPLICATIONS IMRAN IHSAN ASSISTANT PROFESSOR, AIR UNIVERSITY, ISLAMABAD

Full Stack Web Developer Nanodegree Syllabus

How A Website Works. - Shobha

DATABASE SYSTEMS. Database programming in a web environment. Database System Course,

DNS and HTTP. A High-Level Overview of how the Internet works

CGI Architecture Diagram. Web browser takes response from web server and displays either the received file or error message.

Introduction to programming using Python

singly and doubly linked lists, one- and two-ended arrays, and circular arrays.

Scraping Sites that Don t Want to be Scraped/ Scraping Sites that Use Search Forms

Proposal for TED. In recent years, as big data analysis has become a more common practice among

Introduction to Web Concepts & Technologies

linkgrabber Documentation

Announcements. 1. Class webpage: Have you been reading the announcements? Lecture slides and coding examples will be posted

Index LICENSED PRODUCT NOT FOR RESALE

Understanding this structure is pretty straightforward, but nonetheless crucial to working with HTML, CSS, and JavaScript.

Walk through previous lectures

Section 5.5. Left subtree The left subtree of a vertex V on a binary tree is the graph formed by the left child L of V, the descendents

Chapter 13 XML: Extensible Markup Language

HTML MIS Konstantin Bauman. Department of MIS Fox School of Business Temple University

Django-CSP Documentation

Data Visualization (DSC 530/CIS )

SITEFINITY TRAINING. Sign In

ITP 342 Mobile App Development. APIs

Website Report for

RESTful Services. Distributed Enabling Platform

Data Visualization (CIS/DSC 468)

DATA STRUCTURE AND ALGORITHM USING PYTHON

The Structure of the Web. Jim and Matthew

Web Component JSON Response using AppInventor

CSCE 120: Learning To Code

XML: some structural principles

Web Technology for Test and Automation Applications

Website Development with HTML5, CSS and Bootstrap

Service Oriented Architectures (ENCS 691K Chapter 2)

Transcription:

CIS192 Python Programming HTTP Requests and HTML Parsing Raymond Yin University of Pennsylvania October 12, 2016 Raymond Yin (University of Pennsylvania) CIS 192 October 12, 2016 1 / 22

Outline 1 HTTP Requests HTTP Requests 2 HTML Parsing HTML Beautiful Soup Raymond Yin (University of Pennsylvania) CIS 192 October 12, 2016 2 / 22

What s HTTP? The protocol that directs how exactly we send resources back and forth on the web. A protocol is a set of rules that determines which messages can be exchanged, and which messages are appropriate replies. Two roles: server and client Raymond Yin (University of Pennsylvania) CIS 192 October 12, 2016 3 / 22

The Internet, TL;DR Network of computers that communicate via Internet protocol (IP) Internet service providers (ISP) direct traffic IP addresses are computers mailing addresses A Uniform Resource Locator (URL) refers to an IP address Domain Name System (DNS) resolves URLs to IP addresses HyperText Transfer Protocol (HTTP) is a way to talk via IP Raymond Yin (University of Pennsylvania) CIS 192 October 12, 2016 4 / 22

Types of Requests GET: retrieve a representation of the specified resource Should not modify the state of the server HEAD: a GET request but without the body (only the header) POST: Supply the resource with the content of the POST The resource is an entity that can process data The content of the POST is the data to be processed PUT: Store this data at the resource Defines what the contents of the URI should be A GET to the resource should return what was PUT DELETE: deletes the resource Raymond Yin (University of Pennsylvania) CIS 192 October 12, 2016 5 / 22

Outline 1 HTTP Requests HTTP Requests 2 HTML Parsing HTML Beautiful Soup Raymond Yin (University of Pennsylvania) CIS 192 October 12, 2016 6 / 22

Making a Request First you must install requests pip3 install requests r = requests.get(url) will make an HTTP GET request Returns a Response object requests.{head post put delete}(url) r.text is the body of the response r.headers is the header of the response (Extra details, can usually ignore them) Raymond Yin (University of Pennsylvania) CIS 192 October 12, 2016 7 / 22

HTTP Response Codes r.status_code is the HTTP status code of the response 1xx: Informational. Not the actual response but not an error 2xx: Everything is good 3xx: Redirection. Need to make a new request 4xx: Client Error: Didn t ask right, not allowed, doesn t exist 5xx: Server Error: Might be your fault but probably not Requests handles 1xx and 3xx for you. Can see in r.history r.raise_for_status() will raise an error for 4xx or 5xx Prefer over: if r.status_code...: raise Exception Raymond Yin (University of Pennsylvania) CIS 192 October 12, 2016 8 / 22

Hard to Remember... Raymond Yin (University of Pennsylvania) CIS 192 October 12, 2016 9 / 22

Arguments to GET and POST Parameters to a GET request go in the URL s query string http://www.example.com/test?a1=v1&a2=v2 GETs from the test page with a1=v1 and a2=v2 requests.get( http:.../test, params=p) If p = { a1 :v1, a2 :v2} the above are the same POST request data can be passed as a dict r = requests.post(url, data=d) GET and POST also support a headers dict as a kwarg Raymond Yin (University of Pennsylvania) CIS 192 October 12, 2016 10 / 22

APIs Application Programming Interface Specifies how software components should interact On the web, lots of services/websites that provide data in a structured way to analyze! Facebook Google (Maps, Calendar, YouTube...) Twitter Yelp Insert your favorite website/app here Usually represented in JSON JavaScript Object Notation A standard lightweight data format, language-agnostic Use json module to parse JSON strings Raymond Yin (University of Pennsylvania) CIS 192 October 12, 2016 11 / 22

Outline 1 HTTP Requests HTTP Requests 2 HTML Parsing HTML Beautiful Soup Raymond Yin (University of Pennsylvania) CIS 192 October 12, 2016 12 / 22

HyperText Markup Language HTML is a standardized way of specifying the contents of a page It s composed of elements (<tags>) with contents and attributes <tag attribute="val">content</tag> Tags are supposed to specify semantics not style <p>a paragraph</p> Semantic grouping of page <b>bold</b> Style of text. Better to use <strong> or CSS The tags form a tree with <html> at the root Raymond Yin (University of Pennsylvania) CIS 192 October 12, 2016 13 / 22

HTML Example <html> <p> This is the <strong>first</strong> paragraph <p> Sub paragraph </p> </p> <p> This is the <strong>second</strong> paragraph </p> </html> Raymond Yin (University of Pennsylvania) CIS 192 October 12, 2016 14 / 22

HTML Example html p p This is the (strong) paragraph (p) This is the (strong) paragraph strong p Sub paragraph strong first second Raymond Yin (University of Pennsylvania) CIS 192 October 12, 2016 15 / 22

Outline 1 HTTP Requests HTTP Requests 2 HTML Parsing HTML Beautiful Soup Raymond Yin (University of Pennsylvania) CIS 192 October 12, 2016 16 / 22

Goals of Beautiful Soup Make searching through HTML easy (Beautiful) Build the tree from the raw text Provided methods for moving around the tree Provide methods for finding sets of elements Handle poorly formatted HTML (Tag Soup) Historically browsers have been lenient with HTML Un-closed tags and badly nested tags are common <html><p>first<p>second</html> <strong><p></strong></p>?? Raymond Yin (University of Pennsylvania) CIS 192 October 12, 2016 17 / 22

Using Beautiful Soup Installation: pip3 install beautifulsoup4 Importing: from bs4 import BeautifulSoup Create the tree from a string or file handle soup = BeautifulSoup(r.text) soup = BeautifulSoup(html_string) soup = BeautifulSoup(open( html_file, r )) soup.<tag> returns the first element with that tag soup.p returns the first paragraph If there are no <tag>s, returns None The object soup.<tag> returns has type: bs4.element.tag Raymond Yin (University of Pennsylvania) CIS 192 October 12, 2016 18 / 22

Tag Objects A tag represents <tag attribute="val">content</tag> t.name is the value within <> (tag in this case) t[ attribute ] looks up attribute in a dictionary t[key] () t.attrs[key] t.text will give a string of all text in the subtree rooted at t t.string returns a NavigableString Only if t has exactly one child and that child is a non-empty string Raymond Yin (University of Pennsylvania) CIS 192 October 12, 2016 19 / 22

NavigableString Objects NavigableStrings support all operations of regular strings (str) tag.string.split(, ) Additionally, it knows where it is in the tree. You can move to a parent or sister tag Details of moving around are basically the same as Tags Raymond Yin (University of Pennsylvania) CIS 192 October 12, 2016 20 / 22

Moving Around t.<tag> gets the first matching element below t in the tree t.children is an iterator over an elements immediate children t.descendants is an iterator over all elements under t Pre-order traversal t.strings is an iterator over all navigable strings under t t.parent is the parent of t in the tree t.(next_/previous_)sibling move to adjacent nodes t.(next_/previous_)element generalizes to the next node in the pre-order traversal Raymond Yin (University of Pennsylvania) CIS 192 October 12, 2016 21 / 22

Searching the Tree Can search by matching with the following filters: Literal strings Compiled regular expressions any string in a list a function that returns True for tags you want True matches everything t.find_all(filter) returns all descendants with names that match t.find(...) is like t.find_all(...) but only first match kwargs match attributes against filters t.find(text=filter) matches against the.text of a tag t.find_(parents/next_siblings/all_next/previous To use Python keywords, append an _ t.find(class_=filter) Raymond Yin (University of Pennsylvania) CIS 192 October 12, 2016 22 / 22