Network Programming in Python. What is Web Scraping? Server GET HTML

Network Programming in Python Charles Severance www.dr-chuck.com Unless otherwise noted, the content of this course material is licensed under a Creative Commons Attribution 3.0 License. http://creativecommons.org/licenses/by/3.0/. Copyright 2009, Charles Severance What is Web Scraping? When a program or script pretends to be a browser and retrieves web pages, looks at those web pages, extracts information and then looks at more web pages. Search engines scrape web pages - we call this spidering the web or web crawling GET HTML GET Server http://en.wikipedia.org/wiki/web_scraping http://en.wikipedia.org/wiki/web_crawler HTML

Why Scrape? Pull data - particularly social data - who links to who? Get your own data back out of some system that has no export capability Monitor a site for new information Spider the web to make a database for a search engine Scraping Web Pages There is some controversy about web page scraping and some sites are a bit snippy about it. Google: facebook scraping block Republishing copyrighted information is not allowed Violating terms of service is not allowed http://www.facebook.com/terms.php HTML and HTTP in Python Using urllib

Using urllib to retrieve web pages http://docs.python.org/lib/module-urllib.html You get the entire web page when you do f.read() - lines are separated by a newline character You get the entire web page when you do f.read() - lines are separated by a newline character We can split the contents into lines using the split() function

Splitting the contents on the newline character gives use a nice list where each entry is a single line We can easily write a for loop to look through the lines >>> print len(contents) 95328 >>> lines = contents.split("") >>> print len(lines) 2244 >>> print lines[3] <style type="text/css"> >>> A Simple Web Browser returl.py for ln in lines: # Do something for each line import urllib url = raw_input("enter a URL: ") print "Retrieving:", url contents = f.read() print "Retrieved",len(contents),"characters" f.close() lines = contents.split(''); print "Retrieved",len(lines),"lines" python returl.py Enter a URL: http://www.dr-chuck.com/ Retrieving: http://www.dr-chuck.com/ Retrieved 104867 characters Retrieved 2401 lines python returl.py Enter a URL: http://www.appenginelearn.com/ Retrieving: http://www.appenginelearn.com/ Retrieved 6769 characters Retrieved 175 lines

python returl.py Enter a URL: www.dr-chuck.com Retrieving: www.dr-chuck.com Traceback (most recent call last): File "returl.py", line 6, in <module> 2.6/lib/python2.6/urllib.py", line 87, in urlopen 2.6/lib/python2.6/urllib.py", line 203, in open 2.6/lib/python2.6/urllib.py", line 461, in open_file 2.6/lib/python2.6/urllib.py", line 475, in open_local_file IOError: [Errno 2] No such file or directory: 'www.dr-chuck.com' Parsing HTML Counting Anchor Tags We want to look through a web page and see how many lines have the string <a We don t care if there is more than one per line - we are looking for a rough number

<body> <div id="header"> <h1><a href="index.htm">appenginelearn</a></h1> <ul> <li><a href="/ezlaunch.htm">ez-launch</a></li> <li><a href=http://oreilly.com/catalog/9780596800697/ target=_new> Book</a></li> <li><a href=http://www.dr-chuck.com/ target=_new> Author</a></li> <li><a href=http://www.pythonlearn.com/>python</a></li> <li><a href=http://code.google.com/appengine/ target=_new> App Engine</a></li> </ul> </div> import urllib url = raw_input("enter a URL: ") print "Retrieving:", url contents = f.read() print "Retrieved",len(contents),"characters" f.close() lines = contents.split(''); print "Retrieved",len(lines),"lines" count = 0 for line in lines: if line.find("<a") >= 0 : count = count + 1 print count, "lines with <a tag" You get the entire web page when you do f.read() - lines are separated by a newline character We can split the contents into lines using the split() function Splitting the contents on the newline character gives use a nice list where each entry is a single line We can easily write a for loop to look through the lines >>> print len(contents) 95328 >>> lines = contents.split("") >>> print len(lines) 2244 >>> print lines[3] <style type="text/css"> >>> for line in lines: # Do something for each line

import urllib url = raw_input("enter a URL: ") print "Retrieving:", url contents = f.read() print "Retrieved",len(contents),"characters" f.close() lines = contents.split(''); print "Retrieved",len(lines),"lines" count = 0 for line in lines: if line.find("<a") >= 0 : count = count + 1 print count, "lines with <a tag" python hrefs.py Enter a URL: http://www.dr-chuck.com/ Retrieving: http://www.dr-chuck.com/ Retrieved 104739 characters Retrieved 2405 lines 63 lines with <a tag python hrefs.py Enter a URL: http://www.appenginelearn.com/ Retrieving: http://www.appenginelearn.com/ Retrieved 6769 characters Retrieved 175 lines 33 lines with <a tag Summary Python can easily retrieve data from the web and use its powerful string parsing capabilities to sift through the information and make sense of the information We can build a simple directed web-spider for our own purposes Make sure that we do not violate the terms and conditions of a web seit and make sure not to use copyrighted material improperly