Web Clients and Crawlers

Size: px
Start display at page:

Download "Web Clients and Crawlers"

Transcription

1 Web Clients and Crawlers 1 Web Clients alternatives to web browsers opening a web page and copying its content 2 Scanning Files looking for strings between double quotes parsing URLs for the server location 3 Web Crawlers making requests recursively incremental development, modular design of code 4 Parsing HTML the HTMLParser module overriding methods in HTMLParser MCS 507 Lecture 21 Mathematical, Statistical and Scientific Software Jan Verschelde, 14 October 2013 Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

2 Web Clients and Crawlers 1 Web Clients alternatives to web browsers opening a web page and copying its content 2 Scanning Files looking for strings between double quotes parsing URLs for the server location 3 Web Crawlers making requests recursively incremental development, modular design of code 4 Parsing HTML the HTMLParser module overriding methods in HTMLParser Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

3 Web Clients Usually we download files with a web browser, but we can browse the web using scripts. Why do we want to do this? 1 more efficient: no overhead from a GUI 2 in control: request only what we need update most recent information 3 crawl the web: search recursively operate like a search engine 4 the SymPy HTML documentation is interlinked search the documentaton files off line Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

4 Web Clients and Crawlers 1 Web Clients alternatives to web browsers opening a web page and copying its content 2 Scanning Files looking for strings between double quotes parsing URLs for the server location 3 Web Crawlers making requests recursively incremental development, modular design of code 4 Parsing HTML the HTMLParser module overriding methods in HTMLParser Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

5 web pages as files Opening a web page with urllib.urlopen: from urllib import urlopen < object like file > = urlopen( < URL > ) data = < object like file >.read( < size > ) < object like file >.close() we process web pages like we handle files Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

6 opening a web page The SymPy HTML documentation comes with the distribution of the sympy package. from urllib import urlopen URL = file:///users/jan/courses/mcs507/ \ + sympy docs-html/index.html PAGE = urlopen(url) DATA = PAGE.read(80) print 80 characters :, DATA LINES = DATA.split( \n ) print the lines :, LINES PAGE.close() Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

7 function to copy a web page to a file def copypage(url, filename): Given the URL for the web page, a copy of its contents is written to filename. Both url and file are strings. import urllib copyfile = open(filename, w ) webpage = urllib.urlopen(url) while True: data = webpage.read(80) print data if data == : break copyfile.write(data) webpage.close() copyfile.close() Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

8 urlretrieve With urllib.urlretrieve we copy a web page to file. >>> u = file:///users/jan/courses/mcs507/ \... + sympy docs-html/index.html >>> from urllib import urlretrieve >>> urlretrieve(u, mycopy.html ) ( mycopy.html, <mimetools.message instance at 0x1011cb878>) >>> f = open( mycopy.html, r ) >>> d = f.readlines() >>> print d[:4] [ \r\n, \r\n, <!DOCTYPE html PUBLIC "-// W3C//DTD XHTML 1.0 Transitional//EN"\r\n, " ansitional.dtd">\r\n ] Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

9 Web Clients and Crawlers 1 Web Clients alternatives to web browsers opening a web page and copying its content 2 Scanning Files looking for strings between double quotes parsing URLs for the server location 3 Web Crawlers making requests recursively incremental development, modular design of code 4 Parsing HTML the HTMLParser module overriding methods in HTMLParser Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

10 scanning HTML files Applications to scan an HTML file: 1 search for particular information, 2 navigate to where the page refers to. Example (1): download all.py files from jan/mcs507/main.html Example (2): retrieve all URLs the page refers to. What is common between these two examples:.py files and URLs appear between " and " scan for all strings between double quotes Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

11 scanning for double quoted strings Input: Output: a file, or object like a file. list of all strings between double quotes. Recall that we read files with fixed size buffer:..."....".."."...."..".".."... For double quoted strings which run across two buffers we need another buffer. Two buffers: Two functions: one for reading strings from file, one for buffering double quoted string. 1 read buffered data from file, 2 scan the data buffer for double quoted strings. Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

12 reading strings from file def quoted_strings(page): Given a file page, this function scans the page and returns a list of all strings on the file enclosed between double quotes. result = [] buff = while True: data = page.read(80) if(data == ): break (result, buff) = update_qstrings(result, \ buff, data) return result Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

13 processing the buffers In L we store the double quoted strings. In b we buffer the double quoted strings...."... L = [], b = "....".."." L = [...,. ], b =... L = [...,. ], b =.".."." L = [...,.,.. ], b = ".."... L = [...,.,..,.. ], b = def update_qstrings(quoted, buff, sdat): quoted is a list of double quoted strings, buff buffers a double quoted string, and sdat is the data string to be processed. Returns an update of (quoted, buff). Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

14 code for update_qstrings def update_qstrings(quoted, buff, sdat): ".." newb = buff for i in range(0, len(sdat)): if(newb == ): if(sdat[i] == \" ): newb = o # o is for opened else: if(sdat[i]!= \" ): newb += sdat[i] else: # do not store o quoted.append(newb[1:len(newb)]) newb = return (quoted, newb) Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

15 the function main() def main(): Prompts the user for a file name and scans the file for double quoted strings. print getting double quoted strings name = raw_input( Give file name : ) datafile = open(name, r ) qstr = quoted_strings(datafile) print qstr datafile.close() if name == " main ": main() Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

16 Web Clients and Crawlers 1 Web Clients alternatives to web browsers opening a web page and copying its content 2 Scanning Files looking for strings between double quotes parsing URLs for the server location 3 Web Crawlers making requests recursively incremental development, modular design of code 4 Parsing HTML the HTMLParser module overriding methods in HTMLParser Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

17 scanning web pages for URLs Recall the second example application: list all URLs referred to at def main(): Prompts the user for a web page, and prints all URLs this page refers to. print listing reachable locations page = raw_input( Give URL : ) links = http_links(page) print found %d HTTP links % len(links) show_locations(links) Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

18 filtering a double quoted string from scanquotes import update_qstrings def http_filter(names): Returns from the list names only those strings which begin with http. result = [] for name in names: if(len(name) > 4): if name[0:4] == http : result.append(name) return result Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

19 scanning an HTML file for HTTP strings def http_links(url): Given the URL for the web page, returns the list of all HTTP strings. import urllib page = urllib.urlopen(url) result = [] buff = while True: data = page.read(80) if data == : break (result, buff) = update_qstrings(result, \ buff, data) result = http_filter(result) page.close() return result Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

20 showing only server locations using the module urlparse An URL consists of 6 parts protocol://location/path:parameters?query#frag Given URL u, urlparse.urlparse(u) returns 6-tuple. def show_locations(locs): Shows the locations of the URL in locs. from urlparse import urlparse for location in locs: tupl = urlparse(location) print tupl[1] Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

21 Web Clients and Crawlers 1 Web Clients alternatives to web browsers opening a web page and copying its content 2 Scanning Files looking for strings between double quotes parsing URLs for the server location 3 Web Crawlers making requests recursively incremental development, modular design of code 4 Parsing HTML the HTMLParser module overriding methods in HTMLParser Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

22 web crawlers Scanning HTML files and browsing: 1 given a URL, open a web page, 2 compute the list of all URLs in the page, 3 for all URLs in the list do: 1 open the web page defined by location of URL, 2 compute the list of all URLs on that page. continue recursively, crawling the web Things to consider: 1 remove duplicates from list of URLs, 2 do not turn back to pages visited before, 3 limit the levels of recursion, 4 some links will not work. This is similar to finding a path in a maze. Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

23 Web Clients and Crawlers 1 Web Clients alternatives to web browsers opening a web page and copying its content 2 Scanning Files looking for strings between double quotes parsing URLs for the server location 3 Web Crawlers making requests recursively incremental development, modular design of code 4 Parsing HTML the HTMLParser module overriding methods in HTMLParser Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

24 modular design of web crawler # L-21 MCS 507 Mon 14 Oct 2013 : webcrawler.py Prompts the user for a URL and the maximal depth of the recursion tree. Lists all locations of web servers that can reached starting from the user given URL. from scanhttplinks import http_links Still left to write: 1 management of list of server locations, 2 recursive function to crawl the web. Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

25 retain only new locations def new_locations(urls, vis): Given the list urls of new URLs and the list of already visited locations in vis, returns the list of new locations, locations not yet visited earlier. from urlparse import urlparse result = [] for name in urls: tup = urlparse(name) loc = tup[1] if not loc in result: if not loc in vis: result.append(loc) return result Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

26 some links will not work... Add an exception handler: def http_links(url): Given the URL for the web page, returns the list of all HTTP strings. import urllib try: print opening + url +... page = urllib.urlopen(url) except IOError: print opening + url + failed return [] Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

27 unparsing URLs Recall that we only store the server locations. To open a web page we also need to specify the protocol. We apply urlparse.urlunparse >>> from urlparse import urlunparse >>> urlunparse(( http, )) We must provide a 6-tuple as argument... Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

28 running the crawler $ python webcrawler.py crawling the web... Give URL : give maximal depth : 2 opening opening opening opening opening opening opening opening opening it takes a while.. total #locations : 538 Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

29 the function main() def main(): Prompts the user for a web page, and prints all URLs this page refers to. print crawling the web... page = raw_input( Give URL : ) depth = input( give maximal depth : ) locs = crawler(page, depth, []) print reachable locations :, locs print total #locations :, len(locs) if name == " main ": main() Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

30 code for the crawler def crawler(url, kst, locs): Returns the list locs updated with the list of locations reachable from the given url using kst steps. from urlparse import urlunparse links = http_links(url) newlocs = new_locations(links, locs) result = locs + newlocs if(kst == 0): return result else: for loc in newlocs: name = urlunparse(( http, loc, \,,, )) result = crawler(name, kst-1, result) return result Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

31 Web Clients and Crawlers 1 Web Clients alternatives to web browsers opening a web page and copying its content 2 Scanning Files looking for strings between double quotes parsing URLs for the server location 3 Web Crawlers making requests recursively incremental development, modular design of code 4 Parsing HTML the HTMLParser module overriding methods in HTMLParser Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

32 the HTMLParser module to parse html code In the standard Python distribution: >>> import HTMLParser >>> help(htmlparser) the class HTMLParser allows to override handlers of tags provides a feed method the feed method handles buffering Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

33 using HTMLParser from HTMLParser import HTMLParser from urllib import urlopen class OurHTMLParser(HTMLParser): def init (self): def handle_starttag(self, tag, attrs): def handle_endtag(self, tag): def main(): page = urlopen(url) pars = OurHTMLParser() while True: data = page.read(80) if data == : break pars.feed(data) pars.close() Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

34 tallying the tags using the class HTMLParser Gather basic statistics about a page: 1 what types of tags are used, 2 count number of occurrences for each tag. At end of each tag the tally is updated. Data structure for the tally: dictionary. keys: string with type of tag values: natural number counts #occurrences The tally is an object data attribute. Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

35 Web Clients and Crawlers 1 Web Clients alternatives to web browsers opening a web page and copying its content 2 Scanning Files looking for strings between double quotes parsing URLs for the server location 3 Web Crawlers making requests recursively incremental development, modular design of code 4 Parsing HTML the HTMLParser module overriding methods in HTMLParser Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

36 the class Tagtally from HTMLParser import HTMLParser from urllib import urlopen class Tagtally(HTMLParser): Makes a tally of ending tags. def init (self): Initializes the dictionary of tags. def handle_endtag(self, tag): Maintains a tally of the tags. def show_tally(self): Prints the tally to screen. Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

37 the function main() def main(): Opens a web page and parses it. url = print opening %s... % url page = urlopen(url) tags = Tagtally() while True: data = page.read(80) if data == : break tags.feed(data) tags.close() print the tally of tags : tags.show_tally() Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

38 constructor and ShowTally If overriding init, we must initialize parent class. def init (self): Initializes the dictionary of tags. HTMLParser. init (self) self.tally = {} def show_tally(self): Prints the tally to screen. for each in self.tally: print each, :, self.tally[each] Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

39 update of the tally updating a dictionary First check if there is already a tag... def handle_endtag(self, tag): Maintains a tally of the tags. if self.tally.has_key(tag): self.tally[tag] \ = self.tally[tag] + 1 else: self.tally.update({tag:1}) If no tag present, update the dictionary. Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

40 Summary + Exercises Python scripts can read web pages as files. 1 Write a script to download all.py files from jan/mcs507/main.html 2 Limit the search of the crawler so that it only opens pages within the same domain. For example, if we start at a location ending with edu, we only open pages with locations ending with edu. 3 Adjust webcrawler.py to search for a path between two locations. The user is prompted for two URLs. Crawling stops if a path has been found. Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

41 more exercises 4 Use HTMLParser to write a shorter version of the web crawler in the script webcrawler.py. The fourth homework is due on Friday 18 October, 9AM: exercises 3 and 6 of Lecture 13; exercises 2 and 4 of Lecture 14; exercises 1 and 3 of Lecture 15; exercises 2 and 5 of Lecture 16; exercises 3 and 4 of Lecture 17. Scientific Software (MCS 507 L-21) web clients and crawlers 14 October / 41

Web Clients and Crawlers

Web Clients and Crawlers Web Clients and Crawlers 1 Web Clients alternatives to web browsers opening a web page and copying its content 2 Scanning files looking for strings between double quotes parsing URLs for the server location

More information

Web Interfaces. the web server Apache processing forms with Python scripts Python code to write HTML

Web Interfaces. the web server Apache processing forms with Python scripts Python code to write HTML Web Interfaces 1 Python Scripts in Browsers the web server Apache processing forms with Python scripts Python code to write HTML 2 Web Interfaces for the Determinant dynamic interactive forms passing data

More information

20.5. urllib Open arbitrary resources by URL

20.5. urllib Open arbitrary resources by URL 1 of 9 01/25/2012 11:19 AM 20.5. urllib Open arbitrary resources by URL Note: The urllib module has been split into parts and renamed in Python 3.0 to urllib.request, urllib.parse, and urllib.error. The

More information

Defining Functions. turning expressions into functions. writing a function definition defining and using modules

Defining Functions. turning expressions into functions. writing a function definition defining and using modules Defining Functions 1 Lambda Functions turning expressions into functions 2 Functions and Modules writing a function definition defining and using modules 3 Computing Series Developments exploring an example

More information

callback, iterators, and generators

callback, iterators, and generators callback, iterators, and generators 1 Adding a Callback Function a function for Newton s method a function of the user to process results 2 A Newton Iterator defining a counter class refactoring the Newton

More information

Outline. evolution of the web IP addresses and URLs client/server and HTTP. HTML, XML, MathML MathML generated by Maple. the weather forecast

Outline. evolution of the web IP addresses and URLs client/server and HTTP. HTML, XML, MathML MathML generated by Maple. the weather forecast Outline 1 Internet Basics evolution of the web IP addresses and URLs client/server and HTTP 2 Markup Languages HTML, XML, MathML MathML generated by Maple 3 Retrieving Data the weather forecast 4 CGI Programming

More information

APT Session 2: Python

APT Session 2: Python APT Session 2: Python Laurence Tratt Software Development Team 2017-10-20 1 / 17 http://soft-dev.org/ What to expect from this session: Python 1 What is Python? 2 Basic Python functionality. 2 / 17 http://soft-dev.org/

More information

Root Finding Methods. sympy and Sage. MCS 507 Lecture 13 Mathematical, Statistical and Scientific Software Jan Verschelde, 21 September 2011

Root Finding Methods. sympy and Sage. MCS 507 Lecture 13 Mathematical, Statistical and Scientific Software Jan Verschelde, 21 September 2011 wrap Root Finding Methods 1 2 wrap MCS 507 Lecture 13 Mathematical, Statistical and Scientific Software Jan Verschelde, 21 September 2011 Root Finding Methods 1 wrap 2 wrap wrap octave-3.4.0:1> p = [1,0,2,-1]

More information

turning expressions into functions symbolic substitution, series, and lambdify

turning expressions into functions symbolic substitution, series, and lambdify Defining Functions 1 Lambda Functions turning expressions into functions symbolic substitution, series, and lambdify 2 Functions and Modules writing a function definition defining and using modules where

More information

processing data with a database

processing data with a database processing data with a database 1 MySQL and MySQLdb MySQL: an open source database running MySQL for database creation MySQLdb: an interface to MySQL for Python 2 CTA Tables in MySQL files in GTFS feed

More information

Lists and Loops. browse Python docs and interactive help

Lists and Loops. browse Python docs and interactive help Lists and Loops 1 Help in Python browse Python docs and interactive help 2 Lists in Python defining lists lists as queues and stacks inserting and removing membership and ordering lists 3 Loops in Python

More information

Introduction to Web Scraping with Python

Introduction to Web Scraping with Python Introduction to Web Scraping with Python NaLette Brodnax The Institute for Quantitative Social Science Harvard University January 26, 2018 workshop structure 1 2 3 4 intro get the review scrape tools Python

More information

Outline. the try-except statement the try-finally statement. exceptions are classes raising exceptions defining exceptions

Outline. the try-except statement the try-finally statement. exceptions are classes raising exceptions defining exceptions Outline 1 Exception Handling the try-except statement the try-finally statement 2 Python s Exception Hierarchy exceptions are classes raising exceptions defining exceptions 3 Anytime Algorithms estimating

More information

An Introduction to Web Scraping with Python and DataCamp

An Introduction to Web Scraping with Python and DataCamp An Introduction to Web Scraping with Python and DataCamp Olga Scrivner, Research Scientist, CNS, CEWIT WIM, February 23, 2018 0 Objectives Materials: DataCamp.com Review: Importing files Accessing Web

More information

Numerical Integration

Numerical Integration Numerical Integration 1 Functions using Functions functions as arguments of other functions the one-line if-else statement functions returning multiple values 2 Constructing Integration Rules with sympy

More information

Introduction to APIs. Session 2, Oct. 25

Introduction to APIs. Session 2, Oct. 25 Introduction to APIs Session 2, Oct. 25 API: Application Programming Interface What the heck does that mean?! Interface: allows a user to interact with a system Graphical User Interface (GUI): interact

More information

20.6. urllib2 extensible library for opening URLs

20.6. urllib2 extensible library for opening URLs 1 of 16 01/25/2012 11:20 AM 20.6. urllib2 extensible library for opening URLs Note: The urllib2 module has been split across several modules in Python 3.0 named urllib.request and urllib.error. The 2to3

More information

differentiation techniques

differentiation techniques differentiation techniques 1 Callable Objects delayed execution of stored code 2 Numerical and Symbolic Differentiation numerical approximations for the derivative storing common code in a parent class

More information

User Interfaces. getting arguments of the command line a command line interface to store points fitting points with polyfit of numpy

User Interfaces. getting arguments of the command line a command line interface to store points fitting points with polyfit of numpy User Interfaces 1 Command Line Interfaces getting arguments of the command line a command line interface to store points fitting points with polyfit of numpy 2 Encapsulation by Object Oriented Programming

More information

Review for Second Midterm Exam

Review for Second Midterm Exam Review for Second Midterm Exam 1 Policies & Material 2 Questions modular design working with files object-oriented programming testing, exceptions, complexity GUI design and implementation MCS 260 Lecture

More information

urllib2 extensible library for opening URLs

urllib2 extensible library for opening URLs urllib2 extensible library for opening URLs Note: The urllib2 module has been split across several modules in Python 3.0 named urllib.request and urllib.error. The 2to3 tool will automatically adapt imports

More information

Network Programming in Python. What is Web Scraping? Server GET HTML

Network Programming in Python. What is Web Scraping? Server GET HTML Network Programming in Python Charles Severance www.dr-chuck.com Unless otherwise noted, the content of this course material is licensed under a Creative Commons Attribution 3.0 License. http://creativecommons.org/licenses/by/3.0/.

More information

User Interfaces. MCS 507 Lecture 11 Mathematical, Statistical and Scientific Software Jan Verschelde, 16 September Command Line Interfaces

User Interfaces. MCS 507 Lecture 11 Mathematical, Statistical and Scientific Software Jan Verschelde, 16 September Command Line Interfaces User 1 2 MCS 507 Lecture 11 Mathematical, Statistical and Scientific Software Jan Verschelde, 16 September 2011 User 1 2 command line interfaces Many programs run without dialogue with user, as $ executable

More information

Pemrograman Jaringan Web Client Access PTIIK

Pemrograman Jaringan Web Client Access PTIIK Pemrograman Jaringan Web Client Access PTIIK - 2012 In This Chapter You'll learn how to : Download web pages Authenticate to a remote HTTP server Submit form data Handle errors Communicate with protocols

More information

operator overloading algorithmic differentiation the class DifferentialNumber operator overloading

operator overloading algorithmic differentiation the class DifferentialNumber operator overloading operator overloading 1 Computing with Differential Numbers algorithmic differentiation the class DifferentialNumber operator overloading 2 Computing with Double Doubles the class DoubleDouble defining

More information

Introduction to programming using Python

Introduction to programming using Python Introduction to programming using Python Matthieu Choplin matthieu.choplin@city.ac.uk http://moodle.city.ac.uk/ Session 9 1 Objectives Quick review of what HTML is The find() string method Regular expressions

More information

CSI 3140 WWW Structures, Techniques and Standards. Markup Languages: XHTML 1.0

CSI 3140 WWW Structures, Techniques and Standards. Markup Languages: XHTML 1.0 CSI 3140 WWW Structures, Techniques and Standards Markup Languages: XHTML 1.0 HTML Hello World! Document Type Declaration Document Instance Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson

More information

processing data from the web

processing data from the web processing data from the web 1 CTA Tables general transit feed specification stop identification and name finding trips for a given stop 2 CTA Tables in MySQL files in GTFS feed are tables in database

More information

Testing Software with Pexpect

Testing Software with Pexpect Testing Software with Pexpect 1 Testing Computer Algebra Systems testing software preparing the test suite replacing print with assert statements 2 Automating Tests with Pexpect testing SymPy in Sage with

More information

Tuples and Nested Lists

Tuples and Nested Lists stored in hash 1 2 stored in hash 3 and 4 MCS 507 Lecture 6 Mathematical, Statistical and Scientific Software Jan Verschelde, 2 September 2011 and stored in hash 1 2 stored in hash 3 4 stored in hash tuples

More information

CHAPTER 2 MARKUP LANGUAGES: XHTML 1.0

CHAPTER 2 MARKUP LANGUAGES: XHTML 1.0 WEB TECHNOLOGIES A COMPUTER SCIENCE PERSPECTIVE CHAPTER 2 MARKUP LANGUAGES: XHTML 1.0 Modified by Ahmed Sallam Based on original slides by Jeffrey C. Jackson reserved. 0-13-185603-0 HTML HELLO WORLD! Document

More information

Random Walks & Cellular Automata

Random Walks & Cellular Automata Random Walks & Cellular Automata 1 Particle Movements basic version of the simulation vectorized implementation 2 Cellular Automata pictures of matrices an animation of matrix plots the game of life of

More information

Announcements for this Lecture

Announcements for this Lecture Lecture 6 Objects Announcements for this Lecture Last Call Quiz: About the Course Take it by tomorrow Also remember survey Assignment 1 Assignment 1 is live Posted on web page Due Thur, Sep. 18 th Due

More information

XML is an overwhelmingly popular data exchange format, because it s human-readable and easily digested by software.

XML is an overwhelmingly popular data exchange format, because it s human-readable and easily digested by software. Handout 37 May 30, 2008 XML is an overwhelmingly popular data exchange format, because it s human-readable and easily digested by software. Python has excellent support for XML, as it provides both SAX

More information

CIT 590 Homework 5 HTML Resumes

CIT 590 Homework 5 HTML Resumes CIT 590 Homework 5 HTML Resumes Purposes of this assignment Reading from and writing to files Scraping information from a text file Basic HTML usage General problem specification A website is made up of

More information

Lists and Loops. defining lists lists as queues and stacks inserting and removing membership and ordering lists

Lists and Loops. defining lists lists as queues and stacks inserting and removing membership and ordering lists Lists and Loops 1 Lists in Python defining lists lists as queues and stacks inserting and removing membership and ordering lists 2 Loops in Python for and while loops the composite trapezoidal rule MCS

More information

Python review. 1 Python basics. References. CS 234 Naomi Nishimura

Python review. 1 Python basics. References. CS 234 Naomi Nishimura Python review CS 234 Naomi Nishimura The sections below indicate Python material, the degree to which it will be used in the course, and various resources you can use to review the material. You are not

More information

CGI Architecture Diagram. Web browser takes response from web server and displays either the received file or error message.

CGI Architecture Diagram. Web browser takes response from web server and displays either the received file or error message. What is CGI? The Common Gateway Interface (CGI) is a set of standards that define how information is exchanged between the web server and a custom script. is a standard for external gateway programs to

More information

Outline. gzip and gunzip data compression archiving files and pipes in Unix. format conversions encrypting text

Outline. gzip and gunzip data compression archiving files and pipes in Unix. format conversions encrypting text Outline 1 Compressing Files gzip and gunzip data compression archiving files and pipes in Unix 2 File Methods in Python format conversions encrypting text 3 Using Buffers counting and replacing words using

More information

Searching and Ranking

Searching and Ranking Searching and Ranking Michal Cap May 14, 2008 Introduction Outline Outline Search Engines 1 Crawling Crawler Creating the Index 2 Searching Querying 3 Ranking Content-based Ranking Inbound Links PageRank

More information

Web Site Development with HTML/JavaScrip

Web Site Development with HTML/JavaScrip Hands-On Web Site Development with HTML/JavaScrip Course Description This Hands-On Web programming course provides a thorough introduction to implementing a full-featured Web site on the Internet or corporate

More information

HOWTO Fetch Internet Resources Using urllib2 Release 2.7.9

HOWTO Fetch Internet Resources Using urllib2 Release 2.7.9 HOWTO Fetch Internet Resources Using urllib2 Release 2.7.9 Guido van Rossum and the Python development team Contents December 10, 2014 Python Software Foundation Email: docs@python.org 1 Introduction 2

More information

Web Interfaces for Database Servers

Web Interfaces for Database Servers Web Interfaces for Database Servers 1 CGI, MySQLdb, and Sockets glueing the connections with Python functions of the server: connect, count, and main development of the code for the client 2 Displaying

More information

Advanced Web Programming

Advanced Web Programming Advanced Web Programming 1 Advanced Web Programming what we have covered so far 2 The SocketServer Module simplified development of network servers a server tells clients the time 3 A Forking Server instead

More information

Introduction to programming using Python

Introduction to programming using Python Introduction to programming using Python Matthieu Choplin matthieu.choplin@city.ac.uk http://moodle.city.ac.uk/ Session 6-2 1 Objectives To open a file, read/write data from/to a file To use file dialogs

More information

CS Programming Languages: Python

CS Programming Languages: Python CS 3101-1 - Programming Languages: Python Lecture 5: Exceptions / Daniel Bauer (bauer@cs.columbia.edu) October 08 2014 Daniel Bauer CS3101-1 Python - 05 - Exceptions / 1/35 Contents Exceptions Daniel Bauer

More information

LESSON LEARNING TARGETS

LESSON LEARNING TARGETS DAY 3 Frames LESSON LEARNING TARGETS I can describe the attributes of the tag. I can write code to create frames for displaying Web pages with headings, menus, and other content using the

More information

John Romero Programming Proverbs

John Romero Programming Proverbs John Romero Programming Proverbs slide 1 2. It s incredibly important that your game can always be run by your team. Bulletproof your engine by providing defaults (for input data) upon load failure. John

More information

Vebra Search Integration Guide

Vebra Search Integration Guide Guide Introduction... 2 Requirements... 2 How a Vebra search is added to your site... 2 Integration Guide... 3 HTML Wrappers... 4 Page HEAD Content... 4 CSS Styling... 4 BODY tag CSS... 5 DIV#s-container

More information

Multithreaded Servers

Multithreaded Servers Multithreaded Servers 1 Serving Multiple Clients avoid to block clients with waiting using sockets and threads 2 Waiting for Data from 3 Clients running a simple multithreaded server code for client and

More information

Graphical User Interfaces

Graphical User Interfaces Graphical User Interfaces 1 User Interfaces GUIs in Python with Tkinter object oriented GUI programming 2 Mixing Colors specification of the GUI the widget Scale 3 Simulating a Bouncing Ball layout of

More information

Outline. tallying the votes global and local variables call by value or call by reference. of variable length using keywords for optional arguments

Outline. tallying the votes global and local variables call by value or call by reference. of variable length using keywords for optional arguments Outline 1 Histograms tallying the votes global and local variables call by value or call by reference 2 Arguments of Functions of variable length using keywords for optional arguments 3 Functions using

More information

STAXDoc User's Guide. Contents. Introduction. Requirements. STAX Documentation Generator (STAXDoc) User's Guide Version

STAXDoc User's Guide. Contents. Introduction. Requirements. STAX Documentation Generator (STAXDoc) User's Guide Version STAXDoc User's Guide STAX Documentation Generator (STAXDoc) User's Guide Version 1.0.4 February 26, 2008 Contents Introduction Requirements Syntax Source Files STAX Xml Files Package Comment Files Overview

More information

HTML Processing in Python. 3. Construction of the HTML parse tree for further use and traversal.

HTML Processing in Python. 3. Construction of the HTML parse tree for further use and traversal. .. DATA 301 Introduction to Data Science Alexander Dekhtyar.. HTML Processing in Python Overview HTML processing in Python consists of three steps: 1. Download of the HTML file from the World Wide Web.

More information

Data Abstraction. UW CSE 160 Spring 2015

Data Abstraction. UW CSE 160 Spring 2015 Data Abstraction UW CSE 160 Spring 2015 1 What is a program? What is a program? A sequence of instructions to achieve some particular purpose What is a library? A collection of functions that are helpful

More information

Advanced CGI Scripts. personalized web browsing using cookies: count number of visits. secure hash algorithm using cookies for login data the scripts

Advanced CGI Scripts. personalized web browsing using cookies: count number of visits. secure hash algorithm using cookies for login data the scripts Advanced CGI Scripts 1 Cookies personalized web browsing using cookies: count number of visits 2 Password Encryption secure hash algorithm using cookies for login data the scripts 3 Authentication via

More information

Branching and Enumeration

Branching and Enumeration Branching and Enumeration 1 Booleans and Branching computing logical expressions computing truth tables with Sage if, else, and elif 2 Timing Python Code try-except costs more than if-else 3 Recursive

More information

WWW. HTTP, Ajax, APIs, REST

WWW. HTTP, Ajax, APIs, REST WWW HTTP, Ajax, APIs, REST HTTP Hypertext Transfer Protocol Request Web Client HTTP Server WSGI Response Connectionless Media Independent Stateless Python Web Application WSGI : Web Server Gateway Interface

More information

Web scraping and social media scraping crawling

Web scraping and social media scraping crawling Web scraping and social media scraping crawling Jacek Lewkowicz, Dorota Celińska University of Warsaw March 21, 2018 What will we be working on today? We should already known how to gather data from a

More information

File I/O, Benford s Law, and sets

File I/O, Benford s Law, and sets File I/O, Benford s Law, and sets Matt Valeriote 11 February 2019 Benford s law Benford s law describes the (surprising) distribution of first digits of many different sets of numbers. Read it about it

More information

Random Walks & Cellular Automata

Random Walks & Cellular Automata Random Walks & Cellular Automata 1 Particle Movements basic version of the simulation vectorized implementation 2 Cellular Automata pictures of matrices an animation of matrix plots the game of life of

More information

Problem Set 4 Due: Start of class, October 5

Problem Set 4 Due: Start of class, October 5 CS242 Computer Networks Handout # 8 Randy Shull September 28, 2017 Wellesley College Problem Set 4 Due: Start of class, October 5 Reading: Kurose & Ross, Sections 2.5, 2.6, 2.7, 2.8 Problem 1 [4]: Free-riding

More information

get set up for today s workshop

get set up for today s workshop get set up for today s workshop Please open the following in Firefox: 1. Poll: bit.ly/iuwim25 Take a brief poll before we get started 2. Python: www.pythonanywhere.com Create a free account Click on Account

More information

c122jan2714.notebook January 27, 2014

c122jan2714.notebook January 27, 2014 Internet Developer 1 Start here! 2 3 Right click on screen and select View page source if you are in Firefox tells the browser you are using html. Next we have the tag and at the

More information

HOWTO Fetch Internet Resources Using The urllib Package Release 3.5.1

HOWTO Fetch Internet Resources Using The urllib Package Release 3.5.1 HOWTO Fetch Internet Resources Using The urllib Package Release 3.5.1 Guido van Rossum and the Python development team February 24, 2016 Python Software Foundation Email: docs@python.org Contents 1 Introduction

More information

Graphical User Interfaces

Graphical User Interfaces to visualize Graphical User Interfaces 1 2 to visualize MCS 507 Lecture 12 Mathematical, Statistical and Scientific Software Jan Verschelde, 19 September 2011 Graphical User Interfaces to visualize 1 2

More information

Seshat Documentation. Release Joshua P Ashby

Seshat Documentation. Release Joshua P Ashby Seshat Documentation Release 1.0.0 Joshua P Ashby Apr 05, 2017 Contents 1 A Few Minor Warnings 3 2 Quick Start 5 2.1 Contributing............................................... 5 2.2 Doc Contents...............................................

More information

PYTHON CONTENT NOTE: Almost every task is explained with an example

PYTHON CONTENT NOTE: Almost every task is explained with an example PYTHON CONTENT NOTE: Almost every task is explained with an example Introduction: 1. What is a script and program? 2. Difference between scripting and programming languages? 3. What is Python? 4. Characteristics

More information

Web Scraping. HTTP and Requests

Web Scraping. HTTP and Requests 1 Web Scraping Lab Objective: Web Scraping is the process of gathering data from websites on the internet. Since almost everything rendered by an internet browser as a web page uses HTML, the rst step

More information

making connections general transit feed specification stop names and stop times storing the connections in a dictionary

making connections general transit feed specification stop names and stop times storing the connections in a dictionary making connections 1 CTA Tables general transit feed specification stop names and stop times storing the connections in a dictionary 2 CTA Schedules finding connections between stops sparse matrices in

More information

Outline. general information policies for the final exam

Outline. general information policies for the final exam Outline 1 final exam on Tuesday 5 May 2015, at 8AM, in BSB 337 general information policies for the final exam 2 some example questions strings, lists, dictionaries scope of variables in functions working

More information

High Level Parallel Processing

High Level Parallel Processing High Level Parallel Processing 1 GPU computing with Maple enabling CUDA in Maple 15 stochastic processes and Markov chains 2 Multiprocessing in Python scripting in computational science the multiprocessing

More information

This course is designed for web developers that want to learn HTML5, CSS3, JavaScript and jquery.

This course is designed for web developers that want to learn HTML5, CSS3, JavaScript and jquery. HTML5/CSS3/JavaScript Programming Course Summary Description This class is designed for students that have experience with basic HTML concepts that wish to learn about HTML Version 5, Cascading Style Sheets

More information

Intermediate Python 3.x

Intermediate Python 3.x Intermediate Python 3.x This 4 day course picks up where Introduction to Python 3 leaves off, covering some topics in more detail, and adding many new ones, with a focus on enterprise development. This

More information

examples from first year calculus (continued), file I/O, Benford s Law

examples from first year calculus (continued), file I/O, Benford s Law examples from first year calculus (continued), file I/O, Benford s Law Matt Valeriote 5 February 2018 Grid and Bisection methods to find a root Assume that f (x) is a continuous function on the real numbers.

More information

CSCE 463/612: Networks and Distributed Processing Homework 1 Part 2 (25 pts) Due date: 1/29/19

CSCE 463/612: Networks and Distributed Processing Homework 1 Part 2 (25 pts) Due date: 1/29/19 CSCE 463/612: Networks and Distributed Processing Homework 1 Part 2 (25 pts) Due date: 1/29/19 1. Problem Description You will now expand into downloading multiple URLs from an input file using a single

More information

Homework. Announcements. Before We Start. Languages. Plan for today. Chomsky Normal Form. Final Exam Dates have been announced

Homework. Announcements. Before We Start. Languages. Plan for today. Chomsky Normal Form. Final Exam Dates have been announced Homework Homework #3 returned Homework #4 due today Homework #5 Pg 169 -- Exercise 4 Pg 183 -- Exercise 4c,e,i Pg 184 -- Exercise 10 Pg 184 -- Exercise 12 Pg 185 -- Exercise 17 Due 10 / 17 Announcements

More information

1 A file-based desktop search engine: FDSEE

1 A file-based desktop search engine: FDSEE A. V. Gerbessiotis CS 345 Fall 2015 HW 5 Sep 8, 2015 66 points CS 345: Homework 5 (Due: Nov 09, 2015) Rules. Teams of no more than three students as explained in syllabus (Handout 1). This is to be handed

More information

CME 193: Introduction to Scientific Python Lecture 4: Strings and File I/O

CME 193: Introduction to Scientific Python Lecture 4: Strings and File I/O CME 193: Introduction to Scientific Python Lecture 4: Strings and File I/O Nolan Skochdopole stanford.edu/class/cme193 4: Strings and File I/O 4-1 Contents Strings File I/O Classes Exercises 4: Strings

More information

Lecture 5. Defining Functions

Lecture 5. Defining Functions Lecture 5 Defining Functions Announcements for this Lecture Last Call Quiz: About the Course Take it by tomorrow Also remember the survey Readings Sections 3.5 3.3 today Also 6.-6.4 See online readings

More information

from Recursion to Iteration

from Recursion to Iteration from Recursion to Iteration 1 Quicksort Revisited using arrays partitioning arrays via scan and swap recursive quicksort on arrays 2 converting recursion into iteration an iterative version with a stack

More information

Introduction to Python

Introduction to Python Introduction to Python NaLette Brodnax The Institute for Quantitative Social Science Harvard University January 24, 2018 what we ll cover first 0 Python installation one-minute poll 1 introduction what

More information

Week 8: HyperText Transfer Protocol - Clients - HTML. Johan Bollen Old Dominion University Department of Computer Science

Week 8: HyperText Transfer Protocol - Clients - HTML. Johan Bollen Old Dominion University Department of Computer Science Week 8: HyperText Transfer Protocol - Clients - HTML Johan Bollen Old Dominion University Department of Computer Science jbollen@cs.odu.edu http://www.cs.odu.edu/ jbollen October 23, 2003 Page 1 MIDTERM

More information

PTN-202: Advanced Python Programming Course Description. Course Outline

PTN-202: Advanced Python Programming Course Description. Course Outline PTN-202: Advanced Python Programming Course Description This 4-day course picks up where Python I leaves off, covering some topics in more detail, and adding many new ones, with a focus on enterprise development.

More information

CS61A Lecture 15 Object Oriented Programming, Mutable Data Structures. Jom Magrotker UC Berkeley EECS July 12, 2012

CS61A Lecture 15 Object Oriented Programming, Mutable Data Structures. Jom Magrotker UC Berkeley EECS July 12, 2012 CS61A Lecture 15 Object Oriented Programming, Mutable Data Structures Jom Magrotker UC Berkeley EECS July 12, 2012 COMPUTER SCIENCE IN THE NEWS http://www.iospress.nl/ios_news/music to my eyes device converting

More information

Introduction to MATLAB

Introduction to MATLAB Introduction to MATLAB The Desktop When you start MATLAB, the desktop appears, containing tools (graphical user interfaces) for managing files, variables, and applications associated with MATLAB. The following

More information

CS1 Lecture 2 Jan. 16, 2019

CS1 Lecture 2 Jan. 16, 2019 CS1 Lecture 2 Jan. 16, 2019 Contacting me/tas by email You may send questions/comments to me/tas by email. For discussion section issues, sent to TA and me For homework or other issues send to me (your

More information

Scraping I: Introduction to BeautifulSoup

Scraping I: Introduction to BeautifulSoup 5 Web Scraping I: Introduction to BeautifulSoup Lab Objective: Web Scraping is the process of gathering data from websites on the internet. Since almost everything rendered by an internet browser as a

More information

Skills you will learn: How to make requests to multiple URLs using For loops and by altering the URL

Skills you will learn: How to make requests to multiple URLs using For loops and by altering the URL Chapter 9 Your First Multi-Page Scrape Skills you will learn: How to make requests to multiple URLs using For loops and by altering the URL In this tutorial, we will pick up from the detailed example from

More information

Lecture 4: Data Collection and Munging

Lecture 4: Data Collection and Munging Lecture 4: Data Collection and Munging Instructor: Outline 1 Data Collection and Scraping 2 Web Scraping basics In-Class Quizzes URL: http://m.socrative.com/ Room Name: 4f2bb99e Data Collection What you

More information

Custom Actions for argparse Documentation

Custom Actions for argparse Documentation Custom Actions for argparse Documentation Release 0.4 Hai Vu October 26, 2015 Contents 1 Introduction 1 2 Information 3 2.1 Folder Actions.............................................. 3 2.2 IP Actions................................................

More information

Babu Madhav Institute of Information Technology, UTU 2015

Babu Madhav Institute of Information Technology, UTU 2015 Five years Integrated M.Sc.(IT)(Semester 5) Question Bank 060010502:Programming in Python Unit-1:Introduction To Python Q-1 Answer the following Questions in short. 1. Which operator is used for slicing?

More information

Week - 01 Lecture - 04 Downloading and installing Python

Week - 01 Lecture - 04 Downloading and installing Python Programming, Data Structures and Algorithms in Python Prof. Madhavan Mukund Department of Computer Science and Engineering Indian Institute of Technology, Madras Week - 01 Lecture - 04 Downloading and

More information

Spring 2008 June 2, 2008 Section Solution: Python

Spring 2008 June 2, 2008 Section Solution: Python CS107 Handout 39S Spring 2008 June 2, 2008 Section Solution: Python Solution 1: Jane Austen s Favorite Word Project Gutenberg is an open-source effort intended to legally distribute electronic copies of

More information

picrawler Documentation

picrawler Documentation picrawler Documentation Release 0.1.1 Ikuya Yamada October 07, 2013 CONTENTS 1 Installation 3 2 Getting Started 5 2.1 PiCloud Setup.............................................. 5 2.2 Basic Usage...............................................

More information

Large-Scale Networks

Large-Scale Networks Large-Scale Networks 3b Python for large-scale networks Dr Vincent Gramoli Senior lecturer School of Information Technologies The University of Sydney Page 1 Introduction Why Python? What to do with Python?

More information

Notes General. IS 651: Distributed Systems 1

Notes General. IS 651: Distributed Systems 1 Notes General Discussion 1 and homework 1 are now graded. Grading is final one week after the deadline. Contract me before that if you find problem and want regrading. Minor syllabus change Moved chapter

More information

Introduction to Python

Introduction to Python Introduction to Python Version 1.1.5 (12/29/2008) [CG] Page 1 of 243 Introduction...6 About Python...7 The Python Interpreter...9 Exercises...11 Python Compilation...12 Python Scripts in Linux/Unix & Windows...14

More information

(b) If a heap has n elements, what s the height of the tree?

(b) If a heap has n elements, what s the height of the tree? CISC 5835 Algorithms for Big Data Fall, 2018 Homework Assignment #4 1 Answer the following questions about binary tree and heap: (a) For a complete binary tree of height 4 (we consider a tree with just

More information