CS 111: Program Desig I Lecture 18: Web ad gettig text from it Robert H. Sloa & Richard Warer Uiversity of Illiois at Chicago October 25, 2016
Goals Lear about Iteret ad how to access it directly from a program Ad use this as fial piece to fiish program
Grace Hopper Celebratio of Wome i Computig 2016 Where I was with ~25 UIC wome CS studets last week With 15,000 other people Heard from Lataya Sweeey, Harvard Prof. ad formerly Chief Techologist of Federal Trade Commissio about our kid of issues, ad from Megha Smith, CTO of the US Ad from a former CS 111 studet about crawlig the web
Our aluma She took CS 111 i 2011 or 2012 Now works for Google Chicago She told me a full Google web crawl reuires 500 Petabytes to store Did't cover those at start of class because 1 Petabyte = 1000 Terabytes! (So petabytes do't come up much i cosumer use!) Ad that for jobs i YouTube part of Google key skill i Pytho
Wome i this CS 111 Keep your eyes ope i Sprig for your chace to atted Grace Hopper 2017
Thursday No Class: Atted Distiguished Lecture: Presidet Maria Klawe of Harvey Mudd College, oe of exceedigly few computer scietists who are college/ uiversity Presidets 605 SCE, 11am Gettig more wome ito tech careers ad why it matters
Opeig a URL ad readig it The 60 secod "stadig o oe foot" versio so you ca do the crawler homework import urllib.reuest as ur coectio= ur.urlope("https://www.cs.uic.edu") cotet = coectio.read() coectio.close() (We'll come back to this i 30-90 miutes)
Networks: 2 or more computers commuicatig Networks formed whe distict computers commuicate via some mechaism Iside oe computer, 0/1 voltages Almost ever o etwork: too hard to make work over distace More commoly freuecy I 2016 usually ot soud freuecy, but modems i the dialup iteret era did ideed use soud
Networks everywhere All o-aciet cars Moder car is a collectio of computers Cotrollig air flow, gas flow, makig airbags work that commuicate, ad happe to have 4 wheels attached. J Likewise plaes (Almost?) all of you have a etwork i your home or dorm
Networks orgaized i layers Recurrig theme throughout Computer Sciece: Recall hierarchical decompositio For etworks, bottom level is physical substrate What are sigals beig passed o? Radio? Wire? Middle levels: How data is ecoded Freuecies of soud waves or radio waves or??? for ecodig 0's ad 1's Trasmit bit at a time? Byte at a time? Larger packets of bytes?
Higher levels of etworks Protocol of commuicatio How do I address a particular computer I wat to talk to? Or may computers? How do I tell a computer that I wat to talk to it? That I'm startig to sed it data? What it's supposed to do with the data I sed it? Whe we're doe?
Iteret: Collectio of etworks Iteret: etwork of etworks Iside your home, probably have local etwork so your computers ca talk to oe aother Likely uses wireless base statio You ca probably reach priters ad/or copy files betwee your computers Ad you coect your home etwork (via a Iteret Service Provider (ISP)) to global iteret So your home etwork part of Iteret
Iteret based o agreemets o ecodigs Iteret built o set of agreemets about How computers addressed Set of four 1-byte umbers (IPv4), or eight 2-byte umbers (IPv6). E.g., 128.248.155.15 Way of associatig these umbers with domai ames, like www.uic.edu (which probably really resolves to a set of two umbers) usig domai ame servers How computers will commuicate That data will be split ito packets with various pieces i them That computers will format their data ad talk to oe aother usig TCP/IP protocol How packets are routed aroud etwork to fid their destiatio
High-level protocols All that just lets us pass data back ad forth What does data say? What does data do? Oe of earliest applicatios (ivolvig 2 highlevel protocols) was email Aother: File trasfer Protocol (FTP) allows computers to move files betwee each other Defies what oe side says to other whe copyig file over (e.g., "STO fileame") ad how file will be ecoded Not widely used i 2016, but big whe I took CS 1
Iteret vs. Web? Which is most correct? A. Iteret ad (World Wide) Web are very roughly syoymous B. The Web is oe service of the Iteret C. The Iteret is oe service of the Web D. There is oly a very loose coectio betwee the Iteret ad the Web
Iteret is ot ew Iteret web! Iteret origially set up for military applicatios Feature: Packets still reach their destiatio eve if part is destroyed, damaged, or cesored Iteret established i 1969 as military research project with four locatios, curret key protocol TCP/IP dates to 1975; declared uiversal stadard for military computig by US DoD i 1982
The Web (Web Iteret!) Prof. Sloa heavy user of Iteret 1985 oward Web, built o top of Iteret, oly came ito beig i December 1990, ad was extremely obscure util 1993 Web is (agai) set of agreemets, started by Sir Tim Berers-Lee O how to refer to everythig o Web: URL (Uiform Resource Locator) O how to serve documets: HTTP (HyperText Trasfer Protocol) How documets formatted: HTML (HyperText Markup Laguage)
Hypertext Noliear text that liks to other text ad graphics via liks You kow it well! Some limited implemetatios, e.g., Apple Hypercard, a little before web, but The read a little, click ad read some more, the click, the. use of hypertext did't exist whe your parets ad I were i school
HyperText Trasfer Protocol (HTTP) HTTP defies very simple protocol for how to exchage iformatio betwee computers (o top of iteret protocols) Defies pieces of the commuicatio Web server is waitig for cliet i listeig state What resource do you wat? Where is it? Okay, here's the type of thig it is (HTML, JPEG, whatever), ad here it is Ad words the computers say to oe aother: Simple, e.g., "GET", "PUT", ad "OK"
Uiform Resource Locators (URLs) URLs allow us to referece ay material aywhere o Iteret Strictly speakig, ay computer providig a protocol accessible via URL Just puttig your computer o Iteret does ot mea that all of your files are accessible to everyoe o Iteret URLs have four parts: 1. Protocol to use to reach this resource, 2. Domai ame of the computer where the resource is, 3. Path o the computer to the resource, 4. Ad the ame of the resource (optioal; otio of default file). 5. (Ad optioally after that there ca be parameters itroduced with a?)
Example URL E.g., http://www.cs.uic.edu/cs477 3 parts: 1. Protocol: http (usually http or https) 2. Host ame: www.cs.uic.edu 3. Path: /CS477 (locatio of files o host) https://www.cs.uic.edu/~sloa/papers/ SloaWhyCS2014.pdf 1. Protocol: https 2. same host ame 3. Path: /~sloa/papers 4. Fileame: SloaWhyCS2014.pdf
Path also optioal Web servers (programs that uderstad HTTP protocol) typically have special directory that they serve from Files i that special directory directly reachable without specifyig path
Web browsers are cliets Your Web browser is called a cliet accessig a Web server Programs like Chrome, Firefox, or Safari uderstad a lot about Iteret protocols They kow how to iterpret HTML ad display it graphically If HTML refereces other resources, like JPEG or PNG pictures, cliet fetches them ad displays them as appropriate Cliet also kows details of HTTP (ad perhaps mailto, ftp, ad some others)
Browser ot oly way to use Iteret Pytho (ad other laguages) have modules that allow you to use these protocols I Pytho, ca read ay URL as if it was a file I Pytho 3, key library is urllib.reuest (Other parts of urllib useful for, e.g., parsig webpage text)
Importig urllib.reuest Advice: If module has dot i its ame like urllib.reuest, matplotlib.pyplot, always import it as somethig. Else Pytho ca get cofused about multiple dots whe you go to use fuctios iside it. Thus: import urllib.reuest as ur
Opeig a URL ad readig it The 60 secod "stadig o oe foot" versio so you ca do the crawler homework import urllib.reuest as ur coectio= ur.urlope("https://www.cs.uic.edu") cotet = coectio.read() coectio.close() (We've come back!)
Status coectio retured from urllib is a "HTTPRespose" object ad has a status attribute, which should be 200 if all wet well I crawlig, if you ru ito difficulties, try: import urllib.reuest as ur coectio= ur.urlope("https://www.cs.uic.edu") if (coectio.status == 200): cotet = coectio.read() <rest of real work here> coectio.close()
Readig the coectio read() method of coectio works just like read() method of files Almost: this read will give you a seuece of bytes, which ca i may but ot all cases be treated as a strig. Ca force it to be 100% strig istead of merely pretty strig-like by cotet = str(coectio.read()) usig type coversio fuctio str()