CS 111: Program Design I Lecture 18: Web and getting text from it

Similar documents
CS 111: Program Design I Lecture 19: Networks, the Web, and getting text from the Web in Python

CS 111: Program Design I Lecture 20: Web crawling, HTML, Copyright

CS 111: Program Design I Lecture # 7: First Loop, Web Crawler, Functions

CS 111: Program Design I Lecture 21: Network Analysis. Robert H. Sloan & Richard Warner University of Illinois at Chicago April 10, 2018

CS 111: Program Design I Lecture 16: Module Review, Encodings, Lists

CS 111: Program Design I Lecture 15: Modules, Pandas again. Robert H. Sloan & Richard Warner University of Illinois at Chicago March 8, 2018

CS 111: Program Design I Lecture 15: Objects, Pandas, Modules. Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

CS 111: Program Design I Lecture #26: Heat maps, Nothing, Predictive Policing

CS 111 Green: Program Design I Lecture 27: Speed (cont.); parting thoughts

Announcements. Reading. Project #4 is on the web. Homework #1. Midterm #2. Chapter 4 ( ) Note policy about project #3 missing components

CS 111: Program Design I Lecture 14: Encodings & Files concluded; Pandas, Modules, legal data analytics

CS 111: Program Design I Lecture 20: Web crawling, HTML, Copyright

Python Programming: An Introduction to Computer Science

Exceptions. Your computer takes exception. The Exception Class. Causes of Exceptions

Basic Design Principles

CS 111: Program Design I Lecture # 7: Web Crawler, Functions; Open Access

Lecture 9: Exam I Review

Message Integrity and Hash Functions. TELE3119: Week4

CS 11 C track: lecture 1

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #2: Randomized MST and MST Verification January 14, 2015

CSE 111 Bio: Program Design I Class 11: loops

Python Programming: An Introduction to Computer Science

MOTIF XF Extension Owner s Manual

Computers and Scientific Thinking

n Some thoughts on software development n The idea of a calculator n Using a grammar n Expression evaluation n Program organization n Analysis

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000.

CHAPTER IV: GRAPH THEORY. Section 1: Introduction to Graphs

ECE4050 Data Structures and Algorithms. Lecture 6: Searching

Switching Hardware. Spring 2018 CS 438 Staff, University of Illinois 1

Customer Portal Quick Reference User Guide

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago

Location Steps and Paths

Module 8-7: Pascal s Triangle and the Binomial Theorem

top() Applications of Stacks

K-NET bus. When several turrets are connected to the K-Bus, the structure of the system is as showns

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Lecture 7 7 Refraction and Snell s Law Reading Assignment: Read Kipnis Chapter 4 Refraction of Light, Section III, IV

Chapter 9. Pointers and Dynamic Arrays. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Elementary Educational Computer

Course Information. Details. Topics. Network Examples. Overview. Walrand Lecture 1. EECS 228a. EECS 228a Lecture 1 Overview: Networks

Recursion. Computer Science S-111 Harvard University David G. Sullivan, Ph.D. Review: Method Frames

IS-IS in Detail. ISP Workshops

Threads and Concurrency in Java: Part 2

Reliable Transmission. Spring 2018 CS 438 Staff - University of Illinois 1

Solution printed. Do not start the test until instructed to do so! CS 2604 Data Structures Midterm Spring, Instructions:

Term Project Report. This component works to detect gesture from the patient as a sign of emergency message and send it to the emergency manager.

Interface Changes. What s New. User Interface Themes IN THIS CHAPTER

CSE 111 Bio: Program Design I Lecture 17: software development, list methods

Table 2 GSM, UMTS and LTE Coverage Levels

Classes and Objects. Again: Distance between points within the first quadrant. José Valente de Oliveira 4-1

Intro to Scientific Computing: Solutions

Using the Keyboard. Using the Wireless Keyboard. > Using the Keyboard

Chapter 4 The Datapath

Goals of the Lecture UML Implementation Diagrams

TUTORIAL Create Playlist Helen Doron Course

CS 111: Program Design I Lecture 5: US Law when others have encryption keys; if, for

27 Refraction, Dispersion, Internal Reflection

L I N U X. Unit 6 S Y S T E M DHCP & DNS (BIND) A D M I N I S T R A T I O n DPW

Examples and Applications of Binary Search

CS 111: Program Design I Lecture 25: Social networks, nothingness. Robert H. Sloan & Richard Warner University of Illinois at Chicago April 24, 2018

One advantage that SONAR has over any other music-sequencing product I ve worked

IMP: Superposer Integrated Morphometrics Package Superposition Tool

Chapter 5. Functions for All Subtasks. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

The isoperimetric problem on the hypercube

Workflow model GM AR. Gumpy. Dynagump. At a very high level, this is what gump does. We ll be looking at each of the items described here seperately.

Chapter 10. Defining Classes. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Bike MS: 2013 Participant Center guide

Bike MS: 2014 Participant Center guide

Sharing Collections. Share a Collection via . Share a Collection via Google Classroom. Quick Reference Guide

Chapter 11. Friends, Overloaded Operators, and Arrays in Classes. Copyright 2014 Pearson Addison-Wesley. All rights reserved.

Web OS Switch Software

Guide to Applying Online

Deploying 32-bit ASNs

n Based on unrealistic growth forecast n Overcapacity: Fiber 5x100 in three years n Wireless: Expensive spectrum licenses n Fibers

Network Time Protocol (NTP)

BGP Attributes and Path Selection. ISP Training Workshops

1.2 Binomial Coefficients and Subsets

Global Support Guide. Verizon WIreless. For the BlackBerry 8830 World Edition Smartphone and the Motorola Z6c

SD vs. SD + One of the most important uses of sample statistics is to estimate the corresponding population parameters.

Ones Assignment Method for Solving Traveling Salesman Problem

Overview. Chapter 18 Vectors and Arrays. Reminder. vector. Bjarne Stroustrup

10/23/18. File class in Java. Scanner reminder. Files. Opening a file for reading. Scanner reminder. File Input and Output

Graphs ORD SFO LAX DFW

BIKE MS: 2015 PARTICIPANT CENTER GUIDE

Administrative UNSUPERVISED LEARNING. Unsupervised learning. Supervised learning 11/25/13. Final project. No office hours today

University of North Carolina at Charlotte ECGR-6185 ADVANCED EMBEDDED SYSTEMS SMART CARDS. Sravanthi Chalasani

3D Model Retrieval Method Based on Sample Prediction

Which movie we can suggest to Anne?

Floristic Quality Assessment (FQA) Calculator for Colorado User s Guide

Weston Anniversary Fund

Avid Interplay Bundle

1. SWITCHING FUNDAMENTALS

Chapter 3 Classification of FFT Processor Algorithms

Numerical Methods Lecture 6 - Curve Fitting Techniques

Introduction to Computing Systems: From Bits and Gates to C and Beyond 2 nd Edition

Architectural styles for software systems The client-server style

Requirements Analysis

Exercise Set: Implementing an Object-Oriented Design

Firewall and IDS. TELE3119: Week8

Transcription:

CS 111: Program Desig I Lecture 18: Web ad gettig text from it Robert H. Sloa & Richard Warer Uiversity of Illiois at Chicago October 25, 2016

Goals Lear about Iteret ad how to access it directly from a program Ad use this as fial piece to fiish program

Grace Hopper Celebratio of Wome i Computig 2016 Where I was with ~25 UIC wome CS studets last week With 15,000 other people Heard from Lataya Sweeey, Harvard Prof. ad formerly Chief Techologist of Federal Trade Commissio about our kid of issues, ad from Megha Smith, CTO of the US Ad from a former CS 111 studet about crawlig the web

Our aluma She took CS 111 i 2011 or 2012 Now works for Google Chicago She told me a full Google web crawl reuires 500 Petabytes to store Did't cover those at start of class because 1 Petabyte = 1000 Terabytes! (So petabytes do't come up much i cosumer use!) Ad that for jobs i YouTube part of Google key skill i Pytho

Wome i this CS 111 Keep your eyes ope i Sprig for your chace to atted Grace Hopper 2017

Thursday No Class: Atted Distiguished Lecture: Presidet Maria Klawe of Harvey Mudd College, oe of exceedigly few computer scietists who are college/ uiversity Presidets 605 SCE, 11am Gettig more wome ito tech careers ad why it matters

Opeig a URL ad readig it The 60 secod "stadig o oe foot" versio so you ca do the crawler homework import urllib.reuest as ur coectio= ur.urlope("https://www.cs.uic.edu") cotet = coectio.read() coectio.close() (We'll come back to this i 30-90 miutes)

Networks: 2 or more computers commuicatig Networks formed whe distict computers commuicate via some mechaism Iside oe computer, 0/1 voltages Almost ever o etwork: too hard to make work over distace More commoly freuecy I 2016 usually ot soud freuecy, but modems i the dialup iteret era did ideed use soud

Networks everywhere All o-aciet cars Moder car is a collectio of computers Cotrollig air flow, gas flow, makig airbags work that commuicate, ad happe to have 4 wheels attached. J Likewise plaes (Almost?) all of you have a etwork i your home or dorm

Networks orgaized i layers Recurrig theme throughout Computer Sciece: Recall hierarchical decompositio For etworks, bottom level is physical substrate What are sigals beig passed o? Radio? Wire? Middle levels: How data is ecoded Freuecies of soud waves or radio waves or??? for ecodig 0's ad 1's Trasmit bit at a time? Byte at a time? Larger packets of bytes?

Higher levels of etworks Protocol of commuicatio How do I address a particular computer I wat to talk to? Or may computers? How do I tell a computer that I wat to talk to it? That I'm startig to sed it data? What it's supposed to do with the data I sed it? Whe we're doe?

Iteret: Collectio of etworks Iteret: etwork of etworks Iside your home, probably have local etwork so your computers ca talk to oe aother Likely uses wireless base statio You ca probably reach priters ad/or copy files betwee your computers Ad you coect your home etwork (via a Iteret Service Provider (ISP)) to global iteret So your home etwork part of Iteret

Iteret based o agreemets o ecodigs Iteret built o set of agreemets about How computers addressed Set of four 1-byte umbers (IPv4), or eight 2-byte umbers (IPv6). E.g., 128.248.155.15 Way of associatig these umbers with domai ames, like www.uic.edu (which probably really resolves to a set of two umbers) usig domai ame servers How computers will commuicate That data will be split ito packets with various pieces i them That computers will format their data ad talk to oe aother usig TCP/IP protocol How packets are routed aroud etwork to fid their destiatio

High-level protocols All that just lets us pass data back ad forth What does data say? What does data do? Oe of earliest applicatios (ivolvig 2 highlevel protocols) was email Aother: File trasfer Protocol (FTP) allows computers to move files betwee each other Defies what oe side says to other whe copyig file over (e.g., "STO fileame") ad how file will be ecoded Not widely used i 2016, but big whe I took CS 1

Iteret vs. Web? Which is most correct? A. Iteret ad (World Wide) Web are very roughly syoymous B. The Web is oe service of the Iteret C. The Iteret is oe service of the Web D. There is oly a very loose coectio betwee the Iteret ad the Web

Iteret is ot ew Iteret web! Iteret origially set up for military applicatios Feature: Packets still reach their destiatio eve if part is destroyed, damaged, or cesored Iteret established i 1969 as military research project with four locatios, curret key protocol TCP/IP dates to 1975; declared uiversal stadard for military computig by US DoD i 1982

The Web (Web Iteret!) Prof. Sloa heavy user of Iteret 1985 oward Web, built o top of Iteret, oly came ito beig i December 1990, ad was extremely obscure util 1993 Web is (agai) set of agreemets, started by Sir Tim Berers-Lee O how to refer to everythig o Web: URL (Uiform Resource Locator) O how to serve documets: HTTP (HyperText Trasfer Protocol) How documets formatted: HTML (HyperText Markup Laguage)

Hypertext Noliear text that liks to other text ad graphics via liks You kow it well! Some limited implemetatios, e.g., Apple Hypercard, a little before web, but The read a little, click ad read some more, the click, the. use of hypertext did't exist whe your parets ad I were i school

HyperText Trasfer Protocol (HTTP) HTTP defies very simple protocol for how to exchage iformatio betwee computers (o top of iteret protocols) Defies pieces of the commuicatio Web server is waitig for cliet i listeig state What resource do you wat? Where is it? Okay, here's the type of thig it is (HTML, JPEG, whatever), ad here it is Ad words the computers say to oe aother: Simple, e.g., "GET", "PUT", ad "OK"

Uiform Resource Locators (URLs) URLs allow us to referece ay material aywhere o Iteret Strictly speakig, ay computer providig a protocol accessible via URL Just puttig your computer o Iteret does ot mea that all of your files are accessible to everyoe o Iteret URLs have four parts: 1. Protocol to use to reach this resource, 2. Domai ame of the computer where the resource is, 3. Path o the computer to the resource, 4. Ad the ame of the resource (optioal; otio of default file). 5. (Ad optioally after that there ca be parameters itroduced with a?)

Example URL E.g., http://www.cs.uic.edu/cs477 3 parts: 1. Protocol: http (usually http or https) 2. Host ame: www.cs.uic.edu 3. Path: /CS477 (locatio of files o host) https://www.cs.uic.edu/~sloa/papers/ SloaWhyCS2014.pdf 1. Protocol: https 2. same host ame 3. Path: /~sloa/papers 4. Fileame: SloaWhyCS2014.pdf

Path also optioal Web servers (programs that uderstad HTTP protocol) typically have special directory that they serve from Files i that special directory directly reachable without specifyig path

Web browsers are cliets Your Web browser is called a cliet accessig a Web server Programs like Chrome, Firefox, or Safari uderstad a lot about Iteret protocols They kow how to iterpret HTML ad display it graphically If HTML refereces other resources, like JPEG or PNG pictures, cliet fetches them ad displays them as appropriate Cliet also kows details of HTTP (ad perhaps mailto, ftp, ad some others)

Browser ot oly way to use Iteret Pytho (ad other laguages) have modules that allow you to use these protocols I Pytho, ca read ay URL as if it was a file I Pytho 3, key library is urllib.reuest (Other parts of urllib useful for, e.g., parsig webpage text)

Importig urllib.reuest Advice: If module has dot i its ame like urllib.reuest, matplotlib.pyplot, always import it as somethig. Else Pytho ca get cofused about multiple dots whe you go to use fuctios iside it. Thus: import urllib.reuest as ur

Opeig a URL ad readig it The 60 secod "stadig o oe foot" versio so you ca do the crawler homework import urllib.reuest as ur coectio= ur.urlope("https://www.cs.uic.edu") cotet = coectio.read() coectio.close() (We've come back!)

Status coectio retured from urllib is a "HTTPRespose" object ad has a status attribute, which should be 200 if all wet well I crawlig, if you ru ito difficulties, try: import urllib.reuest as ur coectio= ur.urlope("https://www.cs.uic.edu") if (coectio.status == 200): cotet = coectio.read() <rest of real work here> coectio.close()

Readig the coectio read() method of coectio works just like read() method of files Almost: this read will give you a seuece of bytes, which ca i may but ot all cases be treated as a strig. Ca force it to be 100% strig istead of merely pretty strig-like by cotet = str(coectio.read()) usig type coversio fuctio str()