Introduction to Web Mining for Social Scientists Lecture 4: Web Scraping Workshop Prof. Dr. Ulrich Matter (University of St. Gallen) 10/10/2018

Size: px

Start display at page:

Download "Introduction to Web Mining for Social Scientists Lecture 4: Web Scraping Workshop Prof. Dr. Ulrich Matter (University of St. Gallen) 10/10/2018"

Bernice Cummings
5 years ago
Views:

1 Introduction to Web Mining for Social Scientists Lecture 4: Web Scraping Workshop Prof. Dr. Ulrich Matter (University of St. Gallen) 10/10/ First Steps in R: Part II In the previous week we looked at the very basics of using R: how to initiate a variable, R as a calculator, data structures, functions, etc. All of this was rather focused on executing command after command or a number of commands at once in an interactive R session. Apart from the definition of a function, we haven t really looked at how to program with R. A large part of basic programming has to do with automating the execution of a number of commands conditional on some control statements. That is, we want to tell the computer to do something until a certain goal is reached. In the simplest case this boils down to a control flow statement that specifies an iteration, a so-called loop. 1.1 Loops A loop is typically a sequence of statements that is executed a specific number of times. How often the code inside the loop is executed depends on a (hopefully) clearly defined control statement. If we know in advance how often the code inside of the loop has to be executed, we typically write a so-called for-loop. If the number of iterations is not clearly known before executing the code, we typically write a so-called while-loop. The following subsections illustrate both of these concepts in R For-loops In simple terms, a for-loop tells the computer to execute a sequence of commands for each case in a set of n cases. For example, a for-loop could be used to sum up each of the elements in a numeric vector of fix length (thus the number of iterations is clearly defined). In plain English, the for-loop would state something like: Start with 0 as the current total value, for each of the elements in the vector add the value of this element to the current total value. Note how this logically implies that the loop will stop once the value of the last element in the vector is added to the total. Let s illustrate this in R. Take the numeric vector c(1,2,3,4,5). A for loop to sum up all elements can be implemented as follows: vector to be summed up numbers <- c(1,2,3,4,5) initiate total total_sum <- 0 number of iterations n <- length(numbers) start loop for (i in 1:n) { total_sum <- total_sum + numbers[i] check result total_sum [1] 15 1

2 compare with result of sum() function sum(numbers) [1] 15 In some situations a simple for-loop might not be sufficient. Within one sequence of commands there might be another sequence of commands that also has to be executed for a number of times each time the first sequence of commands is executed. In such a case we speak of a nested for-loop. We can illustrate this easily by extending the example of the numeric vector above to a matrix for which we want to sum up the values in each column. Building on the loop implemented above, we would say for each column j of a given numeric matrix, execute the for-loop defined above. matrix to be summed up numbers_matrix <- matrix(1:20, ncol = 4) numbers_matrix [,1] [,2] [,3] [,4] [1,] [2,] [3,] [4,] [5,] number of iterations for outer loop m <- ncol(numbers_matrix) number of iterations for inner loop n <- nrow(numbers_matrix) start outer loop (loop over columns of matrix) for (j in 1:m) { start inner loop initiate total total_sum <- 0 for (i in 1:n) { total_sum <- total_sum + numbers_matrix[i, j] print(total_sum) [1] 15 [1] 40 [1] 65 [1] While-loop In a situation where a program has to repeatedly run a sequence of commands but we don t know in advance how many iterations we need in order to reach the intended goal, a while-loop can help. In simple terms, a while loop keeps executing a sequence of commands as long as a certain logical statement is true. The flow chart in Figure 1 illustrates this point. For example, a while-loop in plain English could state something like start with 0 as the total, add 1.12 to the total until the total is larger than 20. We can implement this in R as follows. initiate starting value total <- 0 start loop 2

3 Figure 1: While-loop illustration. Source: While-loop-diagram.svg. while (total <= 20) { total <- total check the result total [1] Loops and Web Scraping The two types of loops are very helpful in many web scraping tasks. Note how the web scraping example of last week ( blueprint ) is only designed to run for one specific Amazon product review (based on the product id). We can easily imagine to extend the scraper to gather more data. For example, we could first collect a bunch of product ids for which we want to collect all reviews. Thus, we could implement this with a for-loop that iterates through each of the product ids and stops once all of the product ids have been used. Alternatively, we could imagine an extension of the basic review scraper that would first scrape all the reviews of one product id and then continue to scrape all reviews of all the products that the reviewer of the initial review also reviewed, and so on until we have collected a certain number of reviews (or collected reviews of a certain number of reviewers, etc.). The following extended examples show the practical use of loops in different web scraping contexts. 2 Web Scraping in Action 2.1 Extracting Voting Tables from the U.S. Senate A simple but very practical web scraping task is to extract data from HTML tables on a website. If we have to do this only once, R might not even be necessary but we might get the data simply by marking the table in a web browser and copy-pasting it into a spreadsheet program such as Excel (and saving it as CSV etc.). However, it is likely the case that we have to repeatedly extract various tables from the same website. The following exercise shows how this can be done in the context of data on roll-call voting in the U.S. Senate. The scraper is made to extract all roll call voting results for a given list of congresses, and combine 3

4 them in one table. The data will be automatically extracted from the official website of the U.S. Senate where all data for the last few congresses are available on pages per session and congress. For example, the URL is pointing to the page providing the data for the first session of the 113th U.S. Congress. First, we inspect the source code with developer tools and figure out how the URLs are constructed. Based on this, we define the header section of a new R script for this scraper. As we want to extract data on voting results from various congresses and sessions, we define the fixed variables CONGRESS and SESSION as vectors. Introduction to Web Data Mining Lecture 4: Roll Call Data Scraper (HTML Tables) This is a basic web scraper to automatically extract data on roll call vote outcomes in the U.S. Senate. The data is extracted directly form the official government website See for an example of the type of page to be scraped. U. Matter, October 2017 PREAMBLE load packages library(httr) library(xml2) library(rvest) initiate fix variables BASE_URL <- " CONGRESS <- c(110:114) SESSION <- c(1, 2) Following the blueprint outlined in the previous week, we write the three components of the scraper. However, in this case we we aim to place all the components in a for-loop in order to iterate through all the pages we want to extract the tables with voting results from. The three components of our web scraper will thus form the body of the for-loop. That is, they build the sequence of commands that are executed sequentially until we have all the data we want to collect. From inspecting the website of the U.S. Senate (see we learn that in order to collect all the roll-call data from the 110th to the 114th congress, we have to iterate not only through each congress but also through each of the two sessions in one congress (each congress consists of two sessions). Thus for each congress and each session per congress, we want to extract the voting data. This implies a nested for-loop: in the outer loop we iterate through individual congresses, in the inner loop (that is, given a specific congress), we iterate through the sessions. Another key aspect to know before getting started is to understand what the result of each iteration is and how we collect/ merge the individual results. As the overall goal of the scraper is to extract data from HTML tables, a reasonable format to store the data of each iteration is a data.frame. Thus, each iteration will result in a data.frame, which implies that we have to store each of these data-frames while running the loop. We can do this with a list. Before starting the loop, we initiate an empty list all_tables <- list(null). Then, within the loop, we add each of the extracted tables (now objects of class data.frame) as an additional element of that list. The following code chunk contains the blueprint for the loop following this strategy (without the actual loop body, i.e., the three scraper components). initiate variables for iteration n_congr <- length(congress) 4

5 n_session <- length(session) all_tables <- list(null) start iteration for (i in 1:n_congr) { for (j in 1:n_session) { ADD COMPONENTS I TO III HERE! add resulting table to list rc_table_list <- list(rc_table) all_tables <- c(all_tables, rc_table_list) Note that in order to add an extracted table (here: a data-frame called rc_table) to the list, we first have to put it in a list rc_table_list <- list(rc_table) and then add it to the list containing all tables: all_tables <- c(all_tables, rc_table_list). The code above does not do anything yet on its own. We have to fill in the three components containing the actual scraping tasks in the body of the loop. When developing each of the components it is helpful to just write them for one iteration (ignoring the loop for a moment). This way we can test each component step by step before iterating over it many times. A simple way to do this is to just manually assign values to the index-variables i and j: i <- 1, j <- 1. The first component (interaction with the server, parse response... ) is then straightforwardly implemented and tested as I) Handle URL, HTTP request and response, parse HTML build the URL page <- paste0("vote_menu_", CONGRESS[i], "_", SESSION[j], rc_url <- paste0(base_url, page) request webpage, parse results rc_resp <- GET(rc_url) rc_html <- read_html(rc_resp) ".htm") As usual, we have to figure out (with the help of developer tools) how to extract the specific part of the HTML-document which contains the data of interest. In this particular case the xpath expression ".//*[@id='secondary_col2']/table" provides the result we are looking for in the second component: II) Extract the data of interest extract the table rc_table_node <- html_node(rc_html, xpath = ".//*[@id='secondary_col2']/table") rc_table <- html_table(rc_table_node) Finally in the last component, we prepare the extracted data for further processing. When looking at the result of the previous component (head(rc_table)), we note that the extracted table does not actually contain information about which congress and session it is from. We add this information by adding two new columns. III) Format and save data for further processing add additional variables rc_table$congress <- CONGRESS[i] rc_table$session <- SESSION[j] With this we have the extracted data from one iteration (one congress-session pair) in the form we want. Once we have tested each of the components and are happy with the overall result for one iteration, we can add them to the body of the loop and put all parts together. 5

6 Introduction to Web Mining Lecture 4: Roll Call Data Scraper (HTML Tables) This is a basic web scraper to automatically extract data on roll call vote outcomes in the U.S. Senate. The data is extracted directly form the official government website See for an example of the type of page to be scraped. U. Matter, October 2017 PREAMBLE load packages library(httr) library(xml2) library(rvest) initiate fix variables BASE_URL <- " CONGRESS <- c(110:114) SESSION <- c(1, 2) SCRAPER initiate variables for iteration n_congr <- length(congress) n_session <- length(session) all_tables <- list(null) start iteration for (i in 1:n_congr) { for (j in 1:n_session) { I) Handle URL, HTTP request and response, parse HTML build the URL page <- paste0("vote_menu_", CONGRESS[i], "_", SESSION[j], rc_url <- paste0(base_url, page) request webpage, parse results rc_resp <- GET(rc_url) rc_html <- read_html(rc_resp) ".htm") II) Extract the data of interest extract the table rc_table_node <- html_node(rc_html, xpath = ".//*[@id='secondary_col2']/table") alternatively: html_node(rc_html, css = "table") rc_table <- html_table(rc_table_node) III) Format and save data for further processing add additional variables rc_table$congress <- CONGRESS[i] rc_table$session <- SESSION[j] 6

7 add resulting table to list rc_table_list <- list(rc_table) all_tables <- c(all_tables, rc_table_list) As a last step, once the loop has finished, we can stack the individual data-frames together to get one large data-frame which we then can store locally as a csv-file to further work with the collected data. combine all tables in one: big_table <- do.call("rbind", all_tables) write result to file write.csv(x = big_table, file = "data/3_senate_rc.csv", row.names = FALSE) The first rows and columns of the resulting csv-file: Vote (Tally) Result 442 (93-0) Confirmed 441 (76-17) Agreed to 440 (48-46) Rejected 439 (70-25) Agreed to 438 (50-45) Rejected 437 (24-71) Rejected 2.2 A Simple Text Scraper for Wikipedia In this exercise we write an R script that looks up a bunch of terms in Wikipedia, parses the search results, extracts the text of the found page, and saves it locally as a text file. As usual, we first inspect the website with developer tools and have a close look at the part of the website containing the search field. We recognize that the HTML form s action attribute is indicating a relative link /w/index.php. This tells us that once a user hits enter to submit what she entered in the form, the search term will be further processed by a PHP script on Wikipedia s server. From this, however, we do not know yet, how the data will be submitted, or in other words, how we do have to formulate either a GET or POST request in order to mimic a user typing requests into the search field. In order to understand how the search function on Wikipedia pages works under the hood, we open the Network panel in the Firefox Developer Tools, and switch the HTML filter on (as we are only interested in the traffic related to HTML documents). We then type Donald Trump in the search field of the Wikipedia page and hit enter. The first entry of the network panel shows us the first transfer recorded after we hit enter. It tells us that the search function works as such that a GET request with an URL pointing to the PHP-script discovered above is sent to the server. We can copy the exact URL of the GET request by left-clicking on it in the network panel and select Copy/Copy URL and then verify that this is actually how the Wikipedia search function works by pasting the copied URL ( back into the Firefox address bar and hit enter. We can then test whether correctly understand how the URL for a query needs to be constructed by replacing the Donald+Trump part with Barack+Obama and see what 7

8 we get. Based on our insights about how the search field on Wikipedia works, we can start implementing our scraper. In the documentation of this script it is helpful to point out that there are two important types of URLs to be considered here: one as an example of a page to scrape data from, and one pointing to the search function. Since different parts of the URL to Wikipedia s search function will become handy, we define the parsed URL from our Donald Trump example as a fix variable. The aim of the scraper is to extract the text of the returned search result (the found Wikipedia entry) and store it locally in a text file. Therefore, we already define an output directory (RESULTS_DIR <- "data/wikipedia") where results should be stored. Introduction to Web Mining Lecture 4: Wikipedia Search Form Scraper This is a basic web scraper to automatically look up search terms in Wikipedia and extract the text of the returned page. See for an example of the type of page to be scraped. See for the type of URL used by Wikipedia's search function U. Matter, October 2017 PREAMBLE load packages library(httr) library(xml2) library(rvest) library(stringi) initiate fix variables SEARCH_URL <- parse_url(" SEARCH_TERM <- "Barak Obama" RESULTS_DIR <- "data/wikipedia/" As we have parsed the rather complex URL to perform searches on Wikipedia from the example above, we can simply modify the resulting object by replacing the respective parameter (search): SEARCH_URL$query$search <- SEARCH_TERM and then use the function build_url() to construct the URL for an individual request. The rest of the first component is straightforward from the blueprint. I) URL, HANDLE HTTP REQUEST AND THE RESPONSE ---- Build the URL (update search term) SEARCH_URL$query$search <- SEARCH_TERM fetch the website via a HTTP GET request URL <- build_url(search_url) search_result <- GET(URL) parse the content of the response (the html code) search_result_html <- read_html(search_result) or, alternatively: body <- content(resp) In the second component, we first identify the part of the parsed HTML document that we want to extract. In the case of how Wikipedia pages are currently built, it turns out that a straightforward way to do this is to select all paragraphs (<p>) that are embedded in a <div>-tag of class mw-parser-output. The xpath expression ".//*[@class='mw-parser-output']/p" captures thus all the HTML elements with content of 8

9 interest. In order to extract the text from those elements we simply apply the html_text()-function. II) filter HTML, extract data ---- content_nodes <- html_nodes(search_result_html, xpath = ".//*[@class='mw-parser-output']/p") content_text <- html_text(content_nodes) Finally, in the last component we define the name of the text-file to which we want to save the extracted text. 1 III) write text to file ---- filepath <- paste0(results_dir, stri_replace_all_fixed(str = SEARCH_TERM, " ", ""), ".txt" ) write(content_text, filepath) Putting all parts together, we can start using this script to automate the extraction of text from Wikipedia for any search term. Given the previous exercise, it should be straightforward to tweak this script in order to extract text from various pages based on a number of search terms (via a loop). Introduction to Web Mining Lecture 4: Wikipedia Search Form Scraper This is a basic web scraper to automatically look up search terms in Wikipedia and extract the text of the returned page. See for an example of the type of page to be scraped. See for the type of URL used by Wikipedia's search function U. Matter, October 2017 PREAMBLE load packages library(httr) library(xml2) library(rvest) library(stringi) initiate fix variables SEARCH_URL <- parse_url(" SEARCH_TERM <- "Barak Obama" RESULTS_DIR <- "data/wikipedia/" I) URL, HANDLE HTTP REQUEST AND THE RESPONSE ---- Build the URL (update search term) SEARCH_URL$query$search <- SEARCH_TERM fetch the website via a HTTP GET request URL <- build_url(search_url) search_result <- GET(URL) parse the content of the response (the html code) search_result_html <- read_html(search_result) or, alternatively: body <- content(resp) 1 The function stri_replace_all_fixed() is used here to automatically remove all the white space from the search term. Thus, in the case of a search with the term Donald Trump, the extracted data would be stored in a text-file with the path data/wikipedia/donaldtrump.txt. 9

10 II) filter HTML, extract data ---- content_nodes <- html_nodes(search_result_html, xpath = ".//*[@class='mw-parser-output']/p") content_text <- html_text(content_nodes) III) write text to file ---- filepath <- paste0(results_dir, stri_replace_all_fixed(str = SEARCH_TERM, " ", ""), ".txt" ) write(content_text, filepath) 3 References 10

Collecting Data from the Programmable Web

Introduction to Web Mining for Social Scientists Lecture 7: Collecting Data from the Programmable Web II Prof. Dr. Ulrich Matter (University of St. Gallen) 14/11/2018 1 Collecting Data from the Programmable