Collecting Data from the Programmable Web
|
|
- Lilian Conley
- 5 years ago
- Views:
Transcription
1 Introduction to Web Mining for Social Scientists Lecture 7: Collecting Data from the Programmable Web II Prof. Dr. Ulrich Matter (University of St. Gallen) 14/11/ Collecting Data from the Programmable Web The programmable web, in contrast to the old web, offers new opportunities for web developers to integrate and share data across different applications over the web. In recent lectures we have learned about some of the key technological aspects of this programmable web and dynamic websites: Web Application Programming Interfaces (web APIs): A predefined set of HTTP requests/responses for querying data hosted on the server (or providing data from the client side). Extensible Markup Language (XML) and JavaScript Object Notation (JSON): Standards/Syntax to format data that are intended to be both human- and machine-readable and that therefore facilitate the exchange of data between different systems/applications over the web. JavaScript/AJAX: A programming language and framework designed to build interactive/dynamic websites. In the AJAX framework a JavaScript program built into an HTML-document/website, could, for example, be triggered by the user clicking on a button when visiting the website with a browser. This program might then automatically request additional data (in XML-format) from the server via an API and embed this new data dynamically in the HTML-document on the client-side. In the context of web mining, these new technologies mean that automated data collection from the web can get both substantially easier or substantially more difficult in comparison to automated data collection from the old web. Whether one or the other is the case essentially depends on whether a dynamic website relies on an API, and if so, whether this API is publicly accessible (and hopefully free of charge). If the latter is the case, our web mining task is reduced to (a) understanding how the specific API works (which is usually very easy, since open APIs tend to come with detailed and user-friendly documentations) and (b) knowing how to extract the data of interest from the returned XML or JSON documents (which is usually substantially easier than scraping data from HTML pages). In addition, there might already be a so-called API client or wrapper implemented in an R package that does all this for us, such as the twitter package to collect data from one of Twitter s APIs. If such a package is available we only have to learn how it can be applied to systematically collect data from the API (as shown in the previous lecture). In cases where no API is available, the task of automated data collection from a dynamic website/web application can get much more complex because not all the data is provided when we issue a simple GET request based on an URL pointing to a specific webpage. This manuscript covers some of the frequently encountered aspects and difficulties for web mining related to such dynamic sites. The important take-away message is, however, that unlike with the old web there is not one simple generic approach on which we can build when writing a scraper. The techniques necessary to scrape dynamic websites are much more case-specific and might even require substantial knowledge of JavaScript and other web technologies. Providing all the techniques for automated data collection from dynamic websites would thus go far beyond the scope of this course. However, there is also an alternative approach to dealing with such websites which can be employed rather generally: instead of writing a program that decomposes the website to scrape the data from it, we can rely on a framework that allows us to programmatically control an actual web browser and thus simulate a human using a browser (including scrolling, clicking, etc.). That is, we use so-called web browser automation instead of simple scrapers. 1
2 2 Scraping Dynamic Websites The first step of dealing with a dynamic website in a web mining task is to figure out where the data we see in the browser is actually coming from. This is where the Developer Tools provided with modern browsers such as Chrome or Firefox come into play. The Network Panel in combination with the source-code inspector help to evaluate which web technologies are used to make the dynamic website work. From there, we can investigate how the data can be accessed programmatically. For example, we might detect that a JavaScript program embedded in the webpage is querying additional data from the server whenever we scroll down in the browser and that all this additional data is transmitted in XML (before being embedded in the HTML). We then can figure out how the data is exactly queried from the server (e.g., how to build a query URL) in order to automate the extraction of data directly. The question then becomes how we can implement all this in R. In short, the following three steps can get us started: 1. Which web technologies are used? 2. Given a set of web technologies, how can we theoretically access the data? 3. How can we practically collect the data with R? This section gives insights into some of the web technologies frequently encountered when scraping data from dynamic websites and how to deal with them in R. As pointed out above, this is not a complete treatise of all relevant techniques to scrape data from dynamic websites. Thus, the techniques discussed here might not be relevant or sufficient in other cases. 2.1 Cookies HTTP cookies are small pieces of data that help the server to recognize a client. Cookies are stored locally on the client side (by the web browser) when the server is submitting a website with cookies. During the further interaction with the same website, the browser is sending the cookie along other requests to the server. Figure 1) illustrates this point. Figure 1: Illustration of HTTP cookie exchange. Source: cookie_exchange.svg. Dynamic websites typically come with cookies. By identifying the user and her actions with the help of cookies, the server can keep track of what the user is doing and accordingly generate dynamic parts of the 2
3 website. A most typical example of this are web shops where we might navigate through several pages, adding different items to the shopping cart. Once we click on the shopping-cart symbol a new webpage is created dynamically, showing us the cart s content. Obviously, if another user would simultaneously visit the website and add other items to the cart, she would see a different page when clicking on the shopping cart. Similarly, if we would visit a web shop with our browser, add some items to the shopping cart, have a look at the shopping cart, and then try to scrape the content of it via R by copy/pasting the URL of the cart s webpage, the result would likely be inconsistent with what we see in the browser. The reason is that the usual webscraping techniques visited in previous weeks do not automatically take into account cookies. That is, if we want to scrape a webpage that is dynamically generated based on cookies, we have to make sure that R is sending the cookies along with the URL that points to the server-side script generating the page (as the web browser would do automatically in such a case). In the following code example we explore how we can work with cookies in R. 1 The code example implements a scraper that selects items (books) in the web shop, adds them to the shopping cart, and scrapes the webpage representing the content of the shopping cart. The example builds on the previously used R packages rvest and httr. From inspecting the website, we note how URLs to search for books are built. 2 By inspecting the source code of the website we further learn that the dynamic generation of the webpage presenting the shopping cart content is triggered by sending a GET request with the URL ######################################## # Introduction to Web Mining 2017 # 7: Programmable Web II # # Book Shopping with R # U.Matter, November 2017 ######################################## # PREAMBLE # load packages library(rvest) library(httr) # set fix variables SEARCH_URL <- " CART_URL <- " We first initiate a browser session with rvest s html_session() function. The returned R object not only contains the HTML document sent from the server but also information from the HTTP header, including cookies, which we can inspect with cookies(). # INITIATE SESSION # visit the page (start a session) shopping_session <- html_session(search_url) # have a look at the cookies cookies(shopping_session)[, 1:5] domain flag path secure expiration 1 #HttpOnly_.biblio.com TRUE / FALSE :16:08 2 #HttpOnly_.biblio.com TRUE / FALSE :16:08 1 This example is based on a similar code example in Munzert et al. (2014, 248). The original example code is based on other R packages. 2 The base URL for search queries is with some search parameters and values (e.g., keyisbn=economics). 3
4 From inspecting the source code of the webpage we know that items are added to the shopping cart by means of an HTML form. We thus extract the part of the search results containing these forms. # look at the html forms to add items to the cart form_nodes <- html_nodes(shopping_session, xpath = "//form[@class='ob-add-form ']") # inspect extracted forms form_nodes[1:2] {xml_nodeset (2)} [1] <form action=" method="get" class="ob-add... [2] <form action=" method="get" class="... From this we learn that if one form is submitted, it actually submits a book id. Thus if we want to add an item to the shopping cart via R, we need to submit such a form with a book-id number set as bid value. Therefore, we (a) store the structure of these forms in an R object (via html_form()) and (b) extract all the book ids from the search results. # SUBMIT FORMS # extract one of the forms form <- html_form(form_nodes[[1]]) # extract the book ids bid_nodes <- html_nodes(shopping_session, xpath = "//input[@name='bid']/@value") bids <- html_text(bid_nodes) The form template and the ids are sufficient to programmatically fill the shopping cart. We do this by iterating through all bids, setting the bid value to the respective value (with set_values), and then submitting the form (via submit_form()). Importantly, we submit these forms with the same session, meaning submit_form() will make sure that the relevant cookies of this session are sent along. # add books to the shopping cart for (i in bids) { form_i <- set_values(form, bid = i) submit_form(shopping_session, form_i) } Finally, we scrape the content of the shopping cart. Note that instead of simply requesting the page CART_URL is pointing to, we use jump_to() with the already established shopping_session. This ensures that the GET request is issued with the cookies of this session. 3 # open the shopping cart cart <- jump_to(shopping_session, CART_URL) # parse the content cart_content <- read_html(cart) # extract the book titles in the cart books_in_cart <- html_nodes(cart_content, xpath = "//div[@class='']/h3") cat(html_text(books_in_cart)[2]) Managerial Economics & Business Strategy (McGraw-Hill Economics) by Baye, Michael; Prince, Jeff It is straightforward to show that sending along the right cookies by using jump_to() with the same session in which we added the items to the cart is actually crucial. In order to demonstrate this, we simply start a new session and try the same as above, this time accessing the cart with the new session: 3 There are several other ways of achieving the same in R (by rather manually setting cookies). However, the shown functionality provided in the rvest package is more user-friendly. 4
5 # initiate a new session new_shopping_session <- html_session(search_url) # open the shopping cart cart <- jump_to(new_shopping_session, CART_URL) # parse the content cart_content <- read_html(cart) # extract the book titles in the cart books_in_cart <- html_nodes(cart_content, xpath = "//div[@class='']/h3") cat(html_text(books_in_cart)) In the new session (with new cookies) the shopping cart is still empty. Note that we use exactly the same URL. The only difference is that we submit the new cookies with the GET request (issued by jump_to()). Therefore, the server (correctly) recognizes that the session related to these new cookies did not involve any items being added to the shopping cart by the client. 2.2 AJAX and XHR AJAX (Asynchronous JavaScript And XML) is a set of web technologies often employed to design dynamic webpages. The main purpose of AJAX is to allow the asynchronous (meaning under the hood ) exchange of data between client and server when a webpage is already loaded. This means parts of a webpage can be changed/updated without actually reloading the entire page (as illustrated in Figure 2). Figure 2: Illustration of AJAX. Source: What w3schools.com calls a developer s dream is a webscraper s nightmare. The content of a webpage designed with AJAX cannot be downloaded by simply requesting an HTML document with an URL. Additional data will be embedded in the page as the user is scrolling through it in the browser. Thus what we see in the browser is not what we get when simply requesting the same webpage with httr. In order to access these additional bits of data automatically via R, we have to mimic the specific HTTP transactions between the browser and the server that is related to the loading of additional data. These transactions (as illustrated in Figure 2 are usually implemented with a so-called XMLHttpRequest (XHR) object. If we manage to control the exchange between client and server in the context of a dynamic website based on AJAX, figuring out how XHR works in this website is a good starting point. The following code-example illustrates how the control of XHR via R can be implemented in the case of 5
6 the investment research website morningstar.com. 4 The goal is to scrape the monthly total returns from a specific Exchange-traded fund (ETF). ######################################## # Introduction to Web Mining 2017 # 7: Programmable Web II # # Morningstar scraper # U.Matter, November 2017 ######################################## # PREAMBLE # load packages library(httr) library(xml2) library(rvest) # 'TRADITIONAL' APPROACH # fetch the webpage URL <- " http_resp <- GET(URL) # parse HTML html_doc <- read_html(http_resp) # extract the respective section according to the XPath expression # found by inspecting the page in the broswer with Development Tools xpath <- "/html/body/div[3]/div[1]/div[2]/div[2]/div[17]/div/table/tbody[1]" returns_nodes <- html_nodes(html_doc, xpath = xpath) returns <- html_table(returns_nodes) This approach does not seem to be successful. We don t get what we see in the browser. When tracing back the origin of the problem it becomes apparent that the last div-tag which should contain the table with the returns is missing: html_nodes(html_doc, xpath = "/html/body/div[3]/div[1]/div[2]/div[2]/div[17]/div") {xml_nodeset (1)} [1] <div id="div_monthly_returns">\n\t\t\t\t\t\t</div> By inspecting the network traffic with the Developer Tools Network panel, we notice traffic related to XHR. When having a closer look at these entries (via the Response panel) we identify a get request with an URL pointing to historical-returns.action?... which returns exactly the data we were looking for in the webpage. A simple way to scrape the data based on this information would be to copy/paste this URL and rewrite the code chunk above accordingly. However, a more elegant way is to specify the GET request based on the information related to the XHR object provided in Developer Tools (left click on the respective entry -> Copy -> Copy URL Parameters ) in the form of query parameters and use this information to define a query as part of the GET request. # mimic XHR GET request implemented in the morningstar.com website URL <- " http_resp <- GET(url = URL, query = list( 4 The code for this example is partly taken from web-scraping-xhr-dynamic-pages-with-rvest-and-r. 6
7 t="arcx:spy", region="usa", culture="en-us", ops="clear", s="0p00001mk8", y="5", ndec="2", ep="true", freq="m", annlz="true", comparisonremove="false" ) ) # parse the returned HML html_doc <- read_html(http_resp) # extract the table with the data (no xpath needed because only the table is returned!) html_table(html_doc)[[1]][1:5, 1:4] SPY (Price) SPY (NAV) S&P 500 TR USD (Price) January December November Browser Automation with RSelenium Alternatively to analyzing and exploiting the underlying mechanisms that control the exchange and embedding of data in a dynamic website, browser automation can tackle web mining tasks at a higher level. Browser automation frameworks allow to programmatically control a web browser and thereby simulate a user browsing webpages. While most browser automation tools were originally developed for web developers to test the functioning of new web applications (by simulating many different user behaviors), it is naturally also helpful for automated data extraction from web sources. Particularly, if the content of a website is generated dynamically. A widely used web automation framework is Selenium. The R package RSelenium is built on top of this framework, which means we can run and control browser automation via Selenium directly from within R. The following code example gives a brief introduction into the basics of using RSelenium for the scraping of dynamic webpages Installation and Setup For basic usage of Selenium via R, the package RSelenium is all that is needed to get started (when running install.packages("rselenium") the necessary dependencies will be installed automatically). For advanced applications based on Selenium, additional packages or manual download and installation of Selenium might be required. The code examples below focus exclusively on the former, more simple case. Running RSelenium on your computer means running both a Selenium server and Selenium client locally. The server is running the automated browser and the client (here R) is telling it what to do. Whenever we use RSelenium we thus have to first start the Selenium server with rsdriver() and assign the returned R 5 The code example is partly based on RSelenium vignette on CRAN. For a detailed introduction and instructions of how to set up Selenium and RSelenium on your machine, see the RSelenium vignette on CRAN. 7
8 object to a variable. 6 This R object represents the Selenium server in the R environment. Once the server runs, we initiate a new Selenium client by assigning it to a new variable: myclient <- rd$client. Now we control our robot browser through myclient. # install.packages("rselenium") library(rselenium) # start the Selenium server rd <- rsdriver(verbose = FALSE) # assign the client to a new variable, visit a webpage myclient <- rd$client Note the browser window opening automatically when initiating the client (by default Chrome, but other Browsers can also be employed for this task). All instructions given to direct the browser from within R are directly observable in the automated browser window (see Figure 3). Figure 3: Automated browser controlled by Selenium. 3.2 First Steps with RSelenium All methods (functions associated with an R object) can be called right from myclient. This includes all kind of instructions to guide the automated browser as well as methods related to accessing the content of the page that is currently open in the automated browser. For example, we can navigate the browser to open a specific webpage with navigate() and then extract the title of this page (i.e., the text between the <title>-tags) with gettitle(). # start browsing myclient$navigate(" myclient$gettitle() [[1]] [1] "R: The R Project for Statistical Computing" Navigating the automated browser in this manner directly follows from how we navigate the browser through the usual graphical user interface. Thus, if we want to visit a number of pages we tell it step by step to navigate from page to page, including going back to a previously visited page (with goback()). 6 The option verbose = FALSE simply suppresses any status messages the initiation of the server might issue. 8
9 # simulate a user browsing the web myclient$navigate(" myclient$navigate(" myclient$getcurrenturl() [[1]] [1] " myclient$goback() myclient$getcurrenturl() [[1]] [1] " Once a webpage is loaded, specific elements of it can be extracted by means of XPath or CSS selectors. However, with RSelenium, the ability of accessing specific parts of a webpage is not only used to extract data but also to control the dynamic features of a webpage. For example, to automatically control Google s search function, and extract the respective search results. Such a task is typically rather difficult to implement with more traditional web mining techniques because the webpage presenting the search results is generated dynamically and there is thus not a unique URL where the page is constantly available. In addition, we would have to figure out how the search queries are actually sent to a Google server. With RSelenium we can navigate to the search bar of Google s homepage (here by selecting the input tag with XPath), type in a search term, and hit enter to trigger the search. # automate google search # navigate to the search form webelem <- myclient$findelement('xpath', "//input[@name='q']") # type something into the search bar webelem$sendkeystoelement(list("r Cran")) # type a search term and hit enter webelem$sendkeystoelement(list("r Cran", key = "enter")) By default, Google opens a newly generated webpage presenting the search results in the same browser window. Thus, the search result is now automatically stored in myclient and we can access the source code with the getpagesource()-method. To process the source code, we don t have to rely on RSelenium s internal methods and functions but can also use the already familiar tools in rvest. # scrape the results # parse the entire page and take it from there... # for example, extract all the links html_doc <- read_html(myclient$getpagesource()[[1]]) link_nodes <- html_nodes(html_doc, xpath = "//a") html_text(html_nodes(link_nodes, xpath = "@href"))[1] [1] " This approach might actually be quite efficient compared to RSelenium s internal methods. However, we can also use those methods to achieve practically the same. 7 # or extract specific elements via RSelenium # for example, extract all the links links <- myclient$findelements("xpath", "//a") unlist(sapply(links, function(x){x$getelementattribute("href")}))[1] [1] " 7 Note that RSelenium and rvest rely on different XPath engines, meaning that an XPath expression might work in the functions of one package but not in the other. 9
10 At the end of a data mining task with RSelenium we stop/close the Selenium server and Client as follows. # close client/server myclient$close() # rd$server$stop() In practice, RSelenium can be very helpful when extracting data from dynamic websites as the procedure guarantees that we get exactly what we would get by using a browser manually to extract the data. We thus do not need to worry about cookies, AJAX, XHR, and the like, as long as the browser we are automating with Selenium deals with these technologies appropriately. On the downside, scraping webpages with RSelenium is usually less efficient and slower than a more direct approach with httr/rvest. 8 Given the example code above, RSelenium can seamlessly be integrated in the generic web-scraper blueprint used in previous lectures. We can simply implement the first component (interaction with the web server, parsing of HTML) with RSelenium and the rest of the scraper with rvest et al. References Munzert, S., C. Rubba, P. Meißner, and D. Nyhuis Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining. Chichester, UK: Wiley. 8 Note that scraping tasks based on Selenium can be speeded up by using several clients in parallel. However, the argument of computational efficiency still holds. 10
Introduction to Web Mining for Social Scientists Lecture 4: Web Scraping Workshop Prof. Dr. Ulrich Matter (University of St. Gallen) 10/10/2018
Introduction to Web Mining for Social Scientists Lecture 4: Web Scraping Workshop Prof. Dr. Ulrich Matter (University of St. Gallen) 10/10/2018 1 First Steps in R: Part II In the previous week we looked
More informationUsing Development Tools to Examine Webpages
Chapter 9 Using Development Tools to Examine Webpages Skills you will learn: For this tutorial, we will use the developer tools in Firefox. However, these are quite similar to the developer tools found
More informationECPR Methods Summer School: Automated Collection of Web and Social Data. github.com/pablobarbera/ecpr-sc103
ECPR Methods Summer School: Automated Collection of Web and Social Data Pablo Barberá School of International Relations University of Southern California pablobarbera.com Networked Democracy Lab www.netdem.org
More informationXML Processing & Web Services. Husni Husni.trunojoyo.ac.id
XML Processing & Web Services Husni Husni.trunojoyo.ac.id Based on Randy Connolly and Ricardo Hoar Fundamentals of Web Development, Pearson Education, 2015 Objectives 1 XML Overview 2 XML Processing 3
More informationINTERNET ENGINEERING. HTTP Protocol. Sadegh Aliakbary
INTERNET ENGINEERING HTTP Protocol Sadegh Aliakbary Agenda HTTP Protocol HTTP Methods HTTP Request and Response State in HTTP Internet Engineering 2 HTTP HTTP Hyper-Text Transfer Protocol (HTTP) The fundamental
More informationThe course also includes an overview of some of the most popular frameworks that you will most likely encounter in your real work environments.
Web Development WEB101: Web Development Fundamentals using HTML, CSS and JavaScript $2,495.00 5 Days Replay Class Recordings included with this course Upcoming Dates Course Description This 5-day instructor-led
More informationUsing AJAX to Easily Integrate Rich Media Elements
505 Using AJAX to Easily Integrate Rich Media Elements James Monroe Course Developer, WWW.eLearningGuild.com The Problem: How to string together several rich media elements (images, Flash movies, video,
More informationAjax Ajax Ajax = Asynchronous JavaScript and XML Using a set of methods built in to JavaScript to transfer data between the browser and a server in the background Reduces the amount of data that must be
More informationAjax Ajax Ajax = Asynchronous JavaScript and XML Using a set of methods built in to JavaScript to transfer data between the browser and a server in the background Reduces the amount of data that must be
More informationCross-Browser Functional Testing Best Practices
White Paper Application Delivery Management Cross-Browser Functional Testing Best Practices Unified Functional Testing Best Practices Series Table of Contents page Introduction to Cross-Browser Functional
More informationLecture 4: Data Collection and Munging
Lecture 4: Data Collection and Munging Instructor: Outline 1 Data Collection and Scraping 2 Web Scraping basics In-Class Quizzes URL: http://m.socrative.com/ Room Name: 4f2bb99e Data Collection What you
More informationLecture 9a: Sessions and Cookies
CS 655 / 441 Fall 2007 Lecture 9a: Sessions and Cookies 1 Review: Structure of a Web Application On every interchange between client and server, server must: Parse request. Look up session state and global
More informationThingLink User Guide. Andy Chen Eric Ouyang Giovanni Tenorio Ashton Yon
ThingLink User Guide Yon Corp Andy Chen Eric Ouyang Giovanni Tenorio Ashton Yon Index Preface.. 2 Overview... 3 Installation. 4 Functionality. 5 Troubleshooting... 6 FAQ... 7 Contact Information. 8 Appendix...
More informationDecision on opposition
Decision on opposition Opposition No. 2017-700545 Tokyo, Japan Patent Holder Saitama, Japan Patent Attorney Kanagawa, Japan Opponent MEDIALINK.CO., LTD. EMURA, Yoshihiko TAKAHASHI, Yoko The case of opposition
More informationPart of this connection identifies how the response can / should be provided to the client code via the use of a callback routine.
What is AJAX? In one sense, AJAX is simply an acronym for Asynchronous JavaScript And XML In another, it is a protocol for sending requests from a client (web page) to a server, and how the information
More informationCNIT 129S: Securing Web Applications. Ch 3: Web Application Technologies
CNIT 129S: Securing Web Applications Ch 3: Web Application Technologies HTTP Hypertext Transfer Protocol (HTTP) Connectionless protocol Client sends an HTTP request to a Web server Gets an HTTP response
More informationextc Web Developer Rapid Web Application Development and Ajax Framework Using Ajax
extc Web Developer Rapid Web Application Development and Ajax Framework Version 3.0.546 Using Ajax Background extc Web Developer (EWD) is a rapid application development environment for building and maintaining
More informationWeb Scrapping. (Lectures on High-performance Computing for Economists X)
Web Scrapping (Lectures on High-performance Computing for Economists X) Jesús Fernández-Villaverde, 1 Pablo Guerrón, 2 and David Zarruk Valencia 3 December 20, 2018 1 University of Pennsylvania 2 Boston
More informationModule 5: Javascript, Cookies COM 420
Module 5: Javascript, Cookies COM 420 What is the real Internet Lab 1 Review Many Nesting Problems How to check your code Why is nesting Important Recap how grades work in the class Re-Submitting and updating
More informationIntegration Test Plan
Integration Test Plan Team B.E.E.F.E.A.T.E.R. Nick Canzoneri Adam Hamilton Georgi Simeonov Nick Wolfgang Matt Wozniski Date: May 1, 2009 Date Description Revision February 17, 2009 Initial revision 1 April
More informationAutomated Web Application Testing Using Selenium
Worcester Polytechnic Institute Digital WPI Major Qualifying Projects (All Years) Major Qualifying Projects March 2017 Automated Web Application Testing Using Selenium Benjamin I. Chaney Worcester Polytechnic
More informationCreate-A-Page Design Documentation
Create-A-Page Design Documentation Group 9 C r e a t e - A - P a g e This document contains a description of all development tools utilized by Create-A-Page, as well as sequence diagrams, the entity-relationship
More informationManaging State. Chapter 13
Managing State Chapter 13 Textbook to be published by Pearson Ed 2015 in early Pearson 2014 Fundamentals of Web http://www.funwebdev.com Development Section 1 of 8 THE PROBLEM OF STATE IN WEB APPLICATIONS
More informationPackage rvest. R topics documented: August 29, Version Title Easily Harvest (Scrape) Web Pages
Version 0.3.2 Title Easily Harvest (Scrape) Web Pages Package rvest August 29, 2016 Wrappers around the 'ml2' and 'httr' packages to make it easy to download, then manipulate, HTML and XML. Depends R (>=
More informationChapter 2 XML, XML Schema, XSLT, and XPath
Summary Chapter 2 XML, XML Schema, XSLT, and XPath Ryan McAlister XML stands for Extensible Markup Language, meaning it uses tags to denote data much like HTML. Unlike HTML though it was designed to carry
More informationHypertext Markup Language, or HTML, is a markup
Introduction to HTML Hypertext Markup Language, or HTML, is a markup language that enables you to structure and display content such as text, images, and links in Web pages. HTML is a very fast and efficient
More informationClient Side JavaScript and AJAX
Client Side JavaScript and AJAX Client side javascript is JavaScript that runs in the browsers of people using your site. So far all the JavaScript code we've written runs on our node.js server. This is
More informationRECSM Summer School: Scraping the web. github.com/pablobarbera/big-data-upf
RECSM Summer School: Scraping the web Pablo Barberá School of International Relations University of Southern California pablobarbera.com Networked Democracy Lab www.netdem.org Course website: github.com/pablobarbera/big-data-upf
More informationAdobe Marketing Cloud Best Practices Implementing Adobe Target using Dynamic Tag Management
Adobe Marketing Cloud Best Practices Implementing Adobe Target using Dynamic Tag Management Contents Best Practices for Implementing Adobe Target using Dynamic Tag Management.3 Dynamic Tag Management Implementation...4
More informationAssignment: Seminole Movie Connection
Assignment: Seminole Movie Connection Assignment Objectives: Building an application using an Application Programming Interface (API) Parse JSON data from an HTTP response message Use Ajax methods and
More informationWeb Engineering (CC 552)
Web Engineering (CC 552) Introduction Dr. Mohamed Magdy mohamedmagdy@gmail.com Room 405 (CCIT) Course Goals n A general understanding of the fundamentals of the Internet programming n Knowledge and experience
More informationAJAX Workshop. Karen A. Coombs University of Houston Libraries Jason A. Clark Montana State University Libraries
AJAX Workshop Karen A. Coombs University of Houston Libraries Jason A. Clark Montana State University Libraries Outline 1. What you re in for 2. What s AJAX? 3. Why AJAX? 4. Look at some AJAX examples
More information1. Selenium Integrated Development Environment (IDE) 2. Selenium Remote Control (RC) 3. Web Driver 4. Selenium Grid
INTRODUCTION 1.0 Selenium Selenium is a free (open source) automated testing suite for web applications across different browsers and platforms. Selenium focuses on automating web-based applications. Testing
More informationWorking with WebNode
Workshop 28 th February 2008 Page 1 http://blog.larkin.net.au/ What is WebNode? Working with WebNode WebNode is an online tool that allows you to create functional and elegant web sites. The interesting
More informationCS50 Quiz Review. November 13, 2017
CS50 Quiz Review November 13, 2017 Info http://docs.cs50.net/2017/fall/quiz/about.html 48-hour window in which to take the quiz. You should require much less than that; expect an appropriately-scaled down
More informationTHE FRIENDLY GUIDE TO RELEASE 5
THE FRIENDLY GUIDE TO RELEASE 5 TECHNICAL NOTES FOR PROVIDERS Tasha Mellins-Cohen CONTENTS INTRODUCTION.................... 1 TRACKING USAGE.................. 2 Page tags 2 Page tag examples 2 Log files
More informationEmbracing HTML5 CSS </> JS javascript AJAX. A Piece of the Document Viewing Puzzle
Embracing HTML5 AJAX CSS JS javascript A Piece of the Document Viewing Puzzle Embracing HTML5: A Piece of the Document Viewing Puzzle For businesses and organizations across the globe, being able to
More informationChrome if I want to. What that should do, is have my specifications run against four different instances of Chrome, in parallel.
Hi. I'm Prateek Baheti. I'm a developer at ThoughtWorks. I'm currently the tech lead on Mingle, which is a project management tool that ThoughtWorks builds. I work in Balor, which is where India's best
More informationElectric Paoge. Browser Scripting with imacros in Illuminate
Electric Paoge Browser Scripting with imacros in Illuminate Browser Scripting with imacros in Illuminate Welcome Find the latest version of this presentation, plus related materials, at https://goo.gl/d72sdv.
More informationHarvesting Data on the Web
Harvesting Data on the Web Using R and Chrome Taekyung Kim Business Department The University of Suwon PhD, Assistant Professor kimtk@suwon.ac.kr 2015 년 R R Project for Statistical Computing General and
More information20480C: Programming in HTML5 with JavaScript and CSS3. Course Code: 20480C; Duration: 5 days; Instructor-led. JavaScript code.
20480C: Programming in HTML5 with JavaScript and CSS3 Course Code: 20480C; Duration: 5 days; Instructor-led WHAT YOU WILL LEARN This course provides an introduction to HTML5, CSS3, and JavaScript. This
More informationA Technical Perspective: Proxy-Based Website Translation. Discover how the proxy approach eliminates complexity and costs for you and your team.
A Technical Perspective: Proxy-Based Website Translation Discover how the proxy approach eliminates complexity and costs for you and your team. Introduction As your company expands into new global markets,
More informationPackage gcite. R topics documented: February 2, Type Package Title Google Citation Parser Version Date Author John Muschelli
Type Package Title Google Citation Parser Version 0.9.2 Date 2018-02-01 Author John Muschelli Package gcite February 2, 2018 Maintainer John Muschelli Scrapes Google Citation pages
More informationHomework 8: Ajax, JSON and Responsive Design Travel and Entertainment Search (Bootstrap/Angular/AJAX/JSON/jQuery /Cloud Exercise)
Homework 8: Ajax, JSON and Responsive Design Travel and Entertainment Search (Bootstrap/Angular/AJAX/JSON/jQuery /Cloud Exercise) 1. Objectives Get familiar with the AJAX and JSON technologies Use a combination
More informationWeb scraping tools, a real life application
Web scraping tools, a real life application ESTP course on Automated collection of online proces: sources, tools and methodological aspects Guido van den Heuvel, Dick Windmeijer, Olav ten Bosch, Statistics
More informationLanguages in WEB. E-Business Technologies. Summer Semester Submitted to. Prof. Dr. Eduard Heindl. Prepared by
Languages in WEB E-Business Technologies Summer Semester 2009 Submitted to Prof. Dr. Eduard Heindl Prepared by Jenisha Kshatriya (Mat no. 232521) Fakultät Wirtschaftsinformatik Hochshule Furtwangen University
More informationUsing Smart Tools to Write Good Code
B Using Smart Tools to Write Good Code All software development methodologies, with no exception, do include at least one stage of testing of the code. This is because the code most programmers write,
More informationWorking with Javascript Building Responsive Library apps
Working with Javascript Building Responsive Library apps Computers in Libraries April 15, 2010 Arlington, VA Jason Clark Head of Digital Access & Web Services Montana State University Libraries Overview
More informationScreen Scraping. Screen Scraping Defintions ( Web Scraping (
Screen Scraping Screen Scraping Defintions (http://www.wikipedia.org/) Originally, it referred to the practice of reading text data from a computer display terminal's screen. This was generally done by
More informationUser Interaction: jquery
User Interaction: jquery Assoc. Professor Donald J. Patterson INF 133 Fall 2012 1 jquery A JavaScript Library Cross-browser Free (beer & speech) It supports manipulating HTML elements (DOM) animations
More informationPackage rvest. R topics documented: February 20, Version Title Easily Harvest (Scrape) Web Pages
Version 0.2.0 Title Easily Harvest (Scrape) Web Pages Package rvest February 20, 2015 Wrappers around the XML and httr packages to make it easy to download, then manipulate, both html and ml. Depends R
More informationNetworking & The Web. HCID 520 User Interface Software & Technology
Networking & The HCID 520 User Interface Software & Technology Uniform Resource Locator (URL) http://info.cern.ch:80/ 1991 HTTP v0.9 Uniform Resource Locator (URL) http://info.cern.ch:80/ Scheme/Protocol
More informationAJAX: Introduction CISC 282 November 27, 2018
AJAX: Introduction CISC 282 November 27, 2018 Synchronous Communication User and server take turns waiting User requests pages while browsing Waits for server to respond Waits for the page to load in the
More informationBiocomputing II Coursework guidance
Biocomputing II Coursework guidance I refer to the database layer as DB, the middle (business logic) layer as BL and the front end graphical interface with CGI scripts as (FE). Standardized file headers
More informationOutline. AJAX for Libraries. Jason A. Clark Head of Digital Access and Web Services Montana State University Libraries
AJAX for Libraries Jason A. Clark Head of Digital Access and Web Services Montana State University Libraries Karen A. Coombs Head of Web Services University of Houston Libraries Outline 1. What you re
More informationDetects Potential Problems. Customizable Data Columns. Support for International Characters
Home Buy Download Support Company Blog Features Home Features HttpWatch Home Overview Features Compare Editions New in Version 9.x Awards and Reviews Download Pricing Our Customers Who is using it? What
More informationAJAX and JSON. Day 8
AJAX and JSON Day 8 Overview HTTP as a data exchange protocol Components of AJAX JSON and XML XMLHttpRequest Object Updating the HTML document References Duckett, chapter 8 http://www.w3schools.com/ajax/default.asp
More informationBasic Internet Skills
The Internet might seem intimidating at first - a vast global communications network with billions of webpages. But in this lesson, we simplify and explain the basics about the Internet using a conversational
More informationScraping Sites that Don t Want to be Scraped/ Scraping Sites that Use Search Forms
Chapter 9 Scraping Sites that Don t Want to be Scraped/ Scraping Sites that Use Search Forms Skills you will learn: Basic setup of the Selenium library, which allows you to control a web browser from a
More informationIntroduction. A Brief Description of Our Journey
Introduction If you still write RPG code as you did 20 years ago, or if you have ILE RPG on your resume but don t actually use or understand it, this book is for you. It will help you transition from the
More informationPROJECT REPORT. TweetMine Twitter Sentiment Analysis Tool KRZYSZTOF OBLAK C
PROJECT REPORT TweetMine Twitter Sentiment Analysis Tool KRZYSZTOF OBLAK C00161361 Table of Contents 1. Introduction... 1 1.1. Purpose and Content... 1 1.2. Project Brief... 1 2. Description of Submitted
More informationHTML5 - INTERVIEW QUESTIONS
HTML5 - INTERVIEW QUESTIONS http://www.tutorialspoint.com/html5/html5_interview_questions.htm Copyright tutorialspoint.com Dear readers, these HTML5 Interview Questions have been designed specially to
More information1 Introduction. 2 Web Architecture
1 Introduction This document serves two purposes. The first section provides a high level overview of how the different pieces of technology in web applications relate to each other, and how they relate
More information3. WWW and HTTP. Fig.3.1 Architecture of WWW
3. WWW and HTTP The World Wide Web (WWW) is a repository of information linked together from points all over the world. The WWW has a unique combination of flexibility, portability, and user-friendly features
More informationWeb Scraping and APIs
Web Scraping and APIs http://datascience.tntlab.org Module 11 Today s Agenda A deeper, hands-on look at APIs A sneak-peak at server-side API code How to write API queries How to use R libraries to write
More informationDATABASE SYSTEMS. Database programming in a web environment. Database System Course, 2016
DATABASE SYSTEMS Database programming in a web environment Database System Course, 2016 AGENDA FOR TODAY Advanced Mysql More than just SELECT Creating tables MySQL optimizations: Storage engines, indexing.
More informationSolution of Exercise Sheet 5
Foundations of Cybersecurity (Winter 16/17) Prof. Dr. Michael Backes CISPA / Saarland University saarland university computer science Solution of Exercise Sheet 5 1 SQL Injection Consider a website foo.com
More informationIT for Tourism Managers. Analytics
IT for Tourism Managers. Analytics 1 What We Are Talking About Today 1. Logfiles 2. Web Analytics 3. Ranking 4. Web Reputation 5. Privacy & Security 2 Calendar. December 15, 2015 Tuesday, Dec 9 Digital
More informationAJAX Programming Overview. Introduction. Overview
AJAX Programming Overview Introduction Overview In the world of Web programming, AJAX stands for Asynchronous JavaScript and XML, which is a technique for developing more efficient interactive Web applications.
More informationAJAX: Rich Internet Applications
AJAX: Rich Internet Applications Web Programming Uta Priss ZELL, Ostfalia University 2013 Web Programming AJAX Slide 1/27 Outline Rich Internet Applications AJAX AJAX example Conclusion More AJAX Search
More informationWeb basics: HTTP cookies
Web basics: HTTP cookies Myrto Arapinis School of Informatics University of Edinburgh November 20, 2017 1 / 32 How is state managed in HTTP sessions HTTP is stateless: when a client sends a request, the
More informationBrowser behavior can be quite complex, using more HTTP features than the basic exchange, this trace will show us how much gets transferred.
Lab Exercise HTTP Objective HTTP (HyperText Transfer Protocol) is the main protocol underlying the Web. HTTP functions as a request response protocol in the client server computing model. A web browser,
More informationTIME SCHEDULE MODULE TOPICS PERIODS. HTML Document Object Model (DOM) and javascript Object Notation (JSON)
COURSE TITLE : ADVANCED WEB DESIGN COURSE CODE : 5262 COURSE CATEGORY : A PERIODS/WEEK : 4 PERIODS/SEMESTER : 52 CREDITS : 4 TIME SCHEDULE MODULE TOPICS PERIODS 1 HTML Document Object Model (DOM) and javascript
More informationDevelop Mobile Front Ends Using Mobile Application Framework A - 2
Develop Mobile Front Ends Using Mobile Application Framework A - 2 Develop Mobile Front Ends Using Mobile Application Framework A - 3 Develop Mobile Front Ends Using Mobile Application Framework A - 4
More informationCUSTOMER PORTAL. Custom HTML splashpage Guide
CUSTOMER PORTAL Custom HTML splashpage Guide 1 CUSTOM HTML Custom HTML splash page templates are intended for users who have a good knowledge of HTML, CSS and JavaScript and want to create a splash page
More informationThis document is for informational purposes only. PowerMapper Software makes no warranties, express or implied in this document.
OnDemand User Manual Enterprise User Manual... 1 Overview... 2 Introduction to SortSite... 2 How SortSite Works... 2 Checkpoints... 3 Errors... 3 Spell Checker... 3 Accessibility... 3 Browser Compatibility...
More informationAJAX: The Basics CISC 282 November 22, 2017
AJAX: The Basics CISC 282 November 22, 2017 Synchronous Communication User and server take turns waiting User requests pages while browsing Waits for server to respond Waits for the page to load in the
More informationKonaKart Shopping Widgets. 3rd January DS Data Systems (UK) Ltd., 9 Little Meadow Loughton, Milton Keynes Bucks MK5 8EH UK
KonaKart Shopping Widgets 3rd January 2018 DS Data Systems (UK) Ltd., 9 Little Meadow Loughton, Milton Keynes Bucks MK5 8EH UK Introduction KonaKart ( www.konakart.com ) is a Java based ecommerce platform
More informationSite Audit Boeing
Site Audit 217 Boeing Site Audit: Issues Total Score Crawled Pages 48 % 13533 Healthy (3181) Broken (231) Have issues (9271) Redirected (812) Errors Warnings Notices 15266 41538 38 2k 5k 4 k 11 Jan k 11
More informationAnalytics, Insights, Cookies, and the Disappearing Privacy
Analytics, Insights, Cookies, and the Disappearing Privacy What Are We Talking About Today? 1. Logfiles 2. Analytics 3. Google Analytics 4. Insights 5. Cookies 6. Privacy 7. Security slide 2 Logfiles Every
More informationIntroduction to XML. Asst. Prof. Dr. Kanda Runapongsa Saikaew Dept. of Computer Engineering Khon Kaen University
Introduction to XML Asst. Prof. Dr. Kanda Runapongsa Saikaew Dept. of Computer Engineering Khon Kaen University http://gear.kku.ac.th/~krunapon/xmlws 1 Topics p What is XML? p Why XML? p Where does XML
More informationIntroduction to Web Scraping with Python
Introduction to Web Scraping with Python NaLette Brodnax The Institute for Quantitative Social Science Harvard University January 26, 2018 workshop structure 1 2 3 4 intro get the review scrape tools Python
More informationWeb Component JSON Response using AppInventor
Web Component JSON Response using AppInventor App Inventor has a component called Web which gives you the functionality to send and fetch data from a server or a website through GET and POST requests.
More informationEdge Side Includes (ESI) Overview
Edge Side Includes (ESI) Overview Abstract: Edge Side Includes (ESI) accelerates dynamic Web-based applications by defining a simple markup language to describe cacheable and non-cacheable Web page components
More informationAjax Enabled Web Application Model with Comet Programming
International Journal of Engineering and Technology Volume 2. 7, July, 2012 Ajax Enabled Web Application Model with Comet Programming Rajendra Kachhwaha 1, Priyadarshi Patni 2 1 Department of I.T., Faculty
More informationStatic Webpage Development
Dear Student, Based upon your enquiry we are pleased to send you the course curriculum for PHP Given below is the brief description for the course you are looking for: - Static Webpage Development Introduction
More informationA Guide to Liv-ex Software Development Kit (SDK)
A Guide to Liv-ex Software Development Kit (SDK) Document revision: 1.0 Date of Issue: 9 May 2018 Date of revision: Contents 1. Overview... 3 2. What you can do with the Liv-ex SDK... 3 3. The Liv-ex SDK
More informationAJAX: The Basics CISC 282 March 25, 2014
AJAX: The Basics CISC 282 March 25, 2014 Synchronous Communication User and server take turns waiting User requests pages while browsing Waits for server to respond Waits for the page to load in the browser
More informationThe Structure of the Web. Jim and Matthew
The Structure of the Web Jim and Matthew Workshop Structure 1. 2. 3. 4. 5. 6. 7. What is a browser? HTML CSS Javascript LUNCH Clients and Servers (creating a live website) Build your Own Website Workshop
More informationDevelopment of Web Applications
Development of Web Applications Principles and Practice Vincent Simonet, 2013-2014 Université Pierre et Marie Curie, Master Informatique, Spécialité STL 6 Practical Aspects Vincent Simonet, 2013-2014 Université
More informationIntroduction to XML 3/14/12. Introduction to XML
Introduction to XML Asst. Prof. Dr. Kanda Runapongsa Saikaew Dept. of Computer Engineering Khon Kaen University http://gear.kku.ac.th/~krunapon/xmlws 1 Topics p What is XML? p Why XML? p Where does XML
More informationSTARCOUNTER. Technical Overview
STARCOUNTER Technical Overview Summary 3 Introduction 4 Scope 5 Audience 5 Prerequisite Knowledge 5 Virtual Machine Database Management System 6 Weaver 7 Shared Memory 8 Atomicity 8 Consistency 9 Isolation
More informationGraphiq Reality. Product Requirement Document. By Team Graphiq Content. Vincent Duong Kevin Mai Navdeep Sandhu Vincent Tan Xinglun Xu Jiapei Yao
Graphiq Reality Product Requirement Document By Team Graphiq Content Vincent Duong Kevin Mai Navdeep Sandhu Vincent Tan Xinglun Xu Jiapei Yao Revision History 10/9/2015 Created PRD document and basic information.
More information6 WAYS Google s First Page
6 WAYS TO Google s First Page FREE EBOOK 2 CONTENTS 03 Intro 06 Search Engine Optimization 08 Search Engine Marketing 10 Start a Business Blog 12 Get Listed on Google Maps 15 Create Online Directory Listing
More informationCS 161 Computer Security
Paxson Spring 2017 CS 161 Computer Security Discussion 4 Week of February 13, 2017 Question 1 Clickjacking (5 min) Watch the following video: https://www.youtube.com/watch?v=sw8ch-m3n8m Question 2 Session
More informationBrief Intro to Firebug Sukwon Oh CSC309, Summer 2015
Brief Intro to Firebug Sukwon Oh soh@cs.toronto.edu CSC309, Summer 2015 Firebug at a glance One of the most popular web debugging tool with a colleccon of powerful tools to edit, debug and monitor HTML,
More informationArchitectural Engineering Senior Thesis CPEP Webpage Guidelines and Instructions
Architectural Engineering Senior Thesis CPEP Webpage Guidelines and Instructions Your Thesis Drive (T:\) Each student is allocated space on the Thesis drive. Any files on this drive are accessible from
More informationGroup 1. SAJAX: The Road to Secure and Efficient Applications. - Final Project Report -
Group 1 SAJAX: The Road to Secure and Efficient Applications - Final Project Report - Thu Do, Matt Henry, Peter Knolle, Ahmad Yasin George Mason University, 2006/07/15 SAJAX: The Road to Secure and Efficient
More informationQuick XPath Guide. Introduction. What is XPath? Nodes
Quick XPath Guide Introduction What is XPath? Nodes Expressions How Does XPath Traverse the Tree? Different ways of choosing XPaths Tools for finding XPath Firefox Portable Google Chrome Fire IE Selenium
More informationPrivacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras
Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras Lecture 08 Tutorial 2, Part 2, Facebook API (Refer Slide Time: 00:12)
More information