Package scraep. July 3, Index 6

Similar documents
Package scraep. November 15, Index 6

Package gcite. R topics documented: February 2, Type Package Title Google Citation Parser Version Date Author John Muschelli

Package kirby21.base

Package datasets.load

Package RYoudaoTranslate

Package tm.plugin.lexisnexis

Package cattonum. R topics documented: May 2, Type Package Version Title Encode Categorical Features

Package fastdummies. January 8, 2018

Package wikitaxa. December 21, 2017

Package corenlp. June 3, 2015

Package repec. August 31, 2018

Package geojsonsf. R topics documented: January 11, Type Package Title GeoJSON to Simple Feature Converter Version 1.3.

Package jstree. October 24, 2017

Package tidyimpute. March 5, 2018

Package gtrendsr. October 19, 2017

Package tm.plugin.factiva

Package calpassapi. August 25, 2018

Package tm.plugin.factiva

Package bisect. April 16, 2018

Package mdftracks. February 6, 2017

Package io. January 15, 2018

Package rdryad. June 18, 2018

Package CatEncoders. March 8, 2017

Package available. November 17, 2017

Package IATScore. January 10, 2018

Package gtrendsr. August 4, 2018

Package messaging. May 27, 2018

Package spelling. December 18, 2017

Package Rcrawler. November 1, 2017

Package gifti. February 1, 2018

Package wethepeople. March 3, 2013

Package tm.plugin.factiva

Package boilerpiper. May 11, 2015

Package SNPediaR. April 9, 2019

Package githubinstall

Package MicroStrategyR

Package TrafficBDE. March 1, 2018

Package spark. July 21, 2017

Package SEMrushR. November 3, 2018

Package EDFIR. R topics documented: July 17, 2015

Package imgur. R topics documented: December 20, Type Package. Title Share plots using the imgur.com image hosting service. Version 0.1.

Package condusco. November 8, 2017

Package RODBCext. July 31, 2017

Package plusser. R topics documented: August 29, Date Version Title A Google+ Interface for R

Package opencage. January 16, 2018

Package bigreadr. R topics documented: August 13, Version Date Title Read Large Text Files

Package PxWebApiData

Package clipr. June 23, 2018

Package splithalf. March 17, 2018

Package utilsipea. November 13, 2017

Package DataClean. August 29, 2016

Package ecoseries. R topics documented: September 27, 2017

Package canvasxpress

Package Mondrian. R topics documented: March 4, Type Package

Package rcmdcheck. R topics documented: November 10, 2018

Package fastqcr. April 11, 2017

Package heims. January 25, 2018

Package htmltidy. September 29, 2017

Package reval. May 26, 2015

Package Tushare. November 2, 2018

Package robotstxt. November 12, 2017

Package postal. July 27, 2018

Package regexselect. R topics documented: September 22, Version Date Title Regular Expressions in 'shiny' Select Lists

Package librarian. R topics documented:

Package sessioninfo. June 21, 2017

Package rzeit2. January 7, 2019

Package hypercube. December 15, 2017

Package patentsview. July 12, 2017

Package editdata. October 7, 2017

Package postgistools

Package c14bazaar. October 28, Title Download and Prepare C14 Dates from Different Source Databases Version 1.0.2

Package RWildbook. April 6, 2018

Package SFtools. June 28, 2017

Package websearchr. R topics documented: October 23, 2018

Package Ohmage. R topics documented: February 19, 2015

Package validara. October 19, 2017

Package edgarwebr. December 22, 2017

Package ctv. October 8, 2017

Package rsppfp. November 20, 2018

Package IgorR. May 21, 2017

Package rgdax. January 7, 2019

Package textrank. December 18, 2017

Package PCADSC. April 19, 2017

Package tfruns. February 18, 2018

Package crossword.r. January 19, 2018

Package confidence. July 2, 2017

Package reconstructr

Package fst. December 18, 2017

Package remotes. December 21, 2017

Package promote. November 15, 2017

Package rpst. June 6, 2017

Package epitab. July 4, 2018

Package gridgraphics

Package TipDatingBeast

Package barcoder. October 26, 2018

Package bupar. March 21, 2018

Package rngsetseed. February 20, 2015

Package naivebayes. R topics documented: January 3, Type Package. Title High Performance Implementation of the Naive Bayes Algorithm

Package fst. June 7, 2018

Package comparedf. February 11, 2019

Transcription:

Type Package Title Scrape European Parliament Careers Version 1.1 Date 2018-07-01 Package scraep July 3, 2018 Author Maintainer A utility to webscrape the in-house careers of members of the European parliament, from its website <http://www.europarl.europa.eu>. License GPL (>= 3) Imports XML, RCurl, data.table NeedsCompilation no Repository CRAN Date/Publication 2018-07-03 15:30:03 UTC R topics documented: scraep-package....................................... 1 scraep............................................ 2 wiki............................................. 3 xscrape........................................... 3 Index 6 scraep-package Scrape the careers of all members of European parliament. Fetch all in-house career information of MEPs from the European parliament s website, and put them in a data frame. 1

2 scraep Package: scraep Type: Package Version: 1.1 Date: 2018-07-01 License: GPL (>=3) Function scraep downloads and extracts career information from the EP s website. Function xscrape is a general tool to extract information from html pages into data frames using XPath queries. scraep Extract all (in-house) career information of MEPs from the European parliament s website. Usage This function downloads all career information from the European parliament s website <http://www.europarl.europa.eu> or from local html files, and extracts it into a data frame. scraep(local.html= NA, save.html= NA, max.try= 20) Arguments local.html save.html max.try a character string, indicating the local directory from which the MEP s html files are imported (existing files will be overwritten). If NA (default), the career information is downloaded from the EP s website. a character string, indicating the local directory in which the downloaded html files should be saved. If NA (default), the downloaded files are not saved to local filesystem. numeric: maximum number of tries to download html files from the EP s website. Downloading all the pages can take a long time, be sure to store the result in an object. Value A data.frame, where each row is a career step recorded on the EP s website.

wiki 3 wiki Wikipedia page for R. Toy example to extract data using xscrape. Usage data(wiki) Format Object wiki is a character vector. Object wiki is a raw webpage, containing the source of the R (programming language) article on English wikipedia (<https://en.wikipedia.org/wiki/r_(programming_language)>, retrieved on 15/11/2017). xscrape Extract information from webpages to a data.frame, using XPath queries. This function transforms an html page (or list of pages) into a data.frame, extracting nodes specified by their XPath. Usage xscrape(pages, col.xpath, row.xpath= "/html", collapse= " ", encoding= "UTF-8")

4 xscrape Arguments pages col.xpath row.xpath collapse encoding an object of class XMLDocument (as returned by function htmlparse), or list of such objects. Alternatively, a character vector containing the URLs of webpages to be parsed. These are the webpages that information is to be extracted from. a character vector of XPath queries. Each element of this vector will be a column in the resulting data.frame. If the vector is named, these names are given to the result s. (optional) a character string, containing an XPath query. This functions as an intermediary node: if specified, each result of this XPath query (on each page) becomes a row in the resulting data.frame. If not specified (default), the intermediary nodes are whole html pages, so that each page becomes a row in the result. (optional) a character string, containing the separator that will be used in case a col.xpath query yields multiple results. (optional) a character string, containing the encoding parameter that will be used by htmlparse if pages is a vector of URLs. Value If a col.xpath query designs a full node, only its text is extracted. If it designs an attribute (eg ends with /@href for weblinks), only the attribute s value is extracted. If a col.xpath query matches no elements in a page, returned value is NA. If it matches multiple elements, they are concatenated into a single character string, separated by collapse. A data.frame, where each row corresponds to an intermediary node (either a full page or an XML node within a page, specified by row.xpath), and each column corresponds to the text of an col.xpath query. Examples ## Extract all external links and their titles from a wikipedia page data(wiki) wiki.parse <- XML::htmlParse(wiki) links <- xscrape(wiki.parse, row.xpath= "//a[starts-with(./@href, 'http')]", col.xpath= c(title= ".", link= "./@href")) ## Not run: ## Convert results from a search for 'R' on duckduckgo.com ## First download the search page duck <- XML::htmlParse("http://duckduckgo.com/html/?q=R") ## Then run xscrape on the dowloaded and parsed page

xscrape 5 results <- xscrape(duck, row.xpath= "//div[contains(@class, 'result body')]", col.xpath= c(title= "./h2", snippet= ".//*[@class='result snippet']", url= ".//a[@class='result url']/@href")) ## End(Not run) ## Not run: ## Convert results from a search for 'R' and 'Julia' on duckduckgo.com ## Directly provide the URLs to xscrape results <- xscrape(c("http://duckduckgo.com/html/?q=r", "http://duckduckgo.com/html/?q=julia"), row.xpath= "//div[contains(@class, 'result body')]", col.xpath= c(title= "./h2", snippet= ".//*[@class='result snippet']", url= ".//a[@class='result url']/@href")) ## End(Not run)

Index Topic datasets wiki, 3 Topic package scraep-package, 1 scraep, 2 scraep-package, 1 wiki, 3 xscrape, 3 6