Package robotstxt. November 12, 2017

Similar documents
Package cattonum. R topics documented: May 2, Type Package Version Title Encode Categorical Features

Package fastdummies. January 8, 2018

Package messaging. May 27, 2018

Package crossword.r. January 19, 2018

Package rsppfp. November 20, 2018

Package githubinstall

Package oec. R topics documented: May 11, Type Package

Package canvasxpress

Package geojsonsf. R topics documented: January 11, Type Package Title GeoJSON to Simple Feature Converter Version 1.3.

Package validara. October 19, 2017

Package censusr. R topics documented: June 14, Type Package Title Collect Data from the Census API Version 0.0.

Package knitrprogressbar

Package dkanr. July 12, 2018

Package docxtools. July 6, 2018

Package fitbitscraper

Package facerec. May 14, 2018

Package ECctmc. May 1, 2018

Package clipr. June 23, 2018

Package rgho. R topics documented: January 18, 2017

Package datasets.load

Package ezsummary. August 29, 2016

Package calpassapi. August 25, 2018

Package internetarchive

Package rwars. January 14, 2017

Package opencage. January 16, 2018

Package rzeit2. January 7, 2019

Package loggit. April 9, 2018

Package semver. January 6, 2017

Package goodpractice

Package postal. July 27, 2018

Package virustotal. May 1, 2017

Package BiocManager. November 13, 2018

Package keyholder. May 19, 2018

Package modules. July 22, 2017

Package lumberjack. R topics documented: July 20, 2018

Package dbx. July 5, 2018

Package jpmesh. December 4, 2017

Package postgistools

Package splithalf. March 17, 2018

Package nngeo. September 29, 2018

Package rtext. January 23, 2019

Package statsdk. September 30, 2017

Package bisect. April 16, 2018

Package vinereg. August 10, 2018

Package spelling. December 18, 2017

Package spark. July 21, 2017

Package patentsview. July 12, 2017

Package essurvey. August 23, 2018

Package tidylpa. March 28, 2018

Package fingertipsr. May 25, Type Package Version Title Fingertips Data for Public Health

Package readxl. April 18, 2017

Package qualmap. R topics documented: September 12, Type Package

Package rprojroot. January 3, Title Finding Files in Project Subdirectories Version 1.3-2

Package darksky. September 20, 2017

Package rcmdcheck. R topics documented: November 10, 2018

Package SEMrushR. November 3, 2018

Package datapasta. January 24, 2018

Package taxizedb. June 21, 2017

Package WordR. September 7, 2017

Package jdx. R topics documented: January 9, Type Package Title 'Java' Data Exchange for 'R' and 'rjava'

Package data.world. April 5, 2018

Package climber. R topics documented:

Package GetITRData. October 22, 2017

Package gtrendsr. August 4, 2018

Package textrecipes. December 18, 2018

Package tidytransit. March 4, 2019

Package IsoCorrectoR. R topics documented:

Package strat. November 23, 2016

Package repec. August 31, 2018

Package wdman. January 23, 2017

Package utilsipea. November 13, 2017

Package wikitaxa. December 21, 2017

Package censusapi. August 19, 2018

Package liftr. R topics documented: May 14, Type Package

Package ggimage. R topics documented: November 1, Title Use Image in 'ggplot2' Version 0.0.7

Package condusco. November 8, 2017

Package pdfsearch. July 10, 2018

Package comparedf. February 11, 2019

Package NFP. November 21, 2016

Package bigqueryr. October 23, 2017

Package kdtools. April 26, 2018

Package reconstructr

Package profvis. R topics documented:

Package eply. April 6, 2018

Package skynet. December 12, 2018

Package tidyimpute. March 5, 2018

Package httpcache. October 17, 2017

Package colf. October 9, 2017

Package cancensus. February 4, 2018

Package ezknitr. September 16, 2016

Package gtrendsr. October 19, 2017

Package reval. May 26, 2015

Package cli. November 5, 2017

Package triebeard. August 29, 2016

Package available. November 17, 2017

Package gifti. February 1, 2018

Package RODBCext. July 31, 2017

Package GetHFData. November 28, 2017

Package glue. March 12, 2019

Package edfreader. R topics documented: May 21, 2017

Transcription:

Date 2017-11-12 Type Package Package robotstxt November 12, 2017 Title A 'robots.txt' Parser and 'Webbot'/'Spider'/'Crawler' Permissions Checker Version 0.5.2 Provides functions to download and parse 'robots.txt' files. Ultimately the package makes it easy to check if bots (spiders, crawler, scrapers,...) are allowed to access specific resources on a. License MIT + file LICENSE LazyData TRUE BugReports https://github.com/ropenscilabs/robotstxt/issues URL https://github.com/ropenscilabs/robotstxt Imports stringr (>= 1.0.0), httr (>= 1.0.0), spiderbar (>= 0.2.0), future (>= 1.6.2), magrittr, utils Suggests knitr, rmarkdown, dplyr, testthat, covr Depends R (>= 3.0.0) VignetteBuilder knitr RoxygenNote 6.0.1 NeedsCompilation no Author Peter Meissner [aut, cre], Oliver Keys [ctb], Rich Fitz John [ctb] Maintainer Peter Meissner <retep.meissner@gmail.com> Repository CRAN Date/Publication 2017-11-12 17:45:33 UTC 1

2 get_robotstxt R topics documented: get_robotstxt........................................ 2 get_robotstxts........................................ 3 get_robotstxt_http_get................................... 3 guess_........................................ 4 is_valid_robotstxt...................................... 4 parse_robotstxt....................................... 5 paths_allowed........................................ 5 paths_allowed_worker_robotstxt.............................. 6 paths_allowed_worker_spiderbar.............................. 7 print.robotstxt........................................ 7 print.robotstxt_text..................................... 8 remove_....................................... 8 robotstxt........................................... 9 rt_cache........................................... 10 %>%............................................ 10 Index 11 get_robotstxt downloading robots.txt file downloading robots.txt file get_robotstxt(, warn = TRUE, force = FALSE, user_agent = utils::sessioninfo()$r.version$version.string, ssl_verifypeer = c(1, 0)) warn force user_agent from which to download robots.txt file warn about being unable to download /robots.txt because of if TRUE instead of using possible cached results the function will re-download the robotstxt file HTTP response status 404. If this happens, HTTP user-agent string to be used to retrieve robots.txt file from ssl_verifypeer analog to CURL option https://curl.haxx.se/libcurl/c/curlopt_ssl_ VERIFYPEER.html and might help with robots.txt file retrieval in some cases

get_robotstxts 3 get_robotstxts function to get multiple robotstxt files function to get multiple robotstxt files get_robotstxts(, warn = TRUE, force = FALSE, user_agent = utils::sessioninfo()$r.version$version.string, ssl_verifypeer = c(1, 0), use_futures = FALSE) warn force user_agent from which to download robots.txt file warn about being unable to download /robots.txt because of if TRUE instead of using possible cached results the function will re-download the robotstxt file HTTP response status 404. If this happens, HTTP user-agent string to be used to retrieve robots.txt file from ssl_verifypeer analog to CURL option https://curl.haxx.se/libcurl/c/curlopt_ssl_ VERIFYPEER.html and might help with robots.txt file retrieval in some cases # use_futures Should future::future_lapply be used for possible parallel/async retrieval or not. Note: check out help pages and vignettes of package future on how to set up plans for future execution because the robotstxt package does not do it on its own. get_robotstxt_http_get get_robotstxt() worker function to execute HTTP request get_robotstxt() worker function to execute HTTP request get_robotstxt_http_get(, user_agent = utils::sessioninfo()$r.version$version.string, ssl_verifypeer = 1)

4 is_valid_robotstxt user_agent from which to download robots.txt file HTTP user-agent string to be used to retrieve robots.txt file from ssl_verifypeer analog to CURL option https://curl.haxx.se/libcurl/c/curlopt_ssl_ VERIFYPEER.html and might help with robots.txt file retrieval in some cases guess_ function guessing from path function guessing from path guess_(x) x path aka URL from which to infer is_valid_robotstxt function that checks if file is valid / parsable robots.txt file function that checks if file is valid / parsable robots.txt file is_valid_robotstxt(text) text content of a robots.txt file provides as character vector

parse_robotstxt 5 parse_robotstxt function parsing robots.txt function parsing robots.txt parse_robotstxt(txt) txt content of the robots.txt file Value a named list with useragents, comments, permissions, sitemap paths_allowed check if a bot has permissions to access page(s) wrapper to path_allowed paths_allowed(paths = "/", = "auto", bot = "*", user_agent = utils::sessioninfo()$r.version$version.string, check_method = c("robotstxt", "spiderbar"), warn = TRUE, force = FALSE, ssl_verifypeer = c(1, 0), use_futures = TRUE, robotstxt_list = NULL) paths paths for which to check bot s permission, defaults to "/" Domain for which paths should be checked. Defaults to "auto". If set to "auto" function will try to guess the by parsing the paths argument. Note however, that these are educated guesses which might utterly fail. To be on the save side, provide appropriate s manually. bot name of the bot, defaults to "*" user_agent HTTP user-agent string to be used to retrieve robots.txt file from

6 paths_allowed_worker_robotstxt check_method warn force See Also which method to use for checking either "robotstxt" for the package s own method or "spiderbar" for using spiderbar::can_fetch; note that at the current state spiderbar is considered less accurate: the spiderbar algorithm will only take into consideration rules for * or a particular bot but does not merge rules together (see: paste0(system.file("robotstxts", package = "robotstxt"),"/selfhtml_example.txt") warn about being unable to download /robots.txt because of if TRUE instead of using possible cached results the function will re-download the robotstxt file HTTP response status 404. If this happens, ssl_verifypeer analog to CURL option https://curl.haxx.se/libcurl/c/curlopt_ssl_ VERIFYPEER.html and might help with robots.txt file retrieval in some cases use_futures Should future::future_lapply be used for possible parallel/async retrieval or not. Note: check out help pages and vignettes of package future on how to set up plans for future execution because the robotstxt package does not do it on its own. robotstxt_list either NULL the default or a list of character vectors with one vector per path to check path_allowed paths_allowed_worker_robotstxt paths_allowed_worker for robotstxt flavor paths_allowed_worker for robotstxt flavor paths_allowed_worker_robotstxt(, bot, paths, permissions_list) Domain for which paths should be checked. Defaults to "auto". If set to "auto" function will try to guess the by parsing the paths argument. Note however, that these are educated guesses which might utterly fail. To be on the save side, provide appropriate s manually. bot name of the bot, defaults to "*" paths paths for which to check bot s permission, defaults to "/" permissions_list list of permissions from robotstxt objects

paths_allowed_worker_spiderbar 7 paths_allowed_worker_spiderbar paths_allowed_worker spiderbar flavor paths_allowed_worker spiderbar flavor paths_allowed_worker_spiderbar(, bot, paths, robotstxt_list) Domain for which paths should be checked. Defaults to "auto". If set to "auto" function will try to guess the by parsing the paths argument. Note however, that these are educated guesses which might utterly fail. To be on the save side, provide appropriate s manually. bot name of the bot, defaults to "*" paths paths for which to check bot s permission, defaults to "/" robotstxt_list either NULL the default or a list of character vectors with one vector per path to check print.robotstxt printing robotstxt printing robotstxt ## S3 method for class 'robotstxt' print(x,...) x robotstxt instance to be printed... goes down the sink

8 remove_ print.robotstxt_text printing robotstxt_text printing robotstxt_text ## S3 method for class 'robotstxt_text' print(x,...) x character vector aka robotstxt$text to be printed... goes down the sink remove_ function to remove from path function to remove from path remove_(x) x path aka URL from which to first infer and then remove it

robotstxt 9 robotstxt Generate a representations of a robots.txt file The function generates a list that entails data resulting from parsing a robots.txt file as well as a function called check that enables to ask the representation if bot (or particular bots) are allowed to access a resource on the. robotstxt( = NULL, text = NULL, user_agent = NULL, warn = TRUE, force = FALSE) text user_agent warn force Domain for which to generate a representation. If text equals to NULL, the function will download the file from server - the default. If automatic download of the robots.txt is not preferred, the text can be supplied directly. HTTP user-agent string to be used to retrieve robots.txt file from warn about being unable to download /robots.txt because of if TRUE instead of using possible cached results the function will re-download the robotstxt file HTTP response status 404. If this happens, Value Fields Object (list) of class robotstxt with parsed data from a robots.txt (, text, bots, permissions, host, sitemap, other) and one function to (check()) to check resource permissions. character vector holding name for which the robots.txt file is valid; will be set to NA if not supplied on initialization text character vector of text of robots.txt file; either supplied on initialization or automatically downloaded from supplied on initialization bots character vector of bot names mentioned in robots.txt permissions data.frame of bot permissions found in robots.txt file host data.frame of host fields found in robots.txt file sitemap data.frame of sitemap fields found in robots.txt file other data.frame of other - none of the above - fields found in robots.txt file check() Method to check for bot permissions. Defaults to the s root and no bot in particular. check() has two arguments: paths and bot. The first is for supplying the paths for which to check permissions and the latter to put in the name of the bot.

10 %>% Examples ## Not run: rt <- robotstxt(="google.com") rt$bots rt$permissions rt$check( paths = c("/", "forbidden"), bot="*") ## End(Not run) rt_cache get_robotstxt() cache get_robotstxt() cache rt_cache Format An object of class environment of length 0. %>% re-export magrittr pipe operator re-export magrittr pipe operator

Index Topic datasets rt_cache, 10 %>%, 10 get_robotstxt, 2 get_robotstxt_http_get, 3 get_robotstxts, 3 guess_, 4 is_valid_robotstxt, 4 parse_robotstxt, 5 path_allowed, 5, 6 paths_allowed, 5 paths_allowed_worker_robotstxt, 6 paths_allowed_worker_spiderbar, 7 print.robotstxt, 7 print.robotstxt_text, 8 remove_, 8 robotstxt, 9 rt_cache, 10 11