Package lexrankr. December 13, 2017

Similar documents
Package clipr. June 23, 2018

Package geojsonsf. R topics documented: January 11, Type Package Title GeoJSON to Simple Feature Converter Version 1.3.

Package cattonum. R topics documented: May 2, Type Package Version Title Encode Categorical Features

Package ECctmc. May 1, 2018

Package fastdummies. January 8, 2018

Package validara. October 19, 2017

Package datasets.load

Package textstem. R topics documented:

Package textrecipes. December 18, 2018

Package strat. November 23, 2016

Package robotstxt. November 12, 2017

Package canvasxpress

Package textrank. December 18, 2017

Package calpassapi. August 25, 2018

Package widyr. August 14, 2017

Package messaging. May 27, 2018

Package fastrtext. December 10, 2017

Package jstree. October 24, 2017

Package githubinstall

Package bigreadr. R topics documented: August 13, Version Date Title Read Large Text Files

Package nngeo. September 29, 2018

Package tibble. August 22, 2017

Package spelling. December 18, 2017

Package geogrid. August 19, 2018

Package spark. July 21, 2017

Package readxl. April 18, 2017

Package htmltidy. September 29, 2017

Package wikitaxa. December 21, 2017

Package semver. January 6, 2017

Package pdfsearch. July 10, 2018

Package dbx. July 5, 2018

Package farver. November 20, 2018

Package projector. February 27, 2018

Package kdtools. April 26, 2018

Package available. November 17, 2017

Package postgistools

Package knitrprogressbar

Package crossword.r. January 19, 2018

Package reconstructr

Package guardianapi. February 3, 2019

Package rtext. January 23, 2019

Package repec. August 31, 2018

Package sylcount. April 7, 2017

Package loggit. April 9, 2018

Package mdftracks. February 6, 2017

Package datapasta. January 24, 2018

Package ompr. November 18, 2017

Package RODBCext. July 31, 2017

Package keyholder. May 19, 2018

Package dkanr. July 12, 2018

Package d3plus. September 25, 2017

Package slickr. March 6, 2018

Package fst. December 18, 2017

Package RPresto. July 13, 2017

Package rsppfp. November 20, 2018

Package triebeard. August 29, 2016

Package IATScore. January 10, 2018

Package cli. November 5, 2017

Package transformr. December 9, 2018

Package fingertipsr. May 25, Type Package Version Title Fingertips Data for Public Health

Package reval. May 26, 2015

Package narray. January 28, 2018

Package splithalf. March 17, 2018

Package jpmesh. December 4, 2017

Package fasttextr. May 12, 2017

Package purrrlyr. R topics documented: May 13, Title Tools at the Intersection of 'purrr' and 'dplyr' Version 0.0.2

Package rcv. August 11, 2017

Package fitbitscraper

Package qualmap. R topics documented: September 12, Type Package

Package gcite. R topics documented: February 2, Type Package Title Google Citation Parser Version Date Author John Muschelli

Package ezsummary. August 29, 2016

Package jdx. R topics documented: January 9, Type Package Title 'Java' Data Exchange for 'R' and 'rjava'

Package wrswor. R topics documented: February 2, Type Package

Package haven. April 9, 2015

Package comparedf. February 11, 2019

Package elmnnrcpp. July 21, 2018

Package modules. July 22, 2017

Package BiocManager. November 13, 2018

Package driftr. June 14, 2018

Package data.world. April 5, 2018

Package tabulizer. June 7, 2018

Package readobj. November 2, 2017

Package rdryad. June 18, 2018

Package docxtools. July 6, 2018

Package raker. October 10, 2017

Package urltools. October 17, 2016

Package nlgeocoder. October 8, 2018

Package TrajDataMining

Package profvis. R topics documented:

Package rbraries. April 18, 2018

Package goodpractice

Package FSelectorRcpp

Package fst. June 7, 2018

Package kirby21.base

Package nmslibr. April 14, 2018

Package vdiffr. April 27, 2018

Package processanimater

Package postal. July 27, 2018

Package darksky. September 20, 2017

Package oec. R topics documented: May 11, Type Package

Transcription:

Type Package Package lexrankr December 13, 2017 Title Extractive Summarization of Text with the LexRank Algorithm Version 0.5.0 Author Adam Spannbauer [aut, cre], Bryan White [ctb] Maintainer Adam Spannbauer <spannbaueradam@gmail.com> An R implementation of the LexRank algorithm described by G. Erkan and D. R. Radev (2004) <DOI:10.1613/jair.1523>. License MIT + file LICENSE URL https://github.com/adamspannbauer/lexrankr/ BugReports https://github.com/adamspannbauer/lexrankr/issues/ LazyData TRUE RoxygenNote 6.0.1 Imports SnowballC, igraph, Rcpp Depends R (>= 2.10) LinkingTo Rcpp Suggests covr, testthat, R.rsp VignetteBuilder R.rsp NeedsCompilation yes Repository CRAN Date/Publication 2017-12-13 14:01:58 UTC R topics documented: bind_lexrank_........................................ 2 lexrank........................................... 3 lexrankfromsimil..................................... 5 sentenceparse........................................ 6 sentencesimil........................................ 7 sentencetokenparse..................................... 8 1

2 bind_lexrank_ sentence_parser....................................... 9 smart_stopwords...................................... 9 tokenize........................................... 10 unnest_sentences_...................................... 10 Index 12 bind_lexrank_ Bind lexrank scores to a dataframe of text Bind lexrank scores to a dataframe of sentences or to a dataframe of tokens with sentence ids bind_lexrank_(tbl, text, doc_id, sent_id = NULL, level = c("sentences", "tokens"), threshold = 0.2, usepagerank = TRUE, damping = 0.85, continuous = FALSE,...) bind_lexrank(tbl, text, doc_id, sent_id = NULL, level = c("sentences", "tokens"), threshold = 0.2, usepagerank = TRUE, damping = 0.85, continuous = FALSE,...) tbl text doc_id sent_id level threshold usepagerank damping continuous dataframe containing column of sentences to be lexranked name of column containing sentences or tokens to be lexranked name of column containing document ids corresponding to text Only needed if level is "tokens". name of column containing sentence ids corresponding to text the parsed level of the text column to be lexranked. i.e. is text a column of "sentences" or "tokens"? The "tokens" level is provided to allow users to implement custom tokenization. Note: even if the input level is "tokens" lexrank scores are assigned at the sentence level. The minimum simililarity value a sentence pair must have to be represented in the graph where lexrank is calculated. TRUE or FALSE indicating whether or not to use the page rank algorithm for ranking sentences. If FALSE, a sentences unweighted centrality will be used as the rank. Defaults to TRUE. The damping factor to be passed to page rank algorithm. Ignored if usepagerank is FALSE. TRUE or FALSE indicating whether or not to use continuous LexRank. Only applies if usepagerank==true. If TRUE, threshold will be ignored and lexrank will be computed using a weighted graph representation of the sentences. Defaults to FALSE.... tokenizing options to be passed to lexrankr::tokenize. Ignored if level is "sentences"

lexrank 3 A dataframe with an additional column of lexrank scores (column is given name lexrank) df <- data.frame(doc_id = 1:3, text = c("testing the system. Second sentence for you.", "System testing the tidy documents df.", "Documents will be parsed and lexranked."), stringsasfactors = FALSE) ## Not run: library(magrittr) df %>% unnest_sentences(sents, text) %>% bind_lexrank(sents, doc_id, level = "sentences") df %>% unnest_sentences(sents, text) %>% bind_lexrank_("sents", "doc_id", level = "sentences") df <- data.frame(doc_id = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3), sent_id = c(1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), tokens = c("testing", "the", "system", "second", "sentence", "for", "you", "system", "testing", "the", "tidy", "documents", "df", "documents", "will", "be", "parsed", "and", "lexranked"), stringsasfactors = FALSE) df %>% bind_lexrank(tokens, doc_id, sent_id, level = 'tokens') ## End(Not run) lexrank Extractive text summarization with LexRank Compute LexRanks from a vector of documents using the page rank algorithm or degree centrality the methods used to compute lexrank are discussed in "LexRank: Graph-based Lexical Centrality as Salience in Text Summarization."

4 lexrank lexrank(text, docid = "create", threshold = 0.2, n = 3, returnties = TRUE, usepagerank = TRUE, damping = 0.85, continuous = FALSE, sentencesasdocs = FALSE, removepunc = TRUE, removenum = TRUE, tolower = TRUE, stemwords = TRUE, rmstopwords = TRUE, Verbose = TRUE) text docid threshold n returnties usepagerank damping A character vector of documents to be cleaned and processed by the LexRank algorithm A vector of document IDs with length equal to the length of text. If docid == "create" then doc IDs will be created as an index from 1 to n, where n is the length of text. The minimum simil value a sentence pair must have to be represented in the graph where lexrank is calculated. The number of sentences to return as the extractive summary. The function will return the top n lexranked sentences. See returnties for handling ties in lexrank. TRUE or FALSE indicating whether or not to return greater than n sentence IDs if there is a tie in lexrank. If TRUE, the returned number of sentences will not be limited to n, but rather will return every sentece with a top 3 score. If FALSE, the returned number of sentences will be <=n. Defaults to TRUE. TRUE or FALSE indicating whether or not to use the page rank algorithm for ranking sentences. If FALSE, a sentences unweighted centrality will be used as the rank. Defaults to TRUE. The damping factor to be passed to page rank algorithm. Ignored if usepagerank is FALSE. continuous TRUE or FALSE indicating whether or not to use continuous LexRank. Only applies if usepagerank==true. If TRUE, threshold will be ignored and lexrank will be computed using a weighted graph representation of the sentences. Defaults to FALSE. sentencesasdocs TRUE or FALSE, indicating whether or not to treat sentences as documents when calculating tfidf scores for similarity. If TRUE, inverse document frequency will be calculated as inverse sentence frequency (useful for single document extractive summarization). removepunc removenum tolower stemwords TRUE or FALSE indicating whether or not to remove punctuation from text while tokenizing. If TRUE, puncuation will be removed. Defaults to TRUE. TRUE or FALSE indicating whether or not to remove numbers from text while tokenizing. If TRUE, numbers will be removed. Defaults to TRUE. TRUE or FALSE indicating whether or not to coerce all of text to lowercase while tokenizing. If TRUE, text will be coerced to lowercase. Defaults to TRUE. TRUE or FALSE indicating whether or not to stem resulting tokens. If TRUE, the outputted tokens will be tokenized using SnowballC::wordStem(). Defaults to TRUE.

lexrankfromsimil 5 rmstopwords Verbose TRUE, FALSE, or character vector of stopwords to remove from tokens. If TRUE, words in lexrankr::smart_stopwords will be removed prior to stemming. If FALSE, no stopword removal will occur. If a character vector is passed, this vector will be used as the list of stopwords to be removed. Defaults to TRUE. TRUE or FALSE indicating whether or not to cat progress messages to the console while running. Defaults to TRUE. A 2 column dataframe with columns sentenceid and value. sentence contains the ids of the top n sentences in descending order by value. value contains page rank score (if usepagerank==true) or degree centrality (if usepagerank==false). References http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html lexrank(c("this is a test.","tests are fun.", "Do you think the exam will be hard?","is an exam the same as a test?", "How many questions are going to be on the exam?")) lexrankfromsimil Compute LexRanks from pairwise sentence similarities Compute LexRanks from sentence pair similarities using the page rank algorithm or degree centrality the methods used to compute lexrank are discussed in "LexRank: Graph-based Lexical Centrality as Salience in Text Summarization." lexrankfromsimil(s1, s2, simil, threshold = 0.2, n = 3, returnties = TRUE, usepagerank = TRUE, damping = 0.85, continuous = FALSE) s1 s2 simil threshold A character vector of sentence IDs corresponding to the s2 and simil arguemants. A character vector of sentence IDs corresponding to the s1 and simil arguemants. A numeric vector of similiarity values that represents the similiarity between the sentences represented by the IDs in s1 and s2. The minimum simil value a sentence pair must have to be represented in the graph where lexrank is calculated.

6 sentenceparse n returnties usepagerank damping continuous The number of sentences to return as the extractive summary. The function will return the top n lexranked sentences. See returnties for handling ties in lexrank. TRUE or FALSE indicating whether or not to return greater than n sentence IDs if there is a tie in lexrank. If TRUE, the returned number of sentences will not be limited to n, but rather will return every sentece with a top 3 score. If FALSE, the returned number of sentences will be <=n. Defaults to TRUE. TRUE or FALSE indicating whether or not to use the page rank algorithm for ranking sentences. If FALSE, a sentences unweighted centrality will be used as the rank. Defaults to TRUE. The damping factor to be passed to page rank algorithm. Ignored if usepagerank is FALSE. TRUE or FALSE indicating whether or not to use continuous LexRank. Only applies if usepagerank==true. If TRUE, threshold will be ignored and lexrank will be computed using a weighted graph representation of the sentences. Defaults to FALSE. A 2 column dataframe with columns sentenceid and value. sentenceid contains the ids of the top n sentences in descending order by value. value contains page rank score (if usepagerank==true) or degree centrality (if usepagerank==false). References http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html lexrankfromsimil(s1=c("d1_1","d1_1","d1_2"), s2=c("d1_2","d2_1","d2_1"), simil=c(.01,.03,.5)) sentenceparse Parse text into sentences Parse the elements of a character vector into a dataframe of sentences with additional identifiers. sentenceparse(text, docid = "create") text docid Character vector to be parsed into sentences A vector of document IDs with length equal to the length of text. If docid == "create" then doc IDs will be created as an index from 1 to n, where n is the length of text.

sentencesimil 7 A data frame with 3 columns and n rows, where n is the number of sentences found by the routine. Column 1: docid document id for the sentence. Column 2: sentenceid sentence id for the sentence. Column 3: sentence the sentences found in the routine. sentenceparse("bill is trying to earn a Ph.D.", "You have to have a 5.0 GPA.") sentenceparse(c("bill is trying to earn a Ph.D.", "You have to have a 5.0 GPA."), docid=c("d1","d2")) sentencesimil Compute distance between sentences Compute distance between sentences using modified idf cosine distance from "LexRank: Graphbased Lexical Centrality as Salience in Text Summarization". Output can be used as input to lexrankfromsimil. sentencesimil(sentenceid, token, docid = NULL, sentencesasdocs = FALSE) sentenceid token A character vector of sentence IDs corresponding to the docid and token arguemants. A character vector of tokens corresponding to the docid and sentenceid arguemants. docid A character vector of document IDs corresponding to the sentenceid and token arguemants. Can be NULL if sentencesasdocs is TRUE. sentencesasdocs TRUE or FALSE, indicating whether or not to treat sentences as documents when calculating tfidf scores. If TRUE, inverse document frequency will be calculated as inverse sentence frequency (useful for single document extractive summarization) A 3 column dataframe of pairwise distances between sentences. Columns: sent1 (sentence id), sent2 (sentence id), & dist (distance between sent1 and sent2). References http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html

8 sentencetokenparse sentencesimil(docid=c("d1","d1","d2","d2"), sentenceid=c("d1_1","d1_1","d2_1","d2_1"), token=c("i", "ran", "jane", "ran")) sentencetokenparse Parse text into sentences and tokens Parse a character vector of documents into into both sentences and a clean vector of tokens. The resulting output includes IDs for document and sentence for use in other lexrank functions. sentencetokenparse(text, docid = "create", removepunc = TRUE, removenum = TRUE, tolower = TRUE, stemwords = TRUE, rmstopwords = TRUE) text docid removepunc removenum tolower stemwords rmstopwords A character vector of documents to be parsed into sentences and tokenized. A character vector of document Ids the same length as text. If docid=="create" document Ids will be created. TRUE or FALSE indicating whether or not to remove punctuation from text while tokenizing. If TRUE, puncuation will be removed. Defaults to TRUE. TRUE or FALSE indicating whether or not to remove numbers from text while tokenizing. If TRUE, numbers will be removed. Defaults to TRUE. TRUE or FALSE indicating whether or not to coerce all of text to lowercase while tokenizing. If TRUE, text will be coerced to lowercase. Defaults to TRUE. TRUE or FALSE indicating whether or not to stem resulting tokens. If TRUE, the outputted tokens will be tokenized using SnowballC::wordStem(). Defaults to TRUE. TRUE, FALSE, or character vector of stopwords to remove from tokens. If TRUE, words in lexrankr::smart_stopwords will be removed prior to stemming. If FALSE, no stopword removal will occur. If a character vector is passed, this vector will be used as the list of stopwords to be removed. Defaults to TRUE. A list of dataframes. The first element of the list returned is the sentences dataframe; this dataframe has columns docid, sentenceid, & sentence (the actual text of the sentence). The second element of the list returned is the tokens dataframe; this dataframe has columns docid, sentenceid, & token (the actual text of the token).

sentence_parser 9 sentencetokenparse(c("bill is trying to earn a Ph.D.", "You have to have a 5.0 GPA."), docid=c("d1","d2")) sentence_parser Utility to parse sentences from text Utility to parse sentences from text; created to have a central shared sentence parsing function sentence_parser(text) text Character vector to be parsed into sentences A list with length equal to length(text) ; list elements are character vectors of text parsed with sentence regex smart_stopwords SMART English Stopwords English stopwords from the SMART information retrieval system (as documented in Appendix 11 of http://jmlr.csail.mit.edu/papers/volume5/lewis04a/) smart_stopwords Format a character vector with 571 elements Source http://jmlr.csail.mit.edu/papers/volume5/lewis04a/

10 unnest_sentences_ tokenize Tokenize a character vector Parse the elements of a character vector into a list of cleaned tokens. Tokenize a character vector Parse the elements of a character vector into a list of cleaned tokens. tokenize(text, removepunc = TRUE, removenum = TRUE, tolower = TRUE, stemwords = TRUE, rmstopwords = TRUE) text removepunc removenum tolower stemwords rmstopwords The character vector to be tokenized TRUE or FALSE indicating whether or not to remove punctuation from text. If TRUE, puncuation will be removed. Defaults to TRUE. TRUE or FALSE indicating whether or not to remove numbers from text. If TRUE, numbers will be removed. Defaults to TRUE. TRUE or FALSE indicating whether or not to coerce all of text to lowercase. If TRUE, text will be coerced to lowercase. Defaults to TRUE. TRUE or FALSE indicating whether or not to stem resulting tokens. If TRUE, the outputted tokens will be tokenized using SnowballC::wordStem(). Defaults to TRUE. TRUE, FALSE, or character vector of stopwords to remove. If TRUE, words in lexrankr::smart_stopwords will be removed prior to stemming. If FALSE, no stopword removal will occur. If a character vector is passed, this vector will be used as the list of stopwords to be removed. Defaults to TRUE. tokenize("mr. Feeny said the test would be on Sat. At least I'm 99.9% sure that's what he said.") tokenize("bill is trying to earn a Ph.D. in his field.", rmstopwords=false) unnest_sentences_ Split a column of text into sentences Split a column of text into sentences

unnest_sentences_ 11 unnest_sentences_(tbl, output, input, doc_id = NULL, output_id = "sent_id", drop = TRUE) unnest_sentences(tbl, output, input, doc_id = NULL, output_id = "sent_id", drop = TRUE) tbl output input doc_id output_id drop dataframe containing column of text to be split into sentences name of column to be created to store parsed sentences name of input column of text to be parsed into sentences column of document ids; if not provided it will be assumed that each row is a different document name of column to be created to store sentence ids whether original input column should get dropped A data.frame of parsed sentences and sentence ids df <- data.frame(doc_id = 1:3, text = c("testing the system. Second sentence for you.", "System testing the tidy documents df.", "Documents will be parsed and lexranked."), stringsasfactors=false) unnest_sentences(df, sents, text) unnest_sentences_(df, "sents", "text") ## Not run: library(magrittr) df %>% unnest_sentences(sents, text) ## End(Not run)

Index Topic datasets smart_stopwords, 9 bind_lexrank (bind_lexrank_), 2 bind_lexrank_, 2 lexrank, 3 lexrankfromsimil, 5, 7 sentence_parser, 9 sentenceparse, 6 sentencesimil, 7 sentencetokenparse, 8 smart_stopwords, 9 tokenize, 10 unnest_sentences (unnest_sentences_), 10 unnest_sentences_, 10 12