Package CorporaCoCo. R topics documented: November 23, 2017

Similar documents
Package cattonum. R topics documented: May 2, Type Package Version Title Encode Categorical Features

Package fastdummies. January 8, 2018

Package datasets.load

Package spark. July 21, 2017

Package bigreadr. R topics documented: August 13, Version Date Title Read Large Text Files

Package ECctmc. May 1, 2018

Package geneslope. October 26, 2016

Package meme. November 2, 2017

Package ggimage. R topics documented: November 1, Title Use Image in 'ggplot2' Version 0.0.7

Package tm.plugin.lexisnexis

Package repec. August 31, 2018

Package omu. August 2, 2018

Package bisect. April 16, 2018

Package ggimage. R topics documented: December 5, Title Use Image in 'ggplot2' Version 0.1.0

Package coga. May 8, 2018

Package gtrendsr. August 4, 2018

Package validara. October 19, 2017

Package gtrendsr. October 19, 2017

Package SEMrushR. November 3, 2018

Package BiocManager. November 13, 2018

Package pwrab. R topics documented: June 6, Type Package Title Power Analysis for AB Testing Version 0.1.0

Package corclass. R topics documented: January 20, 2016

Package githubinstall

Package gggenes. R topics documented: November 7, Title Draw Gene Arrow Maps in 'ggplot2' Version 0.3.2

Package queuecomputer

Package geojsonsf. R topics documented: January 11, Type Package Title GeoJSON to Simple Feature Converter Version 1.3.

Package interplot. R topics documented: June 30, 2018

Package canvasxpress

Package epitab. July 4, 2018

Package meme. December 6, 2017

Package kdtools. April 26, 2018

Package robotstxt. November 12, 2017

Package filematrix. R topics documented: February 27, Type Package

Package PCADSC. April 19, 2017

Package docxtools. July 6, 2018

Package zoomgrid. January 3, 2019

Package Mondrian. R topics documented: March 4, Type Package

Package catenary. May 4, 2018

Package quickreg. R topics documented:

Package strat. November 23, 2016

Package AdhereR. June 15, 2018

Package rsppfp. November 20, 2018

Package kirby21.base

Package jdx. R topics documented: January 9, Type Package Title 'Java' Data Exchange for 'R' and 'rjava'

Package editdata. October 7, 2017

Package balance. October 12, 2018

Package textrank. December 18, 2017

Package qualmap. R topics documented: September 12, Type Package

Package samplesizecmh

Package raker. October 10, 2017

Package CatEncoders. March 8, 2017

Package dyncorr. R topics documented: December 10, Title Dynamic Correlation Package Version Date

Package spelling. December 18, 2017

Package MatchIt. April 18, 2017

Package messaging. May 27, 2018

Package fastrtext. December 10, 2017

Package sankey. R topics documented: October 22, 2017

Package optimus. March 24, 2017

Package stringb. November 1, 2016

Package multihiccompare

Package vinereg. August 10, 2018

Package projector. February 27, 2018

Package tm.plugin.factiva

Package rtext. January 23, 2019

Package tidyimpute. March 5, 2018

Package autocogs. September 22, Title Automatic Cognostic Summaries Version 0.1.1

Package lucid. August 24, 2018

Package pairsd3. R topics documented: August 29, Title D3 Scatterplot Matrices Version 0.1.0

Package calpassapi. August 25, 2018

Package mixphm. July 23, 2015

Package fusionclust. September 19, 2017

Package fingertipsr. May 25, Type Package Version Title Fingertips Data for Public Health

Package SFtools. June 28, 2017

Package ecoseries. R topics documented: September 27, 2017

Package estprod. May 2, 2018

Package areaplot. October 18, 2017

Package tm.plugin.factiva

Package CuCubes. December 9, 2016

Package geogrid. August 19, 2018

Package textstem. R topics documented:

Package redux. May 31, 2018

Package modmarg. R topics documented:

Package dkanr. July 12, 2018

Package osrm. November 13, 2017

Package lvec. May 24, 2018

Package GetHFData. November 28, 2017

Package desplot. R topics documented: April 3, 2018

Package ClusterBootstrap

Package mdftracks. February 6, 2017

Package ArCo. November 5, 2017

Package Combine. R topics documented: September 4, Type Package Title Game-Theoretic Probability Combination Version 1.

Package ezsummary. August 29, 2016

Package widyr. August 14, 2017

Package splithalf. March 17, 2018

Package httpcache. October 17, 2017

Package jstree. October 24, 2017

Package pomdp. January 3, 2019

Package Tushare. November 2, 2018

Package fst. December 18, 2017

Package lexrankr. December 13, 2017

Transcription:

Encoding UTF-8 Type Package Title Corpora Co-Occurrence Comparison Version 1.1-0 Date 2017-11-22 Package CorporaCoCo November 23, 2017 A set of functions used to compare co-occurrence between two corpora. URL License GPL (>= 3) Depends R (>= 3.2.0), Imports methods, stats, data.table (>= 1.9.6), RColorBrewer, rlist Suggests unittest, stringi, R.rsp VignetteBuilder R.rsp BugReports https://github.com/birmingham-ccr/corporacoco/issues LazyData yes NeedsCompilation no Author Anthony Hennessey [aut, cre], Viola Wiegand [aut], Michaela Mahlberg [aut], Christopher R. Tench [aut], Jamie Lentin [aut] Maintainer Anthony Hennessey <anthony.hennessey@nottingham.ac.uk> Repository CRAN Date/Publication 2017-11-23 10:54:35 UTC R topics documented: CorporaCoCo-package................................... 2 coco............................................. 2 coco-class.......................................... 4 surface............................................ 5 surface_coco........................................ 8 1

2 coco Index 9 CorporaCoCo-package Comparing Co-occurrence between corpora. Implements the method described in Hennessey and Wiegand et al. (2017). A good place to start is the Proof of Concept vignette. There is also a FAQ vignette. You can see a list of package vignettes with vignette(package = "CorporaCoCo") and you can see a particular vignette with something like vignette("faq", package = "CorporaCoCo"). For a list of all documentation use library(help="corporacoco"). Author(s) Maintainer: Anthony Hennessey <anthony.hennessey@nottingham.ac.uk>. References A. Hennessey and V. Wiegand and C. R. Tench and M. Mahlberg (2017) Comparing co-occurrences between corpora. In preparation. coco Co-occurrence comparison Calculates statistically significant difference in co-occurrence counts. Usage coco(a, B, nodes, fdr = 0.01, collocates = NULL) Arguments A B nodes fdr A data.frame of co-occurrence counts. See details. A data.frame of co-occurrence counts. See details. A character vector of nodes or character string representing a single node. The desired level at which to control the False Discovery Rate. Default value is 0.01.

coco 3 collocates A character vector of collocates or character string representing a single collocate. The collocates essentially act as a filter on the y column of the returned data structure. collocates should be used to target the testing; reducing the number of tests will reduce the loss of power from the multiple test correction. Value This function implements the method described in Hennessey and Wiegand (2017). A and B are data.frames of the form Classes 'data.frame':... $ x: chr $ y: chr $ H: int $ M: int The data.frames encapsulate the co-occurrence counts for the (x, y) term pairs within a corpus. For a description of the columns see the details section of the surface function. The nodes essentially act as a filter on the A$x and B$x columns. For a description of the use of nodes see Hennessey and Wiegand (2017). fdr indicates the level at which the False Discovery Rate will be controlled. For a description of the form of FDR used see Benjamini and Hochberg (1995). For a description of the use of FDR in this context see Hennessey and Wiegand (2017). For description of the p_adjusted column in the returned structure see p.adjust. The returned data structure is a data.table. A data.table is also a data.frame and will behave exactly as such if the data.table library is not loaded. The returned data.table contains details of all the co-occurrences for which there is evidence of a difference in co-occurrence between the two supplied data sets. The effect size is calculated as the log base 2 of the odds ratio. The effects size and its confidence interval are captured in the effect_size, CI_lower and CI_upper columns. The p_value column contains the non-adjusted p-value from the Fisher s Exact Test. For more details see Hennessey and Wiegand (2017). For an example of usage see the Proof of Concept vignette. A data.table of the form Classes data.table and 'data.frame': 11 variables: $ x : chr $ y : chr $ H_A : int $ M_A : int $ H_B : int $ M_B : int $ effect_size : num $ CI_lower : num

4 coco-class References $ CI_upper : num $ p_value : num $ p_adjusted : num - attr(*, "sorted")= chr "x" "y" - attr(*, ".internal.selfref")=<externalptr> - attr(*, "coco_metadata")=list of 4..$ nodes : chr..$ fdr : num..$ PACKAGE_VERSION:Classes 'package_version', 'numeric_version'....$ : int..$ date : Date, format: "2016-11-01" Y. Benjamini and Y. Hochberg (1995) Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological) 57 (1)289 300. A. Hennessey and V. Wiegand and C. R. Tench and M. Mahlberg (2017) Comparing co-occurrences between corpora. In preparation. coco-class coco class Usage Object of class coco. ## S3 method for class 'coco' plot(x, as_matrix = FALSE, nodes = NULL, forest_plot_args = NULL,...) Arguments x as_matrix An coco object. If as_matrix is set to TRUE a matrix plot rather than a forest plot is produced. nodes If a vector of nodes is supplied this will be used to filter the set of results that are plotted. If nodes are supplied the plot will use the nodes order. forest_plot_args This is a list of arguments that is passed to the plot.default that produces the foundation of the forest plot. The list may contain a subset or all of the following documented arguments; any arguments that are not documented here will be ignored. A description of each argument can be found in the help for the plot.default function. Available arguments are xlim Default: Calculated from the ranges of the confidence intervals. xlab Default: Effect Size

surface 5 main Default: NULL sub Default: NULL asp Default: NA pch Default: 15 cex.pch Default: 1 lwd.xaxt Default: 1 col.xaxt Default: black col.whisker Default: black col.zero Default: darkgray length.wisker_end Default: 0.05 For example usage see the plot section in the FAQ vignette.... Other arguments will be ignored. An object of class coco is returned by coco() and surface_coco(). No constructor is exported. Note For example usage see the FAQ vignette. surface Calculate Surface Co-occurrence Counts Calculates co-occurrence counts for the supplied vector. For each co-occurrence the maximum possible number of co-occurrences is also calculated. Usage surface(x, span, nodes = NULL, collocates = NULL) Arguments x span nodes collocates A vector. This is the subject of the co-occurrence counting. See details. A character string defining the co-occurrence span. See details. A character vector of nodes or character string representing a single node. The nodes essentially act as a filter on the x column of the returned data structure. Use of nodes will significantly reduce memory usage. A character vector of collocates or character string representing a single collocate. The collocates essentially act as a filter on the y column of the returned data structure.

6 surface Value x is assumed to be an ordered vector of tokenized text. No processing will be applied to x prior to the co-occurrence count calculations. surface co-occurrence is easiest to describe with an example. The following is a span of '2LR', that is 2 to the left and 2 to the right. ("a", "man", "a", "plan", "a", "cat", "a", "canal", "panama") In this example the term plan would co-occur once each with the collocates man and cat, and twice with the collocate a. Other examples of span: span = '1L2R' ("a", "man", "a", "plan", "a", "cat", "a", "canal", "panama") span = '2R' ("a", "man", "a", "plan", "a", "cat", "a", "canal", "panama") NAs can be used to implement co-occurrence barriers eg if two NA characters are inserted into x at each sentence boundary then with span = 2 co-occurrences will not happen across sentences. See Evert (2008) for detailed description of co-occurrence barriers. For a detailed description of surface co-occurrence and the other types of co-occurrence see Evert (2008). Returns a data.table containing counts for all co-occurrences in x. Note that a data.table is also a data.frame so if the data.table library is not loaded the returned object will behave exactly as a data.frame; however, for large data sets there will be significant performance enhancement offered by exploiting data.table functionality. The returned object is of the form: Classes data.table and 'data.frame':... $ x: chr $ y: chr $ H: int $ M: int - attr(*, "sorted")= chr "x" "y" - attr(*, ".internal.selfref")=<externalptr> where H is the number of times x co-occurs with y (think Hits), and M is the number of times x fails to co-occur with y when it could have (think Misses); hence H + M is the maximum number of times that x could have co-occurred with y.

surface 7 References S. Evert (2008) Corpora and collocations. Corpus Linguistics: An International Handbook 1212 1248. Examples # ===================== # surface co-occurrence # ===================== x <- c("a", "man", "a", "plan", "a", "canal", "panama") surface(x, span = '2R') ## x y H M ## 1: a a 2 4 ## 2: a canal 1 5 ## 3: a man 1 5 ## 4: a panama 1 5 ## 5: a plan 1 5 ## 6: canal panama 1 0 ## 7: man a 1 1 ## 8: man plan 1 1 ## 9: plan a 1 1 ## 10: plan canal 1 1 # filter on nodes surface(x, span = '2R', nodes = c("canal", "man", "plan")) ## x y H M ## 1: canal panama 1 0 ## 2: man a 1 1 ## 3: man plan 1 1 ## 4: plan a 1 1 ## 5: plan canal 1 1 # filter on nodes and collocates surface(x, span = '2R', nodes = c("canal", "man", "plan"), collocates = c("panama", "a")) ## x y H M ## 1: canal panama 1 0 ## 2: man a 1 1 ## 3: plan a 1 1 # co-occurrence barrier x <- c("a", "man", "a", "plan", NA, NA, "a", "canal", "panama") surface(x, span = '2R') # x y H M # 1: a a 1 4 # 2: a canal 1 4 # 3: a man 1 4

8 surface_coco # 4: a panama 1 4 # 5: a plan 1 4 # 6: canal panama 1 0 # 7: man a 1 1 # 8: man plan 1 1 surface_coco Surface co-occurrence comparison Convenience function that combined the functionality of the surface and coco functions. Usage surface_coco(a, b, span, nodes, fdr = 0.01, collocates = NULL) Arguments a b span nodes fdr collocates A character vector. A character vector. A character string defining the co-occurrence span. See surface function for details. A character vector of nodes or character string representing a single node. The desired level at which to control the False Discovery Rate. A character vector of collocates or character string representing a single collocate. See surface and coco. For an example of usage see the Proof of Concept vignette. Value A data.table of the form returned by the coco functions.

Index coco, 2, 5, 8 coco-class, 4 CorporaCoCo (CorporaCoCo-package), 2 CorporaCoCo-package, 2 data.table, 3, 6, 8 p.adjust, 3 plot.coco (coco-class), 4 plot.default, 4 surface, 3, 5, 8 surface_coco, 5, 8 9