Package chunked. July 2, 2017

Similar documents
Package dbx. July 5, 2018

Package clipr. June 23, 2018

Package fastdummies. January 8, 2018

Package cattonum. R topics documented: May 2, Type Package Version Title Encode Categorical Features

Package taxizedb. June 21, 2017

Package implyr. May 17, 2018

Package messaging. May 27, 2018

Package tidyimpute. March 5, 2018

Package canvasxpress

Package ECctmc. May 1, 2018

Package bigreadr. R topics documented: August 13, Version Date Title Read Large Text Files

Package datasets.load

Package comparedf. February 11, 2019

Package MonetDBLite. January 14, 2018

Package deductive. June 2, 2017

Package sfc. August 29, 2016

Package repec. August 31, 2018

Package RPresto. July 13, 2017

Package nlgeocoder. October 8, 2018

Package ompr. November 18, 2017

Package lumberjack. R topics documented: July 20, 2018

Package projector. February 27, 2018

Package mdftracks. February 6, 2017

Package resumer. R topics documented: August 29, 2016

Package spark. July 21, 2017

Package extdplyr. February 27, 2017

Package rsppfp. November 20, 2018

Package rgho. R topics documented: January 18, 2017

Package SimilaR. June 21, 2018

Package barcoder. October 26, 2018

Package jstree. October 24, 2017

Package knitrprogressbar

Package ether. September 22, 2018

Package postgistools

Package bigqueryr. October 23, 2017

Package exifr. October 15, 2017

Package geojsonsf. R topics documented: January 11, Type Package Title GeoJSON to Simple Feature Converter Version 1.3.

Package validara. October 19, 2017

Package keyholder. May 19, 2018

Package zebu. R topics documented: October 24, 2017

Package wrswor. R topics documented: February 2, Type Package

Package MonetDB.R. March 21, 2016

Package strat. November 23, 2016

Package regexselect. R topics documented: September 22, Version Date Title Regular Expressions in 'shiny' Select Lists

Package humanize. R topics documented: April 4, Version Title Create Values for Human Consumption

Package datapasta. January 24, 2018

Package kirby21.base

Package facerec. May 14, 2018

Package pwrab. R topics documented: June 6, Type Package Title Power Analysis for AB Testing Version 0.1.0

Package glue. March 12, 2019

Package vinereg. August 10, 2018

Package calpassapi. August 25, 2018

Package condusco. November 8, 2017

Package purrrlyr. R topics documented: May 13, Title Tools at the Intersection of 'purrr' and 'dplyr' Version 0.0.2

Package poplite. R topics documented: October 12, Version Date

Package qualmap. R topics documented: September 12, Type Package

Package bisect. April 16, 2018

Package jpmesh. December 4, 2017

Package edfreader. R topics documented: May 21, 2017

Package SEMrushR. November 3, 2018

Package robotstxt. November 12, 2017

Package splithalf. March 17, 2018

Package IATScore. January 10, 2018

Package docxtools. July 6, 2018

Package bigqueryr. June 8, 2018

Package utilsipea. November 13, 2017

Package oec. R topics documented: May 11, Type Package

Package reconstructr

Package dkanr. July 12, 2018

Package pdfsearch. July 10, 2018

Package internetarchive

Package phrasemachine

Package fst. December 18, 2017

Package raker. October 10, 2017

Package fst. June 7, 2018

Package hyphenatr. August 29, 2016

Package iterators. December 12, 2017

Package opencage. January 16, 2018

Package QCAtools. January 3, 2017

Package loggit. April 9, 2018

Package RODBCext. July 31, 2017

Package MatchIt. April 18, 2017

Package tibble. August 22, 2017

Package shinyhelper. June 21, 2018

Package eply. April 6, 2018

Package filematrix. R topics documented: February 27, Type Package

Package gtrendsr. October 19, 2017

Package ecoseries. R topics documented: September 27, 2017

Package nodbi. August 1, 2018

Package textstem. R topics documented:

Package crossword.r. January 19, 2018

Package pinyin. October 17, 2018

Package fitbitscraper

Package FractalParameterEstimation

Package GetITRData. October 22, 2017

Package censusr. R topics documented: June 14, Type Package Title Collect Data from the Census API Version 0.0.

Package estprod. May 2, 2018

Package errorlocate. R topics documented: March 30, Type Package Title Locate Errors with Validation Rules Version 0.1.3

Package mason. July 5, 2018

Package sessioninfo. June 21, 2017

Transcription:

Type Package Title Chunkwise Text-File Processing for 'dplyr' Version 0.4 Package chunked July 2, 2017 Text data can be processed chunkwise using 'dplyr' commands. These are recorded and executed per data chunk, so large files can be processed with limited memory using the 'LaF' package. License GPL-2 LazyData TRUE BugReports https://github.com/edwindj/chunked/issues URL https://github.com/edwindj/chunked Depends dplyr (>= 0.7) Imports LaF, utils, lazyeval, DBI Suggests testthat, RSQLite, dbplyr RoxygenNote 6.0.1 NeedsCompilation no Author Edwin de Jonge [aut, cre] Maintainer Edwin de Jonge <edwindjonge@gmail.com> Repository CRAN Date/Publication 2017-07-01 23:55:46 UTC R topics documented: chunked-package...................................... 2 insert_chunkwise_into................................... 3 read_chunkwise....................................... 3 read_csv_chunkwise.................................... 4 write_chunkwise...................................... 5 write_csv_chunkwise.................................... 6 Index 8 1

2 chunked-package chunked-package Chunked R is a great tool, but processing large text files with data is cumbersome. chunked helps you to process large text files with dplyr while loading only a part of the data in memory. It builds on the execellent R package LaF Processing commands are writing in dplyr syntax, and chunked (using LaF) will take care that chunk by chunk is processed, taking far less memory than otherwise. chunked is useful for selecting columns, mutating columns and filtering rows. It can be used in data pre-processing. Implemented dplyr verbs filter select rename mutate transmute do left_join inner_join anti_join semi_join tbl_vars collect filter, select, do, left_join, inner_join Not implemented The following operators are not implemented because data in chunked is processed chunkwise, so these are not available. full_join right_join group_by arrange tail

insert_chunkwise_into 3 insert_chunkwise_into insert data in chunks into a database insert_chunkwise_into can be used to insert chunks of data into a database. Typically chunked can be used to for preprocessing data before adding it to a database. insert_chunkwise_into(x, dest, table, temporary = FALSE, analyze = FALSE) x dest table temporary analyze tbl_chunk object database destination, e.g. src_sqlite() name of table Should the table be removed when the database connection is closed? Should the table be analyzed after import? Value a tbl object pointing to the table in database dest. read_chunkwise Read chunkwise from a data source Read chunkwise from a data source read_chunkwise(src, chunk_size = 10000L,...) ## S3 method for class 'character' read_chunkwise(src, chunk_size = 10000L, format = c("csv", "csv2", "table"),...) ## S3 method for class 'laf' read_chunkwise(src, chunk_size = 10000L,...) ## S3 method for class 'tbl_sql' read_chunkwise(src, chunk_size = 10000L,...)

4 read_csv_chunkwise src source to read from chunk_size size of the chunks... parameters used by specific classes format used for specifying type of text file Value an object of type tbl_chunk read_csv_chunkwise Read chunkwise data from text files read_csv_chunk will open a connection to a text file. Subsequent dplyr verbs and commands are recorded until collect, write_csv_chunkwise is called. In that case the recorded commands will be executed chunk by chunk. This read_csv_chunkwise(file, chunk_size = 10000L, header = TRUE, sep = ",", dec = ".",...) read_csv2_chunkwise(file, chunk_size = 10000L, header = TRUE, sep = ";", dec = ",",...) read_table_chunkwise(file, chunk_size = 10000L, header = TRUE, sep = "\t", dec = ".",...) read_laf_chunkwise(laf, chunk_size = 10000L) file chunk_size header sep dec path of texst file size of the chunks te be read Does the csv file have a header with column names? field separator to be used decimal separator to be used... not used read_laf_chunkwise reads chunkwise from a LaF object created with laf_open. It offers more control over data specification. laf laf object created using LaF

write_chunkwise 5 Details read_csv_chunkwise can be best combined with write_csv_chunkwise or insert_chunkwise_into (see example) Examples # create csv file for demo purpose in_file <- file.path(tempdir(), "in.csv") write.csv(women, in_file, row.names = FALSE, quote = FALSE) # women_chunked <- read_chunkwise(in_file) %>% #open chunkwise connection mutate(ratio = weight/height) %>% filter(ratio > 2) %>% select(height, ratio) %>% inner_join(data.frame(height=63:66)) # you can join with data.frames! # no processing done until out_file <- file.path(tempdir(), "processed.csv") women_chunked %>% write_chunkwise(file=out_file) head(women_chunked) # works (without processing all data...) iris_file <- file.path(tempdir(), "iris.csv") write.csv(iris, iris_file, row.names = FALSE, quote= FALSE) iris_chunked <- read_chunkwise(iris_file, chunk_size = 49) %>% # 49 for demo purpose group_by(species) %>% summarise(sepal_length = mean(sepal.length), n=n()) # note that mean is per chunk write_chunkwise Genereric function to write chunk by chunk Genereric function to write chunk by chunk write_chunkwise(x, dest,...) ## S3 method for class 'chunkwise' write_chunkwise(x, dest, table, file = dest, format = c("csv", "csv2", "table"),...)

6 write_csv_chunkwise x chunked input, e.g. created with read_chunkwise or it can be a tbl_sql object. dest where should the data be written. May be a character or a src_sql.... parameters that will be passed to the specific implementations. table table to write to. Only used when dest is a data base(src_sql) file File to write to format Specifies the text format for written to disk. Only used if x is a character. write_csv_chunkwise Write chunks to a csv file Writes data to a csv file chunk by chunk. This function must be just in conjunction with read_csv_chunkwise. Chunks of data will be read, processed and written when this function is called. For writing to a database use insert_chunkwise_into. write_csv_chunkwise(x, file = "", sep = ",", dec = ".", col.names = TRUE, row.names = FALSE,...) write_csv2_chunkwise(x, file = "", sep = ";", dec = ",", col.names = TRUE, row.names = FALSE,...) write_table_chunkwise(x, file = "", sep = "\t", dec = ".", col.names = TRUE, row.names = TRUE,...) x Value file sep dec col.names row.names chunkwise object pointing to a text file file character or connection where the csv file should be written field separator decimal separator should column names be written? should row names be written?... passed through to read.table chunkwise object (chunkwise), when writing to a file it refers to the newly created file, otherwise to x.

write_csv_chunkwise 7 Examples # create csv file for demo purpose in_file <- file.path(tempdir(), "in.csv") write.csv(women, in_file, row.names = FALSE, quote = FALSE) # women_chunked <- read_chunkwise(in_file) %>% #open chunkwise connection mutate(ratio = weight/height) %>% filter(ratio > 2) %>% select(height, ratio) %>% inner_join(data.frame(height=63:66)) # you can join with data.frames! # no processing done until out_file <- file.path(tempdir(), "processed.csv") women_chunked %>% write_chunkwise(file=out_file) head(women_chunked) # works (without processing all data...) iris_file <- file.path(tempdir(), "iris.csv") write.csv(iris, iris_file, row.names = FALSE, quote= FALSE) iris_chunked <- read_chunkwise(iris_file, chunk_size = 49) %>% # 49 for demo purpose group_by(species) %>% summarise(sepal_length = mean(sepal.length), n=n()) # note that mean is per chunk

Index chunked-package, 2 dplyr-verbs (chunked-package), 2 filter (chunked-package), 2 insert_chunkwise_into, 3, 5, 6 mutate (chunked-package), 2 read.table, 6 read_chunkwise, 3 read_csv2_chunkwise (read_csv_chunkwise), 4 read_csv_chunkwise, 4, 6 read_laf_chunkwise (read_csv_chunkwise), 4 read_table_chunkwise (read_csv_chunkwise), 4 select (chunked-package), 2 tbl, 3 write_chunkwise, 5 write_csv2_chunkwise (write_csv_chunkwise), 6 write_csv_chunkwise, 4, 5, 6 write_table_chunkwise (write_csv_chunkwise), 6 8