Package fst. June 7, 2018

Similar documents
Package fst. December 18, 2017

Intel Stress Bitstreams and Encoder (Intel SBE) 2017 AVS2 Release Notes (Version 2.3)

HALCoGen TMS570LS31x Help: example_sci_uart_9600.c

Open Source Used In TSP

ProgressBar Abstract

Ecma International Policy on Submission, Inclusion and Licensing of Software

IETF TRUST. Legal Provisions Relating to IETF Documents. Approved November 6, Effective Date: November 10, 2008

Copyright PFU LIMITED 2016

ColdFusion Builder 3.2 Third Party Software Notices and/or Additional Terms and Conditions

Table of Contents Overview...2 Selecting Post-Processing: ColorMap...3 Overview of Options Copyright, license, warranty/disclaimer...

Copyright PFU LIMITED

IETF TRUST. Legal Provisions Relating to IETF Documents. February 12, Effective Date: February 15, 2009

iwrite technical manual iwrite authors and contributors Revision: 0.00 (Draft/WIP)

Data Deduplication Metadata Extension

Ecma International Policy on Submission, Inclusion and Licensing of Software

Preface. Audience. Cisco IOS Software Documentation. Organization

Open Source Used In Cisco Configuration Professional for Catalyst 1.0

MUMPS IO Documentation

Open Source Used In c1101 and c1109 Cisco IOS XE Fuji

An Easy Way to Split a SAS Data Set into Unique and Non-Unique Row Subsets Thomas E. Billings, MUFG Union Bank, N.A., San Francisco, California

Open Source and Standards: A Proposal for Collaboration

pyserial-asyncio Documentation

System Log NextAge Consulting Pete Halsted

User Manual. Date Aug 30, Enertrax DAS Download Client

NemHandel Referenceklient 2.3.1

Package cattonum. R topics documented: May 2, Type Package Version Title Encode Categorical Features

Fujitsu ScandAll PRO V2.1.5 README

DAP Controller FCO

ANZ TRANSACTIVE MOBILE for ipad

calio / form-input-nginx-module

Grouper UI csrf xsrf prevention

openresty / array-var-nginx-module

Small Logger File System

Package fastdummies. January 8, 2018

AccuTerm 7 Internet Edition Connection Designer Help. Copyright Schellenbach & Assoc., Inc.

NemHandel Referenceklient 2.3.0

FLAMEBOSS 300 MANUAL

HYDROOBJECTS VERSION 1.1

Package validara. October 19, 2017

Static analysis for quality mobile applications

TWAIN driver User s Guide

Definiens. Image Miner bit and 64-bit Edition. Release Notes


Documentation Roadmap for Cisco Prime LAN Management Solution 4.2

FLAME BOSS 200V2 & 300 MANUAL. Version 2.6 Download latest at FlameBoss.com/manuals

MagicInfo Express Content Creator

iphone/ipad Connection Manual

Definiens. Image Miner bit and 64-bit Editions. Release Notes

User Manual for Video Codec

About This Guide. and with the Cisco Nexus 1010 Virtual Services Appliance: N1K-C1010

SMS2CMDB Project Summary v1.6

This file includes important notes on this product and also the additional information not included in the manuals.

SORRENTO MANUAL. by Aziz Gulbeden

LabVIEW Driver. User guide Version

Package bigreadr. R topics documented: August 13, Version Date Title Read Large Text Files

Explaining & Accessing the SPDX License List

Business Rules NextAge Consulting Pete Halsted

Distinction Import Module User Guide. DISTINCTION.CO.UK

Denkh XML Reporter. Web Based Report Generation Software. Written By Scott Auge Amduus Information Works, Inc.

HYDRODESKTOP VERSION 1.4 QUICK START GUIDE

The Cron service allows you to register STAF commands that will be executed at a specified time interval(s).

This file includes important notes on this product and also the additional information not included in the manuals.

PageScope Box Operator Ver. 3.2 User s Guide

HIS document 2 Loading Observations Data with the ODDataLoader (version 1.0)

Hyperscaler Storage. September 12, 2016

Internet Connection Guide

User Guide. Calibrated Software, Inc.

License, Rules, and Application Form

DAP Controller FCO

US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

cs50.c /**************************************************************************** * CS50 Library 6 *

HYDRODESKTOP VERSION 1.1 BETA QUICK START GUIDE

Scott Auge

PTZ Control Center Operations Manual

Encrypted Object Extension

Installation. List Wrangler - Mailing List Manager for GTK+ Part I. 1 Requirements. By Frank Cox. September 3,

Spout to NDI. Convert between Spout and Newtek NDI sources. Using the Newtek NDI SDK. Version 2.

Definiens. Tissue Studio Release Notes

Packet Trace Guide. Packet Trace Guide. Technical Note

@list = bsd_glob('*.[ch]'); $homedir = bsd_glob('~gnat', GLOB_TILDE GLOB_ERR);

Navigator Documentation

AT03262: SAM D/R/L/C System Pin Multiplexer (SYSTEM PINMUX) Driver. Introduction. SMART ARM-based Microcontrollers APPLICATION NOTE

SW MAPS TEMPLATE BUILDER. User s Manual

APPLICATION NOTE. Atmel AT03261: SAM D20 System Interrupt Driver (SYSTEM INTERRUPT) SAM D20 System Interrupt Driver (SYSTEM INTERRUPT)

Simba Cassandra ODBC Driver with SQL Connector

Package clipr. June 23, 2018

PRODUCT SPECIFIC LICENSE TERMS Sybase Enterprise Portal Version 5 Application Edition ( Program )

CSCE Inspection Activity Name(s):

DoJSON Documentation. Release Invenio collaboration

Flask-Sitemap Documentation

ODM 1.1. An application Hydrologic. June Prepared by: Jeffery S. Horsburgh Justin Berger Utah Water

Enterprise Payment Solutions. Scanner Installation April EPS Scanner Installation: Quick Start for Remote Deposit Complete TM

Flask Gravatar. Release 0.5.0

Watch 4 Size v1.0 User Guide By LeeLu Soft 2013

Package geojsonsf. R topics documented: January 11, Type Package Title GeoJSON to Simple Feature Converter Version 1.3.

Conettix Universal Dual Path Communicator B465

AT11512: SAM L Brown Out Detector (BOD) Driver. Introduction. SMART ARM-based Microcontrollers APPLICATION NOTE

Anybus Wireless Bridge Ethernet Bluetooth Access Point Product Guide

<!--Released: February > <!--====================================================================--> <!--March > <!--

QuarkXPress Server Manager 8.0 ReadMe

Transcription:

Type Package Package fst June 7, 2018 Title Lightning Fast Serialization of Data Frames for R Multithreaded serialization of compressed data frames using the 'fst' format. The 'fst' format allows for random access of stored data and compression with the LZ4 and ZSTD compressors created by Yann Collet. The ZSTD compression library is owned by Facebook Inc. Version 0.8.8 Date 2018-06-06 LazyData true Depends R (>= 3.0.0) Imports Rcpp LinkingTo Rcpp SystemRequirements little-endian platform RoxygenNote 6.0.1 Suggests testthat, bit64, data.table, lintr, nanotime, crayon License AGPL-3 file LICENSE Copyright This package includes sources from the LZ4 library written by Yann Collet, sources of the ZSTD library owned by Facebook, Inc. and sources of the fstlib library owned by Mark Klik URL https://fstpackage.github.io BugReports https://github.com/fstpackage/fst/issues NeedsCompilation yes Author Mark Klik [aut, cre, cph], Yann Collet [ctb, cph] (Yann Collet is author of the bundled LZ4 and ZSTD code and copyright holder of LZ4), Facebook, Inc. [cph] (Bundled ZSTD code) Maintainer Mark Klik <markklik@gmail.com> Repository CRAN Date/Publication 2018-06-07 13:50:42 UTC 1

2 fst-package R topics documented: fst-package......................................... 2 compress_fst........................................ 3 decompress_fst....................................... 4 fst.............................................. 4 hash_fst........................................... 5 metadata_fst......................................... 6 threads_fst.......................................... 7 write_fst........................................... 8 Index 10 fst-package Lightning Fast Serialization of Data Frames for R. Details Multithreaded serialization of compressed data frames using the fst format. The fst format allows for random access of stored data which can be compressed with the LZ4 and ZSTD compressors. The fst package is based on three C++ libraries: fstlib: library containing code to write, read and compute on files stored in the fst format. Written and owned by Mark Klik. LZ4: library containing code to compress data with the LZ4 compressor. Written and owned by Yann Collet. ZSTD: library containing code to compress data with the ZSTD compressor. Written by Yann Collet and owned by Facebook, Inc. The following copyright notice, list of conditions and disclaimer apply to the use of the ZSTD library in the fst package: BSD License For Zstandard software Copyright (c) 2016-present, Facebook, Inc. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name Facebook nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

compress_fst 3 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIM- ITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIM- ITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THE- ORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUD- ING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. The following copyright notice, list of conditions and disclaimer apply to the use of the LZ4 library in the fst package: LZ4 Library Copyright (c) 2011-2016, Yann Collet All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIM- ITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIM- ITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THE- ORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUD- ING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. compress_fst Compress a raw vector using the LZ4 or ZSTD compressor. Compress a raw vector using the LZ4 or ZSTD compressor. compress_fst(x, compressor = "ZSTD", compression = 0, hash = FALSE)

4 fst x compressor raw vector. compressor to use for compressing x. Valid options are "LZ4" and "ZSTD" (default). compression compression factor used. Must be in the range 0 (lowest compression) to 100 (maximum compression). hash Compute hash of compressed data. This hash is stored in the resulting raw vector and can be used during decompression to check the validity of the compressed vector. Hash computation is done with the very fast xxhash algorithm and implemented as a parallel operation, so the performance hit will be very small. decompress_fst Decompress a raw vector with compressed data. Decompress a raw vector with compressed data. decompress_fst(x) x raw vector with data previously compressed with compress_fst. a raw vector with previously compressed data. fst Access a fst file like a regular data frame Create a fst_table object that can be accessed like a regular data frame. This object is just a reference to the actual data and requires only a small amount of memory. When data is accessed, only a subset is read from file, depending on the minimum and maximum requested row number. This is possible because the fst file format allows full random access (in columns and rows) to the stored dataset. fst(path, old_format = FALSE)

hash_fst 5 path old_format path to fst file use TRUE to read fst files generated with a fst package version lower than v0.8.0 An object of class fst_table Examples ## Not run: # generate a sample fst file path <- paste0(tempfile(), ".fst") write_fst(iris, path) # create a fst_table object that can be used as a data frame ft <- fst(path) # print head and tail print(ft) # select columns and rows x <- ft[10:14, c("petal.width", "Species")] # use the common list interface ft[true] ft[c(true, FALSE)] ft[["sepal.length"]] ft$petal.length # use data frame generics nrow(ft) ncol(ft) dim(ft) dimnames(ft) colnames(ft) rownames(ft) names(ft) ## End(Not run) hash_fst Parallel calculation of the hash of a raw vector Parallel calculation of the hash of a raw vector

6 metadata_fst hash_fst(x, seed = NULL, block_hash = TRUE) x seed block_hash raw vector that you want to hash The seed value for the hashing algorithm. If NULL, a default seed will be used. If TRUE, a multi-threaded implementation of the 64-bit xxhash algorithm will be used. Note that block_hash = TRUE or block_hash = FALSE will result in different hash values. hash value metadata_fst Read metadata from a fst file Method for checking basic properties of the dataset stored in path. metadata_fst(path, old_format = FALSE) fst.metadata(path, old_format = FALSE) path old_format path to fst file use TRUE to read fst files generated with a fst package version lower than v0.8.0 Returns a list with meta information on the stored dataset in path. Has class fstmetadata. Examples # Sample dataset x <- data.frame( First = 1:10, Second = sample(c(true, FALSE, NA), 10, replace = TRUE), Last = sample(letters, 10)) # Write to fst file write_fst(x, "dataset.fst")

threads_fst 7 # Display meta information metadata_fst("dataset.fst") threads_fst Get or set the number of threads used in parallel operations For parallel operations, the performance is determined to a great extend by the number of threads used. More threads will allow the CPU to perform more computational intensive tasks simultaneously, speeding up the operation. Using more threads also introduces some overhead that will scale with the number of threads used. Therefore, using the maximum number of available threads is not always the fastest solution. With threads_fst the number of threads can be adjusted to the users specific requirements. As a default, fst uses a number of threads equal to the number of logical cores in the system. threads_fst(nr_of_threads = NULL, reset_after_fork = NULL) nr_of_threads number of threads to use or NULL to get the current number of threads used in multithreaded operations. reset_after_fork when fst is running in a forked process, the usage of OpenMP can create problems. To prevent these, fst switches back to single core usage when it detects a fork. After the fork, the number of threads is reset to it s initial setting. However, on some compilers (e.g. Intel), switching back to multi-threaded mode can lead to issues. When reset_after_fork is set to FALSE, fst is left in single-threaded mode after the fork ends. After the fork, multithreading can be activated again manually by calling threads_fst with an appropriate value for nr_of_threads. The default (reset_after_fork = NULL) leaves the fork behavior unchanged. Details The number of threads can also be set with option(fst_threads = N). NOTE: This option is only read when the package s namespace is first loaded, with commands like library, require, or ::. If you have already used one of these, you must use threads_fst to set the number of threads. the number of threads (previously) used

8 write_fst write_fst Read and write fst files. Read and write data frames from and to a fast-storage (fst) file. Allows for compression and (file level) random access of stored data, even for compressed datasets. Multiple threads are used to obtain high (de-)serialization speeds but all background threads are re-joined before write_fst and read_fst return (reads and writes are stable). When using a data.table object for x, the key (if any) is preserved, allowing storage of sorted data. Methods read_fst and write_fst are equivalent to read.fst and write.fst (but the former syntax is preferred). write_fst(x, path, compress = 50, uniform_encoding = TRUE) read_fst(path, columns = NULL, from = 1, to = NULL, as.data.table = FALSE, old_format = FALSE) write.fst(x, path, compress = 50, uniform_encoding = TRUE) read.fst(path, columns = NULL, from = 1, to = NULL, as.data.table = FALSE, old_format = FALSE) x path a data frame to write to disk path to fst file compress value in the range 0 to 100, indicating the amount of compression to use. Lower values mean larger file sizes. The default compression is set to 50. uniform_encoding If TRUE, all character vectors will be assumed to have elements with equal encoding. The encoding (latin1, UTF8 or native) of the first non-na element will used as encoding for the whole column. This will be a correct assumption for most use cases. If uniform.encoding is set to FALSE, no such assumption will be made and all elements will be converted to the same encoding. The latter is a relatively expensive operation and will reduce write performance for character columns. columns from to as.data.table old_format Column names to read. The default is to read all all columns. Read data starting from this row number. Read data up until this row number. The default is to read to the last row of the stored dataset. If TRUE, the result will be returned as a data.table object. Any keys set on dataset x before writing will be retained. This allows for storage of sorted datasets. use TRUE to read fst files generated with a fst package version lower than v0.8.0

write_fst 9 read_fst returns a data frame with the selected columns and rows. read_fst invisibly returns x (so you can use this function in a pipeline). Examples # Sample dataset x <- data.frame(a = 1:10000, B = sample(c(true, FALSE, NA), 10000, replace = TRUE)) # Default compression write_fst(x, "dataset.fst") # filesize: 17 KB y <- read_fst("dataset.fst") # read fst file # Maximum compression write_fst(x, "dataset.fst", 100) # filesize: 4 KB y <- read_fst("dataset.fst") # read fst file # Random access y <- read_fst("dataset.fst", "B") # read selection of columns y <- read_fst("dataset.fst", "A", 100, 200) # read selection of columns and rows

Index compress_fst, 3 decompress_fst, 4 fst, 4 fst-package, 2 fst.metadata (metadata_fst), 6 hash_fst, 5 metadata_fst, 6 read.fst (write_fst), 8 read_fst (write_fst), 8 threads_fst, 7 write.fst (write_fst), 8 write_fst, 8 10