Package fastrtext. December 10, 2017

Similar documents
Package lime. March 8, 2018

Package projector. February 27, 2018

Package fasttextr. May 12, 2017

Package ECctmc. May 1, 2018

Package validara. October 19, 2017

Package semver. January 6, 2017

Package cattonum. R topics documented: May 2, Type Package Version Title Encode Categorical Features

Package kdtools. April 26, 2018

Package nmslibr. April 14, 2018

Package geojsonsf. R topics documented: January 11, Type Package Title GeoJSON to Simple Feature Converter Version 1.3.

Package nngeo. September 29, 2018

Package fastdummies. January 8, 2018

Package bisect. April 16, 2018

Package qualmap. R topics documented: September 12, Type Package

Package spark. July 21, 2017

Package geogrid. August 19, 2018

Package messaging. May 27, 2018

Package datasets.load

Package fst. December 18, 2017

Package strat. November 23, 2016

Package bigreadr. R topics documented: August 13, Version Date Title Read Large Text Files

Package milr. June 8, 2017

Package knitrprogressbar

Package triebeard. August 29, 2016

Package rtext. January 23, 2019

Package balance. October 12, 2018

Package PUlasso. April 7, 2018

Package wrswor. R topics documented: February 2, Type Package

Package lexrankr. December 13, 2017

Package tidyselect. October 11, 2018

Package Rtsne. April 14, 2017

Package robotstxt. November 12, 2017

Package rprojroot. January 3, Title Finding Files in Project Subdirectories Version 1.3-2

Package canvasxpress

Package repec. August 31, 2018

Package tibble. August 22, 2017

Package humanize. R topics documented: April 4, Version Title Create Values for Human Consumption

Package available. November 17, 2017

Package calpassapi. August 25, 2018

Package elmnnrcpp. July 21, 2018

Package readxl. April 18, 2017

Package githubinstall

Package embed. November 19, 2018

Package jdx. R topics documented: January 9, Type Package Title 'Java' Data Exchange for 'R' and 'rjava'

Package readobj. November 2, 2017

Package spelling. December 18, 2017

Package rucrdtw. October 13, 2017

Package wikitaxa. December 21, 2017

Package sparsehessianfd

Package condusco. November 8, 2017

Package SFtools. June 28, 2017

Package utf8. May 24, 2018

Package oec. R topics documented: May 11, Type Package

Package docxtools. July 6, 2018

Package farver. November 20, 2018

Package goodpractice

Package hyphenatr. August 29, 2016

Package jpmesh. December 4, 2017

Package iccbeta. January 28, 2019

Package pdfsearch. July 10, 2018

Package AhoCorasickTrie

Package mixsqp. November 14, 2018

Package textrank. December 18, 2017

Package ggimage. R topics documented: December 5, Title Use Image in 'ggplot2' Version 0.1.0

Package tidyimpute. March 5, 2018

Package BioInstaller

Package narray. January 28, 2018

Package fingertipsr. May 25, Type Package Version Title Fingertips Data for Public Health

Package fst. June 7, 2018

Package WordR. September 7, 2017

Package SEMrushR. November 3, 2018

Package labelvector. July 28, 2018

Package arulescba. April 24, 2018

Package ggimage. R topics documented: November 1, Title Use Image in 'ggplot2' Version 0.0.7

Package internetarchive

Package gifti. February 1, 2018

Package dbx. July 5, 2018

Package KernelKnn. January 16, 2018

Package modules. July 22, 2017

Package kerasformula

Package wdman. January 23, 2017

Package gtrendsr. October 19, 2017

Package SIMLR. December 1, 2017

Package signalhsmm. August 29, 2016

Package eply. April 6, 2018

Package Cubist. December 2, 2017

Package lumberjack. R topics documented: July 20, 2018

Package barcoder. October 26, 2018

Package samplesizecmh

Package rnn. R topics documented: June 21, Title Recurrent Neural Network Version 0.8.1

Package librarian. R topics documented:

Package filematrix. R topics documented: February 27, Type Package

Package sgmcmc. September 26, Type Package

Package facerec. May 14, 2018

Package clipr. June 23, 2018

Package crossrun. October 8, 2018

Package cloudml. June 12, 2018

Package politeness. R topics documented: February 15, Type Package Title Detecting Politeness Features in Text Version 0.2.2

Package geoops. March 19, 2018

Package scmap. March 29, 2019

Transcription:

Type Package Package fastrtext December 10, 2017 Title 'fasttext' Wrapper for Text Classification and Word Representation Version 0.2.4 Date 2017-12-09 Maintainer Michaël Benesty <michael@benesty.fr> Learning text representations and text classifiers may rely on the same simple and efficient approach. 'fasttext' is an open-source, free, lightweight library that allows users to perform both tasks. It transforms text into continuous vectors that can later be used on many language related task. It works on standard, generic hardware (no 'GPU' required). It also includes model size reduction feature. 'fasttext' original source code is available at <https://github.com/facebookresearch/fasttext>. URL https://github.com/pommedeterresautee/fastrtext, https://pommedeterresautee.github.io/fastrtext/ BugReports https://github.com/pommedeterresautee/fastrtext/issues License MIT + file LICENSE Depends R (>= 3.3) Imports methods, Rcpp (>= 0.12.12), assertthat Suggests knitr, testthat LinkingTo Rcpp, RcppThread (>= 0.1.2) LazyData true VignetteBuilder knitr RoxygenNote 6.0.1 Encoding UTF-8 NeedsCompilation yes Author Michaël Benesty [aut, cre, cph], Facebook, Inc [cph] 1

2 add_tags Repository CRAN Date/Publication 2017-12-09 23:39:46 UTC R topics documented: add_tags........................................... 2 execute........................................... 3 fastrtext........................................... 4 get_analogies........................................ 5 get_dictionary........................................ 5 get_hamming_loss..................................... 6 get_labels.......................................... 7 get_nn............................................ 7 get_parameters....................................... 8 get_sentence_representation................................ 9 get_word_distance..................................... 9 get_word_vectors...................................... 10 load_model......................................... 11 predict.rcpp_fastrtext.................................... 11 print_help.......................................... 12 Rcpp_fastrtext-class..................................... 13 stop_words_sentences.................................... 13 test_sentences........................................ 14 train_sentences....................................... 15 Index 16 add_tags Add tags to documents Add tags in the fasttext format. This format is require for the training step. add_tags(documents, tags, prefix = " label ") documents tags prefix texts to learn labels provided as a list or a vector. There can be 1 or more per document. character to add in front of tag (fasttext format) character ready to be written in a file

execute 3 tags <- list(c(1, 5), 0) documents <- c("this is a text", "this is another document") add_tags(documents = documents, tags = tags) execute Execute command on fasttext model (including training) Use the same commands than the one to use for the command line. execute(commands) commands character of commands ## Not run: # Supervised learning example data("train_sentences") data("test_sentences") # prepare data tmp_file_model <- tempfile() train_labels <- paste0(" label ", train_sentences[,"class.text"]) train_texts <- tolower(train_sentences[,"text"]) train_to_write <- paste(train_labels, train_texts) train_tmp_file_txt <- tempfile() writelines(text = train_to_write, con = train_tmp_file_txt) test_labels <- paste0(" label ", test_sentences[,"class.text"]) test_texts <- tolower(test_sentences[,"text"]) test_to_write <- paste(test_labels, test_texts) # learn model execute(commands = c("supervised", "-input", train_tmp_file_txt, "-output", tmp_file_model, "-dim", 20, "-lr", 1, "-epoch", 20, "-wordngrams", 2, "-verbose", 1))

4 fastrtext model <- load_model(tmp_file_model) predict(model, sentences = test_sentences[1, "text"]) # Unsupervised learning example data("train_sentences") data("test_sentences") texts <- tolower(train_sentences[,"text"]) tmp_file_txt <- tempfile() tmp_file_model <- tempfile() writelines(text = texts, con = tmp_file_txt) execute(commands = c("skipgram", "-input", tmp_file_txt, "-output", tmp_file_model, "-verbose", 1)) model <- load_model(tmp_file_model) dict <- get_dictionary(model) get_word_vectors(model, head(dict, 5)) ## End(Not run) fastrtext fastrtext: fasttext Wrapper for Text Classification and Word Representation Learning text representations and text classifiers may rely on the same simple and efficient approach. fasttext is an open-source, free, lightweight library that allows users to perform both tasks. It transforms text into continuous vectors that can later be used on many language related task. It works on standard, generic hardware (no GPU required). It also includes model size reduction feature. fasttext original source code is available at <https://github.com/facebookresearch/fasttext>. Author(s) Maintainer: Michaël Benesty <michael@benesty.fr> [copyright holder] Other contributors: Facebook, Inc <bojanowski@fb.com> [copyright holder] See Also Useful links: https://github.com/pommedeterresautee/fastrtext https://pommedeterresautee.github.io/fastrtext/ Report bugs at https://github.com/pommedeterresautee/fastrtext/issues

get_analogies 5 get_analogies Get analogy From Mikolov paper Based on related move of a vector regarding a basis. King is to Quenn what a man is to??? w1 - w2 + w3 get_analogies(model, w1, w2, w3, k = 1) model w1 w2 w3 k trained fasttext model. NULL if train a new model. 1st word, basis 2nd word, move 3d word, new basis number of words to return a numeric with distances and names are words model_test_path <- system.file("extdata", "model_unsupervised_test.bin", package = "fastrtext") get_analogies(model, "experience", "experiences", "result") get_dictionary Get list of known words Get a character containing each word seen during training. get_dictionary(model)

6 get_hamming_loss model trained fasttext model character containing each word model_test_path <- system.file("extdata", "model_classification_test.bin", package = "fastrtext") print(head(get_dictionary(model), 5)) get_hamming_loss Hamming loss Compute the hamming loss. When there is only one category, this measure the accuracy. get_hamming_loss(labels, predictions) labels predictions list of labels list returned by the predict command (including both the probability and the categories) a scalar with the loss data("test_sentences") model_test_path <- system.file("extdata", "model_classification_test.bin", package = "fastrtext") sentences <- test_sentences[, "text"] test_labels <- test_sentences[, "class.text"] predictions <- predict(model, sentences) get_hamming_loss(as.list(test_labels), predictions)

get_labels 7 get_labels Get list of labels (supervised model) Get a character containing each label seen during training. get_labels(model) model trained fasttext model character containing each label model_test_path <- system.file("extdata", "model_classification_test.bin", package = "fastrtext") print(head(get_labels(model), 5)) get_nn Get nearest neighbour vectors Find the k words with the smallest distance. First execution can be slow because of precomputation. Search is done linearly, if your model is big you may want to use an approximate neighbour algorithm from other R packages (like RcppAnnoy). get_nn(model, word, k) model word k trained fasttext model. Null if train a new model. reference word integer defining the number of results to return

8 get_parameters numeric with distances with names as words model_test_path <- system.file("extdata", "model_unsupervised_test.bin", package = "fastrtext") get_nn(model, "time", 10) get_parameters Export hyper parameters Retrieve hyper parameters used to train the model get_parameters(model) model trained fasttext model list containing each parameter model_test_path <- system.file("extdata", "model_classification_test.bin", package = "fastrtext") print(head(get_parameters(model), 5))

get_sentence_representation 9 get_sentence_representation Get sentence embedding Sentence is splitted in words (using space separation), and word embeddings are averaged. get_sentence_representation(model, sentences) model sentences fasttext model character containing the sentences model_test_path <- system.file("extdata", "model_unsupervised_test.bin", package = "fastrtext") m <- get_sentence_representation(model, "this is a test") print(m) get_word_distance Distance between two words Distance is equal to 1 - cosine get_word_distance(model, w1, w2) model w1 w2 trained fasttext model. Null if train a new model. first word to compare second word to compare a scalar with the distance

10 get_word_vectors model_test_path <- system.file("extdata", "model_unsupervised_test.bin", package = "fastrtext") get_word_distance(model, "time", "timing") get_word_vectors Get word embeddings Return the vector representation of provided words (unsupervised training) or provided labels (supervised training). get_word_vectors(model, words = get_dictionary(model)) model words trained fasttext model character of words. Default: return every word from the dictionary. matrix containing each word embedding as a row and rownames are populated with word strings. model_test_path <- system.file("extdata", "model_unsupervised_test.bin", package = "fastrtext") get_word_vectors(model, c("introduction", "we"))

load_model 11 load_model Load an existing fasttext trained model Load and return a pointer to an existing model which will be used in other functions of this package. load_model(path) path path to the existing model model_test_path <- system.file("extdata", "model_classification_test.bin", package = "fastrtext") predict.rcpp_fastrtext Get predictions (for supervised model) Apply the trained model to new sentences. Average word embeddings and search most similar label vector. ## S3 method for class 'Rcpp_fastrtext' predict(object, sentences, k = 1, simplify = FALSE, unlock_empty_predictions = FALSE,...) object trained fasttext model sentences character containing the sentences k will return the k most probable labels (default = 1) simplify when TRUE and k = 1, function return a (flat) numeric instead of a list

12 print_help unlock_empty_predictions logical to avoid crash when some predictions are not provided for some sentences because all their words have not been seen during training. This parameter should only be set to TRUE to debug.... not used list containing for each sentence the probability to be associated with k labels. data("test_sentences") model_test_path <- system.file("extdata", "model_classification_test.bin", package = "fastrtext") sentence <- test_sentences[1, "text"] print(predict(model, sentence)) print_help Print help Print command information, mainly to use with execute() function. print_help() ## Not run: print_help() ## End(Not run)

Rcpp_fastrtext-class 13 Rcpp_fastrtext-class Rcpp_fastrtext class Models are S4 objects with several slots (methods) which can be called that way: model$slot_name() Slots load Load a model predict Make a prediction execute Execute commands get_vectors Get vectors related to provided words get_parameters Get parameters used to train the model get_dictionary List all words learned get_labels List all labels learned stop_words_sentences Stop words list List of words that can be safely removed from sentences. stop_words_sentences Format Character vector of stop words Source https://archive.ics.uci.edu/ml/datasets.html?format=&task=&att=&area=&numatt=&numins= &type=text&sort=nameup&view=table

14 test_sentences test_sentences Sentence corpus - test part This corpus contains sentences from the abstract and introduction of 30 scientific articles that have been annotated (i.e. labeled or tagged) according to a modified version of the Argumentative Zones annotation scheme. test_sentences Format Details Source 2 data frame with 3117 rows and 2 variables: text the sentences as a character vector class.text the category of the sentence These 30 scientific articles come from three different domains: 1. PLoS Computational Biology (PLOS) 2. The machine learning repository on arxiv (ARXIV) 3. The psychology journal Judgment and Decision Making (JDM) There are 10 articles from each domain. In addition to the labeled data, this corpus also contains a corresponding set of unlabeled articles. These unlabeled articles also come from PLOS, ARXIV, and JDM. There are 300 unlabeled articles from each domain (again, only the sentences from the abstract and introduction). These unlabeled articles can be used for unsupervised or semi-supervised approaches to sentence classification which rely on a small set of labeled data and a larger set of unlabeled data. ===== References ===== S. Teufel and M. Moens. Summarizing scientific articles: experiments with relevance and rhetorical status. Computational Linguistics, 28(4):409-445, 2002. S. Teufel. Argumentative zoning: information extraction from scientific text. PhD thesis, School of Informatics, University of Edinburgh, 1999. https://archive.ics.uci.edu/ml/datasets.html?format=&task=&att=&area=&numatt=&numins= &type=text&sort=nameup&view=table

train_sentences 15 train_sentences Sentence corpus - train part Format Details Source This corpus contains sentences from the abstract and introduction of 30 scientific articles that have been annotated (i.e. labeled or tagged) according to a modified version of the Argumentative Zones annotation scheme. train_sentences 2 data frame with 3117 rows and 2 variables: text the sentences as a character vector class.text the category of the sentence These 30 scientific articles come from three different domains: 1. PLoS Computational Biology (PLOS) 2. The machine learning repository on arxiv (ARXIV) 3. The psychology journal Judgment and Decision Making (JDM) There are 10 articles from each domain. In addition to the labeled data, this corpus also contains a corresponding set of unlabeled articles. These unlabeled articles also come from PLOS, ARXIV, and JDM. There are 300 unlabeled articles from each domain (again, only the sentences from the abstract and introduction). These unlabeled articles can be used for unsupervised or semi-supervised approaches to sentence classification which rely on a small set of labeled data and a larger set of unlabeled data. ===== References ===== S. Teufel and M. Moens. Summarizing scientific articles: experiments with relevance and rhetorical status. Computational Linguistics, 28(4):409-445, 2002. S. Teufel. Argumentative zoning: information extraction from scientific text. PhD thesis, School of Informatics, University of Edinburgh, 1999. https://archive.ics.uci.edu/ml/datasets.html?format=&task=&att=&area=&numatt=&numins= &type=text&sort=nameup&view=table

Index Topic datasets stop_words_sentences, 13 test_sentences, 14 train_sentences, 15 add_tags, 2 character, 2, 3, 5 7, 9 11 S4, 13 stop_words_sentences, 13 test_sentences, 14 train_sentences, 15 TRUE, 11, 12 vector, 2 execute, 3 execute(), 12 fastrtext, 4 fastrtext-package (fastrtext), 4 get_analogies, 5 get_dictionary, 5 get_hamming_loss, 6 get_labels, 7 get_nn, 7 get_parameters, 8 get_sentence_representation, 9 get_word_distance, 9 get_word_vectors, 10 integer, 7 list, 2, 8, 11, 12 load_model, 11 logical, 12 matrix, 10 names, 5, 8 NULL, 5 numeric, 5, 8, 11 predict.rcpp_fastrtext, 11 print_help, 12 Rcpp_fastrtext-class, 13 16