[R-pkg-team] Bug#988829: ITP: r-cran-tokenizers -- GNU R fast, consistent tokenization of natural language text
Andreas Tille
tille at debian.org
Thu May 20 08:53:48 BST 2021
Package: wnpp
Severity: wishlist
Subject: ITP: r-cran-tokenizers -- GNU R fast, consistent tokenization of natural language text
Package: wnpp
Owner: Andreas Tille <tille at debian.org>
Severity: wishlist
* Package name : r-cran-tokenizers
Version : 0.2.1
Upstream Author : Lincoln Mullen,
* URL : https://cran.r-project.org/package=tokenizers
* License : MIT
Programming Lang: GNU R
Description : GNU R fast, consistent tokenization of natural language text
Convert natural language text into tokens. Includes tokenizers for
shingled n-grams, skip n-grams, words, word stems, sentences,
paragraphs, characters, shingled characters, lines, tweets, Penn
Treebank, regular expressions, as well as functions for counting
characters, words, and sentences, and a function for splitting longer
texts into separate documents, each with the same number of words.
The tokenizers have a consistent interface, and the package is built
on the 'stringi' and 'Rcpp' packages for fast yet correct
tokenization in 'UTF-8'.
Remark: This package is maintained by Debian R Packages Maintainers at
https://salsa.debian.org/r-pkg-team/r-cran-tokenizers
More information about the R-pkg-team
mailing list