[med-svn] [r-cran-stringr] 01/05: New upstream version 1.2.0
Andreas Tille
tille at debian.org
Fri Sep 29 21:28:25 UTC 2017
This is an automated email from the git hooks/post-receive script.
tille pushed a commit to branch master
in repository r-cran-stringr.
commit 5243e8c1e84acfca5ea72cd314282e719f056a69
Author: Andreas Tille <tille at debian.org>
Date: Fri Sep 29 23:17:44 2017 +0200
New upstream version 1.2.0
---
DESCRIPTION | 14 +-
LICENSE | 339 ++++++++++++++++++++++++++++
MD5 | 101 +++++----
NAMESPACE | 1 +
NEWS.md | 33 +++
R/case.R | 11 +-
R/detect.r | 4 +-
R/match.r | 6 +-
R/modifiers.r | 4 +-
R/replace.r | 81 ++++++-
R/sort.R | 23 +-
R/stringr.R | 2 +
R/subset.R | 19 +-
README.md | 162 +++++++++++---
build/vignette.rds | Bin 210 -> 241 bytes
inst/doc/regular-expressions.R | 152 +++++++++++++
inst/doc/regular-expressions.Rmd | 415 +++++++++++++++++++++++++++++++++++
inst/doc/regular-expressions.html | 398 +++++++++++++++++++++++++++++++++
inst/doc/stringr.R | 121 ++++++++--
inst/doc/stringr.Rmd | 263 +++++++++++++++-------
inst/doc/stringr.html | 251 +++++++++++++--------
man/case.Rd | 14 +-
man/invert_match.Rd | 1 -
man/modifier-deprecated.Rd | 3 +-
man/modifiers.Rd | 13 +-
man/pipe.Rd | 1 -
man/str_c.Rd | 1 -
man/str_conv.Rd | 1 -
man/str_count.Rd | 1 -
man/str_detect.Rd | 5 +-
man/str_dup.Rd | 1 -
man/str_extract.Rd | 1 -
man/str_interp.Rd | 1 -
man/str_length.Rd | 1 -
man/str_locate.Rd | 1 -
man/str_match.Rd | 1 -
man/str_order.Rd | 21 +-
man/str_pad.Rd | 1 -
man/str_replace.Rd | 62 ++++--
man/str_replace_na.Rd | 10 +-
man/str_split.Rd | 1 -
man/str_sub.Rd | 1 -
man/str_subset.Rd | 18 +-
man/str_trim.Rd | 1 -
man/str_trunc.Rd | 1 -
man/str_view.Rd | 1 -
man/str_wrap.Rd | 1 -
man/stringr-data.Rd | 5 +-
man/stringr-package.Rd | 33 +++
man/word.Rd | 1 -
tests/testthat/test-match.r | 4 +
tests/testthat/test-replace.r | 21 ++
vignettes/regular-expressions.Rmd | 415 +++++++++++++++++++++++++++++++++++
vignettes/releases/stringr-1.0.0.Rmd | 76 +++++++
vignettes/releases/stringr-1.1.0.Rmd | 33 +++
vignettes/stringr.Rmd | 263 +++++++++++++++-------
56 files changed, 2957 insertions(+), 458 deletions(-)
diff --git a/DESCRIPTION b/DESCRIPTION
index a6f51dd..8d0fb37 100644
--- a/DESCRIPTION
+++ b/DESCRIPTION
@@ -1,5 +1,5 @@
Package: stringr
-Version: 1.1.0
+Version: 1.2.0
Title: Simple, Consistent Wrappers for Common String Operations
Description: A consistent, simple and easy to use set of wrappers around the
fantastic 'stringi' package. All function and argument names (and positions)
@@ -10,19 +10,19 @@ Authors at R: c(
person("Hadley", "Wickham", , "hadley at rstudio.com", c("aut", "cre", "cph")),
person("RStudio", role = "cph")
)
-License: GPL-2
+License: GPL-2 | file LICENSE
Depends: R (>= 2.14)
Imports: stringi (>= 0.4.1), magrittr
Suggests: testthat, knitr, htmltools, htmlwidgets, rmarkdown, covr
VignetteBuilder: knitr
-URL: https://github.com/hadley/stringr
-BugReports: https://github.com/hadley/stringr/issues
-RoxygenNote: 5.0.1
+URL: http://stringr.tidyverse.org, https://github.com/tidyverse/stringr
+BugReports: https://github.com/tidyverse/stringr/issues
+RoxygenNote: 6.0.1
LazyData: true
NeedsCompilation: no
-Packaged: 2016-08-19 14:42:23 UTC; hadley
+Packaged: 2017-02-17 15:23:03 UTC; hadley
Author: Hadley Wickham [aut, cre, cph],
RStudio [cph]
Maintainer: Hadley Wickham <hadley at rstudio.com>
Repository: CRAN
-Date/Publication: 2016-08-19 21:02:58
+Date/Publication: 2017-02-18 21:23:06
diff --git a/LICENSE b/LICENSE
new file mode 100644
index 0000000..d159169
--- /dev/null
+++ b/LICENSE
@@ -0,0 +1,339 @@
+ GNU GENERAL PUBLIC LICENSE
+ Version 2, June 1991
+
+ Copyright (C) 1989, 1991 Free Software Foundation, Inc.,
+ 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+ Preamble
+
+ The licenses for most software are designed to take away your
+freedom to share and change it. By contrast, the GNU General Public
+License is intended to guarantee your freedom to share and change free
+software--to make sure the software is free for all its users. This
+General Public License applies to most of the Free Software
+Foundation's software and to any other program whose authors commit to
+using it. (Some other Free Software Foundation software is covered by
+the GNU Lesser General Public License instead.) You can apply it to
+your programs, too.
+
+ When we speak of free software, we are referring to freedom, not
+price. Our General Public Licenses are designed to make sure that you
+have the freedom to distribute copies of free software (and charge for
+this service if you wish), that you receive source code or can get it
+if you want it, that you can change the software or use pieces of it
+in new free programs; and that you know you can do these things.
+
+ To protect your rights, we need to make restrictions that forbid
+anyone to deny you these rights or to ask you to surrender the rights.
+These restrictions translate to certain responsibilities for you if you
+distribute copies of the software, or if you modify it.
+
+ For example, if you distribute copies of such a program, whether
+gratis or for a fee, you must give the recipients all the rights that
+you have. You must make sure that they, too, receive or can get the
+source code. And you must show them these terms so they know their
+rights.
+
+ We protect your rights with two steps: (1) copyright the software, and
+(2) offer you this license which gives you legal permission to copy,
+distribute and/or modify the software.
+
+ Also, for each author's protection and ours, we want to make certain
+that everyone understands that there is no warranty for this free
+software. If the software is modified by someone else and passed on, we
+want its recipients to know that what they have is not the original, so
+that any problems introduced by others will not reflect on the original
+authors' reputations.
+
+ Finally, any free program is threatened constantly by software
+patents. We wish to avoid the danger that redistributors of a free
+program will individually obtain patent licenses, in effect making the
+program proprietary. To prevent this, we have made it clear that any
+patent must be licensed for everyone's free use or not licensed at all.
+
+ The precise terms and conditions for copying, distribution and
+modification follow.
+
+ GNU GENERAL PUBLIC LICENSE
+ TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
+
+ 0. This License applies to any program or other work which contains
+a notice placed by the copyright holder saying it may be distributed
+under the terms of this General Public License. The "Program", below,
+refers to any such program or work, and a "work based on the Program"
+means either the Program or any derivative work under copyright law:
+that is to say, a work containing the Program or a portion of it,
+either verbatim or with modifications and/or translated into another
+language. (Hereinafter, translation is included without limitation in
+the term "modification".) Each licensee is addressed as "you".
+
+Activities other than copying, distribution and modification are not
+covered by this License; they are outside its scope. The act of
+running the Program is not restricted, and the output from the Program
+is covered only if its contents constitute a work based on the
+Program (independent of having been made by running the Program).
+Whether that is true depends on what the Program does.
+
+ 1. You may copy and distribute verbatim copies of the Program's
+source code as you receive it, in any medium, provided that you
+conspicuously and appropriately publish on each copy an appropriate
+copyright notice and disclaimer of warranty; keep intact all the
+notices that refer to this License and to the absence of any warranty;
+and give any other recipients of the Program a copy of this License
+along with the Program.
+
+You may charge a fee for the physical act of transferring a copy, and
+you may at your option offer warranty protection in exchange for a fee.
+
+ 2. You may modify your copy or copies of the Program or any portion
+of it, thus forming a work based on the Program, and copy and
+distribute such modifications or work under the terms of Section 1
+above, provided that you also meet all of these conditions:
+
+ a) You must cause the modified files to carry prominent notices
+ stating that you changed the files and the date of any change.
+
+ b) You must cause any work that you distribute or publish, that in
+ whole or in part contains or is derived from the Program or any
+ part thereof, to be licensed as a whole at no charge to all third
+ parties under the terms of this License.
+
+ c) If the modified program normally reads commands interactively
+ when run, you must cause it, when started running for such
+ interactive use in the most ordinary way, to print or display an
+ announcement including an appropriate copyright notice and a
+ notice that there is no warranty (or else, saying that you provide
+ a warranty) and that users may redistribute the program under
+ these conditions, and telling the user how to view a copy of this
+ License. (Exception: if the Program itself is interactive but
+ does not normally print such an announcement, your work based on
+ the Program is not required to print an announcement.)
+
+These requirements apply to the modified work as a whole. If
+identifiable sections of that work are not derived from the Program,
+and can be reasonably considered independent and separate works in
+themselves, then this License, and its terms, do not apply to those
+sections when you distribute them as separate works. But when you
+distribute the same sections as part of a whole which is a work based
+on the Program, the distribution of the whole must be on the terms of
+this License, whose permissions for other licensees extend to the
+entire whole, and thus to each and every part regardless of who wrote it.
+
+Thus, it is not the intent of this section to claim rights or contest
+your rights to work written entirely by you; rather, the intent is to
+exercise the right to control the distribution of derivative or
+collective works based on the Program.
+
+In addition, mere aggregation of another work not based on the Program
+with the Program (or with a work based on the Program) on a volume of
+a storage or distribution medium does not bring the other work under
+the scope of this License.
+
+ 3. You may copy and distribute the Program (or a work based on it,
+under Section 2) in object code or executable form under the terms of
+Sections 1 and 2 above provided that you also do one of the following:
+
+ a) Accompany it with the complete corresponding machine-readable
+ source code, which must be distributed under the terms of Sections
+ 1 and 2 above on a medium customarily used for software interchange; or,
+
+ b) Accompany it with a written offer, valid for at least three
+ years, to give any third party, for a charge no more than your
+ cost of physically performing source distribution, a complete
+ machine-readable copy of the corresponding source code, to be
+ distributed under the terms of Sections 1 and 2 above on a medium
+ customarily used for software interchange; or,
+
+ c) Accompany it with the information you received as to the offer
+ to distribute corresponding source code. (This alternative is
+ allowed only for noncommercial distribution and only if you
+ received the program in object code or executable form with such
+ an offer, in accord with Subsection b above.)
+
+The source code for a work means the preferred form of the work for
+making modifications to it. For an executable work, complete source
+code means all the source code for all modules it contains, plus any
+associated interface definition files, plus the scripts used to
+control compilation and installation of the executable. However, as a
+special exception, the source code distributed need not include
+anything that is normally distributed (in either source or binary
+form) with the major components (compiler, kernel, and so on) of the
+operating system on which the executable runs, unless that component
+itself accompanies the executable.
+
+If distribution of executable or object code is made by offering
+access to copy from a designated place, then offering equivalent
+access to copy the source code from the same place counts as
+distribution of the source code, even though third parties are not
+compelled to copy the source along with the object code.
+
+ 4. You may not copy, modify, sublicense, or distribute the Program
+except as expressly provided under this License. Any attempt
+otherwise to copy, modify, sublicense or distribute the Program is
+void, and will automatically terminate your rights under this License.
+However, parties who have received copies, or rights, from you under
+this License will not have their licenses terminated so long as such
+parties remain in full compliance.
+
+ 5. You are not required to accept this License, since you have not
+signed it. However, nothing else grants you permission to modify or
+distribute the Program or its derivative works. These actions are
+prohibited by law if you do not accept this License. Therefore, by
+modifying or distributing the Program (or any work based on the
+Program), you indicate your acceptance of this License to do so, and
+all its terms and conditions for copying, distributing or modifying
+the Program or works based on it.
+
+ 6. Each time you redistribute the Program (or any work based on the
+Program), the recipient automatically receives a license from the
+original licensor to copy, distribute or modify the Program subject to
+these terms and conditions. You may not impose any further
+restrictions on the recipients' exercise of the rights granted herein.
+You are not responsible for enforcing compliance by third parties to
+this License.
+
+ 7. If, as a consequence of a court judgment or allegation of patent
+infringement or for any other reason (not limited to patent issues),
+conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License. If you cannot
+distribute so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you
+may not distribute the Program at all. For example, if a patent
+license would not permit royalty-free redistribution of the Program by
+all those who receive copies directly or indirectly through you, then
+the only way you could satisfy both it and this License would be to
+refrain entirely from distribution of the Program.
+
+If any portion of this section is held invalid or unenforceable under
+any particular circumstance, the balance of the section is intended to
+apply and the section as a whole is intended to apply in other
+circumstances.
+
+It is not the purpose of this section to induce you to infringe any
+patents or other property right claims or to contest validity of any
+such claims; this section has the sole purpose of protecting the
+integrity of the free software distribution system, which is
+implemented by public license practices. Many people have made
+generous contributions to the wide range of software distributed
+through that system in reliance on consistent application of that
+system; it is up to the author/donor to decide if he or she is willing
+to distribute software through any other system and a licensee cannot
+impose that choice.
+
+This section is intended to make thoroughly clear what is believed to
+be a consequence of the rest of this License.
+
+ 8. If the distribution and/or use of the Program is restricted in
+certain countries either by patents or by copyrighted interfaces, the
+original copyright holder who places the Program under this License
+may add an explicit geographical distribution limitation excluding
+those countries, so that distribution is permitted only in or among
+countries not thus excluded. In such case, this License incorporates
+the limitation as if written in the body of this License.
+
+ 9. The Free Software Foundation may publish revised and/or new versions
+of the General Public License from time to time. Such new versions will
+be similar in spirit to the present version, but may differ in detail to
+address new problems or concerns.
+
+Each version is given a distinguishing version number. If the Program
+specifies a version number of this License which applies to it and "any
+later version", you have the option of following the terms and conditions
+either of that version or of any later version published by the Free
+Software Foundation. If the Program does not specify a version number of
+this License, you may choose any version ever published by the Free Software
+Foundation.
+
+ 10. If you wish to incorporate parts of the Program into other free
+programs whose distribution conditions are different, write to the author
+to ask for permission. For software which is copyrighted by the Free
+Software Foundation, write to the Free Software Foundation; we sometimes
+make exceptions for this. Our decision will be guided by the two goals
+of preserving the free status of all derivatives of our free software and
+of promoting the sharing and reuse of software generally.
+
+ NO WARRANTY
+
+ 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
+FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
+OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
+PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
+OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS
+TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE
+PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
+REPAIR OR CORRECTION.
+
+ 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
+WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
+REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
+INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
+OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
+TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
+YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
+PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGES.
+
+ END OF TERMS AND CONDITIONS
+
+ How to Apply These Terms to Your New Programs
+
+ If you develop a new program, and you want it to be of the greatest
+possible use to the public, the best way to achieve this is to make it
+free software which everyone can redistribute and change under these terms.
+
+ To do so, attach the following notices to the program. It is safest
+to attach them to the start of each source file to most effectively
+convey the exclusion of warranty; and each file should have at least
+the "copyright" line and a pointer to where the full notice is found.
+
+ <one line to give the program's name and a brief idea of what it does.>
+ Copyright (C) <year> <name of author>
+
+ This program is free software; you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation; either version 2 of the License, or
+ (at your option) any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License along
+ with this program; if not, write to the Free Software Foundation, Inc.,
+ 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+
+Also add information on how to contact you by electronic and paper mail.
+
+If the program is interactive, make it output a short notice like this
+when it starts in an interactive mode:
+
+ Gnomovision version 69, Copyright (C) year name of author
+ Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
+ This is free software, and you are welcome to redistribute it
+ under certain conditions; type `show c' for details.
+
+The hypothetical commands `show w' and `show c' should show the appropriate
+parts of the General Public License. Of course, the commands you use may
+be called something other than `show w' and `show c'; they could even be
+mouse-clicks or menu items--whatever suits your program.
+
+You should also get your employer (if you work as a programmer) or your
+school, if any, to sign a "copyright disclaimer" for the program, if
+necessary. Here is a sample; alter the names:
+
+ Yoyodyne, Inc., hereby disclaims all copyright interest in the program
+ `Gnomovision' (which makes passes at compilers) written by James Hacker.
+
+ <signature of Ty Coon>, 1 April 1989
+ Ty Coon, President of Vice
+
+This General Public License does not permit incorporating your program into
+proprietary programs. If your program is a subroutine library, you may
+consider it more useful to permit linking proprietary applications with the
+library. If this is what you want to do, use the GNU Lesser General
+Public License instead of this License.
diff --git a/MD5 b/MD5
index 46f15b8..8927c36 100644
--- a/MD5
+++ b/MD5
@@ -1,68 +1,74 @@
-660e33eaeb0b2d065987c16301ceddf4 *DESCRIPTION
-418f3881a82e6c06bc2fd35b75912326 *NAMESPACE
-6b7bb4664312869a77e785ec86054244 *NEWS.md
+390e18788a2c4fb30a3b8313d077b27d *DESCRIPTION
+b234ee4d69f5fce4486a80fdaf4a4263 *LICENSE
+c27a838bf6d4d63390de7b5eee7c92f6 *NAMESPACE
+54581cfc188392e3a4725988e372a26d *NEWS.md
602ca1848b4280919b889dcce1985932 *R/c.r
-9c5e91f93a404215e8c4946c5a3ac2b7 *R/case.R
+2691657636967683798e114f578a442f *R/case.R
bc5a2a73f2842baf45454f53f324f274 *R/conv.R
9bcbcd4184483b252ce0ae00ae7cc80a *R/count.r
425b4de1b9cfd96ec12ffa98352cf187 *R/data.R
-3243b04ceb3ef212f41996563c2b3868 *R/detect.r
+9c01de0681437cc3c56a18d3d0738bfa *R/detect.r
017248629447b588fd6e607c6a5b420e *R/dup.r
69c7bf3df301e410b60791b277ce3d67 *R/extract.r
16c94e98a606be41be05f299437c2e9e *R/interp.R
8ef6b7317657989078722686317f810b *R/length.r
44a0e610f88dfe850562cb3f579d7f8c *R/locate.r
-733d4de0d77693b47fe2b24b1ca117dc *R/match.r
-4bd875dbb7d64543cfdb7a2008d3f0a2 *R/modifiers.r
+12184f91aa33d6d958d4209708b08319 *R/match.r
+4e38d5ecad6a6f7bfa678d0cba1af24b *R/modifiers.r
423e3766d98b030f26aef7496986b86a *R/pad-trim.r
-3e83d8d1a990c3104b365e7eaec124d1 *R/replace.r
-932f28001598d05ca5f977721e9bb131 *R/sort.R
+8764fb252e34fc2494322195dba9da7e *R/replace.r
+325be9d9a224126eda73ddd0083eb749 *R/sort.R
19f7b487c142462e73542a1045a95e82 *R/split.r
+b4fe12876c37f8a22cde06e9701eecf0 *R/stringr.R
657a92dfd6074f24e81c49279bd54297 *R/sub.r
-ef300583eb4793a4f0159f5c53e75565 *R/subset.R
+771d6511cdf9ca27379d4371539af82b *R/subset.R
f583f5b5856f7cb5f2c5fbb04f39f8a8 *R/utils.R
e2abdcd205330bc1aa537fed20a8760a *R/view.R
a28d1e2666b667c3800983b1e1b8f6e8 *R/word.r
200bb24c414024721759d59d2907eadd *R/wrap.r
-e8b5b22a21cdea65d5119fba8ecabc15 *README.md
-849843ab5b82d66bfdac5b4f3343d405 *build/vignette.rds
+d61ee81ed384f3894e2df6d4257ed231 *README.md
+54f55a7db353200f3927e5e7b728e7c0 *build/vignette.rds
89f0d280160eb4419b23251639a728c2 *data/fruit.rda
7ad07be2e18f2b3459b55adc0c03c498 *data/sentences.rda
c99f00d311e24c76bbeabfc8a58b4b50 *data/words.rda
-8ffc45088b1068264eba4514b264d53a *inst/doc/stringr.R
-413e34561c01308503e6bc783c359011 *inst/doc/stringr.Rmd
-dc047c46fce1c7d904228f04f294cae7 *inst/doc/stringr.html
+baeeda52353d93ad1530a71401b5b323 *inst/doc/regular-expressions.R
+140a0c05326e805652da6b816420f1c3 *inst/doc/regular-expressions.Rmd
+78f12ec798e31738f61e726141358a1d *inst/doc/regular-expressions.html
+3180c9c46e5c7bba790e79ceb4807c90 *inst/doc/stringr.R
+2d7c83d55dfc0acb32be74615430984b *inst/doc/stringr.Rmd
+b16744e619eecd63e08d758d13151ff2 *inst/doc/stringr.html
0cce813b2f19d701b1f00d51d42902c1 *inst/htmlwidgets/lib/str_view.css
e7c37a495d4ae965400eeb1000dee672 *inst/htmlwidgets/str_view.js
1763429826b7f9745d2e590e4ca4c119 *inst/htmlwidgets/str_view.yaml
-a8cf2920d15e44ce35fdb434174fe86c *man/case.Rd
-6bd36c43504097d68c16f275f10cc870 *man/invert_match.Rd
-867ece9169d129df0198d7a088587c36 *man/modifier-deprecated.Rd
-ec6cb5312ac29edf3ebf1fbe80add745 *man/modifiers.Rd
-7f15d9fd60f36bb4743cfb0fafeea78e *man/pipe.Rd
-e2fca59f2742bfc3b3a6d13f0a381663 *man/str_c.Rd
-c1df5ae44a9c4d4e94526962f4b7f965 *man/str_conv.Rd
-32e0f85fcf94197bd2563f63b7e43b1a *man/str_count.Rd
-40bd26da1ace34057846c683997f5cc4 *man/str_detect.Rd
-652ee618a77d49f1dfcc1cd0de3e50f4 *man/str_dup.Rd
-fff77c9ac7ddda0c6534fb4f41c9cacc *man/str_extract.Rd
-177ab86b4c63f2c639e434e5c5ff9b5e *man/str_interp.Rd
-0ef5f5a05809492af06b6d8f619a28fc *man/str_length.Rd
-d25faba0376340f4fe0627bc0ce2dc5f *man/str_locate.Rd
-c49cf17e667f9e24a6b07510b324f461 *man/str_match.Rd
-ce37ca980b4fcbc539705c717a020bfa *man/str_order.Rd
-0247696a8fcf533b84374ee688091c47 *man/str_pad.Rd
-9f15367dbd187a6503b81aad6b7fb73e *man/str_replace.Rd
-c7bce9f4d5ba40cbeab1c673ca637d3f *man/str_replace_na.Rd
-27f4e80bc8db2ceb3ed2cbfe02bef5a2 *man/str_split.Rd
-ff6271866c0d0d187459f2215d876c60 *man/str_sub.Rd
-5b032797228eb70a9ad2f517b833f6fb *man/str_subset.Rd
-6dcb096c7fd141773de59aba3a979e5b *man/str_trim.Rd
-054a22ebda3d0eaed1e5696ecd3253fc *man/str_trunc.Rd
-bf5d56ed8975ba6f36daa4df746be9f6 *man/str_view.Rd
-3a31054aa873254306a4b109f33da177 *man/str_wrap.Rd
-9ccdac653cd80024e514feb7ebe510dc *man/stringr-data.Rd
-c915980d93d7cdf412b19e80e85dd554 *man/word.Rd
+e16c7ce7c100ec5760a8a656e25e55ce *man/case.Rd
+8b3c03869016124a8bbbfe6bf3b3c15d *man/invert_match.Rd
+abc5c331994245ad531745f57e38ede4 *man/modifier-deprecated.Rd
+248c2e54019399c529094caf86b942a1 *man/modifiers.Rd
+a64a7ea44fcaa33c2d3ad0f7909cbc3e *man/pipe.Rd
+e4300672e5c45af45fa76f1b164ce8ad *man/str_c.Rd
+778952b0ae9fb1c13e133e94a49a1e1e *man/str_conv.Rd
+80b756ae26ea742a5386d0070acda5fc *man/str_count.Rd
+9a11ea3a92fdade8e844fd1bebe80482 *man/str_detect.Rd
+ff1b4f8ff391243b73b7c7d55fdc4570 *man/str_dup.Rd
+8c2a2fae95ee423a6e8814f15368e350 *man/str_extract.Rd
+47b67bfcb344168750570a984ba132d2 *man/str_interp.Rd
+54e885eda93d56226c7bdc0ea5385225 *man/str_length.Rd
+dfede4d61d516bde3c0de4fae9e7376c *man/str_locate.Rd
+0c84ec635421d0eb4bc727e485ebc024 *man/str_match.Rd
+41896cd4305a9f208ab35df01f9a618d *man/str_order.Rd
+0b8d04cecb1e3b39bc7142607ddd1685 *man/str_pad.Rd
+5c847830b77d99cda9f8a593638b6670 *man/str_replace.Rd
+336d96a35dfaf71efed27c5d0e67e28d *man/str_replace_na.Rd
+6bf0a4ec7eef56ca5aa74bc112e226e6 *man/str_split.Rd
+9fdd712e9aa48936f03a52b7b51fb655 *man/str_sub.Rd
+bfbc45dc52a89172f023a216512eedeb *man/str_subset.Rd
+65e2727df5334214f50b8d2aaf9a41d8 *man/str_trim.Rd
+9d6bab45aa95f96dce3e99d8a913e301 *man/str_trunc.Rd
+2ae83edaeca4861b91e7a077886d2264 *man/str_view.Rd
+97161406f461a457f34973310b48497b *man/str_wrap.Rd
+47b1ce113c0ee06888c55adead1a314e *man/stringr-data.Rd
+f4b4600fe3382977fceb1fe8d7f3f816 *man/stringr-package.Rd
+60b2089312906075dc36493b122da772 *man/word.Rd
4ee9d05bd4688270eca8d85299cedcd1 *tests/testthat.R
61f9d77768cf9ff813d382f9337178fb *tests/testthat/test-count.r
0b1b63d62bb48585837b0b7031328dc4 *tests/testthat/test-detect.r
@@ -72,13 +78,16 @@ a246742977453883ed04e738dd16c975 *tests/testthat/test-extract.r
6fe2c7933ec6c863e808320c60567a5f *tests/testthat/test-join.r
6f525891e80befb684ff295d1b714f71 *tests/testthat/test-length.r
b9c2324c1d46d0efdb1b5d448db09556 *tests/testthat/test-locate.r
-3e1efea0b4bd71682bb048d58ce5e7dd *tests/testthat/test-match.r
+9f414d5899d6ccae808e793225fcce18 *tests/testthat/test-match.r
9d4f02d9e2458e9ad849a218a3bd9f5c *tests/testthat/test-pad.r
-c15b2037c04e206c4cca61523dbda15a *tests/testthat/test-replace.r
+97d73d3e119bbbd24830dee6fed65761 *tests/testthat/test-replace.r
ed9fce46356a66a829054fd312dcec0d *tests/testthat/test-split.r
e3a9e72abe44dec62d792922696d6d43 *tests/testthat/test-sub.r
7e0381051472d2bf07693409413df193 *tests/testthat/test-subset.r
286e2327b2be8839d9ef4cb95c31ef70 *tests/testthat/test-trim.r
bf9e1f3e3b9adfb5157cf231126b8c47 *tests/testthat/test-word.r
d64e6159b4a9792a22a0eded7d586dac *tests/testthat/test-wrap.r
-413e34561c01308503e6bc783c359011 *vignettes/stringr.Rmd
+140a0c05326e805652da6b816420f1c3 *vignettes/regular-expressions.Rmd
+142c395fbc380d3be202f5ac3585090e *vignettes/releases/stringr-1.0.0.Rmd
+325fb2151ff3c864302ddc6564ec0278 *vignettes/releases/stringr-1.1.0.Rmd
+2d7c83d55dfc0acb32be74615430984b *vignettes/stringr.Rmd
diff --git a/NAMESPACE b/NAMESPACE
index 78779ac..026354f 100644
--- a/NAMESPACE
+++ b/NAMESPACE
@@ -40,6 +40,7 @@ export(str_trim)
export(str_trunc)
export(str_view)
export(str_view_all)
+export(str_which)
export(str_wrap)
export(word)
import(stringi)
diff --git a/NEWS.md b/NEWS.md
index fa8c990..10ea2dc 100644
--- a/NEWS.md
+++ b/NEWS.md
@@ -1,3 +1,36 @@
+# stringr 1.2.0
+
+## API changes
+
+* `str_match_all()` now returns NA if an optional group doesn't match
+ (previously it returned ""). This is more consistent with `str_match()`
+ and other match failures (#134).
+
+## New features
+
+* In `str_replace()`, `replacement` can now be a function that is called once
+ for each match and who's return value is used to replace the match.
+
+* New `str_which()` mimics `grep()` (#129).
+
+* A new vignette (`vignette("regular-expressions")`) describes the
+ details of the regular expressions supported by stringr.
+ The main vignette (`vignette("stringr")`) has been updated to
+ give a high-level overview of the package.
+
+## Minor improvements and bug fixes
+
+* `str_order()` and `str_sort()` gain explicit `numeric` argument for sorting
+ mixed numbers and strings.
+
+* `str_replace_all()` now throws an error if `replacement` is not a character
+ vector. If `replacement` is `NA_character_` it replaces the complete string
+ with replaces with `NA` (#124).
+
+* All functions that take a locale (e.g. `str_to_lower()` and `str_sort()`)
+ default to "en" (English) to ensure that the default is consistent across
+ platforms.
+
# stringr 1.1.0
* Add sample datasets: `fruit`, `words` and `sentences`.
diff --git a/R/case.R b/R/case.R
index 33b785b..4b0b84c 100644
--- a/R/case.R
+++ b/R/case.R
@@ -1,7 +1,8 @@
#' Convert case of a string.
#'
#' @param string String to modify
-#' @param locale Locale to use for translations.
+#' @param locale Locale to use for translations. Defaults to "en" (English)
+#' to ensure consistent default ordering across platforms.
#' @examples
#' dog <- "The quick brown dog"
#' str_to_upper(dog)
@@ -9,23 +10,23 @@
#' str_to_title(dog)
#'
#' # Locale matters!
-#' str_to_upper("i", "en") # English
+#' str_to_upper("i") # English
#' str_to_upper("i", "tr") # Turkish
#' @name case
NULL
#' @export
#' @rdname case
-str_to_upper <- function(string, locale = "") {
+str_to_upper <- function(string, locale = "en") {
stri_trans_toupper(string, locale = locale)
}
#' @export
#' @rdname case
-str_to_lower <- function(string, locale = "") {
+str_to_lower <- function(string, locale = "en") {
stri_trans_tolower(string, locale = locale)
}
#' @export
#' @rdname case
-str_to_title <- function(string, locale = "") {
+str_to_title <- function(string, locale = "en") {
stri_trans_totitle(string, opts_brkiter = stri_opts_brkiter(locale = locale))
}
diff --git a/R/detect.r b/R/detect.r
index 4047bf2..2863060 100644
--- a/R/detect.r
+++ b/R/detect.r
@@ -19,7 +19,9 @@
#' \code{\link{boundary}()}. An empty pattern, "", is equivalent to
#' \code{boundary("character")}.
#' @return A logical vector.
-#' @seealso \code{\link[stringi]{stri_detect}} which this function wraps
+#' @seealso \code{\link[stringi]{stri_detect}} which this function wraps,
+#' \code{\link{str_subset}} for a convenient wrapper around
+#' \code{x[str_detect(x, pattern)]}
#' @export
#' @examples
#' fruit <- c("apple", "banana", "pear", "pinapple")
diff --git a/R/match.r b/R/match.r
index df6bce8..607d526 100644
--- a/R/match.r
+++ b/R/match.r
@@ -38,7 +38,10 @@ str_match <- function(string, pattern) {
stop("Can only match regular expressions", call. = FALSE)
}
- stri_match_first_regex(string, pattern, opts_regex = opts(pattern))
+ stri_match_first_regex(string,
+ pattern,
+ opts_regex = opts(pattern)
+ )
}
#' @rdname str_match
@@ -50,7 +53,6 @@ str_match_all <- function(string, pattern) {
stri_match_all_regex(string,
pattern,
- cg_missing = "",
omit_no_match = TRUE,
opts_regex = opts(pattern)
)
diff --git a/R/modifiers.r b/R/modifiers.r
index 4d327aa..2d1b210 100644
--- a/R/modifiers.r
+++ b/R/modifiers.r
@@ -62,11 +62,13 @@ fixed <- function(pattern, ignore_case = FALSE) {
#' @rdname modifiers
#' @param locale Locale to use for comparisons. See
#' \code{\link[stringi]{stri_locale_list}()} for all possible options.
+#' Defaults to "en" (English) to ensure that the default collation is
+#' consistent across platforms.
#' @param ... Other less frequently used arguments passed on to
#' \code{\link[stringi]{stri_opts_collator}},
#' \code{\link[stringi]{stri_opts_regex}}, or
#' \code{\link[stringi]{stri_opts_brkiter}}
-coll <- function(pattern, ignore_case = FALSE, locale = NULL, ...) {
+coll <- function(pattern, ignore_case = FALSE, locale = "en", ...) {
if (!is_bare_character(pattern)) {
stop("Can only modify plain character vectors.", call. = FALSE)
}
diff --git a/R/replace.r b/R/replace.r
index 2d8c107..07da119 100644
--- a/R/replace.r
+++ b/R/replace.r
@@ -3,22 +3,29 @@
#' Vectorised over \code{string}, \code{pattern} and \code{replacement}.
#'
#' @inheritParams str_detect
-#' @param pattern,replacement Supply separate pattern and replacement strings
-#' to vectorise over the patterns. References of the form \code{\1},
-#' \code{\2} will be replaced with the contents of the respective matched
-#' group (created by \code{()}) within the pattern.
+#' @param replacement A character vector of replacements. Should be either
+#' length one, or the same length as \code{string} or \code{pattern}.
+#' References of the form \code{\1}, \code{\2}, etc will be replaced with
+#' the contents of the respective matched group (created by \code{()}).
#'
-#' For \code{str_replace_all} only, you can perform multiple patterns and
-#' replacements to each string, by passing a named character to
-#' \code{pattern}.
+#' To perform multiple replacements in each element of \code{string},
+#' pass a named vector (\code{c(pattern1 = replacement1)}) to
+#' \code{str_replace_all}. Alternatively, pass a function to
+#' \code{replacement}: it will be called once for each match and its
+#' return value will be used to replace the match.
+#'
+#' To replace the complete string with \code{NA}, use
+#' \code{replacement = NA_character_}.
#' @return A character vector.
-#' @seealso \code{str_replace_na} to turn missing values into "NA";
+#' @seealso \code{\link{str_replace_na}} to turn missing values into "NA";
#' \code{\link{stri_replace}} for the underlying implementation.
#' @export
#' @examples
#' fruits <- c("one apple", "two pears", "three bananas")
#' str_replace(fruits, "[aeiou]", "-")
#' str_replace_all(fruits, "[aeiou]", "-")
+#' str_replace_all(fruits, "[aeiou]", toupper)
+#' str_replace_all(fruits, "b", NA_character_)
#'
#' str_replace(fruits, "([aeiou])", "")
#' str_replace(fruits, "([aeiou])", "\\1\\1")
@@ -35,10 +42,29 @@
#' str_replace_all(fruits, c("a", "e", "i"), "-")
#'
#' # If you want to apply multiple patterns and replacements to the same
-#' # string, pass a named version to pattern.
-#' str_replace_all(str_c(fruits, collapse = "---"),
-#' c("one" = 1, "two" = 2, "three" = 3))
+#' # string, pass a named vector to pattern.
+#' fruits %>%
+#' str_c(collapse = "---") %>%
+#' str_replace_all(c("one" = "1", "two" = "2", "three" = "3"))
+#'
+#' # Use a function for more sophisticated replacement. This example
+#' # replaces colour names with their hex values.
+#' colours <- str_c("\\b", colors(), "\\b", collapse="|")
+#' col2hex <- function(col) {
+#' rgb <- col2rgb(col)
+#' rgb(rgb["red", ], rgb["green", ], rgb["blue", ], max = 255)
+#' }
+#'
+#' x <- c(
+#' "Roses are red, violets are blue",
+#' "My favourite colour is green"
+#' )
+#' str_replace_all(x, colours, col2hex)
str_replace <- function(string, pattern, replacement) {
+ if (!missing(replacement) && is.function(replacement)) {
+ return(str_transform(string, pattern, replacement))
+ }
+
switch(type(pattern),
empty = ,
bound = stop("Not implemented", call. = FALSE),
@@ -54,6 +80,11 @@ str_replace <- function(string, pattern, replacement) {
#' @export
#' @rdname str_replace
str_replace_all <- function(string, pattern, replacement) {
+ if (!missing(replacement) && is.function(replacement)) {
+ return(str_transform_all(string, pattern, replacement))
+ }
+
+
if (!is.null(names(pattern))) {
vec <- FALSE
replacement <- unname(pattern)
@@ -75,11 +106,17 @@ str_replace_all <- function(string, pattern, replacement) {
}
fix_replacement <- function(x) {
+ if (!is.character(x)) {
+ stop("`replacement` must be a character vector", call. = FALSE)
+ }
+
vapply(x, fix_replacement_one, character(1), USE.NAMES = FALSE)
}
fix_replacement_one <- function(x) {
- escape_dollar <- function(x) if (x == "$") "\\$" else x
+ if (is.na(x)) {
+ return(x)
+ }
chars <- str_split(x, "")[[1]]
out <- character(length(chars))
@@ -120,9 +157,29 @@ fix_replacement_one <- function(x) {
#' Turn NA into "NA"
#'
#' @inheritParams str_replace
+#' @param replacement A single string.
#' @export
#' @examples
#' str_replace_na(c(NA, "abc", "def"))
str_replace_na <- function(string, replacement = "NA") {
stri_replace_na(string, replacement)
}
+
+
+str_transform <- function(string, pattern, replacement) {
+ loc <- str_locate(string, pattern)
+ str_sub(string, loc) <- replacement(str_sub(string, loc))
+ string
+}
+str_transform_all <- function(string, pattern, replacement) {
+ locs <- str_locate_all(string, pattern)
+
+ for (i in seq_along(string)) {
+ for (j in rev(seq_len(nrow(locs[[i]])))) {
+ loc <- locs[[i]]
+ str_sub(string[[i]], loc[j, 1], loc[j, 2]) <- replacement(str_sub(string[[i]], loc[j, 1], loc[j, 2]))
+ }
+ }
+
+ string
+}
diff --git a/R/sort.R b/R/sort.R
index 5958464..1520833 100644
--- a/R/sort.R
+++ b/R/sort.R
@@ -6,25 +6,34 @@
#' @param na_last Where should \code{NA} go? \code{TRUE} at the end,
#' \code{FALSE} at the beginning, \code{NA} dropped.
#' @param locale In which locale should the sorting occur? Defaults to
-#' the current locale.
+#' the English. This ensures that code behaves the same way across
+#' platforms.
+#' @param numeric If \code{TRUE}, will sort digits numerically, instead
+#' of as strings.
#' @param ... Other options used to control sorting order. Passed on to
#' \code{\link[stringi]{stri_opts_collator}}.
#' @seealso \code{\link[stringi]{stri_order}} for the underlying implementation.
#' @export
#' @examples
-#' str_order(letters, locale = "en")
-#' str_sort(letters, locale = "en")
+#' str_order(letters)
+#' str_sort(letters)
#'
#' str_order(letters, locale = "haw")
#' str_sort(letters, locale = "haw")
-str_order <- function(x, decreasing = FALSE, na_last = TRUE, locale = "", ...) {
+#'
+#' x <- c("100a10", "100a5", "2b", "2a")
+#' str_sort(x)
+#' str_sort(x, numeric = TRUE)
+str_order <- function(x, decreasing = FALSE, na_last = TRUE,
+ locale = "en", numeric = FALSE, ...) {
stri_order(x, decreasing = decreasing, na_last = na_last,
- opts_collator = stri_opts_collator(locale, ...))
+ opts_collator = stri_opts_collator(locale, numeric = numeric, ...))
}
#' @export
#' @rdname str_order
-str_sort <- function(x, decreasing = FALSE, na_last = TRUE, locale = "", ...) {
+str_sort <- function(x, decreasing = FALSE, na_last = TRUE,
+ locale = "en", numeric = FALSE, ...) {
stri_sort(x, decreasing = decreasing, na_last = na_last,
- opts_collator = stri_opts_collator(locale, ...))
+ opts_collator = stri_opts_collator(locale, numeric = numeric, ...))
}
diff --git a/R/stringr.R b/R/stringr.R
new file mode 100644
index 0000000..0cc9deb
--- /dev/null
+++ b/R/stringr.R
@@ -0,0 +1,2 @@
+#' @keywords internal
+"_PACKAGE"
diff --git a/R/subset.R b/R/subset.R
index 7b7f36a..21a7c28 100644
--- a/R/subset.R
+++ b/R/subset.R
@@ -1,6 +1,10 @@
-#' Keep strings matching a pattern.
+#' Keep strings matching a pattern, or find positions.
+#'
+#' \code{str_subset()} is a wrapper around \code{x[str_detect(x, pattern)]},
+#' and is equivalent to \code{grep(pattern, x, value = TRUE)}.
+#' \code{str_which()} is a wrapper around \code{which(str_detect(x, pattern))},
+#' and is equivalent to \code{grep(pattern, x)}.
#'
-#' This is a convenient wrapper around \code{x[str_detect(x, pattern)]}.
#' Vectorised over \code{string} and \code{pattern}
#'
#' @inheritParams str_detect
@@ -11,13 +15,16 @@
#' @examples
#' fruit <- c("apple", "banana", "pear", "pinapple")
#' str_subset(fruit, "a")
+#' str_which(fruit, "a")
+#'
#' str_subset(fruit, "^a")
#' str_subset(fruit, "a$")
#' str_subset(fruit, "b")
#' str_subset(fruit, "[aeiou]")
#'
-#' # Missings are silently dropped
+#' # Missings never match
#' str_subset(c("a", NA, "b"), ".")
+#' str_which(c("a", NA, "b"), ".")
str_subset <- function(string, pattern) {
switch(type(pattern),
empty = ,
@@ -27,3 +34,9 @@ str_subset <- function(string, pattern) {
regex = stri_subset_regex(string, pattern, omit_na = TRUE, opts_regex = opts(pattern))
)
}
+
+#' @export
+#' @rdname str_subset
+str_which <- function(string, pattern) {
+ which(str_detect(string, pattern))
+}
diff --git a/README.md b/README.md
index 834db5f..12577ef 100644
--- a/README.md
+++ b/README.md
@@ -1,47 +1,149 @@
-# stringr
-[![Travis-CI Build Status](https://travis-ci.org/hadley/stringr.svg?branch=master)](https://travis-ci.org/hadley/stringr)
-[![Coverage Status](https://img.shields.io/codecov/c/github/hadley/stringr/master.svg)](https://codecov.io/github/hadley/stringr?branch=master)
-[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/stringr)](http://cran.r-project.org/package=stringr)
+<!-- README.md is generated from README.Rmd. Please edit that file -->
+stringr <img src="logo.png" align="right" />
+============================================
-Strings are not glamorous, high-profile components of R, but they do play a big role in many data cleaning and preparations tasks. R provides a solid set of string operations, but because they have grown organically over time, they can be inconsistent and a little hard to learn. Additionally, they lag behind the string operations in other programming languages, so that some things that are easy to do in languages like Ruby or Python are rather hard to do in R.
+[![Build Status](https://travis-ci.org/tidyverse/stringr.svg?branch=master)](https://travis-ci.org/tidyverse/stringr) [![Coverage Status](https://img.shields.io/codecov/c/github/tidyverse/stringr/master.svg)](https://codecov.io/github/tidyverse/stringr?branch=master) [![CRAN Status](http://www.r-pkg.org/badges/version/stringr)](https://cran.r-project.org/package=stringr)
-The __stringr__ package aims to remedy these problems by providing a clean, modern interface to common string operations. More concretely, stringr:
+Overview
+--------
-* Uses consistent functions and argument names.
+Strings are not glamorous, high-profile components of R, but they do play a big role in many data cleaning and preparation tasks. The stringr package provide a cohesive set of functions designed to make working with strings as easy as posssible. If you're not familiar with strings, the best place to start is the [chapter on strings](http://r4ds.had.co.nz/strings.html) in R for Data Science.
-* Simplifies string operations by eliminating options that you don't need
- 95% of the time.
+stringr is built on top of [stringi](https://github.com/gagolews/stringi), which uses the [ICU](http://site.icu-project.org) C library to provide fast, correct implementations of common string manipulations. stringr focusses on the most important and commonly used string manipulation functions whereas stringi provides a comprehensive set covering almost anything you can imagine. If you find that stringr is missing a function that you need, try looking in stringi. Both packages share simi [...]
-* Produces outputs than can easily be used as inputs. This includes ensuring
- that missing inputs result in missing outputs, and zero length inputs
- result in zero length outputs.
+Installation
+------------
-* Is built on top of [stringi](https://github.com/Rexamine/stringi/) which
- uses the [ICU](http://site.icu-project.org) library to provide fast, correct
- implementations of common string manipulations
+``` r
+# Install the released version from CRAN:
+install.packages("stringr")
-## Installation
+# Install the cutting edge development version from GitHub:
+# install.packages("devtools")
+devtools::install_github("tidyverse/stringr")
+```
-To get the current released version from CRAN:
+Usage
+-----
-```R
-install.packages("stringr")
+All functions in stringr start with `str_` and take a vector of strings as the first argument.
+
+``` r
+x <- c("why", "video", "cross", "extra", "deal", "authority")
+str_length(x)
+#> [1] 3 5 5 5 4 9
+str_c(x, collapse = ", ")
+#> [1] "why, video, cross, extra, deal, authority"
+str_sub(x, 1, 2)
+#> [1] "wh" "vi" "cr" "ex" "de" "au"
```
-To get the current development version from github:
+Most string functions work with regular expressions, a concise language for describing patterns of text. For example, the regular expression `"[aeiou]"` matches any single character that is a vowel:
-```R
-# install.packages("devtools")
-devtools::install_github("hadley/stringr")
+``` r
+str_subset(x, "[aeiou]")
+#> [1] "video" "cross" "extra" "deal" "authority"
+str_count(x, "[aeiou]")
+#> [1] 0 3 1 2 2 4
```
-## Piping
+There are seven main verbs that work with patterns:
-stringr provides the pipe, `%>%`, from magrittr to make it easy to string together sequences of string operations:
+- `str_detect(x, pattern)` tells you if there's any match to the pattern.
-```R
-letters %>%
- str_pad(5, "right") %>%
- str_c(letters)
-```
+ ``` r
+ str_detect(x, "[aeiou]")
+ #> [1] FALSE TRUE TRUE TRUE TRUE TRUE
+ ```
+
+- `str_count(x, pattern)` counts the number of patterns.
+
+ ``` r
+ str_count(x, "[aeiou]")
+ #> [1] 0 3 1 2 2 4
+ ```
+
+- `str_subset(x, pattern)` extracts the matching components.
+
+ ``` r
+ str_subset(x, "[aeiou]")
+ #> [1] "video" "cross" "extra" "deal" "authority"
+ ```
+
+- `str_locate(x, pattern)` gives the position of the match.
+
+ ``` r
+ str_locate(x, "[aeiou]")
+ #> start end
+ #> [1,] NA NA
+ #> [2,] 2 2
+ #> [3,] 3 3
+ #> [4,] 1 1
+ #> [5,] 2 2
+ #> [6,] 1 1
+ ```
+
+- `str_extract(x, pattern)` extracts the text of the match.
+
+ ``` r
+ str_extract(x, "[aeiou]")
+ #> [1] NA "i" "o" "e" "e" "a"
+ ```
+
+- `str_match(x, pattern)` extracts parts of the match defined by parentheses.
+
+ ``` r
+ # extract the characters on either side of the vowel
+ str_match(x, "(.)[aeiou](.)")
+ #> [,1] [,2] [,3]
+ #> [1,] NA NA NA
+ #> [2,] "vid" "v" "d"
+ #> [3,] "ros" "r" "s"
+ #> [4,] NA NA NA
+ #> [5,] "dea" "d" "a"
+ #> [6,] "aut" "a" "t"
+ ```
+
+- `str_replace(x, pattern, replacemnt)` replaces the matches with new text.
+
+ ``` r
+ str_replace(x, "[aeiou]", "?")
+ #> [1] "why" "v?deo" "cr?ss" "?xtra" "d?al" "?uthority"
+ ```
+
+- `str_split(x, pattern)` splits up a string into multiple pieces.
+
+ ``` r
+ str_split(c("a,b", "c,d,e"), ",")
+ #> [[1]]
+ #> [1] "a" "b"
+ #>
+ #> [[2]]
+ #> [1] "c" "d" "e"
+ ```
+
+As well as regular expressions (the default), there are three other pattern matching engines:
+
+- `fixed()`: match exact bytes
+- `coll()`: match human letters
+- `boundary()`: match boundaries
+
+Compared to base R
+------------------
+
+R provides a solid set of string operations, but because they have grown organically over time, they can be inconsistent and a little hard to learn. Additionally, they lag behind the string operations in other programming languages, so that some things that are easy to do in languages like Ruby or Python are rather hard to do in R.
+
+- Uses consistent function and argument names. The first argument is always the vector of strings to modify, which makes stringr work particularly well in conjunction with the pipe:
+
+ ``` r
+ letters %>%
+ .[1:10] %>%
+ str_pad(3, "right") %>%
+ str_c(letters[2:11])
+ #> [1] "a b" "b c" "c d" "d e" "e f" "f g" "g h" "h i" "i j" "j k"
+ ```
+
+- Simplifies string operations by eliminating options that you don't need 95% of the time.
+
+- Produces outputs than can easily be used as inputs. This includes ensuring that missing inputs result in missing outputs, and zero length inputs result in zero length outputs.
diff --git a/build/vignette.rds b/build/vignette.rds
index a644427..841698f 100644
Binary files a/build/vignette.rds and b/build/vignette.rds differ
diff --git a/inst/doc/regular-expressions.R b/inst/doc/regular-expressions.R
new file mode 100644
index 0000000..62af3e0
--- /dev/null
+++ b/inst/doc/regular-expressions.R
@@ -0,0 +1,152 @@
+## ----setup, include = FALSE----------------------------------------------
+knitr::opts_chunk$set(
+ collapse = TRUE,
+ comment = "#>"
+)
+library(stringr)
+
+## ---- eval = FALSE-------------------------------------------------------
+# # The regular call:
+# str_extract(fruit, "nana")
+# # Is shorthand for
+# str_extract(fruit, regex("nana"))
+
+## ------------------------------------------------------------------------
+x <- c("apple", "banana", "pear")
+str_extract(x, "an")
+
+## ------------------------------------------------------------------------
+bananas <- c("banana", "Banana", "BANANA")
+str_detect(bananas, "banana")
+str_detect(bananas, regex("banana", ignore_case = TRUE))
+
+## ------------------------------------------------------------------------
+str_extract(x, ".a.")
+
+## ------------------------------------------------------------------------
+str_detect("\nX\n", ".X.")
+str_detect("\nX\n", regex(".X.", dotall = TRUE))
+
+## ------------------------------------------------------------------------
+# To create the regular expression, we need \\
+dot <- "\\."
+
+# But the expression itself only contains one:
+writeLines(dot)
+
+# And this tells R to look for an explicit .
+str_extract(c("abc", "a.c", "bef"), "a\\.c")
+
+## ------------------------------------------------------------------------
+x <- "a\\b"
+writeLines(x)
+
+str_extract(x, "\\\\")
+
+## ------------------------------------------------------------------------
+x <- c("a.b.c.d", "aeb")
+starts_with <- "a.b"
+
+str_detect(x, paste0("^", starts_with))
+str_detect(x, paste0("^\\Q", starts_with, "\\E"))
+
+## ------------------------------------------------------------------------
+x <- "a\u0301"
+str_extract(x, ".")
+str_extract(x, "\\X")
+
+## ------------------------------------------------------------------------
+str_extract_all("1 + 2 = 3", "\\d+")[[1]]
+
+## ------------------------------------------------------------------------
+# Some Laotian numbers
+str_detect("១២៣", "\\d")
+
+## ------------------------------------------------------------------------
+(text <- "Some \t badly\n\t\tspaced \f text")
+str_replace_all(text, "\\s+", " ")
+
+## ------------------------------------------------------------------------
+(text <- c('"Double quotes"', "«Guillemet»", "“Fancy quotes”"))
+str_replace_all(text, "\\p{quotation mark}", "'")
+
+## ------------------------------------------------------------------------
+str_extract_all("Don't eat that!", "\\w+")[[1]]
+str_split("Don't eat that!", "\\W")[[1]]
+
+## ------------------------------------------------------------------------
+str_replace_all("The quick brown fox", "\\b", "_")
+str_replace_all("The quick brown fox", "\\B", "_")
+
+## ------------------------------------------------------------------------
+str_detect(c("abc", "def", "ghi"), "abc|def")
+
+## ------------------------------------------------------------------------
+str_extract(c("grey", "gray"), "gre|ay")
+str_extract(c("grey", "gray"), "gr(e|a)y")
+
+## ------------------------------------------------------------------------
+pattern <- "(..)\\1"
+fruit %>%
+ str_subset(pattern)
+
+fruit %>%
+ str_subset(pattern) %>%
+ str_match(pattern)
+
+## ------------------------------------------------------------------------
+str_match(c("grey", "gray"), "gr(e|a)y")
+str_match(c("grey", "gray"), "gr(?:e|a)y")
+
+## ------------------------------------------------------------------------
+x <- c("apple", "banana", "pear")
+str_extract(x, "^a")
+str_extract(x, "a$")
+
+## ------------------------------------------------------------------------
+x <- "Line 1\nLine 2\nLine 3\n"
+str_extract_all(x, "^Line..")[[1]]
+str_extract_all(x, regex("^Line..", multiline = TRUE))[[1]]
+str_extract_all(x, regex("\\ALine..", multiline = TRUE))[[1]]
+
+## ------------------------------------------------------------------------
+x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
+str_extract(x, "CC?")
+str_extract(x, "CC+")
+str_extract(x, 'C[LX]+')
+
+## ------------------------------------------------------------------------
+str_extract(x, "C{2}")
+str_extract(x, "C{2,}")
+str_extract(x, "C{2,3}")
+
+## ------------------------------------------------------------------------
+str_extract(x, c("C{2,3}", "C{2,3}?"))
+str_extract(x, c("C[LX]+", "C[LX]+?"))
+
+## ------------------------------------------------------------------------
+str_detect("ABC", "(?>A|.B)C")
+str_detect("ABC", "(?:A|.B)C")
+
+## ------------------------------------------------------------------------
+x <- c("1 piece", "2 pieces", "3")
+str_extract(x, "\\d+(?= pieces?)")
+
+y <- c("100", "$400")
+str_extract(y, "(?<=\\$)\\d+")
+
+## ------------------------------------------------------------------------
+str_detect("xyz", "x(?#this is a comment)")
+
+## ------------------------------------------------------------------------
+phone <- regex("
+ \\(? # optional opening parens
+ (\\d{3}) # area code
+ [)- ]? # optional closing parens, dash, or space
+ (\\d{3}) # another three numbers
+ [ -]? # optional space or dash
+ (\\d{3}) # three more numbers
+ ", comments = TRUE)
+
+str_match("514-791-8141", phone)
+
diff --git a/inst/doc/regular-expressions.Rmd b/inst/doc/regular-expressions.Rmd
new file mode 100644
index 0000000..9bb6287
--- /dev/null
+++ b/inst/doc/regular-expressions.Rmd
@@ -0,0 +1,415 @@
+---
+title: "Regular expressions"
+output: rmarkdown::html_vignette
+vignette: >
+ %\VignetteIndexEntry{Regular expressions}
+ %\VignetteEngine{knitr::rmarkdown}
+ %\VignetteEncoding{UTF-8}
+---
+
+```{r setup, include = FALSE}
+knitr::opts_chunk$set(
+ collapse = TRUE,
+ comment = "#>"
+)
+library(stringr)
+```
+
+Regular expressions are a concise and flexible tool for describing patterns in strings. This vignette describes the key features of stringr's regular expressions, as implemented by [stringi](https://github.com/gagolews/stringi). It is not a tutorial, so if you're unfamiliar regular expressions, I'd recommend starting at <http://r4ds.had.co.nz/strings.html>. If you want to master the details, I'd recommend reading the classic [_Mastering Regular Expressions_](https://amzn.com/0596528124) [...]
+
+Regular expressions are the default pattern engine in stringr. That means when you use a pattern matching function with a bare string, it's equivalent to wrapping it in a call to `regex()`:
+
+```{r, eval = FALSE}
+# The regular call:
+str_extract(fruit, "nana")
+# Is shorthand for
+str_extract(fruit, regex("nana"))
+```
+
+You will need to use `regex()` explicitly if you want to override the default options, as you'll see in examples below.
+
+## Basic matches
+
+The simplest patterns match exact strings:
+
+```{r}
+x <- c("apple", "banana", "pear")
+str_extract(x, "an")
+```
+
+You can perform a case-insensitive match using `ignore_case = TRUE`:
+
+```{r}
+bananas <- c("banana", "Banana", "BANANA")
+str_detect(bananas, "banana")
+str_detect(bananas, regex("banana", ignore_case = TRUE))
+```
+
+The next step up in complexity is `.`, which matches any character except a newline:
+
+```{r}
+str_extract(x, ".a.")
+```
+
+You can allow `.` to match everything, including `\n`, by setting `dotall = TRUE`:
+
+```{r}
+str_detect("\nX\n", ".X.")
+str_detect("\nX\n", regex(".X.", dotall = TRUE))
+```
+
+## Escaping
+
+If "`.`" matches any character, how do you match a literal "`.`"? You need to use an "escape" to tell the regular expression you want to match it exactly, not use its special behaviour. Like strings, regexps use the backslash, `\`, to escape special behaviour. So to match an `.`, you need the regexp `\.`. Unfortunately this creates a problem. We use strings to represent regular expressions, and `\` is also used as an escape symbol in strings. So to create the regular expression `\.` we n [...]
+
+```{r}
+# To create the regular expression, we need \\
+dot <- "\\."
+
+# But the expression itself only contains one:
+writeLines(dot)
+
+# And this tells R to look for an explicit .
+str_extract(c("abc", "a.c", "bef"), "a\\.c")
+```
+
+If `\` is used as an escape character in regular expressions, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. To create that regular expression, you need to use a string, which also needs to escape `\`. That means to match a literal `\` you need to write `"\\\\"` --- you need four backslashes to match one!
+
+```{r}
+x <- "a\\b"
+writeLines(x)
+
+str_extract(x, "\\\\")
+```
+
+In this vignette, I use `\.` to denote the regular expression, and `"\\."` to denote the string that represents the regular expression.
+
+An alternative quoting mechanism is `\Q...\E`: all the characters in `...` are treated as exact matches. This is useful if you want to exactly match user input as part of a regular expression.
+
+```{r}
+x <- c("a.b.c.d", "aeb")
+starts_with <- "a.b"
+
+str_detect(x, paste0("^", starts_with))
+str_detect(x, paste0("^\\Q", starts_with, "\\E"))
+```
+
+## Special characters
+
+Escapes also allow you to specify individual characters that are otherwise hard to type. You can specify individual unicode characters in five ways, either as a variable number of hex digits (four is most common), or by name:
+
+* `\xhh`: 2 hex digits.
+
+* `\x{hhhh}`: 1-6 hex digits.
+
+* `\uhhhh`: 4 hex digits.
+
+* `\Uhhhhhhhh`: 8 hex digits.
+
+* `\N{name}`, e.g. `\N{grinning face}` matches the basic smiling emoji.
+
+Similarly, you can specify many common control characters:
+
+* `\a`: bell.
+
+* `\cX`: match a control-X character.
+
+* `\e`: escape (`\u001B`).
+
+* `\f`: form feed (`\u000C`).
+
+* `\n`: line feed (`\u000A`).
+
+* `\r`: carriage return (`\u000D`).
+
+* `\t`: horizontal tabulation (`\u0009`).
+
+* `\0ooo` match an octal character. 'ooo' is from one to three octal digits,
+ from 000 to 0377. The leading zero is required.
+
+(Many of these are only of historical interest and are only included here for the sake of completeness.)
+
+## Matching multiple characters
+
+There are a number of patterns that match more than one character. You've already seen `.`, which matches any character (except a newline). A closely related operator is `\X`, which matches a __grapheme cluster__, a set of individual elements that form a single symbol. For example, one way of representing "á" is as the letter "a" plus an accent: `.` will match the component "a", while `\X` will match the complete symbol:
+
+```{r}
+x <- "a\u0301"
+str_extract(x, ".")
+str_extract(x, "\\X")
+```
+
+There are five other escaped pairs that match narrower classes of characters:
+
+* `\d`: matches any digit. The complement, `\D`, matches any character that
+ is not a decimal digit.
+
+ ```{r}
+ str_extract_all("1 + 2 = 3", "\\d+")[[1]]
+ ```
+
+ Technically, `\d` includes any character in the Unicode Category of Nd
+ ("Number, Decimal Digit"), which also includes numeric symbols from other
+ languages:
+
+ ```{r}
+ # Some Laotian numbers
+ str_detect("១២៣", "\\d")
+ ```
+
+* `\s`: matches any whitespace. This includes tabs, newlines, form feeds,
+ and any character in the Unicode Z Category (which includes a variety of
+ space characters and other separators.). The complement, `\S`, matches any
+ non-whitespace character.
+
+ ```{r}
+ (text <- "Some \t badly\n\t\tspaced \f text")
+ str_replace_all(text, "\\s+", " ")
+ ```
+
+* `\p{property name}` matches any character with specific unicode property,
+ like `\p{Uppercase}` or `\p{Diacritic}`. The complement,
+ `\P{property name}`, matches all characters without the property.
+ A complete list of unicode properties can be found at
+ <http://www.unicode.org/reports/tr44/#Property_Index>.
+
+ ```{r}
+ (text <- c('"Double quotes"', "«Guillemet»", "“Fancy quotes”"))
+ str_replace_all(text, "\\p{quotation mark}", "'")
+ ```
+
+* `\w` matches any "word" character, which includes alphabetic characters,
+ marks and decimal numbers. The complement, `\W`, matches any non-word
+ character.
+
+ ```{r}
+ str_extract_all("Don't eat that!", "\\w+")[[1]]
+ str_split("Don't eat that!", "\\W")[[1]]
+ ```
+
+ Technically, `\w` also matches connector punctuation, `\u200c` (zero width
+ connector), and `\u200d` (zero width joiner), but these are rarely seen in
+ the wild.
+
+* `\b` matches word boundaries, the transition between word and non-word
+ characters. `\B` matches the opposite: boundaries that have either both
+ word or non-word characters on either side.
+
+ ```{r}
+ str_replace_all("The quick brown fox", "\\b", "_")
+ str_replace_all("The quick brown fox", "\\B", "_")
+ ```
+
+You can also create your own __character classes__ using `[]`:
+
+* `[abc]`: matches a, b, or c.
+* `[a-z]`: matches every character between a and z
+ (in Unicode code point order).
+* `[^abc]`: matches anything except a, b, or c.
+* `[\^\-]`: matches `-` or `\`.
+
+There are a number of pre-built classes that you can use inside `[]`:
+
+* `[:punct:]`: punctuation.
+* `[:alpha:]`: letters.
+* `[:lower:]`: lowercase letters.
+* `[:upper:]`: upperclass letters.
+* `[:digit:]`: digits.
+* `[:xdigit:]`: hex digits.
+* `[:alnum:]`: letters and numbers.
+* `[:cntrl:]`: control characters.
+* `[:graph:]`: letters, numbers, and punctuation.
+* `[:print:]`: letters, numbers, punctuation, and whitespace.
+* `[:space:]`: space characters (basically equivalent to `\s`).
+* `[:blank:]`: space and tab.
+
+These all go inside the `[]` for character classes, i.e. `[[:digit:]AX]` matches all digits, A, and X.
+
+You can also using Unicode properties, like `[\p{Letter}]`, and various set operations, like `[\p{Letter}--\p{script=latin}]`. See `?"stringi-search-charclass"` for details.
+
+## Alternation
+
+`|` is the __alternation__ operator, which will pick between one or more possible matches. For example, `abc|def` will match `abc` or `def`.
+
+```{r}
+str_detect(c("abc", "def", "ghi"), "abc|def")
+```
+
+Note that the precedence for `|` is low, so that `abc|def` matches `abc` or `def` not `abcyz` or `abxyz`.
+
+## Grouping
+
+You can use parentheses to override the default precedence rules:
+
+```{r}
+str_extract(c("grey", "gray"), "gre|ay")
+str_extract(c("grey", "gray"), "gr(e|a)y")
+```
+
+Parenthesis also define "groups" that you can refer to with __backreferences__, like `\1`, `\2` etc, and can be extracted with `str_match()`. For example, the following regular expression finds all fruits that have a repeated pair of letters:
+
+```{r}
+pattern <- "(..)\\1"
+fruit %>%
+ str_subset(pattern)
+
+fruit %>%
+ str_subset(pattern) %>%
+ str_match(pattern)
+```
+
+You can use `(?:...)`, the non-grouping parentheses, to control precedence but not capture the match in a group. This is slightly more efficient than capturing parentheses.
+
+```{r}
+str_match(c("grey", "gray"), "gr(e|a)y")
+str_match(c("grey", "gray"), "gr(?:e|a)y")
+```
+
+This is most useful for more complex cases where you need to capture matches and control precedence independently.
+
+## Anchors
+
+By default, regular expressions will match any part of a string. It's often useful to __anchor__ the regular expression so that it matches from the start or end of the string:
+
+* `^` matches the start of string.
+* `$` matches the end of the string.
+
+```{r}
+x <- c("apple", "banana", "pear")
+str_extract(x, "^a")
+str_extract(x, "a$")
+```
+
+To match a literal "$" or "^", you need to escape them, `\$`, and `\^`.
+
+For multiline strings, you can use `regex(multiline = TRUE)`. This changes the behaviour of `^` and `$`, and introduces three new operators:
+
+* `^` now matches the start of each line.
+
+* `$` now matches the end of each line.
+
+* `\A` matches the start of the input.
+
+* `\z` matches the end of the input.
+
+* `\Z` matches the end of the input, but before the final line terminator,
+ if it exists.
+
+```{r}
+x <- "Line 1\nLine 2\nLine 3\n"
+str_extract_all(x, "^Line..")[[1]]
+str_extract_all(x, regex("^Line..", multiline = TRUE))[[1]]
+str_extract_all(x, regex("\\ALine..", multiline = TRUE))[[1]]
+```
+
+## Repetition
+
+You can control how many times a pattern matches with the repetition operators:
+
+* `?`: 0 or 1.
+* `+`: 1 or more.
+* `*`: 0 or more.
+
+```{r}
+x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
+str_extract(x, "CC?")
+str_extract(x, "CC+")
+str_extract(x, 'C[LX]+')
+```
+
+Note that the precedence of these operators is high, so you can write: `colou?r` to match either American or British spellings. That means most uses will need parentheses, like `bana(na)+`.
+
+You can also specify the number of matches precisely:
+
+* `{n}`: exactly n
+* `{n,}`: n or more
+* `{n,m}`: between n and m
+
+```{r}
+str_extract(x, "C{2}")
+str_extract(x, "C{2,}")
+str_extract(x, "C{2,3}")
+```
+
+By default these matches are "greedy": they will match the longest string possible. You can make them "lazy", matching the shortest string possible by putting a `?` after them:
+
+* `??`: 0 or 1, prefer 0.
+* `+?`: 1 or more, match as few times as possible.
+* `*?`: 0 or more, match as few times as possible.
+* `{n,}?`: n or more, match as few times as possible.
+* `{n,m}?`: between n and m, , match as few times as possible, but at least n.
+
+```{r}
+str_extract(x, c("C{2,3}", "C{2,3}?"))
+str_extract(x, c("C[LX]+", "C[LX]+?"))
+```
+
+You can also make the matches possessive, which means that if later parts of the match fail, the repetition will not be re-tried with a smaller number of characters. This is an advanced feature used to improve performance in worst-case scenarios (called "catastrophic backtracking").
+
+* `??`: 0 or 1, possessive.
+* `+?`: 1 or more, possessive.
+* `*?`: 0 or more, possessive.
+* `{n}?`: exactly n, possessive.
+* `{n,}?`: n or more, possessive.
+* `{n,m}?`: between n and m, possessive.
+
+A related concept is the __atomic-match__ parenthesis, `(?>...)`. If a later match fails and the engine needs to back-track, an atomic match is kept as is: it succeeds or fails as a whole. Compare the following two regular expressions:
+
+```{r}
+str_detect("ABC", "(?>A|.B)C")
+str_detect("ABC", "(?:A|.B)C")
+```
+
+The atomic match fails because it matches A, and then the next character is a C so it fails. The regular match suceeds because it matches A, but then C doesn't match, so it back-tracks and tries B instead.
+
+## Look arounds
+
+These assertions look ahead or behind the current match without "consuming" any characters (i.e. changing the input position).
+
+* `(?=...)`: positive look-ahead assertion. Matches if `...` matches at the
+ current input.
+
+* `(?!...)`: negative look-ahead assertion. Matches if `...` __does not__
+ matche at the current input.
+
+* `(?<=...)`: positive look-behind assertion. Matches if `...` matches text
+ preceding the current position, with the last character of the match
+ being the character just before the current position. Length must be bounded
+ (i.e. no `*` or `+`).
+
+* `(?<!...)`: negative look-behind assertion. Matches if `...` __does not__
+ match text preceding the current position. Length must be bounded
+ (i.e. no `*` or `+`).
+
+These are useful when you want to check that a pattern exists, but you don't want to include it in the result:
+
+```{r}
+x <- c("1 piece", "2 pieces", "3")
+str_extract(x, "\\d+(?= pieces?)")
+
+y <- c("100", "$400")
+str_extract(y, "(?<=\\$)\\d+")
+```
+
+## Comments
+
+There are two ways to include comments in a regular expression. The first is with `(?#...)`:
+
+```{r}
+str_detect("xyz", "x(?#this is a comment)")
+```
+
+The second is to use `regex(comments = TRUE)`. This form ignores spaces and newlines, and anything everything after `#`. To match a literal space, you'll need to escape it: `"\\ "`. This is a useful way of describing complex regular expressions:
+
+```{r}
+phone <- regex("
+ \\(? # optional opening parens
+ (\\d{3}) # area code
+ [)- ]? # optional closing parens, dash, or space
+ (\\d{3}) # another three numbers
+ [ -]? # optional space or dash
+ (\\d{3}) # three more numbers
+ ", comments = TRUE)
+
+str_match("514-791-8141", phone)
+```
diff --git a/inst/doc/regular-expressions.html b/inst/doc/regular-expressions.html
new file mode 100644
index 0000000..feac32d
--- /dev/null
+++ b/inst/doc/regular-expressions.html
@@ -0,0 +1,398 @@
+<!DOCTYPE html>
+
+<html xmlns="http://www.w3.org/1999/xhtml">
+
+<head>
+
+<meta charset="utf-8">
+<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
+<meta name="generator" content="pandoc" />
+
+<meta name="viewport" content="width=device-width, initial-scale=1">
+
+
+
+<title>Regular expressions</title>
+
+
+
+<style type="text/css">code{white-space: pre;}</style>
+<style type="text/css">
+div.sourceCode { overflow-x: auto; }
+table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode {
+ margin: 0; padding: 0; vertical-align: baseline; border: none; }
+table.sourceCode { width: 100%; line-height: 100%; }
+td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; }
+td.sourceCode { padding-left: 5px; }
+code > span.kw { color: #007020; font-weight: bold; } /* Keyword */
+code > span.dt { color: #902000; } /* DataType */
+code > span.dv { color: #40a070; } /* DecVal */
+code > span.bn { color: #40a070; } /* BaseN */
+code > span.fl { color: #40a070; } /* Float */
+code > span.ch { color: #4070a0; } /* Char */
+code > span.st { color: #4070a0; } /* String */
+code > span.co { color: #60a0b0; font-style: italic; } /* Comment */
+code > span.ot { color: #007020; } /* Other */
+code > span.al { color: #ff0000; font-weight: bold; } /* Alert */
+code > span.fu { color: #06287e; } /* Function */
+code > span.er { color: #ff0000; font-weight: bold; } /* Error */
+code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
+code > span.cn { color: #880000; } /* Constant */
+code > span.sc { color: #4070a0; } /* SpecialChar */
+code > span.vs { color: #4070a0; } /* VerbatimString */
+code > span.ss { color: #bb6688; } /* SpecialString */
+code > span.im { } /* Import */
+code > span.va { color: #19177c; } /* Variable */
+code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
+code > span.op { color: #666666; } /* Operator */
+code > span.bu { } /* BuiltIn */
+code > span.ex { } /* Extension */
+code > span.pp { color: #bc7a00; } /* Preprocessor */
+code > span.at { color: #7d9029; } /* Attribute */
+code > span.do { color: #ba2121; font-style: italic; } /* Documentation */
+code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
+code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
+code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
+</style>
+
+
+
+<link href="data:text/css;charset=utf-8,body%20%7B%0Abackground%2Dcolor%3A%20%23fff%3B%0Amargin%3A%201em%20auto%3B%0Amax%2Dwidth%3A%20700px%3B%0Aoverflow%3A%20visible%3B%0Apadding%2Dleft%3A%202em%3B%0Apadding%2Dright%3A%202em%3B%0Afont%2Dfamily%3A%20%22Open%20Sans%22%2C%20%22Helvetica%20Neue%22%2C%20Helvetica%2C%20Arial%2C%20sans%2Dserif%3B%0Afont%2Dsize%3A%2014px%3B%0Aline%2Dheight%3A%201%2E35%3B%0A%7D%0A%23header%20%7B%0Atext%2Dalign%3A%20center%3B%0A%7D%0A%23TOC%20%7B%0Aclear%3A%20bot [...]
+
+</head>
+
+<body>
+
+
+
+
+<h1 class="title toc-ignore">Regular expressions</h1>
+
+
+
+<p>Regular expressions are a concise and flexible tool for describing patterns in strings. This vignette describes the key features of stringr’s regular expressions, as implemented by <a href="https://github.com/gagolews/stringi">stringi</a>. It is not a tutorial, so if you’re unfamiliar regular expressions, I’d recommend starting at <a href="http://r4ds.had.co.nz/strings.html" class="uri">http://r4ds.had.co.nz/strings.html</a>. If you want to master the details, I’d recommend reading th [...]
+<p>Regular expressions are the default pattern engine in stringr. That means when you use a pattern matching function with a bare string, it’s equivalent to wrapping it in a call to <code>regex()</code>:</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># The regular call:</span>
+<span class="kw">str_extract</span>(fruit, <span class="st">"nana"</span>)
+<span class="co"># Is shorthand for</span>
+<span class="kw">str_extract</span>(fruit, <span class="kw">regex</span>(<span class="st">"nana"</span>))</code></pre></div>
+<p>You will need to use <code>regex()</code> explicitly if you want to override the default options, as you’ll see in examples below.</p>
+<div id="basic-matches" class="section level2">
+<h2>Basic matches</h2>
+<p>The simplest patterns match exact strings:</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">x <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"apple"</span>, <span class="st">"banana"</span>, <span class="st">"pear"</span>)
+<span class="kw">str_extract</span>(x, <span class="st">"an"</span>)
+<span class="co">#> [1] NA "an" NA</span></code></pre></div>
+<p>You can perform a case-insensitive match using <code>ignore_case = TRUE</code>:</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">bananas <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"banana"</span>, <span class="st">"Banana"</span>, <span class="st">"BANANA"</span>)
+<span class="kw">str_detect</span>(bananas, <span class="st">"banana"</span>)
+<span class="co">#> [1] TRUE FALSE FALSE</span>
+<span class="kw">str_detect</span>(bananas, <span class="kw">regex</span>(<span class="st">"banana"</span>, <span class="dt">ignore_case =</span> <span class="ot">TRUE</span>))
+<span class="co">#> [1] TRUE TRUE TRUE</span></code></pre></div>
+<p>The next step up in complexity is <code>.</code>, which matches any character except a newline:</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">str_extract</span>(x, <span class="st">".a."</span>)
+<span class="co">#> [1] NA "ban" "ear"</span></code></pre></div>
+<p>You can allow <code>.</code> to match everything, including <code>\n</code>, by setting <code>dotall = TRUE</code>:</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">str_detect</span>(<span class="st">"</span><span class="ch">\n</span><span class="st">X</span><span class="ch">\n</span><span class="st">"</span>, <span class="st">".X."</span>)
+<span class="co">#> [1] FALSE</span>
+<span class="kw">str_detect</span>(<span class="st">"</span><span class="ch">\n</span><span class="st">X</span><span class="ch">\n</span><span class="st">"</span>, <span class="kw">regex</span>(<span class="st">".X."</span>, <span class="dt">dotall =</span> <span class="ot">TRUE</span>))
+<span class="co">#> [1] TRUE</span></code></pre></div>
+</div>
+<div id="escaping" class="section level2">
+<h2>Escaping</h2>
+<p>If “<code>.</code>” matches any character, how do you match a literal “<code>.</code>”? You need to use an “escape” to tell the regular expression you want to match it exactly, not use its special behaviour. Like strings, regexps use the backslash, <code>\</code>, to escape special behaviour. So to match an <code>.</code>, you need the regexp <code>\.</code>. Unfortunately this creates a problem. We use strings to represent regular expressions, and <code>\</code> is also used as an es [...]
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># To create the regular expression, we need \\</span>
+dot <-<span class="st"> "</span><span class="ch">\\</span><span class="st">."</span>
+
+<span class="co"># But the expression itself only contains one:</span>
+<span class="kw">writeLines</span>(dot)
+<span class="co">#> \.</span>
+
+<span class="co"># And this tells R to look for an explicit .</span>
+<span class="kw">str_extract</span>(<span class="kw">c</span>(<span class="st">"abc"</span>, <span class="st">"a.c"</span>, <span class="st">"bef"</span>), <span class="st">"a</span><span class="ch">\\</span><span class="st">.c"</span>)
+<span class="co">#> [1] NA "a.c" NA</span></code></pre></div>
+<p>If <code>\</code> is used as an escape character in regular expressions, how do you match a literal <code>\</code>? Well you need to escape it, creating the regular expression <code>\\</code>. To create that regular expression, you need to use a string, which also needs to escape <code>\</code>. That means to match a literal <code>\</code> you need to write <code>"\\\\"</code> — you need four backslashes to match one!</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">x <-<span class="st"> "a</span><span class="ch">\\</span><span class="st">b"</span>
+<span class="kw">writeLines</span>(x)
+<span class="co">#> a\b</span>
+
+<span class="kw">str_extract</span>(x, <span class="st">"</span><span class="ch">\\\\</span><span class="st">"</span>)
+<span class="co">#> [1] "\\"</span></code></pre></div>
+<p>In this vignette, I use <code>\.</code> to denote the regular expression, and <code>"\\."</code> to denote the string that represents the regular expression.</p>
+<p>An alternative quoting mechanism is <code>\Q...\E</code>: all the characters in <code>...</code> are treated as exact matches. This is useful if you want to exactly match user input as part of a regular expression.</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">x <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"a.b.c.d"</span>, <span class="st">"aeb"</span>)
+starts_with <-<span class="st"> "a.b"</span>
+
+<span class="kw">str_detect</span>(x, <span class="kw">paste0</span>(<span class="st">"^"</span>, starts_with))
+<span class="co">#> [1] TRUE TRUE</span>
+<span class="kw">str_detect</span>(x, <span class="kw">paste0</span>(<span class="st">"^</span><span class="ch">\\</span><span class="st">Q"</span>, starts_with, <span class="st">"</span><span class="ch">\\</span><span class="st">E"</span>))
+<span class="co">#> [1] TRUE FALSE</span></code></pre></div>
+</div>
+<div id="special-characters" class="section level2">
+<h2>Special characters</h2>
+<p>Escapes also allow you to specify individual characters that are otherwise hard to type. You can specify individual unicode characters in five ways, either as a variable number of hex digits (four is most common), or by name:</p>
+<ul>
+<li><p><code>\xhh</code>: 2 hex digits.</p></li>
+<li><p><code>\x{hhhh}</code>: 1-6 hex digits.</p></li>
+<li><p><code>\uhhhh</code>: 4 hex digits.</p></li>
+<li><p><code>\Uhhhhhhhh</code>: 8 hex digits.</p></li>
+<li><p><code>\N{name}</code>, e.g. <code>\N{grinning face}</code> matches the basic smiling emoji.</p></li>
+</ul>
+<p>Similarly, you can specify many common control characters:</p>
+<ul>
+<li><p><code>\a</code>: bell.</p></li>
+<li><p><code>\cX</code>: match a control-X character.</p></li>
+<li><p><code>\e</code>: escape (<code>\u001B</code>).</p></li>
+<li><p><code>\f</code>: form feed (<code>\u000C</code>).</p></li>
+<li><p><code>\n</code>: line feed (<code>\u000A</code>).</p></li>
+<li><p><code>\r</code>: carriage return (<code>\u000D</code>).</p></li>
+<li><p><code>\t</code>: horizontal tabulation (<code>\u0009</code>).</p></li>
+<li><p><code>\0ooo</code> match an octal character. ‘ooo’ is from one to three octal digits, from 000 to 0377. The leading zero is required.</p></li>
+</ul>
+<p>(Many of these are only of historical interest and are only included here for the sake of completeness.)</p>
+</div>
+<div id="matching-multiple-characters" class="section level2">
+<h2>Matching multiple characters</h2>
+<p>There are a number of patterns that match more than one character. You’ve already seen <code>.</code>, which matches any character (except a newline). A closely related operator is <code>\X</code>, which matches a <strong>grapheme cluster</strong>, a set of individual elements that form a single symbol. For example, one way of representing “á” is as the letter “a” plus an accent: <code>.</code> will match the component “a”, while <code>\X</code> will match the complete symbol:</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">x <-<span class="st"> "a\u0301"</span>
+<span class="kw">str_extract</span>(x, <span class="st">"."</span>)
+<span class="co">#> [1] "a"</span>
+<span class="kw">str_extract</span>(x, <span class="st">"</span><span class="ch">\\</span><span class="st">X"</span>)
+<span class="co">#> [1] "á"</span></code></pre></div>
+<p>There are five other escaped pairs that match narrower classes of characters:</p>
+<ul>
+<li><p><code>\d</code>: matches any digit. The complement, <code>\D</code>, matches any character that is not a decimal digit.</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">str_extract_all</span>(<span class="st">"1 + 2 = 3"</span>, <span class="st">"</span><span class="ch">\\</span><span class="st">d+"</span>)[[<span class="dv">1</span>]]
+<span class="co">#> [1] "1" "2" "3"</span></code></pre></div>
+<p>Technically, <code>\d</code> includes any character in the Unicode Category of Nd (“Number, Decimal Digit”), which also includes numeric symbols from other languages:</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Some Laotian numbers</span>
+<span class="kw">str_detect</span>(<span class="st">"១២៣"</span>, <span class="st">"</span><span class="ch">\\</span><span class="st">d"</span>)
+<span class="co">#> [1] TRUE</span></code></pre></div></li>
+<li><p><code>\s</code>: matches any whitespace. This includes tabs, newlines, form feeds, and any character in the Unicode Z Category (which includes a variety of space characters and other separators.). The complement, <code>\S</code>, matches any non-whitespace character.</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">(text <-<span class="st"> "Some </span><span class="ch">\t</span><span class="st"> badly</span><span class="ch">\n\t\t</span><span class="st">spaced </span><span class="ch">\f</span><span class="st"> text"</span>)
+<span class="co">#> [1] "Some \t badly\n\t\tspaced \f text"</span>
+<span class="kw">str_replace_all</span>(text, <span class="st">"</span><span class="ch">\\</span><span class="st">s+"</span>, <span class="st">" "</span>)
+<span class="co">#> [1] "Some badly spaced text"</span></code></pre></div></li>
+<li><p><code>\p{property name}</code> matches any character with specific unicode property, like <code>\p{Uppercase}</code> or <code>\p{Diacritic}</code>. The complement, <code>\P{property name}</code>, matches all characters without the property. A complete list of unicode properties can be found at <a href="http://www.unicode.org/reports/tr44/#Property_Index" class="uri">http://www.unicode.org/reports/tr44/#Property_Index</a>.</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">(text <-<span class="st"> </span><span class="kw">c</span>(<span class="st">'"Double quotes"'</span>, <span class="st">"«Guillemet»"</span>, <span class="st">"“Fancy quotes”"</span>))
+<span class="co">#> [1] "\"Double quotes\"" "«Guillemet»" "“Fancy quotes”"</span>
+<span class="kw">str_replace_all</span>(text, <span class="st">"</span><span class="ch">\\</span><span class="st">p{quotation mark}"</span>, <span class="st">"'"</span>)
+<span class="co">#> [1] "'Double quotes'" "'Guillemet'" "'Fancy quotes'"</span></code></pre></div></li>
+<li><p><code>\w</code> matches any “word” character, which includes alphabetic characters, marks and decimal numbers. The complement, <code>\W</code>, matches any non-word character.</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">str_extract_all</span>(<span class="st">"Don't eat that!"</span>, <span class="st">"</span><span class="ch">\\</span><span class="st">w+"</span>)[[<span class="dv">1</span>]]
+<span class="co">#> [1] "Don" "t" "eat" "that"</span>
+<span class="kw">str_split</span>(<span class="st">"Don't eat that!"</span>, <span class="st">"</span><span class="ch">\\</span><span class="st">W"</span>)[[<span class="dv">1</span>]]
+<span class="co">#> [1] "Don" "t" "eat" "that" ""</span></code></pre></div>
+<p>Technically, <code>\w</code> also matches connector punctuation, <code>\u200c</code> (zero width connector), and <code>\u200d</code> (zero width joiner), but these are rarely seen in the wild.</p></li>
+<li><p><code>\b</code> matches word boundaries, the transition between word and non-word characters. <code>\B</code> matches the opposite: boundaries that have either both word or non-word characters on either side.</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">str_replace_all</span>(<span class="st">"The quick brown fox"</span>, <span class="st">"</span><span class="ch">\\</span><span class="st">b"</span>, <span class="st">"_"</span>)
+<span class="co">#> [1] "_The_ _quick_ _brown_ _fox_"</span>
+<span class="kw">str_replace_all</span>(<span class="st">"The quick brown fox"</span>, <span class="st">"</span><span class="ch">\\</span><span class="st">B"</span>, <span class="st">"_"</span>)
+<span class="co">#> [1] "T_h_e q_u_i_c_k b_r_o_w_n f_o_x"</span></code></pre></div></li>
+</ul>
+<p>You can also create your own <strong>character classes</strong> using <code>[]</code>:</p>
+<ul>
+<li><code>[abc]</code>: matches a, b, or c.</li>
+<li><code>[a-z]</code>: matches every character between a and z (in Unicode code point order).</li>
+<li><code>[^abc]</code>: matches anything except a, b, or c.</li>
+<li><code>[\^\-]</code>: matches <code>-</code> or <code>\</code>.</li>
+</ul>
+<p>There are a number of pre-built classes that you can use inside <code>[]</code>:</p>
+<ul>
+<li><code>[:punct:]</code>: punctuation.</li>
+<li><code>[:alpha:]</code>: letters.</li>
+<li><code>[:lower:]</code>: lowercase letters.</li>
+<li><code>[:upper:]</code>: upperclass letters.</li>
+<li><code>[:digit:]</code>: digits.</li>
+<li><code>[:xdigit:]</code>: hex digits.</li>
+<li><code>[:alnum:]</code>: letters and numbers.</li>
+<li><code>[:cntrl:]</code>: control characters.</li>
+<li><code>[:graph:]</code>: letters, numbers, and punctuation.</li>
+<li><code>[:print:]</code>: letters, numbers, punctuation, and whitespace.</li>
+<li><code>[:space:]</code>: space characters (basically equivalent to <code>\s</code>).</li>
+<li><code>[:blank:]</code>: space and tab.</li>
+</ul>
+<p>These all go inside the <code>[]</code> for character classes, i.e. <code>[[:digit:]AX]</code> matches all digits, A, and X.</p>
+<p>You can also using Unicode properties, like <code>[\p{Letter}]</code>, and various set operations, like <code>[\p{Letter}--\p{script=latin}]</code>. See <code>?"stringi-search-charclass"</code> for details.</p>
+</div>
+<div id="alternation" class="section level2">
+<h2>Alternation</h2>
+<p><code>|</code> is the <strong>alternation</strong> operator, which will pick between one or more possible matches. For example, <code>abc|def</code> will match <code>abc</code> or <code>def</code>.</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">str_detect</span>(<span class="kw">c</span>(<span class="st">"abc"</span>, <span class="st">"def"</span>, <span class="st">"ghi"</span>), <span class="st">"abc|def"</span>)
+<span class="co">#> [1] TRUE TRUE FALSE</span></code></pre></div>
+<p>Note that the precedence for <code>|</code> is low, so that <code>abc|def</code> matches <code>abc</code> or <code>def</code> not <code>abcyz</code> or <code>abxyz</code>.</p>
+</div>
+<div id="grouping" class="section level2">
+<h2>Grouping</h2>
+<p>You can use parentheses to override the default precedence rules:</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">str_extract</span>(<span class="kw">c</span>(<span class="st">"grey"</span>, <span class="st">"gray"</span>), <span class="st">"gre|ay"</span>)
+<span class="co">#> [1] "gre" "ay"</span>
+<span class="kw">str_extract</span>(<span class="kw">c</span>(<span class="st">"grey"</span>, <span class="st">"gray"</span>), <span class="st">"gr(e|a)y"</span>)
+<span class="co">#> [1] "grey" "gray"</span></code></pre></div>
+<p>Parenthesis also define “groups” that you can refer to with <strong>backreferences</strong>, like <code>\1</code>, <code>\2</code> etc, and can be extracted with <code>str_match()</code>. For example, the following regular expression finds all fruits that have a repeated pair of letters:</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">pattern <-<span class="st"> "(..)</span><span class="ch">\\</span><span class="st">1"</span>
+fruit %>%<span class="st"> </span>
+<span class="st"> </span><span class="kw">str_subset</span>(pattern)
+<span class="co">#> [1] "banana" "coconut" "cucumber" "jujube" "papaya" </span>
+<span class="co">#> [6] "salal berry"</span>
+
+fruit %>%<span class="st"> </span>
+<span class="st"> </span><span class="kw">str_subset</span>(pattern) %>%<span class="st"> </span>
+<span class="st"> </span><span class="kw">str_match</span>(pattern)
+<span class="co">#> [,1] [,2]</span>
+<span class="co">#> [1,] "anan" "an"</span>
+<span class="co">#> [2,] "coco" "co"</span>
+<span class="co">#> [3,] "cucu" "cu"</span>
+<span class="co">#> [4,] "juju" "ju"</span>
+<span class="co">#> [5,] "papa" "pa"</span>
+<span class="co">#> [6,] "alal" "al"</span></code></pre></div>
+<p>You can use <code>(?:...)</code>, the non-grouping parentheses, to control precedence but not capture the match in a group. This is slightly more efficient than capturing parentheses.</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">str_match</span>(<span class="kw">c</span>(<span class="st">"grey"</span>, <span class="st">"gray"</span>), <span class="st">"gr(e|a)y"</span>)
+<span class="co">#> [,1] [,2]</span>
+<span class="co">#> [1,] "grey" "e" </span>
+<span class="co">#> [2,] "gray" "a"</span>
+<span class="kw">str_match</span>(<span class="kw">c</span>(<span class="st">"grey"</span>, <span class="st">"gray"</span>), <span class="st">"gr(?:e|a)y"</span>)
+<span class="co">#> [,1] </span>
+<span class="co">#> [1,] "grey"</span>
+<span class="co">#> [2,] "gray"</span></code></pre></div>
+<p>This is most useful for more complex cases where you need to capture matches and control precedence independently.</p>
+</div>
+<div id="anchors" class="section level2">
+<h2>Anchors</h2>
+<p>By default, regular expressions will match any part of a string. It’s often useful to <strong>anchor</strong> the regular expression so that it matches from the start or end of the string:</p>
+<ul>
+<li><code>^</code> matches the start of string.</li>
+<li><code>$</code> matches the end of the string.</li>
+</ul>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">x <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"apple"</span>, <span class="st">"banana"</span>, <span class="st">"pear"</span>)
+<span class="kw">str_extract</span>(x, <span class="st">"^a"</span>)
+<span class="co">#> [1] "a" NA NA</span>
+<span class="kw">str_extract</span>(x, <span class="st">"a$"</span>)
+<span class="co">#> [1] NA "a" NA</span></code></pre></div>
+<p>To match a literal “$” or “^”, you need to escape them, <code>\$</code>, and <code>\^</code>.</p>
+<p>For multiline strings, you can use <code>regex(multiline = TRUE)</code>. This changes the behaviour of <code>^</code> and <code>$</code>, and introduces three new operators:</p>
+<ul>
+<li><p><code>^</code> now matches the start of each line.</p></li>
+<li><p><code>$</code> now matches the end of each line.</p></li>
+<li><p><code>\A</code> matches the start of the input.</p></li>
+<li><p><code>\z</code> matches the end of the input.</p></li>
+<li><p><code>\Z</code> matches the end of the input, but before the final line terminator, if it exists.</p></li>
+</ul>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">x <-<span class="st"> "Line 1</span><span class="ch">\n</span><span class="st">Line 2</span><span class="ch">\n</span><span class="st">Line 3</span><span class="ch">\n</span><span class="st">"</span>
+<span class="kw">str_extract_all</span>(x, <span class="st">"^Line.."</span>)[[<span class="dv">1</span>]]
+<span class="co">#> [1] "Line 1"</span>
+<span class="kw">str_extract_all</span>(x, <span class="kw">regex</span>(<span class="st">"^Line.."</span>, <span class="dt">multiline =</span> <span class="ot">TRUE</span>))[[<span class="dv">1</span>]]
+<span class="co">#> [1] "Line 1" "Line 2" "Line 3"</span>
+<span class="kw">str_extract_all</span>(x, <span class="kw">regex</span>(<span class="st">"</span><span class="ch">\\</span><span class="st">ALine.."</span>, <span class="dt">multiline =</span> <span class="ot">TRUE</span>))[[<span class="dv">1</span>]]
+<span class="co">#> [1] "Line 1"</span></code></pre></div>
+</div>
+<div id="repetition" class="section level2">
+<h2>Repetition</h2>
+<p>You can control how many times a pattern matches with the repetition operators:</p>
+<ul>
+<li><code>?</code>: 0 or 1.</li>
+<li><code>+</code>: 1 or more.</li>
+<li><code>*</code>: 0 or more.</li>
+</ul>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">x <-<span class="st"> "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"</span>
+<span class="kw">str_extract</span>(x, <span class="st">"CC?"</span>)
+<span class="co">#> [1] "CC"</span>
+<span class="kw">str_extract</span>(x, <span class="st">"CC+"</span>)
+<span class="co">#> [1] "CCC"</span>
+<span class="kw">str_extract</span>(x, <span class="st">'C[LX]+'</span>)
+<span class="co">#> [1] "CLXXX"</span></code></pre></div>
+<p>Note that the precedence of these operators is high, so you can write: <code>colou?r</code> to match either American or British spellings. That means most uses will need parentheses, like <code>bana(na)+</code>.</p>
+<p>You can also specify the number of matches precisely:</p>
+<ul>
+<li><code>{n}</code>: exactly n</li>
+<li><code>{n,}</code>: n or more</li>
+<li><code>{n,m}</code>: between n and m</li>
+</ul>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">str_extract</span>(x, <span class="st">"C{2}"</span>)
+<span class="co">#> [1] "CC"</span>
+<span class="kw">str_extract</span>(x, <span class="st">"C{2,}"</span>)
+<span class="co">#> [1] "CCC"</span>
+<span class="kw">str_extract</span>(x, <span class="st">"C{2,3}"</span>)
+<span class="co">#> [1] "CCC"</span></code></pre></div>
+<p>By default these matches are “greedy”: they will match the longest string possible. You can make them “lazy”, matching the shortest string possible by putting a <code>?</code> after them:</p>
+<ul>
+<li><code>??</code>: 0 or 1, prefer 0.</li>
+<li><code>+?</code>: 1 or more, match as few times as possible.</li>
+<li><code>*?</code>: 0 or more, match as few times as possible.</li>
+<li><code>{n,}?</code>: n or more, match as few times as possible.</li>
+<li><code>{n,m}?</code>: between n and m, , match as few times as possible, but at least n.</li>
+</ul>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">str_extract</span>(x, <span class="kw">c</span>(<span class="st">"C{2,3}"</span>, <span class="st">"C{2,3}?"</span>))
+<span class="co">#> [1] "CCC" "CC"</span>
+<span class="kw">str_extract</span>(x, <span class="kw">c</span>(<span class="st">"C[LX]+"</span>, <span class="st">"C[LX]+?"</span>))
+<span class="co">#> [1] "CLXXX" "CL"</span></code></pre></div>
+<p>You can also make the matches possessive, which means that if later parts of the match fail, the repetition will not be re-tried with a smaller number of characters. This is an advanced feature used to improve performance in worst-case scenarios (called “catastrophic backtracking”).</p>
+<ul>
+<li><code>??</code>: 0 or 1, possessive.</li>
+<li><code>+?</code>: 1 or more, possessive.</li>
+<li><code>*?</code>: 0 or more, possessive.</li>
+<li><code>{n}?</code>: exactly n, possessive.</li>
+<li><code>{n,}?</code>: n or more, possessive.</li>
+<li><code>{n,m}?</code>: between n and m, possessive.</li>
+</ul>
+<p>A related concept is the <strong>atomic-match</strong> parenthesis, <code>(?>...)</code>. If a later match fails and the engine needs to back-track, an atomic match is kept as is: it succeeds or fails as a whole. Compare the following two regular expressions:</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">str_detect</span>(<span class="st">"ABC"</span>, <span class="st">"(?>A|.B)C"</span>)
+<span class="co">#> [1] FALSE</span>
+<span class="kw">str_detect</span>(<span class="st">"ABC"</span>, <span class="st">"(?:A|.B)C"</span>)
+<span class="co">#> [1] TRUE</span></code></pre></div>
+<p>The atomic match fails because it matches A, and then the next character is a C so it fails. The regular match suceeds because it matches A, but then C doesn’t match, so it back-tracks and tries B instead.</p>
+</div>
+<div id="look-arounds" class="section level2">
+<h2>Look arounds</h2>
+<p>These assertions look ahead or behind the current match without “consuming” any characters (i.e. changing the input position).</p>
+<ul>
+<li><p><code>(?=...)</code>: positive look-ahead assertion. Matches if <code>...</code> matches at the current input.</p></li>
+<li><p><code>(?!...)</code>: negative look-ahead assertion. Matches if <code>...</code> <strong>does not</strong> matche at the current input.</p></li>
+<li><p><code>(?<=...)</code>: positive look-behind assertion. Matches if <code>...</code> matches text preceding the current position, with the last character of the match being the character just before the current position. Length must be bounded<br />
+(i.e. no <code>*</code> or <code>+</code>).</p></li>
+<li><p><code>(?<!...)</code>: negative look-behind assertion. Matches if <code>...</code> <strong>does not</strong> match text preceding the current position. Length must be bounded<br />
+(i.e. no <code>*</code> or <code>+</code>).</p></li>
+</ul>
+<p>These are useful when you want to check that a pattern exists, but you don’t want to include it in the result:</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">x <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"1 piece"</span>, <span class="st">"2 pieces"</span>, <span class="st">"3"</span>)
+<span class="kw">str_extract</span>(x, <span class="st">"</span><span class="ch">\\</span><span class="st">d+(?= pieces?)"</span>)
+<span class="co">#> [1] "1" "2" NA</span>
+
+y <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"100"</span>, <span class="st">"$400"</span>)
+<span class="kw">str_extract</span>(y, <span class="st">"(?<=</span><span class="ch">\\</span><span class="st">$)</span><span class="ch">\\</span><span class="st">d+"</span>)
+<span class="co">#> [1] NA "400"</span></code></pre></div>
+</div>
+<div id="comments" class="section level2">
+<h2>Comments</h2>
+<p>There are two ways to include comments in a regular expression. The first is with <code>(?#...)</code>:</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">str_detect</span>(<span class="st">"xyz"</span>, <span class="st">"x(?#this is a comment)"</span>)
+<span class="co">#> [1] TRUE</span></code></pre></div>
+<p>The second is to use <code>regex(comments = TRUE)</code>. This form ignores spaces and newlines, and anything everything after <code>#</code>. To match a literal space, you’ll need to escape it: <code>"\\ "</code>. This is a useful way of describing complex regular expressions:</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">phone <-<span class="st"> </span><span class="kw">regex</span>(<span class="st">"</span>
+<span class="st"> </span><span class="ch">\\</span><span class="st">(? # optional opening parens</span>
+<span class="st"> (</span><span class="ch">\\</span><span class="st">d{3}) # area code</span>
+<span class="st"> [)- ]? # optional closing parens, dash, or space</span>
+<span class="st"> (</span><span class="ch">\\</span><span class="st">d{3}) # another three numbers</span>
+<span class="st"> [ -]? # optional space or dash</span>
+<span class="st"> (</span><span class="ch">\\</span><span class="st">d{3}) # three more numbers</span>
+<span class="st"> "</span>, <span class="dt">comments =</span> <span class="ot">TRUE</span>)
+
+<span class="kw">str_match</span>(<span class="st">"514-791-8141"</span>, phone)
+<span class="co">#> [,1] [,2] [,3] [,4] </span>
+<span class="co">#> [1,] "514-791-814" "514" "791" "814"</span></code></pre></div>
+</div>
+
+
+
+<!-- dynamically load mathjax for compatibility with self-contained -->
+<script>
+ (function () {
+ var script = document.createElement("script");
+ script.type = "text/javascript";
+ script.src = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
+ document.getElementsByTagName("head")[0].appendChild(script);
+ })();
+</script>
+
+</body>
+</html>
diff --git a/inst/doc/stringr.R b/inst/doc/stringr.R
index d738444..2be90dc 100644
--- a/inst/doc/stringr.R
+++ b/inst/doc/stringr.R
@@ -1,6 +1,75 @@
-## ---- echo=FALSE---------------------------------------------------------
-library("stringr")
-knitr::opts_chunk$set(comment = "#>", collapse = TRUE)
+## ---- include = FALSE----------------------------------------------------
+library(stringr)
+knitr::opts_chunk$set(
+ comment = "#>",
+ collapse = TRUE
+)
+
+## ------------------------------------------------------------------------
+str_length("abc")
+
+## ------------------------------------------------------------------------
+x <- c("abcdef", "ghifjk")
+
+# The 3rd letter
+str_sub(x, 3, 3)
+
+# The 2nd to 2nd-to-last character
+str_sub(x, 2, -2)
+
+
+## ------------------------------------------------------------------------
+str_sub(x, 3, 3) <- "X"
+x
+
+## ------------------------------------------------------------------------
+str_dup(x, c(2, 3))
+
+## ------------------------------------------------------------------------
+x <- c("abc", "defghi")
+str_pad(x, 10)
+str_pad(x, 10, "both")
+
+## ------------------------------------------------------------------------
+str_pad(x, 4)
+
+## ------------------------------------------------------------------------
+x <- c("Short", "This is a long string")
+
+x %>%
+ str_trunc(10) %>%
+ str_pad(10, "right")
+
+## ------------------------------------------------------------------------
+x <- c(" a ", "b ", " c")
+str_trim(x)
+str_trim(x, "left")
+
+## ------------------------------------------------------------------------
+jabberwocky <- str_c(
+ "`Twas brillig, and the slithy toves ",
+ "did gyre and gimble in the wabe: ",
+ "All mimsy were the borogoves, ",
+ "and the mome raths outgrabe. "
+)
+cat(str_wrap(jabberwocky, width = 40))
+
+## ------------------------------------------------------------------------
+x <- "I like horses."
+str_to_upper(x)
+str_to_title(x)
+
+str_to_lower(x)
+# Turkish has two sorts of i: with and without the dot
+str_to_lower(x, "tr")
+
+## ------------------------------------------------------------------------
+x <- c("y", "i", "k")
+str_order(x)
+
+str_sort(x)
+# In Lithuanian, y comes between i and k
+str_sort(x, locale = "lt")
## ------------------------------------------------------------------------
strings <- c(
@@ -17,6 +86,10 @@ str_detect(strings, phone)
str_subset(strings, phone)
## ------------------------------------------------------------------------
+# How many phone numbers in each string?
+str_count(strings, phone)
+
+## ------------------------------------------------------------------------
# Where in the string is the phone number located?
(loc <- str_locate(strings, phone))
str_locate_all(strings, phone)
@@ -37,29 +110,33 @@ str_replace(strings, phone, "XXX-XXX-XXXX")
str_replace_all(strings, phone, "XXX-XXX-XXXX")
## ------------------------------------------------------------------------
-col2hex <- function(col) {
- rgb <- col2rgb(col)
- rgb(rgb["red", ], rgb["green", ], rgb["blue", ], max = 255)
-}
+str_split("a-b-c", "-")
+str_split_fixed("a-b-c", "-", n = 2)
-# Goal replace colour names in a string with their hex equivalent
-strings <- c("Roses are red, violets are blue", "My favourite colour is green")
+## ------------------------------------------------------------------------
+a1 <- "\u00e1"
+a2 <- "a\u0301"
+c(a1, a2)
+a1 == a2
-colours <- str_c("\\b", colors(), "\\b", collapse="|")
-# This gets us the colours, but we have no way of replacing them
-str_extract_all(strings, colours)
+## ------------------------------------------------------------------------
+str_detect(a1, fixed(a2))
+str_detect(a1, coll(a2))
+
+## ------------------------------------------------------------------------
+i <- c("I", "İ", "i", "ı")
+i
-# Instead, let's work with locations
-locs <- str_locate_all(strings, colours)
-Map(function(string, loc) {
- hex <- col2hex(str_sub(string, loc))
- str_sub(string, loc) <- hex
- string
-}, strings, locs)
+str_subset(i, coll("i", ignore_case = TRUE))
+str_subset(i, coll("i", ignore_case = TRUE, locale = "tr"))
## ------------------------------------------------------------------------
-matches <- col2hex(colors())
-names(matches) <- str_c("\\b", colors(), "\\b")
+x <- "This is a sentence."
+str_split(x, boundary("word"))
+str_count(x, boundary("word"))
+str_extract_all(x, boundary("word"))
-str_replace_all(strings, matches)
+## ------------------------------------------------------------------------
+str_split(x, "")
+str_count(x, "")
diff --git a/inst/doc/stringr.Rmd b/inst/doc/stringr.Rmd
index e20db2e..b8b7927 100644
--- a/inst/doc/stringr.Rmd
+++ b/inst/doc/stringr.Rmd
@@ -1,78 +1,159 @@
---
title: "Introduction to stringr"
-date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Introduction to stringr}
%\VignetteEngine{knitr::rmarkdown}
- \usepackage[utf8]{inputenc}
+ %\VignetteEncoding{UTF-8}
---
-```{r, echo=FALSE}
-library("stringr")
-knitr::opts_chunk$set(comment = "#>", collapse = TRUE)
+```{r, include = FALSE}
+library(stringr)
+knitr::opts_chunk$set(
+ comment = "#>",
+ collapse = TRUE
+)
```
-Strings are not glamorous, high-profile components of R, but they do play a big role in many data cleaning and preparations tasks. R provides a solid set of string operations, but because they have grown organically over time, they can be inconsistent and a little hard to learn. Additionally, they lag behind the string operations in other programming languages, so that some things that are easy to do in languages like Ruby or Python are rather hard to do in R. The __stringr__ package aim [...]
+There are four main families of functions in stringr:
+
+1. Character manipulation: these functions allow you to manipulate the
+ individual characters inside the strings inside character vectors.
+
+1. Whitespace tools to add, remove, and manipulation whitespace.
+
+1. Locale sensitive operation whose operation will vary for locale
+ to locale
+
+1. Pattern matching functions. These recognise four engines of
+ pattern description. The most common is regular expresssions, but there
+ are a three other tools.
-More concretely, stringr:
+## Getting and setting individual characters
-- Simplifies string operations by eliminating options that you don't need
- 95% of the time (the other 5% of the time you can functions from base R or
- [stringi](https://github.com/Rexamine/stringi/)).
+You can get the length of the string with `str_length()`:
-- Uses consistent function names and arguments.
+```{r}
+str_length("abc")
+```
-- Produces outputs than can easily be used as inputs. This includes ensuring
- that missing inputs result in missing outputs, and zero length inputs result
- in zero length outputs. It also processes factors and character vectors in
- the same way.
+This is now equivalent to the base R function `nchar()`. Previously it was needed to work around issues with `nchar()` such as the fact that it returned 2 for `nchar(NA)`. This has been fixed as of R 3.3.0, so it is no longer so important.
-- Completes R's string handling functions with useful functions from other
- programming languages.
+You can access individual character using `sub_str()`. It takes three arguments: a character vector, a starting position and an end position. Either position can either be a positive integer, which counts from the length, or a negative integer which counts from the right. The positions are inclusive, and if longer than the string, will be silently truncated.
-To meet these goals, stringr provides two basic families of functions:
+```{r}
+x <- c("abcdef", "ghifjk")
-- basic string operations, and
+# The 3rd letter
+str_sub(x, 3, 3)
-- pattern matching functions which use regular expressions to detect, locate,
- match, replace, extract, and split strings.
+# The 2nd to 2nd-to-last character
+str_sub(x, 2, -2)
-As of version 1.0, stringr is a thin wrapper around [stringi](https://github.com/Rexamine/stringi/), which implements all the functions in stringr with efficient C code based on the [ICU library](http://site.icu-project.org). Compared to stringi, stringr is considerably simpler: it provides fewer options and fewer functions. This is great when you're getting started learning string functions, and if you do need more of stringi's power, you should find the interface similar.
+```
-These are described in more detail in the following sections.
+You can also use `str_sub()` to modify strings:
-## Basic string operations
+```{r}
+str_sub(x, 3, 3) <- "X"
+x
+```
-There are three string functions that are closely related to their base R equivalents, but with a few enhancements:
+To duplicate individual strings, you can use `str_dup()`:
-- `str_c()` is equivalent to `paste()`, but it uses the empty string ("") as
- the default separator and silently removes `NULL` inputs.
+```{r}
+str_dup(x, c(2, 3))
+```
-- `str_length()` is equivalent to `nchar()`, but it preserves NA's (rather than
- giving them length 2) and converts factors to characters (not integers).
+## Whitespace
-- `str_sub()` is equivalent to `substr()` but it returns a zero length vector
- if any of its inputs are zero length, and otherwise expands each argument to
- match the longest. It also accepts negative positions, which are calculated
- from the left of the last character. The end position defaults to `-1`,
- which corresponds to the last character.
+Three functions add, remove, or modify whitespace:
-- `str_sub<-` is equivalent to `substr<-`, but like `str_sub` it understands
- negative indices, and replacement strings not do need to be the same length
- as the string they are replacing.
+1. `str_pad()` pads a string to a fixed length by adding extra whitespace on
+ the left, right, or both sides.
+
+ ```{r}
+ x <- c("abc", "defghi")
+ str_pad(x, 10)
+ str_pad(x, 10, "both")
+ ```
+
+ (You can pad with other characters by using the `pad` argument.)
+
+ `str_pad()` will never make a string shorter:
+
+ ```{r}
+ str_pad(x, 4)
+ ```
+
+ So if you want to ensure that all strings are the same length (often useful
+ for print methods), combine `str_pad()` and `str_trunc()`:
+
+ ```{r}
+ x <- c("Short", "This is a long string")
+
+ x %>%
+ str_trunc(10) %>%
+ str_pad(10, "right")
+ ```
-Three functions add new functionality:
+1. The opposite of `str_pad()` is `str_trim()`, which removes leading and
+ trailing whitespace:
+
+ ```{r}
+ x <- c(" a ", "b ", " c")
+ str_trim(x)
+ str_trim(x, "left")
+ ```
-- `str_dup()` to duplicate the characters within a string.
+1. You can use `str_wrap()` to modify existing whitespace in order to wrap
+ a paragraph of text so that the length of each line as a similar as
+ possible.
+
+ ```{r}
+ jabberwocky <- str_c(
+ "`Twas brillig, and the slithy toves ",
+ "did gyre and gimble in the wabe: ",
+ "All mimsy were the borogoves, ",
+ "and the mome raths outgrabe. "
+ )
+ cat(str_wrap(jabberwocky, width = 40))
+ ```
-- `str_trim()` to remove leading and trailing whitespace.
+## Locale sensitive
-- `str_pad()` to pad a string with extra whitespace on the left, right, or both sides.
+A handful of stringr are functions are locale-sensitive: they will perform differently in different regions of the world. These functions case transformation functions:
+
+```{r}
+x <- "I like horses."
+str_to_upper(x)
+str_to_title(x)
+
+str_to_lower(x)
+# Turkish has two sorts of i: with and without the dot
+str_to_lower(x, "tr")
+```
+
+And string ordering and sorting:
+
+```{r}
+x <- c("y", "i", "k")
+str_order(x)
+
+str_sort(x)
+# In Lithuanian, y comes between i and k
+str_sort(x, locale = "lt")
+```
+
+The locale always defaults to English to ensure that the default behaviour is identically across systems. Locales always include a two letter ISO-639-1 language code (like "en" for English or "zh" for Chinese), and optionally a ISO-3166 country code (like "en_UK" vs "en_US"). You can see a complete list of available locales by running `stringi::stri_locale_list()`.
## Pattern matching
-stringr provides pattern matching functions to **detect**, **locate**, **extract**, **match**, **replace**, and **split** strings. I'll illustrate how they work with some strings and a regular expression designed to match (US) phone numbers:
+The vast majority of stringr functions work with patterns. These are parameterised by the task they perform and the types of patterns they match.
+
+### Tasks
+
+Each pattern matching function has the same first two arguments, a character vector of `string`s to process and a single `pattern` to match. stringr provides pattern matching functions to **detect**, **locate**, **extract**, **match**, **replace**, and **split** strings. I'll illustrate how they work with some strings and a regular expression designed to match (US) phone numbers:
```{r}
strings <- c(
@@ -95,6 +176,13 @@ phone <- "([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})"
str_subset(strings, phone)
```
+- `str_count()` counts the number of matches:
+
+ ```{r}
+ # How many phone numbers in each string?
+ str_count(strings, phone)
+ ```
+
- `str_locate()` locates the first position of a pattern and returns a numeric
matrix with columns start and end. `str_locate_all()` locates all matches,
returning a list of numeric matrices. Similar to `regexpr()` and `gregexpr()`.
@@ -140,62 +228,73 @@ phone <- "([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})"
- `str_split_fixed()` splits the string into a fixed number of pieces based
on a pattern and returns a character matrix. `str_split()` splits a string
into a variable number of pieces and returns a list of character vectors.
+
+ ```{r}
+ str_split("a-b-c", "-")
+ str_split_fixed("a-b-c", "-", n = 2)
+ ```
-### Arguments
-
-Each pattern matching function has the same first two arguments, a character vector of `string`s to process and a single `pattern` (regular expression) to match. The replace functions have an additional argument specifying the replacement string, and the split functions have an argument to specify the number of pieces.
+### Engines
-Unlike base string functions, stringr offers control over matching not through arguments, but through modifier functions, `regex()`, `coll()` and `fixed()`. This is a deliberate choice made to simplify these functions. For example, while `grepl` has six arguments, `str_detect()` only has two.
+There are four main engines that stringr can use to describe patterns:
-### Regular expressions
+* Regular expressions, the default, as shown above, and described in
+ `vignette("regular-expressions")`.
+
+* Fixed bytewise matching, with `fixed()`.
-To be able to use these functions effectively, you'll need a good knowledge of regular expressions, which this vignette is not going to teach you. Some useful tools to get you started:
+* Locale-sensitve character matching, with `coll()`
-- A good [reference sheet](http://www.regular-expressions.info/reference.html).
+* Text boundary analysis with `boundary()`.
-- A tool that allows you to [interactively test](http://gskinner.com/RegExr/)
- what a regular expression will match.
+#### Fixed matches
-- A tool to [build a regular expression](http://www.txt2re.com) from an
- input string.
+`fixed(x)` only matches the exact sequence of bytes specified by `x`. This is a very limited "pattern", but the restriction can make matching much faster. Beware using `fixed()` with non-English data. It is problematic because there are often multiple ways of representing the same character. For example, there are two ways to define "á": either as a single character or as an "a" plus an accent:
-When writing regular expressions, I strongly recommend generating a list of positive (pattern should match) and negative (pattern shouldn't match) test cases to ensure that you are matching the correct components.
+```{r}
+a1 <- "\u00e1"
+a2 <- "a\u0301"
+c(a1, a2)
+a1 == a2
+```
-### Functions that return lists
+They render identically, but because they're defined differently,
+`fixed()` doesn't find a match. Instead, you can use `coll()`, defined
+next, to respect human character comparison rules:
-Many of the functions return a list of vectors or matrices. To work with each element of the list there are two strategies: iterate through a common set of indices, or use `Map()` to iterate through the vectors simultaneously. The second strategy is illustrated below:
+```{r}
+str_detect(a1, fixed(a2))
+str_detect(a1, coll(a2))
+```
+
+#### Collation search
+
+`coll(x)` looks for a match to `x` using human-language **coll**ation rules, and is particularly important if you want to do case insensitive matching. Collation rules diffe around the world, so you'll also need to supply a `locale` parameter.
```{r}
-col2hex <- function(col) {
- rgb <- col2rgb(col)
- rgb(rgb["red", ], rgb["green", ], rgb["blue", ], max = 255)
-}
-
-# Goal replace colour names in a string with their hex equivalent
-strings <- c("Roses are red, violets are blue", "My favourite colour is green")
-
-colours <- str_c("\\b", colors(), "\\b", collapse="|")
-# This gets us the colours, but we have no way of replacing them
-str_extract_all(strings, colours)
-
-# Instead, let's work with locations
-locs <- str_locate_all(strings, colours)
-Map(function(string, loc) {
- hex <- col2hex(str_sub(string, loc))
- str_sub(string, loc) <- hex
- string
-}, strings, locs)
+i <- c("I", "İ", "i", "ı")
+i
+
+str_subset(i, coll("i", ignore_case = TRUE))
+str_subset(i, coll("i", ignore_case = TRUE, locale = "tr"))
```
-Another approach is to use the second form of `str_replace_all()`: if you give it a named vector, it applies each `pattern = replacement` in turn:
+The downside of `coll()` is speed; because the rules for recognising which characters are the same are complicated, `coll()` is relatively slow compared to `regex()` and `fixed()`. Note that will both `fixed()` and `regex()` have `ignore_case` arguments, they perform a much simpler comparison than `coll()`.
-```{r}
-matches <- col2hex(colors())
-names(matches) <- str_c("\\b", colors(), "\\b")
+#### Boundary
+
+`boundary()` matches boundaries between characters, lines, sentences or words. It's most useful with `str_split()`, but can used with all pattern matching functions
-str_replace_all(strings, matches)
+```{r}
+x <- "This is a sentence."
+str_split(x, boundary("word"))
+str_count(x, boundary("word"))
+str_extract_all(x, boundary("word"))
```
-## Conclusion
+By convention, `""` is treated as `boundary("character")`:
-stringr provides an opinionated interface to strings in R. It makes string processing simpler by removing uncommon options, and by vigorously enforcing consistency across functions. I have also added new functions that I have found useful from Ruby, and over time, I hope users will suggest useful functions from other programming languages. I will continue to build on the included test suite to ensure that the package behaves as expected and remains bug free.
+```{r}
+str_split(x, "")
+str_count(x, "")
+```
diff --git a/inst/doc/stringr.html b/inst/doc/stringr.html
index f44fa90..1e2f836 100644
--- a/inst/doc/stringr.html
+++ b/inst/doc/stringr.html
@@ -11,7 +11,6 @@
<meta name="viewport" content="width=device-width, initial-scale=1">
-<meta name="date" content="2016-08-19" />
<title>Introduction to stringr</title>
@@ -68,44 +67,113 @@ code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Inf
<h1 class="title toc-ignore">Introduction to stringr</h1>
-<h4 class="date"><em>2016-08-19</em></h4>
-<p>Strings are not glamorous, high-profile components of R, but they do play a big role in many data cleaning and preparations tasks. R provides a solid set of string operations, but because they have grown organically over time, they can be inconsistent and a little hard to learn. Additionally, they lag behind the string operations in other programming languages, so that some things that are easy to do in languages like Ruby or Python are rather hard to do in R. The <strong>stringr</str [...]
-<p>More concretely, stringr:</p>
-<ul>
-<li><p>Simplifies string operations by eliminating options that you don’t need 95% of the time (the other 5% of the time you can functions from base R or <a href="https://github.com/Rexamine/stringi/">stringi</a>).</p></li>
-<li><p>Uses consistent function names and arguments.</p></li>
-<li><p>Produces outputs than can easily be used as inputs. This includes ensuring that missing inputs result in missing outputs, and zero length inputs result in zero length outputs. It also processes factors and character vectors in the same way.</p></li>
-<li><p>Completes R’s string handling functions with useful functions from other programming languages.</p></li>
-</ul>
-<p>To meet these goals, stringr provides two basic families of functions:</p>
-<ul>
-<li><p>basic string operations, and</p></li>
-<li><p>pattern matching functions which use regular expressions to detect, locate, match, replace, extract, and split strings.</p></li>
-</ul>
-<p>As of version 1.0, stringr is a thin wrapper around <a href="https://github.com/Rexamine/stringi/">stringi</a>, which implements all the functions in stringr with efficient C code based on the <a href="http://site.icu-project.org">ICU library</a>. Compared to stringi, stringr is considerably simpler: it provides fewer options and fewer functions. This is great when you’re getting started learning string functions, and if you do need more of stringi’s power, you should find the interfa [...]
-<p>These are described in more detail in the following sections.</p>
-<div id="basic-string-operations" class="section level2">
-<h2>Basic string operations</h2>
-<p>There are three string functions that are closely related to their base R equivalents, but with a few enhancements:</p>
-<ul>
-<li><p><code>str_c()</code> is equivalent to <code>paste()</code>, but it uses the empty string (“”) as the default separator and silently removes <code>NULL</code> inputs.</p></li>
-<li><p><code>str_length()</code> is equivalent to <code>nchar()</code>, but it preserves NA’s (rather than giving them length 2) and converts factors to characters (not integers).</p></li>
-<li><p><code>str_sub()</code> is equivalent to <code>substr()</code> but it returns a zero length vector if any of its inputs are zero length, and otherwise expands each argument to match the longest. It also accepts negative positions, which are calculated from the left of the last character. The end position defaults to <code>-1</code>, which corresponds to the last character.</p></li>
-<li><p><code>str_sub<-</code> is equivalent to <code>substr<-</code>, but like <code>str_sub</code> it understands negative indices, and replacement strings not do need to be the same length as the string they are replacing.</p></li>
-</ul>
-<p>Three functions add new functionality:</p>
-<ul>
-<li><p><code>str_dup()</code> to duplicate the characters within a string.</p></li>
-<li><p><code>str_trim()</code> to remove leading and trailing whitespace.</p></li>
-<li><p><code>str_pad()</code> to pad a string with extra whitespace on the left, right, or both sides.</p></li>
-</ul>
+<p>There are four main families of functions in stringr:</p>
+<ol style="list-style-type: decimal">
+<li><p>Character manipulation: these functions allow you to manipulate the individual characters inside the strings inside character vectors.</p></li>
+<li><p>Whitespace tools to add, remove, and manipulation whitespace.</p></li>
+<li><p>Locale sensitive operation whose operation will vary for locale to locale</p></li>
+<li><p>Pattern matching functions. These recognise four engines of pattern description. The most common is regular expresssions, but there are a three other tools.</p></li>
+</ol>
+<div id="getting-and-setting-individual-characters" class="section level2">
+<h2>Getting and setting individual characters</h2>
+<p>You can get the length of the string with <code>str_length()</code>:</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">str_length</span>(<span class="st">"abc"</span>)
+<span class="co">#> [1] 3</span></code></pre></div>
+<p>This is now equivalent to the base R function <code>nchar()</code>. Previously it was needed to work around issues with <code>nchar()</code> such as the fact that it returned 2 for <code>nchar(NA)</code>. This has been fixed as of R 3.3.0, so it is no longer so important.</p>
+<p>You can access individual character using <code>sub_str()</code>. It takes three arguments: a character vector, a starting position and an end position. Either position can either be a positive integer, which counts from the length, or a negative integer which counts from the right. The positions are inclusive, and if longer than the string, will be silently truncated.</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">x <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"abcdef"</span>, <span class="st">"ghifjk"</span>)
+
+<span class="co"># The 3rd letter</span>
+<span class="kw">str_sub</span>(x, <span class="dv">3</span>, <span class="dv">3</span>)
+<span class="co">#> [1] "c" "i"</span>
+
+<span class="co"># The 2nd to 2nd-to-last character</span>
+<span class="kw">str_sub</span>(x, <span class="dv">2</span>, -<span class="dv">2</span>)
+<span class="co">#> [1] "bcde" "hifj"</span></code></pre></div>
+<p>You can also use <code>str_sub()</code> to modify strings:</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">str_sub</span>(x, <span class="dv">3</span>, <span class="dv">3</span>) <-<span class="st"> "X"</span>
+x
+<span class="co">#> [1] "abXdef" "ghXfjk"</span></code></pre></div>
+<p>To duplicate individual strings, you can use <code>str_dup()</code>:</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">str_dup</span>(x, <span class="kw">c</span>(<span class="dv">2</span>, <span class="dv">3</span>))
+<span class="co">#> [1] "abXdefabXdef" "ghXfjkghXfjkghXfjk"</span></code></pre></div>
+</div>
+<div id="whitespace" class="section level2">
+<h2>Whitespace</h2>
+<p>Three functions add, remove, or modify whitespace:</p>
+<ol style="list-style-type: decimal">
+<li><p><code>str_pad()</code> pads a string to a fixed length by adding extra whitespace on the left, right, or both sides.</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">x <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"abc"</span>, <span class="st">"defghi"</span>)
+<span class="kw">str_pad</span>(x, <span class="dv">10</span>)
+<span class="co">#> [1] " abc" " defghi"</span>
+<span class="kw">str_pad</span>(x, <span class="dv">10</span>, <span class="st">"both"</span>)
+<span class="co">#> [1] " abc " " defghi "</span></code></pre></div>
+<p>(You can pad with other characters by using the <code>pad</code> argument.)</p>
+<p><code>str_pad()</code> will never make a string shorter:</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">str_pad</span>(x, <span class="dv">4</span>)
+<span class="co">#> [1] " abc" "defghi"</span></code></pre></div>
+<p>So if you want to ensure that all strings are the same length (often useful for print methods), combine <code>str_pad()</code> and <code>str_trunc()</code>:</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">x <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"Short"</span>, <span class="st">"This is a long string"</span>)
+
+x %>%<span class="st"> </span>
+<span class="st"> </span><span class="kw">str_trunc</span>(<span class="dv">10</span>) %>%<span class="st"> </span>
+<span class="st"> </span><span class="kw">str_pad</span>(<span class="dv">10</span>, <span class="st">"right"</span>)
+<span class="co">#> [1] "Short " "This is..."</span></code></pre></div></li>
+<li><p>The opposite of <code>str_pad()</code> is <code>str_trim()</code>, which removes leading and trailing whitespace:</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">x <-<span class="st"> </span><span class="kw">c</span>(<span class="st">" a "</span>, <span class="st">"b "</span>, <span class="st">" c"</span>)
+<span class="kw">str_trim</span>(x)
+<span class="co">#> [1] "a" "b" "c"</span>
+<span class="kw">str_trim</span>(x, <span class="st">"left"</span>)
+<span class="co">#> [1] "a " "b " "c"</span></code></pre></div></li>
+<li><p>You can use <code>str_wrap()</code> to modify existing whitespace in order to wrap a paragraph of text so that the length of each line as a similar as possible.</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">jabberwocky <-<span class="st"> </span><span class="kw">str_c</span>(
+ <span class="st">"`Twas brillig, and the slithy toves "</span>,
+ <span class="st">"did gyre and gimble in the wabe: "</span>,
+ <span class="st">"All mimsy were the borogoves, "</span>,
+ <span class="st">"and the mome raths outgrabe. "</span>
+)
+<span class="kw">cat</span>(<span class="kw">str_wrap</span>(jabberwocky, <span class="dt">width =</span> <span class="dv">40</span>))
+<span class="co">#> `Twas brillig, and the slithy toves did</span>
+<span class="co">#> gyre and gimble in the wabe: All mimsy</span>
+<span class="co">#> were the borogoves, and the mome raths</span>
+<span class="co">#> outgrabe.</span></code></pre></div></li>
+</ol>
+</div>
+<div id="locale-sensitive" class="section level2">
+<h2>Locale sensitive</h2>
+<p>A handful of stringr are functions are locale-sensitive: they will perform differently in different regions of the world. These functions case transformation functions:</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">x <-<span class="st"> "I like horses."</span>
+<span class="kw">str_to_upper</span>(x)
+<span class="co">#> [1] "I LIKE HORSES."</span>
+<span class="kw">str_to_title</span>(x)
+<span class="co">#> [1] "I Like Horses."</span>
+
+<span class="kw">str_to_lower</span>(x)
+<span class="co">#> [1] "i like horses."</span>
+<span class="co"># Turkish has two sorts of i: with and without the dot</span>
+<span class="kw">str_to_lower</span>(x, <span class="st">"tr"</span>)
+<span class="co">#> [1] "ı like horses."</span></code></pre></div>
+<p>And string ordering and sorting:</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">x <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"y"</span>, <span class="st">"i"</span>, <span class="st">"k"</span>)
+<span class="kw">str_order</span>(x)
+<span class="co">#> [1] 2 3 1</span>
+
+<span class="kw">str_sort</span>(x)
+<span class="co">#> [1] "i" "k" "y"</span>
+<span class="co"># In Lithuanian, y comes between i and k</span>
+<span class="kw">str_sort</span>(x, <span class="dt">locale =</span> <span class="st">"lt"</span>)
+<span class="co">#> [1] "i" "y" "k"</span></code></pre></div>
+<p>The locale always defaults to English to ensure that the default behaviour is identically across systems. Locales always include a two letter ISO-639-1 language code (like “en” for English or “zh” for Chinese), and optionally a ISO-3166 country code (like “en_UK” vs “en_US”). You can see a complete list of available locales by running <code>stringi::stri_locale_list()</code>.</p>
</div>
<div id="pattern-matching" class="section level2">
<h2>Pattern matching</h2>
-<p>stringr provides pattern matching functions to <strong>detect</strong>, <strong>locate</strong>, <strong>extract</strong>, <strong>match</strong>, <strong>replace</strong>, and <strong>split</strong> strings. I’ll illustrate how they work with some strings and a regular expression designed to match (US) phone numbers:</p>
+<p>The vast majority of stringr functions work with patterns. These are parameterised by the task they perform and the types of patterns they match.</p>
+<div id="tasks" class="section level3">
+<h3>Tasks</h3>
+<p>Each pattern matching function has the same first two arguments, a character vector of <code>string</code>s to process and a single <code>pattern</code> to match. stringr provides pattern matching functions to <strong>detect</strong>, <strong>locate</strong>, <strong>extract</strong>, <strong>match</strong>, <strong>replace</strong>, and <strong>split</strong> strings. I’ll illustrate how they work with some strings and a regular expression designed to match (US) phone numbers:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">strings <-<span class="st"> </span><span class="kw">c</span>(
<span class="st">"apple"</span>,
<span class="st">"219 733 8965"</span>,
@@ -122,6 +190,10 @@ phone <-<span class="st"> "([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})
<span class="co">#> [1] "219 733 8965" </span>
<span class="co">#> [2] "329-293-8753" </span>
<span class="co">#> [3] "Work: 579-499-7527; Home: 543.355.3679"</span></code></pre></div></li>
+<li><p><code>str_count()</code> counts the number of matches:</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># How many phone numbers in each string?</span>
+<span class="kw">str_count</span>(strings, phone)
+<span class="co">#> [1] 0 1 1 2</span></code></pre></div></li>
<li><p><code>str_locate()</code> locates the first position of a pattern and returns a numeric matrix with columns start and end. <code>str_locate_all()</code> locates all matches, returning a list of numeric matrices. Similar to <code>regexpr()</code> and <code>gregexpr()</code>.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Where in the string is the phone number located?</span>
(loc <-<span class="st"> </span><span class="kw">str_locate</span>(strings, phone))
@@ -203,68 +275,73 @@ phone <-<span class="st"> "([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})
<span class="co">#> [2] "XXX-XXX-XXXX" </span>
<span class="co">#> [3] "XXX-XXX-XXXX" </span>
<span class="co">#> [4] "Work: XXX-XXX-XXXX; Home: XXX-XXX-XXXX"</span></code></pre></div></li>
-<li><p><code>str_split_fixed()</code> splits the string into a fixed number of pieces based on a pattern and returns a character matrix. <code>str_split()</code> splits a string into a variable number of pieces and returns a list of character vectors.</p></li>
+<li><p><code>str_split_fixed()</code> splits the string into a fixed number of pieces based on a pattern and returns a character matrix. <code>str_split()</code> splits a string into a variable number of pieces and returns a list of character vectors.</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">str_split</span>(<span class="st">"a-b-c"</span>, <span class="st">"-"</span>)
+<span class="co">#> [[1]]</span>
+<span class="co">#> [1] "a" "b" "c"</span>
+<span class="kw">str_split_fixed</span>(<span class="st">"a-b-c"</span>, <span class="st">"-"</span>, <span class="dt">n =</span> <span class="dv">2</span>)
+<span class="co">#> [,1] [,2] </span>
+<span class="co">#> [1,] "a" "b-c"</span></code></pre></div></li>
</ul>
-<div id="arguments" class="section level3">
-<h3>Arguments</h3>
-<p>Each pattern matching function has the same first two arguments, a character vector of <code>string</code>s to process and a single <code>pattern</code> (regular expression) to match. The replace functions have an additional argument specifying the replacement string, and the split functions have an argument to specify the number of pieces.</p>
-<p>Unlike base string functions, stringr offers control over matching not through arguments, but through modifier functions, <code>regex()</code>, <code>coll()</code> and <code>fixed()</code>. This is a deliberate choice made to simplify these functions. For example, while <code>grepl</code> has six arguments, <code>str_detect()</code> only has two.</p>
</div>
-<div id="regular-expressions" class="section level3">
-<h3>Regular expressions</h3>
-<p>To be able to use these functions effectively, you’ll need a good knowledge of regular expressions, which this vignette is not going to teach you. Some useful tools to get you started:</p>
+<div id="engines" class="section level3">
+<h3>Engines</h3>
+<p>There are four main engines that stringr can use to describe patterns:</p>
<ul>
-<li><p>A good <a href="http://www.regular-expressions.info/reference.html">reference sheet</a>.</p></li>
-<li><p>A tool that allows you to <a href="http://gskinner.com/RegExr/">interactively test</a> what a regular expression will match.</p></li>
-<li><p>A tool to <a href="http://www.txt2re.com">build a regular expression</a> from an input string.</p></li>
+<li><p>Regular expressions, the default, as shown above, and described in <code>vignette("regular-expressions")</code>.</p></li>
+<li><p>Fixed bytewise matching, with <code>fixed()</code>.</p></li>
+<li><p>Locale-sensitve character matching, with <code>coll()</code></p></li>
+<li><p>Text boundary analysis with <code>boundary()</code>.</p></li>
</ul>
-<p>When writing regular expressions, I strongly recommend generating a list of positive (pattern should match) and negative (pattern shouldn’t match) test cases to ensure that you are matching the correct components.</p>
+<div id="fixed-matches" class="section level4">
+<h4>Fixed matches</h4>
+<p><code>fixed(x)</code> only matches the exact sequence of bytes specified by <code>x</code>. This is a very limited “pattern”, but the restriction can make matching much faster. Beware using <code>fixed()</code> with non-English data. It is problematic because there are often multiple ways of representing the same character. For example, there are two ways to define “á”: either as a single character or as an “a” plus an accent:</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">a1 <-<span class="st"> "\u00e1"</span>
+a2 <-<span class="st"> "a\u0301"</span>
+<span class="kw">c</span>(a1, a2)
+<span class="co">#> [1] "á" "á"</span>
+a1 ==<span class="st"> </span>a2
+<span class="co">#> [1] FALSE</span></code></pre></div>
+<p>They render identically, but because they’re defined differently, <code>fixed()</code> doesn’t find a match. Instead, you can use <code>coll()</code>, defined next, to respect human character comparison rules:</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">str_detect</span>(a1, <span class="kw">fixed</span>(a2))
+<span class="co">#> [1] FALSE</span>
+<span class="kw">str_detect</span>(a1, <span class="kw">coll</span>(a2))
+<span class="co">#> [1] TRUE</span></code></pre></div>
+</div>
+<div id="collation-search" class="section level4">
+<h4>Collation search</h4>
+<p><code>coll(x)</code> looks for a match to <code>x</code> using human-language <strong>coll</strong>ation rules, and is particularly important if you want to do case insensitive matching. Collation rules diffe around the world, so you’ll also need to supply a <code>locale</code> parameter.</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">i <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"I"</span>, <span class="st">"İ"</span>, <span class="st">"i"</span>, <span class="st">"ı"</span>)
+i
+<span class="co">#> [1] "I" "İ" "i" "ı"</span>
+
+<span class="kw">str_subset</span>(i, <span class="kw">coll</span>(<span class="st">"i"</span>, <span class="dt">ignore_case =</span> <span class="ot">TRUE</span>))
+<span class="co">#> [1] "I" "i"</span>
+<span class="kw">str_subset</span>(i, <span class="kw">coll</span>(<span class="st">"i"</span>, <span class="dt">ignore_case =</span> <span class="ot">TRUE</span>, <span class="dt">locale =</span> <span class="st">"tr"</span>))
+<span class="co">#> [1] "İ" "i"</span></code></pre></div>
+<p>The downside of <code>coll()</code> is speed; because the rules for recognising which characters are the same are complicated, <code>coll()</code> is relatively slow compared to <code>regex()</code> and <code>fixed()</code>. Note that will both <code>fixed()</code> and <code>regex()</code> have <code>ignore_case</code> arguments, they perform a much simpler comparison than <code>coll()</code>.</p>
</div>
-<div id="functions-that-return-lists" class="section level3">
-<h3>Functions that return lists</h3>
-<p>Many of the functions return a list of vectors or matrices. To work with each element of the list there are two strategies: iterate through a common set of indices, or use <code>Map()</code> to iterate through the vectors simultaneously. The second strategy is illustrated below:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">col2hex <-<span class="st"> </span>function(col) {
- rgb <-<span class="st"> </span><span class="kw">col2rgb</span>(col)
- <span class="kw">rgb</span>(rgb[<span class="st">"red"</span>, ], rgb[<span class="st">"green"</span>, ], rgb[<span class="st">"blue"</span>, ], <span class="dt">max =</span> <span class="dv">255</span>)
-}
-
-<span class="co"># Goal replace colour names in a string with their hex equivalent</span>
-strings <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"Roses are red, violets are blue"</span>, <span class="st">"My favourite colour is green"</span>)
-
-colours <-<span class="st"> </span><span class="kw">str_c</span>(<span class="st">"</span><span class="ch">\\</span><span class="st">b"</span>, <span class="kw">colors</span>(), <span class="st">"</span><span class="ch">\\</span><span class="st">b"</span>, <span class="dt">collapse=</span><span class="st">"|"</span>)
-<span class="co"># This gets us the colours, but we have no way of replacing them</span>
-<span class="kw">str_extract_all</span>(strings, colours)
+<div id="boundary" class="section level4">
+<h4>Boundary</h4>
+<p><code>boundary()</code> matches boundaries between characters, lines, sentences or words. It’s most useful with <code>str_split()</code>, but can used with all pattern matching functions</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">x <-<span class="st"> "This is a sentence."</span>
+<span class="kw">str_split</span>(x, <span class="kw">boundary</span>(<span class="st">"word"</span>))
<span class="co">#> [[1]]</span>
-<span class="co">#> [1] "red" "blue"</span>
-<span class="co">#> </span>
-<span class="co">#> [[2]]</span>
-<span class="co">#> [1] "green"</span>
-
-<span class="co"># Instead, let's work with locations</span>
-locs <-<span class="st"> </span><span class="kw">str_locate_all</span>(strings, colours)
-<span class="kw">Map</span>(function(string, loc) {
- hex <-<span class="st"> </span><span class="kw">col2hex</span>(<span class="kw">str_sub</span>(string, loc))
- <span class="kw">str_sub</span>(string, loc) <-<span class="st"> </span>hex
- string
-}, strings, locs)
-<span class="co">#> $`Roses are red, violets are blue`</span>
-<span class="co">#> [1] "Roses are #FF0000, violets are blue"</span>
-<span class="co">#> [2] "Roses are red, violets are #0000FF" </span>
-<span class="co">#> </span>
-<span class="co">#> $`My favourite colour is green`</span>
-<span class="co">#> [1] "My favourite colour is #00FF00"</span></code></pre></div>
-<p>Another approach is to use the second form of <code>str_replace_all()</code>: if you give it a named vector, it applies each <code>pattern = replacement</code> in turn:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">matches <-<span class="st"> </span><span class="kw">col2hex</span>(<span class="kw">colors</span>())
-<span class="kw">names</span>(matches) <-<span class="st"> </span><span class="kw">str_c</span>(<span class="st">"</span><span class="ch">\\</span><span class="st">b"</span>, <span class="kw">colors</span>(), <span class="st">"</span><span class="ch">\\</span><span class="st">b"</span>)
-
-<span class="kw">str_replace_all</span>(strings, matches)
-<span class="co">#> [1] "Roses are #FF0000, violets are #0000FF"</span>
-<span class="co">#> [2] "My favourite colour is #00FF00"</span></code></pre></div>
+<span class="co">#> [1] "This" "is" "a" "sentence"</span>
+<span class="kw">str_count</span>(x, <span class="kw">boundary</span>(<span class="st">"word"</span>))
+<span class="co">#> [1] 4</span>
+<span class="kw">str_extract_all</span>(x, <span class="kw">boundary</span>(<span class="st">"word"</span>))
+<span class="co">#> [[1]]</span>
+<span class="co">#> [1] "This" "is" "a" "sentence"</span></code></pre></div>
+<p>By convention, <code>""</code> is treated as <code>boundary("character")</code>:</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">str_split</span>(x, <span class="st">""</span>)
+<span class="co">#> [[1]]</span>
+<span class="co">#> [1] "T" "h" "i" "s" " " "i" "s" " " "a" " " "s" "e" "n" "t" "e" "n" "c"</span>
+<span class="co">#> [18] "e" "."</span>
+<span class="kw">str_count</span>(x, <span class="st">""</span>)
+<span class="co">#> [1] 19</span></code></pre></div>
</div>
</div>
-<div id="conclusion" class="section level2">
-<h2>Conclusion</h2>
-<p>stringr provides an opinionated interface to strings in R. It makes string processing simpler by removing uncommon options, and by vigorously enforcing consistency across functions. I have also added new functions that I have found useful from Ruby, and over time, I hope users will suggest useful functions from other programming languages. I will continue to build on the included test suite to ensure that the package behaves as expected and remains bug free.</p>
</div>
diff --git a/man/case.Rd b/man/case.Rd
index d32a9d0..4fb1ff1 100644
--- a/man/case.Rd
+++ b/man/case.Rd
@@ -2,21 +2,22 @@
% Please edit documentation in R/case.R
\name{case}
\alias{case}
+\alias{str_to_upper}
\alias{str_to_lower}
\alias{str_to_title}
-\alias{str_to_upper}
\title{Convert case of a string.}
\usage{
-str_to_upper(string, locale = "")
+str_to_upper(string, locale = "en")
-str_to_lower(string, locale = "")
+str_to_lower(string, locale = "en")
-str_to_title(string, locale = "")
+str_to_title(string, locale = "en")
}
\arguments{
\item{string}{String to modify}
-\item{locale}{Locale to use for translations.}
+\item{locale}{Locale to use for translations. Defaults to "en" (English)
+to ensure consistent default ordering across platforms.}
}
\description{
Convert case of a string.
@@ -28,7 +29,6 @@ str_to_lower(dog)
str_to_title(dog)
# Locale matters!
-str_to_upper("i", "en") # English
+str_to_upper("i") # English
str_to_upper("i", "tr") # Turkish
}
-
diff --git a/man/invert_match.Rd b/man/invert_match.Rd
index e2ec24a..3cdd8cb 100644
--- a/man/invert_match.Rd
+++ b/man/invert_match.Rd
@@ -24,4 +24,3 @@ str_sub(numbers, num_loc[, "start"], num_loc[, "end"])
text_loc <- invert_match(num_loc)
str_sub(numbers, text_loc[, "start"], text_loc[, "end"])
}
-
diff --git a/man/modifier-deprecated.Rd b/man/modifier-deprecated.Rd
index 0201ccd..2256be3 100644
--- a/man/modifier-deprecated.Rd
+++ b/man/modifier-deprecated.Rd
@@ -1,8 +1,8 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/modifiers.r
\name{modifier-deprecated}
-\alias{ignore.case}
\alias{modifier-deprecated}
+\alias{ignore.case}
\alias{perl}
\title{Deprecated modifier functions.}
\usage{
@@ -14,4 +14,3 @@ perl(pattern)
Please use \code{\link{regex}} and \code{\link{coll}} instead.
}
\keyword{internal}
-
diff --git a/man/modifiers.Rd b/man/modifiers.Rd
index 732165f..df25163 100644
--- a/man/modifiers.Rd
+++ b/man/modifiers.Rd
@@ -1,16 +1,16 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/modifiers.r
\name{modifiers}
-\alias{boundary}
-\alias{coll}
-\alias{fixed}
\alias{modifiers}
+\alias{fixed}
+\alias{coll}
\alias{regex}
+\alias{boundary}
\title{Control matching behaviour with modifier functions.}
\usage{
fixed(pattern, ignore_case = FALSE)
-coll(pattern, ignore_case = FALSE, locale = NULL, ...)
+coll(pattern, ignore_case = FALSE, locale = "en", ...)
regex(pattern, ignore_case = FALSE, multiline = FALSE, comments = FALSE,
dotall = FALSE, ...)
@@ -24,7 +24,9 @@ boundary(type = c("character", "line_break", "sentence", "word"),
\item{ignore_case}{Should case differences be ignored in the match?}
\item{locale}{Locale to use for comparisons. See
-\code{\link[stringi]{stri_locale_list}()} for all possible options.}
+\code{\link[stringi]{stri_locale_list}()} for all possible options.
+Defaults to "en" (English) to ensure that the default collation is
+consistent across platforms.}
\item{...}{Other less frequently used arguments passed on to
\code{\link[stringi]{stri_opts_collator}},
@@ -85,4 +87,3 @@ str_extract_all("a\\nb\\nc", regex("^.", multiline = TRUE))
str_extract_all("a\\nb\\nc", "a.")
str_extract_all("a\\nb\\nc", regex("a.", dotall = TRUE))
}
-
diff --git a/man/pipe.Rd b/man/pipe.Rd
index e0bc900..f9ca34a 100644
--- a/man/pipe.Rd
+++ b/man/pipe.Rd
@@ -10,4 +10,3 @@ lhs \%>\% rhs
Pipe operator
}
\keyword{internal}
-
diff --git a/man/str_c.Rd b/man/str_c.Rd
index a59757c..14a1781 100644
--- a/man/str_c.Rd
+++ b/man/str_c.Rd
@@ -54,4 +54,3 @@ str_c(str_replace_na(c("a", NA, "b")), "-d")
\code{\link{paste}} for equivalent base R functionality, and
\code{\link[stringi]{stri_join}} which this function wraps
}
-
diff --git a/man/str_conv.Rd b/man/str_conv.Rd
index 5233e50..7cb35d2 100644
--- a/man/str_conv.Rd
+++ b/man/str_conv.Rd
@@ -22,4 +22,3 @@ x
str_conv(x, "ISO-8859-2") # Polish "a with ogonek"
str_conv(x, "ISO-8859-1") # Plus-minus
}
-
diff --git a/man/str_count.Rd b/man/str_count.Rd
index 6d8f574..440be40 100644
--- a/man/str_count.Rd
+++ b/man/str_count.Rd
@@ -47,4 +47,3 @@ str_count(c("a.", "...", ".a.a"), fixed("."))
\code{\link{str_locate}}/\code{\link{str_locate_all}} to locate position
of matches
}
-
diff --git a/man/str_detect.Rd b/man/str_detect.Rd
index fb9dfe2..c0e1f02 100644
--- a/man/str_detect.Rd
+++ b/man/str_detect.Rd
@@ -43,6 +43,7 @@ str_detect(fruit, "[aeiou]")
str_detect("aecfg", letters)
}
\seealso{
-\code{\link[stringi]{stri_detect}} which this function wraps
+\code{\link[stringi]{stri_detect}} which this function wraps,
+ \code{\link{str_subset}} for a convenient wrapper around
+ \code{x[str_detect(x, pattern)]}
}
-
diff --git a/man/str_dup.Rd b/man/str_dup.Rd
index 011824c..36047fe 100644
--- a/man/str_dup.Rd
+++ b/man/str_dup.Rd
@@ -23,4 +23,3 @@ str_dup(fruit, 2)
str_dup(fruit, 1:3)
str_c("ba", str_dup("na", 0:5))
}
-
diff --git a/man/str_extract.Rd b/man/str_extract.Rd
index b3e1a09..d88f914 100644
--- a/man/str_extract.Rd
+++ b/man/str_extract.Rd
@@ -60,4 +60,3 @@ str_extract_all("This is, suprisingly, a sentence.", boundary("word"))
\code{\link{str_match}} to extract matched groups;
\code{\link[stringi]{stri_extract}} for the underlying implementation.
}
-
diff --git a/man/str_interp.Rd b/man/str_interp.Rd
index b70bef1..fb405e4 100644
--- a/man/str_interp.Rd
+++ b/man/str_interp.Rd
@@ -59,4 +59,3 @@ str_interp(c(
\author{
Stefan Milton Bache
}
-
diff --git a/man/str_length.Rd b/man/str_length.Rd
index 9df8a16..e47e3f3 100644
--- a/man/str_length.Rd
+++ b/man/str_length.Rd
@@ -42,4 +42,3 @@ str_count(u2)
\seealso{
\code{\link[stringi]{stri_length}} which this function wraps.
}
-
diff --git a/man/str_locate.Rd b/man/str_locate.Rd
index b6fc070..50556c6 100644
--- a/man/str_locate.Rd
+++ b/man/str_locate.Rd
@@ -56,4 +56,3 @@ str_locate_all(fruit, "")
\code{\link{str_extract}} for a convenient way of extracting matches,
\code{\link[stringi]{stri_locate}} for the underlying implementation.
}
-
diff --git a/man/str_match.Rd b/man/str_match.Rd
index 56ac7d2..ea60f31 100644
--- a/man/str_match.Rd
+++ b/man/str_match.Rd
@@ -50,4 +50,3 @@ str_extract_all(x, "<.*?>")
\code{\link[stringi]{stri_match}} for the underlying
implementation.
}
-
diff --git a/man/str_order.Rd b/man/str_order.Rd
index 991be0f..5a9cb0d 100644
--- a/man/str_order.Rd
+++ b/man/str_order.Rd
@@ -5,9 +5,11 @@
\alias{str_sort}
\title{Order or sort a character vector.}
\usage{
-str_order(x, decreasing = FALSE, na_last = TRUE, locale = "", ...)
+str_order(x, decreasing = FALSE, na_last = TRUE, locale = "en",
+ numeric = FALSE, ...)
-str_sort(x, decreasing = FALSE, na_last = TRUE, locale = "", ...)
+str_sort(x, decreasing = FALSE, na_last = TRUE, locale = "en",
+ numeric = FALSE, ...)
}
\arguments{
\item{x}{A character vector to sort.}
@@ -19,7 +21,11 @@ lowest to highest; if \code{TRUE} sorts from highest to lowest.}
\code{FALSE} at the beginning, \code{NA} dropped.}
\item{locale}{In which locale should the sorting occur? Defaults to
-the current locale.}
+the English. This ensures that code behaves the same way across
+platforms.}
+
+\item{numeric}{If \code{TRUE}, will sort digits numerically, instead
+of as strings.}
\item{...}{Other options used to control sorting order. Passed on to
\code{\link[stringi]{stri_opts_collator}}.}
@@ -28,13 +34,16 @@ the current locale.}
Order or sort a character vector.
}
\examples{
-str_order(letters, locale = "en")
-str_sort(letters, locale = "en")
+str_order(letters)
+str_sort(letters)
str_order(letters, locale = "haw")
str_sort(letters, locale = "haw")
+
+x <- c("100a10", "100a5", "2b", "2a")
+str_sort(x)
+str_sort(x, numeric = TRUE)
}
\seealso{
\code{\link[stringi]{stri_order}} for the underlying implementation.
}
-
diff --git a/man/str_pad.Rd b/man/str_pad.Rd
index 9dc833a..2d03e0c 100644
--- a/man/str_pad.Rd
+++ b/man/str_pad.Rd
@@ -40,4 +40,3 @@ str_pad("hadley", 3)
\code{\link{str_trim}} to remove whitespace;
\code{\link{str_trunc}} to decrease the maximum width of a string.
}
-
diff --git a/man/str_replace.Rd b/man/str_replace.Rd
index 674d812..3133714 100644
--- a/man/str_replace.Rd
+++ b/man/str_replace.Rd
@@ -13,14 +13,34 @@ str_replace_all(string, pattern, replacement)
\item{string}{Input vector. Either a character vector, or something
coercible to one.}
-\item{pattern, replacement}{Supply separate pattern and replacement strings
- to vectorise over the patterns. References of the form \code{\1},
- \code{\2} will be replaced with the contents of the respective matched
- group (created by \code{()}) within the pattern.
-
- For \code{str_replace_all} only, you can perform multiple patterns and
- replacements to each string, by passing a named character to
- \code{pattern}.}
+\item{pattern}{Pattern to look for.
+
+ The default interpretation is a regular expression, as described
+ in \link[stringi]{stringi-search-regex}. Control options with
+ \code{\link{regex}()}.
+
+ Match a fixed string (i.e. by comparing only bytes), using
+ \code{\link{fixed}(x)}. This is fast, but approximate. Generally,
+ for matching human text, you'll want \code{\link{coll}(x)} which
+ respects character matching rules for the specified locale.
+
+ Match character, word, line and sentence boundaries with
+ \code{\link{boundary}()}. An empty pattern, "", is equivalent to
+ \code{boundary("character")}.}
+
+\item{replacement}{A character vector of replacements. Should be either
+ length one, or the same length as \code{string} or \code{pattern}.
+ References of the form \code{\1}, \code{\2}, etc will be replaced with
+ the contents of the respective matched group (created by \code{()}).
+
+ To perform multiple replacements in each element of \code{string},
+ pass a named vector (\code{c(pattern1 = replacement1)}) to
+ \code{str_replace_all}. Alternatively, pass a function to
+ \code{replacement}: it will be called once for each match and its
+ return value will be used to replace the match.
+
+ To replace the complete string with \code{NA}, use
+ \code{replacement = NA_character_}.}
}
\value{
A character vector.
@@ -32,6 +52,8 @@ Vectorised over \code{string}, \code{pattern} and \code{replacement}.
fruits <- c("one apple", "two pears", "three bananas")
str_replace(fruits, "[aeiou]", "-")
str_replace_all(fruits, "[aeiou]", "-")
+str_replace_all(fruits, "[aeiou]", toupper)
+str_replace_all(fruits, "b", NA_character_)
str_replace(fruits, "([aeiou])", "")
str_replace(fruits, "([aeiou])", "\\\\1\\\\1")
@@ -48,12 +70,26 @@ str_replace_all(fruits, "[aeiou]", c("1", "2", "3"))
str_replace_all(fruits, c("a", "e", "i"), "-")
# If you want to apply multiple patterns and replacements to the same
-# string, pass a named version to pattern.
-str_replace_all(str_c(fruits, collapse = "---"),
- c("one" = 1, "two" = 2, "three" = 3))
+# string, pass a named vector to pattern.
+fruits \%>\%
+ str_c(collapse = "---") \%>\%
+ str_replace_all(c("one" = "1", "two" = "2", "three" = "3"))
+
+# Use a function for more sophisticated replacement. This example
+# replaces colour names with their hex values.
+colours <- str_c("\\\\b", colors(), "\\\\b", collapse="|")
+col2hex <- function(col) {
+ rgb <- col2rgb(col)
+ rgb(rgb["red", ], rgb["green", ], rgb["blue", ], max = 255)
+}
+
+x <- c(
+ "Roses are red, violets are blue",
+ "My favourite colour is green"
+)
+str_replace_all(x, colours, col2hex)
}
\seealso{
-\code{str_replace_na} to turn missing values into "NA";
+\code{\link{str_replace_na}} to turn missing values into "NA";
\code{\link{stri_replace}} for the underlying implementation.
}
-
diff --git a/man/str_replace_na.Rd b/man/str_replace_na.Rd
index a4372fe..77219c2 100644
--- a/man/str_replace_na.Rd
+++ b/man/str_replace_na.Rd
@@ -10,14 +10,7 @@ str_replace_na(string, replacement = "NA")
\item{string}{Input vector. Either a character vector, or something
coercible to one.}
-\item{replacement}{Supply separate pattern and replacement strings
- to vectorise over the patterns. References of the form \code{\1},
- \code{\2} will be replaced with the contents of the respective matched
- group (created by \code{()}) within the pattern.
-
- For \code{str_replace_all} only, you can perform multiple patterns and
- replacements to each string, by passing a named character to
- \code{pattern}.}
+\item{replacement}{A single string.}
}
\description{
Turn NA into "NA"
@@ -25,4 +18,3 @@ Turn NA into "NA"
\examples{
str_replace_na(c(NA, "abc", "def"))
}
-
diff --git a/man/str_split.Rd b/man/str_split.Rd
index 9a866d7..2aeae8a 100644
--- a/man/str_split.Rd
+++ b/man/str_split.Rd
@@ -66,4 +66,3 @@ str_split_fixed(fruits, " and ", 4)
\seealso{
\code{\link{stri_split}} for the underlying implementation.
}
-
diff --git a/man/str_sub.Rd b/man/str_sub.Rd
index 5a4852b..ef48830 100644
--- a/man/str_sub.Rd
+++ b/man/str_sub.Rd
@@ -69,4 +69,3 @@ str_sub(x, 2, -2) <- ""; x
\seealso{
The underlying implementation in \code{\link[stringi]{stri_sub}}
}
-
diff --git a/man/str_subset.Rd b/man/str_subset.Rd
index f0a255b..58732b1 100644
--- a/man/str_subset.Rd
+++ b/man/str_subset.Rd
@@ -2,9 +2,12 @@
% Please edit documentation in R/subset.R
\name{str_subset}
\alias{str_subset}
-\title{Keep strings matching a pattern.}
+\alias{str_which}
+\title{Keep strings matching a pattern, or find positions.}
\usage{
str_subset(string, pattern)
+
+str_which(string, pattern)
}
\arguments{
\item{string}{Input vector. Either a character vector, or something
@@ -29,22 +32,29 @@ coercible to one.}
A character vector.
}
\description{
-This is a convenient wrapper around \code{x[str_detect(x, pattern)]}.
+\code{str_subset()} is a wrapper around \code{x[str_detect(x, pattern)]},
+and is equivalent to \code{grep(pattern, x, value = TRUE)}.
+\code{str_which()} is a wrapper around \code{which(str_detect(x, pattern))},
+and is equivalent to \code{grep(pattern, x)}.
+}
+\details{
Vectorised over \code{string} and \code{pattern}
}
\examples{
fruit <- c("apple", "banana", "pear", "pinapple")
str_subset(fruit, "a")
+str_which(fruit, "a")
+
str_subset(fruit, "^a")
str_subset(fruit, "a$")
str_subset(fruit, "b")
str_subset(fruit, "[aeiou]")
-# Missings are silently dropped
+# Missings never match
str_subset(c("a", NA, "b"), ".")
+str_which(c("a", NA, "b"), ".")
}
\seealso{
\code{\link{grep}} with argument \code{value = TRUE},
\code{\link[stringi]{stri_subset}} for the underlying implementation.
}
-
diff --git a/man/str_trim.Rd b/man/str_trim.Rd
index ddd1eb8..f4dcaf3 100644
--- a/man/str_trim.Rd
+++ b/man/str_trim.Rd
@@ -24,4 +24,3 @@ str_trim("\\n\\nString with trailing and leading white space\\n\\n")
\seealso{
\code{\link{str_pad}} to add whitespace
}
-
diff --git a/man/str_trunc.Rd b/man/str_trunc.Rd
index ea2c72f..c7d98b4 100644
--- a/man/str_trunc.Rd
+++ b/man/str_trunc.Rd
@@ -30,4 +30,3 @@ rbind(
\seealso{
\code{\link{str_pad}} to increase the minimum width of a string.
}
-
diff --git a/man/str_view.Rd b/man/str_view.Rd
index dc642de..2870033 100644
--- a/man/str_view.Rd
+++ b/man/str_view.Rd
@@ -49,4 +49,3 @@ str_view(c("abc", "def", "fgh"), "d|e")
str_view(c("abc", "def", "fgh"), "d|e", match = TRUE)
str_view(c("abc", "def", "fgh"), "d|e", match = FALSE)
}
-
diff --git a/man/str_wrap.Rd b/man/str_wrap.Rd
index 345aeed..033bd82 100644
--- a/man/str_wrap.Rd
+++ b/man/str_wrap.Rd
@@ -35,4 +35,3 @@ cat(str_wrap(thanks, width = 60, indent = 2), "\\n")
cat(str_wrap(thanks, width = 60, exdent = 2), "\\n")
cat(str_wrap(thanks, width = 0, exdent = 2), "\\n")
}
-
diff --git a/man/stringr-data.Rd b/man/stringr-data.Rd
index 95a5e83..a7fb1fd 100644
--- a/man/stringr-data.Rd
+++ b/man/stringr-data.Rd
@@ -2,9 +2,9 @@
% Please edit documentation in R/data.R
\docType{data}
\name{stringr-data}
-\alias{fruit}
-\alias{sentences}
\alias{stringr-data}
+\alias{sentences}
+\alias{fruit}
\alias{words}
\title{Sample character vectors for practicing string manipulations.}
\format{A character vector.}
@@ -33,4 +33,3 @@ length(words)
words[1:5]
}
\keyword{datasets}
-
diff --git a/man/stringr-package.Rd b/man/stringr-package.Rd
new file mode 100644
index 0000000..7c65496
--- /dev/null
+++ b/man/stringr-package.Rd
@@ -0,0 +1,33 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/stringr.R
+\docType{package}
+\name{stringr-package}
+\alias{stringr}
+\alias{stringr-package}
+\title{stringr: Simple, Consistent Wrappers for Common String Operations}
+\description{
+A consistent, simple and easy to use set of wrappers around the
+fantastic 'stringi' package. All function and argument names (and positions)
+are consistent, all functions deal with "NA"'s and zero length vectors
+in the same way, and the output from one function is easy to feed into
+the input of another.
+}
+\seealso{
+Useful links:
+\itemize{
+ \item \url{http://stringr.tidyverse.org}
+ \item \url{https://github.com/tidyverse/stringr}
+ \item Report bugs at \url{https://github.com/tidyverse/stringr/issues}
+}
+
+}
+\author{
+\strong{Maintainer}: Hadley Wickham \email{hadley at rstudio.com} [copyright holder]
+
+Other contributors:
+\itemize{
+ \item RStudio [copyright holder]
+}
+
+}
+\keyword{internal}
diff --git a/man/word.Rd b/man/word.Rd
index 3e56105..a94e8a9 100644
--- a/man/word.Rd
+++ b/man/word.Rd
@@ -42,4 +42,3 @@ str <- 'abc.def..123.4568.999'
word(str, 1, sep = fixed('..'))
word(str, 2, sep = fixed('..'))
}
-
diff --git a/tests/testthat/test-match.r b/tests/testthat/test-match.r
index bbf43df..1c85a22 100644
--- a/tests/testthat/test-match.r
+++ b/tests/testthat/test-match.r
@@ -50,6 +50,10 @@ test_that("match returns NA when optional group doesn't match", {
expect_equal(str_match(c("ab", "a"), "(a)(b)?")[,3], c("b", NA))
})
+test_that("match_all returns NA when option group doesn't match",{
+ expect_equal(str_match_all("a", "(a)(b)?")[[1]][1, ], c("a", "a", NA))
+})
+
test_that("multiple match works", {
phones_one <- str_c(phones, collapse = " ")
multi_match <- str_match_all(phones_one,
diff --git a/tests/testthat/test-replace.r b/tests/testthat/test-replace.r
index 438d8ba..077c140 100644
--- a/tests/testthat/test-replace.r
+++ b/tests/testthat/test-replace.r
@@ -40,6 +40,27 @@ test_that("can replace multiple matches", {
expect_equal(y, c("11", "22"))
})
+test_that("replacement must be a string", {
+ expect_error(str_replace("x", "x", 1), "must be a character vector")
+})
+
+test_that("replacement must be a string", {
+ expect_equal(str_replace("xyz", "x", NA_character_), NA_character_)
+})
+
+
+# functions ---------------------------------------------------------------
+
+test_that("can supply replacement function", {
+ expect_equal(str_replace("abc", "a|c", toupper), "Abc")
+ expect_equal(str_replace_all("abc", "a|c", toupper), "AbC")
+})
+
+test_that("replacement can be different length", {
+ double <- function(x) str_dup(x, 2)
+ expect_equal(str_replace_all("abc", "a|c", double), "aabcc")
+})
+
# fix_replacement ---------------------------------------------------------
test_that("$ are escaped", {
diff --git a/vignettes/regular-expressions.Rmd b/vignettes/regular-expressions.Rmd
new file mode 100644
index 0000000..9bb6287
--- /dev/null
+++ b/vignettes/regular-expressions.Rmd
@@ -0,0 +1,415 @@
+---
+title: "Regular expressions"
+output: rmarkdown::html_vignette
+vignette: >
+ %\VignetteIndexEntry{Regular expressions}
+ %\VignetteEngine{knitr::rmarkdown}
+ %\VignetteEncoding{UTF-8}
+---
+
+```{r setup, include = FALSE}
+knitr::opts_chunk$set(
+ collapse = TRUE,
+ comment = "#>"
+)
+library(stringr)
+```
+
+Regular expressions are a concise and flexible tool for describing patterns in strings. This vignette describes the key features of stringr's regular expressions, as implemented by [stringi](https://github.com/gagolews/stringi). It is not a tutorial, so if you're unfamiliar regular expressions, I'd recommend starting at <http://r4ds.had.co.nz/strings.html>. If you want to master the details, I'd recommend reading the classic [_Mastering Regular Expressions_](https://amzn.com/0596528124) [...]
+
+Regular expressions are the default pattern engine in stringr. That means when you use a pattern matching function with a bare string, it's equivalent to wrapping it in a call to `regex()`:
+
+```{r, eval = FALSE}
+# The regular call:
+str_extract(fruit, "nana")
+# Is shorthand for
+str_extract(fruit, regex("nana"))
+```
+
+You will need to use `regex()` explicitly if you want to override the default options, as you'll see in examples below.
+
+## Basic matches
+
+The simplest patterns match exact strings:
+
+```{r}
+x <- c("apple", "banana", "pear")
+str_extract(x, "an")
+```
+
+You can perform a case-insensitive match using `ignore_case = TRUE`:
+
+```{r}
+bananas <- c("banana", "Banana", "BANANA")
+str_detect(bananas, "banana")
+str_detect(bananas, regex("banana", ignore_case = TRUE))
+```
+
+The next step up in complexity is `.`, which matches any character except a newline:
+
+```{r}
+str_extract(x, ".a.")
+```
+
+You can allow `.` to match everything, including `\n`, by setting `dotall = TRUE`:
+
+```{r}
+str_detect("\nX\n", ".X.")
+str_detect("\nX\n", regex(".X.", dotall = TRUE))
+```
+
+## Escaping
+
+If "`.`" matches any character, how do you match a literal "`.`"? You need to use an "escape" to tell the regular expression you want to match it exactly, not use its special behaviour. Like strings, regexps use the backslash, `\`, to escape special behaviour. So to match an `.`, you need the regexp `\.`. Unfortunately this creates a problem. We use strings to represent regular expressions, and `\` is also used as an escape symbol in strings. So to create the regular expression `\.` we n [...]
+
+```{r}
+# To create the regular expression, we need \\
+dot <- "\\."
+
+# But the expression itself only contains one:
+writeLines(dot)
+
+# And this tells R to look for an explicit .
+str_extract(c("abc", "a.c", "bef"), "a\\.c")
+```
+
+If `\` is used as an escape character in regular expressions, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. To create that regular expression, you need to use a string, which also needs to escape `\`. That means to match a literal `\` you need to write `"\\\\"` --- you need four backslashes to match one!
+
+```{r}
+x <- "a\\b"
+writeLines(x)
+
+str_extract(x, "\\\\")
+```
+
+In this vignette, I use `\.` to denote the regular expression, and `"\\."` to denote the string that represents the regular expression.
+
+An alternative quoting mechanism is `\Q...\E`: all the characters in `...` are treated as exact matches. This is useful if you want to exactly match user input as part of a regular expression.
+
+```{r}
+x <- c("a.b.c.d", "aeb")
+starts_with <- "a.b"
+
+str_detect(x, paste0("^", starts_with))
+str_detect(x, paste0("^\\Q", starts_with, "\\E"))
+```
+
+## Special characters
+
+Escapes also allow you to specify individual characters that are otherwise hard to type. You can specify individual unicode characters in five ways, either as a variable number of hex digits (four is most common), or by name:
+
+* `\xhh`: 2 hex digits.
+
+* `\x{hhhh}`: 1-6 hex digits.
+
+* `\uhhhh`: 4 hex digits.
+
+* `\Uhhhhhhhh`: 8 hex digits.
+
+* `\N{name}`, e.g. `\N{grinning face}` matches the basic smiling emoji.
+
+Similarly, you can specify many common control characters:
+
+* `\a`: bell.
+
+* `\cX`: match a control-X character.
+
+* `\e`: escape (`\u001B`).
+
+* `\f`: form feed (`\u000C`).
+
+* `\n`: line feed (`\u000A`).
+
+* `\r`: carriage return (`\u000D`).
+
+* `\t`: horizontal tabulation (`\u0009`).
+
+* `\0ooo` match an octal character. 'ooo' is from one to three octal digits,
+ from 000 to 0377. The leading zero is required.
+
+(Many of these are only of historical interest and are only included here for the sake of completeness.)
+
+## Matching multiple characters
+
+There are a number of patterns that match more than one character. You've already seen `.`, which matches any character (except a newline). A closely related operator is `\X`, which matches a __grapheme cluster__, a set of individual elements that form a single symbol. For example, one way of representing "á" is as the letter "a" plus an accent: `.` will match the component "a", while `\X` will match the complete symbol:
+
+```{r}
+x <- "a\u0301"
+str_extract(x, ".")
+str_extract(x, "\\X")
+```
+
+There are five other escaped pairs that match narrower classes of characters:
+
+* `\d`: matches any digit. The complement, `\D`, matches any character that
+ is not a decimal digit.
+
+ ```{r}
+ str_extract_all("1 + 2 = 3", "\\d+")[[1]]
+ ```
+
+ Technically, `\d` includes any character in the Unicode Category of Nd
+ ("Number, Decimal Digit"), which also includes numeric symbols from other
+ languages:
+
+ ```{r}
+ # Some Laotian numbers
+ str_detect("១២៣", "\\d")
+ ```
+
+* `\s`: matches any whitespace. This includes tabs, newlines, form feeds,
+ and any character in the Unicode Z Category (which includes a variety of
+ space characters and other separators.). The complement, `\S`, matches any
+ non-whitespace character.
+
+ ```{r}
+ (text <- "Some \t badly\n\t\tspaced \f text")
+ str_replace_all(text, "\\s+", " ")
+ ```
+
+* `\p{property name}` matches any character with specific unicode property,
+ like `\p{Uppercase}` or `\p{Diacritic}`. The complement,
+ `\P{property name}`, matches all characters without the property.
+ A complete list of unicode properties can be found at
+ <http://www.unicode.org/reports/tr44/#Property_Index>.
+
+ ```{r}
+ (text <- c('"Double quotes"', "«Guillemet»", "“Fancy quotes”"))
+ str_replace_all(text, "\\p{quotation mark}", "'")
+ ```
+
+* `\w` matches any "word" character, which includes alphabetic characters,
+ marks and decimal numbers. The complement, `\W`, matches any non-word
+ character.
+
+ ```{r}
+ str_extract_all("Don't eat that!", "\\w+")[[1]]
+ str_split("Don't eat that!", "\\W")[[1]]
+ ```
+
+ Technically, `\w` also matches connector punctuation, `\u200c` (zero width
+ connector), and `\u200d` (zero width joiner), but these are rarely seen in
+ the wild.
+
+* `\b` matches word boundaries, the transition between word and non-word
+ characters. `\B` matches the opposite: boundaries that have either both
+ word or non-word characters on either side.
+
+ ```{r}
+ str_replace_all("The quick brown fox", "\\b", "_")
+ str_replace_all("The quick brown fox", "\\B", "_")
+ ```
+
+You can also create your own __character classes__ using `[]`:
+
+* `[abc]`: matches a, b, or c.
+* `[a-z]`: matches every character between a and z
+ (in Unicode code point order).
+* `[^abc]`: matches anything except a, b, or c.
+* `[\^\-]`: matches `-` or `\`.
+
+There are a number of pre-built classes that you can use inside `[]`:
+
+* `[:punct:]`: punctuation.
+* `[:alpha:]`: letters.
+* `[:lower:]`: lowercase letters.
+* `[:upper:]`: upperclass letters.
+* `[:digit:]`: digits.
+* `[:xdigit:]`: hex digits.
+* `[:alnum:]`: letters and numbers.
+* `[:cntrl:]`: control characters.
+* `[:graph:]`: letters, numbers, and punctuation.
+* `[:print:]`: letters, numbers, punctuation, and whitespace.
+* `[:space:]`: space characters (basically equivalent to `\s`).
+* `[:blank:]`: space and tab.
+
+These all go inside the `[]` for character classes, i.e. `[[:digit:]AX]` matches all digits, A, and X.
+
+You can also using Unicode properties, like `[\p{Letter}]`, and various set operations, like `[\p{Letter}--\p{script=latin}]`. See `?"stringi-search-charclass"` for details.
+
+## Alternation
+
+`|` is the __alternation__ operator, which will pick between one or more possible matches. For example, `abc|def` will match `abc` or `def`.
+
+```{r}
+str_detect(c("abc", "def", "ghi"), "abc|def")
+```
+
+Note that the precedence for `|` is low, so that `abc|def` matches `abc` or `def` not `abcyz` or `abxyz`.
+
+## Grouping
+
+You can use parentheses to override the default precedence rules:
+
+```{r}
+str_extract(c("grey", "gray"), "gre|ay")
+str_extract(c("grey", "gray"), "gr(e|a)y")
+```
+
+Parenthesis also define "groups" that you can refer to with __backreferences__, like `\1`, `\2` etc, and can be extracted with `str_match()`. For example, the following regular expression finds all fruits that have a repeated pair of letters:
+
+```{r}
+pattern <- "(..)\\1"
+fruit %>%
+ str_subset(pattern)
+
+fruit %>%
+ str_subset(pattern) %>%
+ str_match(pattern)
+```
+
+You can use `(?:...)`, the non-grouping parentheses, to control precedence but not capture the match in a group. This is slightly more efficient than capturing parentheses.
+
+```{r}
+str_match(c("grey", "gray"), "gr(e|a)y")
+str_match(c("grey", "gray"), "gr(?:e|a)y")
+```
+
+This is most useful for more complex cases where you need to capture matches and control precedence independently.
+
+## Anchors
+
+By default, regular expressions will match any part of a string. It's often useful to __anchor__ the regular expression so that it matches from the start or end of the string:
+
+* `^` matches the start of string.
+* `$` matches the end of the string.
+
+```{r}
+x <- c("apple", "banana", "pear")
+str_extract(x, "^a")
+str_extract(x, "a$")
+```
+
+To match a literal "$" or "^", you need to escape them, `\$`, and `\^`.
+
+For multiline strings, you can use `regex(multiline = TRUE)`. This changes the behaviour of `^` and `$`, and introduces three new operators:
+
+* `^` now matches the start of each line.
+
+* `$` now matches the end of each line.
+
+* `\A` matches the start of the input.
+
+* `\z` matches the end of the input.
+
+* `\Z` matches the end of the input, but before the final line terminator,
+ if it exists.
+
+```{r}
+x <- "Line 1\nLine 2\nLine 3\n"
+str_extract_all(x, "^Line..")[[1]]
+str_extract_all(x, regex("^Line..", multiline = TRUE))[[1]]
+str_extract_all(x, regex("\\ALine..", multiline = TRUE))[[1]]
+```
+
+## Repetition
+
+You can control how many times a pattern matches with the repetition operators:
+
+* `?`: 0 or 1.
+* `+`: 1 or more.
+* `*`: 0 or more.
+
+```{r}
+x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
+str_extract(x, "CC?")
+str_extract(x, "CC+")
+str_extract(x, 'C[LX]+')
+```
+
+Note that the precedence of these operators is high, so you can write: `colou?r` to match either American or British spellings. That means most uses will need parentheses, like `bana(na)+`.
+
+You can also specify the number of matches precisely:
+
+* `{n}`: exactly n
+* `{n,}`: n or more
+* `{n,m}`: between n and m
+
+```{r}
+str_extract(x, "C{2}")
+str_extract(x, "C{2,}")
+str_extract(x, "C{2,3}")
+```
+
+By default these matches are "greedy": they will match the longest string possible. You can make them "lazy", matching the shortest string possible by putting a `?` after them:
+
+* `??`: 0 or 1, prefer 0.
+* `+?`: 1 or more, match as few times as possible.
+* `*?`: 0 or more, match as few times as possible.
+* `{n,}?`: n or more, match as few times as possible.
+* `{n,m}?`: between n and m, , match as few times as possible, but at least n.
+
+```{r}
+str_extract(x, c("C{2,3}", "C{2,3}?"))
+str_extract(x, c("C[LX]+", "C[LX]+?"))
+```
+
+You can also make the matches possessive, which means that if later parts of the match fail, the repetition will not be re-tried with a smaller number of characters. This is an advanced feature used to improve performance in worst-case scenarios (called "catastrophic backtracking").
+
+* `??`: 0 or 1, possessive.
+* `+?`: 1 or more, possessive.
+* `*?`: 0 or more, possessive.
+* `{n}?`: exactly n, possessive.
+* `{n,}?`: n or more, possessive.
+* `{n,m}?`: between n and m, possessive.
+
+A related concept is the __atomic-match__ parenthesis, `(?>...)`. If a later match fails and the engine needs to back-track, an atomic match is kept as is: it succeeds or fails as a whole. Compare the following two regular expressions:
+
+```{r}
+str_detect("ABC", "(?>A|.B)C")
+str_detect("ABC", "(?:A|.B)C")
+```
+
+The atomic match fails because it matches A, and then the next character is a C so it fails. The regular match suceeds because it matches A, but then C doesn't match, so it back-tracks and tries B instead.
+
+## Look arounds
+
+These assertions look ahead or behind the current match without "consuming" any characters (i.e. changing the input position).
+
+* `(?=...)`: positive look-ahead assertion. Matches if `...` matches at the
+ current input.
+
+* `(?!...)`: negative look-ahead assertion. Matches if `...` __does not__
+ matche at the current input.
+
+* `(?<=...)`: positive look-behind assertion. Matches if `...` matches text
+ preceding the current position, with the last character of the match
+ being the character just before the current position. Length must be bounded
+ (i.e. no `*` or `+`).
+
+* `(?<!...)`: negative look-behind assertion. Matches if `...` __does not__
+ match text preceding the current position. Length must be bounded
+ (i.e. no `*` or `+`).
+
+These are useful when you want to check that a pattern exists, but you don't want to include it in the result:
+
+```{r}
+x <- c("1 piece", "2 pieces", "3")
+str_extract(x, "\\d+(?= pieces?)")
+
+y <- c("100", "$400")
+str_extract(y, "(?<=\\$)\\d+")
+```
+
+## Comments
+
+There are two ways to include comments in a regular expression. The first is with `(?#...)`:
+
+```{r}
+str_detect("xyz", "x(?#this is a comment)")
+```
+
+The second is to use `regex(comments = TRUE)`. This form ignores spaces and newlines, and anything everything after `#`. To match a literal space, you'll need to escape it: `"\\ "`. This is a useful way of describing complex regular expressions:
+
+```{r}
+phone <- regex("
+ \\(? # optional opening parens
+ (\\d{3}) # area code
+ [)- ]? # optional closing parens, dash, or space
+ (\\d{3}) # another three numbers
+ [ -]? # optional space or dash
+ (\\d{3}) # three more numbers
+ ", comments = TRUE)
+
+str_match("514-791-8141", phone)
+```
diff --git a/vignettes/releases/stringr-1.0.0.Rmd b/vignettes/releases/stringr-1.0.0.Rmd
new file mode 100644
index 0000000..e969341
--- /dev/null
+++ b/vignettes/releases/stringr-1.0.0.Rmd
@@ -0,0 +1,76 @@
+---
+title: "stringr 1.0.0"
+date: "2015-05-05"
+---
+
+```{r, echo = FALSE}
+knitr::opts_chunk$set(comment = "#>", collapse = T)
+```
+
+I'm very excited to announce the 1.0.0 release of the stringr package. If you haven't heard of stringr before, it makes string manipulation easier by:
+
+* Using consistent function and argument names: all functions start with `str_`,
+ and the first argument is always the input string This makes stringr easier
+ to learn and easy to use with [the pipe](http://github.com/smbache/magrittr/).
+
+* Eliminating options that you don't need 95% of the time.
+
+To get started with stringr, check out the [new vignette](http://cran.r-project.org/web/packages/stringr/vignettes/stringr.html).
+
+## What's new?
+
+The biggest change in this release is that stringr is now powered by the [stringi](https://github.com/Rexamine/stringi) package instead of base R. This has two big benefits: stringr is now much faster, and has much better unicode support.
+
+If you've used stringi before, you might wonder why stringr is still necessary: stringi does everything that stringr does, and much much more. There are two reasons that I think stringr is still important:
+
+1. Lots of people use it already, so this update will give many people a
+ performance boost for free.
+
+1. The smaller API of stringr makes it a little easier to learn.
+
+That said, once you've learned stringr, using stringi should be easy, so it's a great place to start if you need a tool that doesn't exist in stringr.
+
+## New features and functions
+
+* `str_replace_all()` gains a convenient syntax for applying multiple pairs of
+ pattern and replacement to the same vector:
+
+ ```{r}
+ x <- c("abc", "def")
+ str_replace_all(x, c("[ad]" = "!", "[cf]" = "?"))
+ ```
+
+* `str_subset()` keeps values that match a pattern:
+
+ ```{r}
+ x <- c("abc", "def", "jhi", "klm", "nop")
+ str_subset(x, "[aeiou]")
+ ```
+
+* `str_order()` and `str_sort()` sort and order strings in a specified locale.
+ `str_conv()` to converts strings from specified encoding to UTF-8.
+
+ ```{r}
+ # The vowels come before the consonants in Hawaiian
+ str_sort(letters[1:10], locale = "haw")
+ ```
+
+* New modifier `boundary()` allows you to count, locate and split by
+ character, word, line and sentence boundaries.
+
+ ```{r}
+ words <- c("These are some words. Some more words.")
+ str_count(words, boundary("word"))
+ str_split(words, boundary("word"))
+ ```
+
+There were two minor changes to make stringr a little more consistent:
+
+* `str_c()` now returns a zero length vector if any of its inputs are
+ zero length vectors. This is consistent with all other functions, and
+ standard R recycling rules. Similarly, using `str_c("x", NA)` now
+ yields `NA`. If you want `"xNA"`, use `str_replace_na()` on the inputs.
+
+* `str_match()` now returns NA if an optional group doesn't match
+ (previously it returned ""). This is more consistent with `str_extract()`
+ and other match failures.
diff --git a/vignettes/releases/stringr-1.1.0.Rmd b/vignettes/releases/stringr-1.1.0.Rmd
new file mode 100644
index 0000000..9adf58c
--- /dev/null
+++ b/vignettes/releases/stringr-1.1.0.Rmd
@@ -0,0 +1,33 @@
+---
+title: "stringr 1.1.0"
+date: "2016-08-24"
+---
+
+```{r, echo = FALSE}
+knitr::opts_chunk$set(comment = "#>", collapse = T)
+```
+
+This release is mostly bug fixes, but there are a couple of new features you might care out.
+
+* There are three new datasets, `fruit`, `words` and `sentences`, to
+ help you practice your regular expression skills:
+
+ ```{r}
+ str_subset(fruit, "(..)\\1")
+ head(words)
+ sentences[1]
+ ```
+
+* More functions work with `boundary()`: `str_detect()` and `str_subset()`
+ can detect boundaries, and `str_extract()` and `str_extract_all()` pull out
+ the components between boundaries. This is particularly useful if
+ you want to extract logical constructs like words or sentences.
+
+ ```{r}
+ x <- "This is harder than you might expect, e.g. punctuation!"
+ x %>% str_extract_all(boundary("word")) %>% .[[1]]
+ x %>% str_extract(boundary("sentence"))
+ ```
+
+* `str_view()` and `str_view_all()` create HTML widgets that display regular
+ expression matches. This is particularly useful for teaching.
diff --git a/vignettes/stringr.Rmd b/vignettes/stringr.Rmd
index e20db2e..b8b7927 100644
--- a/vignettes/stringr.Rmd
+++ b/vignettes/stringr.Rmd
@@ -1,78 +1,159 @@
---
title: "Introduction to stringr"
-date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Introduction to stringr}
%\VignetteEngine{knitr::rmarkdown}
- \usepackage[utf8]{inputenc}
+ %\VignetteEncoding{UTF-8}
---
-```{r, echo=FALSE}
-library("stringr")
-knitr::opts_chunk$set(comment = "#>", collapse = TRUE)
+```{r, include = FALSE}
+library(stringr)
+knitr::opts_chunk$set(
+ comment = "#>",
+ collapse = TRUE
+)
```
-Strings are not glamorous, high-profile components of R, but they do play a big role in many data cleaning and preparations tasks. R provides a solid set of string operations, but because they have grown organically over time, they can be inconsistent and a little hard to learn. Additionally, they lag behind the string operations in other programming languages, so that some things that are easy to do in languages like Ruby or Python are rather hard to do in R. The __stringr__ package aim [...]
+There are four main families of functions in stringr:
+
+1. Character manipulation: these functions allow you to manipulate the
+ individual characters inside the strings inside character vectors.
+
+1. Whitespace tools to add, remove, and manipulation whitespace.
+
+1. Locale sensitive operation whose operation will vary for locale
+ to locale
+
+1. Pattern matching functions. These recognise four engines of
+ pattern description. The most common is regular expresssions, but there
+ are a three other tools.
-More concretely, stringr:
+## Getting and setting individual characters
-- Simplifies string operations by eliminating options that you don't need
- 95% of the time (the other 5% of the time you can functions from base R or
- [stringi](https://github.com/Rexamine/stringi/)).
+You can get the length of the string with `str_length()`:
-- Uses consistent function names and arguments.
+```{r}
+str_length("abc")
+```
-- Produces outputs than can easily be used as inputs. This includes ensuring
- that missing inputs result in missing outputs, and zero length inputs result
- in zero length outputs. It also processes factors and character vectors in
- the same way.
+This is now equivalent to the base R function `nchar()`. Previously it was needed to work around issues with `nchar()` such as the fact that it returned 2 for `nchar(NA)`. This has been fixed as of R 3.3.0, so it is no longer so important.
-- Completes R's string handling functions with useful functions from other
- programming languages.
+You can access individual character using `sub_str()`. It takes three arguments: a character vector, a starting position and an end position. Either position can either be a positive integer, which counts from the length, or a negative integer which counts from the right. The positions are inclusive, and if longer than the string, will be silently truncated.
-To meet these goals, stringr provides two basic families of functions:
+```{r}
+x <- c("abcdef", "ghifjk")
-- basic string operations, and
+# The 3rd letter
+str_sub(x, 3, 3)
-- pattern matching functions which use regular expressions to detect, locate,
- match, replace, extract, and split strings.
+# The 2nd to 2nd-to-last character
+str_sub(x, 2, -2)
-As of version 1.0, stringr is a thin wrapper around [stringi](https://github.com/Rexamine/stringi/), which implements all the functions in stringr with efficient C code based on the [ICU library](http://site.icu-project.org). Compared to stringi, stringr is considerably simpler: it provides fewer options and fewer functions. This is great when you're getting started learning string functions, and if you do need more of stringi's power, you should find the interface similar.
+```
-These are described in more detail in the following sections.
+You can also use `str_sub()` to modify strings:
-## Basic string operations
+```{r}
+str_sub(x, 3, 3) <- "X"
+x
+```
-There are three string functions that are closely related to their base R equivalents, but with a few enhancements:
+To duplicate individual strings, you can use `str_dup()`:
-- `str_c()` is equivalent to `paste()`, but it uses the empty string ("") as
- the default separator and silently removes `NULL` inputs.
+```{r}
+str_dup(x, c(2, 3))
+```
-- `str_length()` is equivalent to `nchar()`, but it preserves NA's (rather than
- giving them length 2) and converts factors to characters (not integers).
+## Whitespace
-- `str_sub()` is equivalent to `substr()` but it returns a zero length vector
- if any of its inputs are zero length, and otherwise expands each argument to
- match the longest. It also accepts negative positions, which are calculated
- from the left of the last character. The end position defaults to `-1`,
- which corresponds to the last character.
+Three functions add, remove, or modify whitespace:
-- `str_sub<-` is equivalent to `substr<-`, but like `str_sub` it understands
- negative indices, and replacement strings not do need to be the same length
- as the string they are replacing.
+1. `str_pad()` pads a string to a fixed length by adding extra whitespace on
+ the left, right, or both sides.
+
+ ```{r}
+ x <- c("abc", "defghi")
+ str_pad(x, 10)
+ str_pad(x, 10, "both")
+ ```
+
+ (You can pad with other characters by using the `pad` argument.)
+
+ `str_pad()` will never make a string shorter:
+
+ ```{r}
+ str_pad(x, 4)
+ ```
+
+ So if you want to ensure that all strings are the same length (often useful
+ for print methods), combine `str_pad()` and `str_trunc()`:
+
+ ```{r}
+ x <- c("Short", "This is a long string")
+
+ x %>%
+ str_trunc(10) %>%
+ str_pad(10, "right")
+ ```
-Three functions add new functionality:
+1. The opposite of `str_pad()` is `str_trim()`, which removes leading and
+ trailing whitespace:
+
+ ```{r}
+ x <- c(" a ", "b ", " c")
+ str_trim(x)
+ str_trim(x, "left")
+ ```
-- `str_dup()` to duplicate the characters within a string.
+1. You can use `str_wrap()` to modify existing whitespace in order to wrap
+ a paragraph of text so that the length of each line as a similar as
+ possible.
+
+ ```{r}
+ jabberwocky <- str_c(
+ "`Twas brillig, and the slithy toves ",
+ "did gyre and gimble in the wabe: ",
+ "All mimsy were the borogoves, ",
+ "and the mome raths outgrabe. "
+ )
+ cat(str_wrap(jabberwocky, width = 40))
+ ```
-- `str_trim()` to remove leading and trailing whitespace.
+## Locale sensitive
-- `str_pad()` to pad a string with extra whitespace on the left, right, or both sides.
+A handful of stringr are functions are locale-sensitive: they will perform differently in different regions of the world. These functions case transformation functions:
+
+```{r}
+x <- "I like horses."
+str_to_upper(x)
+str_to_title(x)
+
+str_to_lower(x)
+# Turkish has two sorts of i: with and without the dot
+str_to_lower(x, "tr")
+```
+
+And string ordering and sorting:
+
+```{r}
+x <- c("y", "i", "k")
+str_order(x)
+
+str_sort(x)
+# In Lithuanian, y comes between i and k
+str_sort(x, locale = "lt")
+```
+
+The locale always defaults to English to ensure that the default behaviour is identically across systems. Locales always include a two letter ISO-639-1 language code (like "en" for English or "zh" for Chinese), and optionally a ISO-3166 country code (like "en_UK" vs "en_US"). You can see a complete list of available locales by running `stringi::stri_locale_list()`.
## Pattern matching
-stringr provides pattern matching functions to **detect**, **locate**, **extract**, **match**, **replace**, and **split** strings. I'll illustrate how they work with some strings and a regular expression designed to match (US) phone numbers:
+The vast majority of stringr functions work with patterns. These are parameterised by the task they perform and the types of patterns they match.
+
+### Tasks
+
+Each pattern matching function has the same first two arguments, a character vector of `string`s to process and a single `pattern` to match. stringr provides pattern matching functions to **detect**, **locate**, **extract**, **match**, **replace**, and **split** strings. I'll illustrate how they work with some strings and a regular expression designed to match (US) phone numbers:
```{r}
strings <- c(
@@ -95,6 +176,13 @@ phone <- "([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})"
str_subset(strings, phone)
```
+- `str_count()` counts the number of matches:
+
+ ```{r}
+ # How many phone numbers in each string?
+ str_count(strings, phone)
+ ```
+
- `str_locate()` locates the first position of a pattern and returns a numeric
matrix with columns start and end. `str_locate_all()` locates all matches,
returning a list of numeric matrices. Similar to `regexpr()` and `gregexpr()`.
@@ -140,62 +228,73 @@ phone <- "([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})"
- `str_split_fixed()` splits the string into a fixed number of pieces based
on a pattern and returns a character matrix. `str_split()` splits a string
into a variable number of pieces and returns a list of character vectors.
+
+ ```{r}
+ str_split("a-b-c", "-")
+ str_split_fixed("a-b-c", "-", n = 2)
+ ```
-### Arguments
-
-Each pattern matching function has the same first two arguments, a character vector of `string`s to process and a single `pattern` (regular expression) to match. The replace functions have an additional argument specifying the replacement string, and the split functions have an argument to specify the number of pieces.
+### Engines
-Unlike base string functions, stringr offers control over matching not through arguments, but through modifier functions, `regex()`, `coll()` and `fixed()`. This is a deliberate choice made to simplify these functions. For example, while `grepl` has six arguments, `str_detect()` only has two.
+There are four main engines that stringr can use to describe patterns:
-### Regular expressions
+* Regular expressions, the default, as shown above, and described in
+ `vignette("regular-expressions")`.
+
+* Fixed bytewise matching, with `fixed()`.
-To be able to use these functions effectively, you'll need a good knowledge of regular expressions, which this vignette is not going to teach you. Some useful tools to get you started:
+* Locale-sensitve character matching, with `coll()`
-- A good [reference sheet](http://www.regular-expressions.info/reference.html).
+* Text boundary analysis with `boundary()`.
-- A tool that allows you to [interactively test](http://gskinner.com/RegExr/)
- what a regular expression will match.
+#### Fixed matches
-- A tool to [build a regular expression](http://www.txt2re.com) from an
- input string.
+`fixed(x)` only matches the exact sequence of bytes specified by `x`. This is a very limited "pattern", but the restriction can make matching much faster. Beware using `fixed()` with non-English data. It is problematic because there are often multiple ways of representing the same character. For example, there are two ways to define "á": either as a single character or as an "a" plus an accent:
-When writing regular expressions, I strongly recommend generating a list of positive (pattern should match) and negative (pattern shouldn't match) test cases to ensure that you are matching the correct components.
+```{r}
+a1 <- "\u00e1"
+a2 <- "a\u0301"
+c(a1, a2)
+a1 == a2
+```
-### Functions that return lists
+They render identically, but because they're defined differently,
+`fixed()` doesn't find a match. Instead, you can use `coll()`, defined
+next, to respect human character comparison rules:
-Many of the functions return a list of vectors or matrices. To work with each element of the list there are two strategies: iterate through a common set of indices, or use `Map()` to iterate through the vectors simultaneously. The second strategy is illustrated below:
+```{r}
+str_detect(a1, fixed(a2))
+str_detect(a1, coll(a2))
+```
+
+#### Collation search
+
+`coll(x)` looks for a match to `x` using human-language **coll**ation rules, and is particularly important if you want to do case insensitive matching. Collation rules diffe around the world, so you'll also need to supply a `locale` parameter.
```{r}
-col2hex <- function(col) {
- rgb <- col2rgb(col)
- rgb(rgb["red", ], rgb["green", ], rgb["blue", ], max = 255)
-}
-
-# Goal replace colour names in a string with their hex equivalent
-strings <- c("Roses are red, violets are blue", "My favourite colour is green")
-
-colours <- str_c("\\b", colors(), "\\b", collapse="|")
-# This gets us the colours, but we have no way of replacing them
-str_extract_all(strings, colours)
-
-# Instead, let's work with locations
-locs <- str_locate_all(strings, colours)
-Map(function(string, loc) {
- hex <- col2hex(str_sub(string, loc))
- str_sub(string, loc) <- hex
- string
-}, strings, locs)
+i <- c("I", "İ", "i", "ı")
+i
+
+str_subset(i, coll("i", ignore_case = TRUE))
+str_subset(i, coll("i", ignore_case = TRUE, locale = "tr"))
```
-Another approach is to use the second form of `str_replace_all()`: if you give it a named vector, it applies each `pattern = replacement` in turn:
+The downside of `coll()` is speed; because the rules for recognising which characters are the same are complicated, `coll()` is relatively slow compared to `regex()` and `fixed()`. Note that will both `fixed()` and `regex()` have `ignore_case` arguments, they perform a much simpler comparison than `coll()`.
-```{r}
-matches <- col2hex(colors())
-names(matches) <- str_c("\\b", colors(), "\\b")
+#### Boundary
+
+`boundary()` matches boundaries between characters, lines, sentences or words. It's most useful with `str_split()`, but can used with all pattern matching functions
-str_replace_all(strings, matches)
+```{r}
+x <- "This is a sentence."
+str_split(x, boundary("word"))
+str_count(x, boundary("word"))
+str_extract_all(x, boundary("word"))
```
-## Conclusion
+By convention, `""` is treated as `boundary("character")`:
-stringr provides an opinionated interface to strings in R. It makes string processing simpler by removing uncommon options, and by vigorously enforcing consistency across functions. I have also added new functions that I have found useful from Ruby, and over time, I hope users will suggest useful functions from other programming languages. I will continue to build on the included test suite to ensure that the package behaves as expected and remains bug free.
+```{r}
+str_split(x, "")
+str_count(x, "")
+```
--
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/r-cran-stringr.git
More information about the debian-med-commit
mailing list