[Pkg-haskell-maintainers] Bug#702617: Bug#702617: . fails to match certian characters
Joachim Breitner
nomeata at debian.org
Sat Mar 9 08:50:11 UTC 2013
Hi Joey,
Am Freitag, den 08.03.2013, 21:43 -0400 schrieb Joey Hess:
> Package: libghc-regex-compat-dev
> Version: 0.95.1-2+b1
> Severity: normal
>
> Prelude Text.Regex> matchRegex (mkRegex $ "^.*$") "o"
> Just []
> Prelude Text.Regex> let s = "ò"
> Prelude Text.Regex> s
> "\242"
> Prelude Text.Regex> matchRegex (mkRegex $ "^.*$") s
> Nothing
> Prelude Text.Regex> matchRegex (mkRegex $ ".") s
> Nothing
>
> I mentioned this to upstream and he said:
>
> | That looks like it is pushing the unicode text to your system C library for
> | matching. This translation is probably making a multibyte C-string and then
> | running a non-Unicode aware C-library call.
> |
> | You will need to check your setup.
> |
> | It is true there are some bugs in this, but they are in the translating of
> | indices, which does not apply here.
regex-compat is but a thin layour around regex-posix, which states
Note that the posix library works with single byte characters,
and does not understand Unicode. If you need Unicode support you
will have to use a different backend.¹
It also makes suggestions for alternative regex libraries:
Benchmarking shows the default regex library on many platforms
is very inefficient. You might increase performace by an order
of magnitude by obtaining libpcre and regex-pcre or libtre and
regex-tre. If you do not need the captured substrings then you
can also get great performance from regex-dfa. If you do need
the capture substrings then you may be able to use regex-parsec
to improve performance.
For arbtt, where speed is crucial, I made sure that all my strings are
UTF8-Encoding ByteStrings and got good results with pcre-light in
utf8-mode, but this required some manual plumbing.
It seems that only regex-tdfa supports Unicode natively:
Depending on the text being searched this package supports
Unicode. The [Char] and (Seq Char) text types support Unicode.
The ByteString and ByteString.Lazy text types only support
ASCII. It is possible to support utf8 encoded ByteString.Lazy by
using regex-tdfa and regex-tdfa-utf8 packages together (required
the utf8-string package). ²
I don’t know how its speed compares to the others, but likely better
than regex-posix.
If you agree with this analysis, please close the bug.
Greetings,
Joachim
¹ http://hackage.haskell.org/packages/archive/regex-posix/0.95.2/doc/html/Text-Regex-Posix.html
² http://hackage.haskell.org/packages/archive/regex-tdfa/1.1.8/doc/html/Text-Regex-TDFA.html
--
Joachim "nomeata" Breitner
Debian Developer
nomeata at debian.org | ICQ# 74513189 | GPG-Keyid: 4743206C
JID: nomeata at joachim-breitner.de | http://people.debian.org/~nomeata
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
URL: <http://lists.alioth.debian.org/pipermail/pkg-haskell-maintainers/attachments/20130309/58f6313d/attachment.pgp>
More information about the Pkg-haskell-maintainers
mailing list