[Pkg-haskell-maintainers] Bug#702617: Bug#702617: . fails to match certian characters

Sat Mar 9 08:50:11 UTC 2013

Hi Joey,

Am Freitag, den 08.03.2013, 21:43 -0400 schrieb Joey Hess:
> Package: libghc-regex-compat-dev
> Version: 0.95.1-2+b1
> Severity: normal
> 
> Prelude Text.Regex> matchRegex (mkRegex $ "^.*$") "o"
> Just []
> Prelude Text.Regex> let s = "ò"
> Prelude Text.Regex> s
> "\242"
> Prelude Text.Regex> matchRegex (mkRegex $ "^.*$") s
> Nothing
> Prelude Text.Regex> matchRegex (mkRegex $ ".") s
> Nothing
> 
> I mentioned this to upstream and he said:
> 
> | That looks like it is pushing the unicode text to your system C library for
> | matching.  This translation is probably making a multibyte C-string and then
> | running a non-Unicode aware C-library call.
> | 
> | You will need to check your setup.
> | 
> | It is true there are some bugs in this, but they are in the translating of
> | indices, which does not apply here.

regex-compat is but a thin layour around regex-posix, which states
        Note that the posix library works with single byte characters,
        and does not understand Unicode. If you need Unicode support you
        will have to use a different backend.¹

It also makes suggestions for alternative regex libraries:
        Benchmarking shows the default regex library on many platforms
        is very inefficient. You might increase performace by an order
        of magnitude by obtaining libpcre and regex-pcre or libtre and
        regex-tre. If you do not need the captured substrings then you
        can also get great performance from regex-dfa. If you do need
        the capture substrings then you may be able to use regex-parsec
        to improve performance. 

For arbtt, where speed is crucial, I made sure that all my strings are
UTF8-Encoding ByteStrings and got good results with pcre-light in
utf8-mode, but this required some manual plumbing.

It seems that only regex-tdfa supports Unicode natively:
        Depending on the text being searched this package supports
        Unicode. The [Char] and (Seq Char) text types support Unicode.
        The ByteString and ByteString.Lazy text types only support
        ASCII. It is possible to support utf8 encoded ByteString.Lazy by
        using regex-tdfa and regex-tdfa-utf8 packages together (required
        the utf8-string package). ²
I don’t know how its speed compares to the others, but likely better
than regex-posix.

If you agree with this analysis, please close the bug.

Greetings,
Joachim

¹ http://hackage.haskell.org/packages/archive/regex-posix/0.95.2/doc/html/Text-Regex-Posix.html
² http://hackage.haskell.org/packages/archive/regex-tdfa/1.1.8/doc/html/Text-Regex-TDFA.html

-- 
Joachim "nomeata" Breitner
Debian Developer
  nomeata at debian.org | ICQ# 74513189 | GPG-Keyid: 4743206C
  JID: nomeata at joachim-breitner.de | http://people.debian.org/~nomeata

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
URL: <http://lists.alioth.debian.org/pipermail/pkg-haskell-maintainers/attachments/20130309/58f6313d/attachment.pgp>