[Pkg-haskell-maintainers] Bug#702617: Bug#702617: . fails to match certian characters
Joachim Breitner
nomeata at debian.org
Mon Mar 11 14:52:32 UTC 2013
Hi,
Am Samstag, den 09.03.2013, 11:52 -0400 schrieb Joey Hess:
> Joachim Breitner wrote:
> > regex-compat is but a thin layour around regex-posix, which states
> > Note that the posix library works with single byte characters,
> > and does not understand Unicode. If you need Unicode support you
> > will have to use a different backend.¹
>
> Right. However, ò is not actually unicode, I think it's ISO8859-15.
>
> Also I'm not trying to do anything that requires knowledge of unicode.
> Even if the library sees [byte, byte], "^.*$" should still match all
> the bytes.
The library should actually see "\242\0", and gdb verifies that this is
in the CString. Nevertheless, I cannot reproduce this behavior in C:
#include <sys/types.h>
#include <regex.h>
#include <stdio.h>
main () {
regex_t r;
regcomp(&r, ".", 0);
char *s = "\242";
int i = regexec(&r, s, 0, NULL, 0);
printf("%d\n", i);
}
prints 0, i.e. match succeeded.
But on the lowest layer above the FFI, I the strange behaviour already
occurs:
Prelude Foreign.C.String Text.Regex.Posix.Wrap> cs <- newCAString "\242"
Prelude Foreign.C.String Text.Regex.Posix.Wrap> cp <- newCAString "."
Prelude Foreign.C.String Text.Regex.Posix.Wrap>
Prelude Foreign.C.String Text.Regex.Posix.Wrap> Right r2 <- wrapCompile 0 0 cp
Prelude Foreign.C.String Text.Regex.Posix.Wrap>
Prelude Foreign.C.String Text.Regex.Posix.Wrap> wrapTest r2 cs
Right False
It is False for \128 and True for \127
The code in question is here:
http://hackage.haskell.org/packages/archive/regex-posix/0.95.2/doc/html/src/Text-Regex-Posix-Wrap.html
Changing the regex to "^$" or "^.*$" does not make a difference, i.e.
the string is not just turned to the empty string. Clearly something is
broken here.
I can reproduce it from within ghc’s address space using gdb:
(gdb) call malloc(32)
$7 = 64943120
(gdb) call regcomp(64943120, ".", 0)
$8 = 0
(gdb) call regexec(64943120,"\242",0,0,0)
$9 = 1
(gdb) call regexec(64943120,"only_ascii",0,0,0)
$10 = 0
And even from gdb while debugging “sleep”. So the behaviour is already
there in regexec, but for some reason it is not triggered from C code,
but only via some variants of FFI (GHC’s or gdb’s).
I’ll leave it at that, as this is not really related to GHC or Haskell
any more.
Greetings,
Joachim
--
Joachim "nomeata" Breitner
Debian Developer
nomeata at debian.org | ICQ# 74513189 | GPG-Keyid: 4743206C
JID: nomeata at joachim-breitner.de | http://people.debian.org/~nomeata
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
URL: <http://lists.alioth.debian.org/pipermail/pkg-haskell-maintainers/attachments/20130311/d727dc8e/attachment-0001.pgp>
More information about the Pkg-haskell-maintainers
mailing list