Bug#743174: #743174: paragraph parsing truncates on comments
Stuart Prescott
stuart at debian.org
Sun Jun 8 03:19:23 UTC 2014
[As a data point towards fixing this in the future]
In #750247, David Kalnischkies pointed out that the use of python-apt's
TagFile (which is a wrapper around libapt-pkg) would behave badly on comments.
iter_paragraphs() uses python-apt if it can to accelerate processing of deb822
files.
Forcing iter_paragraphs() to avoid use python-apt TagFile:
$ echo -e 'Build-Depends: foo,\n#comment\n bar' | python -c 'import sys,
debian.deb822; l = [p for p in debian.deb822.Deb822.iter_paragraphs(sys.stdin,
use_apt_pkg=True)]; print(l)'
[{'Build-Depends': u'foo,'}]
$ echo -e 'Build-Depends: foo,\n#comment\n bar' | python -c 'import sys,
debian.deb822; l = [p for p in debian.deb822.Deb822.iter_paragraphs(sys.stdin,
use_apt_pkg=False)]; print(l)'
[{'Build-Depends': u'foo,\n bar'}]
When parsing a Packages or Sources file, it is probably acceptable to use the
strict parser in TagFile by default because the generators of those files are
strict in what they produce. For other random sources of deb822 files which are
not necessarily strictly produced or use a syntax that is an extension to the
Packages and Sources file format, defaulting to the use of TagFile is probably
a bad thing. We've got quite a few bugs related to debian/control files -- they
are particularly problematic because they are human-generated and the build
tools are forgiving in what they accept. Using a strict parser like TagFile to
consume these files is probably the wrong thing to do; whatever performance
boost there may be on a short file with only a few paragraphs isn't worth the
pain this is causing.
I'm starting to wonder if we should invert the current API logic:
* iter_paragraphs should *not* try to use TagFile unless the caller explicitly
requests it and thereby makes the promise that there will be no comments in
the data and that paragraphs will be separated by blank lines rather than
other whitespace etc.
* Packages and Sources files could continue to opportunistically using TagFile
since they should be OK with that restriction anyway and the performance boost
of using TagFile is actually worth it for them.
We'd be able to solve at least #750247 and #743174 by doing this.
Comments welcome!
cheers
Stuart
--
Stuart Prescott http://www.nanonanonano.net/ stuart at nanonanonano.net
Debian Developer http://www.debian.org/ stuart at debian.org
GPG fingerprint 90E2 D2C1 AD14 6A1B 7EBB 891D BBC1 7EBB 1396 F2F7
More information about the pkg-python-debian-maint
mailing list