Bug#743174: #743174: paragraph parsing truncates on comments

Stuart Prescott stuart at debian.org
Sun Jun 8 03:19:23 UTC 2014


[As a data point towards fixing this in the future]

In #750247, David Kalnischkies pointed out that the use of python-apt's 
TagFile (which is a wrapper around libapt-pkg) would behave badly on comments. 
iter_paragraphs() uses python-apt if it can to accelerate processing of deb822 
files.

Forcing iter_paragraphs() to avoid use python-apt TagFile:

$  echo -e 'Build-Depends: foo,\n#comment\n bar' | python -c 'import sys, 
debian.deb822; l = [p for p in debian.deb822.Deb822.iter_paragraphs(sys.stdin, 
use_apt_pkg=True)]; print(l)'
[{'Build-Depends': u'foo,'}]

$  echo -e 'Build-Depends: foo,\n#comment\n bar' | python -c 'import sys, 
debian.deb822; l = [p for p in debian.deb822.Deb822.iter_paragraphs(sys.stdin, 
use_apt_pkg=False)]; print(l)'
[{'Build-Depends': u'foo,\n bar'}]

When parsing a Packages or Sources file, it is probably acceptable to use the 
strict parser in TagFile by default because the generators of those files are 
strict in what they produce. For other random sources of deb822 files which are 
not necessarily strictly produced or use a syntax that is an extension to the 
Packages and Sources file format, defaulting to the use of TagFile is probably 
a bad thing. We've got quite a few bugs related to debian/control files -- they 
are particularly problematic because they are human-generated and the build 
tools are forgiving in what they accept. Using a strict parser like TagFile to 
consume these files is probably the wrong thing to do; whatever performance 
boost there may be on a short file with only a few paragraphs isn't worth the 
pain this is causing.

I'm starting to wonder if we should invert the current API logic:

* iter_paragraphs should *not* try to use TagFile unless the caller explicitly 
requests it and thereby makes the promise that there will be no comments in 
the data and that paragraphs will be separated by blank lines rather than 
other whitespace etc.

* Packages and Sources files could continue to opportunistically using TagFile 
since they should be OK with that restriction anyway and the performance boost 
of using TagFile is actually worth it for them.

We'd be able to solve at least #750247 and #743174 by doing this.

Comments welcome!

cheers
Stuart

-- 
Stuart Prescott    http://www.nanonanonano.net/   stuart at nanonanonano.net
Debian Developer   http://www.debian.org/         stuart at debian.org
GPG fingerprint    90E2 D2C1 AD14 6A1B 7EBB 891D BBC1 7EBB 1396 F2F7



More information about the pkg-python-debian-maint mailing list