Bug#538376: python-debian: debian_bundle parses some multilines fields inconsistantly

John Wright jsw at debian.org
Sat Jul 25 17:38:57 UTC 2009


On Sat, Jul 25, 2009 at 05:02:16PM +0200, John Wright wrote:
> On Sat, Jul 25, 2009 at 02:05:26PM +0200, sean finney wrote:
> > severity 538376 normal
> > thanks
> > 
> > okay i take back what i said about this being a regression, it seems that
> > in previous versions (< 0.1.10) it was treated as a plain text field (i.e. 
> > no dictionary at all), which meant that the fields were probably ignored
> > entirely in the patch-tracker but are now partially showing up.
> > 
> > still a bug though afaict :)
> 
> Gah!  This is apt_pkg's fault: it strips off the leading '\n'.  This
> means our goal of having the output match the input will in general be
> broken when you use apt_pkg.  (Right now, the only place that's used by
> default is when you use iter_paragraphs.)
> 
> A temporary workaround is to pass use_apt_pkg=False to the
> iter_paragraphs method.  On large files, you'll probably notice a bit of
> a performance hit.  I'll see what we can do to preserve this information
> with apt_pkg.

I'm actually inclined to turn off using apt_pkg by default.  It's
definitely faster, typically by a factor between 2 and 2.5, but we keep
running into weird corner cases with the way apt_pkg parses things.

Using sid's Sources and amd64 Packages files, calling the respective
class's iter_paragraph method with the specified kwargs and throwing
away the results, like

    for d in cls.iter_paragraphs(f, **kwargs): pass

I get the following run times:


Packages, <class 'debian_bundle.deb822.Packages'>, {'use_apt_pkg': True}
0: 0:00:08.664978
1: 0:00:07.747378
2: 0:00:07.743156
3: 0:00:07.961919
4: 0:00:07.758220
Average: 0:00:07.975130

Packages, <class 'debian_bundle.deb822.Packages'>, {'use_apt_pkg': False}
0: 0:00:18.505047
1: 0:00:18.179216
2: 0:00:18.179558
3: 0:00:18.415705
4: 0:00:18.182857
Average: 0:00:18.292476

Sources, <class 'debian_bundle.deb822.Sources'>, {'use_apt_pkg': True}
0: 0:00:07.865666
1: 0:00:07.864537
2: 0:00:07.861713
3: 0:00:07.873949
4: 0:00:07.858093
Average: 0:00:07.864791

Sources, <class 'debian_bundle.deb822.Sources'>, {'use_apt_pkg': False}
0: 0:00:13.710405
1: 0:00:13.262080
2: 0:00:13.260217
3: 0:00:13.245185
4: 0:00:13.251963
Average: 0:00:13.345970

Packages, <class 'debian_bundle.deb822.Deb822'>, {'use_apt_pkg': True}
0: 0:00:06.283796
1: 0:00:06.414739
2: 0:00:06.323466
3: 0:00:06.320447
4: 0:00:06.264290
Average: 0:00:06.321347

Packages, <class 'debian_bundle.deb822.Deb822'>, {'use_apt_pkg': False}
0: 0:00:16.653596
1: 0:00:16.637927
2: 0:00:16.805496
3: 0:00:16.631162
4: 0:00:16.614459
Average: 0:00:16.668528

Packages, <class 'debian_bundle.deb822.Deb822'>, {'use_apt_pkg': True, 'shared_storage': True}
0: 0:00:02.513985
1: 0:00:02.516300
2: 0:00:02.521496
3: 0:00:02.514919
4: 0:00:02.516548
Average: 0:00:02.516649


Clearly, using shared storage (which basically just means using
apt_pkg's parser.Section directly) is blazingly fast compared to
without.  But this is aready not the default, since it has the confusing
side-effect of making the object returned by each iteration (each of
which has a different id) actually share the same data.

Is anybody strongly opposed to making iter_paragraphs not use apt_pkg by
default?  I'm still trying to figure out a way to salvage the output
from apt_pkg in this case, but I'm not having much luck.

-- 
John Wright <jsw at debian.org>





More information about the pkg-python-debian-maint mailing list