Bug#538376: python-debian: debian_bundle parses some multilines fields inconsistantly
John Wright
jsw at debian.org
Sat Jul 25 17:38:57 UTC 2009
On Sat, Jul 25, 2009 at 05:02:16PM +0200, John Wright wrote:
> On Sat, Jul 25, 2009 at 02:05:26PM +0200, sean finney wrote:
> > severity 538376 normal
> > thanks
> >
> > okay i take back what i said about this being a regression, it seems that
> > in previous versions (< 0.1.10) it was treated as a plain text field (i.e.
> > no dictionary at all), which meant that the fields were probably ignored
> > entirely in the patch-tracker but are now partially showing up.
> >
> > still a bug though afaict :)
>
> Gah! This is apt_pkg's fault: it strips off the leading '\n'. This
> means our goal of having the output match the input will in general be
> broken when you use apt_pkg. (Right now, the only place that's used by
> default is when you use iter_paragraphs.)
>
> A temporary workaround is to pass use_apt_pkg=False to the
> iter_paragraphs method. On large files, you'll probably notice a bit of
> a performance hit. I'll see what we can do to preserve this information
> with apt_pkg.
I'm actually inclined to turn off using apt_pkg by default. It's
definitely faster, typically by a factor between 2 and 2.5, but we keep
running into weird corner cases with the way apt_pkg parses things.
Using sid's Sources and amd64 Packages files, calling the respective
class's iter_paragraph method with the specified kwargs and throwing
away the results, like
for d in cls.iter_paragraphs(f, **kwargs): pass
I get the following run times:
Packages, <class 'debian_bundle.deb822.Packages'>, {'use_apt_pkg': True}
0: 0:00:08.664978
1: 0:00:07.747378
2: 0:00:07.743156
3: 0:00:07.961919
4: 0:00:07.758220
Average: 0:00:07.975130
Packages, <class 'debian_bundle.deb822.Packages'>, {'use_apt_pkg': False}
0: 0:00:18.505047
1: 0:00:18.179216
2: 0:00:18.179558
3: 0:00:18.415705
4: 0:00:18.182857
Average: 0:00:18.292476
Sources, <class 'debian_bundle.deb822.Sources'>, {'use_apt_pkg': True}
0: 0:00:07.865666
1: 0:00:07.864537
2: 0:00:07.861713
3: 0:00:07.873949
4: 0:00:07.858093
Average: 0:00:07.864791
Sources, <class 'debian_bundle.deb822.Sources'>, {'use_apt_pkg': False}
0: 0:00:13.710405
1: 0:00:13.262080
2: 0:00:13.260217
3: 0:00:13.245185
4: 0:00:13.251963
Average: 0:00:13.345970
Packages, <class 'debian_bundle.deb822.Deb822'>, {'use_apt_pkg': True}
0: 0:00:06.283796
1: 0:00:06.414739
2: 0:00:06.323466
3: 0:00:06.320447
4: 0:00:06.264290
Average: 0:00:06.321347
Packages, <class 'debian_bundle.deb822.Deb822'>, {'use_apt_pkg': False}
0: 0:00:16.653596
1: 0:00:16.637927
2: 0:00:16.805496
3: 0:00:16.631162
4: 0:00:16.614459
Average: 0:00:16.668528
Packages, <class 'debian_bundle.deb822.Deb822'>, {'use_apt_pkg': True, 'shared_storage': True}
0: 0:00:02.513985
1: 0:00:02.516300
2: 0:00:02.521496
3: 0:00:02.514919
4: 0:00:02.516548
Average: 0:00:02.516649
Clearly, using shared storage (which basically just means using
apt_pkg's parser.Section directly) is blazingly fast compared to
without. But this is aready not the default, since it has the confusing
side-effect of making the object returned by each iteration (each of
which has a different id) actually share the same data.
Is anybody strongly opposed to making iter_paragraphs not use apt_pkg by
default? I'm still trying to figure out a way to salvage the output
from apt_pkg in this case, but I'm not having much luck.
--
John Wright <jsw at debian.org>
More information about the pkg-python-debian-maint
mailing list