[Piuparts-devel] get_files_owned_by_packages
Herbert Fortes
terberh at gmail.com
Wed May 15 13:22:30 BST 2019
On 5/15/19 2:50 AM, Niels Thykier wrote:
> Niels Thykier:
>> Herbert Fortes:
>>> Hi,
>>>
>>> I did a refactor to get_files_owned_by_packages[0]. I did
>>> 5 versions.
>>>
>>> [0] - https://salsa.debian.org/debian/piuparts/blob/develop/piuparts.py#L1661
>>>
>>> The best version for the programmer is the one with
>>> pathlib and dict.setdefault:
>>>
>>> vdir = Path("var/lib/dpkg/info")
>>> vdict = {}
>>>
>>> for basename in vdir.glob("*.list"):
>>> for line in basename.read_text().split("\n"):
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>
>> This smells a lot like it will read the whole file into memory, then
>> split it into lines (i.e. have the whole file in memory twice for a
>> short while) and the looping over it.
>>
>> I suspect the old one iterated over the lines one-by-one via buffered
>> reads. This /might/ explain the performance difference you see.
Thanks to make it clear to me.
>>
>>
>> I suspect that something like:
>>
>> """
>> with open(path) as fd:
>> for line in fd:
>> ...
>> """
>>
>> Will be considerably faster if /var/.../info has a non-trivial size
>> (though I am unsure if the "basename" variable from your example can be
>> passed to open)
Path obj has a open() feature. pathlib gives comfort, but
one has to choose where to use it.
>>
>
> make that "/var/<...>/info/<...>.list"
>
>>
>>> vdict.setdefault(line.strip(), []).append(basename.stem)
>
> Silly me missing this one.
>
> This one is also a sneaky time consumer in many cases. This statement
> basically creates a list for every single iteration and then throws the
> list away if the key already exists (as every .list file includes
> directories that are common in every package, we will have a
> considerable number of duplicates).
Yes. Even only that costs more.
>
> There are basically two options. Go back to (a variant) of the original
> bulk code or use a defaultdict. Caveat here; the defaultdict can hide
> "KeyError"s (and that can lead to high memory consumption as you are now
> creating a bunch of empty lists that you did not expect).
>
I think the version I did with a generator would be my best
suggestion. IMHO it is a little bit better to read and has
the same performance than the original.
- 100 loops
t_get: 24.94818631599992 segundos (100 loops).
and_gen_if_back: 24.977995010000086 segundos (100 loops).
- 1000 loops
t_get: 248.72175344900006 segundos (1000 loops).
and_gen_if_back: 248.58061235099967 segundos (1000 loops).
base_name = ((os.path.join(vdir, basename), basename[:-len(".list")])
for basename in os.listdir(vdir)
if basename.endswith(".list"))
for basename, pkg in base_name:
for line in readlines_file(basename):
pathname = line.strip()
if pathname in vdict:
vdict[pathname].append(pkg)
else:
vdict[pathname] = [pkg]
Just in case, the file with tests:
https://paste.debian.net/1081560/
Regards,
Herbert
More information about the Piuparts-devel
mailing list