[Piuparts-devel] get_files_owned_by_packages

Niels Thykier niels at thykier.net
Wed May 15 06:50:00 BST 2019


Niels Thykier:
> Herbert Fortes:
>> Hi,
>>
>> I did a refactor to get_files_owned_by_packages[0]. I did
>> 5 versions.
>>
>> [0] - https://salsa.debian.org/debian/piuparts/blob/develop/piuparts.py#L1661
>>
>> The best version for the programmer is the one with
>> pathlib and dict.setdefault:
>>
>> vdir = Path("var/lib/dpkg/info")
>> vdict = {}
>>
>> for basename in vdir.glob("*.list"):
>>     for line in basename.read_text().split("\n"):
>                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> This smells a lot like it will read the whole file into memory, then
> split it into lines (i.e. have the whole file in memory twice for a
> short while) and the looping over it.
> 
> I suspect the old one iterated over the lines one-by-one via buffered
> reads.  This /might/ explain the performance difference you see.
> 
> 
> I suspect that something like:
> 
> """
>   with open(path) as fd:
>     for line in fd:
>       ...
> """
> 
> Will be considerably faster if /var/.../info has a non-trivial size
> (though I am unsure if the "basename" variable from your example can be
> passed to open)
> 

make that "/var/<...>/info/<...>.list"

> 
>>         vdict.setdefault(line.strip(), []).append(basename.stem)

Silly me missing this one.

This one is also a sneaky time consumer in many cases.  This statement
basically creates a list for every single iteration and then throws the
list away if the key already exists (as every .list file includes
directories that are common in every package, we will have a
considerable number of duplicates).

There are basically two options.  Go back to (a variant) of the original
bulk code or use a defaultdict.  Caveat here; the defaultdict can hide
"KeyError"s (and that can lead to high memory consumption as you are now
creating a bunch of empty lists that you did not expect).

Thanks,
~Niels




More information about the Piuparts-devel mailing list