[Piuparts-devel] get_files_owned_by_packages

Herbert Fortes terberh at gmail.com
Wed May 15 13:22:30 BST 2019


On 5/15/19 2:50 AM, Niels Thykier wrote:
> Niels Thykier:
>> Herbert Fortes:
>>> Hi,
>>>
>>> I did a refactor to get_files_owned_by_packages[0]. I did
>>> 5 versions.
>>>
>>> [0] - https://salsa.debian.org/debian/piuparts/blob/develop/piuparts.py#L1661
>>>
>>> The best version for the programmer is the one with
>>> pathlib and dict.setdefault:
>>>
>>> vdir = Path("var/lib/dpkg/info")
>>> vdict = {}
>>>
>>> for basename in vdir.glob("*.list"):
>>>     for line in basename.read_text().split("\n"):
>>                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>
>> This smells a lot like it will read the whole file into memory, then
>> split it into lines (i.e. have the whole file in memory twice for a
>> short while) and the looping over it.
>>
>> I suspect the old one iterated over the lines one-by-one via buffered
>> reads.  This /might/ explain the performance difference you see.

Thanks to make it clear to me.

>>
>>
>> I suspect that something like:
>>
>> """
>>   with open(path) as fd:
>>     for line in fd:
>>       ...
>> """
>>
>> Will be considerably faster if /var/.../info has a non-trivial size
>> (though I am unsure if the "basename" variable from your example can be
>> passed to open)

Path obj has a open() feature. pathlib gives comfort, but
one has to choose where to use it.

>>
> 
> make that "/var/<...>/info/<...>.list"
> 
>>
>>>         vdict.setdefault(line.strip(), []).append(basename.stem)
> 
> Silly me missing this one.
> 
> This one is also a sneaky time consumer in many cases.  This statement
> basically creates a list for every single iteration and then throws the
> list away if the key already exists (as every .list file includes
> directories that are common in every package, we will have a
> considerable number of duplicates).

Yes. Even only that costs more.

> 
> There are basically two options.  Go back to (a variant) of the original
> bulk code or use a defaultdict.  Caveat here; the defaultdict can hide
> "KeyError"s (and that can lead to high memory consumption as you are now
> creating a bunch of empty lists that you did not expect).
> 

I think the version I did with a generator would be my best
suggestion. IMHO it is a little bit better to read and has
the same performance than the original.

- 100 loops
t_get: 24.94818631599992 segundos (100 loops).
and_gen_if_back: 24.977995010000086 segundos (100 loops).

- 1000 loops
t_get: 248.72175344900006 segundos (1000 loops).
and_gen_if_back: 248.58061235099967 segundos (1000 loops).



base_name = ((os.path.join(vdir, basename), basename[:-len(".list")])
             for basename in os.listdir(vdir)
             if basename.endswith(".list"))

for basename, pkg in base_name:
    for line in readlines_file(basename):
        pathname = line.strip()
        if pathname in vdict:
            vdict[pathname].append(pkg)
        else:
            vdict[pathname] = [pkg]


Just in case, the file with tests:
https://paste.debian.net/1081560/



Regards,
Herbert



More information about the Piuparts-devel mailing list