Bug#980517: debsums: Parallel checking

Witold Baryluk witold.baryluk at gmail.com
Wed Jan 20 14:26:36 GMT 2021


HI Alex.

I think a reasonable clean approach would be to have a master thread
(producer) that sends chunks of packages or file paths to worker threads
(consumers) by putting into a global qieue, and they work on them by
grabbing chunks of files names from the queueing. That should be reasonably
salad load balancing. The processes can report things directly to studio,
as it is line buffered and locked, so it should work.

I am pretty sure library to do such worker queues are available for Perl.

On Wed, 20 Jan 2021, 12:20 Axel Beckert, <abe at debian.org> wrote:

> Control: severity -1 wishlist
>
> Hi Witold,
>
> Witold Baryluk wrote:
> > On my 32 core system, and a lot of packages installed (~7000), it takes
> > about one hour for the debsum to check all the files. Despite ability to
> > read all files from the storage in about 4 minutes.
>
> Oh, ok. I would have expected the I/O to be the bottleneck.
>
> > The issue is use of just one thread and most likely not optimized
> > md5sum implementation.
>
> It uses Perl's Digest::MD5 module.
>
> > I think it would be very useful to be able to specify number of parallel
> > threads to use when doing checking manually or from cron.
>
> Ack. I though don't see that as feature and not really as a bug. Hence
> setting the severity to wishlist.
>
> Also I'm currently not sure how to implement this properly. Splitting
> up the list of files or packages to check and then starting debsums
> for 1/n of these as a subprocess, gathering the output and merging it?
>
> Or trying to find a hashing tool which parallelises this for each
> package. But then again, this likely will generate extra overhead for
> the huge number of small packages and hence might outweight
> parallelisation.
>
> IO::AIO might be a potential solution. There's even an MD5 example in
> https://metacpan.org/pod/IO::AIO, but it focusses huge files to hash
> via mmapping and not really on parallelism (I guess).
>
> > I think it would even be good to enable it by default.
>
> Not sure about that as it takes a lot of factors more to choose the
> right value:
>
> * Current load of the system
> * Installation on what kind of medium? (Spinning disk, SATA SSD, NVMe
>   SSD, RAID1, etc.)
> * Number of threads available in the CPU (of course, too :-)
>
> > (for the case of use from cron, usage of nice / schedtool and/or
> > ionice could mitigate any issues on server or laptops).
>
> The cron jobs already use ionice if it is installed. :-)
>
>                 Regards, Axel
> --
>  ,''`.  |  Axel Beckert <abe at debian.org>, https://people.debian.org/~abe/
> : :' :  |  Debian Developer, ftp.ch.debian.org Admin
> `. `'   |  4096R: 2517 B724 C5F6 CA99 5329  6E61 2FF9 CD59 6126 16B5
>   `-    |  1024D: F067 EA27 26B9 C3FC 1486  202E C09E 1D89 9593 0EDE
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/pkg-perl-maintainers/attachments/20210120/17095fc7/attachment.html>


More information about the pkg-perl-maintainers mailing list