[sane-devel] avision: Possible patch to reduce memory copying for line-interleaved duplex scans

Mon Aug 24 11:00:57 BST 2020

Hello, sane-backend developers, and especially avision developers such
as Michael Niewöhner who helped me so quickly with my previous avision
bug report.

Olaf, thank you for your answer with your detailed advice on how to
submit a suggested change after all your effort in releasing 1.0.31.

Especially because I am not that familiar with the libsane code base
and this may be my first patch submission, I am posting this patch to
the mailing list first to invite review.  Depending on what feedback I
get, I expect then to follow the process of doing a git fork and merge
request with a possibly revised version of this change, but I will
wait to do that fork, in the hopes of perhaps making the git branch
and merge history a bit clearer with respect to the release events.

Anyhow, here is a one paragraph summary of the change followed by a
slightly more detailed discussion.

I believe that the attached patch should eliminate about 6.5 memory
reads and writes of the image received from the scanner for duplex
scans done by backend/avision.c for scanners that interleave the front
and back images line by line.  Although saving 156MB reads + 156MB
writes per 300dpi color A4 or US letter page may seem substantial, I
expect the performance improvement on a modern desktop processor to be
quite small, because the memory will likely fit into per-core data
cache (about 256kB per stripe in that case), and a benchmark data
point suggests the savings are about half of what I would expect even
from that, which worked out to about a 4% user CPU time saving for
that case (more on this below).

Some duplex scanners supported by the backend/avision.c driver send
the data from the front and back scanners interleaved by line, that is
one line from the front scanner followed by one line from the back
scanner.  However, at least when I scan a US letter document at 300
dots per inch, the data is sent in blocks of 32 lines, the 16 even
lines coming from the front and the 16 odd lines coming from the
back).  avision.c handles each line for the back page by saving the
line elsewhere and then calling memmove to copy up the remaining lines
in the 32 byte block to cover up the line that was saved away.

The avision driver does this for each odd numbered line, so, for a 32
line stripe, it does this 16 times, although the last time will
normally be a copy of zero bytes.  The first copy copies the 30
remaining lines; the second copy copies 28 remaining lines; the third
copy copies the 26 remaining lines, and so forth.  That is 15 copy
operations of an average of (30 + 2)/2 lines each, so 240 line copies,
an average of 7.5 (240 / 32) copies of each line.  The attached patch
should reduce the number of lines copied per iteration to two, so the
average number of copies per line should be 0.9375 (2 * 15 / 32).

>From trying a couple of runs of 10 sheets (20 pages) US letter sized
at 300dpi with and without this change, it appears that the savings
are about 6 - 7.5 milliseconds per page, which is less than half of
what I had expected from reduced L2 cache traffic, which I estimated
thusly:

~156MB reads + ~156MB writes = ~312MB traffic to L2 cache
L2 cache bandwidth is ~12 cycles, which I presume loads a 64 byte
cache line into L1, so, at 4GHz, that's 64 bytes / 3ns = ~21.3333
bytes/ns = ~21.3333 GB / sec.
(~0.312GB L2 cache traffic /page)  / (~21.3333 GB / sec) = ~0.014625
seconds/page = 14.625ms / page

This discrepancy of probably more than a factor of two, I am guessing,
may be, in part, because the number I got from the internet about ~12
cycle access time for the L2 cache might include evicting a dirty L1
cache line to L2.

The total amount of user CPU time used without my patch was 156ms, so
this may be a savings of about 4% user CPU time in the case I tried.

Although I believe that scanimage is already fast enough to run the
scanner without pauses if it is the only thing running, I do get
scanner pauses when I run the rest of my rather inefficient scanning
pipeline, and my efforts to optimize or at least parallelize other
parts of that process seem to have reduced the scanner pauses
considerably, so I am glad to take any savings of CPU and competition
for L2 cache with in cases where the core doing that memory copying
has some thread from some other program running on its other
hyperthread.

Anyhow, any comment on the above or this patch are welcome.  Thanks
for reading this far if anyone made it to here.

Adam
-------------- next part --------------
A non-text attachment was scrubbed...
Name: avision-duplex-memmove.diff
Type: application/x-patch
Size: 1584 bytes
Desc: not available
URL: <http://alioth-lists.debian.net/pipermail/sane-devel/attachments/20200824/85124249/attachment.bin>