Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM

Tue Mar 4 10:49:45 UTC 2014

Am Tue, 04 Mar 2014 02:59:44 +0000
schrieb peter green <plugwash at p10link.net>: 

> Is there any quality 
> difference from using a fpu vs nonfpu decoder?

Technically, there is. See those numbers for generic fpu and non-fpu
code with and without --enable-int-quality given to configure (enables
better rounding for small performance hit, you might want to activate
that by default).

In numbers, the difference is this:

==> src/mpg123.fpu_accurate.compliance.txt <==

==== Layer 3 ====
--> 16 bit signed integer output
compl.bit:	RMS=4.300914e-06 (PASS) maxdiff=7.688999e-06 (PASS)
--> 32 bit integer output
compl.bit:	RMS=2.152784e-08 (PASS) maxdiff=1.769513e-07 (PASS)
--> 24 bit integer output
compl.bit:	RMS=4.206462e-08 (PASS) maxdiff=1.788139e-07 (PASS)
--> 32 bit floating point output
compl.bit:	RMS=2.153045e-08 (PASS) maxdiff=1.769513e-07 (PASS)

==> src/mpg123.fpu.compliance.txt <==

==== Layer 3 ====
--> 16 bit signed integer output
compl.bit:	RMS=8.907757e-06 (LIMITED) maxdiff=1.531839e-05 (PASS)
--> 32 bit integer output
compl.bit:	RMS=2.152589e-08 (PASS) maxdiff=1.769513e-07 (PASS)
--> 24 bit integer output
compl.bit:	RMS=4.205495e-08 (PASS) maxdiff=1.788139e-07 (PASS)
--> 32 bit floating point output
compl.bit:	RMS=2.153045e-08 (PASS) maxdiff=1.769513e-07 (PASS)

==> src/mpg123.nofpu_accurate.compliance.txt <==

==== Layer 3 ====
--> 16 bit signed integer output
compl.bit:	RMS=4.344827e-06 (PASS) maxdiff=1.275539e-05 (PASS)
--> 32 bit integer output
compl.bit:	RMS=4.344827e-06 (PASS) maxdiff=1.275539e-05 (PASS)
--> 24 bit integer output
compl.bit:	RMS=4.344827e-06 (PASS) maxdiff=1.275539e-05 (PASS)
--> 32 bit floating point output
compl.bit:	RMS=4.344827e-06 (PASS) maxdiff=1.275539e-05 (PASS)

==> src/mpg123.nofpu.compliance.txt <==

==== Layer 3 ====
--> 16 bit signed integer output
compl.bit:	RMS=7.927192e-06 (PASS) maxdiff=2.676249e-05 (PASS)
--> 32 bit integer output
compl.bit:	RMS=7.927192e-06 (PASS) maxdiff=2.676249e-05 (PASS)
--> 24 bit integer output
compl.bit:	RMS=7.927192e-06 (PASS) maxdiff=2.676249e-05 (PASS)
--> 32 bit floating point output
compl.bit:	RMS=7.927192e-06 (PASS) maxdiff=2.676249e-05 (PASS)

With a nofpu decoder, you always get the precision of 16 bit output,
because floating point numbers are converted from 16 bit. But,
especially so with --enable-int-quality, this is a fully compliante
MPEG audio decoder with all the precision that you need for "normal"
playback situations.

MAD claims 24 bit precision with integer math
(just about matching mpg123's 24 bit output with FPU decoder, see
http://www.underbit.com/resources/mpeg/audio/compliance, RMS=4.906e−08)
I suspect though, that MAD will be considerably slower than mpg123's
arm_nofpu decoder. On my Core2Duo P8800, madplay with libmad 0.15.1
needs about  7.4 s to 8.5 s decoding to null output (with either speed or
accuracy optimization). The mpg123 numbers for the generic variants
(accurate == --enable-int-quality):

==> src/mpg123.fpu_accurate.bench.txt <==
#mpg123 benchmark (user CPU time in seconds for decoding)
#decoder	t_s16/s	t_f32/s
generic	6.16	5.85

==> src/mpg123.fpu.bench.txt <==
#mpg123 benchmark (user CPU time in seconds for decoding)
#decoder	t_s16/s	t_f32/s
generic	6.05	5.83

==> src/mpg123.nofpu_accurate.bench.txt <==
#mpg123 benchmark (user CPU time in seconds for decoding)
#decoder	t_s16/s	t_f32/s
generic	6.67	6.81

==> src/mpg123.nofpu.bench.txt <==
#mpg123 benchmark (user CPU time in seconds for decoding)
#decoder	t_s16/s	t_f32/s
generic	6.01	6.16

You see, there is some hit from accurate rounding, but it is in a
different league compared to the difference between fpu and nofpu on a
NEON-less ARM device (and yes, on a x86 CPU, generic FPU code is faster
when actually proucing float output).

Oh, and remember: This is for mpg123 with handbrakes on, using Taihei's
assembly optimizations, the decoding time is about halved on the Core2.
Similarily, I'd like to see numbers for madplay on ARM (best on
machines with and without fpu to get a picture about what difference we
talk about):

sh$ time -d -o null convergence_-_points_of_view/*.mp3

I don't know offhand how mpg123 nofpu stacks up against that, but there
should be a considerable difference in speed. My guess is that, on
limited hardware without NEON, you'd prefer stutter-free playback with
least CPU power draw. When utmost theoretical quality really matters or
you intend extensive post-processing of the data --- especially using
an audio player that works with floating point math internally, like
audacious --- then employing a more capable CPU with NEON is something
I expect. The mpg123 nofpu decoder, according Riku's numbers, is still
a good choice for systems with a FPU but no NEON, but the generic
floating point decoder is not that far behind in speed (compared to
softfloat) and offers proper floating point accuracy as bonus.

Generally, it is a safe bet that any normal person is quite happy with
16 bit accuracy for decoded MP3s. Depending on the initial quality of
the encoding, this might be everything that is sensible anyway (and its
a challence to hear any difference to 24 bit on an arbitrarily
expensive HiFi system). There are people preferring their 16 bit output
rounded using dithering, which mpg123 also offers
(--with-cpu=generic_dither), but which excludes optimizations for ARM.

We are talking about the default setup for the majority of debian users
here. Any quality choice should be fine for that, after all, we're
talking compliant MPEG quality in any case (sometimes 'limited
precision', but still). Audiophiles wanting the utmost quality from
their setup (as funny as that is to many audiophiles when starting from
a lossy compression;-) will love to tweak things anyway. They can
always do their own build, or use an additional repository (thinking
ubuntu PPAs for various such purposes) that provides a different taste.

The quality difference between 1 h or 10 h time on battery while playing
music is very much noticable to anyone, so the choice on armel should
be settled. On armhf, there are cases where the arm_nofpu would be a
better choice (decoding to 16 bit without NEON), but about 50 % CPU
demand increase is less dramatic and it evens out when using floating
point output.

In any case ... Riku: Care to run timings of MAD on your
configurations? I'm interested in how fast it is producing that 24 bit
output on limited CPUs.

> Lennart Sorensen wrote:
> > I think so.  armhf's current debian rules automatically picked arm_fpu
> IMO it's often better to be explicit about this sort of thing. 

I agree. Never trust upstream's defaults in such sensitive matters;-)

Alrighty then,

Thomas

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://lists.alioth.debian.org/pipermail/pkg-multimedia-maintainers/attachments/20140304/7467e538/attachment.sig>