amd64 arch and sse optimization

Mon Feb 3 16:20:46 UTC 2014

On Mon, Feb 3, 2014 at 1:07 PM, Reinhard Tartler <siretart at gmail.com> wrote:
> On Mon, Feb 3, 2014 at 10:56 AM, Reinhard Tartler <siretart at gmail.com> wrote:
>> On Mon, Feb 3, 2014 at 10:43 AM, Felipe Sateler <fsateler at debian.org> wrote:
>
>>>> Please be careful with adding those flags. They instruct gcc to emit
>>>> code that does not work on some machines. The admitedlyonly way to use
>>>> them correctly is to ensure that the emitted code is only used on
>>>> machines that actually supports that.
>>>
>>> The mtune flag instructs the compiler to optimize for a certain
>>> instruction set, but still provide a fallback for when the instrucions
>>> are not available. I don't know if this includes the use of SSE or
>>> other coprocessor.
>>>
>>> http://stackoverflow.com/questions/10559275/gcc-how-is-march-different-from-mtune
>>
>> OK, then we will only see performance impact on machines not matching
>> the optimization target. Which seems OK to me, if the benefit for the
>> targeted architectures justify this. Sorry for the noise.
>
> Please note that Jaromir proposed to use the following flags:
>
> ifeq ($(DEB_HOST_ARCH_CPU),amd64)
> CFLAGS += -msse -msse2 -mfpmath=sse
> endif
>
>
> None of them fall in the "mtune" category, that is, none of them are
> "safe" to use without further precaution measures!

Hmm, I'm now not sure if they help and are harmless. If sse is
required by the amd64 spec, then gcc should enable it by default (in
-O2?). I can't say if the previous is true. Perhaps we should take
this to a more knowledgeable audience? debian-devel or debian-gcc
lists would be better, perhaps.

The docs do suggest they are redundant in amd64[1]:

===
`sse'

Use scalar floating-point instructions present in the SSE instruction
set. This instruction set is supported by Pentium III and newer chips,
and in the AMD line by Athlon-4, Athlon XP and Athlon MP chips. The
earlier version of the SSE instruction set supports only
single-precision arithmetic, thus the double and extended-precision
arithmetic are still done using 387. A later version, present only in
Pentium 4 and AMD x86-64 chips, supports double-precision arithmetic
too.

For the i386 compiler, you must use -march=cpu-type, -msse or -msse2
switches to enable SSE extensions and make this option effective. For
the x86-64 compiler, these extensions are enabled by default.

The resulting code should be considerably faster in the majority of
cases and avoid the numerical instability problems of 387 code, but
may break some existing code that expects temporaries to be 80 bits.

This is the default choice for the x86-64 compiler.

===

[1] http://gcc.gnu.org/onlinedocs/gcc-4.4.2/gcc/i386-and-x86_002d64-Options.html

-- 

Saludos,
Felipe Sateler