Re: [git pull] x86 fixes

Previous thread: Linux 2.6.26.5 by Greg KH on Monday, September 8, 2008 - 10:46 am. (2 messages)

Next thread: Re: 2.6.27-rc5 OLTP performance regression by Peter Zijlstra on Monday, September 8, 2008 - 11:00 am. (7 messages)
From: H. Peter Anvin
Date: Monday, September 8, 2008 - 10:52 am

Linus,

Please pull the latest x86-fixes-for-linus git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git x86-fixes-for-linus

 Thanks,

	-hpa

------------------>
H. Peter Anvin (1):
      x86: enable CONFIG_X86_GENERIC by default


 arch/x86/Kconfig.cpu |   19 ++++++++++---------
 1 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
index 2c518fb..46d0acf 100644
--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -279,17 +279,18 @@ config GENERIC_CPU
 endchoice
 
 config X86_GENERIC
-	bool "Generic x86 support"
+	bool "Generic x86 support" if EMBEDDED
 	depends on X86_32
+	default y
 	help
-	  Instead of just including optimizations for the selected
-	  x86 variant (e.g. PII, Crusoe or Athlon), include some more
-	  generic optimizations as well. This will make the kernel
-	  perform better on x86 CPUs other than that selected.
-
-	  This is really intended for distributors who need more
-	  generic optimizations.
-
+	  Instead of just including optimizations and workarounds for
+	  the selected x86 variant (e.g. PII, Crusoe or Athlon),
+	  include some more generic optimizations and workarounds as
+	  well.  Without this option, the kernel is not guaranteed to
+	  run on anything other than the exact CPU selected.
+
+	  Disable this if you want to run the kernel on a specific CPU
+	  *only* and want maximum optimizations for that CPU.
 endif
 
 config X86_CPU
--

From: Linus Torvalds
Date: Monday, September 8, 2008 - 11:04 am

Ok, so after having realized that this seems to be more about a bug with 
gcc, I'm really not as convinced any more.

As far as I can tell, there are three issues:

 - "-mtune=core/core2/pentium4/.." is buggy in some gas/gcc versions on 
   x86-32, and makes architectural choices.

   Any actual _released_ versions? Maybe it's just a current SVN issue?

   Workaround: don't use it. And yes, X86_GENERIC=y will do that, although 
   quite frankly that seems to be dubious in itself. But quite frankly, 
   it's a gcc bug, and we should see it as such.

   The better workaround may well be "-Wa,-mtune=generic" as you pointed 
   out.

 - We do the CONFIG_P6_NOPL thing ourselves, and we should just stop 
   doing that on 32-bit. There simply isn't a good enough reason to do so. 
   I already posteed the Kconfig.cpu patch to just stop doing it.

 - X86_GENERIC means _other_ things too, like doing a 128-bit cacheline 
   just so that it won't suck horribly on P4's even if it's otherwise 
   tuned for a good microarchitecture.

And they really do seem to be _separate_ issues. Do we really want to tie 
these things together under X86_GENERIC? 

		Linus
--

From: Linus Torvalds
Date: Monday, September 8, 2008 - 11:17 am

Hmm. The only other thing seems to be X86_INTEL_USERCOPY. Which doesn't 
seem to be something we want to force either.

And I have to say, that whole X86_GENERIC -> L1_CACHE_BYTES=128 -> 
cache_line_size() -> SLAB/SLUB/SLOB alignment worries me too. Looking at 
that, I really don't feel like I want to force 128-byte alignment on 
everybody, just because the P4 was a pig in cacheline size.

So NOPL really stands out as being different from the other things that 
X86_GENERIC does.

			Linus
--

From: Andi Kleen
Date: Monday, September 8, 2008 - 3:42 pm

SLAB/SLUB should actually auto detect the cache line at runtime.

Similar feeling here.

-Andi

-- 
ak@linux.intel.com
--

From: H. Peter Anvin
Date: Monday, September 8, 2008 - 11:22 am

As far as I can tell, -Wa,-mtune=generic *should* work.  It doesn't look 
to me as if cc1 will generate the long NOPs.  That one we can do 

Well, the argument in favour would be that if you want a kernel that can 
cross between different microarchitectures, then you want the "don't 
suck horribly on any of them".  We can, of course, divide them down 
further, but is it useful?

The "ideal" way to do any of this would probably to have checkboxes for 
all the CPUs you want to support and then a drop-down box for the CPU to 
optimize for.  However, the combinatorics of that would be horrible, and 
it would be very unlikely we would avoid bugs.

	-hpa
--

From: Arjan van de Ven
Date: Monday, September 8, 2008 - 11:46 am

On Mon, 08 Sep 2008 11:22:24 -0700

the ideal case would be "support them all"

the second-most ideal case would be "support all as of <year>" I suppose

a third one for advanced users not distros would be "support only
<vendor>" since that would be the biggest part of code to drop

between models of the same vendor.. not too much to win there.


-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: H. Peter Anvin
Date: Monday, September 8, 2008 - 11:51 am

Not really.  That would include things like the i386, which is a bunch 
of really nasty stuff.

	-hpa
--

From: Ingo Molnar
Date: Monday, September 8, 2008 - 12:02 pm

agreed - especially the verify_area() impact makes it a non-starter.

but 486 and higher is certainly quite reasonable, and is still being 
tested.

... and _in practice_ 99% of all systems that run Linux today understand 
CMOV.

... _and_ in practice 99% of all new Linux systems shipped today are 
Core2 or better.

... and so on it goes with this argument. Everyone has a different 
target audience and there's no firm limit. Maybe what makes more sense 
is to have some sort of time dependency:

  support all x86 CPUs released in the last year
  support all x86 CPUs released in the past 5 years
  support all x86 CPUs released in the past 10 years
  support all x86 CPUs released ever
  [ ... or configure a specific model ]

and people/distributions would use _those_ switches. That means we could 
continuously tweak those targets, as systems become obsolete and new 
CPUs arrive.

	Ingo
--

From: Linus Torvalds
Date: Monday, September 8, 2008 - 12:30 pm

cmov, cmpxchg and xadd are the noticeable things.

I think there are realistically three classes:

 - _really_ old, to the point of being totally useless for SMP.

   This is really just 386 and clones. We _need_ a working WP for a 
   race-free access_ok(), and we need cmpxchg (and lately xadd).

   SMP cannot really realistically work reasonably (ys, there were SMP 
   machines. No, they don't matter), and you'd have to be insane to care 
   about this as a vendor even on UP. Probably nobody really cares (ie if 
   you have hardware that old, you are likely much better off with an 
   older kernel too)

   Smaller pains even on UP: bswap doesn't exist. invlpg doesn't exist. 

 - old. pre-cmov. i486 and pentium, and some clones.

   It's workable, but code generation differences are really big enough 
   that it's worth having a totally separate architecture option for newer 
   CPUs where the kernel simply won't work.

   And most newer distros probably simply don't care, although there may 
   be individual cases where this makes sense (embedded places still use 
   pentium clones etc, and there are probably a fair amount of individuals 
   that want to still use this)

   Other pains: TSC doesn't necessarily exist.

 - "modern 32-bit": PPro and better. Can take CMOV, MMX and TSC for 
   granted.

Yes, there are graduations to the above, but reasonably, those three are I 
think the "architectural" big versions. The rest should be:

 - pure "tuning" options. A Pentium 4 is different from Core 2 in tuning, 
   and the best code sequences can be very very different, but the binary 
   should work on both.

 - with *dynamic* choices for the differences that are architecturally 
   visible.

   Ie the whole choice of syscall/sysenter/int80 is dynamic, not specified 
   statically at compile time with a config option. So are things like the 
   different XMM versions etc.

Hmm? Doesn't that sound like a sane model?

		Linus
--

From: Arjan van de Ven
Date: Monday, September 8, 2008 - 12:55 pm

On Mon, 8 Sep 2008 12:30:02 -0700 (PDT)

I'd lump all cpus that don't have cpuid in this bucket too (eg half the
486es) simply because not having cpuid is painful in pretty much the


again makes sense; question is if it makes sense to take PSE and PAE

it does to me; the only question is if we hit a new bucket with the
various fancy string instructions that are in upcoming models; doing
string/copy operations inlined for those guys will make a fourth bucket.


-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: H. Peter Anvin
Date: Monday, September 8, 2008 - 1:14 pm

Not really.  Detecting CPUID is pretty trivial, and we just initialize 

Well, PAE implies PSE.  Unfortunately Intel released a series of 
Pentium-Ms without PAE support.  We *should* be able to take PSE for 
granted, but there is Xen damage.

	-hpa
--

From: Krzysztof Halasa
Date: Monday, September 8, 2008 - 4:17 pm

VIA C3 (Samuel 2/Ezra, 600 - 1000 MHz?, common on VIA EPIA-*: home
theatres etc) can't CMOV.
-- 
Krzysztof Halasa
--

From: Arjan van de Ven
Date: Monday, September 8, 2008 - 11:42 am

On Tue, 09 Sep 2008 01:17:19 +0200

so your cpu does not fall into this bucket......
no big deal.


-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Andi Kleen
Date: Tuesday, September 9, 2008 - 3:24 am

AFAIK they fixed that in newer BIOS with a microcode update. It's 
slow, but it works.

-Andi
--

From: Linus Torvalds
Date: Tuesday, September 9, 2008 - 7:54 am

Well, more practically, the C3 simply _isn't_ a "modern 32-bit" one. It 
would fall into the other category of "pre-PPro, but at least better 
than i386".

		Linus
--

From: H. Peter Anvin
Date: Tuesday, September 9, 2008 - 10:01 am

Yes, but if it's slower than jmp+mov than you actively want to avoid it.

	-hpa
--

From: Mark Lord
Date: Tuesday, September 9, 2008 - 10:17 am

..

Our firewall here uses a Via C3-600 CPU, and CMOV has never worked on it.
But based upon your posting, I have today upgraded the BIOS to the
latest (2004) version.

Now.. how can I check whether CMOV works or not?  It's not listed in /proc/cpuinfo.

Thanks
--

From: H. Peter Anvin
Date: Tuesday, September 9, 2008 - 10:19 am

Compile just about any C program with -march=i686.

	-hpa
--

From: Mark Lord
Date: Tuesday, September 9, 2008 - 10:48 am

..

..

Okay, done.  And the binary does indeed have a ton of CMOV instructions.
When running it, this appears immediately:

    Illegal instruction

So much for the "BIOS upgrade fixes CMOV microcode" theory.

Cheers
--

From: Andi Kleen
Date: Tuesday, September 9, 2008 - 11:40 am

If it's not in cpuinfo it won't work.

-Andi

-- 
ak@linux.intel.com
--

From: Adrian Bunk
Date: Tuesday, September 9, 2008 - 9:05 am

We use 3DNow! for bigger memcpy's if the kernel is configured for a K7.


cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

--

From: Linus Torvalds
Date: Tuesday, September 9, 2008 - 9:15 am

It doesn't. I guess I don't care that much, since explicitly asking for 
some odd-ball case does indicate that you want a very specific kernel. I 
guess that's ok. I'm certainly not violently against it.

Of course, I also suspect that we _could_ fix it so that things like 
memcpy really only have two cases:

 - the special inlined "rep movs" thing. Although I'm not actually sure 
   gcc even does this, and I don't think we force it any more.

 - If doing a function call, we could just fix things up to be more 
   dynamic. Of course, the fixups for the SMP cases are scary (ie we'd 
   probably have to first change it to a one-byte "int $3" instruction, 
   then change the target, and then write the first byte back - and handle 
   any race with another CPU by fixing up the trap).

but I dunno.

		Linus
--

From: Valdis.Kletnieks
Date: Monday, September 8, 2008 - 1:25 pm

That's just *asking* for flame mail if somebody builds a kernel for a system
that's 4 year 9 months old, and he builds a kernel 6 months later, and it fails
to boot because the CPU is now 3 months out and we've deprecated it...

Quick - what year/month was the CPU you're using now released?  No peeking. ;)

(For the record, I have no *clue* when Intel actually released the Core2 T7200,
which is a whole *nother* can of worms - the chip release date can be quite
some time before the system vendor ships, and when the consumer actually buys
it - it's quite possible that we can write "released in the past 5 years",
a user looks at it and says "I bought this system 4 years 2 months ago", and
think he's OK, but he's not because he bought a system released 4 years 9 months
ago that used a chipset released 5 years 6 months ago...
From: Ingo Molnar
Date: Tuesday, September 9, 2008 - 12:27 am

yeah, in terms of precision of the definition it's certainly more 
towards the 'vague' end of the spectrum. OTOH, we do change our defaults 
slowly but surely to match the hardware. So this would give a practical 
definition. If someone _does_ complain legitimately, it doesnt cost us 
much to revert a tweak and delay it some more.

So the idea is to have some sort of independent platform, instead of the 
current practice of distros like Debian chosing pretty much random 
options. No strong opinion though. We can cover 90% of the real 
advantages via dynamic methods, it's quite rare that we have to make 
hard .config choices.

Pretty much the only hardcoded aspect that hurts in practice is the 
cache alignment parameter - all the rest is either dynamic already or 
insignificant. Ever since distros have discovered 
CONFIG_CC_OPTIMIZE_FOR_SIZE=y, even the various compiler optimization 
parameters have less of a role. We just have to wait a year or two for 
P4's to not matter that much anymore, then we can do generic kernels 
with 64 byte alignment and cmov, that will just work almost everywhere 
rather optimally.

	Ingo
--

From: Andi Kleen
Date: Monday, September 8, 2008 - 3:43 pm

Support all from the last 10 years (ok excluding legacy models that
just shipped forever like 486). I think that's quite reasonable
to do and worked for a long time.

-Andi
-- 
ak@linux.intel.com
--

From: Adrian Bunk
Date: Tuesday, September 9, 2008 - 9:57 am

As far as I understood it it's a gas issue, and X86_GENERIC=y would 
therefore *not* fix the bug with gcc < 4.2 and affected binutils
since we pass -mtune=i686 for gcc < 4.2 with X86_GENERIC=y.


cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

--

From: H. Peter Anvin
Date: Tuesday, September 9, 2008 - 10:03 am

Well, for one thing, gcc doesn't actually pass the -mtune= option to 
gas, it turns out.

But yes, "-Wa,-march=generic32" is really the proper fix.

	-hpa
--

From: Adrian Bunk
Date: Tuesday, September 9, 2008 - 10:43 am

If I understand the binutils changelog correctly -march=generic32 
support was added one week before the NOP code in question, so all 

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

--

From: H. Peter Anvin
Date: Tuesday, September 9, 2008 - 11:12 am

It doesn't, after all, with the current gcc driver.  A future gcc driver 
may change that.  Of course, now when this has popped up on the radar 

s/-march/-mtune/, but yes.  I suspect it was actually added *in order* 
to support the NOP code.

	-hpa
--

[