Re: [rfc][patch 2/3] x86: fix IO write barriers

Previous thread: Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io by David Chinner on Wednesday, October 3, 2007 - 10:03 pm. (2 messages)

Next thread: none
From: Nick Piggin
Date: Wednesday, October 3, 2007 - 10:21 pm

Hi,

Here's a couple of patches to improve the memory barrier situation on x86.
They probably aren't going upstream until after the x86 merge, however I'm
posting them here for RFC, and in case anybody wants to backport into stable
trees.

---
movnt* instructions are not strongly ordered with respect to other stores,
so if we are to assume stores are strongly ordered in the rest of the x86_64
kernel, we must fence these off (see similar examples in i386 kernel).

[ The AMD memory ordering document seems to say that nontemporal stores can
  also pass earlier regular stores, so maybe we need sfences _before_ movnt*
  everywhere too? ]

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/arch/x86_64/lib/copy_user_nocache.S
===================================================================
--- linux-2.6.orig/arch/x86_64/lib/copy_user_nocache.S
+++ linux-2.6/arch/x86_64/lib/copy_user_nocache.S
@@ -117,6 +117,7 @@ ENTRY(__copy_user_nocache)
 	popq %rbx
 	CFI_ADJUST_CFA_OFFSET -8
 	CFI_RESTORE rbx
+	sfence
 	ret
 	CFI_RESTORE_STATE
 
-

From: Nick Piggin
Date: Wednesday, October 3, 2007 - 10:22 pm

wmb() on x86 must always include a barrier, because stores can go out of
order in many cases when dealing with devices (eg. WC memory).

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/include/asm-i386/system.h
===================================================================
--- linux-2.6.orig/include/asm-i386/system.h
+++ linux-2.6/include/asm-i386/system.h
@@ -216,6 +216,7 @@ static inline unsigned long get_limit(un
 
 #define mb() alternative("lock; addl $0,0(%%esp)", "mfence", X86_FEATURE_XMM2)
 #define rmb() alternative("lock; addl $0,0(%%esp)", "lfence", X86_FEATURE_XMM2)
+#define wmb() alternative("lock; addl $0,0(%%esp)", "sfence", X86_FEATURE_XMM)
 
 /**
  * read_barrier_depends - Flush all pending reads that subsequents reads
@@ -271,18 +272,14 @@ static inline unsigned long get_limit(un
 
 #define read_barrier_depends()	do { } while(0)
 
-#ifdef CONFIG_X86_OOSTORE
-/* Actually there are no OOO store capable CPUs for now that do SSE, 
-   but make it already an possibility. */
-#define wmb() alternative("lock; addl $0,0(%%esp)", "sfence", X86_FEATURE_XMM)
-#else
-#define wmb()	__asm__ __volatile__ ("": : :"memory")
-#endif
-
 #ifdef CONFIG_SMP
 #define smp_mb()	mb()
 #define smp_rmb()	rmb()
-#define smp_wmb()	wmb()
+#ifdef CONFIG_X86_OOSTORE
+# define smp_wmb() 	wmb()
+#else
+# define smp_wmb()	barrier()
+#endif
 #define smp_read_barrier_depends()	read_barrier_depends()
 #define set_mb(var, value) do { (void) xchg(&var, value); } while (0)
 #else
Index: linux-2.6/include/asm-x86_64/system.h
===================================================================
--- linux-2.6.orig/include/asm-x86_64/system.h
+++ linux-2.6/include/asm-x86_64/system.h
@@ -159,12 +159,8 @@ static inline void write_cr8(unsigned lo
  */
 #define mb() 	asm volatile("mfence":::"memory")
 #define rmb()	asm volatile("lfence":::"memory")
-
-#ifdef CONFIG_UNORDERED_IO
 #define wmb()	asm volatile("sfence" ::: "memory")
-#else
-#define wmb()	asm volatile("" ::: ...
From: Dave Jones
Date: Thursday, October 4, 2007 - 10:32 am

On Thu, Oct 04, 2007 at 07:22:58AM +0200, Nick Piggin wrote:

 > -#ifdef CONFIG_X86_OOSTORE
 > -/* Actually there are no OOO store capable CPUs for now that do SSE, 
 > -   but make it already an possibility. */
 > -#define wmb() alternative("lock; addl $0,0(%%esp)", "sfence", X86_FEATURE_XMM)
 > -#else
 > -#define wmb()	__asm__ __volatile__ ("": : :"memory")
 > -#endif
 > -
 >  #ifdef CONFIG_SMP
 >  #define smp_mb()	mb()
 >  #define smp_rmb()	rmb()
 > -#define smp_wmb()	wmb()
 > +#ifdef CONFIG_X86_OOSTORE
 > +# define smp_wmb() 	wmb()
 > +#else
 > +# define smp_wmb()	barrier()
 > +#endif

The only vendor that ever implemented OOSTOREs was Centaur, and they
only did in the Winchip generation of the CPUs.  When they dropped it
from the C3, I asked whether they intended to bring it back, and the
answer was "extremely unlikely".

So we can probably just drop that "just in case" clause above, and just
do..

 #define smp_wmb()  barrier()


	Dave

-- 
http://www.codemonkey.org.uk
-

From: Andi Kleen
Date: Thursday, October 4, 2007 - 10:53 am

Do you know if it made a big performance difference?

But yes we should probably just remove this special case to make 
maintenance easier.

-Andi
-

From: Dave Jones
Date: Thursday, October 4, 2007 - 11:10 am

On Thu, Oct 04, 2007 at 07:53:16PM +0200, Andi Kleen wrote:
 > 
 > > The only vendor that ever implemented OOSTOREs was Centaur, and they
 > > only did in the Winchip generation of the CPUs.  When they dropped it
 > > from the C3, I asked whether they intended to bring it back, and the
 > > answer was "extremely unlikely".
 > >
 > 
 > Do you know if it made a big performance difference?

On the winchip, it was a huge win. I can't remember exact numbers,
but pretty much every benchmark I threw at it at the time showed
significant improvement.

 > But yes we should probably just remove this special case to make 
 > maintenance easier.

It's CONFIG_SMP anyway, which none of the winchips were.
SMP+OOSTORE just didn't happen, and I'd be surprised if
any vendor makes it happen any time soon.
(Even if so, it's likely we'd need to make additional changes
 anyway, so adding it back shouldn't be a big deal.)

	Dave

-- 
http://www.codemonkey.org.uk
-

From: Andi Kleen
Date: Thursday, October 4, 2007 - 11:21 am

It's not. And we need memory barriers even without SMP 
when talking to device drivers. Only the smp_*b()s get noped
on UP.

-Andi
-

From: Dave Jones
Date: Thursday, October 4, 2007 - 11:41 am

On Thu, Oct 04, 2007 at 08:21:59PM +0200, Andi Kleen wrote:
 > On Thursday 04 October 2007 20:10:44 Dave Jones wrote:
 > > On Thu, Oct 04, 2007 at 07:53:16PM +0200, Andi Kleen wrote:
 > >  > 
 > >  > > The only vendor that ever implemented OOSTOREs was Centaur, and they
 > >  > > only did in the Winchip generation of the CPUs.  When they dropped it
 > >  > > from the C3, I asked whether they intended to bring it back, and the
 > >  > > answer was "extremely unlikely".
 > >  > >
 > >  > 
 > >  > Do you know if it made a big performance difference?
 > > 
 > > On the winchip, it was a huge win. I can't remember exact numbers,
 > > but pretty much every benchmark I threw at it at the time showed
 > > significant improvement.
 > 
 > Significant as in >10%?

"Worth about 10-20% performance" according to the 2.4.18pre9-ac4
release notes: http://www.linuxtoday.com/news_story.php3?ltsn=2002-02-14-015-20-NW-KN

 > >  > But yes we should probably just remove this special case to make 
 > >  > maintenance easier.
 > > It's CONFIG_SMP anyway, which none of the winchips were.
 > 
 > It's not.

You're right it isn't now, but Nicks patch seems to change it so that it is.

...

 #ifdef CONFIG_SMP
 #define smp_mb()       mb()
 #define smp_rmb()      rmb()
-#define smp_wmb()      wmb()
+#ifdef CONFIG_X86_OOSTORE
+# define smp_wmb()     wmb()
+#else
+# define smp_wmb()     barrier()
+#endif

 > And we need memory barriers even without SMP 
 > when talking to device drivers. Only the smp_*b()s get noped
 > on UP.

Good point.

	Dave

-- 
http://www.codemonkey.org.uk
-

From: Andi Kleen
Date: Thursday, October 4, 2007 - 11:58 am

That is only for smp_wmb() which are always SMP only

-Andi
-

From: Dave Jones
Date: Thursday, October 4, 2007 - 12:08 pm

On Thu, Oct 04, 2007 at 08:58:27PM +0200, Andi Kleen wrote:
 > On Thursday 04 October 2007 20:41:07 Dave Jones wrote:
 > > On Thu, Oct 04, 2007 at 08:21:59PM +0200, Andi Kleen wrote:
 > >  > On Thursday 04 October 2007 20:10:44 Dave Jones wrote:
 > >  > > On Thu, Oct 04, 2007 at 07:53:16PM +0200, Andi Kleen wrote:
 > >  > >  > 
 > >  > >  > > The only vendor that ever implemented OOSTOREs was Centaur, and they
 > >  > >  > > only did in the Winchip generation of the CPUs.  When they dropped it
 > >  > >  > > from the C3, I asked whether they intended to bring it back, and the
 > >  > >  > > answer was "extremely unlikely".
 > >  > >  > >
 > >  > >  > 
 > >  > >  > Do you know if it made a big performance difference?
 > >  > > 
 > >  > > On the winchip, it was a huge win. I can't remember exact numbers,
 > >  > > but pretty much every benchmark I threw at it at the time showed
 > >  > > significant improvement.
 > >  > 
 > >  > Significant as in >10%?
 > > 
 > > "Worth about 10-20% performance" according to the 2.4.18pre9-ac4
 > > release notes: http://www.linuxtoday.com/news_story.php3?ltsn=2002-02-14-015-20-NW-KN
 > 
 > Are there numbers for a newer kernel available too?

no idea, my winchips died about 5 years ago.

	Dave

-- 
http://www.codemonkey.org.uk
-

From: Alan Cox
Date: Thursday, October 4, 2007 - 1:52 pm

Got a couple here just need a mainboard 8)
-

From: Nick Piggin
Date: Wednesday, October 3, 2007 - 10:23 pm

According to latest memory ordering specification documents from Intel and
AMD, both manufacturers are committed to in-order loads from cacheable memory
for the x86 architecture. Hence, smp_rmb() may be a simple barrier.

Also according to those documents, and according to existing practice in Linux
(eg. spin_unlock doesn't enforce ordering), stores to cacheable memory are
visible in program order too. Special string stores are safe -- their
constituent stores may be out of order, but they must complete in order WRT
surrounding stores. Nontemporal stores to WB memory can go out of order, and so
they should be fenced explicitly to make them appear in-order WRT other stores.
Hence, smp_wmb() may be a simple barrier.

http://developer.intel.com/products/processor/manuals/318147.pdf
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf

In userspace microbenchmarks on a core2 system, fence instructions range
anywhere from around 15 cycles to 50, which may not be totally insignificant
in performance critical paths (code size will go down too).

However the primary motivation for this is to have the canonical barrier
implementation for x86 architecture.

smp_rmb on buggy pentium pros remains a locked op, which is apparently
required.

Signed-off-by: Nick Piggin <npiggin@suse.de>

---
Index: linux-2.6/include/asm-i386/system.h
===================================================================
--- linux-2.6.orig/include/asm-i386/system.h
+++ linux-2.6/include/asm-i386/system.h
@@ -274,7 +274,11 @@ static inline unsigned long get_limit(un
 
 #ifdef CONFIG_SMP
 #define smp_mb()	mb()
-#define smp_rmb()	rmb()
+#ifdef CONFIG_X86_PPRO_FENCE
+# define smp_rmb()	rmb()
+#else
+# define smp_rmb()	barrier()
+#endif
 #ifdef CONFIG_X86_OOSTORE
 # define smp_wmb() 	wmb()
 #else
Index: linux-2.6/include/asm-x86_64/system.h
===================================================================
--- linux-2.6.orig/include/asm-x86_64/system.h
+++ ...
From: Jarek Poplawski
Date: Friday, October 12, 2007 - 1:25 am

...

Great news!

First it looks like a really great thing that it's revealed at last.
But then... there is probably some confusion: did we have to use
ineffective code for so long?

First again, we could try to blame Intel etc. But then, wait a minute:
is it such a mystery knowledge? If this reordering is done there are
some easy rules broken (just like in examples from these manuals). And
if somebody cared to do this for optimization, then this is probably
noticeable optimization, let's say 5 or 10%. Then any test shouldn't
need to take very long to tell the truth in less than 100 loops!

So, maybe linux needs something like this, instead of waiting few
years with each new model for vendors goodwill? IMHO, even for less
popular processors, this could be checked under some debugging option
at the system start (after disabling suspicios barrier for a while
plus some WARN_ONs).

Thanks,
Jarek P.
-

From: Helge Hafting
Date: Friday, October 12, 2007 - 1:42 am

You could have tried the optimization before, and
gotten better performance. But if without solid knowledge that
the optimization is _valid_, you risk having a kernel
that performs great but suffer the occational glitch and
therefore is unstable and crash the machine "now and then".
This sort of thing can't really be figured out by experimentation, because
the bad cases might happen only with some processors, some
combinations of memory/chipsets, or with some minimum
number of processors.  Such problems can be very hard
to find, especially considering that other plain bugs also
cause crashes.

Therefore, the "ineffective code" was used because it was
the only safe alternative. Now we know, so now we may optimize.


Helge Hafting
-

From: Jarek Poplawski
Date: Friday, October 12, 2007 - 2:12 am

Sorry, I don't understand this logic at all. Since bad cases
happen independently from any specifications and Intel doesn't
take any legal responsibility for such information, it seems we
should better still not optimize?

Jarek P.
-

From: Nick Piggin
Date: Friday, October 12, 2007 - 2:44 am

We already do in probably more critical and lible to be problematic
cases (notably, spin_unlock).

So unless there is reasonable information for us to believe this
will be a problem, IMO the best thing to do is stick with the
specs. Intel is pretty reasonable with documenting errata I think.

With memory barriers specifically, I'm sure we have many more bugs
in the kernel than AMD or Intel have in their chips ;)

-

From: Jarek Poplawski
Date: Friday, October 12, 2007 - 3:04 am

On Fri, Oct 12, 2007 at 11:44:27AM +0200, Nick Piggin wrote:

100% right - if there are any specs. But it seems for a few years
this spec was missing or there is some change of mind, I presume?

Jarek P.
-

From: Helge Hafting
Date: Friday, October 12, 2007 - 5:44 am

The point is that we _trust_ intel when they says "this will work".
Therefore, we can use the optimizations. It was never about
legal matters. If we didn't trust intel, then we couldn't
use their processors at all.

We couldn't take the chance before. It was not documented
to work, verification by testing would not be trivial at all for
this case.
Linux is about "stability first, then performance".
Now we _know_ that we can have this optimization without
compromising stability. Nobody knew before!

Helge Hafting
-

From: Jarek Poplawski
Date: Friday, October 12, 2007 - 6:29 am

On Fri, Oct 12, 2007 at 02:44:51PM +0200, Helge Hafting wrote:

But there was nothing about trust. Usually you don't trust somebody
but somebody's opinions. The problem is there was no valid opinion,

So, you think this would be the first or the least credibly
verified undocumented feature used in linux? Then, it seems
I can try to install this linux on my laptop at last! (...
And, I can trust you, it will not break anything...?)

Thanks,
Jarek P.
-

From: Helge Hafting
Date: Monday, October 15, 2007 - 3:17 am

"Trusting people or their opinions" is only about use of the
english language, and not that intersting to bring up here.
Surely you know that lots of people here have english as
a secondary language only. Intersting for me to know, but
I never claimed that linux will work on your laptop, so no:
You can't take my word for that, because I never gave it!
It is well known that some laptops don't work with linux,
I have no idea if yours will work, I don't even know what kind it is.

I told you the reasoning behind using _this particular optimization_,
the same does _not_ apply to everything else. If you think every
kernel decision is made the same way, then you are mistaken.
Things don't work that way.
First, several people are involved - they think differently.
Second, "what kind of tricks to use" is not an all-or-nothing
approach. If linux were to use every undocumented trick
that might or might not work, then linux would fail on
lots of hardware. It would not be useful.
If linux took the other approach and never used any "tricks",
then  it'd be slow and boring.

Some things are much easier to test - you construct a testcase
or just build a test kernel and benchmark it. If all is ok, then
the "trick" is useable. Some cases are a clear win for lots of
machines, and the possible failure cases involves
very rare hardware. So it might get used. Some tricks have
a failure mode that is rare but completely obvious when it happens.
So it gets used, and "troublesome hardware" is added to a blacklist
as needed.

Some "tricks" however, are hard to figure out without docs.
There may be no good way to test. The tricks
may cause instability that will be very hard to track down, and this could
happen on a wide range of hardware. So such don't get used, until
adequate documentation appear. In this case, it seems like intel,
who make and design the processors in question and therefore
know them well enough, provided such documentation. That
makes a previously dubious optimization ...
From: Jarek Poplawski
Date: Monday, October 15, 2007 - 4:53 am

Of curse, I know this problem: sometimes it's very hard to make people
believe it's my secondary language! But this time I didn't see any
language problem. I simply poined out that sometimes trusting could be

OK, this was supposed to be a joke... (Btw, can you remember burning
linux laptops?) I thought this "stability first" a bit funny, but this
was a really bad joke, sorry.

Thanks for these additional explanations - you are completely right!

Regards,
Jarek P.
-

From: Nick Piggin
Date: Friday, October 12, 2007 - 1:57 am

I'm not sure exactly what the situation is with the manufacturers,
but maybe they (at least Intel) wanted to keep their options open
WRT their barrier semantics, even if current implementations were

I don't know quite what you're saying... the CPUs could probably get
performance by having weakly ordered loads, OTOH I think the Intel
ones might already do this speculatively so they appear in order but
essentially have the performance of weak order.

If you're just talking about this patch, then it probably isn't much
performance gain. I'm guessing you'd be lucky to measure it from

I don't know if that would be worthwhile. It actually isn't always
trivial to trigger reordering. For example, on my dual-core core2,
in order to see reads pass writes, I have to do work on a set that
exceeds the cache size and does a huge amount of work to ensure it
is going to trigger that. If you can actually come up with a test
case that triggers load/load or store/store reordering, I'm sure
Intel / AMD would like to see it ;)

All existing processors as far as we know are in-order WRT loads vs
loads and stores vs stores. It was just a matter of getting the docs
clarified, which gives us more confidence that we're correct and a
reasonable guarnatee of forward compatibility.

So, I think the plan is just to merge these 3 patches during the
current window.

-

From: Jarek Poplawski
Date: Friday, October 12, 2007 - 2:55 am

I meant: if there is any reordering possible this should be quite
distinctly visible, because why would any vendor enable such nasty
things if not for performance. But now I start to doubt: of course
there is such a possibility someone makes this reordering for some
other reasons which could be so rare it's hard to check. And this
someone knows it's processors are seen less efficient because of eg.

No, it's only about the comment to this patch: "Hence, smp_rmb() may be

Anyway, it seems any heavy testing such as yours, should give us the
same informations years earlier than any vendors manual and then any
gain is multiplied by millions of users. Then only still doubtful
cases could be treated with additional caution and some debugging

After reading this Intel's legal information I don't think you should

And they really should be!

Jarek P.
-

From: Nick Piggin
Date: Friday, October 12, 2007 - 3:42 am

It's not. Not in the cases where it is explicitly allowed and actively
exploited (loads passing stores), but most definitely not distinctly

Yes: it isn't the explicitly allowed reorderings that we care
about here (because obviously we're retaining the barriers for those).
It would be cases of bugs in the CPUs meaning they don't follow the
standard. But how far do you take your mistrust of a CPU? You could
ask gcc to insert locked ops between every load and store operation?

Firstly, while it can be possible to write a code to show up reordering,
it is really hard (ie. impossible) to guarantee no reordering happens. For
example, it may have only showed up on SMT+SMP P4 CPUs with some obscure
interactions between threads and cores involving more than 2 threads.

Secondly, even if we were sure that no current implementations reordered
loads, we don't want to go outside the bounds of the specification
because we might break on some future CPUs. This isn't a big performance

Yes, but that's the same way I feel after reading *any* legal "information" ;)

-

From: Jarek Poplawski
Date: Friday, October 12, 2007 - 4:55 am

I'm not sure of your point, but it seems we don't differ here, and

I'm not sure how much this all above is consistent wrt. this earlier

It seems, after testing only (plus no official spec against this idea),
you could be almost sure there is no such test possible. And, if it
were done a few years ago, you think it still should be not enough to
make a decision on changing this smp_rmb because of lack of official
specs? Besides, there is probably so much features guessing in arch
and drivers sections, this reorder testing should look as solid as a

I don't agree with this - IMO we should care only about currently used

Strange... I feel exactly opposite. Are you sure you've chosen the
right job (...and the right system)?

Jarek P.
-

From: Jarek Poplawski
Date: Friday, October 12, 2007 - 5:10 am

(...plus of course proper smp_rmb & smp_wmb vs. smp_mb interpretation
probably available from Paul McKenney or Davide Libenzi before this
Intel spec, as well...)

Jarek P.
-

From: Linus Torvalds
Date: Friday, October 12, 2007 - 8:13 am

I think the chip manufacturers really wanted to keep their options open.

Having the option to re-order loads in architecturally visible ways was 
something that they probably felt they really wanted to have. On the other 
hand:

 - I bet they had noticed that things break, and some applications depend 
   on fairly strong ordering (not necessarily in Linux-land, but..)

   I suspect hw manufacturers go through life hoping that "software 
   improves". They probably thought that getting rid of the old 16-bit 
   windows would mean that less people depended on undefined behaviour. 

   And I suspect that they started noticing that no, with threads and 
   JVM's and things, *more* people started depending on fairly strong 
   memory ordering.

 - I suspect Intel in particular noticed that they can do a lot of very 
   aggressive re-ordering at a microarchitectural level, but can still 
   guarantee that *architecturally* they never show it (dynamic detection 
   of reordered loads being replayed on cache dirty events etc).

IOW, I suspect that both Intel and AMD noticed that while they had wanted 
to keep their options open, those options weren't really realistic, and 
not something that the market wanted (aggressive use of threading wants 
*stricter* memory ordering, not looser), and they could work well enough 

Quite frankly, even *within* Intel and AMD, there are damn few people who 
understand exactly what the memory ordering requirements and guarantees 
are and historically were for the different CPU's.

I would bet that had you asked a random (but still competent) Intel/AMD 
engineer that wasn't really intimately involved with the actual design of 
the cache protocols and memory pipelines, they would absolutely not have 
been able to tell you how the CPU actually worked.

So no, there's no way a software person could have afforded to say "it 
seems to work on my setup even without the barrier". On a dual-socket 
setup with s shared bus, that says absolutely ...
From: Jarek Poplawski
Date: Monday, October 15, 2007 - 12:44 am

Yes, I still can't believe this, but after some more reading I start
to admit such things can happen in computer "science" too... I've
mentioned a lost performance, but as a matter of fact I've been more
concerned with the problem of truth:

From: Intel(R) 64 and IA-32 Architectures Software Developer's Manual
Volume 3A:

   "7.2.2 Memory Ordering in P6 and More Recent Processor Families
    ...
    1. Reads can be carried out speculatively and in any order.
    ..."

So, it looks to me like almost the 1-st Commandment. Some people (like
me) did believe this, others tried to check, and it was respected for
years notwithstanding nobody had ever seen such an event.

And then, a few years later, we have this:

From: Intel(R) 64 Architecture Memory Ordering White Paper

    "2 Memory ordering for write-back (WB) memory
     ...
     Intel 64 memory ordering obeys the following principles:
     1. Loads are not reordered with other loads.
     ..."

I know, technically this doesn't have to be a contradiction (for not
WB), but to me it's something like: "OK, Elvis lives and this guy is
not real Paul McCartney too" in an official CIA statement!


I'm still so "dazed and confused" that I can't tell this (or anything)
is right...

Thanks very much for so extensive and sound explanation,

Jarek P.

PS: Btw, I apologize Helge for not trusting her: "verification by
testing would not be trivial" words.
-

From: Nick Piggin
Date: Monday, October 15, 2007 - 1:09 am

I'd say that's exactly what Intel wanted. It's pretty common (we do
it all the time in the kernel too) to create an API which places a
stronger requirement on the caller than is actually required. It can
make changes much less painful.

Has performance really been much problem for you? (even before the
lfence instruction, when you theoretically had to use a locked op)?
I mean, I'd struggle to find a place in the Linux kernel where there
is actually a measurable difference anywhere... and we're pretty
performance critical and I think we have a reasonable amount of lockless
code (I guess we may not have a lot of tight computational loops, though).
I'd be interested to know what, if any, application had found these

The thing is that those documents are not defining what a particular
implementation does, but how the architecture is defined (ie. what
must some arbitrary software/hardware provide and what may it expect).

It's pretty natural that Intel started out with a weaker guarantee
than their CPUs of the time actually supported, and tightened it up
after (presumably) deciding not to implement such relaxed semantics
for the forseeable future.

-

From: Jarek Poplawski
Date: Monday, October 15, 2007 - 2:10 am

On Mon, Oct 15, 2007 at 10:09:24AM +0200, Nick Piggin wrote:

I'm not performance-words at all, so I can't help you, sorry. But, I
understand people who care about this, and think there is a popular
conviction barriers and locked instructions are costly, so I'm

I'm not sure this is the right way to tell it. If there is no
distinction between what is and what could be, how can I believe in
similar Alpha or Itanium stuff? IMHO, these manuals sometimes look
like they describe some real hardware mechanisms, and sometimes they
mention about possible changes and reserved features too. So, when

As a matter of fact it's not natural for me at all. I expected the
other direction, and I still doubt programmers' intentions could be
"automatically" predicted good enough, so IMHO, it's not for long.
Of course, it doesn't seem to be any help for linux or bsd
programmers, which still have to think about different architectures.

Regards,
Jarek P.
-

From: Jarek Poplawski
Date: Monday, October 15, 2007 - 2:24 am

On Mon, Oct 15, 2007 at 11:09:59AM +0200, Jarek Poplawski wrote:

...performance-wards?!

Looks like serious: I don't even now who I'm not now!

Jarek P.
-

From: Nick Piggin
Date: Monday, October 15, 2007 - 5:50 pm

It's more expensive than nothing, sure. However in real code, algorithmic
complexity, cache misses and cacheline bouncing tend to be much bigger
issues.

I can't think of a place in the kernel where smp_rmb matters _that_ much.
seqlocks maybe (timers, dcache lookup), vmscan... Obviously removing the
lfence is not going to hurt. Maybe we even gain 0.01% performance in
someone's workload.

Also, remember: if loads are already in-order, then lfence is a noop,
right? (in practice it seems to have to do a _little_ bit of work, but

No. Why are you reading that much into it? I know for a fact that some
non-x86 architectures actual implementations have stronger ordering than
their ISA allows. It's nothing to do with you "believing" how the hardware

Really? Consider the consequences if, instead of releasing this latest
document tightening consistency, Intel found that out of order loads
were worth 5% more performance and implemented them in their next chip.
The chip could be completely backwards compatible, but all your old code
would break, because it was broken to begin with (because it was outside
the spec).

IMO Intel did exactly the right thing from an engineering perspective,
and so did Linux to always follow the spec.
-

From: Jarek Poplawski
Date: Tuesday, October 16, 2007 - 2:00 am

You are right: considering current CPUs there could be no performance
problem at all. Removing LOCKs for older ones should probably matter
more, but as a matter of fact, now I wouldn't bet even on this - it

I've different opinion on this: I expect any spec to describe current
implementation. Before issuing new models any changes of
implementation should be made public with proper margin of time. Then
system could be optimally adjusted to a real hardware, instead of
planned only, but possibly never realized (plus doing such not used
things with old means is usually more costly: lock vs. lfence). There
is still problem of specs' completness: there are probably often some
things unspecified which could brake on a new model, so never 100%

But, if you follow the spec - you don't follow the spec! Why do you
ignore so much this part of Intel's spec:

 "This document contains information which Intel may change at any
  time without notice. Do not finalize a design with this information."

Maybe it's a real Intel intention and not for lawyers only? (Btw, it
seems we have an example.)

Regards,
Jarek P.
-

From: david
Date: Tuesday, October 16, 2007 - 2:14 am

what you don't realize is that Intel (and AMD) have built their business 
on makeing sure that their new CPU's run existing software with no 
modifications, (and almost always faster then the old versions). remember 
that for most of the world, getting the software modified would mean 
buying a new version, if the vendor bothered to make a different version 
for the new chip.

if they required everyone to buy new software to use a new chip it 
wouldn't work well. In fact Intel tried to do exactly withat with the 
itanium and it has been a spectacular failure (or t the very least, not a 

in theory they could change anything at any time, in practice if they 
break old software they won't sell the chips, so the modifications tend to 
be along the lines of this one, adding detail to the specifications so 
that programmers can get more performance.

David Lang
-

From: Jarek Poplawski
Date: Tuesday, October 16, 2007 - 5:49 am

On Tue, Oct 16, 2007 at 02:14:17AM -0700, david@lang.hm wrote:

It's a good point to always consider when you analyze how something
new should work if it's used with older programs too. But with newer
things like SMP or multithreading they probably have more choice, and

The failure of an architecture doesn't mean all specific new
technologies used in itanium were failure too, so they could be back
when needed (and nothing better in reserve) yet.


I don't think 'not breaking' is much problem here, rather how to use
all new features (which you seem to ignore a bit) to get maximum of
performance without breaking older things. Or, like current problem:
go rational and remove useless (acording to new specs) things, even
without performance gain, or stay 'safe'?

Jarek P.
-

From: David Schwartz
Date: Monday, October 15, 2007 - 7:38 am

When Intel first added speculative loads to the x86 family, they pegged the
speculative load to the cache line. If the cache line is invalidated, so is
the speculative load. As a result, out-of-order reads to normal memory are
invisible to software. If a write to the same memory location on another CPU
would make the fetched value invalid, it will make the cache line invalid,
which invalidates the fetch.

I think it's extremely unlikely that any x86 CPU will do this any
differently. It's hard to imagine Intel and AMD would go to all this trouble
for so long just to stop so late in the line's lifetime.

DS


-

Previous thread: Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io by David Chinner on Wednesday, October 3, 2007 - 10:03 pm. (2 messages)

Next thread: none