Hi, Here's a couple of patches to improve the memory barrier situation on x86. They probably aren't going upstream until after the x86 merge, however I'm posting them here for RFC, and in case anybody wants to backport into stable trees. --- movnt* instructions are not strongly ordered with respect to other stores, so if we are to assume stores are strongly ordered in the rest of the x86_64 kernel, we must fence these off (see similar examples in i386 kernel). [ The AMD memory ordering document seems to say that nontemporal stores can also pass earlier regular stores, so maybe we need sfences _before_ movnt* everywhere too? ] Signed-off-by: Nick Piggin <npiggin@suse.de> Index: linux-2.6/arch/x86_64/lib/copy_user_nocache.S =================================================================== --- linux-2.6.orig/arch/x86_64/lib/copy_user_nocache.S +++ linux-2.6/arch/x86_64/lib/copy_user_nocache.S @@ -117,6 +117,7 @@ ENTRY(__copy_user_nocache) popq %rbx CFI_ADJUST_CFA_OFFSET -8 CFI_RESTORE rbx + sfence ret CFI_RESTORE_STATE -
wmb() on x86 must always include a barrier, because stores can go out of
order in many cases when dealing with devices (eg. WC memory).
Signed-off-by: Nick Piggin <npiggin@suse.de>
Index: linux-2.6/include/asm-i386/system.h
===================================================================
--- linux-2.6.orig/include/asm-i386/system.h
+++ linux-2.6/include/asm-i386/system.h
@@ -216,6 +216,7 @@ static inline unsigned long get_limit(un
#define mb() alternative("lock; addl $0,0(%%esp)", "mfence", X86_FEATURE_XMM2)
#define rmb() alternative("lock; addl $0,0(%%esp)", "lfence", X86_FEATURE_XMM2)
+#define wmb() alternative("lock; addl $0,0(%%esp)", "sfence", X86_FEATURE_XMM)
/**
* read_barrier_depends - Flush all pending reads that subsequents reads
@@ -271,18 +272,14 @@ static inline unsigned long get_limit(un
#define read_barrier_depends() do { } while(0)
-#ifdef CONFIG_X86_OOSTORE
-/* Actually there are no OOO store capable CPUs for now that do SSE,
- but make it already an possibility. */
-#define wmb() alternative("lock; addl $0,0(%%esp)", "sfence", X86_FEATURE_XMM)
-#else
-#define wmb() __asm__ __volatile__ ("": : :"memory")
-#endif
-
#ifdef CONFIG_SMP
#define smp_mb() mb()
#define smp_rmb() rmb()
-#define smp_wmb() wmb()
+#ifdef CONFIG_X86_OOSTORE
+# define smp_wmb() wmb()
+#else
+# define smp_wmb() barrier()
+#endif
#define smp_read_barrier_depends() read_barrier_depends()
#define set_mb(var, value) do { (void) xchg(&var, value); } while (0)
#else
Index: linux-2.6/include/asm-x86_64/system.h
===================================================================
--- linux-2.6.orig/include/asm-x86_64/system.h
+++ linux-2.6/include/asm-x86_64/system.h
@@ -159,12 +159,8 @@ static inline void write_cr8(unsigned lo
*/
#define mb() asm volatile("mfence":::"memory")
#define rmb() asm volatile("lfence":::"memory")
-
-#ifdef CONFIG_UNORDERED_IO
#define wmb() asm volatile("sfence" ::: "memory")
-#else
-#define wmb() asm volatile("" ::: ...On Thu, Oct 04, 2007 at 07:22:58AM +0200, Nick Piggin wrote: > -#ifdef CONFIG_X86_OOSTORE > -/* Actually there are no OOO store capable CPUs for now that do SSE, > - but make it already an possibility. */ > -#define wmb() alternative("lock; addl $0,0(%%esp)", "sfence", X86_FEATURE_XMM) > -#else > -#define wmb() __asm__ __volatile__ ("": : :"memory") > -#endif > - > #ifdef CONFIG_SMP > #define smp_mb() mb() > #define smp_rmb() rmb() > -#define smp_wmb() wmb() > +#ifdef CONFIG_X86_OOSTORE > +# define smp_wmb() wmb() > +#else > +# define smp_wmb() barrier() > +#endif The only vendor that ever implemented OOSTOREs was Centaur, and they only did in the Winchip generation of the CPUs. When they dropped it from the C3, I asked whether they intended to bring it back, and the answer was "extremely unlikely". So we can probably just drop that "just in case" clause above, and just do.. #define smp_wmb() barrier() Dave -- http://www.codemonkey.org.uk -
Do you know if it made a big performance difference? But yes we should probably just remove this special case to make maintenance easier. -Andi -
On Thu, Oct 04, 2007 at 07:53:16PM +0200, Andi Kleen wrote: > > > The only vendor that ever implemented OOSTOREs was Centaur, and they > > only did in the Winchip generation of the CPUs. When they dropped it > > from the C3, I asked whether they intended to bring it back, and the > > answer was "extremely unlikely". > > > > Do you know if it made a big performance difference? On the winchip, it was a huge win. I can't remember exact numbers, but pretty much every benchmark I threw at it at the time showed significant improvement. > But yes we should probably just remove this special case to make > maintenance easier. It's CONFIG_SMP anyway, which none of the winchips were. SMP+OOSTORE just didn't happen, and I'd be surprised if any vendor makes it happen any time soon. (Even if so, it's likely we'd need to make additional changes anyway, so adding it back shouldn't be a big deal.) Dave -- http://www.codemonkey.org.uk -
It's not. And we need memory barriers even without SMP when talking to device drivers. Only the smp_*b()s get noped on UP. -Andi -
On Thu, Oct 04, 2007 at 08:21:59PM +0200, Andi Kleen wrote: > On Thursday 04 October 2007 20:10:44 Dave Jones wrote: > > On Thu, Oct 04, 2007 at 07:53:16PM +0200, Andi Kleen wrote: > > > > > > > The only vendor that ever implemented OOSTOREs was Centaur, and they > > > > only did in the Winchip generation of the CPUs. When they dropped it > > > > from the C3, I asked whether they intended to bring it back, and the > > > > answer was "extremely unlikely". > > > > > > > > > > Do you know if it made a big performance difference? > > > > On the winchip, it was a huge win. I can't remember exact numbers, > > but pretty much every benchmark I threw at it at the time showed > > significant improvement. > > Significant as in >10%? "Worth about 10-20% performance" according to the 2.4.18pre9-ac4 release notes: http://www.linuxtoday.com/news_story.php3?ltsn=2002-02-14-015-20-NW-KN > > > But yes we should probably just remove this special case to make > > > maintenance easier. > > It's CONFIG_SMP anyway, which none of the winchips were. > > It's not. You're right it isn't now, but Nicks patch seems to change it so that it is. ... #ifdef CONFIG_SMP #define smp_mb() mb() #define smp_rmb() rmb() -#define smp_wmb() wmb() +#ifdef CONFIG_X86_OOSTORE +# define smp_wmb() wmb() +#else +# define smp_wmb() barrier() +#endif > And we need memory barriers even without SMP > when talking to device drivers. Only the smp_*b()s get noped > on UP. Good point. Dave -- http://www.codemonkey.org.uk -
That is only for smp_wmb() which are always SMP only -Andi -
On Thu, Oct 04, 2007 at 08:58:27PM +0200, Andi Kleen wrote: > On Thursday 04 October 2007 20:41:07 Dave Jones wrote: > > On Thu, Oct 04, 2007 at 08:21:59PM +0200, Andi Kleen wrote: > > > On Thursday 04 October 2007 20:10:44 Dave Jones wrote: > > > > On Thu, Oct 04, 2007 at 07:53:16PM +0200, Andi Kleen wrote: > > > > > > > > > > > The only vendor that ever implemented OOSTOREs was Centaur, and they > > > > > > only did in the Winchip generation of the CPUs. When they dropped it > > > > > > from the C3, I asked whether they intended to bring it back, and the > > > > > > answer was "extremely unlikely". > > > > > > > > > > > > > > > > Do you know if it made a big performance difference? > > > > > > > > On the winchip, it was a huge win. I can't remember exact numbers, > > > > but pretty much every benchmark I threw at it at the time showed > > > > significant improvement. > > > > > > Significant as in >10%? > > > > "Worth about 10-20% performance" according to the 2.4.18pre9-ac4 > > release notes: http://www.linuxtoday.com/news_story.php3?ltsn=2002-02-14-015-20-NW-KN > > Are there numbers for a newer kernel available too? no idea, my winchips died about 5 years ago. Dave -- http://www.codemonkey.org.uk -
Got a couple here just need a mainboard 8) -
According to latest memory ordering specification documents from Intel and AMD, both manufacturers are committed to in-order loads from cacheable memory for the x86 architecture. Hence, smp_rmb() may be a simple barrier. Also according to those documents, and according to existing practice in Linux (eg. spin_unlock doesn't enforce ordering), stores to cacheable memory are visible in program order too. Special string stores are safe -- their constituent stores may be out of order, but they must complete in order WRT surrounding stores. Nontemporal stores to WB memory can go out of order, and so they should be fenced explicitly to make them appear in-order WRT other stores. Hence, smp_wmb() may be a simple barrier. http://developer.intel.com/products/processor/manuals/318147.pdf http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf In userspace microbenchmarks on a core2 system, fence instructions range anywhere from around 15 cycles to 50, which may not be totally insignificant in performance critical paths (code size will go down too). However the primary motivation for this is to have the canonical barrier implementation for x86 architecture. smp_rmb on buggy pentium pros remains a locked op, which is apparently required. Signed-off-by: Nick Piggin <npiggin@suse.de> --- Index: linux-2.6/include/asm-i386/system.h =================================================================== --- linux-2.6.orig/include/asm-i386/system.h +++ linux-2.6/include/asm-i386/system.h @@ -274,7 +274,11 @@ static inline unsigned long get_limit(un #ifdef CONFIG_SMP #define smp_mb() mb() -#define smp_rmb() rmb() +#ifdef CONFIG_X86_PPRO_FENCE +# define smp_rmb() rmb() +#else +# define smp_rmb() barrier() +#endif #ifdef CONFIG_X86_OOSTORE # define smp_wmb() wmb() #else Index: linux-2.6/include/asm-x86_64/system.h =================================================================== --- linux-2.6.orig/include/asm-x86_64/system.h +++ ...
... Great news! First it looks like a really great thing that it's revealed at last. But then... there is probably some confusion: did we have to use ineffective code for so long? First again, we could try to blame Intel etc. But then, wait a minute: is it such a mystery knowledge? If this reordering is done there are some easy rules broken (just like in examples from these manuals). And if somebody cared to do this for optimization, then this is probably noticeable optimization, let's say 5 or 10%. Then any test shouldn't need to take very long to tell the truth in less than 100 loops! So, maybe linux needs something like this, instead of waiting few years with each new model for vendors goodwill? IMHO, even for less popular processors, this could be checked under some debugging option at the system start (after disabling suspicios barrier for a while plus some WARN_ONs). Thanks, Jarek P. -
You could have tried the optimization before, and gotten better performance. But if without solid knowledge that the optimization is _valid_, you risk having a kernel that performs great but suffer the occational glitch and therefore is unstable and crash the machine "now and then". This sort of thing can't really be figured out by experimentation, because the bad cases might happen only with some processors, some combinations of memory/chipsets, or with some minimum number of processors. Such problems can be very hard to find, especially considering that other plain bugs also cause crashes. Therefore, the "ineffective code" was used because it was the only safe alternative. Now we know, so now we may optimize. Helge Hafting -
Sorry, I don't understand this logic at all. Since bad cases happen independently from any specifications and Intel doesn't take any legal responsibility for such information, it seems we should better still not optimize? Jarek P. -
We already do in probably more critical and lible to be problematic cases (notably, spin_unlock). So unless there is reasonable information for us to believe this will be a problem, IMO the best thing to do is stick with the specs. Intel is pretty reasonable with documenting errata I think. With memory barriers specifically, I'm sure we have many more bugs in the kernel than AMD or Intel have in their chips ;) -
On Fri, Oct 12, 2007 at 11:44:27AM +0200, Nick Piggin wrote: 100% right - if there are any specs. But it seems for a few years this spec was missing or there is some change of mind, I presume? Jarek P. -
The point is that we _trust_ intel when they says "this will work". Therefore, we can use the optimizations. It was never about legal matters. If we didn't trust intel, then we couldn't use their processors at all. We couldn't take the chance before. It was not documented to work, verification by testing would not be trivial at all for this case. Linux is about "stability first, then performance". Now we _know_ that we can have this optimization without compromising stability. Nobody knew before! Helge Hafting -
On Fri, Oct 12, 2007 at 02:44:51PM +0200, Helge Hafting wrote: But there was nothing about trust. Usually you don't trust somebody but somebody's opinions. The problem is there was no valid opinion, So, you think this would be the first or the least credibly verified undocumented feature used in linux? Then, it seems I can try to install this linux on my laptop at last! (... And, I can trust you, it will not break anything...?) Thanks, Jarek P. -
"Trusting people or their opinions" is only about use of the english language, and not that intersting to bring up here. Surely you know that lots of people here have english as a secondary language only. Intersting for me to know, but I never claimed that linux will work on your laptop, so no: You can't take my word for that, because I never gave it! It is well known that some laptops don't work with linux, I have no idea if yours will work, I don't even know what kind it is. I told you the reasoning behind using _this particular optimization_, the same does _not_ apply to everything else. If you think every kernel decision is made the same way, then you are mistaken. Things don't work that way. First, several people are involved - they think differently. Second, "what kind of tricks to use" is not an all-or-nothing approach. If linux were to use every undocumented trick that might or might not work, then linux would fail on lots of hardware. It would not be useful. If linux took the other approach and never used any "tricks", then it'd be slow and boring. Some things are much easier to test - you construct a testcase or just build a test kernel and benchmark it. If all is ok, then the "trick" is useable. Some cases are a clear win for lots of machines, and the possible failure cases involves very rare hardware. So it might get used. Some tricks have a failure mode that is rare but completely obvious when it happens. So it gets used, and "troublesome hardware" is added to a blacklist as needed. Some "tricks" however, are hard to figure out without docs. There may be no good way to test. The tricks may cause instability that will be very hard to track down, and this could happen on a wide range of hardware. So such don't get used, until adequate documentation appear. In this case, it seems like intel, who make and design the processors in question and therefore know them well enough, provided such documentation. That makes a previously dubious optimization ...
Of curse, I know this problem: sometimes it's very hard to make people believe it's my secondary language! But this time I didn't see any language problem. I simply poined out that sometimes trusting could be OK, this was supposed to be a joke... (Btw, can you remember burning linux laptops?) I thought this "stability first" a bit funny, but this was a really bad joke, sorry. Thanks for these additional explanations - you are completely right! Regards, Jarek P. -
I'm not sure exactly what the situation is with the manufacturers, but maybe they (at least Intel) wanted to keep their options open WRT their barrier semantics, even if current implementations were I don't know quite what you're saying... the CPUs could probably get performance by having weakly ordered loads, OTOH I think the Intel ones might already do this speculatively so they appear in order but essentially have the performance of weak order. If you're just talking about this patch, then it probably isn't much performance gain. I'm guessing you'd be lucky to measure it from I don't know if that would be worthwhile. It actually isn't always trivial to trigger reordering. For example, on my dual-core core2, in order to see reads pass writes, I have to do work on a set that exceeds the cache size and does a huge amount of work to ensure it is going to trigger that. If you can actually come up with a test case that triggers load/load or store/store reordering, I'm sure Intel / AMD would like to see it ;) All existing processors as far as we know are in-order WRT loads vs loads and stores vs stores. It was just a matter of getting the docs clarified, which gives us more confidence that we're correct and a reasonable guarnatee of forward compatibility. So, I think the plan is just to merge these 3 patches during the current window. -
I meant: if there is any reordering possible this should be quite distinctly visible, because why would any vendor enable such nasty things if not for performance. But now I start to doubt: of course there is such a possibility someone makes this reordering for some other reasons which could be so rare it's hard to check. And this someone knows it's processors are seen less efficient because of eg. No, it's only about the comment to this patch: "Hence, smp_rmb() may be Anyway, it seems any heavy testing such as yours, should give us the same informations years earlier than any vendors manual and then any gain is multiplied by millions of users. Then only still doubtful cases could be treated with additional caution and some debugging After reading this Intel's legal information I don't think you should And they really should be! Jarek P. -
It's not. Not in the cases where it is explicitly allowed and actively exploited (loads passing stores), but most definitely not distinctly Yes: it isn't the explicitly allowed reorderings that we care about here (because obviously we're retaining the barriers for those). It would be cases of bugs in the CPUs meaning they don't follow the standard. But how far do you take your mistrust of a CPU? You could ask gcc to insert locked ops between every load and store operation? Firstly, while it can be possible to write a code to show up reordering, it is really hard (ie. impossible) to guarantee no reordering happens. For example, it may have only showed up on SMT+SMP P4 CPUs with some obscure interactions between threads and cores involving more than 2 threads. Secondly, even if we were sure that no current implementations reordered loads, we don't want to go outside the bounds of the specification because we might break on some future CPUs. This isn't a big performance Yes, but that's the same way I feel after reading *any* legal "information" ;) -
I'm not sure of your point, but it seems we don't differ here, and I'm not sure how much this all above is consistent wrt. this earlier It seems, after testing only (plus no official spec against this idea), you could be almost sure there is no such test possible. And, if it were done a few years ago, you think it still should be not enough to make a decision on changing this smp_rmb because of lack of official specs? Besides, there is probably so much features guessing in arch and drivers sections, this reorder testing should look as solid as a I don't agree with this - IMO we should care only about currently used Strange... I feel exactly opposite. Are you sure you've chosen the right job (...and the right system)? Jarek P. -
(...plus of course proper smp_rmb & smp_wmb vs. smp_mb interpretation probably available from Paul McKenney or Davide Libenzi before this Intel spec, as well...) Jarek P. -
I think the chip manufacturers really wanted to keep their options open. Having the option to re-order loads in architecturally visible ways was something that they probably felt they really wanted to have. On the other hand: - I bet they had noticed that things break, and some applications depend on fairly strong ordering (not necessarily in Linux-land, but..) I suspect hw manufacturers go through life hoping that "software improves". They probably thought that getting rid of the old 16-bit windows would mean that less people depended on undefined behaviour. And I suspect that they started noticing that no, with threads and JVM's and things, *more* people started depending on fairly strong memory ordering. - I suspect Intel in particular noticed that they can do a lot of very aggressive re-ordering at a microarchitectural level, but can still guarantee that *architecturally* they never show it (dynamic detection of reordered loads being replayed on cache dirty events etc). IOW, I suspect that both Intel and AMD noticed that while they had wanted to keep their options open, those options weren't really realistic, and not something that the market wanted (aggressive use of threading wants *stricter* memory ordering, not looser), and they could work well enough Quite frankly, even *within* Intel and AMD, there are damn few people who understand exactly what the memory ordering requirements and guarantees are and historically were for the different CPU's. I would bet that had you asked a random (but still competent) Intel/AMD engineer that wasn't really intimately involved with the actual design of the cache protocols and memory pipelines, they would absolutely not have been able to tell you how the CPU actually worked. So no, there's no way a software person could have afforded to say "it seems to work on my setup even without the barrier". On a dual-socket setup with s shared bus, that says absolutely ...
Yes, I still can't believe this, but after some more reading I start
to admit such things can happen in computer "science" too... I've
mentioned a lost performance, but as a matter of fact I've been more
concerned with the problem of truth:
From: Intel(R) 64 and IA-32 Architectures Software Developer's Manual
Volume 3A:
"7.2.2 Memory Ordering in P6 and More Recent Processor Families
...
1. Reads can be carried out speculatively and in any order.
..."
So, it looks to me like almost the 1-st Commandment. Some people (like
me) did believe this, others tried to check, and it was respected for
years notwithstanding nobody had ever seen such an event.
And then, a few years later, we have this:
From: Intel(R) 64 Architecture Memory Ordering White Paper
"2 Memory ordering for write-back (WB) memory
...
Intel 64 memory ordering obeys the following principles:
1. Loads are not reordered with other loads.
..."
I know, technically this doesn't have to be a contradiction (for not
WB), but to me it's something like: "OK, Elvis lives and this guy is
not real Paul McCartney too" in an official CIA statement!
I'm still so "dazed and confused" that I can't tell this (or anything)
is right...
Thanks very much for so extensive and sound explanation,
Jarek P.
PS: Btw, I apologize Helge for not trusting her: "verification by
testing would not be trivial" words.
-
I'd say that's exactly what Intel wanted. It's pretty common (we do it all the time in the kernel too) to create an API which places a stronger requirement on the caller than is actually required. It can make changes much less painful. Has performance really been much problem for you? (even before the lfence instruction, when you theoretically had to use a locked op)? I mean, I'd struggle to find a place in the Linux kernel where there is actually a measurable difference anywhere... and we're pretty performance critical and I think we have a reasonable amount of lockless code (I guess we may not have a lot of tight computational loops, though). I'd be interested to know what, if any, application had found these The thing is that those documents are not defining what a particular implementation does, but how the architecture is defined (ie. what must some arbitrary software/hardware provide and what may it expect). It's pretty natural that Intel started out with a weaker guarantee than their CPUs of the time actually supported, and tightened it up after (presumably) deciding not to implement such relaxed semantics for the forseeable future. -
On Mon, Oct 15, 2007 at 10:09:24AM +0200, Nick Piggin wrote: I'm not performance-words at all, so I can't help you, sorry. But, I understand people who care about this, and think there is a popular conviction barriers and locked instructions are costly, so I'm I'm not sure this is the right way to tell it. If there is no distinction between what is and what could be, how can I believe in similar Alpha or Itanium stuff? IMHO, these manuals sometimes look like they describe some real hardware mechanisms, and sometimes they mention about possible changes and reserved features too. So, when As a matter of fact it's not natural for me at all. I expected the other direction, and I still doubt programmers' intentions could be "automatically" predicted good enough, so IMHO, it's not for long. Of course, it doesn't seem to be any help for linux or bsd programmers, which still have to think about different architectures. Regards, Jarek P. -
On Mon, Oct 15, 2007 at 11:09:59AM +0200, Jarek Poplawski wrote: ...performance-wards?! Looks like serious: I don't even now who I'm not now! Jarek P. -
It's more expensive than nothing, sure. However in real code, algorithmic complexity, cache misses and cacheline bouncing tend to be much bigger issues. I can't think of a place in the kernel where smp_rmb matters _that_ much. seqlocks maybe (timers, dcache lookup), vmscan... Obviously removing the lfence is not going to hurt. Maybe we even gain 0.01% performance in someone's workload. Also, remember: if loads are already in-order, then lfence is a noop, right? (in practice it seems to have to do a _little_ bit of work, but No. Why are you reading that much into it? I know for a fact that some non-x86 architectures actual implementations have stronger ordering than their ISA allows. It's nothing to do with you "believing" how the hardware Really? Consider the consequences if, instead of releasing this latest document tightening consistency, Intel found that out of order loads were worth 5% more performance and implemented them in their next chip. The chip could be completely backwards compatible, but all your old code would break, because it was broken to begin with (because it was outside the spec). IMO Intel did exactly the right thing from an engineering perspective, and so did Linux to always follow the spec. -
You are right: considering current CPUs there could be no performance problem at all. Removing LOCKs for older ones should probably matter more, but as a matter of fact, now I wouldn't bet even on this - it I've different opinion on this: I expect any spec to describe current implementation. Before issuing new models any changes of implementation should be made public with proper margin of time. Then system could be optimally adjusted to a real hardware, instead of planned only, but possibly never realized (plus doing such not used things with old means is usually more costly: lock vs. lfence). There is still problem of specs' completness: there are probably often some things unspecified which could brake on a new model, so never 100% But, if you follow the spec - you don't follow the spec! Why do you ignore so much this part of Intel's spec: "This document contains information which Intel may change at any time without notice. Do not finalize a design with this information." Maybe it's a real Intel intention and not for lawyers only? (Btw, it seems we have an example.) Regards, Jarek P. -
what you don't realize is that Intel (and AMD) have built their business on makeing sure that their new CPU's run existing software with no modifications, (and almost always faster then the old versions). remember that for most of the world, getting the software modified would mean buying a new version, if the vendor bothered to make a different version for the new chip. if they required everyone to buy new software to use a new chip it wouldn't work well. In fact Intel tried to do exactly withat with the itanium and it has been a spectacular failure (or t the very least, not a in theory they could change anything at any time, in practice if they break old software they won't sell the chips, so the modifications tend to be along the lines of this one, adding detail to the specifications so that programmers can get more performance. David Lang -
On Tue, Oct 16, 2007 at 02:14:17AM -0700, david@lang.hm wrote: It's a good point to always consider when you analyze how something new should work if it's used with older programs too. But with newer things like SMP or multithreading they probably have more choice, and The failure of an architecture doesn't mean all specific new technologies used in itanium were failure too, so they could be back when needed (and nothing better in reserve) yet. I don't think 'not breaking' is much problem here, rather how to use all new features (which you seem to ignore a bit) to get maximum of performance without breaking older things. Or, like current problem: go rational and remove useless (acording to new specs) things, even without performance gain, or stay 'safe'? Jarek P. -
When Intel first added speculative loads to the x86 family, they pegged the speculative load to the cache line. If the cache line is invalidated, so is the speculative load. As a result, out-of-order reads to normal memory are invisible to software. If a write to the same memory location on another CPU would make the fetched value invalid, it will make the cache line invalid, which invalidates the fetch. I think it's extremely unlikely that any x86 CPU will do this any differently. It's hard to imagine Intel and AMD would go to all this trouble for so long just to stop so late in the line's lifetime. DS -
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining sup |
