Re: [PATCH 7/8] i386: bitops: Kill needless usage of __asm__ __volatile__

Previous thread: Re: Linus 2.6.23-rc1 by Bob Picco on Monday, July 23, 2007 - 8:47 am. (2 messages)

Next thread: Plan 9 Resource Sharing Support - what is it? by Pavel Machek on Monday, July 23, 2007 - 12:54 am. (4 messages)
From: Satyam Sharma
Date: Monday, July 23, 2007 - 9:05 am

Hi,

There was a lot of bogus stuff that include/asm-i386/bitops.h was doing,
that was unnecessary and not required for the correctness of those APIs.
All that superfluous stuff was also unnecessarily disallowing compiler
optimization possibilities, and making gcc generate code that wasn't as
beautiful as it could otherwise have been. Hence the following series
of cleanups (some trivial, some surprising, in no particular order):

* Bogus extended asm constraints (such as specifying immediate-value
  constraint-modifiers to operands constrained to local registers)
* Obsolete / inapplicable-to-x86 / incorrect comments
* Marking "memory" as clobbered for no good reason
* Volatile-casting of memory addresses
  (wholly unnecessary, makes gcc generate bad code)
* Unwarranted use of __asm__ __volatile__ even when those semantics
  are not required
* Unnecessarily harsh definitions of smp_mb__{before, after}_clear_bit()
  (again, this was like *asking* gcc to generate bad code)


These patches fix no bugs, as there were none (or else we'd have known
long before, obviously). This is just an attempt to clean up / bring down
the bogosity level of that file by several microlenats :-)

Almost all the above are also applicable to include/asm-x86_64/bitops.h
so I will send a patch for that also, later. I have zero knowledge of
other arch's so cannot audit them, sorry.


My testbox boots/works fine with all these patches (uptime half an hour)
and the compressed bzImage is smaller by about ~2 KB for my .config --
don't know about savings in modules. But considering these primitives get
inlined from several places in the tree, I'd expect good savings for
allyesconfig or allmodconfig kind of setups. We've also probably lost
a few cycles from kernel codepaths that use these functions, but I haven't
verified that experimentally.


Satyam
-

From: Satyam Sharma
Date: Monday, July 23, 2007 - 9:06 am

From: Satyam Sharma <ssatyam@cse.iitk.ac.in>

[7/8] i386: bitops: Kill needless usage of __asm__ __volatile__

Another oddity I noticed in this file. The semantics of __volatile__
when used to qualify inline __asm__ are that the compiler will not
(1) elid, or, (2) reorder, or, (3) intersperse, our inline asm with
the rest of the generated code.

However, we do not want these guarantees in the unlocked variants of the
bitops functions. Also, note that test_bit() is *always* an unlocked
operation on i386 -- the i386 instruction set disallows the use of the LOCK
prefix with the "btl" instruction anyway, and the CPU will throw an invalid
opcode exception otherwise. Hence, every caller of variable_test_bit() must
have all the required locking implemented at a higher-level anyway and this
operation would necessarily be guarantee-less.

So let's remove the __volatile__ qualification of the inline asm in the
variable_test_bit() function also -- and let's give the compiler a chance
to optimize / combine multiple bitops, if possible.

Signed-off-by: Satyam Sharma <ssatyam@cse.iitk.ac.in>
Cc: David Howells <dhowells@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>

---

 include/asm-i386/bitops.h |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/asm-i386/bitops.h b/include/asm-i386/bitops.h
index f37b8a2..4f1fda5 100644
--- a/include/asm-i386/bitops.h
+++ b/include/asm-i386/bitops.h
@@ -96,7 +96,7 @@ static inline void clear_bit(int nr, unsigned long *addr)
  */
 static inline void __clear_bit(int nr, unsigned long *addr)
 {
-	__asm__ __volatile__(
+	__asm__(
 		"btrl %1,%0"
 		:"=m" (*addr)
 		:"r" (nr));
@@ -123,7 +123,7 @@ static inline void __clear_bit(int nr, unsigned long *addr)
  */
 static inline void __change_bit(int nr, unsigned long *addr)
 {
-	__asm__ __volatile__(
+	__asm__(
 		"btcl %1,%0"
 		:"=m" (*addr)
 		:"r" (nr));
@@ -251,7 +251,7 @@ static inline int __test_and_change_bit(int nr, unsigned long *addr)
 ...
From: Andi Kleen
Date: Monday, July 23, 2007 - 9:18 am

I thought so too and did a similar transformation while moving
some string functions out of line. After that recent misadventure
I would be very very careful with this.

Overall I'm sorry to say, but the risk:gain ratio of this
patch is imho totally out of whack.

It took a long time to get bitops.h correct and as far as we know
it is compiled correctly currently. You risk all at for very dubious
improvements.

-Andi



-

From: Andi Kleen
Date: Monday, July 23, 2007 - 9:22 am

BTW if you want to optimize inline asm code a bit -- find_first_bit / 
find_first_zero_bit / for_each_cpu could really benefit from a optimized version 
for cpumask_t sized bitmaps. That would  save a lot of cycles in some of the 
hotter paths of the kernel like the scheduler.

-Andi
-

From: Satyam Sharma
Date: Monday, July 23, 2007 - 9:32 am

Hi,


Ah, ok, so this must be the case we'd stumbled upon recently on the

The patch does look correct to me, we're only killing the use of
__volatile__ from places that don't require it (the guarantee-less
variants). Without losing it, I really don't see how the compiler
can ever combine multiple bitops, which does sound beneficial when
the caller has already implemented higher-level locking around
the usage of these operations.

Satyam
-

From: Jeremy Fitzhardinge
Date: Monday, July 23, 2007 - 9:23 am

"asm volatile" does not mean that at all.  It only guarantees (1), and
only then if the asm is ever reachable.

    J
-

From: Satyam Sharma
Date: Monday, July 23, 2007 - 9:43 am

Hi Jeremy,





<quote>

Similarly, you can't expect a sequence of volatile asm instructions to
remain perfectly consecutive. If you want consecutive output, use a
single asm. Also GCC will perform some optimizations across a volatile
asm instruction, GCC does not "forget everything" when it encounters a
volatile asm instruction the way some other compilers do.

</quote>

I'm reading "Similarly, you can't expect a sequence of volatile asm
instructions to remain perfectly consecutive" to mean they're talking
about something like:

asm volatile(...);
asm volatile(...);
asm volatile(...);

But "If you want consecutive output, use a single asm" probably means:

asm volatile(... multiple instructions here ...);

would actually ensure the code written in there would not be
interspersed ... at least that's how I read it.

[ BTW "Also GCC will perform some optimizations across a volatile
asm instruction ..." is exactly the kind of optimizations we must

Yup, of course.

Satyam
-

From: Jeremy Fitzhardinge
Date: Monday, July 23, 2007 - 10:39 am

I'm not quite sure what your point is.  The paragraph you quoted is
pretty explicit in saying that volatile doesn't prevent an "asm
volatile" from being interspersed with other code, and the example just
before that is explicit in talking about how to use dependencies to
control the ordering of asm volatiles with respect to surrounding code. 
In fact nothing in that section precludes asm volatiles from being
reordered with respect to each other either; you just have to make sure
your dependency constraints are all correct.

    J
-

From: Satyam Sharma
Date: Monday, July 23, 2007 - 11:07 am

The (3) as I had originally written / meant was that multiple
instructions in a volatile asm would not get _individually_
interspersed with the rest of the code i.e. be emitted out
_consecutively_. I don't think we need any such guarantees for
the non-atomic variants of those operations, so it's good to
let the compiler have a free hand with what it wants to do,
and optimize/combine multiple bitops as necessary / possible,
which was the original intention.

Satyam
-

From: Jeremy Fitzhardinge
Date: Monday, July 23, 2007 - 11:28 am

No, a single asm statement is always emitted in one piece.  Gcc doesn't
parse the string other than to do %-substitution to insert arguments, so
it has no way to meaningfully split it up.

    J
-

From: Trent Piepho
Date: Monday, July 23, 2007 - 1:29 pm

gcc also tries to count the number of instructions, to guess how large in
bytes the asm block is, as it could make a difference for near vs short
jumps, etc.

I wonder it it also affects the instruction count the inline heuristics
use?
-

From: Jeremy Fitzhardinge
Date: Monday, July 23, 2007 - 1:40 pm

How does it do that?  By looking for \n, ';', etc?

    J
-

From: Trent Piepho
Date: Monday, July 23, 2007 - 2:06 pm

Yes:

    Some targets require that GCC track the size of each instruction used in
    order to generate correct code.  Because the final length of an `asm' is
    only known by the assembler, GCC must make an estimate as to how big it
    will be.  The estimate is formed by counting the number of statements in
    the pattern of the `asm' and multiplying that by the length of the longest
    instruction on that processor.  Statements in the `asm' are identified by
    newline characters and whatever statement separator characters are
    supported by the assembler; on most processors this is the ``;''
    character.
-

From: Andi Kleen
Date: Monday, July 23, 2007 - 2:30 pm

Are you sure? I doubt it. It would need a full asm parser to do this
properly and then even it could be wrong 
(e.g. when the sections are switched like Linux does extensively)  

gcc doesn't have such a parser.

Also on x86 gcc doesn't need to care about long/short anyways because
the assembler takes care of this and the other instructions who cared
about this (loop) isn't generated anymore.

You're probably confusing it with some other compiler, who sometimes
do this. e.g. the Microsoft inline asm syntax requires assembler parsing

AFAIK it counts like one operand.

-Andi
-

From: Nicholas Miell
Date: Monday, July 23, 2007 - 2:48 pm

GCC counts newlines and semicolons and uses that number as the likely
instruction count.

See asm_insn_count() in gcc/gcc/final.c

-- 
Nicholas Miell <nmiell@comcast.net>

-

From: Satyam Sharma
Date: Monday, July 23, 2007 - 9:05 am

From: Satyam Sharma <ssatyam@cse.iitk.ac.in>

[5/8] i386: bitops: Contain warnings fallout from the death of volatiles

The wrappers below included from all over tree re-used "volatile" just
because the bitops used them. With them killed, almost every file ends
up crying about:

warning: passing argument 2 of 'set_bit' discards qualifiers from
pointer target type

Silence these bogus warnings by killing volatile casts a second time.

Signed-off-by: Satyam Sharma <ssatyam@cse.iitk.ac.in>
Cc: David Howells <dhowells@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>

---

[ Cscope tells us there are other such wrappers that didn't show up
  in my test .config -- I'll update them shortly.

  Also, I should probably merge this patch with the previous one.
  Otherwise git-bisecters who hit the window between these two
  patches will be flooded with bogus warnings and would probably
  want to kill me :-) ]

 include/linux/cpumask.h  |    4 ++--
 include/linux/nodemask.h |    4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 23f5514..49f6ed4 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -89,13 +89,13 @@ typedef struct { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_t;
 extern cpumask_t _unused_cpumask_arg_;
 
 #define cpu_set(cpu, dst) __cpu_set((cpu), &(dst))
-static inline void __cpu_set(int cpu, volatile cpumask_t *dstp)
+static inline void __cpu_set(int cpu, cpumask_t *dstp)
 {
 	set_bit(cpu, dstp->bits);
 }
 
 #define cpu_clear(cpu, dst) __cpu_clear((cpu), &(dst))
-static inline void __cpu_clear(int cpu, volatile cpumask_t *dstp)
+static inline void __cpu_clear(int cpu, cpumask_t *dstp)
 {
 	clear_bit(cpu, dstp->bits);
 }
diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 52c54a5..81ba056 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -89,13 +89,13 @@ typedef struct { DECLARE_BITMAP(bits, MAX_NUMNODES); } ...
From: Satyam Sharma
Date: Monday, July 23, 2007 - 9:05 am

From: Satyam Sharma <ssatyam@cse.iitk.ac.in>

[1/8] i386: bitops: Update/correct comments

Just trying to standardize the look of comments for various functions of the
bitops API, removed some trailing whitespace here and there, give different
kernel-doc description to the atomic functions and their corresponding
unlocked variants, remove/explicitly mention what is inapplicable/applicable
to i386, add kernel-doc comments to functions that lacked them already, and
other janitorial work. Only comments touched in this patch.

Signed-off-by: Satyam Sharma <ssatyam@cse.iitk.ac.in>
Cc: David Howells <dhowells@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>

---

 include/asm-i386/bitops.h |  123 +++++++++++++++++++++++++++++++-------------
 1 files changed, 86 insertions(+), 37 deletions(-)

diff --git a/include/asm-i386/bitops.h b/include/asm-i386/bitops.h
index a20fe98..ba8e4bb 100644
--- a/include/asm-i386/bitops.h
+++ b/include/asm-i386/bitops.h
@@ -10,10 +10,20 @@
 
 /*
  * These have to be done with inline assembly: that way the bit-setting
- * is guaranteed to be atomic. All bit operations return 0 if the bit
- * was cleared before the operation and != 0 if it was not.
+ * is guaranteed to be atomic. All bit test operations return an int type:
+ * 0 if the bit was cleared before the operation and != 0 if it was not.
  *
- * bit 0 is the LSB of addr; bit 32 is the LSB of (addr+1).
+ * However, the operand itself is always of unsigned long type, so care
+ * must be taken to ensure that the int type return of bit test operations
+ * for the != 0 case does not get truncate the '1' bits. This is not an
+ * issue on i386 where sizeof(int) == sizeof(unsigned long), but we avoid
+ * this pitfall anyway by returning with all 1's or LSB set in this case.
+ *
+ * bit 0 is the LSB of addr; bit 31 is the MSB.
+ *
+ * But note that all functions that take a bit-number argument allow it
+ * to be arbitrarily large; these operations are not restricted to acting
+ * on ...
From: Satyam Sharma
Date: Monday, July 23, 2007 - 9:05 am

From: Satyam Sharma <ssatyam@cse.iitk.ac.in>

[6/8] i386: bitops: Don't mark memory as clobbered unnecessarily

The goal is to let gcc generate good, beautiful, optimized code.

But test_and_set_bit, test_and_clear_bit, __test_and_change_bit,
and test_and_change_bit unnecessarily mark all of memory as clobbered,
thereby preventing gcc from doing perfectly valid optimizations.

The case of __test_and_change_bit() is particularly surprising, given
that it's a variant where we don't make any guarantees at all.

Even for the other three cases, the (only) instruction that accesses
shared memory is btsl/btrl/btcl and requires locking and atomicity.
But we handle that properly already by the use of the lock prefix.
Also, "=m" is already specified in the output operands of the asm
for the passed bit-string pointer, so the gcc optimizer knows what
to do and what not to do anyway.

So there's no point clobbering all of memory in these functions.

Signed-off-by: Satyam Sharma <ssatyam@cse.iitk.ac.in>
Cc: David Howells <dhowells@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>

---

 include/asm-i386/bitops.h |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/asm-i386/bitops.h b/include/asm-i386/bitops.h
index 0f5634b..f37b8a2 100644
--- a/include/asm-i386/bitops.h
+++ b/include/asm-i386/bitops.h
@@ -164,7 +164,7 @@ static inline int test_and_set_bit(int nr, unsigned long *addr)
 	__asm__ __volatile__( LOCK_PREFIX
 		"btsl %2,%1\n\tsbbl %0,%0"
 		:"=r" (oldbit),"=m" (*addr)
-		:"r" (nr) : "memory");
+		:"r" (nr));
 	return oldbit;
 }
 
@@ -208,7 +208,7 @@ static inline int test_and_clear_bit(int nr, unsigned long *addr)
 	__asm__ __volatile__( LOCK_PREFIX
 		"btrl %2,%1\n\tsbbl %0,%0"
 		:"=r" (oldbit),"=m" (*addr)
-		:"r" (nr) : "memory");
+		:"r" (nr));
 	return oldbit;
 }
 
@@ -254,7 +254,7 @@ static inline int __test_and_change_bit(int nr, unsigned long *addr)
 	__asm__ __volatile__(
 		"btcl %2,%1\n\tsbbl %0,%0"
 ...
From: Andi Kleen
Date: Monday, July 23, 2007 - 9:13 am

The first goal is correct code.

The reason for the memory barrier is to prevent other memory references
from being moved over the atomic reference.

e.g. when a bit is used to communicate with another CPU this might be dangerous.

I don't think it's a good idea to remove them. It'll likely introduce subtle
bugs. Atomic bitops tend to be so slow anyways that it doesn't make much
difference.

-Andi

-

From: Satyam Sharma
Date: Monday, July 23, 2007 - 9:26 am

Hi Andi,




Yes, but _that_ address (of the bit-string) is protected already -- by the
implicit memory barrier due to the LOCK prefix. We shouldn't really be
caring about any other memory addresses, so it doesn't affect the
correctness of the bitops API at all.

[ and at least the __test_and_change_bit() case is definitely okay. ]

Satyam
-

From: Andi Kleen
Date: Monday, July 23, 2007 - 9:33 am

Compiler barrier != CPU barrier.

The memory clobber is a compiler barrier that prevents its global optimizer
from moving memory references. The CPU memory ordering guarantees are completely 

The problem is the relationship to other operations.

This is not theoretic and we have had bugs because of this.

-Andi

 
-

From: Satyam Sharma
Date: Monday, July 23, 2007 - 10:12 am

Hi,


Exactly, but the actual _synchronization_ in all users of the bitops API
should (should, at least, otherwise the bugs lie in the callers) depend
upon the _bit-string_ whose address is passed to us. That could be some
flags/lock/etc in some caller, whatever, but all the synchronization in
those users would be based upon _that_ -- and we're handling it alright
with the full CPU barrier for _that_ address. Also note that in all the
atomic/unatomic variants, we always constrain the passed pointer to
memory and for the atomic/locked variants at least, I don't really see
how compiler optimizations would get us into trouble (again, for the


... I don't want to get argumentative, but IMHO, we don't really need
to care about other (other than the passed memory address) references.
The problem with marking all of memory clobbered is that we're


Those could've been due to buggy users, not necessarily the bitops.

Satyam
-

From: Jeremy Fitzhardinge
Date: Monday, July 23, 2007 - 10:49 am

The compiler doesn't know how large the object at that address is, so
there's limits to how much alias analysis it can do.  If it assumes
you're passing a pointer to a single word then it may fail to order the
bit operation correctly with respect to another operation on another
part of the same bitfield (a memset to clear it, for example).

    J
-

From: Linus Torvalds
Date: Monday, July 23, 2007 - 10:55 am

No. The point is to let gcc generate *correct* code.

It's either "=m" together with "memory", or it's "+m".

You just introduced a bug.

		Linus
-

From: Benjamin Herrenschmidt
Date: Tuesday, July 24, 2007 - 2:52 am

In fact, it's more than that... the bitops that return a value are often
used to have hand-made spinlock semantics. I'm sure we would get funky
bugs if loads or stores leaked out of the locked region. I think a full
"memory" clobber should be kept around for those cases.

(That's also why on ppc, we give them a few more barriers)

Ben.

-

From: Linus Torvalds
Date: Tuesday, July 24, 2007 - 10:24 am

Not helpful.

The CPU ordering constraints for "test_and_set_bit()" and friends are weak 
enough that even if you have a full memory clobber, it still wouldn't work 
as a lock.

That's exactly why we have smp_mb__after_set_bit() and friends. On some 
architectures (arm, mips, parsic, powerpc), *that* is where the CPU memory 
barrier is, because the "test_and_set_bit()" itself is just a 
cache-coherent operation, not an actual barrier.

		Linus
-

From: Trond Myklebust
Date: Tuesday, July 24, 2007 - 10:42 am

That's not what the Documentation/memory-barriers.txt states:

        Any atomic operation that modifies some state in memory and returns information
        about the state (old or new) implies an SMP-conditional general memory barrier
        (smp_mb()) on each side of the actual operation.  These include:
        
        .....
        
        test_and_set_bit();
        test_and_clear_bit();
        test_and_change_bit();
...

Trond

-

From: Linus Torvalds
Date: Tuesday, July 24, 2007 - 11:13 am

Hmm.. You're right. We only actually need it for the unconditional bitops, 
like the *unlock* side.

IOW, if you do a spinlock with the bitops, the locking side should be able 
to use a "test_and_set_bit()" on its own, but the unlocking side should be

	smp_mb__before_clear_bit();
	clear_bit();

because the ones that don't return a value also don't imply a memory 
barrier.

		Linus
-

From: Trond Myklebust
Date: Tuesday, July 24, 2007 - 11:28 am

Yup, and this is exactly what we currently do in bit_spin_unlock().

Trond

-

From: Benjamin Herrenschmidt
Date: Tuesday, July 24, 2007 - 2:37 pm

Yup. But I much prefer Nick's clear_bit_unlock() :-)

Ben.


-

From: Trond Myklebust
Date: Tuesday, July 24, 2007 - 2:55 pm

If you want to use bitops as spinlocks you should rather be using
<linux/bit_spinlock.h>. That also does the right thing w.r.t.
pre-emption and sparse locking annotations.

Trond

-

From: Benjamin Herrenschmidt
Date: Tuesday, July 24, 2007 - 3:32 pm

Heh, I didn't know about those... A bit annoying that I can't override
them in the arch, I might be able to save a barrier or two here. Our
test_and_set_bit() contains both barriers for lock and unlock semantics
to cope with all kind of abuses, but your bit_spinlock obviously doesn't
need that.

Cheers,
Ben.

-

From: Nick Piggin
Date: Tuesday, July 24, 2007 - 9:10 pm

I guess the test_and_set_bit_lock / clear_bit_unlock will allow you to
override them in a way.

The big performance problem I see on my powerpc system is not the bit
spinlocks (open-coded or not), but the bit sleep locks.

Anyway, I'll finally send out the lock bitops patches again today...

-- 
SUSE Labs, Novell Inc.
-

From: Benjamin Herrenschmidt
Date: Tuesday, July 24, 2007 - 2:36 pm

Well, as I said, our test_and_set_bit() asm (and in general, the asm for
all the atomic ops that -return- a value) have at least some level of
barriers in them because of that. We do that because people are abusing
them as locks. The smp_mb__after_set_bit() I never quite grokked. We do
an mb in there but I suspect we don't need if it's only ever used after
test_and_set_bit() because of the above. The smb_mb__before_clear_bit()
makes more sense as it's supposed to give clear_bit() a spin_unlock
semantic.

But we do need the "memory" clobber as well.

That's one reason why I like Nick's bitop locks patches, providing
-explicit- test_and_set_bit_lock() and clear_bit_unlock(), we can fix a
whole lot of things and make sure they have the right barriers and not
more. (We save a few useless barriers on POWER that way in the page lock
path and it's measureable in his benchmark).

Cheers,
Ben.



-

From: Nick Piggin
Date: Monday, July 23, 2007 - 8:57 pm

__test_and_change_bit is one that you could remove the memory clobber


-- 
SUSE Labs, Novell Inc.
-

From: Satyam Sharma
Date: Monday, July 23, 2007 - 11:38 pm

Yes, for the atomic versions we don't care if we're asking gcc to
generate trashy code (even though I'd have wanted to only disallow
problematic optimizations -- ones involving the passed bit-string
address -- there, and allow other memory references to be optimized
as and how the compiler feels like it) because the atomic variants
are slow anyway and we probably want to be extra-safe there.

But for the non-atomic variants, it does make sense to remove the
memory clobber (and the unneeded __asm__ __volatile__ that another
patch did -- for the non-atomic variants, again).

OTOH, as per Linus' review it seems we can drop the "memory" clobber
and specify the output operand for the extended asm as "+m". But I
must admit I didn't quite understand that at all.

[ I should probably start reading gcc sources, the docs are said to
  be insufficient/out-of-date, as per the reviews of the patches. ]


Satyam
-

From: Nick Piggin
Date: Tuesday, July 24, 2007 - 12:24 am

No. It has nothing to do with atomicity and all to do with ordering.
For example test_bit, clear_bit, set_bit, etc are all atomic but none
place any restrictions on ordering.

__test_and_change_bit has no restriction on ordering, so as long as
the correct operands are clobbered, a "memory" clobber to enforce a

Yes, but that is for cases where "memory" is being used to say that
some otherwise unspecified memory is actually being changed, rather
than to provide a compiler barrier as is the case for test_and_set_bit

You should read Documentation/atomic_ops.txt and memory-barriers.txt,
which are quite useful and should be uptodate.

-- 
SUSE Labs, Novell Inc.
-

From: Satyam Sharma
Date: Tuesday, July 24, 2007 - 1:29 am

The memory clobber, or the volatile asm? There's more than one variable
here ... but still, I don't think either affects _ordering_ in any


But why even for the other operations? Consider (current code of)
test_and_set_bit():

static inline int test_and_set_bit(int nr, volatile unsigned long * addr)
{
	int oldbit;
	__asm__ __volatile__( LOCK_PREFIX
		"btsl %2,%1\n\tsbbl %0,%0"
		:"=r" (oldbit),"+m" (ADDR)
		:"Ir" (nr) : "memory");

	return oldbit;
}

The only memory reference in there is to the passed address, it will
be modified, yes, but that's been made obvious to gcc in the asm
already. So why are we marking all of memory as clobbered, is the
question. (I just read Jeremy's latest reply, but I don't see how
or why the memory clobber helps that case either -- does a memory


I have, of course. Will probably read again, thanks.


Satyam
-

From: Nick Piggin
Date: Tuesday, July 24, 2007 - 1:39 am

Of course, because then the compiler can't assume anything about
the contents of memory after the operation.

   #define barrier() __asm__ __volatile__("": : :"memory")

A memory clobber is equivalent to a compiler barrier.

-- 
SUSE Labs, Novell Inc.
-

From: Trent Piepho
Date: Tuesday, July 24, 2007 - 1:38 am

Speaking of that, why are all the asm functions in arch/i386/lib/string.c
defined as having a memory clobber, even those which don't modify memory
like strcmp, strchr, strlen and so on?

Why are they all volatile too?  Even those with no side effects.

It seems like any asm which writes to or _reads from_ memory that isn't one
of the operands must have a memory clobber.  And any asm with side effects
other than it's operands (and asm("mov $0, (%0)" :  :  "r"(some_pointer))
has side effects) must be volatile.

So something like strlen() needs to have a memory clobber, otherwise a
store to the string could be moved to the other side of the strlen(), but
doesn't need to be volatile, since it has no side effects.
-

From: Linus Torvalds
Date: Tuesday, July 24, 2007 - 12:39 pm

That's because the memory clobber will serialize the inline asm with 
anything else that reads or writes memory.

So even if we don't actually change any memory, if we cannot describe what 
we *read*, then we need to tell gcc to not re-order us wrt things that 
could *write*. And the simplest way to do that is to say that you clobber 
memory, even if you don't.

So as a more concrete example: imagine that we're doing a "strlen()" on 
some local variable. We need to tell gcc that that variable has to be 
updated in memory before it schedules the asm. The memory clobber does 
that.

(Yes, the "asm volatile" may do so too, but it's very unclear what the 
"volatile" on the asm actually does, so ..)

		Linus
-

From: Andi Kleen
Date: Tuesday, July 24, 2007 - 1:37 pm

Without the volatile they get completely optimized away :/
[tried that, cost a lot of debugging time -- empty string functions
cause a lot of strange side effects]

-Andi
-

From: Linus Torvalds
Date: Tuesday, July 24, 2007 - 1:08 pm

Sure, that's *one* thing that "volatile" guarantees (it guarantees that 
gcc won't optimize away things where the end result isn't actually visibly 
used).

But gcc docs also talk about the other things volatile means, including 
"not significantly moved".

Is that "not significantly moved" already equivalent to "not moved past 
something that can change memory"? Probably not. Which is why it's a good 
idea to have the "memory" clobber there too, because the semantics of the 
"volatile" just aren't very well defined from a code movement standpoint 
(while a general memory clobber is *very* clear: if gcc moves the inline 
asm across another thing that uses/changes memory, that's a gcc bug, plain 
and simple (*)).

		Linus

(*) Other than pure internal gcc spills/reloads. Spilling/reloading local 
registers that haven't had their address taken obviously cannot be seen as 
memory accesses.
-

From: Jeremy Fitzhardinge
Date: Tuesday, July 24, 2007 - 2:31 pm

Actually, it doesn't.  In fact it goes out of its way to say that "asm
volatile" statements can be moved quite a bit, with respect to other
asms, other code, jumps, basic blocks, etc.  The only reliable way to
control the placement of an asm is have the right dependencies.

    The `volatile' keyword indicates that the instruction has important
    side-effects.  GCC will not delete a volatile `asm' if it is reachable.
    (The instruction can still be deleted if GCC can prove that
    control-flow will never reach the location of the instruction.)  Note
    that even a volatile `asm' instruction can be moved relative to other
    code, including across jump instructions.

also:

    An `asm' instruction without any output operands will be treated
    identically to a volatile `asm' instruction.

So there isn't anything very special about "asm volatile".  It's purely
to stop apparently useless code from being removed.

    J
-

From: Linus Torvalds
Date: Tuesday, July 24, 2007 - 2:46 pm

Ahh. That's newer.

Historically, gcc manuals used to say "may not be deleted or significantly 
reordered".

So they've weakened what it means, probably exactly because it wasn't 
well-defined before either.

		Linus
-

From: Trent Piepho
Date: Wednesday, July 25, 2007 - 6:07 pm

I went a made a test suite to see what really happened, and this isn't how
it works.  It appears that a memory clobber only tells gcc that the asm
writes to memory.  It does _not_ tell gcc that the asm reads from memory.

It's at http://www.speakeasy.org/~xyzzy/download/opttest.tar.gz
It's only 3k, but there are 16 files so I'm not inlining it.

The suite has a few test c files, which are compiled with various different
functions, inline norm asm, inline volatile asm, inline asm with a memory
clobber, a normal function, a __attribute__((const)) function, and so on.

They are compiled to asm file, and then a perl script scans the asm files to
figure out what optimizations gcc made.

"make test" will compile all the tests and run them through the perl scripts.
"make test1" will just run test1, etc.

It appears that a normal asm, one without volatile or a memory clobber, is
treated like a const function, which returns the output via a struct (not
using pass-by-address).  It has no side-effects, can't read or write global
variables, and can't dereference pointer arguments.

Adding volatile tells gcc the asm has some hidden side-effect.  It still can't
r/w globals or dereference inputs.  But it won't get elided if there are no
used outputs, common subexpression merged, or treated as a loop invariant.

Adding a memory clobber tells gcc that the asm modifies memory.  It doesn't
modify un-aliased local variables in registers.  It does modify aliased local
variables.  It does not read from memory.  gcc will move or elide a memory
write before an asm with a memory clobber if nothing else (besides the asm)
could see the write.  A memory clobber doesn't count as a side-effect either,
a non-volatile asm without unused outputs will be elided, even if has a memory
clobber.

Specifically, check test6_memasm.s.  The C code looks like this:

extern int a; /* keep asm from being elided for having no used output */
static inline void bar(void) { asm("call bar" : "=m"(a) : : "memory"); }
/* ...
From: Linus Torvalds
Date: Wednesday, July 25, 2007 - 6:18 pm

Hmm. I really think you should take this up with the gcc people. That 
looks like a gcc bug - because there really is nothing that guarantees 
that the asm doesn't change the array that "x" points to, and the asm 

I can't explain it. I do think you've found a gcc bug.

That said, the kernel mostly uses "asm volatile()" _together_ with a 
memory clobber for these kinds of things, so it sounds like the kernel 
wouldn't be impacted. But you're definitely right - the above report makes 

Well, that's likely just a subtle register allocation issue, and 
understandable. Generating perfect code is impossible, you want to 
generate good code on average.

		Linus
-

From: Linus Torvalds
Date: Wednesday, July 25, 2007 - 6:22 pm

Actually, I take that back. I think gcc does the right thing, and yes, 
it's explained by the memory clobber being just a blind "write to memory" 
rather than read memory. My bad.

It does leave us with very few ways of saying that an asm can *read* 
memory, and so it might be good to have it clarified that "volatile" 
implies that (at least with the memory clobber).

Your examples are good, I think.

		Linus
-

From: David Howells
Date: Tuesday, July 24, 2007 - 2:44 am

As I understand it, the "+m" indicates to the compiler a restriction on the
ordering of things that access that particular memory location, whereas a
"memory" indicates a restriction on the orderings of all accesses to memory -
precisely what you need to produce a lock.

There are a number of things that use test_and_set_bit() and co to implement a
lock or other synchronisation.  This means that they must exhibit LOCK-class
barrier effects or better.  LOCK-class barrier effects mean, more or less,
that all memory accesses issued before the lock must happen before all memory
accesses issued after the lock.  But it most happen at both CPU-level and
compiler-level.  The "memory" constraint instructs the compiler in this
regard.

Remember also that this is gcc black magic and so some of it has had to be
worked out empirically - possibly after sacrificing a goat under a full moon.

David
-

From: Satyam Sharma
Date: Tuesday, July 24, 2007 - 3:02 am

Hi David,


Ok, thanks -- I didn't know gcc's behaviour w.r.t. "+m" at all, but in my

Yes, thanks for laying this out so clearly, again. So combined with what
you explained above, I think I now fully understand why most of this


Satyam
-

From: Satyam Sharma
Date: Monday, July 23, 2007 - 9:06 am

From: Satyam Sharma <ssatyam@cse.iitk.ac.in>

memory barriers around clear_bit() must also implement these two interfaces.
However, for i386, clear_bit() is a strict, locked, atomic and
un-reorderable operation and includes an implicit memory barrier already.

But these two functions have been wrongly defined as "barrier()" which is
a pointless _compiler optimization_ barrier, and only serves to make gcc
not do legitimate optimizations that it could have otherwise done.

So let's make these proper no-ops, because that's exactly what we require
these to be on the i386 platform.

Signed-off-by: Satyam Sharma <ssatyam@cse.iitk.ac.in>
Cc: David Howells <dhowells@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>

---

[ A similar optimization needs to be done in the atomic.h also.
  Will submit that patch shortly. ]

 include/asm-i386/bitops.h |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/asm-i386/bitops.h b/include/asm-i386/bitops.h
index 4f1fda5..42999eb 100644
--- a/include/asm-i386/bitops.h
+++ b/include/asm-i386/bitops.h
@@ -106,8 +106,8 @@ static inline void __clear_bit(int nr, unsigned long *addr)
  * Bit operations are already serializing on x86.
  * These must still be defined here for API completeness.
  */
-#define smp_mb__before_clear_bit()	barrier()
-#define smp_mb__after_clear_bit()	barrier()
+#define smp_mb__before_clear_bit()	do {} while (0)
+#define smp_mb__after_clear_bit()	do {} while (0)
 
 /**
  * __change_bit - Toggle a bit in memory
-

From: Nick Piggin
Date: Monday, July 23, 2007 - 8:53 pm

No. clear_bit is not a compiler barrier on i386, thus smp_mb__before/after


-- 
SUSE Labs, Novell Inc.
-

From: Satyam Sharma
Date: Tuesday, July 24, 2007 - 12:34 am

Not so obvious. Why do we require these to be a full compiler barrier
is precisely the question I raised here.

Consider this (the above two functions exist only for clear_bit(),
the atomic variant, as you already know), the _only_ memory reference
we care about is that of the address of the passed bit-string:

(1) The compiler must not optimize / elid it (i.e. we need to disallow
    compiler optimization for that reference) -- but we've already taken
    care of that with the __asm__ __volatile__ and the constraints on
    the memory "addr" operand there, and,
(2) For the i386, it also includes an implicit memory (CPU) barrier
    already.

So I /think/ it makes sense to let the compiler optimize _other_ memory
references across the call to clear_bit(). There's a difference. I think
we'd be safe even if we do this, because the synchronization in callers
must be based upon the _passed bit-string_, otherwise _they_ are the
ones who're buggy.

[ However, elsewhere Jeremy Fitzhardinge mentioned the case of
  some callers, for instance, doing a memset() on an alias of
  the same bit-string. But again, I think that is dodgy/buggy/
  extremely border-line usage on the caller's side itself ...
  *unless* the caller is doing that inside a higher-level lock
  anyway, in which case he wouldn't be needing to use the
  locked variants either ... ]

[ For those interested, I've been looking at the code generated
  for the test kernel I built with these patches, and I don't
  really see anything gcc did that it shouldn't have -- but ok,
  that doesn't mean other versions/toolchains for other setups
  won't. Also, the test box has been up all night, but I'm only
  running Firefox on it anyway, and don't really know how to
  verify if I've introduced any correctness issues / bugs. ]


Satyam
-

From: Jeremy Fitzhardinge
Date: Tuesday, July 24, 2007 - 12:48 am

You miss my point.  If you have:

	memset(&my_bitmask, 0, sizeof(my_bitmask));
	set_bit(my_bitmask, 44);

Then unless the set_bit has a constraint argument which covers the whole
of the (multiword) bitmask, the compiler may see fit to interleave the
memset writes with the set_bit in bad ways.  In other words, plain "+m"
(*(long *)ptr) won't cut it.  You'd need "+m" (my_bitmask), I think.

    J
-

From: Nick Piggin
Date: Tuesday, July 24, 2007 - 1:31 am

That's a valid point, and looks like it is a bug in the existing i386
macros, doesn't it? We should be clobbering addr + BITOP_WORD(nr).

-- 
SUSE Labs, Novell Inc.
-

From: Nick Piggin
Date: Tuesday, July 24, 2007 - 1:20 am

Repeating what has been said before: A CPU memory barrier is not a
compiler barrier or vice versa. Seeing as we are talking about
the compiler barrier, it is irrelevant as to whether or not the

Yes it makes sense to let the compiler move memory operations over
clear_bit(), because we have defined the interface to be nice and
relaxed. And this is exactly why we do need to have an additional

correct output != correct input.

Without a barrier there, we _allow_ the compiler to reorder. If it
does not reorder, the missing barrier is still a bug :)

-- 
SUSE Labs, Novell Inc.
-

From: Satyam Sharma
Date: Tuesday, July 24, 2007 - 2:21 am

[ Compiler barrier, you mean, that's not true of CPU barriers. ]

In any case, I know that, obviously. I asked "why" not "what" :-) i.e.
why should we care about other addresses / why do we want to extend
the compiler barrier to all memory references -- but Jeremy seems to

I think it is quite relevant, in fact. From Documentation/atomic_ops.txt,
smp_mb__{before,after}_clear_bit(), as the name itself suggests, must
be _CPU barriers_ for those arch's that don't have an implicit
_CPU barrier_ in the clear_bit() itself [ which i386 does have already ].

As for a compiler barrier, the asm there already guarantees the compiler
will not optimize references to _that_ address, but there could still be
the memset()/set{clear}_bit() interspersing pitfalls for example, so yeah
the memory clobber would probably protect us there.


Satyam
-

From: Nick Piggin
Date: Tuesday, July 24, 2007 - 3:25 am

For the purpose of this discussion (Linux memory
barrier semantics, on WB memory), it is true of CPU

Obviously because we want some kind of ordering
guarantee at a given point. All the CPU barriers
in the world are useless if the compiler can reorder

One or both of us still fails to understand the other.

bit_spin_lock(LOCK_NR, &word);
var++;
/* this is bit_spin_unlock(LOCK_NR, &word); */
smp_mb__before_clear_bit();
clear_bit(LOCK_NR, &word);

Are you saying that it is OK for the store to var to
be reordered below the clear_bit? If not, what are you
saying?



      Yahoo!7 Mail has just got even bigger and better with unlimited storage on all webmail accounts.
http://au.docs.yahoo.com/mail/unlimitedstorage.html



-

From: Satyam Sharma
Date: Tuesday, July 24, 2007 - 4:10 am

On later Intel processors, if the memory address range being referenced
(and say written to) by the (locked) instruction is in the cache of a
CPU, then it will not assert the LOCK signal on the system bus --
thus not assume exclusive use of shared memory. So other CPUs are free
to modify (other) memory at that point. Cache coherency will still




I might be making a radical turn-around here, but all of a
sudden I think it's actually a good idea to put a complete
memory clobber in set_bit/clear_bit and friends themselves,
and leave the "__" variants as they are.


Satyam
-

From: Nick Piggin
Date: Tuesday, July 24, 2007 - 4:32 am

The system bus does not need to be serialised because the CPU already
holds the cacheline in exclusive state. That *is* the cache coherency
protocol.

The memory ordering is enforced by the CPU effectively preventing
speculative loads to pass the locked instruction and ensuring all
stores reach the cache before it is executed. (I say effectively

Why?

-- 
SUSE Labs, Novell Inc.
-

From: Satyam Sharma
Date: Tuesday, July 24, 2007 - 4:45 am

Looks like when you said "CPU memory barrier extends to all memory
references" you were probably referring to a _given_ CPU ... yes,

Well, why not. Callers who don't want/need any guarantees whatsoever
can still use __foo() -- for others, it makes sense to just use
foo() and get *both* the compiler and CPU barrier semantics -- I think
that's the behaviour most callers would want anyway.


Satyam
-

From: Nick Piggin
Date: Tuesday, July 24, 2007 - 5:01 am

Firstly, __foo() is non-atomic, secondly set_bit/clear_bit/etc do not
provide any CPU or compiler semantics.

It was set up this way because other CPU ISAs don't couple atomicity
with memory ordering, and with many implementations of those, the cost
to order memory is quite high.

-- 
SUSE Labs, Novell Inc.
-

From: Linus Torvalds
Date: Tuesday, July 24, 2007 - 10:12 am

No. CPU memory barriers extend to all CPU's. End of discussion.

It's not about "that cacheline". The whole *point* of a CPU memory barrier 
is that it's about independent memory accesses.

Yes, for a memory barrier to be effective, all CPU's involved in the 
transaction have to have the barriers - the same way a lock needs to be 
taken by everybody in order for it to make sense - but the point is, CPU 
barriers are about *global* behaviour, not local ones.

So there's a *huge* difference between

	clear_bit(x,y);

and

	clear_bit(x,y);
	smp_mb__before_after_clear_bit();

and it has absolutely nothing to do with the particular cacheline that "y" 
is in, it's about the *global* memory ordering.

Any write you do after that "smp_mb__before_after_clear_bit()" will be 
guaranteed to be visible to _other_ CPU's *after* they have seen the bit 
being cleared. Yes, those other CPU's need to have a read barrier between 
reading the bit and reading some other thign, but the point is, this hass 
*nothing* to do with cache coherency, and the particular cache line that 
"y" is in.

And no, "smp_mb__before/after_clear_bit()" must *not* be just an empty "do 
{} while (0)". It needs to be a compiler barrier even when it has no 
actual CPU meaning, unless clear_bit() itself is guaranteed to be a 
compiler barrier (which it isn't, although the "volatile" on the asm in 
practice makes it something *close* to that).

Why? Think of the sequence like this:

	clear_bit(x,y);
	smp_mb__after_clear_bit();
	other_variable = 10;

the whole *point* of this sequence is that if another CPU does

	x = other_variable;
	smp_rmb();
	bit = test_bit(x,y)

then if it sees "x" being 10, then the bit *has* to be clear.

And this is why the compiler barrier in "smp_mb__after_clear_bit()" needs 
to be a compiler barrier:

 - it doesn't matter for the action of the "clear_bit()" itself: that one 
   is locked, and on x86 it thus also happens to be a serializing 
   instruction, and the cache ...
From: Satyam Sharma
Date: Tuesday, July 24, 2007 - 12:01 pm

Ok, thanks much (David and Nick too!) for taking the time to explain this
out clearly. It does look amazingly obvious now that I see it, with
callers using bitops to implement locking schemes who would completely
break otherwise -- in fact 6 and 8 in this series look amazingly obtuse
now ...


Satyam
-

From: Satyam Sharma
Date: Monday, July 23, 2007 - 9:05 am

From: Satyam Sharma <ssatyam@cse.iitk.ac.in>

[2/8] i386: bitops: Rectify bogus "Ir" constraints

The "I" constraint (on the i386 platform) is used to restrict constants to
the 0..31 range, for use with instructions that must deal with bit numbers.

However:
* The "I" constraint modifier is applicable only to immediate-value operands,
  and combining it with "r" is bogus.
* gcc will just ignore "I" if we specify "Ir" anyway and treat it same as "r".
* Bitops in Linux allow @nr to be arbitrarily large anyway, so we don't want
  to restrict ourselves to @nr < 32 in the first place.

So just kill the bogus "I" constraint modifier.

Signed-off-by: Satyam Sharma <ssatyam@cse.iitk.ac.in>
Cc: David Howells <dhowells@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>

---

 include/asm-i386/bitops.h |   26 +++++++++++++-------------
 1 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/include/asm-i386/bitops.h b/include/asm-i386/bitops.h
index ba8e4bb..17a302d 100644
--- a/include/asm-i386/bitops.h
+++ b/include/asm-i386/bitops.h
@@ -43,7 +43,7 @@ static inline void set_bit(int nr, volatile unsigned long * addr)
 	__asm__ __volatile__( LOCK_PREFIX
 		"btsl %1,%0"
 		:"+m" (ADDR)
-		:"Ir" (nr));
+		:"r" (nr));
 }
 
 /**
@@ -63,7 +63,7 @@ static inline void __set_bit(int nr, volatile unsigned long * addr)
 	__asm__(
 		"btsl %1,%0"
 		:"+m" (ADDR)
-		:"Ir" (nr));
+		:"r" (nr));
 }
 
 /**
@@ -81,7 +81,7 @@ static inline void clear_bit(int nr, volatile unsigned long * addr)
 	__asm__ __volatile__( LOCK_PREFIX
 		"btrl %1,%0"
 		:"+m" (ADDR)
-		:"Ir" (nr));
+		:"r" (nr));
 }
 
 /**
@@ -101,7 +101,7 @@ static inline void __clear_bit(int nr, volatile unsigned long * addr)
 	__asm__ __volatile__(
 		"btrl %1,%0"
 		:"+m" (ADDR)
-		:"Ir" (nr));
+		:"r" (nr));
 }
 
 /*
@@ -128,7 +128,7 @@ static inline void __change_bit(int nr, volatile unsigned long * addr)
 	__asm__ __volatile__(
 		"btcl %1,%0"
 		:"+m" (ADDR)
-		:"Ir" (nr));
+		:"r" ...
From: Andi Kleen
Date: Monday, July 23, 2007 - 9:10 am

It means I or r, not I modified by r. This means either a immediate constant
0..31 or a register, which is correct.

% cat t18.c 

f()
{
        asm("xxx %0" :: "rI" (10));
        asm("yyy %0" :: "rI" (100));
}
% gcc -O2 -S t18.c
% cat t18.s
...
f:
.LFB2:
#APP
        xxx $10
#NO_APP
        movl    $100, %eax
#APP
        yyy %eax
#NO_APP
        ret
.LFE2:
...

-Andi

-

From: Satyam Sharma
Date: Monday, July 23, 2007 - 9:21 am

Hi Andi,




Whoa, thanks for explaining that to me -- I didn't know, obviously. I had
just written a test program that used "Ir" with an automatic variable
defined in the inline function (as is the case with these bitops) and
observed that even when I gave > 32 values, it would still work -- hence
my conclusion.

However, the patch still stands, does it not? [ I will modify the
changelog, obviously. ] The thing is that we don't want to limit
@nr to <= 31 in the first place, or am I wrong again? :-)

Thanks,
Satyam
-

From: Andi Kleen
Date: Monday, July 23, 2007 - 9:30 am

These bit operations only allow 8 bit immediates, so 0..255 would
be correct. N might work from the 4.1 docs, but I don't know if it works 
in all old supported gccs (3.2+)

However I is definitely not wrong and most bit numbers are small anyways.

-Andi
-

From: Jan Hubicka
Date: Monday, July 23, 2007 - 9:36 am

'N' constraint was there pretty much forever, originally intended for
in/out operands.  It is in 2.95 docs
http://gcc.gnu.org/onlinedocs/gcc-2.95.3/gcc_16.html#SEC175

Honza
-

From: H. Peter Anvin
Date: Monday, July 23, 2007 - 11:05 am

"I" is correct.  The Intel documentation on this is highly confusing
(and has bugs in it), but it does unambiguously state:

"Some assemblers support immediate bit offsets larger than 31 by using
the immediate bit offset field in combination with the displacement
field in the memory operand ... The processor will ignore the high-order
bits if they are not zero."  AMD processors might be different for all I
know.

So unless gas is capable of doing this transformation (and it's not as
of binutils-2.17.50.0.6) "I" is what's needed here.

	-hpa

-

From: H. Peter Anvin
Date: Monday, July 23, 2007 - 11:28 am

Just tested it on a K8 machine; AMD behaves the same way.  So "I" is
correct, and changing it to "N" would introduce a bug.

The only way to optimize this is by using __builtin_constant_p() and
adjust the offset appropriately.

	-hpa
-

From: Linus Torvalds
Date: Monday, July 23, 2007 - 10:57 am

This is wrong too.

The whole point of a "Ir" modifier is to say that the instruction takes 
*either* an "I" or an "r".

Andrew - the ones I've looked at were all wrong. Please don't take this 
series.

		Linus
-

From: Satyam Sharma
Date: Monday, July 23, 2007 - 11:14 am

Yup, sorry about this one, Andi pointed this out earlier. But the "I"
must still go I think, for the third reason in that changelog -- it
unnecessarily limits the bit offset to 0..31, but (at least from the
comment up front in that file) we do allow arbitrarily large @nr (upto

I think I'll rescind the series anyway, a lot of patches turned out to
be wrong -- some due to mis-reading / incorrect gcc docs, others due to
other reasons ... this was just something I did thinking of as a cleanup
anyway, so I don't intend to push or correct this or anything.

Satyam
-

From: Andi Kleen
Date: Monday, July 23, 2007 - 11:32 am

As HPA pointed out that would risk not being correctly assembled by at 

cpumask_t/nodemask_t bitmap optimizations would be useful.

-Andi

-

From: H. Peter Anvin
Date: Monday, July 23, 2007 - 11:39 am

Incidentally, I just noticed the x86-64 bitops have "dIr" as their
constraint set.  "d" would normally be redundant with "r", and as far as
I know, gcc doesn't prefer one over the other without having "?" or "!"
as part of the constraint, so is is "d" a stray or is there some meaning
behind it?

	-hpa
-

From: Satyam Sharma
Date: Monday, July 23, 2007 - 11:52 am

Yup, I had noticed that myself. We would need to use "J" if we want
to limit the offsets to 0..63, but "d" sounds weird / stray indeed,
unless there's yet another undocumented/wrongly-documented mystery
behind this one too ... (I'd like to know even if that's the case,
obviously).

Satyam
-

From: Satyam Sharma
Date: Monday, July 23, 2007 - 9:05 am

From: Satyam Sharma <ssatyam@cse.iitk.ac.in>


  Extended asm supports input-output or read-write operands. Use the
  constraint character `+' to indicate such an operand and list it with
  the output operands. You should only use read-write operands when the
  constraints for the operand (or the operand in which only some of the
  bits are to be changed) allow a register.

So, using the "+" constraint modifier for memory, like "+m" is bogus.
We must simply specify "=m" which handles the case correctly.

Signed-off-by: Satyam Sharma <ssatyam@cse.iitk.ac.in>
Cc: David Howells <dhowells@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>

---

 include/asm-i386/bitops.h |   24 ++++++++++++------------
 1 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/include/asm-i386/bitops.h b/include/asm-i386/bitops.h
index 17a302d..d623fd9 100644
--- a/include/asm-i386/bitops.h
+++ b/include/asm-i386/bitops.h
@@ -42,7 +42,7 @@ static inline void set_bit(int nr, volatile unsigned long * addr)
 {
 	__asm__ __volatile__( LOCK_PREFIX
 		"btsl %1,%0"
-		:"+m" (ADDR)
+		:"=m" (ADDR)
 		:"r" (nr));
 }
 
@@ -62,7 +62,7 @@ static inline void __set_bit(int nr, volatile unsigned long * addr)
 {
 	__asm__(
 		"btsl %1,%0"
-		:"+m" (ADDR)
+		:"=m" (ADDR)
 		:"r" (nr));
 }
 
@@ -80,7 +80,7 @@ static inline void clear_bit(int nr, volatile unsigned long * addr)
 {
 	__asm__ __volatile__( LOCK_PREFIX
 		"btrl %1,%0"
-		:"+m" (ADDR)
+		:"=m" (ADDR)
 		:"r" (nr));
 }
 
@@ -100,7 +100,7 @@ static inline void __clear_bit(int nr, volatile unsigned long * addr)
 {
 	__asm__ __volatile__(
 		"btrl %1,%0"
-		:"+m" (ADDR)
+		:"=m" (ADDR)
 		:"r" (nr));
 }
 
@@ -127,7 +127,7 @@ static inline void __change_bit(int nr, volatile unsigned long * addr)
 {
 	__asm__ __volatile__(
 		"btcl %1,%0"
-		:"+m" (ADDR)
+		:"=m" (ADDR)
 		:"r" (nr));
 }
 
@@ -145,7 +145,7 @@ static inline void change_bit(int nr, volatile unsigned long * addr)
 {
 	__asm__ __volatile__( ...
From: Andi Kleen
Date: Monday, July 23, 2007 - 9:37 am

I checked with Honza (cc'ed) and he stated that the + are really needed
at least in newer gcc.

-Andi
-

From: Satyam Sharma
Date: Monday, July 23, 2007 - 10:15 am

That extract is from the latest (4.2.1) manual, but they could've
forgotten to update the documentation, of course.

But even then, I'm not sure they could ever make it "really needed",
that'll be a step that would break all existing code, otherwise! :-)

Satyam
-

From: Linus Torvalds
Date: Monday, July 23, 2007 - 10:46 am

No. This is wrong.

"=m" means that the new value of the memory location is *set*.

Which means that gcc will potentially optimize away any previous stores to 
that memory location.

And yes, it happens, and yes, we've seen it.

The gcc docs are secondary. They're not updated, they aren't correct, they 
don't reflect reality, and they don't matter. The only correct thing to 
use is "+m" for things like this, or alternatively to just use the 
"memory" specifier at the end and make gcc generate really crappy code.

			Linus
-

From: David Howells
Date: Tuesday, July 24, 2007 - 2:22 am

I had a lot of "fun" with this when mucking around with the asm-optimised R/W
semaphores.

David
-

From: Satyam Sharma
Date: Monday, July 23, 2007 - 9:05 am

From: Satyam Sharma <ssatyam@cse.iitk.ac.in>

[4/8] i386: bitops: Kill volatile-casting of memory addresses

All the occurrences of "volatile" that are used to qualify access to the
passed bit-string pointer/address must be removed, because "volatile" is
crazy, doesn't really work anyway, has nothing to do with locking and
atomicity, and is in general wholly unnecessary for these operations:

* In the case of the unlocked variants, we're not providing any guarantees
  whatsoever, so we want the caller to do any necessary locking surrounding
  the use of that function anyway.

* In case of the primitives where we *do* make locking/atomicity/reordering
  (three very different, but related, concepts) guarantees, we're doing that
  properly by using the lock instruction-prefix and "__asm__ __volatile__"
  anyway (and with "addr" always being constrained with "m" in the assembly,
  gcc doesn't bring it into a local register either).

So let's just kill the pointless use of volatile.

Signed-off-by: Satyam Sharma <ssatyam@cse.iitk.ac.in>
Cc: David Howells <dhowells@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>

---

 include/asm-i386/bitops.h |   60 +++++++++++++++++++++------------------------
 1 files changed, 28 insertions(+), 32 deletions(-)

diff --git a/include/asm-i386/bitops.h b/include/asm-i386/bitops.h
index d623fd9..0f5634b 100644
--- a/include/asm-i386/bitops.h
+++ b/include/asm-i386/bitops.h
@@ -26,8 +26,6 @@
  * on single-dword quantities.
  */
 
-#define ADDR (*(volatile long *) addr)
-
 /**
  * set_bit - Atomically set a bit in memory
  * @nr: the bit to set
@@ -38,11 +36,11 @@
  *
  * See __set_bit() if you do not require the atomic guarantees.
  */
-static inline void set_bit(int nr, volatile unsigned long * addr)
+static inline void set_bit(int nr, unsigned long *addr)
 {
 	__asm__ __volatile__( LOCK_PREFIX
 		"btsl %1,%0"
-		:"=m" (ADDR)
+		:"=m" (*addr)
 		:"r" (nr));
 }
 
@@ -58,11 +56,11 @@ static inline void set_bit(int nr, ...
From: Linus Torvalds
Date: Monday, July 23, 2007 - 10:52 am

This is wrong.

The "const volatile" is so that you can pass an arbitrary pointer. The 
only kind of abritraty pointer is "const volatile".

In other words, the "volatile" has nothing at all to do with whether the 
memory is volatile or not (the same way "const" has nothing to do with it: 
it's purely a C type *safety* issue, exactly the same way "const" is a 
type safety issue.

A "const" on a pointer doesn't mean that the thing it points to cannot 
change. When you pass a source pointer to "strlen()", it doesn't have to 
be constant. But "strlen()" takes a "const" pointer, because it work son 
constant pointers *too*.

Same deal here.

Admittedly this may be mostly historic, but regardless - the "volatiles" 
are right.

Using volatile on *data* is generally considered incorrect and bad taste, 
but using it in situations like this potentially makes sense.

Of course, if we remove all "volatiles" in data in the kernel (with the 
possible exception of "jiffies"), we can then remove them from function 
declarations too, but it should be done in that order.

			Linus
-

From: Nick Piggin
Date: Monday, July 23, 2007 - 9:19 pm

Well, regardless, it still forces the function to treat the pointer
target as volatile, won't it? It definitely prevents valid optimisations
that would be useful for me in mm/page_alloc.c where page flags are
being set up or torn down or checked with non-atomic bitops.

OK, not the i386 functions as much because they are all in asm anwyay,
but in general (btw. why does i386 or any architecture define their own
non-atomic bitops? If the version in asm-generic/bitops/non-atomic.h
is not good enough then surely it is a bug in gcc or that file?)

Anyway by type safety, do you mean it will stop the compiler from
warning if a pointer to a volatile is passed to the bitop? If so, then
why don't we just kill all the volatiles out of here and fix any
warnings that comeup? I doubt there would be many, and of those, some
might show up real synchronisation problems.

-- 
SUSE Labs, Novell Inc.
-

From: Satyam Sharma
Date: Monday, July 23, 2007 - 11:23 pm

Yes, and yes. But I think what he meant there is that we'd need to
audit the kernel for all users of set_bit and friends and see if callers
actually pass in any _data_ that _is_ volatile. So we have to kill them
there first, and then in the function declarations here. I think I'll put

The compiler would start warning for all those cases (passing in
a pointer to volatile data, when the bitops have lost the volatile
casting from their function declarations), actually. Something like
"passing argument discards qualifiers from pointer type" ... but
considering I didn't see *any* of those warnings after these patches,
I'm confused as to what exactly Linus meant here ... and what exactly




Satyam
-

From: Nick Piggin
Date: Tuesday, July 24, 2007 - 12:16 am

Because even with an allyesconfig, your compile isn't testing the
entire kernel. So given the relatively minor benefit of removing
the volatiles, I suppose we shouldn't risk slipping a bug in. If
you can make gcc throw an error in that case it would be a different
story.

-- 
SUSE Labs, Novell Inc.
-

From: Benjamin Herrenschmidt
Date: Tuesday, July 24, 2007 - 2:49 am

However... What about that:

 - This "volatile" will allow to pass pointers to volatile data to the
bitops.

 - Most users of "volatile" in the kenrel (except maybe jiffies) are
bogus

 - Thus let's remove it -as a type safety thing- to catch more of those
stupid volatile that shouldn't be ? :-)

Besides, as Nick pointed out, it prevents some valid optimizations.

Cheers,
Ben.

-

From: Linus Torvalds
Date: Tuesday, July 24, 2007 - 10:20 am

Quite frankly, I'd really really rather kill the "volatile" data
structures independently. Once that is done, it doesn't really matter any

No it doesn't. Not the ones on the functions that just do an inline asm.

The only valid optimization it might break is for "constant_test_bit()", 
which isn't even using inline asm.

And before we remove that "volatile", we'd better make damn sure that 
there isn't any driver that does

	/* Wait for the command queue to be cleared by DMA */
	while (test_bit(...))
		;

or similar.

Yes, it's annoying, but this is a scary and subtle area. And we sadly 
_have_ had code that does things like that.

		Linus
-

From: Jeff Garzik
Date: Tuesday, July 24, 2007 - 10:39 am

I certainly cannot speak for all the grotty drivers out there, but I've 
never ever seen anything like the above.  I would consider anyone using 
kernel bit operations on DMA memory to be more than a little crazy.  But 
that's just me :)

Usually you will see

	while (1) {
		rmb();
		if (software_index == hardware_index_in_DMA_memory)
			break;
		
		... handle a unit of work ...
	}

Though ISTR being told that even rmb() was not sufficient in all cases 
[nonetheless, that's what I use in net and SATA drivers and email 
recommendations, and people seem happy]

	Jeff


-

From: Nick Piggin
Date: Tuesday, July 24, 2007 - 9:54 pm

The constant case is probably most used (at least for page flags), and
is most important for me. constant_test_bit may not be using inline asm,
but the volatile pointer target means that it reloads the value and can't
do much optimisation over it.

BTW. once volatile goes away, i386 really should start using the C
versions of __set_bit and __clear_bit as well IMO. (at least for the
constant bitnr case), so gcc can potentially optimise better.

-- 
SUSE Labs, Novell Inc.
-

From: Denis Vlasenko
Date: Monday, July 30, 2007 - 10:57 am

Hi Satyam,


[I did read entire thread]

Welcome to the minefield.

This bitops and barrier stuff is complicated. It's very easy to
introduce bugs which are hard to trigger, or happen only with some
specific gcc versions, or only on massively parallel SMP boxes.

You can also make technically correct changes which relax needlessly
strict barrier semantics of some bitops and trigger latent bugs
in code which was unknowingly depending on it.

How you can proceed:

Make a change which you believe is right. Recompile allyesconfig
kernel with and without this change. Find a piece of assembly code
which become different. Check that new code is correct (and smaller
and/or faster). Post your patch together with example(s) of code
fragments that got better. Be as verbose as needed.

Repeat for each change separately.

This can be painfully slow, but less likely to be rejected outright

I vaguely remember that "memory" clobbers are needed in some rather
obscure, non-obvious situations. Google for it - Linus wrote about it




For this kind of things, you really need something more stressing.

At least it proves that _something_ changed.
--
vda
-

From: Satyam Sharma
Date: Monday, July 30, 2007 - 6:07 pm

Hi Denis,




The problem with most of the patches in this series (other than the two
related to vague gcc docs), as you know, turned out to be the fact that
a lot of code/callers out there rely on the bitops as also being
lock/unlock functions -- and hence the need for disallowing both
compiler as well as CPU reordering / optimizations.

There were few valid changes in the patchset too, and probably I'll try
to look at them the way you described above (or probably it also makes
some sense to keep the bitops as-is, "it works" today after all, and
there are other avenues for optimizations/cleanups too).


Thanks again,
Satyam
-

Previous thread: Re: Linus 2.6.23-rc1 by Bob Picco on Monday, July 23, 2007 - 8:47 am. (2 messages)

Next thread: Plan 9 Resource Sharing Support - what is it? by Pavel Machek on Monday, July 23, 2007 - 12:54 am. (4 messages)