Re: [RFC] cpuset relative memory policies - second choice

Previous thread: bttv build error (CONFIG_NET=n) by Randy Dunlap on Tuesday, October 30, 2007 - 10:15 pm. (6 messages)

Next thread: [PATCH 1/2] [POWERPC] mpc5200: Fix Kconfig dependancies on MPC5200 FEC device driver by Grant Likely on Tuesday, October 30, 2007 - 11:28 pm. (2 messages)
From: Paul Jackson
Date: Tuesday, October 30, 2007 - 11:17 pm

From: Paul Jackson <pj@sgi.com>

  RFC only so far - has been built and booted, but has received
  almost no testing.

Add a second choice for how node numbers are interpreted and returned
by the NUMA memory policy system calls mbind, set_mempolicy and
get_mempolicy.

The original choice remains the default, for compatibility.

The second choice overcomes some limitations of the first choice in
the interaction between cpusets and these memory policy calls, that
show up when tasks using these calls are also being moved between
different cpusets, especially between cpusets with varying numbers
of allowed nodes in the cpuset 'mems' file.

A new per-task mode, managed using added get_mempolicy calls, controls
which mode applies to subsequently created memory policies.

See the Documentation/vm/numa_memory_policy.txt section MEMORY
POLICIES AND CPUSETS for an explanation of how both these choices
for node numbering work and interact with cpusets.

Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Lee.Schermerhorn@hp.com
Cc: Andi Kleen <ak@suse.de>

---
 Documentation/vm/numa_memory_policy.txt |  140 +++++++++++++++++++++++++-------
 include/linux/mempolicy.h               |   15 +++
 include/linux/sched.h                   |    1 
 mm/mempolicy.c                          |  123 +++++++++++++++++++++++-----
 4 files changed, 229 insertions(+), 50 deletions(-)

--- 2.6.23-mm1.orig/include/linux/mempolicy.h	2007-10-30 18:04:11.000000000 -0700
+++ 2.6.23-mm1/include/linux/mempolicy.h	2007-10-30 18:11:07.000000000 -0700
@@ -21,6 +21,10 @@
 #define MPOL_F_ADDR	(1<<1)	/* look up vma using address */
 #define MPOL_F_MEMS_ALLOWED (1<<2) /* return allowed memories */
 
+#define MPOL_F_MODE_DEFAULT (1<<3) /* set cpuset confined nodemask mode */
+#define MPOL_F_MODE_SYS_WIDE (1<<4) /* set system-wide nodemask mode */
+#define MPOL_F_MODE_GET	    (1<<5) /* return number mode: old => 1, new => 0 */
+
 /* ...
From: Christoph Lameter
Date: Wednesday, October 31, 2007 - 11:59 am

Make both policy and the mode char? Could we shorten the mpol_nodemask_mode 

Hmmm... I would rather have numactl manage this flag and specify it in 
calls to set memory policies.

-

From: Paul Jackson
Date: Wednesday, October 31, 2007 - 1:19 pm

Well, the mpol_nodemask_mode already is char.  So I guess you're
asking if we should change 'policy' to type char as well.

Changing 'policy' from a short to a char would reduce the sizeof
(struct mempolicy) on 16-bit systems, as both chars would fit in one

Huh - I don't know what "shorten ... to mode" means.  If it means

Well, numactl is a command line utility.  It doesn't manage this flag
as an -alternative- to some sort of changes to the mbind, set_mempolicy
and get_mempolicy calls, but rather layers on top of changes to those
system calls.  So I'm not sure what you mean, but I guess it is not

This suggestion I do think I understand - good.  However I disagree.
Here's what I think you meant, and why I disagree.

My current patch adds a new per-task modal state, that is manipulated
by additional get_mempolicy calls and options.

I think you are stating a preference for passing an additional flag on
each mbind, set_mempolicy and get_mempolicy call, to be used when you
want to use this new way (so called "Choice B") of numbering nodes.

The question is:

    Should this mode be per-task, or per-system call?

The basic reason that I went with an additional per-task modal
state, rather than a modal flag for each mbind, set_mempolicy and
get_mempolicy call was to reduce the likely rate of bugs in user
level C code using this API.

    Programs that code to this API in C usually first spend some number
    of lines of code preparing bitmasks that represent sets of nodes,
    which they will then pass into these mempolicy system calls.

    The bits are numbered differently between Choices A and B.    

    If a piece of code adapts Choice B numbering, because it is
    better suited to that codes needs, and if we used your suggested
    per-system call flag, then if they miss passing in the new special
    flag on -even-one- mbind, set_mempolicy or get_mempolicy system
    call, they have an obscure code bug.  They will be passing in
    a bitmask that they ...
From: Christoph Lameter
Date: Wednesday, October 31, 2007 - 2:05 pm

No it means shorted the identifier which is a bit long for a structure 





It is a change of behavior of the function call. If some mysterious flag 
somewhere influences that behavior then we have difficulties finding bugs. 
The presence of the flag makes it obvious to the reviewer that we do 
something special here.

-

From: Paul Jackson
Date: Wednesday, October 31, 2007 - 9:47 pm

Ok - but why?


Well, I called it "mpol_nodemask_mode" in the mempolicy struct because
I called it that same thing in the task_struct struct, where the longer
name is quite useful (the task struct covers alot of ground; a brief
'mode' field name for this purpose would be too vague and short.

I guess the 'mpol_' prefix part of this struct mempolicy field name is
unnecessary.  It's needed in the task struct, but not in the mempolicy
struct.  So I can imagine shorting this mempolicy struct name from
"mpol_nodemask_mode" down to "nodemask_mode".  But it really isn't a
generic "mode" field, so I'd be reluctant to make it that short.

Whether names should be long or short depends more to me on how they
are used in the code.  I want the code to be easy to read.  Sometimes
I use 30+ char names; sometimes 1 char names.  There is a tendency for
external function names to be longer than local variable names or

So you would rather have libnuma manage this flag?

As opposed to what else managing it?  I'm still a tad confused on

The ones I described further on in that message, involving computing
bitmasks using one Choice then making some of the mbind/set_mempolicy

I disagree.

There are several mempolicy interfaces in use, including at least:
 1) libnuma, which in turn is based on (5), below
 2) the numactl command line utility (based on libnuma)
 3) libcpuset (which has its own modest mempolicy interface)
 4) the glibc mbind/set_mempolicy/get_mempolicy wrappers, in C code

Hmmm ... yes ... that's a problem.  Either way seems to be a problem.

With the mode bit as in my patch, there are fewer places in the user
code that have to be gotten just right.  With your way, each and
every mbind and *_mempolicy call has to be hacked with the new flag
if one is going to use the new nodemask bit numbering.  Some of these
calls might be inside other routines or libraries that aren't readily
available to be examined or changed.  If you miss one, or forget to
add one when adding more ...
From: Christoph Lameter
Date: Wednesday, October 31, 2007 - 10:25 pm

These are no problem 1/2 are libnuma. No current version of libcpuset



That would be okay from the backwards compat standpoint.
-

From: Paul Jackson
Date: Wednesday, October 31, 2007 - 11:33 pm

> You are managing it in the task struct. No need to. libnuma can handle it.


Wrong.  It has not received wide publication yet, but it has been

I've seen such go by.

    David R - is your use of the mbind and *_mempolicy system calls
    via libnuma, or direct system calls?

    I've fielded questions from others on how to use these system
    calls (directly, not via libnuma) for several years now, both
    from within and without of SGI.

    The glibc folks have spent some effort on refining the system
    call wrappers they provide for these calls, which are not used by
    libnuma (libnuma doesn't use the glibc wrappers for these calls.)
    I presume that glibc work has at least some actual users.

    Andi has encouraged everyone to use libnuma, but there is no
    serious reason why others have to.  It is not like it takes
    thousands of lines of elaborate code to perform basic operations
    using these calls.

    A search of some old SGI release software sitting on an internal
    server just now suggests that products with names including histx,
    gru, libmpi and pcp might be directly invoking these system calls
    ... I didn't actually examine the source to determine whether
    they really use these direct calls -- just got a grep hit.

    I would not be surprised if some of the big batch schedulers
    directly invoke these system calls, but don't know for sure.

In any event, we must assume that some use these system calls

Maybe for you.  Not for the rest of us error prone mere mortals.

Forcing coders to specify the same detail in multiple places, when
there is no way to validate their consistency, doesn't force them
to think or do it right.  It increases the error rate due to
inconsistencies.  If you have to select a binary mode once, you've got
a 50-50 chance of getting it correct with just a random try, and a
half-way decent chance of noticing an error, since it would affect all
such calls.  If you have to say it ten times scattered ...
From: David Rientjes
Date: Thursday, November 1, 2007 - 1:03 am

I hope to be able to use libnuma exclusively once your fix is in place so 
that the interleaving behaves the way we want while attached to a changing 
cpuset.  (And it would be preferable to not have to use a special version 
of it that is hacked to support the fix.)  It's just simpler to use as an 
interface, but I wouldn't consider using the system calls directly an 
error.

		David
-

From: Christoph Lameter
Date: Thursday, November 1, 2007 - 6:07 am

A good argument to leave the API unchanged and not create magic task 

There are always wrappers for system calls. The flags will be set in 
these.
-

From: Paul Jackson
Date: Thursday, November 1, 2007 - 9:06 am

You might be recalling something called libcpumemset, which is about
five years old.  The library known as libcpuset is SGI's current LGPL

We were discussing libnuma here, not glibc.  The system call wrappers
are in glibc.  System call wrappers should not be setting optional
flags.  They should just make the system call -- do whatever magic it
takes to get the provided arguments into the right registers or other
conventionally determined places, and invoke the necessary machine
instruction to trap into the kernel.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: Christoph Lameter
Date: Thursday, November 1, 2007 - 10:07 am

The library interface can set flags to modify behavior.
-

From: Paul Jackson
Date: Thursday, November 1, 2007 - 10:26 am

A library such as libnuma can set them, yes, but not everyone uses
libnuma.  Basically everyone uses the standard C library, glibc, which
has the system call wrappers, but these wrappers should not be setting
optional flags.

We're going around in circles here, Christoph.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: Lee Schermerhorn
Date: Thursday, November 1, 2007 - 10:38 am

[Empty message]
From: Christoph Lameter
Date: Thursday, November 1, 2007 - 10:44 am

Yes and you keep missing the point focusing on stuff that is not relevant.
Where did you get the rule that libraries should not be setting flags? 
libraries do all sort of conversion before calling the kernel API.

Look the libraries are a strong argument against the method of setting 
task flags. If you are using multiple libraries some of which have been 
updated and some of those which have not then setting a task flag can 
have bad consequences. You want clean syscall behavior.
-

From: David Rientjes
Date: Thursday, November 1, 2007 - 12:06 pm

I think what would end up happening is that additional functions would be 
added to libnuma that would effect the system-wide nodemask numbering by 
simply OR'ing the MPOL_F_MODE_SYS_WIDE flag as part of the mode actual.

		David
-

From: David Rientjes
Date: Thursday, November 1, 2007 - 12:03 pm

So there is no problem with allowing modal flags to be passed to 
set_mempolicy() because glibc will respect the mode actual and pass it 
right along to the kernel.  That gives the power to the programmer to 
specify whether he or she, based on the updated documentation, wants the 
nodemask to be interpreted in terms of the system or cpuset.

		David
-

From: David Rientjes
Date: Thursday, November 1, 2007 - 12:00 pm

That code is going to have to be hacked anyway to use the new nodemask 
semantics, so it can easily add a flag to the set_mempolicy() call for the 
system-wide node numbering.  This then only requires a small addition to 
the documentation: use MPOL_F_MODE_SYS_WIDE for system-wide node numbering 

That's a very legitimate concern, but those libraries will eventually need 
to be made to support this new extention anyway.  They will be modified 
just like we're modifying the kernel once people want to start using the 
different nodemask semantics.  As mempolicy modes become more popular, 
those libraries are going to start accepting custom mode flags to pass to 
their set_mempolicy() wrappers that will get OR'd with the mempolicy mode 
that is used.  It will be the natural progression of how mempolicies are 
supported in userspace.

		David
-

From: David Rientjes
Date: Thursday, November 1, 2007 - 11:52 am

I think it may be more error prone to accidently leave off the 
get_mempolicy() system call to use the system-wide numbering or for 
user-level code to forget that the mode was already set (and now stored in 
their task_struct) without a subsequent get_mempolicy() to revert back to 
the default behavior.  Both of these problems can be addressed by checking 
the *policy returned by get_mempolicy(), as you've coded it, but many 
applications will probably ignore that overhead.

Allowing the mode to be passed on each set_mempolicy() system call seems 
better, this is where the nodemask is passed anyway so it's legitimate for 
the caller to specify how that nodemask should be interpreted (either 
system-wide or cpuset-wide).  This keeps all semantics of the nodemask to 
a single invocation of a system call instead of setting policy modes with 
get_mempolicy() and confusing the matter later.

I think get_mempolicy() can remain unchanged because it will simply return 
the contextualized nodemask in either case and would not require a mode to 
be passed.

		David
-

Previous thread: bttv build error (CONFIG_NET=n) by Randy Dunlap on Tuesday, October 30, 2007 - 10:15 pm. (6 messages)

Next thread: [PATCH 1/2] [POWERPC] mpc5200: Fix Kconfig dependancies on MPC5200 FEC device driver by Grant Likely on Tuesday, October 30, 2007 - 11:28 pm. (2 messages)