From: Paul Jackson <pj@sgi.com> RFC only so far - has been built and booted, but has received almost no testing. Add a second choice for how node numbers are interpreted and returned by the NUMA memory policy system calls mbind, set_mempolicy and get_mempolicy. The original choice remains the default, for compatibility. The second choice overcomes some limitations of the first choice in the interaction between cpusets and these memory policy calls, that show up when tasks using these calls are also being moved between different cpusets, especially between cpusets with varying numbers of allowed nodes in the cpuset 'mems' file. A new per-task mode, managed using added get_mempolicy calls, controls which mode applies to subsequently created memory policies. See the Documentation/vm/numa_memory_policy.txt section MEMORY POLICIES AND CPUSETS for an explanation of how both these choices for node numbering work and interact with cpusets. Signed-off-by: Paul Jackson <pj@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Lee.Schermerhorn@hp.com Cc: Andi Kleen <ak@suse.de> --- Documentation/vm/numa_memory_policy.txt | 140 +++++++++++++++++++++++++------- include/linux/mempolicy.h | 15 +++ include/linux/sched.h | 1 mm/mempolicy.c | 123 +++++++++++++++++++++++----- 4 files changed, 229 insertions(+), 50 deletions(-) --- 2.6.23-mm1.orig/include/linux/mempolicy.h 2007-10-30 18:04:11.000000000 -0700 +++ 2.6.23-mm1/include/linux/mempolicy.h 2007-10-30 18:11:07.000000000 -0700 @@ -21,6 +21,10 @@ #define MPOL_F_ADDR (1<<1) /* look up vma using address */ #define MPOL_F_MEMS_ALLOWED (1<<2) /* return allowed memories */ +#define MPOL_F_MODE_DEFAULT (1<<3) /* set cpuset confined nodemask mode */ +#define MPOL_F_MODE_SYS_WIDE (1<<4) /* set system-wide nodemask mode */ +#define MPOL_F_MODE_GET (1<<5) /* return number mode: old => 1, new => 0 */ + /* ...
Make both policy and the mode char? Could we shorten the mpol_nodemask_mode Hmmm... I would rather have numactl manage this flag and specify it in calls to set memory policies. -
Well, the mpol_nodemask_mode already is char. So I guess you're
asking if we should change 'policy' to type char as well.
Changing 'policy' from a short to a char would reduce the sizeof
(struct mempolicy) on 16-bit systems, as both chars would fit in one
Huh - I don't know what "shorten ... to mode" means. If it means
Well, numactl is a command line utility. It doesn't manage this flag
as an -alternative- to some sort of changes to the mbind, set_mempolicy
and get_mempolicy calls, but rather layers on top of changes to those
system calls. So I'm not sure what you mean, but I guess it is not
This suggestion I do think I understand - good. However I disagree.
Here's what I think you meant, and why I disagree.
My current patch adds a new per-task modal state, that is manipulated
by additional get_mempolicy calls and options.
I think you are stating a preference for passing an additional flag on
each mbind, set_mempolicy and get_mempolicy call, to be used when you
want to use this new way (so called "Choice B") of numbering nodes.
The question is:
Should this mode be per-task, or per-system call?
The basic reason that I went with an additional per-task modal
state, rather than a modal flag for each mbind, set_mempolicy and
get_mempolicy call was to reduce the likely rate of bugs in user
level C code using this API.
Programs that code to this API in C usually first spend some number
of lines of code preparing bitmasks that represent sets of nodes,
which they will then pass into these mempolicy system calls.
The bits are numbered differently between Choices A and B.
If a piece of code adapts Choice B numbering, because it is
better suited to that codes needs, and if we used your suggested
per-system call flag, then if they miss passing in the new special
flag on -even-one- mbind, set_mempolicy or get_mempolicy system
call, they have an obscure code bug. They will be passing in
a bitmask that they ...No it means shorted the identifier which is a bit long for a structure It is a change of behavior of the function call. If some mysterious flag somewhere influences that behavior then we have difficulties finding bugs. The presence of the flag makes it obvious to the reviewer that we do something special here. -
Ok - but why? Well, I called it "mpol_nodemask_mode" in the mempolicy struct because I called it that same thing in the task_struct struct, where the longer name is quite useful (the task struct covers alot of ground; a brief 'mode' field name for this purpose would be too vague and short. I guess the 'mpol_' prefix part of this struct mempolicy field name is unnecessary. It's needed in the task struct, but not in the mempolicy struct. So I can imagine shorting this mempolicy struct name from "mpol_nodemask_mode" down to "nodemask_mode". But it really isn't a generic "mode" field, so I'd be reluctant to make it that short. Whether names should be long or short depends more to me on how they are used in the code. I want the code to be easy to read. Sometimes I use 30+ char names; sometimes 1 char names. There is a tendency for external function names to be longer than local variable names or So you would rather have libnuma manage this flag? As opposed to what else managing it? I'm still a tad confused on The ones I described further on in that message, involving computing bitmasks using one Choice then making some of the mbind/set_mempolicy I disagree. There are several mempolicy interfaces in use, including at least: 1) libnuma, which in turn is based on (5), below 2) the numactl command line utility (based on libnuma) 3) libcpuset (which has its own modest mempolicy interface) 4) the glibc mbind/set_mempolicy/get_mempolicy wrappers, in C code Hmmm ... yes ... that's a problem. Either way seems to be a problem. With the mode bit as in my patch, there are fewer places in the user code that have to be gotten just right. With your way, each and every mbind and *_mempolicy call has to be hacked with the new flag if one is going to use the new nodemask bit numbering. Some of these calls might be inside other routines or libraries that aren't readily available to be examined or changed. If you miss one, or forget to add one when adding more ...
These are no problem 1/2 are libnuma. No current version of libcpuset That would be okay from the backwards compat standpoint. -
> You are managing it in the task struct. No need to. libnuma can handle it.
Wrong. It has not received wide publication yet, but it has been
I've seen such go by.
David R - is your use of the mbind and *_mempolicy system calls
via libnuma, or direct system calls?
I've fielded questions from others on how to use these system
calls (directly, not via libnuma) for several years now, both
from within and without of SGI.
The glibc folks have spent some effort on refining the system
call wrappers they provide for these calls, which are not used by
libnuma (libnuma doesn't use the glibc wrappers for these calls.)
I presume that glibc work has at least some actual users.
Andi has encouraged everyone to use libnuma, but there is no
serious reason why others have to. It is not like it takes
thousands of lines of elaborate code to perform basic operations
using these calls.
A search of some old SGI release software sitting on an internal
server just now suggests that products with names including histx,
gru, libmpi and pcp might be directly invoking these system calls
... I didn't actually examine the source to determine whether
they really use these direct calls -- just got a grep hit.
I would not be surprised if some of the big batch schedulers
directly invoke these system calls, but don't know for sure.
In any event, we must assume that some use these system calls
Maybe for you. Not for the rest of us error prone mere mortals.
Forcing coders to specify the same detail in multiple places, when
there is no way to validate their consistency, doesn't force them
to think or do it right. It increases the error rate due to
inconsistencies. If you have to select a binary mode once, you've got
a 50-50 chance of getting it correct with just a random try, and a
half-way decent chance of noticing an error, since it would affect all
such calls. If you have to say it ten times scattered ...I hope to be able to use libnuma exclusively once your fix is in place so that the interleaving behaves the way we want while attached to a changing cpuset. (And it would be preferable to not have to use a special version of it that is hacked to support the fix.) It's just simpler to use as an interface, but I wouldn't consider using the system calls directly an error. David -
A good argument to leave the API unchanged and not create magic task There are always wrappers for system calls. The flags will be set in these. -
You might be recalling something called libcpumemset, which is about
five years old. The library known as libcpuset is SGI's current LGPL
We were discussing libnuma here, not glibc. The system call wrappers
are in glibc. System call wrappers should not be setting optional
flags. They should just make the system call -- do whatever magic it
takes to get the provided arguments into the right registers or other
conventionally determined places, and invoke the necessary machine
instruction to trap into the kernel.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
The library interface can set flags to modify behavior. -
A library such as libnuma can set them, yes, but not everyone uses
libnuma. Basically everyone uses the standard C library, glibc, which
has the system call wrappers, but these wrappers should not be setting
optional flags.
We're going around in circles here, Christoph.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
Yes and you keep missing the point focusing on stuff that is not relevant. Where did you get the rule that libraries should not be setting flags? libraries do all sort of conversion before calling the kernel API. Look the libraries are a strong argument against the method of setting task flags. If you are using multiple libraries some of which have been updated and some of those which have not then setting a task flag can have bad consequences. You want clean syscall behavior. -
I think what would end up happening is that additional functions would be added to libnuma that would effect the system-wide nodemask numbering by simply OR'ing the MPOL_F_MODE_SYS_WIDE flag as part of the mode actual. David -
So there is no problem with allowing modal flags to be passed to set_mempolicy() because glibc will respect the mode actual and pass it right along to the kernel. That gives the power to the programmer to specify whether he or she, based on the updated documentation, wants the nodemask to be interpreted in terms of the system or cpuset. David -
That code is going to have to be hacked anyway to use the new nodemask semantics, so it can easily add a flag to the set_mempolicy() call for the system-wide node numbering. This then only requires a small addition to the documentation: use MPOL_F_MODE_SYS_WIDE for system-wide node numbering That's a very legitimate concern, but those libraries will eventually need to be made to support this new extention anyway. They will be modified just like we're modifying the kernel once people want to start using the different nodemask semantics. As mempolicy modes become more popular, those libraries are going to start accepting custom mode flags to pass to their set_mempolicy() wrappers that will get OR'd with the mempolicy mode that is used. It will be the natural progression of how mempolicies are supported in userspace. David -
I think it may be more error prone to accidently leave off the get_mempolicy() system call to use the system-wide numbering or for user-level code to forget that the mode was already set (and now stored in their task_struct) without a subsequent get_mempolicy() to revert back to the default behavior. Both of these problems can be addressed by checking the *policy returned by get_mempolicy(), as you've coded it, but many applications will probably ignore that overhead. Allowing the mode to be passed on each set_mempolicy() system call seems better, this is where the nodemask is passed anyway so it's legitimate for the caller to specify how that nodemask should be interpreted (either system-wide or cpuset-wide). This keeps all semantics of the nodemask to a single invocation of a system call instead of setting policy modes with get_mempolicy() and confusing the matter later. I think get_mempolicy() can remain unchanged because it will simply return the contextualized nodemask in either case and would not require a mode to be passed. David -
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DISPLAY is not set |
| Junio C Hamano |
