Is that this?
" - There's a /sys based reservation facility that allows the allocation
of a certain number of hw counters for guaranteed sysadmin access."
Sounds like I can't do that as an ordinary user, even on my own
processes...
I don't want the whole PMU all the time, I just want it while my
monitored process is running, and only on the CPU where it is
running.
No - at least on the machines I'm familiar with, I can count every
single cache miss and hit at every level of the memory hierarchy,
every single TLB miss, every load and store instruction, etc. etc.
I want to be able to work out things like cache hit rates, just as one
example. To do that I need two numbers that are directly comparable
because they relate to the same set of instructions. If I have a
count of L1 Dcache hits for one set of instructions and a count of L1
Dcache misses over some different stretch of instructions, the ratio
of them doesn't mean anything.
Your argument about "it's all statistical" is bogus because even if
the things we are measuring are statistical, that's still no excuse
for being sloppy about how we make our estimates. And not being able
to have synchronized counters is just sloppy. The users want it, the
hardware provides it, so that makes it a must-have as far as I am
concerned.
What can you back that assertion up with?
More unsupported assertions, that sound wrong to me...
Only for root, which isn't good enough.
What I was proposing was NOT a rigid notion - you don't have to own
the whole PMU if you are happy to use the events that the kernel knows
about. If you do want the whole PMU, you can have it while the
process you're monitoring is running, and the kernel will
context-switch it between you and other users, who can also have the
whole PMU when their processes are running.
Perhaps you have misunderstood my proposal. A counter-set doesn't
have to be the whole PMU, and you can have multiple counter-sets
active at the same time as long as they fit. You can even have
multiple "whole PMU" counter-sets and the kernel will multiplex them
onto the real PMU.
Well, I'll ignore the patronizing tone (but please try to avoid it in
future).
The PRIMARY reason for wanting counter-sets is because THAT IS WHAT
THE USERS WANT. A "usable" and "sane" interface design that doesn't
do what users want is useless.
Anyway, my proposal is just as "usable" as yours, since users still
have perf_counter_open, exactly as in your proposal. Users with
simpler requirements can do things exactly the same way as with your
proposal.
Handling the counter constraints is indeed a matter of implementation,
and as I noted previously, your current proposed implementation
doesn't handle them.
It's the choice of a single counter as being your "object" that I
object to. :)
Which is not necessarily a good thing. Fundamentally, if you are
trying to measure something, and you get a number, you need to know
what exactly got measured.
For example, suppose I am trying to count TLB misses during the
execution of a program. If my TLB miss counter keeps getting bumped
off because the kernel is scheduling my counter along with a dozen
other counters, then I *at least* want to know about it, and
preferably control it. Otherwise I'll be getting results that vary by
an order of magnitude with no way to tell why.
For simple things, yes it is simpler. But it can't do the more
complex things in any sort of clean or sane way.
... because the design of your code is wrong at an abstract level ...
OK, here's an example. I have an application whose execution has
several different phases, and I want to measure the L1 Icache hit rate
and the L1 Dcache hit rate as a function of time and make a graph. So
I need counters for L1 Icache accesses, L1 Icache misses, L1 Dcache
accesses, and L1 Dcache misses. I want to sample at 1ms intervals.
The CPU I'm running on has two counters.
With your current proposal, I don't see any way to make sure that the
counter scheduler counts L1 Dcache accesses and L1 Dcache misses at
the same time, then schedules L1 Icache accesses and L1 Icache
misses. I could end up with L1 Dcache accesses and L1 Icache
accesses, then L1 Dcache misses and L1 Icache misses - and get a
nonsensical situation like the misses being greater than the accesses.
No. Where did you get contexts from? I didn't write anything about
contexts. Please read what I wrote.
Please drop the patronizing tone, again.
What user-space applications want to be able to do is this:
* Ensure that a set of counters are all counting at the same time.
* Know when counters get scheduled on and off the process so that the
results can be interpreted properly. Either that or be able to
control the scheduling.
* Sophisticated applications want to be able to do things with the PMU
that the kernel doesn't necessarily understand.
You'd rather provide useless numbers to userspace? :)
Your arguments remind me of a filesystem that a colleague of mine once
designed that only had files, but no directories (you could have "/"
characters in the filenames, though). This whole discussion is a bit
like you arguing that directories are an unnecessary complication that
only messes up the interface and adds extra system calls.
But then I don't get context-switching between processes.
It still means we end up having to add something approaching 29,000
lines of code and 320kB to the kernel, just for the IBM 64-bit PowerPC
processors. (I don't guarantee that code is optimal, but that is some
indication of the complexity required.)
I am perfectly happy to add code for the kernel to know about the most
commonly-used, simple events on those processors. But I surely don't
want to have to teach the kernel about every last event and every last
capability of those machines' PMUs.
For example, there is a facility on POWER6 where certain instructions
can be selected (based on (instruction_word & mask) == value) and
marked, and then there are events that allow you to measure how long
marked instructions take in various stages of execution. How would I
make such a feature available for applications to use, within your
framework?
Paul.
--