login
Header Space

 
 

Debugging Multiple CPUs

October 22, 2007 - 2:19pm
Submitted by Jeremy on October 22, 2007 - 2:19pm.
Linux news

"Sysrq-p is pretty useless unless you can force the keyboard interrupt and the spinning process onto the same CPU," noted Chuck Ebbert during a discussion centered around debugging tasks stuck in a running state. Pressing the <Alt><SysRq><p> key combination is used for debugging, dumping the registers and flags from the CPU that handles the keypress interrupt to the console. UltraSPARC maintainer, David Miller, replied, "yes, I find this a painful limitation too," adding:

"Sparc64 used to dump the registers on all active cpus for show_regs() via a cross-call, and this was incredibly useful. But I disabled that as soon as I started playing with Niagara because at 32 cpus and larger the output is just too voluminous to be useful."

David then suggested, "what might be appropriate is just to get a one-line program counter dump on every cpu via some new sysrq keystroke." Chuck noted that similar functionality is provided by a patch in the -mm kernel, "IIRC -mm had something like this but it was buggy because we were sending IPIs to each processor asking them to print their state. Maybe it would work if we had a way of making them dump their state to a memory location and then collected and printed it from the CPU that's handling the sysrq."


From: Jeff Garzik <jeff@...>
Subject: [2.6.23] tasks stuck in running state?
Date: Oct 19, 5:39 pm 2007

On my main devel box, vanilla 2.6.23 on x86-64/Fedora-7, I'm seeing a 
certain behavior at least once a day.  I'll start a kernel build (make 
-sj5 on this box), and it will "hang" in the following way:

> 31003 ?        S      0:04 sshd: jgarzik@pts/0
> 31004 pts/0    Ss     0:02  \_ -bash
>  8280 pts/0    S+     0:00      \_ make ARCH=i386 -sj4
>  8690 pts/0    Z+     0:00          \_ [rm] <defunct>
>  8691 pts/0    S+     0:00          \_ /bin/sh -c cat include/config/kernel.release 2> /dev/null
>  8692 pts/0    R+     6:12              \_ cat include/config/kernel.release

Specifically, the symptom is a process, often a simple one like cat(1) 
or rm(1) or somewhere in check-headers, will stay in the running state, 
accumulating CPU time.

If I Ctrl-C the build, and start over, the build will normally -not- get 
stuck at the same point, but proceed to chew through one of a bazillion 
allmodconfig builds.

I also see this occasionally on my main workstation (also 
2.6.23/x86-64/Fedora-7), though not as frequently.

This is a new behavior since the new scheduler was merged... I think.

Nothing more concrete to report at this time.  I cannot easily reproduce 
the behavior, as it happens [apparently] randomly sometime during the 
day.  Generally, the files these programs are dealing with are -always- 
in the pagecache, if that makes any difference.

	Jeff


-

From: Chuck Ebbert <cebbert@...> Subject: Re: [2.6.23] tasks stuck in running state? Date: Oct 19, 5:53 pm 2007 On 10/19/2007 05:39 PM, Jeff Garzik wrote: > On my main devel box, vanilla 2.6.23 on x86-64/Fedora-7, I'm seeing a > certain behavior at least once a day. I'll start a kernel build (make > -sj5 on this box), and it will "hang" in the following way: > Can you try to strace the hanging task? -
From: Jeff Garzik <jeff@...> Subject: Re: [2.6.23] tasks stuck in running state? Date: Oct 19, 6:03 pm 2007 Chuck Ebbert wrote: > On 10/19/2007 05:39 PM, Jeff Garzik wrote: >> On my main devel box, vanilla 2.6.23 on x86-64/Fedora-7, I'm seeing a >> certain behavior at least once a day. I'll start a kernel build (make >> -sj5 on this box), and it will "hang" in the following way: >> > > Can you try to strace the hanging task? Well, to the system it's running, so that doesn't do much of anything... > > 8482 pts/0 S+ 0:00 \_ /bin/sh /garz/repo/misc-2.6/scripts/hdrcheck.sh /garz/repo/misc-2.6/usr/include /garz/repo/misc-2.6/usr/include/linux/kernelcapi.h /garz/repo/misc-2.6/usr/include/linux/.check.kernelcapi.h > 8484 pts/0 R+ 3:10 \_ grep ^[ \t]*#[ \t]*include[ \t]*< /garz/repo/misc-2.6/usr/include/linux/kernelcapi.h > 8486 pts/0 S+ 0:00 \_ cut -f2 -d< > 8487 pts/0 S+ 0:00 \_ cut -f1 -d> > 8488 pts/0 S+ 0:00 \_ egrep ^linux|^asm > [jgarzik@pretzel misc-2.6]$ strace -p8484 > Process 8484 attached - interrupt to quit [sits there, chewing up CPU grepping a 47-line header file] -
From: Chuck Ebbert <cebbert@...> Subject: Re: [2.6.23] tasks stuck in running state? Date: Oct 19, 6:18 pm 2007 On 10/19/2007 06:03 PM, Jeff Garzik wrote: >> [jgarzik@pretzel misc-2.6]$ strace -p8484 >> Process 8484 attached - interrupt to quit > [sits there, chewing up CPU grepping a 47-line header file] > And sysrq-p is pretty useless unless you can force the keyboard interrupt and the spinning process onto the same CPU. -
From: David Miller <davem@...> Subject: Re: [2.6.23] tasks stuck in running state? Date: Oct 19, 8:01 pm 2007 From: Chuck Ebbert <cebbert@redhat.com> Date: Fri, 19 Oct 2007 18:18:08 -0400 > On 10/19/2007 06:03 PM, Jeff Garzik wrote: > >> [jgarzik@pretzel misc-2.6]$ strace -p8484 > >> Process 8484 attached - interrupt to quit > > [sits there, chewing up CPU grepping a 47-line header file] > > > > And sysrq-p is pretty useless unless you can force the keyboard > interrupt and the spinning process onto the same CPU. Yes, I find this a painful limitation too. Sparc64 used to dump the registers on all active cpus for show_regs() via a cross-call, and this was incredibly useful. But I disabled that as soon as I started playing with Niagara because at 32 cpus and larger the output is just too voluminous to be useful. What might be appropriate is just to get a one-line program counter dump on every cpu via some new sysrq keystroke. -
From: Chuck Ebbert <cebbert@...> Subject: Re: [2.6.23] tasks stuck in running state? Date: Oct 21, 11:59 am 2007 On 10/19/2007 08:01 PM, David Miller wrote: > > What might be appropriate is just to get a one-line program counter > dump on every cpu via some new sysrq keystroke. > IIRC -mm had something like this but it was buggy because we were sending IPIs to each processor asking them to print their state. Maybe it would work if we had a way of making them dump their state to a memory location and then collected and printed it from the CPU that's handling the sysrq. -


speck-geostationary