Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]

Previous thread: 32GB SSD on USB1.1 P3/700 == ___HELL___ (2.6.34-rc3) by Andreas Mohr on Sunday, April 4, 2010 - 3:13 pm. (24 messages)

Next thread: none
From: Hans-Peter Jansen
Date: Sunday, April 4, 2010 - 3:49 pm

[Sorry for the cross post, but I don't know where to start to tackle this 
 issue]

Hi,

on an attempt to get to a current kernel, I suffer from an issue, where a 
simple du on a reasonably big xfs tree leads to invoking the oom killer: 

Apr  4 23:24:53 tyrex kernel: [  418.913223] XFS mounting filesystem sdd1
Apr  4 23:24:54 tyrex kernel: [  419.774606] Ending clean XFS mount for filesystem: sdd1
Apr  4 23:26:02 tyrex kernel: [  488.160795] du invoked oom-killer: gfp_mask=0x802d0, order=0, oom_adj=0
Apr  4 23:26:02 tyrex kernel: [  488.160798] du cpuset=/ mems_allowed=0
Apr  4 23:26:02 tyrex kernel: [  488.160800] Pid: 6397, comm: du Tainted: G        W  2.6.34-rc3-13-vanilla #1
Apr  4 23:26:02 tyrex kernel: [  488.160802] Call Trace:
Apr  4 23:26:02 tyrex kernel: [  488.160808]  [<c02becc7>] dump_header+0x67/0x1a0
Apr  4 23:26:02 tyrex kernel: [  488.160811]  [<c03cf1a7>] ? ___ratelimit+0x77/0xe0
Apr  4 23:26:02 tyrex kernel: [  488.160813]  [<c02bee59>] oom_kill_process+0x59/0x160
Apr  4 23:26:02 tyrex kernel: [  488.160815]  [<c02bf43e>] __out_of_memory+0x4e/0xc0
Apr  4 23:26:02 tyrex kernel: [  488.160817]  [<c02bf502>] out_of_memory+0x52/0xc0
Apr  4 23:26:02 tyrex kernel: [  488.160819]  [<c02c20f4>] __alloc_pages_slowpath+0x444/0x4c0
Apr  4 23:26:02 tyrex kernel: [  488.160822]  [<c02c22c2>] __alloc_pages_nodemask+0x152/0x160
Apr  4 23:26:02 tyrex kernel: [  488.160825]  [<c02ea4a9>] cache_grow+0x249/0x2e0
Apr  4 23:26:02 tyrex kernel: [  488.160838]  [<c02ea748>] cache_alloc_refill+0x208/0x240
Apr  4 23:26:02 tyrex kernel: [  488.160840]  [<c02eab19>] kmem_cache_alloc+0xb9/0xc0
Apr  4 23:26:02 tyrex kernel: [  488.160868]  [<f86375dd>] ? xfs_trans_brelse+0xfd/0x150 [xfs]
Apr  4 23:26:02 tyrex kernel: [  488.160888]  [<f863d547>] kmem_zone_alloc+0x77/0xb0 [xfs]
Apr  4 23:26:02 tyrex kernel: [  488.160905]  [<f860a043>] ? xfs_da_state_free+0x53/0x60 [xfs]
Apr  4 23:26:02 tyrex kernel: [  488.160923]  [<f861c796>] xfs_inode_alloc+0x26/0x110 [xfs]
Apr  4 23:26:02 tyrex kernel: ...
From: Dave Chinner
Date: Sunday, April 4, 2010 - 5:49 pm

Oh, this is a highmem box. You ran out of low memory, I think, which
is where all the inodes are cached. Seems like a VM problem or a
highmem/lowmem split config problem to me, not anything to do with
XFS...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Hans-Peter Jansen
Date: Monday, April 5, 2010 - 4:35 am

Might be, I don't have a chance to test this on a different FS. Thanks
for the answer anyway, Dave. I hope, you don't mind, that I keep you 
copied on this thread.. 

This matter is, I cannot locate the problem from the syslog output. Might
be a "can't see the forest because all the trees" syndrome.

Today I repeated that thing with 2.6.24-rc3 as a pae build with openSUSE
patches applied and vm.swappiness, vm.dirty_ratio and vm.dirty_background_ratio 
reset to kernel defaults.

It behaves exactly the same, thus it looks like a generic problem. du -sh 
on the huge tree, this time gkrellmd triggered the oom killer, while the
du process kept going.

Apr  5 13:09:20 tyrex kernel: [ 1747.524375] XFS mounting filesystem sdd1
Apr  5 13:09:21 tyrex kernel: [ 1747.942048] Ending clean XFS mount for filesystem: sdd1
Apr  5 13:10:27 tyrex kernel: [ 1814.288944] oom_kill_process: 3 callbacks suppressed
Apr  5 13:10:27 tyrex kernel: [ 1814.288946] gkrellmd invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0
Apr  5 13:10:27 tyrex kernel: [ 1814.288948] gkrellmd cpuset=/ mems_allowed=0
Apr  5 13:10:27 tyrex kernel: [ 1814.288950] Pid: 4019, comm: gkrellmd Not tainted 2.6.34-rc3-13-pae #1
Apr  5 13:10:27 tyrex kernel: [ 1814.288951] Call Trace:
Apr  5 13:10:27 tyrex kernel: [ 1814.288959]  [<c0206181>] try_stack_unwind+0x1b1/0x200
Apr  5 13:10:27 tyrex kernel: [ 1814.288962]  [<c020507f>] dump_trace+0x3f/0xe0
Apr  5 13:10:27 tyrex kernel: [ 1814.288965]  [<c0205cfb>] show_trace_log_lvl+0x4b/0x60
Apr  5 13:10:27 tyrex kernel: [ 1814.288967]  [<c0205d28>] show_trace+0x18/0x20
Apr  5 13:10:27 tyrex kernel: [ 1814.288970]  [<c05ec570>] dump_stack+0x6d/0x7d
Apr  5 13:10:27 tyrex kernel: [ 1814.288974]  [<c02c758a>] dump_header+0x6a/0x1b0
Apr  5 13:10:27 tyrex kernel: [ 1814.288976]  [<c02c772a>] oom_kill_process+0x5a/0x160
Apr  5 13:10:27 tyrex kernel: [ 1814.288979]  [<c02c7bc6>] __out_of_memory+0x56/0xc0
Apr  5 13:10:27 tyrex kernel: [ 1814.288981]  [<c02c7ca7>] out_of_memory+0x77/0x1b0
Apr  ...
From: Dave Chinner
Date: Monday, April 5, 2010 - 4:06 pm

Well, I have to ask why you are running a 32bit PAE kernel when your
CPU is:

<6>[    0.085062] CPU0: Intel(R) Xeon(R) CPU           X3460  @ 2.80GHz stepping 05


Agreed. And FWIW, don't let your filesystems get near ENOSPC on
2.6.34-rc, either....

(i.e. under sustained write load, 2.6.34-rc will hit the OOM killer
on page cache allocation before the filesystem can report ENOSPC to
the user application.  Test 224 in the xfsqa suite on a VM w/ 1GB
RAM will trigger this with > 90% reliability....)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Hans-Peter Jansen
Subject:
Date: Tuesday, April 6, 2010 - 7:52 am

Hi Dave,



Sure, but for compatibility reasons with a customer setup, that I'm fully 
responsible for and we strongly depend on, it is i586 still. (and it's a 
system, that I've full access on only for a few hours on sundays, which 
punishes my family..).

Dave, I really don't want to disappoint you, but a lengthy bisection session 
points to:

57817c68229984818fea9e614d6f95249c3fb098 is the first bad commit
commit 57817c68229984818fea9e614d6f95249c3fb098
Author: Dave Chinner <david@fromorbit.com>
Date:   Sun Jan 10 23:51:47 2010 +0000

    xfs: reclaim all inodes by background tree walks
    
    We cannot do direct inode reclaim without taking the flush lock to
    ensure that we do not reclaim an inode under IO. We check the inode
    is clean before doing direct reclaim, but this is not good enough
    because the inode flush code marks the inode clean once it has
    copied the in-core dirty state to the backing buffer.
    
    It is the flush lock that determines whether the inode is still
    under IO, even though it is marked clean, and the inode is still
    required at IO completion so we can't reclaim it even though it is
    clean in core. Hence the requirement that we need to take the flush
    lock even on clean inodes because this guarantees that the inode
    writeback IO has completed and it is safe to reclaim the inode.
    
    With delayed write inode flushing, we coul dend up waiting a long
    time on the flush lock even for a clean inode. The background
    reclaim already handles this efficiently, so avoid all the problems
    by killing the direct reclaim path altogether.
    
    Signed-off-by: Dave Chinner <david@fromorbit.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
8e6b6febccba69bc4cdbfd1886d545c369d64c41 M      fs

I will try to prove this by reverting this commit on a 2.6.33.2 build, but

Hmm, thanks for the warning. Will resort to 2.6.33.2 for now on my servers
and keep an eye on the xfs commit logs...

Cheers && greetings ...

Interesting. I did a fair bit of low memory testing when i made that
change (admittedly none on a highmem i386 box), and since then I've
done lots of "millions of files" tree creates, traversals and destroys on
limited memory machines without triggering problems when memory is
completely full of inodes.


I don't think that will work as expected in all situations - the
inode clean check there is not completely valid as the XFS inode
locks aren't held, so it can race with other operations that need
to complete before reclaim is done. This was one of the reasons for
pushing reclaim into the background....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--


OK, if there is page cache pressure (e.g. creating small files or
grepping the resultant tree) or the machine has significant amounts
of memory (e.g. >= 4GB) then I can't reproduce this.

However, if the memory pressure is purely inode cache (creating zero
length files or read-only traversal), then the OOM killer kicks a
while after the slab cache fills memory.  This doesn't need highmem;
I used a x86_64 kernel on a VM w/ 1GB RAM to reliably reproduce
this.  I'll add zero length file tests and traversals to my low
memory testing.

The best way to fix this, I think, is to trigger a shrinker callback
when memory is low to run the background inode reclaim. The problem
is that these inode caches and the reclaim state are per-filesystem,
not global state, and the current shrinker interface only works with
global state.

Hence there are two patches to this fix - the first adds a context
to the shrinker callout, and the second adds the XFS infrastructure
to track the number of reclaimable inodes per filesystem and
register/unregister shrinkers for each filesystem.

With these patches, my reproducable test case which locked the
machine up with a OOM panic in a couple of minutes has been running
for over half an hour. I have much more confidence in this change
with limited testing than the reverting of the background inode
reclaim as the revert introduces 

The patches below apply to the xfs-dev tree, which is currently at
34-rc1. If they don't apply, let me know and I'll redo them against
a vanilla kernel tree. Can you test them to see if the problem goes
away? If the problem is fixed, I'll push them for a proper review
cycle...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

I'm glad, that you're able to reproduce it. My initial failure was during 

I see, the first one will be interesting to get into mainline, given the 

Of course, you did the original patch for a reason... Therefor I would love 
to test your patches. I've tried to apply them to 2.6.33.2, but after 
fixing the same reject as noted below, I'm stuck here:

/usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c: 
In function 'xfs_reclaim_inode_shrink':
/usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c:805: 
error: implicit declaration of function 'xfs_perag_get'
/usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c:805: 
warning: assignment makes pointer from integer without a cast
/usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c:807: 
error: implicit declaration of function 'xfs_perag_put'

Now I see, that there happened a rename of the offending functions, but also 
they've grown a radix_tree structure and locking. How do I handle that?

BTW, your patches do not apply to Linus' current git tree either:
patching file fs/xfs/quota/xfs_qm.c
Hunk #1 succeeded at 72 (offset 3 lines).
Hunk #2 FAILED at 2120.
1 out of 2 hunks FAILED -- saving rejects to file fs/xfs/quota/xfs_qm.c.rej
I'm able to resolve this, but 2.6.34-current does give me some other 
trouble, that I need to get by (PS2 keyboard stops working eventually)..

Anyway, thanks for your great support, Dave. This is much appreciated.

Cheers,
Pete
--


With difficulty. I'd need to backport it to match the .33 code,

Yeah, there's another patch in my xfs-dev tree that changes that.
I'll rebase it on a clean linux tree before I post it again.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--


Dave, may I ask you kindly for briefly elaborating on the worst consequences 

I've briefly tested this with a codebase somewhere between -rc3 and -rc4, 
and it survived the du test, but it suffered from some strange network drop 
outs, that aren't funny on a nfs server...

Will retest your patches after opensuse-current catched up with -rc4. 
Hopefully, the most blatant stability issues are fixed by then.

Cheers,
Pete
--


Well, given that is the new shrinker code generating the warnings,
reverting/removing that hunk will render the patch useless :0

I'll get you a working 2.6.33 patch tomorrow - it's dinner time
now....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--


Excuse me, I didn't express myself well. I'm after the consequences of 

Cool, thanks. 

Have a nice dinner,
Pete



--


Obviously and not totally unexpected, really fixing this is going to take 
more time.

FYI, 2.6.33.2 is still affected from this issue. 

Greg, you might search for a server using xfs filesystems and and a i586 
kernel >= 2.6.33, (2.6.32.11 of SLE11-SP1 will serve as well), log in as an 
ordinary user, do a "du" on /usr, and wait for the other users screaming...

BTW, all affected kernels, available from 
http://download.opensuse.org/repositories/home:/frispete: have the 
offending patch reverted (see subject), do run fine for me (on this 
aspect).

Will you guys pass by another round of stable fixes without doing anything 
on this issue?

Dave, this is why I'm kindly asking you: what might be the worst 
consequences, if we just do the revert for now (at least for 2.6.33), until 
you and Nick came to a final decision on how to solve this issue in the 
future.

Just a brief note would be fine.

Cheers,
Pete
--


I did precisely that, and didn't notice anything special (du on kernel
source tree) kernel 2.6.32.11, deadline scheduler, 7 drives RAID-6
array, 8GB RAM.

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------
--


I guess, you're not on this specific openSUSE git version of 2.6.32.11 (e.g. 
the preparation for SP1 of SLE11), which, as usual, carries a lot of stuff 
from later kernels. The offending patch was included in linux-2.6.33 
between -rc4 and -rc5:

Committer
Alex Elder<aelder@sgi.com>
Author
Dave Chinner<david@fromorbit.com>
Author date
11.01.10 00:51
Parent
xfs: Avoid inodes in reclaim when flushing from inode cache
Child
xfs: Remove inode iolock held check during allocation
Branch
master origin (Merge branch 'for-linus' of git://git.kernel.org/pub/scm/...) 
Branch
2.6.33.1 (Linux 2.6.33) 
Follows
v2.6.33-rc4 (Linux 2.6.33-rc4)
Precedes
v2.6.33-rc5 (Linux 2.6.33-rc5)

Cheers,
Pete
--


is this bisectable?
from what I remember with 2.6.33
looking at the bugreports I don't recall
any issue in regards with firmware related stuff
for radeon(but could be wrong).

Keep in mind, I don't have your card, but I do have the X1600
which had no issues so far(running the latest HEAD).
does changing the .config work for you?
(in regards to what the thread I posted
had mentioned)

as for the open suse SP1 of SLE11..glad there is
an option to load the latest kernel.

Justin P. Mattock
--


sh*t.. my bad just replied to the wrong
thread..

I'm tired.. please ignore this.

Justin P. Mattock
--


No, I always use pristine unpatched kernel.org releases, no SELinux, no
nothing. Just another confirmation I should go on this way :)

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------
--


Is 2.6.33.3-rc2 affected?  A lot of xfs patches are in there (as are in
2.6.32.12-rc2.)

thanks,

greg k-h
--


Yes.  It's not even in mainline yet as Nick doesn't like the trivial
core VM fix required to solve this in a clean way.
--


Hm, Nick, why?  This seems like a real problem, easily reproduced.  Is
it solved some other way in Linus's tree that we could backport to the
-stable series?

thanks,

greg k-h
--


The fix is not in Linus' kernel yet, Greg. So once that is done,
I'll have to backport the fix back to those stable kernels as well.
It's not a trivial fix, so it will miss this round of stable
releases....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--


Ok, that's fine, just checking.  I'm in no rush, I have plenty of other
patches to queue up for the next stable releases :)

thanks,

greg k-h
--


The problem is that the fix I did has been rejected by the upstream
VM guys, and the stable rules are that fixes have to be in mainline
before they can be put in a stable release.  So, until we get a fix

Yet there's only been one report of the problem. While that doesn't
make it any less serious, I don't think the problem you're reporting
is as widespread as you are making it out to be. We'll get the fix
done and upstream, and then it will go back to the stable kernel.

You could always apply the *tested* patches I posted that fix


If the process of getting the fix upstream takes longer than another
stable release cycle, then yes. I'm sorry, but I can't control the
process, and if someone takes a week to NACK a fix, then you're just
going to have to wait longer. Feel free to run the fix in the
meantime - testing it, even if it was NACKed will still help us
because if it fixes your problem we know that we are fixing the
_right problem_.

If you can't live with this, then you shouldn't be running the

I've already told you - you could be reintroducing all the really
hard to reproduce inode reclaim problems (oops, hangs, panics,
potentially even fs corruption) that the patch in question was part
of the fix for.  You're running code that changes reclaim in very
subtle ways and has not been tested upstream in any way - if it
breaks you get to keep all the broken pieces to yourself...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

Previous thread: 32GB SSD on USB1.1 P3/700 == ___HELL___ (2.6.34-rc3) by Andreas Mohr on Sunday, April 4, 2010 - 3:13 pm. (24 messages)

Next thread: none