[Sorry for the cross post, but I don't know where to start to tackle this issue] Hi, on an attempt to get to a current kernel, I suffer from an issue, where a simple du on a reasonably big xfs tree leads to invoking the oom killer: Apr 4 23:24:53 tyrex kernel: [ 418.913223] XFS mounting filesystem sdd1 Apr 4 23:24:54 tyrex kernel: [ 419.774606] Ending clean XFS mount for filesystem: sdd1 Apr 4 23:26:02 tyrex kernel: [ 488.160795] du invoked oom-killer: gfp_mask=0x802d0, order=0, oom_adj=0 Apr 4 23:26:02 tyrex kernel: [ 488.160798] du cpuset=/ mems_allowed=0 Apr 4 23:26:02 tyrex kernel: [ 488.160800] Pid: 6397, comm: du Tainted: G W 2.6.34-rc3-13-vanilla #1 Apr 4 23:26:02 tyrex kernel: [ 488.160802] Call Trace: Apr 4 23:26:02 tyrex kernel: [ 488.160808] [<c02becc7>] dump_header+0x67/0x1a0 Apr 4 23:26:02 tyrex kernel: [ 488.160811] [<c03cf1a7>] ? ___ratelimit+0x77/0xe0 Apr 4 23:26:02 tyrex kernel: [ 488.160813] [<c02bee59>] oom_kill_process+0x59/0x160 Apr 4 23:26:02 tyrex kernel: [ 488.160815] [<c02bf43e>] __out_of_memory+0x4e/0xc0 Apr 4 23:26:02 tyrex kernel: [ 488.160817] [<c02bf502>] out_of_memory+0x52/0xc0 Apr 4 23:26:02 tyrex kernel: [ 488.160819] [<c02c20f4>] __alloc_pages_slowpath+0x444/0x4c0 Apr 4 23:26:02 tyrex kernel: [ 488.160822] [<c02c22c2>] __alloc_pages_nodemask+0x152/0x160 Apr 4 23:26:02 tyrex kernel: [ 488.160825] [<c02ea4a9>] cache_grow+0x249/0x2e0 Apr 4 23:26:02 tyrex kernel: [ 488.160838] [<c02ea748>] cache_alloc_refill+0x208/0x240 Apr 4 23:26:02 tyrex kernel: [ 488.160840] [<c02eab19>] kmem_cache_alloc+0xb9/0xc0 Apr 4 23:26:02 tyrex kernel: [ 488.160868] [<f86375dd>] ? xfs_trans_brelse+0xfd/0x150 [xfs] Apr 4 23:26:02 tyrex kernel: [ 488.160888] [<f863d547>] kmem_zone_alloc+0x77/0xb0 [xfs] Apr 4 23:26:02 tyrex kernel: [ 488.160905] [<f860a043>] ? xfs_da_state_free+0x53/0x60 [xfs] Apr 4 23:26:02 tyrex kernel: [ 488.160923] [<f861c796>] xfs_inode_alloc+0x26/0x110 [xfs] Apr 4 23:26:02 tyrex kernel: ...
Oh, this is a highmem box. You ran out of low memory, I think, which is where all the inodes are cached. Seems like a VM problem or a highmem/lowmem split config problem to me, not anything to do with XFS... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
Might be, I don't have a chance to test this on a different FS. Thanks for the answer anyway, Dave. I hope, you don't mind, that I keep you copied on this thread.. This matter is, I cannot locate the problem from the syslog output. Might be a "can't see the forest because all the trees" syndrome. Today I repeated that thing with 2.6.24-rc3 as a pae build with openSUSE patches applied and vm.swappiness, vm.dirty_ratio and vm.dirty_background_ratio reset to kernel defaults. It behaves exactly the same, thus it looks like a generic problem. du -sh on the huge tree, this time gkrellmd triggered the oom killer, while the du process kept going. Apr 5 13:09:20 tyrex kernel: [ 1747.524375] XFS mounting filesystem sdd1 Apr 5 13:09:21 tyrex kernel: [ 1747.942048] Ending clean XFS mount for filesystem: sdd1 Apr 5 13:10:27 tyrex kernel: [ 1814.288944] oom_kill_process: 3 callbacks suppressed Apr 5 13:10:27 tyrex kernel: [ 1814.288946] gkrellmd invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0 Apr 5 13:10:27 tyrex kernel: [ 1814.288948] gkrellmd cpuset=/ mems_allowed=0 Apr 5 13:10:27 tyrex kernel: [ 1814.288950] Pid: 4019, comm: gkrellmd Not tainted 2.6.34-rc3-13-pae #1 Apr 5 13:10:27 tyrex kernel: [ 1814.288951] Call Trace: Apr 5 13:10:27 tyrex kernel: [ 1814.288959] [<c0206181>] try_stack_unwind+0x1b1/0x200 Apr 5 13:10:27 tyrex kernel: [ 1814.288962] [<c020507f>] dump_trace+0x3f/0xe0 Apr 5 13:10:27 tyrex kernel: [ 1814.288965] [<c0205cfb>] show_trace_log_lvl+0x4b/0x60 Apr 5 13:10:27 tyrex kernel: [ 1814.288967] [<c0205d28>] show_trace+0x18/0x20 Apr 5 13:10:27 tyrex kernel: [ 1814.288970] [<c05ec570>] dump_stack+0x6d/0x7d Apr 5 13:10:27 tyrex kernel: [ 1814.288974] [<c02c758a>] dump_header+0x6a/0x1b0 Apr 5 13:10:27 tyrex kernel: [ 1814.288976] [<c02c772a>] oom_kill_process+0x5a/0x160 Apr 5 13:10:27 tyrex kernel: [ 1814.288979] [<c02c7bc6>] __out_of_memory+0x56/0xc0 Apr 5 13:10:27 tyrex kernel: [ 1814.288981] [<c02c7ca7>] out_of_memory+0x77/0x1b0 Apr ...
Well, I have to ask why you are running a 32bit PAE kernel when your CPU is: <6>[ 0.085062] CPU0: Intel(R) Xeon(R) CPU X3460 @ 2.80GHz stepping 05 Agreed. And FWIW, don't let your filesystems get near ENOSPC on 2.6.34-rc, either.... (i.e. under sustained write load, 2.6.34-rc will hit the OOM killer on page cache allocation before the filesystem can report ENOSPC to the user application. Test 224 in the xfsqa suite on a VM w/ 1GB RAM will trigger this with > 90% reliability....) Cheers, Dave. -- Dave Chinner david@fromorbit.com --
Hi Dave,
Sure, but for compatibility reasons with a customer setup, that I'm fully
responsible for and we strongly depend on, it is i586 still. (and it's a
system, that I've full access on only for a few hours on sundays, which
punishes my family..).
Dave, I really don't want to disappoint you, but a lengthy bisection session
points to:
57817c68229984818fea9e614d6f95249c3fb098 is the first bad commit
commit 57817c68229984818fea9e614d6f95249c3fb098
Author: Dave Chinner <david@fromorbit.com>
Date: Sun Jan 10 23:51:47 2010 +0000
xfs: reclaim all inodes by background tree walks
We cannot do direct inode reclaim without taking the flush lock to
ensure that we do not reclaim an inode under IO. We check the inode
is clean before doing direct reclaim, but this is not good enough
because the inode flush code marks the inode clean once it has
copied the in-core dirty state to the backing buffer.
It is the flush lock that determines whether the inode is still
under IO, even though it is marked clean, and the inode is still
required at IO completion so we can't reclaim it even though it is
clean in core. Hence the requirement that we need to take the flush
lock even on clean inodes because this guarantees that the inode
writeback IO has completed and it is safe to reclaim the inode.
With delayed write inode flushing, we coul dend up waiting a long
time on the flush lock even for a clean inode. The background
reclaim already handles this efficiently, so avoid all the problems
by killing the direct reclaim path altogether.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
8e6b6febccba69bc4cdbfd1886d545c369d64c41 M fs
I will try to prove this by reverting this commit on a 2.6.33.2 build, but
Hmm, thanks for the warning. Will resort to 2.6.33.2 for now on my servers
and keep an eye on the xfs commit logs...
Cheers && greetings ...Interesting. I did a fair bit of low memory testing when i made that change (admittedly none on a highmem i386 box), and since then I've done lots of "millions of files" tree creates, traversals and destroys on limited memory machines without triggering problems when memory is completely full of inodes. I don't think that will work as expected in all situations - the inode clean check there is not completely valid as the XFS inode locks aren't held, so it can race with other operations that need to complete before reclaim is done. This was one of the reasons for pushing reclaim into the background.... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
OK, if there is page cache pressure (e.g. creating small files or grepping the resultant tree) or the machine has significant amounts of memory (e.g. >= 4GB) then I can't reproduce this. However, if the memory pressure is purely inode cache (creating zero length files or read-only traversal), then the OOM killer kicks a while after the slab cache fills memory. This doesn't need highmem; I used a x86_64 kernel on a VM w/ 1GB RAM to reliably reproduce this. I'll add zero length file tests and traversals to my low memory testing. The best way to fix this, I think, is to trigger a shrinker callback when memory is low to run the background inode reclaim. The problem is that these inode caches and the reclaim state are per-filesystem, not global state, and the current shrinker interface only works with global state. Hence there are two patches to this fix - the first adds a context to the shrinker callout, and the second adds the XFS infrastructure to track the number of reclaimable inodes per filesystem and register/unregister shrinkers for each filesystem. With these patches, my reproducable test case which locked the machine up with a OOM panic in a couple of minutes has been running for over half an hour. I have much more confidence in this change with limited testing than the reverting of the background inode reclaim as the revert introduces The patches below apply to the xfs-dev tree, which is currently at 34-rc1. If they don't apply, let me know and I'll redo them against a vanilla kernel tree. Can you test them to see if the problem goes away? If the problem is fixed, I'll push them for a proper review cycle... Cheers, Dave. -- Dave Chinner david@fromorbit.com
I'm glad, that you're able to reproduce it. My initial failure was during I see, the first one will be interesting to get into mainline, given the Of course, you did the original patch for a reason... Therefor I would love to test your patches. I've tried to apply them to 2.6.33.2, but after fixing the same reject as noted below, I'm stuck here: /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c: In function 'xfs_reclaim_inode_shrink': /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c:805: error: implicit declaration of function 'xfs_perag_get' /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c:805: warning: assignment makes pointer from integer without a cast /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c:807: error: implicit declaration of function 'xfs_perag_put' Now I see, that there happened a rename of the offending functions, but also they've grown a radix_tree structure and locking. How do I handle that? BTW, your patches do not apply to Linus' current git tree either: patching file fs/xfs/quota/xfs_qm.c Hunk #1 succeeded at 72 (offset 3 lines). Hunk #2 FAILED at 2120. 1 out of 2 hunks FAILED -- saving rejects to file fs/xfs/quota/xfs_qm.c.rej I'm able to resolve this, but 2.6.34-current does give me some other trouble, that I need to get by (PS2 keyboard stops working eventually).. Anyway, thanks for your great support, Dave. This is much appreciated. Cheers, Pete --
With difficulty. I'd need to backport it to match the .33 code, Yeah, there's another patch in my xfs-dev tree that changes that. I'll rebase it on a clean linux tree before I post it again. Cheers, Dave. -- Dave Chinner david@fromorbit.com --
Dave, may I ask you kindly for briefly elaborating on the worst consequences I've briefly tested this with a codebase somewhere between -rc3 and -rc4, and it survived the du test, but it suffered from some strange network drop outs, that aren't funny on a nfs server... Will retest your patches after opensuse-current catched up with -rc4. Hopefully, the most blatant stability issues are fixed by then. Cheers, Pete --
Well, given that is the new shrinker code generating the warnings, reverting/removing that hunk will render the patch useless :0 I'll get you a working 2.6.33 patch tomorrow - it's dinner time now.... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
Excuse me, I didn't express myself well. I'm after the consequences of Cool, thanks. Have a nice dinner, Pete --
Obviously and not totally unexpected, really fixing this is going to take more time. FYI, 2.6.33.2 is still affected from this issue. Greg, you might search for a server using xfs filesystems and and a i586 kernel >= 2.6.33, (2.6.32.11 of SLE11-SP1 will serve as well), log in as an ordinary user, do a "du" on /usr, and wait for the other users screaming... BTW, all affected kernels, available from http://download.opensuse.org/repositories/home:/frispete: have the offending patch reverted (see subject), do run fine for me (on this aspect). Will you guys pass by another round of stable fixes without doing anything on this issue? Dave, this is why I'm kindly asking you: what might be the worst consequences, if we just do the revert for now (at least for 2.6.33), until you and Nick came to a final decision on how to solve this issue in the future. Just a brief note would be fine. Cheers, Pete --
I did precisely that, and didn't notice anything special (du on kernel
source tree) kernel 2.6.32.11, deadline scheduler, 7 drives RAID-6
array, 8GB RAM.
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
--
I guess, you're not on this specific openSUSE git version of 2.6.32.11 (e.g. the preparation for SP1 of SLE11), which, as usual, carries a lot of stuff from later kernels. The offending patch was included in linux-2.6.33 between -rc4 and -rc5: Committer Alex Elder<aelder@sgi.com> Author Dave Chinner<david@fromorbit.com> Author date 11.01.10 00:51 Parent xfs: Avoid inodes in reclaim when flushing from inode cache Child xfs: Remove inode iolock held check during allocation Branch master origin (Merge branch 'for-linus' of git://git.kernel.org/pub/scm/...) Branch 2.6.33.1 (Linux 2.6.33) Follows v2.6.33-rc4 (Linux 2.6.33-rc4) Precedes v2.6.33-rc5 (Linux 2.6.33-rc5) Cheers, Pete --
is this bisectable? from what I remember with 2.6.33 looking at the bugreports I don't recall any issue in regards with firmware related stuff for radeon(but could be wrong). Keep in mind, I don't have your card, but I do have the X1600 which had no issues so far(running the latest HEAD). does changing the .config work for you? (in regards to what the thread I posted had mentioned) as for the open suse SP1 of SLE11..glad there is an option to load the latest kernel. Justin P. Mattock --
sh*t.. my bad just replied to the wrong thread.. I'm tired.. please ignore this. Justin P. Mattock --
No, I always use pristine unpatched kernel.org releases, no SELinux, no
nothing. Just another confirmation I should go on this way :)
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
--
Is 2.6.33.3-rc2 affected? A lot of xfs patches are in there (as are in 2.6.32.12-rc2.) thanks, greg k-h --
Yes. It's not even in mainline yet as Nick doesn't like the trivial core VM fix required to solve this in a clean way. --
Hm, Nick, why? This seems like a real problem, easily reproduced. Is it solved some other way in Linus's tree that we could backport to the -stable series? thanks, greg k-h --
The fix is not in Linus' kernel yet, Greg. So once that is done, I'll have to backport the fix back to those stable kernels as well. It's not a trivial fix, so it will miss this round of stable releases.... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
Ok, that's fine, just checking. I'm in no rush, I have plenty of other patches to queue up for the next stable releases :) thanks, greg k-h --
The problem is that the fix I did has been rejected by the upstream VM guys, and the stable rules are that fixes have to be in mainline before they can be put in a stable release. So, until we get a fix Yet there's only been one report of the problem. While that doesn't make it any less serious, I don't think the problem you're reporting is as widespread as you are making it out to be. We'll get the fix done and upstream, and then it will go back to the stable kernel. You could always apply the *tested* patches I posted that fix If the process of getting the fix upstream takes longer than another stable release cycle, then yes. I'm sorry, but I can't control the process, and if someone takes a week to NACK a fix, then you're just going to have to wait longer. Feel free to run the fix in the meantime - testing it, even if it was NACKed will still help us because if it fixes your problem we know that we are fixing the _right problem_. If you can't live with this, then you shouldn't be running the I've already told you - you could be reintroducing all the really hard to reproduce inode reclaim problems (oops, hangs, panics, potentially even fs corruption) that the patch in question was part of the fix for. You're running code that changes reclaim in very subtle ways and has not been tested upstream in any way - if it breaks you get to keep all the broken pieces to yourself... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DISPLAY is not set |
| Junio C Hamano | Re: Rss produced by git is not valid xml? |
| Linux Kernel Mailing List | iSeries: fix section mismatch in iseries_veth |
| Linux Kernel Mailing List | ixbge: remove TX lock and redo TX accounting. |
| Linux Kernel Mailing List | ixgbe: fix several counter register errata |
| Linux Kernel Mailing List |
