-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Hi all, Got the following on a 64bit intel NFS client while a process was running heavy read and writes to a large (40G) file on a 64bit Intel NFS server. Both client and server were running 2.6.34 from kernel.org. Basically, the client never recovered, and had to be power cycled to correct the issue. Here is the mount info: drbd:/data/export on /home type nfs (rw,rsize=32768,wsize=32768,nfsvers=3,bg,intr,addr=xxx.xx.xx.xxx) Here's the log: May 21 16:50:55 tovirtcore1 kernel: BUG: soft lockup - CPU#3 stuck for 61s! [qemu-system-x86:6340] May 21 16:50:55 tovirtcore1 kernel: Modules linked in: tun nfs lockd nfs_acl auth_rpcgss sunrpc 8021q bridge stp kvm_intel kvm parport_pc i2c_i801 rtc_cmos rtc_core rtc_lib parport psmouse i2c_core evdev serio_raw button processor intel_agp pcspkr ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod raid1 md_mod sd_mod ide_pci_generic ide_core ata_generic pata_marvell ata_piix ohci1394 ieee1394 uhci_hcd ehci_hcd pata_acpi firewire_ohci firewire_core crc_itu_t libata e1000 scsi_mod e1000e usbcore thermal [last unloaded: scsi_wait_scan] May 21 16:50:55 tovirtcore1 kernel: CPU 3 May 21 16:50:55 tovirtcore1 kernel: Modules linked in: tun nfs lockd nfs_acl auth_rpcgss sunrpc 8021q bridge stp kvm_intel kvm parport_pc i2c_i801 rtc_cmos rtc_core rtc_lib parport psmouse i2c_core evdev serio_raw button processor intel_agp pcspkr ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod raid1 md_mod sd_mod ide_pci_generic ide_core ata_generic pata_marvell ata_piix ohci1394 ieee1394 uhci_hcd ehci_hcd pata_acpi firewire_ohci firewire_core crc_itu_t libata e1000 scsi_mod e1000e usbcore thermal [last unloaded: scsi_wait_scan] May 21 16:50:55 tovirtcore1 kernel: May 21 16:50:55 tovirtcore1 kernel: Pid: 6340, comm: qemu-system-x86 Not tainted 2.6.34-0-xeon-actusa #1 DQ965GF/ May 21 16:50:55 tovirtcore1 kernel: RIP: 0010:[<ffffffff8107005a>] [<ffffffff8107005a>] ...
Do you see any more NFS traffic to the server when the above hang occurs? I'm wondering if we don't need something like the following patch. Cheers Trond -------------------------------------------------------------------------------- From 0b574497e05f62fd49cfe26f1b97e3669525446c Mon Sep 17 00:00:00 2001 From: Trond Myklebust <Trond.Myklebust@netapp.com> Date: Sat, 22 May 2010 11:49:19 -0400 Subject: [PATCH] NFS: Ensure that we mark the inode as dirty if we exit early from commit If we exit from nfs_commit_inode() without ensuring that the COMMIT rpc call has been completed, we must re-mark the inode as dirty. Otherwise, future calls to sync_inode() with the WB_SYNC_ALL flag set will fail to ensure that the data is on the disk. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org --- fs/nfs/write.c | 13 +++++++++++-- 1 files changed, 11 insertions(+), 2 deletions(-) diff --git a/fs/nfs/write.c b/fs/nfs/write.c index 3aea3ca..b8a6d7a 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -1386,7 +1386,7 @@ static int nfs_commit_inode(struct inode *inode, int how) int res = 0; if (!nfs_commit_set_lock(NFS_I(inode), may_wait)) - goto out; + goto out_mark_dirty; spin_lock(&inode->i_lock); res = nfs_scan_commit(inode, &head, 0, 0); spin_unlock(&inode->i_lock); @@ -1398,9 +1398,18 @@ static int nfs_commit_inode(struct inode *inode, int how) wait_on_bit(&NFS_I(inode)->flags, NFS_INO_COMMIT, nfs_wait_bit_killable, TASK_KILLABLE); + else + goto out_mark_dirty; } else nfs_commit_clear_lock(NFS_I(inode)); -out: + return res; + /* Note: If we exit without ensuring that the commit is complete, + * we must mark the inode as dirty. Otherwise, future calls to + * sync_inode() with the WB_SYNC_ALL flag set will fail to ensure + * that the data is on the disk. + */ +out_mark_dirty: + __mark_inode_dirty(inode, I_DIRTY_DATASYNC); return res; } -- 1.7.0.1 --
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Trond, When it occurred, it continues to throw those errors in the log, and all access to the NFS mount stalled until I hard reset the client system. Do you want me to apply the patch and see if I can recreate the condition? Stu -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iQIcBAEBCAAGBQJL+APxAAoJEFKVLITDJSGSL4cP/1h+O1kL+PMvo/0ocOjKSSVE g19VUwW/Mj3Pj4lFP4Grp+3KZvKNDPMILJjxH0erzjviWq2YJCAsjULvv65+2dYE xgFJXOB2Pa0bMPa1OhMDtKHd4QZjIu4nutDfsrfv8dM895tT65k6+X3l2j9Unfoj 7pq/q20TTTHNRYbZbxOfcIcOmgl7NZOJv4y/whwixUvj9YoxW1cdQnQbiNs6lsIC BCD9kJprgTMpR85tgw28W0I6g+RfJiwuXn8C0qQ6ZIGI3zxOyyqt83SySwsF5yRn 8Y1I3Z5qq3uEvCQ//TGtxohmzdUIxDVIXPSYevupuno6M+1cDWvV5K3E/2BpnPC3 toHUSM0F26/9LMyWyhRHCnAmJHEwrQY2gVv238qInQH63ubgnl1ObSUZmy3wSyRN msc7VwsqUhK64OXo713DwhLVJfTwEWNEWRmLA+2WAlhESgRB9s2XRFOY7ubir17M DLpb2AbGSMvDrSbWOG7e6ReGn07yd1yYGkOMoxxddYiA3Jq7iyrAyJeEmM4sOSTa Tsy7VCojt3Ibwgq7dbylhl1PthmYq6xMLe5XjmTTtN8UfAu9Ag+1vOEGkKAsSiyq 2nn9Ct49Wi8ZcUxHdHKjS2PWGvZLEpk5YANcbQTxWKS2NR80QJDbWYTNryQpc7UB 7C2/jbdiQ7wrz13B/yWP =qY2g -----END PGP SIGNATURE----- --
Yes, please do. Could you also apply the following debugging patch on
top of the above one, and see if the WARN_ON() triggers when both
patches are applied?
Cheers
Trond
------------------------------------------------------------------------------------------------
From 9883e35957468987f4338525c1d800d637bc05b7 Mon Sep 17 00:00:00 2001
From: Trond Myklebust <Trond.Myklebust@netapp.com>
Date: Sat, 22 May 2010 10:46:41 -0400
Subject: [PATCH 2/2] NFS: debugging code for nfs_wb_page()
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
---
fs/nfs/write.c | 9 +++++++++
1 files changed, 9 insertions(+), 0 deletions(-)
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index b8a6d7a..0558fab 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1519,12 +1519,21 @@ int nfs_wb_page(struct inode *inode, struct page *page)
int ret;
while(PagePrivate(page)) {
+ unsigned dirty;
+ int syncing;
+
wait_on_page_writeback(page);
if (clear_page_dirty_for_io(page)) {
ret = nfs_writepage_locked(page, &wbc);
if (ret < 0)
goto out_error;
+ continue;
}
+
+ dirty = inode->i_state & I_DIRTY;
+ syncing = inode->i_state & I_SYNC;
+ WARN_ON(!syncing && !dirty && PagePrivate(page));
+
ret = sync_inode(inode, &wbc);
if (ret < 0)
goto out_error;
--
1.7.0.1
--
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
The problem seems to be fixed with this, but I'm not seeing / don't know
where to find the 'WARN_ON' messages. If they are suppose to be in the
syslog, then there weren't any.
I'm rolling back to the unpatched kernel to verify that I can still
reproduce the problem natively.
Will follow up on Monday.
Stu
- --
If you took all the girls I knew When I was single And brought
them all together for one night I know theyd never match My sweet
imagination And everything looks worse in black and white
-- Paul Simon - "Kodachrome Lyrics"
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
iQIcBAEBCAAGBQJL+WPYAAoJEFKVLITDJSGSU0sP/jdJt9NiliGFJ0IB3I4Pt6o/
aHI5wzWWG8Uxzn5UXBumEp79hWXZ8D79kLZ4L8zh9/9hpi8rbExz1ci9IJmjU1LW
qWQVDqmv36uKX9YUzmj8d4505G0Czf9BkU6vsOy4elZ4pAy/Q9EXKFVS5mtirO7u
9FeYJebvbhvdICJTaLDbryugpxWYV6P6bGVglowdbqVWBnKo5QXevWnm6s3Lc1Jd
girpqkQ2f4NddfeW1TbITtBr0bEPYuhK4s4XMdWiYIHNIaRSBJDF5Hlues8LWxu2
++4xz1G7n/K59hRBRX+giBGaSXXl/GSGib87RfwSCrg5qEytNbSRKQX0WuFFxARS
tTbU+zwDpUF7SSvYJZGDh2EEPr2QNfOVCCxVmf1Oe4JAs0OJku1z1ReKpr+CoZg2
lgIFl59bPBNjMcx8GNynnJgTW1IMXWsM8UpTpiAfwTpffXaW3YEH2V855Px9Mkqt
ONuvCll3CWEbwdmisWqvqRAix7oHNh4VMnDOTfb1eYf5ytw5mLfxZaznXhwL8FjE
pUvHZG4TRMUENjucs38dvwx4Vx63DEdxMSK5C0GpdsI16xh0hMKa3ohWaVtSgwIE
Emf1HU2G0vdxl+zMI1IyerNp1T+oxu1rr7eOYWzl3HO8bQc+9ua7yCntpni/c7Dz
Ge8hUPZTZAVmFRRy9zBO
=2+R2
-----END PGP SIGNATURE-----
--
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Ok, I was able to reproduce the problem, and then repeat the test with the patched kernel. The result with the patched kernel was that the system did not crash, but I did not see any of the WARN_ON messages in the logs. Also, I noticed that although the client did not crash, I did notice long 'hangs' in the process... Just a note, when I upgraded from 2.6.30.5 to 2.6.33.2 I noticed that our overall network activity for NFS increase by about 40%. Also, I saw longer delays when using commands like ls on a NFS mounted partition. Not sure if anyone else noticed this or not, but I thought it worth mentioning... :) Stu - -- Spider Pig, Spider Pig, Does whatever a Spider Pig does. Can he swing, from a web, no he can't, he's a pig. Look Oouuuut! He is a Spider Pig.... -- Matt Groening - "Spider Pig Lyrics" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iQIcBAEBCAAGBQJL+ptvAAoJEFKVLITDJSGSjGEP/27W6Eb19ACogC1F4OY/KsEj dMC0qKZeV2pJFV9UH4c8pH0TvTwNGLrcFk36wzzyuTKqyaYaBKB9iXEsy4XrLc2i pSLh5/fjsd8zuf44ZEu9jEUvpUjpZoMnvWbYdZ0FMeCEcg+h8enf7OTVxCMtJeY6 1tiSifcrrWxiwfE6rsS7dAldWLuic4vQw4dC53RC1RcwHTewVmK3penHlPbaGZoe UHYL1nuHek8pmlluMvATaY4qwu5teBGqru1c8Xrm7RsaSZCU4gSZNG83uv+hRKfk TRBIIi1kZ54CV7h6flNf/Byb4xw7uOGtQ9mgM1Nqupdh1faoAAbnXEhz2AOLEg51 yadnctCrTOHXIvwblPVRz8o77JB8UU8EU4KJjTBg/Dy4JJsbe3XNNY2gusy6QCPQ 7n0oq5x0eZGdy5CGAYqm9L3zhaxIPs/2Wxkav+snh7GsED+tcnbwv7gdtTN8bRrW dH5fsX7X/vdmr6hRpQhFshiZTjbZD5Zn2q0FZDspVyMHLuE+mjUY+G/82P17SfGi OAC4mclbsPkoGvcHLBuXgCFdCEmWsJxSZNTykfCj34+DaDMSZ+s1zcGvd63f8PiG xkxU+ADBqfNcRcJ6XuHOqWFGuVuN3MDbriIidiCZwI7uFW/qx7X0yR78aRgaXqCV 19zpRpMOBaVPmtdLdadQ =zl3B -----END PGP SIGNATURE----- --
