When a Linux user reported a repeatedly high load average on an idle server, tracking the problem to a specific patch labeled, "user of the jiffies rounding code", Andrew Morton replied, "this is unexpected. High load average is due to either a task chewing a lot of CPU time or a task stuck in uninterruptible sleep." Linus Torvalds disagreed, explaining:
"We saw high loadaverages with the timer bogosity with 'gettimeofday()' and 'select()' not agreeing, so they would do things like
'date = time(..); select(.. , timeout = );'and when 'date' wasn't taking the jiffies offset into account, and thus mixing these kinds of different time sources, the select ended up returning immediately because they effectively used different clocks, and suddenly we had some applications chewing up 30% CPU time, because they were in a loop that *tried* to sleep."
Linus offered what he described as an "idiotic patch" to cause the load average to not be calculated exactly once every 5 seconds to prevent it from being in sync with something else waking up every 5 seconds, noting, "the load average is not calculated every tick, because that's not just expensive, but we also want to have some time-based decay." Arjan van de Ven pointed out that this shouldn't help, "I mean, the load gets only updated in actual timer interrupts... and on a tickless system there's very few of those around..... and usually at places round_jiffies() already put a timer on." Linus agreed with this reasoning, suggesting, "maybe Anders' problem stems partly from the fact that he really is using the tweaks to make that tickless theory more true than it tends to be on most systems?" Arjan pointed out that a lot of work has been successful in making tickless kernels wake up less, "we fixed a TON of stuff over the last months.. standard desktops (F8 / next Ubuntu) will be around 10 wakeups/sec, in a lab environment you can get below 2 ;)"
From: Anders <anders@...> Subject: PROBLEM: high load average when idle Date: Oct 2, 5:37 pm 2007 Hi! My computer suffers from high load average when the system is idle, introduced by commit 44d306e1508fef6fa7a6eb15a1aba86ef68389a6 . Long story: 2.6.20 and all later versions I've tested, including 2.6.21 and 2.6.22, make the load average high. Even when the computer is totally idle (I've tested in single user mode), the load average end up at ~0.30. The computer is still responsive, and the only fault seems to be the too high load average. All versions up to and including 2.6.19.7 is fine, and don't suffer from the problem. I git bisect between 2.6.19 and 2.6.20 gave me 44d306e1508fef6fa7a6eb15a1aba86ef68389a6 "[PATCH] user of the jiffies rounding code: JBD" as the first patch with the problem. 2.6.20 with 44d306e1508fef6fa7a6eb15a1aba86ef68389a6 reverted works fine. 2.6.23-rc8 with 44d306e1508fef6fa7a6eb15a1aba86ef68389a6 reverted also works fine. This fixes the problem: -------------------------- fs/jbd/transaction.c ----------------------------- index cceaf57..d38e0d5 100644 @@ -55,7 +55,7 @@ get_transaction(journal_t *journal, transaction_t *transaction) spin_lock_init(&transaction->t_handle_lock); /* Set up the commit timer for the new transaction. */ - journal->j_commit_timer.expires = round_jiffies(transaction->t_expires); + journal->j_commit_timer.expires = transaction->t_expires; add_timer(&journal->j_commit_timer); J_ASSERT(journal->j_running_transaction == NULL); I've only seen this problem on my home desktop computer. My work desktop computer and several other computers at work don't suffer from this problem. However, all other computers I've tested on is using AMD64 as architecture, and not i386 as my home desktop computer. Please let me know how I can assist in further debugging of this, if needed. System info: A Debian stable system with ABIT KV7 MB, VIA KT600 chipset, Athlon XP 1500+ CPU, GeForce DDR and Atheros AR5212 wlan board. Details below. I've tested without nvidia and the madwifi modules listed below, with the same results. eckert:/usr/src/linux-2.6>sh scripts/ver_linux If some fields are empty or look unusual you may have an old version. Compare to the current minimal requirements in Documentation/Changes. Linux eckert.bostrom.dyndns.org 2.6.20noload #1 Mon Oct 1 21:36:19 CEST 2007 i686 GNU/Linux Gnu C 4.1.2 Gnu make 3.81 binutils 2.17 util-linux 2.12r mount 2.12r module-init-tools 3.3-pre2 e2fsprogs 1.40-WIP Linux C Library 2.3.6 Dynamic linker (ldd) 2.3.6 Procps 3.2.7 Net-tools 1.60 Console-tools 0.2.3 Sh-utils 5.97 udev 105 wireless-tools 28 Modules Loaded nls_iso8859_1 nls_cp437 nvidia wlan_tkip iptable_filter ip_tables x_tables softdog snd_via82xx snd_ac97_codec ac97_bus snd_mpu401_uart snd_seq_midi snd_rawmidi wlan_scan_sta ath_rate_sample ath_pci wlan ath_hal eckert:~> cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 6 model name : AMD Athlon(tm) XP 1500+ stepping : 2 cpu MHz : 1383.971 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mp mmxext 3dnowext 3dnow ts bogomips : 2769.67 clflush size : 32 eckert:~> cat /proc/ioports 0000-001f : dma1 0020-0021 : pic1 0040-0043 : timer0 0050-0053 : timer1 0060-006f : keyboard 0070-0077 : rtc 0080-008f : dma page reg 00a0-00a1 : pic2 00c0-00df : dma2 00f0-00ff : fpu 0170-0177 : 0000:00:0f.1 0170-0177 : libata 01f0-01f7 : 0000:00:0f.1 01f0-01f7 : libata 0295-0296 : w83627hf 0376-0376 : 0000:00:0f.1 0376-0376 : libata 03c0-03df : vesafb 03f6-03f6 : 0000:00:0f.1 03f6-03f6 : libata 0cf8-0cff : PCI conf1 4000-407f : motherboard 4000-4003 : ACPI PM1a_EVT_BLK 4004-4005 : ACPI PM1a_CNT_BLK 4008-400b : ACPI PM_TMR 4010-4015 : ACPI CPU throttle 4020-4023 : ACPI GPE0_BLK 5000-500f : motherboard 5000-5007 : vt596_smbus c000-c007 : 0000:00:0f.0 c000-c007 : sata_via c400-c403 : 0000:00:0f.0 c400-c403 : sata_via c800-c807 : 0000:00:0f.0 c800-c807 : sata_via cc00-cc03 : 0000:00:0f.0 cc00-cc03 : sata_via d000-d00f : 0000:00:0f.0 d000-d00f : sata_via d400-d4ff : 0000:00:0f.0 d400-d4ff : sata_via d800-d80f : 0000:00:0f.1 d800-d80f : libata dc00-dc1f : 0000:00:10.0 dc00-dc1f : uhci_hcd e000-e01f : 0000:00:10.1 e000-e01f : uhci_hcd e400-e41f : 0000:00:10.2 e400-e41f : uhci_hcd e800-e81f : 0000:00:10.3 e800-e81f : uhci_hcd ec00-ecff : 0000:00:11.5 ec00-ecff : VIA8237 eckert:~> cat /proc/iomem 00000000-0009f3ff : System RAM 0009f400-0009ffff : reserved 000a0000-000bffff : Video RAM area 000c0000-000cbbff : Video ROM 000f0000-000fffff : System ROM 00100000-1feeffff : System RAM 00100000-00302afc : Kernel code 00302afd-003a7b53 : Kernel data 1fef0000-1fef2fff : ACPI Non-volatile Storage 1fef3000-1fefffff : ACPI Tables e0000000-e7ffffff : PCI Bus #01 e0000000-e7ffffff : 0000:01:00.0 e0000000-e1ffffff : vesafb e8000000-e9ffffff : PCI Bus #01 e8000000-e8ffffff : 0000:01:00.0 e8000000-e8ffffff : nvidia e9000000-e900ffff : 0000:01:00.0 ea000000-ebffffff : 0000:00:00.0 ec000000-ec00ffff : 0000:00:0b.0 ec000000-ec00ffff : ath ec010000-ec0100ff : 0000:00:10.4 ec010000-ec0100ff : ehci_hcd fec00000-fec00fff : reserved fee00000-fee00fff : reserved ffff0000-ffffffff : reserved eckert:~> cat /proc/scsi/scsi Attached devices: Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: HDT722525DLA380 Rev: V44O Type: Direct-Access ANSI SCSI revision: 05 Host: scsi1 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: HDT722525DLA380 Rev: V44O Type: Direct-Access ANSI SCSI revision: 05 Host: scsi2 Channel: 00 Id: 00 Lun: 00 Vendor: _NEC Model: DVD_RW ND-1300A Rev: 1.0B Type: CD-ROM ANSI SCSI revision: 05 Host: scsi2 Channel: 00 Id: 01 Lun: 00 Vendor: HL-DT-ST Model: CD-RW GCE-8240B Rev: 1.07 Type: CD-ROM ANSI SCSI revision: 05 Host: scsi3 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: IC35L120AVV207-0 Rev: V24O Type: Direct-Access ANSI SCSI revision: 05 eckert:~# lspci -vvv 00:00.0 Host bridge: VIA Technologies, Inc. VT8377 [KT400/KT600 AGP] Host Bridge (rev 80) Subsystem: ABIT Computer Corp. Unknown device 1408 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort+ >SERR- <PERR- Latency: 8 Region 0: Memory at ea000000 (32-bit, prefetchable) [size=32M] Capabilities: [80] AGP version 3.5 Status: RQ=32 Iso- ArqSz=0 Cal=2 SBA+ ITACoh- GART64- HTrans- 64bit- FW+ AGP3- Rate=x1,x2,x4 Command: RQ=1 ArqSz=0 Cal=0 SBA+ AGP+ GART64- 64bit- FW+ Rate=x4 Capabilities: [c0] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- 00:01.0 PCI bridge: VIA Technologies, Inc. VT8237 PCI Bridge (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 0 Bus: primary=00, secondary=01, subordinate=01, sec-latency=0 I/O behind bridge: 0000f000-00000fff Memory behind bridge: e8000000-e9ffffff Prefetchable memory behind bridge: e0000000-e7ffffff Secondary status: 66MHz+ FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort+ <SERR- <PERR- BridgeCtl: Parity- SERR- NoISA+ VGA+ MAbort- >Reset- FastB2B- Capabilities: [80] Power Management version 2 Flags: PMEClk- DSI- D1+ D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- 00:0b.0 Ethernet controller: Atheros Communications, Inc. AR5212 802.11abg NIC (rev 01) Subsystem: Global Sun Technology Inc Trust Speedshare Turbo Pro Wireless PCI Adapter Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 168 (2500ns min, 7000ns max), Cache Line Size: 32 bytes Interrupt: pin A routed to IRQ 18 Region 0: Memory at ec000000 (32-bit, non-prefetchable) [size=64K] Capabilities: [44] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=2 PME- 00:0f.0 RAID bus controller: VIA Technologies, Inc. VIA VT6420 SATA RAID Controller (rev 80) Subsystem: ABIT Computer Corp. KV7 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 32 Interrupt: pin B routed to IRQ 16 Region 0: I/O ports at c000 [size=8] Region 1: I/O ports at c400 [size=4] Region 2: I/O ports at c800 [size=8] Region 3: I/O ports at cc00 [size=4] Region 4: I/O ports at d000 [size=16] Region 5: I/O ports at d400 [size=256] Capabilities: [c0] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- 00:0f.1 IDE interface: VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06) (prog-if 8a [Master SecP PriP]) Subsystem: ABIT Computer Corp. Unknown device 1408 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 32 Interrupt: pin A routed to IRQ 16 Region 0: [virtual] Memory at 000001f0 (32-bit, non-prefetchable) [size=8] Region 1: [virtual] Memory at 000003f0 (type 3, non-prefetchable) [size=1] Region 2: [virtual] Memory at 00000170 (32-bit, non-prefetchable) [size=8] Region 3: [virtual] Memory at 00000370 (type 3, non-prefetchable) [size=1] Region 4: I/O ports at d800 [size=16] Capabilities: [c0] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- 00:10.0 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 81) (prog-if 00 [UHCI]) Subsystem: ABIT Computer Corp. Unknown device 1408 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 32, Cache Line Size: 32 bytes Interrupt: pin A routed to IRQ 17 Region 4: I/O ports at dc00 [size=32] Capabilities: [80] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=0 PME- 00:10.1 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 81) (prog-if 00 [UHCI]) Subsystem: ABIT Computer Corp. Unknown device 1408 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 32, Cache Line Size: 32 bytes Interrupt: pin A routed to IRQ 17 Region 4: I/O ports at e000 [size=32] Capabilities: [80] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=0 PME- 00:10.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 81) (prog-if 00 [UHCI]) Subsystem: ABIT Computer Corp. Unknown device 1408 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 32, Cache Line Size: 32 bytes Interrupt: pin B routed to IRQ 17 Region 4: I/O ports at e400 [size=32] Capabilities: [80] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=0 PME- 00:10.3 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 81) (prog-if 00 [UHCI]) Subsystem: ABIT Computer Corp. Unknown device 1408 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 32, Cache Line Size: 32 bytes Interrupt: pin B routed to IRQ 17 Region 4: I/O ports at e800 [size=32] Capabilities: [80] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=0 PME- 00:10.4 USB Controller: VIA Technologies, Inc. USB 2.0 (rev 86) (prog-if 20 [EHCI]) Subsystem: ABIT Computer Corp. Unknown device 1408 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 32, Cache Line Size: 64 bytes Interrupt: pin C routed to IRQ 17 Region 0: Memory at ec010000 (32-bit, non-prefetchable) [size=256] Capabilities: [80] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=0 PME- 00:11.0 ISA bridge: VIA Technologies, Inc. VT8237 ISA bridge [KT600/K8T800/K8T890 South] Subsystem: ABIT Computer Corp. Unknown device 1408 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping+ SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 0 Capabilities: [c0] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- 00:11.5 Multimedia audio controller: VIA Technologies, Inc. VT8233/A/8235/8237 AC97 Audio Controller (rev 60) Subsystem: ABIT Computer Corp. Unknown device 1408 Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Interrupt: pin C routed to IRQ 19 Region 0: I/O ports at ec00 [size=256] Capabilities: [c0] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- 01:00.0 VGA compatible controller: nVidia Corporation NV10DDR [GeForce 256 DDR] (rev 10) (prog-if 00 [VGA]) Subsystem: LeadTek Research Inc. WinFast GeForce 256 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 248 (1250ns min, 250ns max) Interrupt: pin A routed to IRQ 20 Region 0: Memory at e8000000 (32-bit, non-prefetchable) [size=16M] Region 1: Memory at e0000000 (32-bit, prefetchable) [size=128M] [virtual] Expansion ROM at e9000000 [disabled] [size=64K] Capabilities: [60] Power Management version 1 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Capabilities: [44] AGP version 2.0 Status: RQ=32 Iso- ArqSz=0 Cal=0 SBA+ ITACoh- GART64- HTrans- 64bit- FW+ AGP3- Rate=x1,x2,x4 Command: RQ=32 ArqSz=0 Cal=0 SBA+ AGP+ GART64- 64bit- FW+ Rate=x4 eckert:~# / Anders
From: Andrew Morton <akpm@...> Subject: Re: PROBLEM: high load average when idle Date: Oct 2, 6:07 pm 2007 On Tue, 02 Oct 2007 23:37:31 +0200 (CEST) Anders Bostr__m <anders@bostrom.dyndns.org> wrote: > My computer suffers from high load average when the system is idle, > introduced by commit 44d306e1508fef6fa7a6eb15a1aba86ef68389a6 . > > Long story: > > 2.6.20 and all later versions I've tested, including 2.6.21 and > 2.6.22, make the load average high. Even when the computer is totally > idle (I've tested in single user mode), the load average end up > at ~0.30. The computer is still responsive, and the only fault seems > to be the too high load average. All versions up to and including > 2.6.19.7 is fine, and don't suffer from the problem. > > I git bisect between 2.6.19 and 2.6.20 gave me > 44d306e1508fef6fa7a6eb15a1aba86ef68389a6 "[PATCH] user of the jiffies > rounding code: JBD" as the first patch with the > problem. 2.6.20 with 44d306e1508fef6fa7a6eb15a1aba86ef68389a6 reverted > works fine. 2.6.23-rc8 with 44d306e1508fef6fa7a6eb15a1aba86ef68389a6 > reverted also works fine. > > This fixes the problem: > > -------------------------- fs/jbd/transaction.c ----------------------------- > index cceaf57..d38e0d5 100644 > @@ -55,7 +55,7 @@ get_transaction(journal_t *journal, transaction_t *transaction) > spin_lock_init(&transaction->t_handle_lock); > > /* Set up the commit timer for the new transaction. */ > - journal->j_commit_timer.expires = round_jiffies(transaction->t_expires); > + journal->j_commit_timer.expires = transaction->t_expires; > add_timer(&journal->j_commit_timer); > > J_ASSERT(journal->j_running_transaction == NULL); > > > I've only seen this problem on my home desktop computer. My work > desktop computer and several other computers at work don't suffer from > this problem. However, all other computers I've tested on is using > AMD64 as architecture, and not i386 as my home desktop computer. > > Please let me know how I can assist in further debugging of this, if > needed. This is unexpected. High load average is due to either a task chewing a lot of CPU time or a task stuck in uninterruptible sleep. Can you please work out which of these is happening? Run `top' on an idle system. Is the CPU less than 1% loaded? Run ps aux | grep " D" or something like that on an idle system, see if you can spot a task which is spending time in D state. If there's a task whcih is spending time in D state, try running echo w > /proc/sysrq-trigger ; dmesg -c > foo the check "foo" to see if it has a task in D state (search foo for " D "). If it's not there, do the sysrq again, repeat until you've managed to capture a trace of the blocked task. If it turns out that the CPU really is spending excess amounts of time being busy then a kernel profile would be a good way of finding out where it is spinning. Or run sysrq-P from the keyboard a few times. -
From: Linus Torvalds <torvalds@...> Subject: Re: PROBLEM: high load average when idle Date: Oct 2, 6:32 pm 2007 On Tue, 2 Oct 2007, Andrew Morton wrote: > > This is unexpected. High load average is due to either a task chewing a > lot of CPU time or a task stuck in uninterruptible sleep. Not necessarily. We saw high loadaverages with the timer bogosity with "gettimeofday()" and "select()" not agreeing, so they would do things like date = time(..) select(.. , timeout = <time + 1> ) and when "date" wasn't taking the jiffies offset into account, and thus mixing these kinds of different time sources, the select ended up returning immediately because they effectively used different clocks, and suddenly we had some applications chewing up 30% CPU time, because they were in a loop that *tried* to sleep. And I wonder if the same kind thing is effectively happening here: the code is written so that it *tries* to sleep, but the rounding of the clock basically means that it's trying to sleep using a different clock than the one we're using to wake things up with, so some percentage of the time it doesn't sleep at all! I wonder if the whole "round_jiffies()" thing should be written so that it never rounds down, or at least never rounds down to before the current second! I have to say, I also think it's a bit iffy to do "round_jiffies()" at all in that per-CPU kind of way. The "per-cpu" thing is quite possibly going to change by the time we actually add the timer, so the goal of trying to get wakeups to happen in "bunches" per CPU should really be done by setting a flag on the timer itself - so that we could do that rounding when the timer is actually added to the per-cpu queues! Now, I think the JBD "t_expires" field should never be "near" in seconds, so I do find it a bit surprising that this rounding can have any effect, but on the other hand it clearly *does* have some effect, so.. It migt just be interacting with some other use, of course. Linus -
From: Chuck Ebbert <cebbert@...> Subject: Re: PROBLEM: high load average when idle Date: Oct 2, 6:33 pm 2007 On 10/02/2007 06:07 PM, Andrew Morton wrote: > On Tue, 02 Oct 2007 23:37:31 +0200 (CEST) > Anders Bostr__m <anders@bostrom.dyndns.org> wrote: > >> My computer suffers from high load average when the system is idle, >> introduced by commit 44d306e1508fef6fa7a6eb15a1aba86ef68389a6 . >> >> Long story: >> >> 2.6.20 and all later versions I've tested, including 2.6.21 and >> 2.6.22, make the load average high. Even when the computer is totally >> idle (I've tested in single user mode), the load average end up >> at ~0.30. The computer is still responsive, and the only fault seems >> to be the too high load average. All versions up to and including >> 2.6.19.7 is fine, and don't suffer from the problem. >> >> I git bisect between 2.6.19 and 2.6.20 gave me >> 44d306e1508fef6fa7a6eb15a1aba86ef68389a6 "[PATCH] user of the jiffies >> rounding code: JBD" as the first patch with the >> problem. 2.6.20 with 44d306e1508fef6fa7a6eb15a1aba86ef68389a6 reverted >> works fine. 2.6.23-rc8 with 44d306e1508fef6fa7a6eb15a1aba86ef68389a6 >> reverted also works fine. >> >> This fixes the problem: >> >> -------------------------- fs/jbd/transaction.c ----------------------------- >> index cceaf57..d38e0d5 100644 >> @@ -55,7 +55,7 @@ get_transaction(journal_t *journal, transaction_t *transaction) >> spin_lock_init(&transaction->t_handle_lock); >> >> /* Set up the commit timer for the new transaction. */ >> - journal->j_commit_timer.expires = round_jiffies(transaction->t_expires); >> + journal->j_commit_timer.expires = transaction->t_expires; >> add_timer(&journal->j_commit_timer); >> >> J_ASSERT(journal->j_running_transaction == NULL); >> >> >> I've only seen this problem on my home desktop computer. My work >> desktop computer and several other computers at work don't suffer from >> this problem. However, all other computers I've tested on is using >> AMD64 as architecture, and not i386 as my home desktop computer. >> >> Please let me know how I can assist in further debugging of this, if >> needed. > > This is unexpected. High load average is due to either a task chewing a > lot of CPU time or a task stuck in uninterruptible sleep. > Or, everybody wakes up at once right when we are taking a sample. :) -
From: Arjan van de Ven <arjan@...> Subject: Re: PROBLEM: high load average when idle Date: Oct 2, 7:26 pm 2007 On Tue, 02 Oct 2007 18:33:58 -0400 > Or, everybody wakes up at once right when we are taking a sample. :) nice try but we sample every timer tick; this code being timer driven makes it what you say it is regardless of *which* timer tick it happens at ;) -
From: Chuck Ebbert <cebbert@...> Subject: Re: PROBLEM: high load average when idle Date: Oct 3, 1:32 pm 2007 On 10/02/2007 07:26 PM, Arjan van de Ven wrote: > On Tue, 02 Oct 2007 18:33:58 -0400 >> Or, everybody wakes up at once right when we are taking a sample. :) > > nice try but we sample every timer tick; this code being timer driven > makes it what you say it is regardless of *which* timer tick it > happens at ;) > But we reduce the number of samples because some ticks just never happen when the timers get rounded: No rounding: tick ............... tick 1 running 1 running Rounded: tick 2 running In the first case the average is 1, but it's 2 in the second. -
From: Linus Torvalds <torvalds@...> Subject: Re: PROBLEM: high load average when idle Date: Oct 3, 2:02 pm 2007 On Wed, 3 Oct 2007, Chuck Ebbert wrote: > > But we reduce the number of samples because some ticks just never > happen when the timers get rounded: > > No rounding: > > tick ............... tick > 1 running 1 running > > Rounded: > > tick > 2 running > > In the first case the average is 1, but it's 2 in the second. In fact, I think this is it! The load average is not calculated every tick, because that's not just expensive, but we also want to have some time-based decay. So it's calculated every LOAD_FREQ ticks. And guess what: LOAD_FREQ is defined to be exactly five seconds. So imagine if the timer gets to be in sync with another event that happens every five seconds - let's pick at random a 5-second JBD transaction thing? Anders - does this idiotic patch make a difference for you? Without this, I can easily imagine that the rounding code tends to try to round to an even second, and the load-average code generally also runs at even seconds! Linus --- include/linux/sched.h | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index a01ac6d..643de0f 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -113,7 +113,7 @@ extern unsigned long avenrun[]; /* Load averages */ #define FSHIFT 11 /* nr of bits of precision */ #define FIXED_1 (1<<FSHIFT) /* 1.0 as fixed-point */ -#define LOAD_FREQ (5*HZ) /* 5 sec intervals */ +#define LOAD_FREQ (5*HZ+1) /* ~5 sec intervals */ #define EXP_1 1884 /* 1/exp(5sec/1min) as fixed-point */ #define EXP_5 2014 /* 1/exp(5sec/5min) */ #define EXP_15 2037 /* 1/exp(5sec/15min) */ -
From: Arjan van de Ven <arjan@...> Subject: Re: PROBLEM: high load average when idle Date: Oct 3, 2:20 pm 2007 Linus Torvalds wrote: > Without this, I can easily imagine that the rounding code tends to try to > round to an even second, and the load-average code generally also runs at > even seconds! > > Linus > > --- > include/linux/sched.h | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index a01ac6d..643de0f 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -113,7 +113,7 @@ extern unsigned long avenrun[]; /* Load averages */ > > #define FSHIFT 11 /* nr of bits of precision */ > #define FIXED_1 (1<<FSHIFT) /* 1.0 as fixed-point */ > -#define LOAD_FREQ (5*HZ) /* 5 sec intervals */ > +#define LOAD_FREQ (5*HZ+1) /* ~5 sec intervals */ not sure this is going to help; I mean, the load gets only updated in actual timer interrupts... and on a tickless system there's very few of those around..... and usually at places round_jiffies() already put a timer on. (also.. one thing that might make Chuck's theory wrong is that the sampling code doesn't sample timer activity since that's run just after the sampler in the same irq) -
From: Linus Torvalds <torvalds@...> Subject: Re: PROBLEM: high load average when idle Date: Oct 3, 2:28 pm 2007 On Wed, 3 Oct 2007, Arjan van de Ven wrote: > > not sure this is going to help; I mean, the load gets only updated in actual > timer interrupts... and on a tickless system there's very few of those > around..... and usually at places round_jiffies() already put a timer on. Yeah, you're right. Although in practice, at least on a system running X, I'd expect that there still is lots of other timers going on, hiding the issue. Hmm. Maybe Anders' problem stems partly from the fact that he really is using the tweaks to make that tickless theory more true than it tends to be on most systems? Linus -
From: Arjan van de Ven <arjan@...> Subject: Re: PROBLEM: high load average when idle Date: Oct 3, 2:29 pm 2007 Linus Torvalds wrote: > > On Wed, 3 Oct 2007, Arjan van de Ven wrote: >> not sure this is going to help; I mean, the load gets only updated in actual >> timer interrupts... and on a tickless system there's very few of those >> around..... and usually at places round_jiffies() already put a timer on. > > Yeah, you're right. Although in practice, at least on a system running > X, I'd expect that there still is lots of other timers going on, hiding > the issue. eh not really; on a normal distro desktop you maybe have 10 wakeups/sec or so; on a tuned one you have 2 or less. > > Hmm. Maybe Anders' problem stems partly from the fact that he really is > using the tweaks to make that tickless theory more true than it tends to > be on most systems? we fixed a TON of stuff over the last months.. standard desktops (F8 / next Ubuntu) will be around 10 wakeups/sec, in a lab environment you can get below 2 ;) -
From: Anders <anders@...> Subject: Re: PROBLEM: high load average when idle Date: Oct 3, 4:15 pm 2007 >>>>> "LT" == Linus Torvalds <torvalds@linux-foundation.org> writes: LT> On Wed, 3 Oct 2007, Chuck Ebbert wrote: >> >> But we reduce the number of samples because some ticks just never >> happen when the timers get rounded: >> >> No rounding: >> >> tick ............... tick >> 1 running 1 running >> >> Rounded: >> >> tick >> 2 running >> >> In the first case the average is 1, but it's 2 in the second. LT> In fact, I think this is it! LT> The load average is not calculated every tick, because that's not just LT> expensive, but we also want to have some time-based decay. So it's LT> calculated every LOAD_FREQ ticks. LT> And guess what: LOAD_FREQ is defined to be exactly five seconds. LT> So imagine if the timer gets to be in sync with another event that happens LT> every five seconds - let's pick at random a 5-second JBD transaction LT> thing? LT> Anders - does this idiotic patch make a difference for you? Yes, it does, it fixes the load average!!! I guess we have something here! Why does this problem only show up on my computer? Any idea? / Anders LT> Without this, I can easily imagine that the rounding code tends to try to LT> round to an even second, and the load-average code generally also runs at LT> even seconds! LT> Linus LT> --- LT> include/linux/sched.h | 2 +- LT> 1 files changed, 1 insertions(+), 1 deletions(-) LT> diff --git a/include/linux/sched.h b/include/linux/sched.h LT> index a01ac6d..643de0f 100644 LT> --- a/include/linux/sched.h LT> +++ b/include/linux/sched.h LT> @@ -113,7 +113,7 @@ extern unsigned long avenrun[]; /* Load averages */ LT> #define FSHIFT 11 /* nr of bits of precision */ LT> #define FIXED_1 (1<<FSHIFT) /* 1.0 as fixed-point */ LT> -#define LOAD_FREQ (5*HZ) /* 5 sec intervals */ LT> +#define LOAD_FREQ (5*HZ+1) /* ~5 sec intervals */ LT> #define EXP_1 1884 /* 1/exp(5sec/1min) as fixed-point */ LT> #define EXP_5 2014 /* 1/exp(5sec/5min) */ LT> #define EXP_15 2037 /* 1/exp(5sec/15min) */ -
