I've moved recently one of my proxies(squid and some compressing application)
from 2.6.21 to 2.6.22, and notice huge performance drop. I think this is
important, cause it can cause serious regression on some other workloads like
busy web-servers and etc.
After some analysis of different options i can bring more exact numbers:
2.6.21 able to process 500-550 requests/second and 15-20 Mbit/s of traffic,
and working great without any slowdown or instability.
2.6.22 able to process only 250-300 requests and 8-10 Mbit/s of traffic, ssh
and console is "freezing" (there is delay even for typing characters).
Both proxies is on identical hardware(Sun Fire X4100), configuration(small
system, LFS-like, on USB flash), different only kernel.
I tried to disable/enable various options and optimisations - it doesn't
change anything, till i reach SLUB/SLAB option.
I've loaded proxy configuration to gentoo PC with 2.6.22 (then upgraded it to
2.6.23-rc8), and having same effect.
Additionally, when load reaching maximum i can notice whole system slowdown,
for example ssh and scp takes much more time to run, even i do nice -n -5 for
them.
But even choosing 2.6.23-rc8+SLAB i noticed same "freezing" of ssh (and sure
it slowdown other kind of network performance), but much less comparing with
SLUB. On top i am seeing ksoftirqd taking almost 100% (sometimes ksoftirqd/0,
sometimes ksoftirqd/1).
I tried also different tricks with scheduler (/proc/sys/kernel/sched*), but
it's also didn't help.
When it freezes it looks like:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7 root 15 -5 0 0 0 R 64 0.0 2:47.48 ksoftirqd/1
5819 root 20 0 134m 130m 596 R 57 3.3 4:36.78 globax
5911 squid 20 0 1138m 1.1g 2124 R 26 28.9 2:24.87 squid
10 root 15 -5 0 0 0 S 1 0.0 0:01.86 events/1
6130 root 20 0 3960 2416 1592 S 0 0.1 0:08.02 oprofiled
Oprofile results:
Thats oprofile ...Could you try with SLUB but disabling CONFIG_SLUB_DEBUG ? -
Hi pi ~ # zcat /proc/config.gz |grep SLUB # CONFIG_SLUB_DEBUG is not set CONFIG_SLUB=y 21:27:20 CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 21:27:21 all 5.53 21.61 8.54 0.00 0.50 5.03 0.00 58.79 7147.52 21:27:22 all 6.40 15.76 8.87 0.00 0.49 20.20 0.00 48.28 6327.00 21:27:23 all 6.53 20.60 12.06 0.00 0.00 52.76 0.00 8.04 456.44 21:27:24 all 7.00 17.50 10.50 0.00 0.00 53.00 0.00 12.00 439.00 21:27:25 all 7.50 18.00 12.50 0.50 0.50 52.50 0.00 8.50 464.00 21:27:26 all 6.64 16.11 11.37 0.00 0.00 58.77 0.00 7.11 401.90 21:27:27 all 0.00 0.40 0.40 0.00 0.00 99.20 0.00 0.00 325.20 21:27:28 all 0.00 0.50 0.00 0.00 0.00 99.50 0.00 0.00 453.47 21:27:29 all 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 495.10 21:27:30 all 0.00 0.50 0.00 0.00 0.00 99.50 0.00 0.00 625.74 21:27:31 all 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 600.97 21:27:32 all 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 604.95 64 bytes from 127.0.0.1: icmp_seq=15 ttl=64 time=0.031 ms 64 bytes from 127.0.0.1: icmp_seq=16 ttl=64 time=229 ms 64 bytes from 127.0.0.1: icmp_seq=17 ttl=64 time=253 ms 64 bytes from 127.0.0.1: icmp_seq=18 ttl=64 time=257 ms 64 bytes from 127.0.0.1: icmp_seq=19 ttl=64 time=279 ms 64 bytes from 127.0.0.1: icmp_seq=20 ttl=64 time=246 ms 64 bytes from 127.0.0.1: icmp_seq=21 ttl=64 time=246 ms 64 bytes from 127.0.0.1: icmp_seq=22 ttl=64 time=234 ms 64 bytes from 127.0.0.1: icmp_seq=23 ttl=64 time=235 ms Problem remains. It is another server. -- Denys Fedoryshchenko Technical Manager Virtual ISP S.A.L. -
Hi Denys, thanks for reporting (btw. please reply-to-all when replying on lkml). You say that SLAB is better than SLUB on an otherwise identical kernel, but I didn't see if you quantified the actual numbers? It sounds like there is still a regression with SLAB? -
Hi Nick There is no numbers for sure now, it is just instability. I am bisecting kernel now (from 2.6.21 to 2.6.22-rc1), i think it is caused by some patch, and maybe not related to CFS. Anyways, let's see when i finish bisect. Just i was thinking, maybe anyone else have similar issues or some ideas in mind. The only one case - it is very difficult to track, and i have to do it on live server with thousands of users. At night load is much lower, so possible by mistake i will miss a point (it can detect bug easily, IF softirq in mpstat 1 will jump up to 30+%. On 2.6.21 it is always staying +-10%, on "buggy" version it jumps up to 100%. Now cause less load at night up to 30- 50%, so on hardware it is not very noticeable (thats why maybe not much feedbacks), i am trying to utilise hardware maximum as possible - so i feel regressions much more. Anyway, possible i will have news after 2-3 hours max. -- Denys Fedoryshchenko Technical Manager Virtual ISP S.A.L. -
You could use oprofile to find where the CPU time is spent -Andi -
Already i did, and it didn't show real failure point. -- Denys Fedoryshchenko Technical Manager Virtual ISP S.A.L. -
There was no difference between the profile of the working kernel and the high load kernel? Perhaps you just have legitimate higher load? -Andi -
Hi No major differences. Some procedures higher a bit, some lower a bit, but everything looks fine on both kernels. Thats what i tried first - oprofile, and only one keypoint i was able to catch - in top softirqd/0 or /1 shows usage 100% cpu, and in mpstat - soft% also reaching 100%. -- Denys Fedoryshchenko Technical Manager Virtual ISP S.A.L. -
Then you likely just have higher load. What makes you think this is a kernel problem? -Andi -
That 2.6.21 works fine and 2.6.22 MUCH worse on multiple servers with similar workload. Plus all userspace becoming extremely unstable, and definitely ping 127.0.0.1 must not reach 300ms. -- Denys Fedoryshchenko Technical Manager Virtual ISP S.A.L. -
Hi I got pi linux-git # git bisect bad Bisecting: 0 revisions left to test after this [f85958151900f9d30fa5ff941b0ce71eaa45a7de] [NET]: random functions can use nsec resolution instead of usec I will make sure and will try to reverse this patch on 2.6.22 But it seems "that's it". -- Denys Fedoryshchenko Technical Manager Virtual ISP S.A.L. -
Well... thats interesting...
No problem here on bigger servers, so I CC David Miller and netdev on this one.
AFAIK do_gettimeofday() and ktime_get_real() should use the same underlying
hardware functions on PC and no performance problem should happen here.
(relevant part of this patch :
@ -1521,7 +1515,6 @@ __u32 secure_ip_id(__be32 daddr)
__u32 secure_tcp_sequence_number(__be32 saddr, __be32 daddr,
__be16 sport, __be16 dport)
{
- struct timeval tv;
__u32 seq;
__u32 hash[4];
struct keydata *keyptr = get_keyptr();
@@ -1543,12 +1536,11 @@ __u32 secure_tcp_sequence_number(__be32 saddr, __be32
daddr,
* As close as possible to RFC 793, which
* suggests using a 250 kHz clock.
* Further reading shows this assumes 2 Mb/s networks.
- * For 10 Mb/s Ethernet, a 1 MHz clock is appropriate.
+ * For 10 Gb/s Ethernet, a 1 GHz clock is appropriate.
* That's funny, Linux has one built in! Use it!
* (Networks are faster now - should this be increased?)
*/
- do_gettimeofday(&tv);
- seq += tv.tv_usec + tv.tv_sec * 1000000;
+ seq += ktime_get_real().tv64;
-
From: Eric Dumazet <dada1@cosmosbay.com> One thing that jumps out at me is that on 32-bit (and to a certain extent on 64-bit) there is a lot of stack accesses and missed optimizations because all of the work occurs, and gets expanded, inside of ktime_get_real(). The timespec_to_ktime() inside of there constructs the ktime_t return value on the stack, then returns that as an aggregate to the caller. That cannot be without some cost. ktime_get_real() is definitely a candidate for inlining especially in these kinds of cases where we'll happily get computations in local registers instead of all of this on-stack nonsense. And in several cases (if the caller only needs the tv_sec value, for example) computations can be elided entirely. It would be constructive to experiment and see if this is in fact part of the problem. -
Well, i can play a bit more on "live" servers. I have now hot-swap server with full gentoo, where i can rebuild any kernel you want, with any applied patch. But it looks more like not overhead, load becoming high too "spiky", and it is not just permantenly higher. Also it is not normal that all system becoming unresposive (for example ping 127.0.0.1 becoming 300ms for period, when usage softirq jumps to 100%). -- Denys Fedoryshchenko Technical Manager Virtual ISP S.A.L. -
Could you try a pristine 2.6.22.9 and some patch in secure_tcp_sequence_number() like : --- drivers/char/random.c.orig 2007-10-01 10:18:42.000000000 +0200 +++ drivers/char/random.c 2007-10-01 10:19:58.000000000 +0200 @@ -1554,7 +1554,7 @@ * That's funny, Linux has one built in! Use it! * (Networks are faster now - should this be increased?) */ - seq += ktime_get_real().tv64; + seq += ktime_get_real().tv64 / 1000; #if 0 printk("init_seq(%lx, %lx, %d, %d) = %d\n", saddr, daddr, sport, dport, seq); -
On 32 bits machine, replace the divide by a shift to avoid a linker error (undefined reference to `__divdi3'): seq += ktime_get_real().tv64 >> 10; -
Not able to compile kernel with patch drivers/built-in.o: In function `secure_tcp_sequence_number': (.text+0x3ad02): undefined reference to `__divdi3' make: *** [.tmp_vmlinux1] Error 1 -- Denys Fedoryshchenko Technical Manager Virtual ISP S.A.L. -
Resend for maillists (was discareded cause of encoding issues as SPAM). Everything looks fine, for sure. Confirmed on second server. -- Denys Fedoryshchenko Technical Manager Virtual ISP S.A.L. -
So maybe the following patch is necessary... I believe IPV6 & DCCP are immune to this problem. Thanks again Denys for spotting this. Eric [PATCH] TCP : secure_tcp_sequence_number() should not use a too fast clock TCP V4 sequence numbers are 32bits, and RFC 793 assumed a 250 KHz clock. In order to follow network speed increase, we can use a faster clock, but we should limit this clock so that the delay between two rollovers is greater than MSL (TCP Maximum Segment Lifetime : 2 minutes) Choosing a 64 nsec clock should be OK, since the rollovers occur every 274 seconds. Problem spotted by Denys Fedoryshchenko Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
From: Eric Dumazet <dada1@cosmosbay.com> Thanks a lot Eric for bringing closure to this. I'll apply this and add a reference in the commit message to the changeset that introduced this problem, since it might help others who look at this. -
Hi I can confirm, that reversing this patch on 2.6.23-rc8 fixes the problem. Checked on 3 different servers. softirq not jumping anymore on top of "top" and mpstat also looks stable, reasonable and nice. pi linux-git # git bisect bad Bisecting: 0 revisions left to test after this [f85958151900f9d30fa5ff941b0ce71eaa45a7de] [NET]: random functions can use nsec resolution instead of usec -- Denys Fedoryshchenko Technical Manager Virtual ISP S.A.L. -
Just a bit more details about hardware: Sun Fire X4100 (AMD Opteron 252), chipset looks like AMD-8111/AMD-8131 chips. There is no HPET detected, and by default acpi_pm used, which is seems more CPU intensive(based on oprofile results) than TSC. Choosing TSC over /sys doesn't make much difference. Workload is http requests coming from customers, noticeable slowdown happening on >250 requests/second, then routed to squid, then marked at squid ToS and depends on ToS routed over netfilter to required TCP port. At required TCP port "satellite accelerator" intercepting connection (so it is loopback, and it means "double" tcp work), and encapsulate stuff to UDP. Sure receiving UDP (stream around 10-20Mbit/s), and decapsulating ... and same way but in reverse to customer. So summary: 2x Opteron 2.4 Ghz Processing incoming TCP requests rate 250-500 req/s Incoming TCP bandwidth around 3-5 Mbit/s Outgoing TCP bandwidth around 33-35Mbit/s Internally routed TCP also around this number (-cached content) Incoming UDP bandwidth arounn 20-25Mbit/s Outgoing UDP bandwidth around 500Kbit/s At 2x Dual Core Opteron 2.6 Ghz (means total 4 cores) and similar load i cannot notice big slowdown, so it is noticable only when hardware used nearby it's limits i think. BUT! I can notice in mpstat spikes of softirq, from normal value 7-8% to 50-60%, but it didn't cause any noticeable slowdown on ssh or system operations. I am expecting 2.6.21 to serve 600-700 req/s. Maybe there is some additional overhead in this calculations, or improper irq locking? I am not guru in such stuff. I guess who have similar workload can check mpstat 1, and see if there is -- Denys Fedoryshchenko Technical Manager Virtual ISP S.A.L. -
| Jesse Barnes | Re: [stable] [BUG][PATCH] cpqphp: fix kernel NULL pointer dereference |
| Greg KH | [003/136] p54usb: add Zcomax XG-705A usbid |
| Magnus Damm | [PATCH 03/07] ARM: Use shared GIC entry macros on Realview |
| Oliver Neukum | Re: [Bug #13682] The webcam stopped working when upgrading from 2.6.29 to 2.6.30 |
| Martin Schwidefsky | Re: [PATCH] optimized ktime_get[_ts] for GENERIC_TIME=y |
git: | |
| Junio C Hamano | Re: Some advanced index playing |
| Jeff King | Re: confusion over the new branch and merge config |
| Robin Rosenberg | Re: cvs2svn conversion directly to git ready for experimentation |
| Linus Torvalds | git binary size... |
| Ævar Arnfjörð Bjarmason | Re: Challenge with Git-Bash |
