2.6.21 -> 2.6.22 & 2.6.23-rc8 performance regression

Previous thread: [PATCH] robust futex thread exit race by Martin Schwidefsky on Sunday, September 30, 2007 - 8:02 am. (10 messages)

Next thread: 2.6.21 -> 2.6.22 & 2.6.23-rc8 performance regression by Denys on Sunday, September 30, 2007 - 8:22 am. (1 message)
From: Denys
Date: Sunday, September 30, 2007 - 7:48 am

I've moved recently one of my proxies(squid and some compressing application) 
from 2.6.21 to 2.6.22, and notice huge performance drop. I think this is 
important, cause it can cause serious regression on some other workloads like 
busy web-servers and etc.

After some analysis of different options i can bring more exact numbers:

2.6.21 able to process 500-550 requests/second and 15-20 Mbit/s of traffic, 
and working great without any slowdown or instability.

2.6.22 able to process only 250-300 requests and 8-10 Mbit/s of traffic, ssh 
and console is "freezing" (there is delay even for typing characters).

Both proxies is on identical hardware(Sun Fire X4100), configuration(small 
system, LFS-like, on USB flash), different only kernel.

I tried to disable/enable various options and optimisations - it doesn't 
change anything, till i reach SLUB/SLAB option.

I've loaded proxy configuration to gentoo PC with 2.6.22 (then upgraded it to 
2.6.23-rc8), and having same effect.
Additionally, when load reaching maximum i can notice whole system slowdown, 
for example ssh and scp takes much more time to run, even i do nice -n -5 for 
them.

But even choosing 2.6.23-rc8+SLAB i noticed same "freezing" of ssh (and sure 
it slowdown other kind of network performance), but much less comparing with 
SLUB. On top i am seeing ksoftirqd taking almost 100% (sometimes ksoftirqd/0, 
sometimes ksoftirqd/1).

I tried also different tricks with scheduler (/proc/sys/kernel/sched*), but 
it's also didn't help.

When it freezes it looks like:
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
    7 root      15  -5     0    0    0 R   64  0.0   2:47.48 ksoftirqd/1
 5819 root      20   0  134m 130m  596 R   57  3.3   4:36.78 globax
 5911 squid     20   0 1138m 1.1g 2124 R   26 28.9   2:24.87 squid
   10 root      15  -5     0    0    0 S    1  0.0   0:01.86 events/1
 6130 root      20   0  3960 2416 1592 S    0  0.1   0:08.02 oprofiled


Oprofile results:


Thats oprofile ...
From: Eric Dumazet
Date: Sunday, September 30, 2007 - 10:48 am

Could you try with SLUB but disabling CONFIG_SLUB_DEBUG ?


-

From: Denys
Date: Sunday, September 30, 2007 - 11:45 am

Hi

pi ~ # zcat /proc/config.gz |grep SLUB
# CONFIG_SLUB_DEBUG is not set
CONFIG_SLUB=y


21:27:20     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal
   %idle    intr/s
21:27:21     all    5.53   21.61    8.54    0.00    0.50    5.03    0.00   
58.79   7147.52
21:27:22     all    6.40   15.76    8.87    0.00    0.49   20.20    0.00   
48.28   6327.00
21:27:23     all    6.53   20.60   12.06    0.00    0.00   52.76    0.00    
8.04    456.44
21:27:24     all    7.00   17.50   10.50    0.00    0.00   53.00    0.00   
12.00    439.00
21:27:25     all    7.50   18.00   12.50    0.50    0.50   52.50    0.00    
8.50    464.00
21:27:26     all    6.64   16.11   11.37    0.00    0.00   58.77    0.00    
7.11    401.90
21:27:27     all    0.00    0.40    0.40    0.00    0.00   99.20    0.00    
0.00    325.20
21:27:28     all    0.00    0.50    0.00    0.00    0.00   99.50    0.00    
0.00    453.47
21:27:29     all    0.00    0.00    0.00    0.00    0.00  100.00    0.00    
0.00    495.10
21:27:30     all    0.00    0.50    0.00    0.00    0.00   99.50    0.00    
0.00    625.74
21:27:31     all    0.00    0.00    0.00    0.00    0.00  100.00    0.00    
0.00    600.97
21:27:32     all    0.00    0.00    0.00    0.00    0.00  100.00    0.00    
0.00    604.95

64 bytes from 127.0.0.1: icmp_seq=15 ttl=64 time=0.031 ms
64 bytes from 127.0.0.1: icmp_seq=16 ttl=64 time=229 ms
64 bytes from 127.0.0.1: icmp_seq=17 ttl=64 time=253 ms
64 bytes from 127.0.0.1: icmp_seq=18 ttl=64 time=257 ms
64 bytes from 127.0.0.1: icmp_seq=19 ttl=64 time=279 ms
64 bytes from 127.0.0.1: icmp_seq=20 ttl=64 time=246 ms
64 bytes from 127.0.0.1: icmp_seq=21 ttl=64 time=246 ms
64 bytes from 127.0.0.1: icmp_seq=22 ttl=64 time=234 ms
64 bytes from 127.0.0.1: icmp_seq=23 ttl=64 time=235 ms

Problem remains. It is another server.




--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.

-

From: Nick Piggin
Date: Saturday, September 29, 2007 - 9:25 pm

Hi Denys, thanks for reporting (btw. please reply-to-all when replying
on lkml).

You say that SLAB is better than SLUB on an otherwise identical kernel,
but I didn't see if you quantified the actual numbers? It sounds like there
is still a regression with SLAB?



-

From: Denys
Date: Sunday, September 30, 2007 - 3:04 pm

Hi Nick

There is no numbers for sure now, it is just instability.

I am bisecting kernel now (from 2.6.21 to 2.6.22-rc1), i think it is caused 
by some patch, and maybe not related to CFS. Anyways, let's see when i finish 
bisect. Just i was thinking, maybe anyone else have similar issues or some 
ideas in mind.

The only one case - it is very difficult to track, and i have to do it on 
live server with thousands of users. At night load is much lower, so possible 
by mistake i will miss a point (it can detect bug easily, IF softirq in 
mpstat 1 will jump up to 30+%. On 2.6.21 it is always staying +-10%, on 
"buggy" version it jumps up to 100%. Now cause less load at night up to 30-
50%, so on hardware it is not very noticeable (thats why maybe not much 
feedbacks), i am trying to utilise hardware maximum as possible - so i feel 
regressions much more.

Anyway, possible i will have news after 2-3 hours max.




--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.

-

From: Andi Kleen
Date: Monday, October 1, 2007 - 3:01 am

You could use oprofile to find where the CPU time is spent

-Andi
-

From: Denys
Date: Monday, October 1, 2007 - 3:30 am

Already i did, and it didn't show real failure point.



--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.

-

From: Andi Kleen
Date: Monday, October 1, 2007 - 4:14 am

There was no difference between the profile of the working kernel
and the high load kernel?  Perhaps you just have legitimate higher load?

-Andi
-

From: Denys
Date: Monday, October 1, 2007 - 4:52 am

Hi

No major differences. 
Some procedures higher a bit, some lower a bit, but everything looks fine on
both kernels.

Thats what i tried first - oprofile, and only one keypoint i was able to catch
- in top softirqd/0 or /1 shows usage 100% cpu, and in mpstat - soft% also
reaching 100%. 





--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.

-

From: Andi Kleen
Date: Monday, October 1, 2007 - 4:57 am

Then you likely just have higher load. What makes you think this
is a kernel problem?

-Andi
-

From: Denys
Date: Monday, October 1, 2007 - 5:04 am

That 2.6.21 works fine and 2.6.22 MUCH worse on multiple servers with similar
workload.
Plus all userspace becoming extremely unstable, and definitely ping 127.0.0.1
must not reach 300ms.




--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.

-

From: Denys
Date: Sunday, September 30, 2007 - 3:35 pm

Hi 

I got

pi linux-git # git bisect bad
Bisecting: 0 revisions left to test after this
[f85958151900f9d30fa5ff941b0ce71eaa45a7de] [NET]: random functions can use 
nsec resolution instead of usec

I will make sure and will try to reverse this patch on 2.6.22

But it seems "that's it".




--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.

-

From: Eric Dumazet
Date: Sunday, September 30, 2007 - 10:59 pm

Well... thats interesting...

No problem here on bigger servers, so I CC David Miller and netdev on this one.

AFAIK do_gettimeofday() and ktime_get_real() should use the same underlying 
hardware functions on PC and no performance problem should happen here.

(relevant part of this patch :

@ -1521,7 +1515,6 @@ __u32 secure_ip_id(__be32 daddr)
  __u32 secure_tcp_sequence_number(__be32 saddr, __be32 daddr,
                                  __be16 sport, __be16 dport)
  {
-       struct timeval tv;
         __u32 seq;
         __u32 hash[4];
         struct keydata *keyptr = get_keyptr();
@@ -1543,12 +1536,11 @@ __u32 secure_tcp_sequence_number(__be32 saddr, __be32 
daddr,
          *      As close as possible to RFC 793, which
          *      suggests using a 250 kHz clock.
          *      Further reading shows this assumes 2 Mb/s networks.
-        *      For 10 Mb/s Ethernet, a 1 MHz clock is appropriate.
+        *      For 10 Gb/s Ethernet, a 1 GHz clock is appropriate.
          *      That's funny, Linux has one built in!  Use it!
          *      (Networks are faster now - should this be increased?)
          */
-       do_gettimeofday(&tv);
-       seq += tv.tv_usec + tv.tv_sec * 1000000;
+       seq += ktime_get_real().tv64;



-

From: David Miller
Date: Monday, October 1, 2007 - 12:12 am

From: Eric Dumazet <dada1@cosmosbay.com>

One thing that jumps out at me is that on 32-bit (and to a certain
extent on 64-bit) there is a lot of stack accesses and missed
optimizations because all of the work occurs, and gets expanded,
inside of ktime_get_real().

The timespec_to_ktime() inside of there constructs the ktime_t return
value on the stack, then returns that as an aggregate to the caller.

That cannot be without some cost.

ktime_get_real() is definitely a candidate for inlining especially in
these kinds of cases where we'll happily get computations in local
registers instead of all of this on-stack nonsense.  And in several
cases (if the caller only needs the tv_sec value, for example)
computations can be elided entirely.

It would be constructive to experiment and see if this is in fact part
of the problem.

-

From: Denys
Date: Monday, October 1, 2007 - 1:07 am

Well, i can play a bit more on "live" servers. I have now hot-swap server with
full gentoo,  where i can rebuild any kernel you want, with any applied patch.
But it looks more like not overhead, load becoming high too "spiky", and it is
not just permantenly higher. Also it is not normal that all system becoming
unresposive (for example ping 127.0.0.1 becoming 300ms for period, when usage
softirq jumps to 100%).



--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.

-

From: Eric Dumazet
Date: Monday, October 1, 2007 - 1:20 am

Could you try a pristine 2.6.22.9 and some patch in 
secure_tcp_sequence_number() like :

--- drivers/char/random.c.orig 2007-10-01 10:18:42.000000000 +0200
+++ drivers/char/random.c 2007-10-01 10:19:58.000000000 +0200
@@ -1554,7 +1554,7 @@
* That's funny, Linux has one built in! Use it!
* (Networks are faster now - should this be increased?)
*/
- seq += ktime_get_real().tv64;
+ seq += ktime_get_real().tv64 / 1000;
#if 0
printk("init_seq(%lx, %lx, %d, %d) = %d\n",
saddr, daddr, sport, dport, seq);





-

From: Eric Dumazet
Date: Monday, October 1, 2007 - 1:35 am

On 32 bits machine, replace the divide by a shift  to avoid a linker 
error (undefined reference to `__divdi3'):  

seq += ktime_get_real().tv64 >> 10;





-

From: Denys
Date: Monday, October 1, 2007 - 5:10 am

Not able to compile kernel with patch

drivers/built-in.o: In function `secure_tcp_sequence_number':
(.text+0x3ad02): undefined reference to `__divdi3'
make: *** [.tmp_vmlinux1] Error 1



--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.

-

From: Denys
Date: Monday, October 1, 2007 - 6:26 am

Resend for maillists (was discareded cause of encoding issues as SPAM).

Everything looks fine, for sure. Confirmed on second server.



--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.

-

From: Eric Dumazet
Date: Monday, October 1, 2007 - 1:10 pm

So maybe the following patch is necessary...

I believe IPV6 & DCCP are immune to this problem.

Thanks again Denys for spotting this.

Eric

[PATCH] TCP : secure_tcp_sequence_number() should not use a too fast clock

TCP V4 sequence numbers are 32bits, and RFC 793 assumed a 250 KHz clock.
In order to follow network speed increase, we can use a faster clock, but
we should limit this clock so that the delay between two rollovers is
greater than MSL (TCP Maximum Segment Lifetime : 2 minutes)

Choosing a 64 nsec clock should be OK, since the rollovers occur every
274 seconds.

Problem spotted by Denys Fedoryshchenko

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

From: David Miller
Date: Monday, October 1, 2007 - 1:57 pm

From: Eric Dumazet <dada1@cosmosbay.com>

Thanks a lot Eric for bringing closure to this.

I'll apply this and add a reference in the commit message to the
changeset that introduced this problem, since it might help
others who look at this.
-

From: Denys
Date: Sunday, September 30, 2007 - 4:24 pm

Hi 

I can confirm, that reversing this patch on 2.6.23-rc8 fixes the problem.
Checked on 3 different servers.
softirq not jumping anymore on top of "top" and mpstat also looks stable, 
reasonable and nice.


pi linux-git # git bisect bad
Bisecting: 0 revisions left to test after this
[f85958151900f9d30fa5ff941b0ce71eaa45a7de] [NET]: random functions can use 
nsec resolution instead of usec




--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.

-

From: Denys
Date: Sunday, September 30, 2007 - 11:43 pm

Just a bit more details about hardware:

Sun Fire X4100 (AMD Opteron 252), chipset looks like AMD-8111/AMD-8131 chips.
There is no HPET detected, and by default acpi_pm used, which is seems more 
CPU intensive(based on oprofile results) than TSC. Choosing TSC over /sys 
doesn't make much difference.

Workload is http requests coming from customers, noticeable slowdown 
happening on >250 requests/second, then routed to squid, then marked at squid 
ToS and depends on ToS routed over netfilter to required TCP port. At 
required TCP port "satellite accelerator" intercepting connection (so it is 
loopback, and it means "double" tcp work), and encapsulate stuff to UDP. Sure 
receiving UDP (stream around 10-20Mbit/s), and decapsulating ... and same way 
but in reverse to customer. So summary:

2x Opteron 2.4 Ghz
Processing incoming TCP requests rate 250-500 req/s
Incoming TCP bandwidth around 3-5 Mbit/s
Outgoing TCP bandwidth around 33-35Mbit/s 
Internally routed TCP also around this number (-cached content)
Incoming UDP bandwidth arounn 20-25Mbit/s
Outgoing UDP bandwidth around 500Kbit/s


At 2x Dual Core Opteron 2.6 Ghz (means total 4 cores) and similar load i 
cannot notice big slowdown, so it is noticable only when hardware used nearby 
it's limits i think. BUT! I can notice in mpstat spikes of softirq, from 
normal value 7-8% to 50-60%, but it didn't cause any noticeable slowdown on 
ssh or system operations. I am expecting 2.6.21 to serve 600-700 req/s. Maybe 
there is some additional overhead in this calculations, or improper irq 
locking? I am not guru in such stuff.
I guess who have similar workload can check mpstat 1, and see if there is 
--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.

-

Previous thread: [PATCH] robust futex thread exit race by Martin Schwidefsky on Sunday, September 30, 2007 - 8:02 am. (10 messages)

Next thread: 2.6.21 -> 2.6.22 & 2.6.23-rc8 performance regression by Denys on Sunday, September 30, 2007 - 8:22 am. (1 message)