login
Header Space

 
 

Threading Benchmarks, NetBSD versus FreeBSD

October 7, 2007 - 8:24pm
Submitted by Jeremy on October 7, 2007 - 8:24pm.
FreeBSD news

Andrew Doran posted some threading benchmark results to NetBSD's tech-kern mailing list, following up to some benchmarks he'd posted earlier. The results compared NetBSD -current with FreeBSD -current, and the Linux 2.6.21 kernel. Kris Kennaway was surprised by the results, and ran his own benchmarks with minimal configuration changes, summarizing, "this measurement shows that FreeBSD is performing 70-80% better than NetBSD in this 4 CPU configuration. This is in contrast to Andrew's findings which seem to show NetBSD performing 10% better than FreeBSD on a 4 CPU system (a very old one though)." He added, "the drop-off above 8 threads on FreeBSD is due to non-scalability of mysql itself. i.e. it comes from pthread mutex contention in userland."

Kris ran additional benchmarks with PostgreSQL instead of MySQL, showing much improved scalability above 8 threads, "postgresql is much more scalable than mysql on this workload and doesn't have silly scaling bottlenecks inside the application (cf the tail of the FreeBSD curve for mysql which is where pthread mutex contention kicked in)." He continued his testing, and found that on older 4CPU P3 hardware NetBSD did outperform FreeBSD, "but only by 3-4% (in particular I am not seeing the ~10% difference that Andrew observes on his 4*p3 700MHz). Given the age of the hardware and the fact that I am not seeing it on other workloads or on modern hardware it might just be due to a small scheduling difference on this configuration."


From: Andrew Doran <ad@...>
Subject: Thread benchmarks, round 2
Date: Oct 4, 7:04 pm 2007

So, I learned a few things since I put up the previous set of benchmarks:

- The erratic behaviour from Linux is due to the glibc memory allocator.
  Using Google's tcmalloc, the problem disappears.

- I missed a few things when porting jemalloc from FreeBSD. One of them
  was fairly major. Due to my mistake jemalloc on NetBSD was, basically,
  single threaded. That said it did show a noticable improvement over
  phkmalloc.

- There was a nasty performance bug in NetBSD's pthread mutexes, which
  is now fixed. libpthread has also had a couple more tweaks for performance
  that have had a positive impact.

- The memory allocator used has a significant effect on sysbench itself:
  it needs to be multithreaded.

- Mindaugas has made more improvements to his scheduler and these are
  showing a really positive effect.

So after making some changes to NetBSD, and changes to how I'm benchmarking
the systems, I have rerun them. In contrast to the previous runs, this one
is done locally:

	http://www.netbsd.org/~ad/sysbench2/4cpu.png 

Kris Kennaway has kindly offered to try NetBSD on an 8-way system. I expect
that NetBSD will hit a fairly clear ceiling due to poll, fcntl and socket
I/O causing contention on kernel_lock. It will be interesting to see.

Thanks,
Andrew

From: Kris Kennaway <kris@...> Subject: Re: Thread benchmarks, round 2 Date: Oct 5, 5:18 am 2007 Andrew Doran wrote: > So, I learned a few things since I put up the previous set of benchmarks: > > - The erratic behaviour from Linux is due to the glibc memory allocator. > Using Google's tcmalloc, the problem disappears. Well you have to be careful there, tcmalloc apparently defers frees, and is not really a general purpose malloc. The linux performance problems are (were? I haven't tried recent kernels) real though. > - I missed a few things when porting jemalloc from FreeBSD. One of them > was fairly major. Due to my mistake jemalloc on NetBSD was, basically, > single threaded. That said it did show a noticable improvement over > phkmalloc. > > - There was a nasty performance bug in NetBSD's pthread mutexes, which > is now fixed. libpthread has also had a couple more tweaks for performance > that have had a positive impact. > > - The memory allocator used has a significant effect on sysbench itself: > it needs to be multithreaded. > > - Mindaugas has made more improvements to his scheduler and these are > showing a really positive effect. > > So after making some changes to NetBSD, and changes to how I'm benchmarking > the systems, I have rerun them. In contrast to the previous runs, this one > is done locally: > > http://www.netbsd.org/~ad/sysbench2/4cpu.png I am somewhat surprised by this, because on FreeBSD it is really not spending much time in the kernel (only ~20% system time), so there does not seem to be much scope for a 10% performance difference. Also it took quite a lot of work to optimize locking of various kernel subsystems that are used by this workload, and until that point there was significant kernel lock contention which reduced performance by tens of percent. I would have expected this to matter on NetBSD - even with the vmlocking work there is still more to go. I will try to reproduce this on my own hardware (see below). > Kris Kennaway has kindly offered to try NetBSD on an 8-way system. I expect > that NetBSD will hit a fairly clear ceiling due to poll, fcntl and socket > I/O causing contention on kernel_lock. It will be interesting to see. Here is the initial run with CVS HEAD sources (I took out the obvious things from GENERIC.MP like I386_CPU support, etc, and removed the default datasize and stack size limits). Same benchmark config that Andrew is using, etc. http://people.freebsd.org/~kris/scaling/netbsd.png There are a couple of things to note: * the drop-off above 8 threads on FreeBSD is due to non-scalability of mysql itself. i.e. it comes from pthread mutex contention in userland. This is the only relevant lock contention point in the FreeBSD kernel on this workload. There are some things we can do in libpthread to mitigate the performance loss in the over-contended pthread situation, but we haven't done them yet. * The tail end of the graph is somewhat noisy, which is the reason for the jump at 19 threads (I only graphed a single run). The distribution at 20 clients looks like: +------------------------------------------------------------+ | x x | |x x x xxx x x xx x x xxx x xx| | |_______________A_M_____________| | +------------------------------------------------------------+ N Min Max Median Avg Stddev x 20 2326.01 2758.86 2586.47 2572.856 116.69937 Next, to try and reproduce Andrew's result, I disabled 4 CPUs (using cpuctl in NetBSD) and compared FreeBSD and NetBSD again. I didnt do a full graph yet, but the results are consistent with what I saw on 8 CPUs. NetBSD: 4 threads 1137.83 1135.49 1138.80 1138.06 20 threads 1101.84 1068.56 1075.32 998.49 Note that these are lower but not too different from the NetBSD values when all 8 CPUs are in use. FreeBSD: 4 threads 1985.48 1997.13 1997.43 20 threads 1813.02 1817.73 1824.59 The 4 thread performance is basically identical to the 8 CPU case, showing that the FreeBSD scaling graphed on 8 CPUs is the same as on 4 CPUs (but without the tail since mysql contention is now rate-limited), i.e. FreeBSD is continuing to scale linearly. This measurement shows that FreeBSD is performing 70-80% better than NetBSD in this 4 CPU configuration. This is in contrast to Andrew's findings which seem to show NetBSD performing 10% better than FreeBSD on a 4 CPU system (a very old one though). I will try later with the experimental kernel Andrew sent me (which includes the new scheduler). If it indeed gives a 100% performance improvement that would be a significant result :-) Kris
From: Kris Kennaway <kris@...> Subject: Re: Thread benchmarks, round 2 Date: Oct 5, 3:08 pm 2007 Kris Kennaway wrote: > The 4 thread performance is basically identical to the 8 CPU case, > showing that the FreeBSD scaling graphed on 8 CPUs is the same as on 4 > CPUs (but without the tail since mysql contention is now rate-limited), > i.e. FreeBSD is continuing to scale linearly. > > This measurement shows that FreeBSD is performing 70-80% better than > NetBSD in this 4 CPU configuration. This is in contrast to Andrew's > findings which seem to show NetBSD performing 10% better than FreeBSD on > a 4 CPU system (a very old one though). > > I will try later with the experimental kernel Andrew sent me (which > includes the new scheduler). If it indeed gives a 100% performance > improvement that would be a significant result :-) OK, I have repeated the benchmarking in two additional cases: 1) NetBSD with 8 CPUs and some kind of experimental kernel that Andrew gave me (based on the vmlocking branch). This is using the new scheduler. 2) As above with experimental libc and libpthread also given to me by Andrew. I dunno what changes these contain either :) I was only able to run in the 8 CPU configuration because when I tried to disable CPUs with cpuctl, processes would hang under load. This is probably a scheduler issue. http://people.freebsd.org/~kris/scaling/netbsd.png This shows some improvement but not much, relatively speaking. In particular performance at 4 threads is still significantly below FreeBSD performance, which (given what I measured previously) suggests that there is still a performance deficit with 4 CPUs on NetBSD. It would be nice to be able to test this directly though, maybe Andrew can give me a kernel that has MAXCPU=4 or whatever the NetBSD version is. Kris
From: Andrew Doran <ad@...> Subject: Re: Thread benchmarks, round 2 Date: Oct 5, 3:39 pm 2007 On Fri, Oct 05, 2007 at 09:08:07PM +0200, Kris Kennaway wrote: > OK, I have repeated the benchmarking in two additional cases: > > 1) NetBSD with 8 CPUs and some kind of experimental kernel that Andrew > gave me (based on the vmlocking branch). This is using the new scheduler. > > 2) As above with experimental libc and libpthread also given to me by > Andrew. I dunno what changes these contain either :) It's actually GENERIC.MP from current, with SCHED_M2. No vmlocking code involved - would you be able to update the labels? The libc has jemalloc, and libpthread is simply an up to date copy. > I was only able to run in the 8 CPU configuration because when I tried > to disable CPUs with cpuctl, processes would hang under load. This is > probably a scheduler issue. Right, I doubt that bit has been well tested since the scheduler is so new. > http://people.freebsd.org/~kris/scaling/netbsd.png > > This shows some improvement but not much, relatively speaking. In > particular performance at 4 threads is still significantly below FreeBSD > performance, which (given what I measured previously) suggests that > there is still a performance deficit with 4 CPUs on NetBSD. It would be > nice to be able to test this directly though, maybe Andrew can give me a > kernel that has MAXCPU=4 or whatever the NetBSD version is. Interesting. :-). Thanks for running this. I'm still optimistic about the 4 CPU case so I'm very interested in seeing what the results would be. I'll have a look into the offline problem this evening. Thanks, Andrew
From: Kris Kennaway <kris@...> Subject: Re: Thread benchmarks, round 2 Date: Oct 5, 5:38 pm 2007 Andrew Doran wrote: > On Fri, Oct 05, 2007 at 09:08:07PM +0200, Kris Kennaway wrote: > >> OK, I have repeated the benchmarking in two additional cases: >> >> 1) NetBSD with 8 CPUs and some kind of experimental kernel that Andrew >> gave me (based on the vmlocking branch). This is using the new scheduler. >> >> 2) As above with experimental libc and libpthread also given to me by >> Andrew. I dunno what changes these contain either :) > > It's actually GENERIC.MP from current, with SCHED_M2. No vmlocking code > involved - would you be able to update the labels? The libc has jemalloc, > and libpthread is simply an up to date copy. Done. >> I was only able to run in the 8 CPU configuration because when I tried >> to disable CPUs with cpuctl, processes would hang under load. This is >> probably a scheduler issue. > > Right, I doubt that bit has been well tested since the scheduler is so new. > >> http://people.freebsd.org/~kris/scaling/netbsd.png >> >> This shows some improvement but not much, relatively speaking. In >> particular performance at 4 threads is still significantly below FreeBSD >> performance, which (given what I measured previously) suggests that >> there is still a performance deficit with 4 CPUs on NetBSD. It would be >> nice to be able to test this directly though, maybe Andrew can give me a >> kernel that has MAXCPU=4 or whatever the NetBSD version is. > > Interesting. :-). Thanks for running this. I'm still optimistic about the 4 > CPU case so I'm very interested in seeing what the results would be. I'll > have a look into the offline problem this evening. OK thanks. In the meantime I ran sysbench with postgresql 8.2. Same NetBSD configs as before (except I built my own kernel with the sched_m2 patches since I needed to tweak the sysv ipc parameters). http://people.freebsd.org/~kris/scaling/netbsd-pgsql.png postgresql is much more scalable than mysql on this workload and doesn't have silly scaling bottlenecks inside the application (cf the tail of the FreeBSD curve for mysql which is where pthread mutex contention kicked in). Kris
From: Kris Kennaway <kris@...> Subject: Re: Thread benchmarks, round 2 Date: Oct 6, 12:20 pm 2007 Kris Kennaway wrote: > In the meantime I ran sysbench with postgresql 8.2. Same NetBSD configs > as before (except I built my own kernel with the sched_m2 patches since > I needed to tweak the sysv ipc parameters). > > http://people.freebsd.org/~kris/scaling/netbsd-pgsql.png > > postgresql is much more scalable than mysql on this workload and doesn't > have silly scaling bottlenecks inside the application (cf the tail of > the FreeBSD curve for mysql which is where pthread mutex contention > kicked in). Here are some more graphs. This one is on the 4 CPU P3 500 MHz and shows postgresql 8.2. FreeBSD is about 15-20% higher throughput. http://people.freebsd.org/~kris/scaling/4cpu-pgsql.png This one shows mysql on the same system http://people.freebsd.org/~kris/scaling/4cpu-mysql.png In that test NetBSD does outperform FreeBSD but only by 3-4% (in particular I am not seeing the ~10% difference that Andrew observes on his 4*p3 700MHz). Given the age of the hardware and the fact that I am not seeing it on other workloads or on modern hardware it might just be due to a small scheduling difference on this configuration. Kris


I wonder why glibc guys

October 7, 2007 - 9:16pm

I wonder why glibc guys don't throw away the current slow memory allocator and replace it with something newer, like the one by Google, or jemalloc, or something like these?

Glibc? There is just one in

October 7, 2007 - 10:28pm
Anonymous (not verified)

Glibc? There is just one in Linux, there is no glibc in BSD.

Which doesn't invalidate the

October 7, 2007 - 11:40pm
Anonymous (not verified)

Which doesn't invalidate the question though. Why is the glibc memory allocation/deallocation so slow and yet not replaced by something faster?

indeed

October 8, 2007 - 12:36am

Exactly. Why are we still using glibc everywhere if it performs this poorly?

One reason would be that

October 8, 2007 - 6:12am
Anonymous (not verified)

One reason would be that tcmalloc does not ever give memory back to the system. I haven't looked at jemalloc so no idea about that one.

Because the glibc

October 8, 2007 - 8:08am
Anonymous (not verified)

Because the glibc maintainers dont care about you.

Just look at Drepper's various agenda's with Joe Average.

They cater to the companies.

That's quite irrelevant: the

October 8, 2007 - 9:02am
Anonymous (not verified)

That's quite irrelevant: the average Joe doesn't run MySQL on a 4-CPU server.

Speed is not everything. For

October 8, 2007 - 9:44am
Anonymous (not verified)

Speed is not everything. For a library like glibc, robustness and portability are also essential.

Apart from that, I think glibc malloc performs pretty well. Where are the benchmarks that show that it is "so slow"?

http://goog-perftools.sourcef

October 8, 2007 - 9:55am
Anonymous (not verified)

http://goog-perftools.sourceforge.net/doc/tcmalloc.html

Caveat: these results may be quite different for libc 2.6.

I have just looked at glibc

October 9, 2007 - 2:49am
Anonymous (not verified)

I have just looked at glibc cvs tree and found that Malloc they use is still ptmalloc2 and not ptmalloc3. However Glibc maintainers have certainly done some modifications to original ptmalloc2 to improve speed or space usage.
Also ptmalloc2 is based on Doug Lea's malloc(dlmalloc) 2.7 which is quite old. Ptmalloc3 is based on the more recent dlmalloc 2.8.3 which dates back from 2005.
Othe scalable allocators worth mentioning are Hoard and Nedmalloc. The latter is also based on dlmalloc.

Highly useful information.

October 9, 2007 - 9:13am
Anonymous (not verified)

Highly useful information. Thank you.

because..

October 8, 2007 - 1:12am
Anonymous (not verified)

http://goog-perftools.sourceforge.net/doc/tcmalloc.html
Caveats

For some systems, TCMalloc may not work correctly on with applications that aren't linked against libpthread.so (or the equivalent on your OS). It should work on Linux using glibc 2.3, but other OS/libc combinations have not been tested.

TCMalloc may be somewhat more memory hungry than other mallocs, though it tends not to have the huge blowups that can happen with other mallocs. In particular, at startup TCMalloc allocates approximately 6 MB of memory. It would be easy to roll a specialized version that trades a little bit of speed for more space efficiency.

TCMalloc currently does not return any memory to the system.

Don't try to load TCMalloc into a running binary (e.g., using JNI in Java programs). The binary will have allocated some objects using the system malloc, and may try to pass them to TCMalloc for deallocation. TCMalloc will not be able to handle such objects.

License

October 8, 2007 - 9:29am
Nony mouse (not verified)

It's released under a BSD license (?)

glibc 2.6 has some type of

October 8, 2007 - 9:53am
Anonymous (not verified)

glibc 2.6 has some type of improved allocator compared to the earlier ones (ptmalloc3 according to some googling). Some of the worst-case behaviour of the current one should already be mitigated. Unfortunately the graphs do not say which malloc they tested against, maybe they did not have glibc 2.6...

On the other hand, if these results were derived with glibc 2.6, we should seriously look into improving it.

tcmalloc website seems to be comparing against ptmalloc2, so the behaviour reported there should no longer reflect the current version. They explain how they derived these results: someone should try replicating them with a libc 2.6.

I doubt the impact of the allocator is that great for most of us, but it does look like there's room for improvement.

Glibc version

October 9, 2007 - 12:08am
BigChris (not verified)

As noted in one of Andrew Doran's emails to the tech-kern mailing list, he is running Fedora Core 7 for the Linux tests. He is running with the stock kernel and glibc that comes with FC7, and using something like the LD_PRELOAD trick to run MySQL with the tcmalloc allocator.

Chris

graph generation

October 7, 2007 - 10:27pm
Anonymous (not verified)

anyone gimme a clue as to what tool they're using to generate the graphs? rrdtool?

Looks like the venerable

October 7, 2007 - 10:56pm
moltonel (not verified)

Looks like the venerable gnuplot.

Gnuplot?

October 7, 2007 - 11:00pm

Hmm those are nice looking graphs for gnuplot.... Can anyone tell me what font that is? My gnuplot graphs are nowhere near as pretty as that :(

I think you get something

October 7, 2007 - 11:46pm
Erik Wikström (not verified)

I think you get something like that when you use "set terminal png small" or "set terminal png tiny". The small/tiny bit refers to the fonts to be used. Don't you just love an application where the fonts that can be used are dependent on the format you saves the output in. Don't get me wrong, gnuplot is one of the best plotting tools out there, it is just not very user friendly or logical.

wx

October 8, 2007 - 12:25am

Compile gnuplot with wxwidgets support and the fonts should be available. I used to have that problem as well until about a week ago.

posgresql and threads

October 8, 2007 - 9:49am
Anonymous (not verified)

How is postgresql relevant for a threading benchmark when it does not use threads?

At a guess...

October 8, 2007 - 10:28am
Flewellyn (not verified)

I'd say the idea was to compare the performance of multithreading (MySQL's approach) with plain old multiprocessing with shared memory (PostgreSQL). The idea, I imagine, is that ideally a multithreading application would be faster than a multiprocessing one, because of the reduced overhead of threads versus processes.

I'm not sure how valid this assumption is, mind you. I'm just guessing what their reasoning was.

threaded vs non-threaded

October 8, 2007 - 11:06am
Kris Kennaway (not verified)

The "threading benchmark" is something the author of this article made up :) My work is to do with measuring kernel performance on SMP systems, and trying to identify and fix performance bottlenecks on various common workloads. Threaded vs non-threaded applications aren't really the point, except that they tend to exercise different parts of the kernel.

made up

October 8, 2007 - 1:02pm

Hi Kris,

Thanks for your clarification. However, I'm still unclear on a two or three word summary that would make a descriptive title as to what these tests are measuring. I chose "threading benchmark" as I saw a series of benchmarks that in which the X and Y axis seemed to indicate we were graphing the performance of threads. Would "performance benchmarking" be better? Or just "measuring performance"?

posgresql and threads

October 8, 2007 - 11:53am
Anonymous (not verified)

This article is not talking about POSIX threads (pthreads) only it talks about threading performance in general. In the kernel scheduler and other subsystems often works only with threads named light-weight processes (LWP). It does not matter if testing threaded application (MySQL) or not (PostgreSQL) it is based on LWPs anyway. Just keep in mind that benchmarks are done for OS not databases.

There is a talk about "old systems". It does not matter that 4-CPU machine is so old the general principle of SMP is the same. It shows that NetBSD improved SMP performance very much both on the old and new machines. Of course there are still a lot of if improvements to do.

FreeBSD is a slight bit

October 8, 2007 - 11:20pm
Anonymous (not verified)

FreeBSD is a slight bit slower than others on PIII-based machine, I think it is because the kernel size is larger and PIII just can not handle it
effectively. I don't care PIII at all, I even threw it away some days ago.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
speck-geostationary