Linux: The Really Fair Scheduler

Submitted by Jeremy
on August 30, 2007 - 8:07pm

During the many threads discussing Ingo Molnar's recently merged Completely Fair Scheduler, Roman Zippel has repeatedly questioned the complexity of the new process scheduler. In a recent posting to the Linux Kernel mailing list he offered a simpler scheduler named the 'Really Fair Scheduler' saying, "as I already tried to explain previously CFS has a considerable algorithmic and computational complexity. This patch should now make it clearer, why I could so easily skip over Ingo's long explanation of all the tricks CFS uses to keep the computational overhead low - I simply don't need them." He offered a mathematical overview of how his new scheduler works, included some benchmarks, and reflected back to earlier discussions on the lkml asking, "Ingo, from this point now I need your help, you have to explain to me, what is missing now relative to CFS. I tried to ask questions, but that wasn't very successful..." He also noted:

"The basic idea of this scheduler is somewhat different than CFS. Where the old scheduler maintains fixed time slices, CFS still maintains a dynamic per task time slice. This model does away with it completely, instead it puts the task on a virtual (normalized) time line, where only the relative distance between any two task is relevant."


From:	Roman Zippel [email blocked]
Subject: [ANNOUNCE/RFC] Really Fair Scheduler
Date:	Fri, 31 Aug 2007 04:05:12 +0200 (CEST)

Hi,

I'm glad to announce a working prototype of the basic algorithm I
already suggested last time.
As I already tried to explain previously CFS has a considerable
algorithmic and computational complexity. This patch should now make it
clearer, why I could so easily skip over Ingo's long explanation of all
the tricks CFS uses to keep the computational overhead low - I simply
don't need them. The following numbers are based on a 2.6.23-rc3-git1 UP
kernel, the first 3 runs are without patch, the last 3 runs are with the
patch:

Xeon 1.86GHz:

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------------------
Host                 OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
                         ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
--------- ------------- ------ ------ ------ ------ ------ ------- -------
suse      Linux 2.6.23- 0.9500 1.3900 1.1700 1.6500 1.1700 1.68000 1.22000
suse      Linux 2.6.23- 0.9700 1.3600 1.1400 1.6100 1.1500 1.67000 1.18000
suse      Linux 2.6.23- 0.9800 1.4000 1.1500 1.6600 1.1400 1.70000 1.19000
suse      Linux 2.6.23- 0.7500 1.2000 1.0000 1.4800 0.9800 1.52000 1.06000
suse      Linux 2.6.23- 0.7800 1.2200 0.9800 1.4600 1.0100 1.53000 1.05000
suse      Linux 2.6.23- 0.8300 1.2200 1.0000 1.4800 0.9800 1.52000 1.05000

Celeron 2.66GHz:

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------------------
Host                 OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
                         ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
--------- ------------- ------ ------ ------ ------ ------ ------- -------
devel6    Linux 2.6.23- 3.1300 3.3800 5.9000 5.8300   18.1 9.33000 18.0
devel6    Linux 2.6.23- 3.1400 3.2800 7.0700 6.0900   17.6 9.18000 17.8
devel6    Linux 2.6.23- 3.1400 3.6000 3.5400 6.1300   17.2 9.02000 17.2
devel6    Linux 2.6.23- 2.7400 3.0300 5.5200 6.4900   16.6 8.70000 17.0
devel6    Linux 2.6.23- 2.7400 3.2700 8.7700 6.8700   17.1 8.79000 17.4
devel6    Linux 2.6.23- 2.7500 3.3100 5.5600 6.1700   16.8 9.07000 17.2


Besides these numbers I can also provide a mathematical foundation for
it, I tried the same for CFS, but IMHO it's not really sanely possible.
This model is far more accurate than CFS is and doesn't add an error
over time, thus there are no more underflow/overflow anymore within the
described limits.
The small example program also demonstrates how it can be easily scaled
down to millisecond resolution to completely get rid of the 64bit math.
This may be interesting for a few archs even if they have a finer clock
resolution as most scheduling decisions are done on a millisecond basis.

The basic idea of this scheduler is somewhat different than CFS. Where
the old scheduler maintains fixed time slices, CFS still maintains a
dynamic per task time slice. This model does away with it completely,
instead it puts the task on a virtual (normalized) time line, where only
the relative distance between any two task is relevant.

So here all the mathematical details necessary to understand what the
scheduler does, so anyone can judge for himself how solid this design
is. First some basics:

(1)	time = sum_{t}^{T}(time_{t})
(2)	weight_sum = sum_{t}^{T}(weight_{t})

Time per task is calculated with:

(3)	time_{t} = time * weight_{t} / weight_sum

This can be also written as: 

(4)	time_{t} / weight_{t} = time / weight_sum

This way we have the normalized time:

(5)	time_norm = time / weight_sum
(6)	time_norm_{t} = time_{t} / weight_{t}

If every task got its share they are all same. Using time_norm one can
calculate the time tasks should get based on their weight:

(7)	sum_{t}^{T}(time_{t}) = sum_{t}^{T}(round(time / weight_sum) * weight_{t})

This is bascially what CFS currently does and it demonstrates the basic
problem it faces. It rounds the normalized time and the rounding error
is also distributed to the time a task gets, so there is a difference
between the time which is distributed to the tasks and the time consumed
by them. On the upside the error is distributed equally to all tasks
relatively to their weight (so it isn't immediately visible via top). On
the downside the error itself is weighted too and so a small error can
be become quite large and the higher the weight the more it contributes
to the error and the more likely it hits one of the limits. Once it hits
a limit, the overflow/underflow time is simply thrown away and is lost
for accounting and the task doesn't get the time it's supposed to get.

An alternative approach is to not use time_norm at all to distribute the
time. Any task can be used to calculate the time any other task needs
relative to it. For this the normalized time per task is maintained
based on (6):

(8)	time_norm_{t} * 2^16 = time_{t} * round(2^16 / weight_{t})

Using the difference between the normalized times of two tasks, one can
calculate the time it needs to equalize the normalized time.
This has the advantage that round(2^16 / weight_{t}) is constant (unless
reniced) and thus also the error due to the rounding. The time one task
gets relative to another is based on these constants. As only the delta
of these times are needed, the absolute value can simply overflow and
the limit of maximum time delta is:

(9)	time_delta_max = KCLOCK_MAX / (2^16 / weight_min)


The global normalized time is still needed and useful (e.g. waking
tasks) and thus this faces the same issue as CFS right now - managing
the rounding error. This means one can't directly use the real
weight_{t} value anymore without producing new errors, so either one
uses this approximate weight:

(10)	weight_app_{t} = 2^16 / round(2^16 / weight_{t})

or even better would be to get rid of this completely.

Based on (5) and (6) one can calculate the global normalized time as:

(11)	time_norm = sum_{t}^{T}(time_{t}) / sum_{t}^{T}(weight_app_{t})
	          = sum_{t}^{T}(time_norm_{t} * weight_app_{t}) / sum_{t}^{T}(weight_app_{t})

This is now a weighted average and provides the possibility to get rid
of weight_app_{t} by simply replacing it:

(12)	time_norm_app = sum_{t}^{T}(time_norm_{t} * weight_{t}) / sum_{t}^{T}(weight_{t})

This produces only a approximate normalized time, but if all
time_norm_{t} are equal (i.e. all tasks got their share), the result is
the same, thus the error is only temporary. If one already approximates
this value this means value other replacements are possible too. In the
previous example program I simply used 1:

(13)	time_norm_app = sum_{t}^{T}(time_norm_{t}) / T

Another approximation is to use a shift:

(14)	time_norm_app = sum_{t}^{T}(time_norm_{t} * 2^weight_shift_{t}) / sum_{t}^{T}(2^weight_shift_{t})

This helps to avoid a possible full 64 bit multiply and makes other
operations elsewhere simpler too and the result should be close enough.
So by maintaining these two sums one can calculate an approximate
normalized time value:

(15)	time_sum_app = sum_{t}^{T}(time_norm_{t} * 2^weight_shift_{t})
(16)	weight_sum_app = sum_{t}^{T}(2^weight_shift_{t})

Using (6) the dividend can also be written as:

(17)	time_sum_app * 2^16 = sum_{t}^{T}(time_{t} * round(2^16 / weight_{t}) * 2^weight_shift_{t})

The multiplier "round(2^16 / weight_{t}) * 2^weight_shift_{t}" is off at
most by 1.42 (or e(.5*l(2))), so this approximate time sum value grows
almost linearly with the real time and thus the maximum of this sum is
reached approximately after:

(18)	time_max = KCLOCK_MAX / 2^16 / 1.42

The problem now is to avoid this overflow, for this the sum is regularly
updated by splitting it:

(19)	time_sum_app = time_sum_base + time_sum_off

Using this (14) can be written as:

(20)	time_norm_app = (time_sum_base + time_sum_off) / weight_sum_app
	              = time_sum_base / weight_sum_app + time_sum_off / weight_sum_app
	              = time_norm_base + time_sum_off / weight_sum_app

time_norm_base is maintained incrementally be defining this increment:

(21)	time_norm_inc = time_sum_max / weight_sum_app

Everytime time_sum_off exceeds time_sum_max, time_sum_off and
time_norm_base are adjusted appropriately. time_sum_max is scaled so
that the required update frequency is reduced to a minimum, but also so
that time_sum_off can be easily scaled down to a 32 bit value when
needed.

This basically describes the static system, but to allow for sleeping
and waking these sums need adjustments to preserve a proper average:

(22)	weight_app_sum += 2^weight_shift_{new}
(23)	time_sum_max += time_norm_inc * 2^weight_shift_{new}
(24)	time_sum_off += (time_norm_{new} - time_norm_base) * 2^weight_shift_{new}

The last one is a little less obvious, it can be derived from (15) and
(19):

(25)	time_sum_base + time_sum_off = sum_{t}^{T}(time_norm_{t} * 2^weight_shift_{t})
	time_sum_off = sum_{t}^{T}(time_norm_{t} * 2^weight_shift_{t}) - time_sum_base
	             = sum_{t}^{T}(time_norm_{t} * 2^weight_shift_{t}) - time_norm_base * weight_sum_app
	             = sum_{t}^{T}(time_norm_{t} * 2^weight_shift_{t}) - sum_{t}^{T}(time_norm_base * 2^weight_shift_{t})
	             = sum_{t}^{T}((time_norm_{t} - time_norm_base) * 2^weight_shift_{t})

As one can see here time_norm_base is only interesting relative to the
normalized task time and so the same limits apply.

The average from (20) can now be used to calculate the normalized time
for the new task in (24). It can be a given a bonus relative to the
other task or it might be still within a certain limit, as it hasn't
slept long enough. The limit (9) still applies here, so a simple
generation counter may still be needed for long sleeps.
The time_sum_off value used to calculate the average can be scaled down
as mentioned above. As it contains far more resolution than needed for
short time scheduling decisions, the lower bits can be thrown away to
get a 32 bit value. The scaling of time_sum_max makes sure that one
knows the location of the most significant bit, so that the 32 bit is
used as much as possible.


Finally a few more notes about the patch. The current task is not kept
in the tree (just this saves a lot of tree updates), so I faced a
similiar problem as FAIR_GROUP_SCHED, that enqueue_task/dequeue_task can
be called for the current task, so maintaining the current task pointer
for the class is interesting. Instead of adding a set_curr_task it would
be IMO simpler to further reduce the number of indirect calls, e.g. the
deactivate_task/put_prev_task sequence can be replaced with a single
call (and I don't need the sleep/wake arguments anymore, so it can be
reused for that).
I disabled the usage of cpu load as its calculation is also rather 64bit
heavy and I suspect it could be easily scaled down, but this way it's
not my immediate concern.

Ingo, from this point now I need your help, you have to explain to me,
what is missing now relative to CFS. I tried to ask questions, but that
wasn't very successful...

bye, Roman

---
 include/linux/sched.h |   21 -
 kernel/sched.c        |  158 +++++----
 kernel/sched_debug.c  |   26 -
 kernel/sched_norm.c   |  812 ++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 910 insertions(+), 107 deletions(-)

Related Links:

well I've just tested it,

J-oe (not verified)
on
September 2, 2007 - 4:25am

well I've just tested it, but I'm quite a n00b over responsiveness benchmarks, dunnoh, just did what first came out to my mind: compile kernel while opening firefox, and lots of 00.org apps, everything fine ;<) impressive, but I must warn! one cannot considerate this as an acquerate test (wish I knew more of Con's responsiveness benchmark tool), Gtkperf which is a simple text drawing performed 2 second faster, but again the test can be tweaked because it's wasn't performed under the same conditions.

Gamming, blah blah (not interested in that subject)

Anyway really worth trying. (Someone over the kernel mailing list merged it over a complete patchset).

regards, have fun.

IO scheduler is MUCH MORE IMPORTANT

eha11 (not verified)
on
September 1, 2007 - 3:23pm

IO matters much more because it is much more likely to be a bottleneck. Set your priorities straight.

From http://linux.slashdot.org/comments.pl?sid=286045&cid=20435985 :

Screw the CPU scheduler at this point. The kernel folks are missing the obvious and utter brokenness of the IO scheduling. These bugs have been outstanding about a year now!! And it's not just AMD64 anymore either. Quoth the kernel bug report:

"Now, as far as this bug being AMD64 only. We develop a portable data analysis
tool and we run it on Intel Core Mobile systems (Sony UX series, Panasonic
Toughbook series) and see this bug or one almost exactly like it on those
platforms as well.
"

http://bugzilla.kernel.org/show_bug.cgi?id=7372 [kernel.org]
http://bugzilla.kernel.org/show_bug.cgi?id=8636 [kernel.org]
http://www.nabble.com/IO-activity-brings-my-deskto p-to-its-knees-(2.6.22.1-ck1)-t4192136.html [nabble.com]
http://forums.gentoo.org/viewtopic-t-482731-start- 500.html [gentoo.org]

At first, deadline IO was touted as an answer, but that doesn't completely fix things.
Some say Native Command Queueing is broken. One person claims deadline + NCQ disabled helps.
Some say the kernel's vfs_cache_pressure settings help, while others refute it (compare kernel bug report versus page 21 of the gentoo forum thread). But no one understands what's really broken in the kernel.

Can we please get Ingo working on IO scheduling? PLEASE?

Answered in the same

Anonymous (not verified)
on
September 2, 2007 - 5:54am

Answered in the same slashdot thread. (link)

So the kernel devs know about this and are working on a solution and have a fix for it. All you have to do is to check out the BDI patches (try the -mm kernel, it has them included) and provide feedback to the kernel list.

WEE, Ingo working on IO scheduling?

J-oe (not verified)
on
September 2, 2007 - 3:39pm

certainly, you didn't read the patch changelog, it's not a new scheduler and for what Roman writes, I guess he is trying to merge it to Ingo sched ASAP, but IT'S N0t a new scheduler it's a reimplementation, just some rewrites, bits reallocation, new lines, another deletions, etc.

As for the math goes, clueless, I 'm waiting for some one to come up with a plain english explanation.
Anyways, you just were dropping buying, and felt in need of saying "whatever"???

WHy don't you just try the d4mm patch instead of saying the first thing that crosses through your brain, maybe it will answer this question of yours:

>> Can we please get
>> Ingo working on IO scheduling? PLEASE?

I have a dream!

zza (not verified)
on
September 4, 2007 - 6:30am

The scheduler problem …
It is still hard to believe that in this day and age with duel core and massive amounts memory on a computer that I cannot run my MP3 player and download at the same time?
I do agree that IO is the bottle neck..
But all of this comes from the fact that OS systems now where designed and built to run on one CPU.. Now the hardware has changed and we added threading and duel cores(even more cores coming).. But yet we are still running the same kernels that where built back in the 90’s?
I think Linux is missing its chance to overtake windows by not totally re-writing it’s kernel from the ground up… Have the ability to detect what the user likes to run and have AI built into the OS that can then change to different types of schedulers.. This will create a lot of bugs and problems but the trade off will be huge.. We would have a OS written with threading in mind and with all the new Mobil technology..
But I guess this is my dream.. till then I am stuck running out of date kernels that are slow..

Whaaaaaaa? I believe you,

Anonymous (not verified)
on
September 4, 2007 - 2:30pm

Whaaaaaaa? I believe you, your machine or the setup is the problem, haven't seen that in a looooooooooong time...

Sure it's not a userspace problem?

Comment viewing options