A university student studying operating systems asked about why the Linux kernel uses two chained lists in its LRU (least recently used) page replacement algorithm. Andrea Arcangeli [interview [1]], whose virtual memory subsystem was merged into the 2.4.10 kernel, explained, "back then I designed it with two lru lists because by splitting the active from the inactive cache allows to detect the cache pollution before it starts discarding the working set." He went on to add, "a page in the inactive list will be collected much more quickly than a page in the active list, so the pollution will be collected more quickly than the working set. Then the VM while freeing cache tries to keep a balance between the size of the two lists to avoid being too unfair, obviously at some point the active list have to be de-activated too."
Rik van Riel [interview [2]], author of the reverse mapping virtual memory code that was merged into the 2.5 kernel [story [3]] noted, "since memory size has increased a lot more than disk speed over the last decade (and this is likely to continue for the next decades), the quality of page replacement algorithms is likely to become more and more important over time." In response to a proposal to split the LRU into two parts, one for the page cache and the other for mapped pages, Nick Piggin [interview [4]] replied, "I actually had patches to do 'split active lists' a while back. They worked by lazily moving the page at reclaim-time, based on whether or not it is mapped. This isn't too much worse than the kernel's current idea of what a mapped page is." Rik offered some ideas on to how to further tune it, "for each list we keep track of: 1) the size of the list 2) the rate at which we scan the list 3) the fraction of (non new) pages that get referenced. That way we can determine which list has the largest fraction of 'idle' pages sitting around and consequently which list should be scanned more aggressively."
From: Niki Hammler [email blocked] To: linux-kernel Subject: Why active list and inactive list? Date: Tue, 23 Jan 2007 01:10:46 +0100 Dear Linux Developers/Enthusiasts, For a course at my university I'm implementing parts of an operating system where I get most ideas from the Linux Kernel (which I like very much). One book I gain information from is [1]. Linux uses for its Page Replacing Algorithm (based on LRU) *two* chained lists - one active list and one inactive list. I implemented my PRA this way too. Now the big question is: WHY do I need *two* lists? Isn't it just overhead/more work? Are there any reasons for that? In my opinion, it would be better to have just one just; pop frames to be swapped out from the end of the list and push new frames in front of it. Then just evaluate the frames and shift them around in the list. Is there any explanation why Linux uses two lists? Thanks you very much in advance! Best regards, Niki [1] Linux Kernelarchitektur. Konzepte, Strukturen und Algorithmen von Kernel 2.6. Wolfang Mauerer, Hanser, ISBN 3-446-22566-8
From: Andrea Arcangeli [5] [email blocked] Subject: Re: Why active list and inactive list? Date: Tue, 23 Jan 2007 01:39:39 +0100 On Tue, Jan 23, 2007 at 01:10:46AM +0100, Niki Hammler wrote: > Dear Linux Developers/Enthusiasts, > > For a course at my university I'm implementing parts of an operating > system where I get most ideas from the Linux Kernel (which I like very > much). One book I gain information from is [1]. > > Linux uses for its Page Replacing Algorithm (based on LRU) *two* chained > lists - one active list and one active list. > I implemented my PRA this way too. > > No the big question is: WHY do I need *two* lists? Isn't it just > overhead/more work? Are there any reasons for that? > > In my opinion, it would be better to have just one just; pop frames to > be swapped out from the end of the list and push new frames in front of > it. Then just evaluate the frames and shift them around in the list. > > Is there any explanation why Linux uses two lists? Back then I designed it with two lru lists because by splitting the active from the inactive cache allows to detect the cache pollution before it starts discarding the working set. The idea is that the pollution will enter and exit the inactive list without ever being elected to the active list because by definition it will never generate a cache hit. The working set will instead trigger cache hits during page faults or repeated reads, and it will be preserved better by electing it to enter the active list. A page in the inactive list will be collected much more quickly than a page in the active list, so the pollution will be collected more quickly than the working set. Then the VM while freeing cache tries to keep a balance between the size of the two lists to avoid being too unfair, obviously at some point the active list have to be de-activated too. If your server "fits in ram" you'll find lots of cache to be active and so the I/O activity not part of the working set will be collected without affecting the working set much.
From: Balbir Singh [email blocked] Subject: Re: Why active list and inactive list? Date: Tue, 23 Jan 2007 07:01:33 +0530 Andrea Arcangeli wrote: > On Tue, Jan 23, 2007 at 01:10:46AM +0100, Niki Hammler wrote: >> >> Is there any explanation why Linux uses two lists? > > Back then I designed it with two lru lists because by splitting the > active from the inactive cache allows to detect the cache pollution > before it starts discarding the working set. The idea is that the > pollution will enter and exit the inactive list without ever being > elected to the active list because by definition it will never > generate a cache hit. The working set will instead trigger cache hits > during page faults or repeated reads, and it will be preserved better > by electing it to enter the active list. > > A page in the inactive list will be collected much more quickly than a > page in the active list, so the pollution will be collected more > quickly than the working set. Then the VM while freeing cache tries to > keep a balance between the size of the two lists to avoid being too > unfair, obviously at some point the active list have to be > de-activated too. If your server "fits in ram" you'll find lots of > cache to be active and so the I/O activity not part of the working set > will be collected without affecting the working set much. This makes me wonder if it makes sense to split up the LRU into page cache LRU and mapped pages LRU. I see two benefits 1. Currently based on swappiness, we might walk an entire list searching for page cache pages or mapped pages. With these lists separated, it should get easier and faster to implement this scheme 2. There is another parallel thread on implementing page cache limits. If the lists split out, we need not scan the entire list to find page cache pages to evict them. Of course I might be missing something (some piece of history) -- Balbir Singh Linux Technology Center IBM, ISTL
From: Andrea Arcangeli [6] [email blocked] Subject: Re: Why active list and inactive list? Date: Tue, 23 Jan 2007 03:13:27 +0100 On Tue, Jan 23, 2007 at 07:01:33AM +0530, Balbir Singh wrote: > This makes me wonder if it makes sense to split up the LRU into page > cache LRU and mapped pages LRU. I see two benefits > 1. Currently based on swappiness, we might walk an entire list > searching for page cache pages or mapped pages. With these > lists separated, it should get easier and faster to implement > this scheme When I tried that long time ago, I recall I had troubles, but there wasn't the reclaim_mapped based on static values so it was even harder. However it would be still a problem today to decide when to switch from the unmapped to the mapped lru. When reclaim_mapped is set, you'll still have to shrink some unmapped page, and by splitting you literally lose age information to save some cpu. Eventually you risk spending time in trying to free unfreeable pinned pages that sits in the unmapped list before finally jumping to the mapped list. So you've to add yet another list to get rid of the pinned stuff in the unmapped list and I stopped when I had to refile pages from the "pinned" list to the unmapped list in irq I/O completion context, now it's all spin_lock_irq so it would be more natural at least... > 2. There is another parallel thread on implementing page cache > limits. If the lists split out, we need not scan the entire > list to find page cache pages to evict them. BTW I'm unsure about the cache limit thread, the overhead of the vm collection shouldn't be an issue, and those tends to hide vm inefficiencies. For example Neil has a patch to reduce the writeback cache to 10M-50M (much lower than the current 1% minimum) to hide huge unfariness in the writeback cache. I think they should mount the fs with -o sync instead of using that patch until the unfariness is fixed or tunable. The patch itself is fine though but for that problem it only looks a workaround. So I at least try to be always quite skeptical when I hear about cache "fixed size limiting" patches that improve responsiveness or performance ;) > Of course I might be missing something (some piece of history) Partly ;) Code was very different back then, today it would be easier thanks to reclaim_mapped but the partial loss of age information and potential loss of cpu in a pinned&unmapped walk would probably remain.
From: Rik van Riel [7] [email blocked] Subject: Re: Why active list and inactive list? Date: Mon, 22 Jan 2007 20:42:59 -0500 Balbir Singh wrote: > This makes me wonder if it makes sense to split up the LRU into page > cache LRU and mapped pages LRU. I see two benefits Unlikely. I have seen several workloads fall over because they did not throw out mapped pages soon enough. If the kernel does not keep the most frequently accessed pages resident, hit rates will suffer. Sometimes (well, usually) those are the mapped pages, but this is not true in all workloads. Some workloads are very page cache heavy and it pays to keep the more frequently accessed page cache pages resident while discarding the less frequently accessed ones. Since memory size has increased a lot more than disk speed over the last decade (and this is likely to continue for the next decades), the quality of page replacement algorithms is likely to become more and more important over time. > 1. Currently based on swappiness, we might walk an entire list > searching for page cache pages or mapped pages. With these > lists separated, it should get easier and faster to implement > this scheme How do you classify a mapped page cache page? Another issue is that you'll want to make sure that the page cache pages that are referenced more frequently than the least referenced mapped (I assume you mean anonymous?) pages in memory, while swapping out those least used anonymous pages. One way to do this could be to compare the scan rates, list sizes and referenced percentage of both lists, to find out which of the two caches is hotter. > 2. There is another parallel thread on implementing page cache > limits. If the lists split out, we need not scan the entire > list to find page cache pages to evict them. If the lists split out, there is no reason to limit the page cache size because you can easily reclaim them. Right? > Of course I might be missing something (some piece of history) http://linux-mm.org/AdvancedPageReplacement [8] -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic.
From: Nick Piggin [9] [email blocked] Subject: Re: Why active list and inactive list? Date: Tue, 23 Jan 2007 15:17:32 +1100 Balbir Singh wrote: > This makes me wonder if it makes sense to split up the LRU into page > cache LRU and mapped pages LRU. I see two benefits > > 1. Currently based on swappiness, we might walk an entire list > searching for page cache pages or mapped pages. With these > lists separated, it should get easier and faster to implement > this scheme > 2. There is another parallel thread on implementing page cache > limits. If the lists split out, we need not scan the entire > list to find page cache pages to evict them. > > Of course I might be missing something (some piece of history) I actually had patches to do "split active lists" a while back. They worked by lazily moving the page at reclaim-time, based on whether or not it is mapped. This isn't too much worse than the kernel's current idea of what a mapped page is. They actually got a noticable speedup of the swapping kbuild workload, but at this stage there were some more basic improvements needed, so the difference could be smaller today. The other nice thing about it was that it didn't have a hard cutoff that the current reclaim_mapped toggle does -- you could opt to scan the mapped list at a lower ratio than the unmapped one. Of course, it also has some downsides too, and would require retuning... -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com [10]
From: Rik van Riel [11] [email blocked] Subject: Re: Why active list and inactive list? Date: Mon, 22 Jan 2007 23:34:27 -0500 Nick Piggin wrote: > The other nice thing about it was that it didn't have a hard > cutoff that the current reclaim_mapped toggle does -- you could > opt to scan the mapped list at a lower ratio than the unmapped > one. Of course, it also has some downsides too, and would > require retuning... Here's a simple idea for tuning. For each list we keep track of: 1) the size of the list 2) the rate at which we scan the list 3) the fraction of (non new) pages that get referenced That way we can determine which list has the largest fraction of "idle" pages sitting around and consequently which list should be scanned more aggressively. For each list we can calculate how frequently the pages in the list are being used: pressure = referenced percentage * scan rate / list size The VM can equalize the pressure by scanning the list with lower usage less than the other list. This way the VM can give the right amount of memory to each type. Of course, each list needs to be divided into inactive and active like the current VM, in order to make sure that the pages which are used once cannot push the real working set of that list out of memory. There is a more subtle problem when the list's working set is larger than the amount of memory the list has. In that situation the VM will be faulting pages back in just after they got evicted. Something like my /proc/refaults code can detect that and adjust the size of the undersized list accordingly. Of course, once we properly distinguish between the more frequently and less frequently accessed pages within each of the page sets (mapped/anonymous vs. unmapped) and have the pressure between the lists equalized, why do we need to keep them separate again? -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic.
From: Balbir Singh [email blocked] Subject: Re: Why active list and inactive list? Date: Tue, 23 Jan 2007 11:21:38 +0530 Rik van Riel wrote: > Nick Piggin wrote: > >> The other nice thing about it was that it didn't have a hard >> cutoff that the current reclaim_mapped toggle does -- you could >> opt to scan the mapped list at a lower ratio than the unmapped >> one. Of course, it also has some downsides too, and would >> require retuning... > > Here's a simple idea for tuning. > > For each list we keep track of: > 1) the size of the list > 2) the rate at which we scan the list > 3) the fraction of (non new) pages that get > referenced > > That way we can determine which list has the largest > fraction of "idle" pages sitting around and consequently > which list should be scanned more aggressively. > > For each list we can calculate how frequently the pages > in the list are being used: > > pressure = referenced percentage * scan rate / list size > > The VM can equalize the pressure by scanning the list with > lower usage less than the other list. This way the VM can > give the right amount of memory to each type. > This sounds like a good thing to start with. I think we can then use swappiness to decide what to evict. > Of course, each list needs to be divided into inactive and > active like the current VM, in order to make sure that the > pages which are used once cannot push the real working set > of that list out of memory. > Yes, that makes sense. > There is a more subtle problem when the list's working set > is larger than the amount of memory the list has. In that > situation the VM will be faulting pages back in just after > they got evicted. Something like my /proc/refaults code > can detect that and adjust the size of the undersized list > accordingly. > > Of course, once we properly distinguish between the more > frequently and less frequently accessed pages within each > of the page sets (mapped/anonymous vs. unmapped) and have > the pressure between the lists equalized, why do we need > to keep them separate again? > :-) -- Balbir Singh Linux Technology Center IBM, ISTL
Related Links:
- Archive of above thread [12]
- KernelTrap interview with Andrea Arcangeli [13]
- KernelTrap interview with Rik van Riel [14]
- KernelTrap interview with Nick Piggin [15]