The current page replacement scheme in Linux has a number of problems, which can be boiled down to: - Sometimes the kernel evicts the wrong pages, which can result in bad performance. - The kernel scans over pages that should not be evicted. On systems with a few GB of RAM, this can result in the VM using an annoying amount of CPU. On systems with >128GB of RAM, this can knock the system out for hours since excess CPU use is compounded with lock contention and other issues. This patch series tries to address the issues by splitting the LRU lists into two sets, one for swap/ram backed pages ("anon") and one for filesystem backed pages ("file"). The current version only has the infrastructure. Large changes to the page replacement policy will follow later. More details can be found on this page: http://linux-mm.org/PageReplacementDesign TODO: - have any mlocked and ramfs pages live off of the LRU list, so we do not need to scan these pages - switch to SEQ replacement for the anon LRU lists, so the worst case number of pages to scan is reduced greatly. - figure out if the file LRU lists need page replacement changes to help with worst case scenarios - implement and benchmark a scalable non-resident page tracking implementation in the radix tree, this may make the anon/file balancing algorithm more stable and could allow for further simplifications in the balancing algorithm -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan -
Define page_file_cache() function to answer the question:
is page backed by a file?
Originally part of Rik van Riel's split-lru patch. Extracted
to make available for other, independent reclaim patches.
Moved inline function to linux/mm_inline.h where it will
be needed by subsequent "split LRU" and "noreclaim" patches.
Unfortunately this needs to use a page flag, since the
PG_swapbacked state needs to be preserved all the way
to the point where the page is last removed from the
LRU. Trying to derive the status from other info in
the page resulted in wrong VM statistics in earlier
split VM patchsets.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Index: linux-2.6.23-mm1/include/linux/mm_inline.h
===================================================================
--- linux-2.6.23-mm1.orig/include/linux/mm_inline.h
+++ linux-2.6.23-mm1/include/linux/mm_inline.h
@@ -1,3 +1,23 @@
+#ifndef LINUX_MM_INLINE_H
+#define LINUX_MM_INLINE_H
+
+/**
+ * page_file_cache(@page)
+ * Returns !0 if @page is page cache page backed by a regular file,
+ * or 0 if @page is anonymous, tmpfs or otherwise swap backed.
+ *
+ * We would like to get this info without a page flag, but the state
+ * needs to propagate to whereever the page is last deleted from the LRU.
+ */
+static inline int page_file_cache(struct page *page)
+{
+ if (PageSwapBacked(page))
+ return 0;
+
+ /* The page is page cache backed by a normal filesystem. */
+ return 2;
+}
+
static inline void
add_page_to_active_list(struct zone *zone, struct page *page)
{
@@ -38,3 +58,4 @@ del_page_from_lru(struct zone *zone, str
}
}
+#endif
Index: linux-2.6.23-mm1/mm/shmem.c
===================================================================
--- linux-2.6.23-mm1.orig/mm/shmem.c
+++ linux-2.6.23-mm1/mm/shmem.c
@@ -1267,6 +1267,7 @@ repeat:
goto failed;
}
+ SetPageSwapBacked(filepage);
spin_lock(&info->lock);
entry = ...Well its not clear what is meant by a file in the first place. By file you mean disk space in contrast to ram based filesystems? I think we could add a flag to the bdi to indicate wheter the backing store is a disk file. In fact you can also deduce if if a device has The bdi may avoid that extra flag. -
On Tue, 6 Nov 2007 18:23:44 -0800 (PST) Yes. I have improved the comment over page_file_cache() a bit: /** * page_file_cache(@page) * Returns !0 if @page is page cache page backed by a regular filesystem, * or 0 if @page is anonymous, tmpfs or otherwise ram or swap backed. * * We would like to get this info without a page flag, but the state * needs to survive until the page is last deleted from the LRU, which * could be as far down as __page_cache_release. The bdi will no longer be accessible by the time a page makes it to free_hot_cold_page, which is one place in the kernel where this information is needed. -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan -
At that point you need only information about which list the page was put on. Dont we need something like PageLRU -> PageFileLRU and PageMemLRU? The page may change its nature I think? What if a page becomes swap backed? -
On Tue, 6 Nov 2007 19:02:47 -0800 (PST) That is exactly why we need a page flag. If you have a better name for the page flag, please let me know. Note that the kind of page needs to be separate from PageLRU, since pages are taken off of and put back onto LRUs all the Every anonymous, tmpfs or shared memory segment page is potentially swap backed. That is the whole point of the PG_swapbacked flag. A page from a filesystem like ext3 or NFS cannot suddenly turn into a swap backed page. This page "nature" is not changed during the lifetime of a page. -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan -
One of the current issues with anonymous pages is the accounting when they become file backed and get dirty. There are performance issue with swap writeout because we are not doing it in file order and on a page by page basis. Well COW sortof does that but then its a new page. -
On Tue, 6 Nov 2007 19:26:33 -0800 (PST) What are you talking about? That is one of the reasons everything that is ram/swap backed goes onto a different set of LRU lists from everything that is Since ramfs pages cannot be evicted from memory at all, they Exactly. As far as I know, a page never changes from a file page into an anonymous page, or the other way around. -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan -
On Wed, 7 Nov 2007 10:06:10 -0800 (PST) With the patch set from last weekend, the file LRU. With the patch set later this week, they'll be in the "noreclaim" page set, which is never scanned by the VM. -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan -
If they are swap backed then they have a backing store on disk. They are That sounds better. -
Debug whether we end up classifying the wrong pages as
filesystem backed. This has not triggered in stress
tests on my system, but who knows...
Signed-off-by: Rik van Riel <riel@redhat.com>
Index: linux-2.6.23-mm1/include/linux/mm_inline.h
===================================================================
--- linux-2.6.23-mm1.orig/include/linux/mm_inline.h
+++ linux-2.6.23-mm1/include/linux/mm_inline.h
@@ -1,6 +1,8 @@
#ifndef LINUX_MM_INLINE_H
#define LINUX_MM_INLINE_H
+#include <linux/fs.h> /* for struct address_space */
+
/**
* page_file_cache(@page)
* Returns !0 if @page is page cache page backed by a regular file,
@@ -9,11 +11,19 @@
* We would like to get this info without a page flag, but the state
* needs to propagate to whereever the page is last deleted from the LRU.
*/
+extern const struct address_space_operations shmem_aops;
static inline int page_file_cache(struct page *page)
{
+ struct address_space * mapping = page_mapping(page);
+
if (PageSwapBacked(page))
return 0;
+ /* These pages should all be marked PG_swapbacked */
+ WARN_ON(PageAnon(page));
+ WARN_ON(PageSwapCache(page));
+ WARN_ON(mapping && mapping->a_ops && mapping->a_ops == &shmem_aops);
+
/* The page is page cache backed by a normal filesystem. */
return 2;
}
Index: linux-2.6.23-mm1/mm/shmem.c
===================================================================
--- linux-2.6.23-mm1.orig/mm/shmem.c
+++ linux-2.6.23-mm1/mm/shmem.c
@@ -180,7 +180,7 @@ static inline void shmem_unacct_blocks(u
}
static const struct super_operations shmem_ops;
-static const struct address_space_operations shmem_aops;
+const struct address_space_operations shmem_aops;
static const struct file_operations shmem_file_operations;
static const struct inode_operations shmem_inode_operations;
static const struct inode_operations shmem_dir_inode_operations;
@@ -2344,7 +2344,7 @@ static void destroy_inodecache(void)
kmem_cache_destroy(shmem_inode_cachep);
}
-static ...Use an indexed array for LRU variables. This makes the rest of the split VM code a lot cleaner. V1 -> V2 [lts]: + Remove extraneous __dec_zone_state(zone, NR_ACTIVE) pointed out by Mel G. From clameter@sgi.com Wed Aug 29 11:39:51 2007 Currently we are defining explicit variables for the inactive and active list. An indexed array can be more generic and avoid repeating similar code in several places in the reclaim code. We are saving a few bytes in terms of code size: Before: text data bss dec hex filename 4097753 573120 4092484 8763357 85b7dd vmlinux After: text data bss dec hex filename 4097729 573120 4092484 8763333 85b7c5 vmlinux Having an easy way to add new lru lists may ease future work on the reclaim code. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> include/linux/mm_inline.h | 34 ++++++++--- include/linux/mmzone.h | 17 +++-- mm/page_alloc.c | 9 +-- mm/swap.c | 2 mm/vmscan.c | 132 ++++++++++++++++++++++------------------------ mm/vmstat.c | 3 - 6 files changed, 107 insertions(+), 90 deletions(-) Index: linux-2.6.23-rc8-mm2-vm/include/linux/mmzone.h =================================================================== --- linux-2.6.23-rc8-mm2-vm.orig/include/linux/mmzone.h +++ linux-2.6.23-rc8-mm2-vm/include/linux/mmzone.h @@ -82,8 +82,8 @@ struct zone_padding { enum zone_stat_item { /* First 128 byte cacheline (assuming 64 bit words) */ NR_FREE_PAGES, - NR_INACTIVE, - NR_ACTIVE, + NR_INACTIVE, /* must match order of LRU_[IN]ACTIVE */ + NR_ACTIVE, /* " " " " " */ NR_ANON_PAGES, /* Mapped anonymous pages */ NR_FILE_MAPPED, /* pagecache pages mapped into pagetables. only modified from process context */ @@ -107,6 +107,13 @@ enum zone_stat_item { #endif ...
Swapin_readahead can read in a lot of data that the processes in memory never need. Adding swap cache pages to the inactive list prevents them from putting too much pressure on the working set. This has the potential to help the programs that are already in memory, but it could also be a disadvantage to processes that are trying to get swapped in. In short, this patch needs testing. Signed-off-by: Rik van Riel <riel@redhat.com> Index: linux-2.6.23-mm1/mm/swap_state.c =================================================================== --- linux-2.6.23-mm1.orig/mm/swap_state.c +++ linux-2.6.23-mm1/mm/swap_state.c @@ -370,7 +370,7 @@ struct page *read_swap_cache_async(swp_e /* * Initiate read into locked page and return. */ - lru_cache_add_active_anon(new_page); + lru_cache_add_anon(new_page); swap_readpage(NULL, new_page); return new_page; } -
Make the LRU arithmetic more explicit. Hopefully this will make
the code a little easier to read and less prone to future errors.
Signed-off-by: Rik van Riel <riel@redhat.com>
Index: linux-2.6.23-mm1/include/linux/mm_inline.h
===================================================================
--- linux-2.6.23-mm1.orig/include/linux/mm_inline.h
+++ linux-2.6.23-mm1/include/linux/mm_inline.h
@@ -28,7 +28,7 @@ static inline int page_file_cache(struct
return 0;
/* The page is page cache backed by a normal filesystem. */
- return (LRU_INACTIVE_FILE - LRU_INACTIVE_ANON);
+ return LRU_FILE;
}
static inline void
Index: linux-2.6.23-mm1/mm/swap.c
===================================================================
--- linux-2.6.23-mm1.orig/mm/swap.c
+++ linux-2.6.23-mm1/mm/swap.c
@@ -180,12 +180,12 @@ void fastcall activate_page(struct page
spin_lock_irq(&zone->lru_lock);
if (PageLRU(page) && !PageActive(page)) {
- int l = LRU_INACTIVE_ANON;
+ int l = LRU_BASE;
l += page_file_cache(page);
del_page_from_lru_list(zone, page, l);
SetPageActive(page);
- l += LRU_ACTIVE_ANON - LRU_INACTIVE_ANON;
+ l += LRU_ACTIVE;
add_page_to_lru_list(zone, page, l);
__count_vm_event(PGACTIVATE);
mem_cgroup_move_lists(page_get_page_cgroup(page), true);
Index: linux-2.6.23-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.23-mm1.orig/mm/vmscan.c
+++ linux-2.6.23-mm1/mm/vmscan.c
@@ -786,11 +786,11 @@ static unsigned long isolate_pages_globa
struct mem_cgroup *mem_cont,
int active, int file)
{
- int l = LRU_INACTIVE_ANON;
+ int l = LRU_BASE;
if (active)
- l += LRU_ACTIVE_ANON - LRU_INACTIVE_ANON;
+ l += LRU_ACTIVE;
if (file)
- l += LRU_INACTIVE_FILE - LRU_INACTIVE_ANON;
+ l += LRU_FILE;
return isolate_lru_pages(nr, &z->list[l], dst, scanned, order,
mode, !!file);
}
@@ -842,7 +842,7 @@ int isolate_lru_page(struct page *page)
spin_lock_irq(&zone->lru_lock);
...move isolate_lru_page() to vmscan.c Against 2.6.23-rc4-mm1 V1 -> V2 [lts]: + fix botched merge -- add back "get_page_unless_zero()" From: Nick Piggin <npiggin@suse.de> To: Linux Memory Management <linux-mm@kvack.org> Subject: [patch 1/4] mm: move and rework isolate_lru_page Date: Mon, 12 Mar 2007 07:38:44 +0100 (CET) isolate_lru_page logically belongs to be in vmscan.c than migrate.c. It is tough, because we don't need that function without memory migration so there is a valid argument to have it in migrate.c. However a subsequent patch needs to make use of it in the core mm, so we can happily move it to vmscan.c. Also, make the function a little more generic by not requiring that it adds an isolated page to a given list. Callers can do that. Note that we now have '__isolate_lru_page()', that does something quite different, visible outside of vmscan.c for use with memory controller. Methinks we need to rationalize these names/purposes. --lts Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com> include/linux/migrate.h | 3 --- mm/internal.h | 2 ++ mm/mempolicy.c | 10 ++++++++-- mm/migrate.c | 47 ++++++++++------------------------------------- mm/vmscan.c | 41 +++++++++++++++++++++++++++++++++++++++++ 5 files changed, 61 insertions(+), 42 deletions(-) Index: Linux/include/linux/migrate.h =================================================================== --- Linux.orig/include/linux/migrate.h 2007-07-08 19:32:17.000000000 -0400 +++ Linux/include/linux/migrate.h 2007-09-20 10:21:52.000000000 -0400 @@ -25,7 +25,6 @@ static inline int vma_migratable(struct return 1; } -extern int isolate_lru_page(struct page *p, struct list_head *pagelist); extern int putback_lru_pages(struct list_head *l); extern int migrate_page(struct address_space *, struct page *, struct page *); @@ -42,8 +41,6 @@ ...
Reviewed-by: Christoph Lameter <clameter@sgi.com> -
Make lumpy reclaim and the split VM code work together better, by
allowing both file and anonymous pages to be relaimed together.
Will be merged into patch 6/10 soon, split out for the benefit of
people who have looked at the older code in the past.
Signed-off-by: Rik van Riel <riel@redhat.com>
Index: linux-2.6.23-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.23-mm1.orig/mm/vmscan.c
+++ linux-2.6.23-mm1/mm/vmscan.c
@@ -752,10 +752,6 @@ static unsigned long isolate_lru_pages(u
cursor_page = pfn_to_page(pfn);
- /* Don't lump pages of different types: file vs anon */
- if (!PageLRU(page) || (file != !!page_file_cache(cursor_page)))
- break;
-
/* Check that we have not crossed a zone boundary. */
if (unlikely(page_zone_id(cursor_page) != zone_id))
continue;
@@ -799,16 +795,22 @@ static unsigned long isolate_pages_globa
* clear_active_flags() is a helper for shrink_active_list(), clearing
* any active bits from the pages in the list.
*/
-static unsigned long clear_active_flags(struct list_head *page_list)
+static unsigned long clear_active_flags(struct list_head *page_list,
+ unsigned int *count)
{
int nr_active = 0;
+ int lru;
struct page *page;
- list_for_each_entry(page, page_list, lru)
+ list_for_each_entry(page, page_list, lru) {
+ lru = page_file_cache(page);
if (PageActive(page)) {
+ lru += LRU_ACTIVE;
ClearPageActive(page);
nr_active++;
}
+ count[lru]++;
+ }
return nr_active;
}
@@ -876,24 +878,25 @@ static unsigned long shrink_inactive_lis
unsigned long nr_scan;
unsigned long nr_freed;
unsigned long nr_active;
+ unsigned int count[NR_LRU_LISTS] = { 0, };
+ int mode = (sc->order > PAGE_ALLOC_COSTLY_ORDER) ?
+ ISOLATE_BOTH : ISOLATE_INACTIVE;
nr_taken = sc->isolate_pages(sc->swap_cluster_max,
- &page_list, &nr_scan, sc->order,
- (sc->order > PAGE_ALLOC_COSTLY_ORDER)?
- ISOLATE_BOTH : ...Rik van Riel's patch to free swap space on swap-in/activiation,
forward ported by Lee Schermerhorn.
Against: 2.6.23-rc2-mm2 atop:
+ lts' convert anon_vma list lock to reader/write lock patch
+ Nick Piggin's move and rework isolate_lru_page() patch
Patch Description: quick attempt by lts
Free swap cache entries when swapping in pages if vm_swap_full()
[swap space > 1/2 used?]. Uses new pagevec to reduce pressure
on locks.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
include/linux/pagevec.h | 1 +
mm/swap.c | 18 ++++++++++++++++++
mm/vmscan.c | 16 +++++++++++-----
3 files changed, 30 insertions(+), 5 deletions(-)
Index: linux-2.6.23-rc6-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.23-rc6-mm1.orig/mm/vmscan.c 2007-09-25 15:20:05.000000000 -0400
+++ linux-2.6.23-rc6-mm1/mm/vmscan.c 2007-09-25 15:25:04.000000000 -0400
@@ -613,6 +613,9 @@ free_it:
continue;
activate_locked:
+ /* Not a candidate for swapping, so reclaim swap space. */
+ if (PageSwapCache(page) && vm_swap_full())
+ remove_exclusive_swap_page(page);
SetPageActive(page);
pgactivate++;
keep_locked:
@@ -1142,14 +1145,13 @@ force_reclaim_mapped:
}
}
__mod_zone_page_state(zone, NR_INACTIVE, pgmoved);
+ spin_unlock_irq(&zone->lru_lock);
pgdeactivate += pgmoved;
- if (buffer_heads_over_limit) {
- spin_unlock_irq(&zone->lru_lock);
- pagevec_strip(&pvec);
- spin_lock_irq(&zone->lru_lock);
- }
+ if (buffer_heads_over_limit)
+ pagevec_strip(&pvec);
pgmoved = 0;
+ spin_lock_irq(&zone->lru_lock);
while (!list_empty(&l_active)) {
page = lru_to_page(&l_active);
prefetchw_prev_lru_page(page, &l_active, flags);
@@ -1163,6 +1165,8 @@ force_reclaim_mapped:
__mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
pgmoved = 0;
spin_unlock_irq(&zone->lru_lock);
+ if ...Why are we dropping the lock here now? There would be less activity Same here. Maybe the spin_unlock and the spin_lock can go into pagevec_swap_free? -
The memory controller code is still quite simple, so don't do
anything fancy for now trying to make it work better with the
split VM code.
Will be merged into 6/10 soon.
Signed-off-by: Rik van Riel <riel@redhat.com>
Index: linux-2.6.23-mm1/mm/memcontrol.c
===================================================================
--- linux-2.6.23-mm1.orig/mm/memcontrol.c
+++ linux-2.6.23-mm1/mm/memcontrol.c
@@ -210,7 +210,6 @@ unsigned long mem_cgroup_isolate_pages(u
struct list_head *src;
struct page_cgroup *pc;
-//TODO: memory container maintain separate file/anon lists?
if (active)
src = &mem_cont->active_list;
else
@@ -222,6 +221,9 @@ unsigned long mem_cgroup_isolate_pages(u
page = pc->page;
VM_BUG_ON(!pc);
+ /*
+ * TODO: play better with lumpy reclaim, grabbing anything.
+ */
if (PageActive(page) && !active) {
__mem_cgroup_move_lists(pc, true);
scan--;
@@ -240,6 +242,9 @@ unsigned long mem_cgroup_isolate_pages(u
if (page_zone(page) != z)
continue;
+ if (file != !!page_file_cache(page))
+ continue;
+
/*
* Check if the meta page went away from under us
*/
-
Split the LRU lists in two, one set for pages that are backed by real file systems ("file") and one for pages that are backed by memory and swap ("anon"). The latter includes tmpfs. Eventually mlocked pages will be taken off the LRUs alltogether. A patch for that already exists and just needs to be integrated into this series. This patch mostly has the infrastructure and a basic policy to balance how much we scan the anon lists and how much we scan the file lists. Fancy policy changes will be in separate patches. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Index: linux-2.6.23-mm1/fs/proc/proc_misc.c =================================================================== --- linux-2.6.23-mm1.orig/fs/proc/proc_misc.c +++ linux-2.6.23-mm1/fs/proc/proc_misc.c @@ -149,43 +149,47 @@ static int meminfo_read_proc(char *page, * Tagged format, for easy grepping and expansion. */ len = sprintf(page, - "MemTotal: %8lu kB\n" - "MemFree: %8lu kB\n" - "Buffers: %8lu kB\n" - "Cached: %8lu kB\n" - "SwapCached: %8lu kB\n" - "Active: %8lu kB\n" - "Inactive: %8lu kB\n" + "MemTotal: %8lu kB\n" + "MemFree: %8lu kB\n" + "Buffers: %8lu kB\n" + "Cached: %8lu kB\n" + "SwapCached: %8lu kB\n" + "Active(anon): %8lu kB\n" + "Inactive(anon): %8lu kB\n" + "Active(file): %8lu kB\n" + "Inactive(file): %8lu kB\n" #ifdef CONFIG_HIGHMEM - "HighTotal: %8lu kB\n" - "HighFree: %8lu kB\n" - "LowTotal: %8lu kB\n" - "LowFree: %8lu kB\n" -#endif - "SwapTotal: %8lu kB\n" - "SwapFree: %8lu kB\n" - "Dirty: %8lu kB\n" - "Writeback: %8lu kB\n" - "AnonPages: %8lu kB\n" - "Mapped: %8lu kB\n" - "Slab: %8lu kB\n" - "SReclaimable: %8lu kB\n" - "SUnreclaim: %8lu kB\n" - "PageTables: %8lu kB\n" - "NFS_Unstable: %8lu kB\n" - "Bounce: %8lu kB\n" - "CommitLimit: %8lu ...
If we split the memory backed from the disk backed pages then they are no longer competing with one another on equal terms? So the file LRU may run faster than the memory LRU? The patch looks awfully large. -
On Tue, 6 Nov 2007 18:28:19 -0800 (PST) The file LRU probably *should* run faster than the memory LRU most of the time, since we stream the readahead data for many sequentially accessed files through the file LRU. We adjust the rates at which the two LRUs are scanned depending on the fraction of referenced pages found when scanning each list. Making it smaller would probably result in something that does not work right. -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan -
Hmmmm.. I'd rather see where we are going. One other way of addressing many of these issues is to allow large page sizes on the LRU which will reduce the number of entities that have to be managed. Both approaches We do not have an accepted standard load. So how would we figure that one out? -
On Tue, 6 Nov 2007 18:11:39 -0800 (PST) Linus seems to have vetoed that (unless I am mistaken), so the chances of that happening soon are probably not very large. Also, a factor 16 increase in page size is not going to help if memory sizes also increase by a factor 16, since we already For some workloads this is the most urgent change, indeed. Since the patches for this already exist, integrating them is at the top of my list. Expect this to be integrated into The current worst case is where we need to scan all of memory, just to find a few pages we can swap out. With the effects of lock contention figured in, this can take hours on huge systems. In order to make the VM more scalable, we need to find acceptable pages to swap out with low complexity in the VM. The "worst case" above refers to the upper bound on how much work the VM needs to do in order to get something evicted from the page cache or swapped out. -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan -
Note that a factor 16 increase usually goes hand in hand with more processors. The synchronization of multiple processors becomes a concern. If you have an 8p and each of them tries to get the zone locks for reclaim then we are already in trouble. And given the immaturity of the handling of cacheline contention in current commodity hardware this A bit sparse but limiting the scanning if we cannot do much is certainly the right thing to do. The percentage of memory taken up by anonymous pages varies depending on the load. HPC applications may consume all of memory with anonymous pages. But there the pain is already so bad that Right but I think this looks like a hopeless situation regardless of the algorithm if you have a couple of million pages and are trying to free one. Now image a series of processors going on the hunt for the few pages that can be reclaimed. -
On Tue, 6 Nov 2007 18:40:46 -0800 (PST) Which is why we need to greatly reduce the number of pages An algorithm that only clears the referenced bit and then moves the anonymous page from the active to the inactive list will do a lot less work than an algorithm that needs to scan the *whole* active list because all of the pages on it are referenced. This is not a theoretical situation: every anonymous page starts out referenced! Add in a relatively small inactive list on huge memory systems, and we could have something of an acceptable algorithmic complexity. -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan -
It strikes me that splitting one list into two lists will not provide sufficient improvement in search efficiency to do that. I mean, a naive guess would be that it will, on average, halve the amount of work which needs to be done. But we need multiple-orders-of-magnitude improvements to address the pathological worst-cases which you're looking at there. Where is this coming from? Or is the problem which you're seeing due to scanning of mapped pages at low "distress" levels? Would be interested in seeing more details on all of this, please. -
On Wed, 7 Nov 2007 09:59:45 -0800 Well, if you look at the typical problem systems today, you will see that most of the pages being allocated and evicted are in the page cache, while most of the pages in memory are actually anonymous pages. Not having to scan over that 80% of memory that contains anonymous pages and shared memory segments to get at the 20% page cache pages is much more than a factor two Replacing page cache pages is easy. If they were referenced once (typical), we can just evict the page the first time we scan it. Anonymous pages have a similar optimization: every anonymous page starts out referenced, so moving referenced pages back to the front of the active list is unneeded work. However, we cannot just place referenced anonymous pages onto an inactive list that is shared with page cache pages, because of the difference in replacement cost and relative importance http://linux-mm.org/PageReplacementDesign -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan -
