[RFC PATCH 1/10] move isolate_lru_page to vmscan.c

Previous thread: none

Next thread: Linux 2.6.16.57-rc1 by Adrian Bunk on Saturday, November 3, 2007 - 4:11 pm. (1 message)
From: Rik van Riel
Date: Saturday, November 3, 2007 - 3:42 pm

The current page replacement scheme in Linux has a number of problems,
which can be boiled down to:
- Sometimes the kernel evicts the wrong pages, which can result in
  bad performance.
- The kernel scans over pages that should not be evicted.  On systems
  with a few GB of RAM, this can result in the VM using an annoying
  amount of CPU.  On systems with >128GB of RAM, this can knock the
  system out for hours since excess CPU use is compounded with lock
  contention and other issues.

This patch series tries to address the issues by splitting the LRU
lists into two sets, one for swap/ram backed pages ("anon") and
one for filesystem backed pages ("file").

The current version only has the infrastructure.  Large changes to
the page replacement policy will follow later.

More details can be found on this page:

	http://linux-mm.org/PageReplacementDesign

TODO:
- have any mlocked and ramfs pages live off of the LRU list,
  so we do not need to scan these pages
- switch to SEQ replacement for the anon LRU lists, so the
  worst case number of pages to scan is reduced greatly.
- figure out if the file LRU lists need page replacement
  changes to help with worst case scenarios
- implement and benchmark a scalable non-resident page
  tracking implementation in the radix tree, this may make
  the anon/file balancing algorithm more stable and could
  allow for further simplifications in the balancing algorithm

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
-

From: Rik van Riel
Date: Saturday, November 3, 2007 - 3:55 pm

Define page_file_cache() function to answer the question:
	is page backed by a file?

Originally part of Rik van Riel's split-lru patch.  Extracted
to make available for other, independent reclaim patches.

Moved inline function to linux/mm_inline.h where it will
be needed by subsequent "split LRU" and "noreclaim" patches.  

Unfortunately this needs to use a page flag, since the
PG_swapbacked state needs to be preserved all the way
to the point where the page is last removed from the
LRU.  Trying to derive the status from other info in
the page resulted in wrong VM statistics in earlier
split VM patchsets.


Signed-off-by:  Rik van Riel <riel@redhat.com>
Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>


Index: linux-2.6.23-mm1/include/linux/mm_inline.h
===================================================================
--- linux-2.6.23-mm1.orig/include/linux/mm_inline.h
+++ linux-2.6.23-mm1/include/linux/mm_inline.h
@@ -1,3 +1,23 @@
+#ifndef LINUX_MM_INLINE_H
+#define LINUX_MM_INLINE_H
+
+/**
+ * page_file_cache(@page)
+ * Returns !0 if @page is page cache page backed by a regular file,
+ * or 0 if @page is anonymous, tmpfs or otherwise swap backed.
+ *
+ * We would like to get this info without a page flag, but the state
+ * needs to propagate to whereever the page is last deleted from the LRU.
+ */
+static inline int page_file_cache(struct page *page)
+{
+	if (PageSwapBacked(page))
+		return 0;
+
+	/* The page is page cache backed by a normal filesystem. */
+	return 2;
+}
+
 static inline void
 add_page_to_active_list(struct zone *zone, struct page *page)
 {
@@ -38,3 +58,4 @@ del_page_from_lru(struct zone *zone, str
 	}
 }
 
+#endif
Index: linux-2.6.23-mm1/mm/shmem.c
===================================================================
--- linux-2.6.23-mm1.orig/mm/shmem.c
+++ linux-2.6.23-mm1/mm/shmem.c
@@ -1267,6 +1267,7 @@ repeat:
 				goto failed;
 			}
 
+			SetPageSwapBacked(filepage);
 			spin_lock(&info->lock);
 			entry = ...
From: Christoph Lameter
Date: Tuesday, November 6, 2007 - 7:23 pm

Well its not clear what is meant by a file in the first place.
By file you mean disk space in contrast to ram based filesystems?

I think we could add a flag to the bdi to indicate wheter the backing 
store is a disk file. In fact you can also deduce if if a device has

The bdi may avoid that extra flag.
-

From: Rik van Riel
Date: Tuesday, November 6, 2007 - 7:55 pm

On Tue, 6 Nov 2007 18:23:44 -0800 (PST)

Yes.  I have improved the comment over page_file_cache() a bit:

/**
 * page_file_cache(@page)
 * Returns !0 if @page is page cache page backed by a regular filesystem,
 * or 0 if @page is anonymous, tmpfs or otherwise ram or swap backed.
 *
 * We would like to get this info without a page flag, but the state
 * needs to survive until the page is last deleted from the LRU, which
 * could be as far down as __page_cache_release.

The bdi will no longer be accessible by the time a page
makes it to free_hot_cold_page, which is one place in the
kernel where this information is needed.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
-

From: Christoph Lameter
Date: Tuesday, November 6, 2007 - 8:02 pm

At that point you need only information about which list the page
was put on. Dont we need something like PageLRU -> PageFileLRU
and PageMemLRU?

The page may change its nature I think? What if a page becomes
swap backed?
-

From: Rik van Riel
Date: Tuesday, November 6, 2007 - 8:17 pm

On Tue, 6 Nov 2007 19:02:47 -0800 (PST)

That is exactly why we need a page flag.  If you have a better
name for the page flag, please let me know.

Note that the kind of page needs to be separate from PageLRU,
since pages are taken off of and put back onto LRUs all the

Every anonymous, tmpfs or shared memory segment page is potentially
swap backed. That is the whole point of the PG_swapbacked flag.

A page from a filesystem like ext3 or NFS cannot suddenly turn into
a swap backed page.  This page "nature" is not changed during the
lifetime of a page.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
-

From: Christoph Lameter
Date: Tuesday, November 6, 2007 - 8:26 pm

One of the current issues with anonymous pages is the accounting when 
they become file backed and get dirty. There are performance issue with 
swap writeout because we are not doing it in file order and on a page by 
page basis.


Well COW sortof does that but then its a new page.

-

From: Rik van Riel
Date: Wednesday, November 7, 2007 - 7:35 am

On Tue, 6 Nov 2007 19:26:33 -0800 (PST)

What are you talking about?


That is one of the reasons everything that is ram/swap backed
goes onto a different set of LRU lists from everything that is

Since ramfs pages cannot be evicted from memory at all, they

Exactly.  As far as I know, a page never changes from a file
page into an anonymous page, or the other way around.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
-

From: Christoph Lameter
Date: Wednesday, November 7, 2007 - 11:06 am

Which LRU do they go on.

-

From: Rik van Riel
Date: Wednesday, November 7, 2007 - 11:17 am

On Wed, 7 Nov 2007 10:06:10 -0800 (PST)


With the patch set from last weekend, the file LRU.

With the patch set later this week, they'll be in the 
"noreclaim" page set, which is never scanned by the VM.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
-

From: Christoph Lameter
Date: Wednesday, November 7, 2007 - 11:18 am

If they are swap backed then they have a backing store on disk. They are 


That sounds better.

-

From: Rik van Riel
Date: Saturday, November 3, 2007 - 3:55 pm

Debug whether we end up classifying the wrong pages as
filesystem backed.  This has not triggered in stress
tests on my system, but who knows...

Signed-off-by: Rik van Riel <riel@redhat.com>

Index: linux-2.6.23-mm1/include/linux/mm_inline.h
===================================================================
--- linux-2.6.23-mm1.orig/include/linux/mm_inline.h
+++ linux-2.6.23-mm1/include/linux/mm_inline.h
@@ -1,6 +1,8 @@
 #ifndef LINUX_MM_INLINE_H
 #define LINUX_MM_INLINE_H
 
+#include <linux/fs.h>  /* for struct address_space */
+
 /**
  * page_file_cache(@page)
  * Returns !0 if @page is page cache page backed by a regular file,
@@ -9,11 +11,19 @@
  * We would like to get this info without a page flag, but the state
  * needs to propagate to whereever the page is last deleted from the LRU.
  */
+extern const struct address_space_operations shmem_aops;
 static inline int page_file_cache(struct page *page)
 {
+	struct address_space * mapping = page_mapping(page);
+
 	if (PageSwapBacked(page))
 		return 0;
 
+	/* These pages should all be marked PG_swapbacked */
+	WARN_ON(PageAnon(page));
+	WARN_ON(PageSwapCache(page));
+	WARN_ON(mapping && mapping->a_ops && mapping->a_ops == &shmem_aops);
+
 	/* The page is page cache backed by a normal filesystem. */
 	return 2;
 }
Index: linux-2.6.23-mm1/mm/shmem.c
===================================================================
--- linux-2.6.23-mm1.orig/mm/shmem.c
+++ linux-2.6.23-mm1/mm/shmem.c
@@ -180,7 +180,7 @@ static inline void shmem_unacct_blocks(u
 }
 
 static const struct super_operations shmem_ops;
-static const struct address_space_operations shmem_aops;
+const struct address_space_operations shmem_aops;
 static const struct file_operations shmem_file_operations;
 static const struct inode_operations shmem_inode_operations;
 static const struct inode_operations shmem_dir_inode_operations;
@@ -2344,7 +2344,7 @@ static void destroy_inodecache(void)
 	kmem_cache_destroy(shmem_inode_cachep);
 }
 
-static ...
From: Rik van Riel
Date: Saturday, November 3, 2007 - 3:56 pm

Use an indexed array for LRU variables.  This makes the rest
of the split VM code a lot cleaner.

V1 -> V2 [lts]:
+ Remove extraneous  __dec_zone_state(zone, NR_ACTIVE) pointed
  out by Mel G.

From clameter@sgi.com Wed Aug 29 11:39:51 2007

Currently we are defining explicit variables for the inactive
and active list. An indexed array can be more generic and avoid
repeating similar code in several places in the reclaim code.

We are saving a few bytes in terms of code size:

Before:

   text    data     bss     dec     hex filename
4097753  573120 4092484 8763357  85b7dd vmlinux

After:

   text    data     bss     dec     hex filename
4097729  573120 4092484 8763333  85b7c5 vmlinux

Having an easy way to add new lru lists may ease future work on
the reclaim code.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

 include/linux/mm_inline.h |   34 ++++++++---
 include/linux/mmzone.h    |   17 +++--
 mm/page_alloc.c           |    9 +--
 mm/swap.c                 |    2 
 mm/vmscan.c               |  132 ++++++++++++++++++++++------------------------
 mm/vmstat.c               |    3 -
 6 files changed, 107 insertions(+), 90 deletions(-)

Index: linux-2.6.23-rc8-mm2-vm/include/linux/mmzone.h
===================================================================
--- linux-2.6.23-rc8-mm2-vm.orig/include/linux/mmzone.h
+++ linux-2.6.23-rc8-mm2-vm/include/linux/mmzone.h
@@ -82,8 +82,8 @@ struct zone_padding {
 enum zone_stat_item {
 	/* First 128 byte cacheline (assuming 64 bit words) */
 	NR_FREE_PAGES,
-	NR_INACTIVE,
-	NR_ACTIVE,
+	NR_INACTIVE,	/* must match order of LRU_[IN]ACTIVE */
+	NR_ACTIVE,	/*  "     "     "   "       "         */
 	NR_ANON_PAGES,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
 			   only modified from process context */
@@ -107,6 +107,13 @@ enum zone_stat_item {
 #endif
 ...
From: Rik van Riel
Date: Saturday, November 3, 2007 - 4:06 pm

Swapin_readahead can read in a lot of data that the processes in
memory never need.  Adding swap cache pages to the inactive list
prevents them from putting too much pressure on the working set.

This has the potential to help the programs that are already in
memory, but it could also be a disadvantage to processes that
are trying to get swapped in.

In short, this patch needs testing.

Signed-off-by: Rik van Riel <riel@redhat.com>

Index: linux-2.6.23-mm1/mm/swap_state.c
===================================================================
--- linux-2.6.23-mm1.orig/mm/swap_state.c
+++ linux-2.6.23-mm1/mm/swap_state.c
@@ -370,7 +370,7 @@ struct page *read_swap_cache_async(swp_e
 			/*
 			 * Initiate read into locked page and return.
 			 */
-			lru_cache_add_active_anon(new_page);
+			lru_cache_add_anon(new_page);
 			swap_readpage(NULL, new_page);
 			return new_page;
 		}
-

From: Rik van Riel
Date: Saturday, November 3, 2007 - 4:02 pm

Make the LRU arithmetic more explicit.  Hopefully this will make
the code a little easier to read and less prone to future errors.

Signed-off-by: Rik van Riel <riel@redhat.com>

Index: linux-2.6.23-mm1/include/linux/mm_inline.h
===================================================================
--- linux-2.6.23-mm1.orig/include/linux/mm_inline.h
+++ linux-2.6.23-mm1/include/linux/mm_inline.h
@@ -28,7 +28,7 @@ static inline int page_file_cache(struct
 		return 0;
 
 	/* The page is page cache backed by a normal filesystem. */
-	return (LRU_INACTIVE_FILE - LRU_INACTIVE_ANON);
+	return LRU_FILE;
 }
 
 static inline void
Index: linux-2.6.23-mm1/mm/swap.c
===================================================================
--- linux-2.6.23-mm1.orig/mm/swap.c
+++ linux-2.6.23-mm1/mm/swap.c
@@ -180,12 +180,12 @@ void fastcall activate_page(struct page 
 
 	spin_lock_irq(&zone->lru_lock);
 	if (PageLRU(page) && !PageActive(page)) {
-		int l = LRU_INACTIVE_ANON;
+		int l = LRU_BASE;
 		l += page_file_cache(page);
 		del_page_from_lru_list(zone, page, l);
 
 		SetPageActive(page);
-		l += LRU_ACTIVE_ANON - LRU_INACTIVE_ANON;
+		l += LRU_ACTIVE;
 		add_page_to_lru_list(zone, page, l);
 		__count_vm_event(PGACTIVATE);
 		mem_cgroup_move_lists(page_get_page_cgroup(page), true);
Index: linux-2.6.23-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.23-mm1.orig/mm/vmscan.c
+++ linux-2.6.23-mm1/mm/vmscan.c
@@ -786,11 +786,11 @@ static unsigned long isolate_pages_globa
 					struct mem_cgroup *mem_cont,
 					int active, int file)
 {
-	int l = LRU_INACTIVE_ANON;
+	int l = LRU_BASE;
 	if (active)
-		l += LRU_ACTIVE_ANON - LRU_INACTIVE_ANON;
+		l += LRU_ACTIVE;
 	if (file)
-		l += LRU_INACTIVE_FILE - LRU_INACTIVE_ANON;
+		l += LRU_FILE;
 	return isolate_lru_pages(nr, &z->list[l], dst, scanned, order,
 								mode, !!file);
 }
@@ -842,7 +842,7 @@ int isolate_lru_page(struct page *page)
 
 		spin_lock_irq(&zone->lru_lock);
 ...
From: Rik van Riel
Date: Saturday, November 3, 2007 - 3:54 pm

move isolate_lru_page() to vmscan.c

Against 2.6.23-rc4-mm1

V1 -> V2 [lts]:
+  fix botched merge -- add back "get_page_unless_zero()"

From: Nick Piggin <npiggin@suse.de>
To: Linux Memory Management <linux-mm@kvack.org>
Subject: [patch 1/4] mm: move and rework isolate_lru_page
Date:	Mon, 12 Mar 2007 07:38:44 +0100 (CET)

isolate_lru_page logically belongs to be in vmscan.c than migrate.c.

It is tough, because we don't need that function without memory migration
so there is a valid argument to have it in migrate.c. However a subsequent
patch needs to make use of it in the core mm, so we can happily move it
to vmscan.c.

Also, make the function a little more generic by not requiring that it
adds an isolated page to a given list. Callers can do that.

	Note that we now have '__isolate_lru_page()', that does
	something quite different, visible outside of vmscan.c
	for use with memory controller.  Methinks we need to
	rationalize these names/purposes.	--lts

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

 include/linux/migrate.h |    3 ---
 mm/internal.h           |    2 ++
 mm/mempolicy.c          |   10 ++++++++--
 mm/migrate.c            |   47 ++++++++++-------------------------------------
 mm/vmscan.c             |   41 +++++++++++++++++++++++++++++++++++++++++
 5 files changed, 61 insertions(+), 42 deletions(-)

Index: Linux/include/linux/migrate.h
===================================================================
--- Linux.orig/include/linux/migrate.h	2007-07-08 19:32:17.000000000 -0400
+++ Linux/include/linux/migrate.h	2007-09-20 10:21:52.000000000 -0400
@@ -25,7 +25,6 @@ static inline int vma_migratable(struct 
 	return 1;
 }
 
-extern int isolate_lru_page(struct page *p, struct list_head *pagelist);
 extern int putback_lru_pages(struct list_head *l);
 extern int migrate_page(struct address_space *,
 			struct page *, struct page *);
@@ -42,8 +41,6 @@ ...
From: Christoph Lameter
Date: Tuesday, November 6, 2007 - 7:13 pm

Reviewed-by: Christoph Lameter <clameter@sgi.com>


-

From: Rik van Riel
Date: Saturday, November 3, 2007 - 4:03 pm

Make lumpy reclaim and the split VM code work together better, by
allowing both file and anonymous pages to be relaimed together.

Will be merged into patch 6/10 soon, split out for the benefit of
people who have looked at the older code in the past.

Signed-off-by: Rik van Riel <riel@redhat.com>

Index: linux-2.6.23-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.23-mm1.orig/mm/vmscan.c
+++ linux-2.6.23-mm1/mm/vmscan.c
@@ -752,10 +752,6 @@ static unsigned long isolate_lru_pages(u
 
 			cursor_page = pfn_to_page(pfn);
 
-			/* Don't lump pages of different types:  file vs anon */
-			if (!PageLRU(page) || (file != !!page_file_cache(cursor_page)))
-				break;
-
 			/* Check that we have not crossed a zone boundary. */
 			if (unlikely(page_zone_id(cursor_page) != zone_id))
 				continue;
@@ -799,16 +795,22 @@ static unsigned long isolate_pages_globa
  * clear_active_flags() is a helper for shrink_active_list(), clearing
  * any active bits from the pages in the list.
  */
-static unsigned long clear_active_flags(struct list_head *page_list)
+static unsigned long clear_active_flags(struct list_head *page_list,
+					unsigned int *count)
 {
 	int nr_active = 0;
+	int lru;
 	struct page *page;
 
-	list_for_each_entry(page, page_list, lru)
+	list_for_each_entry(page, page_list, lru) {
+		lru = page_file_cache(page);
 		if (PageActive(page)) {
+			lru += LRU_ACTIVE;
 			ClearPageActive(page);
 			nr_active++;
 		}
+		count[lru]++;
+	}
 
 	return nr_active;
 }
@@ -876,24 +878,25 @@ static unsigned long shrink_inactive_lis
 		unsigned long nr_scan;
 		unsigned long nr_freed;
 		unsigned long nr_active;
+		unsigned int count[NR_LRU_LISTS] = { 0, };
+		int mode = (sc->order > PAGE_ALLOC_COSTLY_ORDER) ?
+					ISOLATE_BOTH : ISOLATE_INACTIVE;
 
 		nr_taken = sc->isolate_pages(sc->swap_cluster_max,
-			     &page_list, &nr_scan, sc->order,
-			     (sc->order > PAGE_ALLOC_COSTLY_ORDER)?
-					     ISOLATE_BOTH : ...
From: Rik van Riel
Date: Saturday, November 3, 2007 - 3:54 pm

Rik van Riel's patch to free swap space on swap-in/activiation,
forward ported by Lee Schermerhorn.

Against:  2.6.23-rc2-mm2 atop:
+ lts' convert anon_vma list lock to reader/write lock patch
+ Nick Piggin's move and rework isolate_lru_page() patch

Patch Description:  quick attempt by lts

Free swap cache entries when swapping in pages if vm_swap_full()
[swap space > 1/2 used?].  Uses new pagevec to reduce pressure
on locks.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

 include/linux/pagevec.h |    1 +
 mm/swap.c               |   18 ++++++++++++++++++
 mm/vmscan.c             |   16 +++++++++++-----
 3 files changed, 30 insertions(+), 5 deletions(-)

Index: linux-2.6.23-rc6-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.23-rc6-mm1.orig/mm/vmscan.c	2007-09-25 15:20:05.000000000 -0400
+++ linux-2.6.23-rc6-mm1/mm/vmscan.c	2007-09-25 15:25:04.000000000 -0400
@@ -613,6 +613,9 @@ free_it:
 		continue;
 
 activate_locked:
+		/* Not a candidate for swapping, so reclaim swap space. */
+		if (PageSwapCache(page) && vm_swap_full())
+			remove_exclusive_swap_page(page);
 		SetPageActive(page);
 		pgactivate++;
 keep_locked:
@@ -1142,14 +1145,13 @@ force_reclaim_mapped:
 		}
 	}
 	__mod_zone_page_state(zone, NR_INACTIVE, pgmoved);
+	spin_unlock_irq(&zone->lru_lock);
 	pgdeactivate += pgmoved;
-	if (buffer_heads_over_limit) {
-		spin_unlock_irq(&zone->lru_lock);
-		pagevec_strip(&pvec);
-		spin_lock_irq(&zone->lru_lock);
-	}
 
+	if (buffer_heads_over_limit)
+		pagevec_strip(&pvec);
 	pgmoved = 0;
+	spin_lock_irq(&zone->lru_lock);
 	while (!list_empty(&l_active)) {
 		page = lru_to_page(&l_active);
 		prefetchw_prev_lru_page(page, &l_active, flags);
@@ -1163,6 +1165,8 @@ force_reclaim_mapped:
 			__mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
 			pgmoved = 0;
 			spin_unlock_irq(&zone->lru_lock);
+			if ...
From: Christoph Lameter
Date: Tuesday, November 6, 2007 - 7:20 pm

Why are we dropping the lock here now? There would be less activity

Same here. Maybe the spin_unlock and the spin_lock can go into
pagevec_swap_free?
-

From: Rik van Riel
Date: Tuesday, November 6, 2007 - 7:48 pm

[Empty message]
From: Rik van Riel
Date: Saturday, November 3, 2007 - 4:04 pm

The memory controller code is still quite simple, so don't do
anything fancy for now trying to make it work better with the
split VM code.

Will be merged into 6/10 soon.

Signed-off-by: Rik van Riel <riel@redhat.com>

Index: linux-2.6.23-mm1/mm/memcontrol.c
===================================================================
--- linux-2.6.23-mm1.orig/mm/memcontrol.c
+++ linux-2.6.23-mm1/mm/memcontrol.c
@@ -210,7 +210,6 @@ unsigned long mem_cgroup_isolate_pages(u
 	struct list_head *src;
 	struct page_cgroup *pc;
 
-//TODO:  memory container maintain separate file/anon lists?
 	if (active)
 		src = &mem_cont->active_list;
 	else
@@ -222,6 +221,9 @@ unsigned long mem_cgroup_isolate_pages(u
 		page = pc->page;
 		VM_BUG_ON(!pc);
 
+		/*
+		 * TODO: play better with lumpy reclaim, grabbing anything.
+		 */
 		if (PageActive(page) && !active) {
 			__mem_cgroup_move_lists(pc, true);
 			scan--;
@@ -240,6 +242,9 @@ unsigned long mem_cgroup_isolate_pages(u
 		if (page_zone(page) != z)
 			continue;
 
+		if (file != !!page_file_cache(page))
+			continue;
+
 		/*
 		 * Check if the meta page went away from under us
 		 */
-

From: Rik van Riel
Date: Saturday, November 3, 2007 - 4:01 pm

Split the LRU lists in two, one set for pages that are backed by
real file systems ("file") and one for pages that are backed by
memory and swap ("anon").  The latter includes tmpfs.

Eventually mlocked pages will be taken off the LRUs alltogether.
A patch for that already exists and just needs to be integrated
into this series.

This patch mostly has the infrastructure and a basic policy to
balance how much we scan the anon lists and how much we scan
the file lists. Fancy policy changes will be in separate patches.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

Index: linux-2.6.23-mm1/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.23-mm1.orig/fs/proc/proc_misc.c
+++ linux-2.6.23-mm1/fs/proc/proc_misc.c
@@ -149,43 +149,47 @@ static int meminfo_read_proc(char *page,
 	 * Tagged format, for easy grepping and expansion.
 	 */
 	len = sprintf(page,
-		"MemTotal:     %8lu kB\n"
-		"MemFree:      %8lu kB\n"
-		"Buffers:      %8lu kB\n"
-		"Cached:       %8lu kB\n"
-		"SwapCached:   %8lu kB\n"
-		"Active:       %8lu kB\n"
-		"Inactive:     %8lu kB\n"
+		"MemTotal:       %8lu kB\n"
+		"MemFree:        %8lu kB\n"
+		"Buffers:        %8lu kB\n"
+		"Cached:         %8lu kB\n"
+		"SwapCached:     %8lu kB\n"
+		"Active(anon):   %8lu kB\n"
+		"Inactive(anon): %8lu kB\n"
+		"Active(file):   %8lu kB\n"
+		"Inactive(file): %8lu kB\n"
 #ifdef CONFIG_HIGHMEM
-		"HighTotal:    %8lu kB\n"
-		"HighFree:     %8lu kB\n"
-		"LowTotal:     %8lu kB\n"
-		"LowFree:      %8lu kB\n"
-#endif
-		"SwapTotal:    %8lu kB\n"
-		"SwapFree:     %8lu kB\n"
-		"Dirty:        %8lu kB\n"
-		"Writeback:    %8lu kB\n"
-		"AnonPages:    %8lu kB\n"
-		"Mapped:       %8lu kB\n"
-		"Slab:         %8lu kB\n"
-		"SReclaimable: %8lu kB\n"
-		"SUnreclaim:   %8lu kB\n"
-		"PageTables:   %8lu kB\n"
-		"NFS_Unstable: %8lu kB\n"
-		"Bounce:       %8lu kB\n"
-		"CommitLimit:  %8lu ...
From: Christoph Lameter
Date: Tuesday, November 6, 2007 - 7:28 pm

If we split the memory backed from the disk backed pages then
they are no longer competing with one another on equal terms? So the file LRU 
may run faster than the memory LRU?

The patch looks awfully large.
-

From: Rik van Riel
Date: Tuesday, November 6, 2007 - 8:00 pm

On Tue, 6 Nov 2007 18:28:19 -0800 (PST)

The file LRU probably *should* run faster than the memory LRU most
of the time, since we stream the readahead data for many sequentially
accessed files through the file LRU.

We adjust the rates at which the two LRUs are scanned depending on
the fraction of referenced pages found when scanning each list.

Making it smaller would probably result in something that does
not work right.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
-

From: Christoph Lameter
Date: Tuesday, November 6, 2007 - 7:11 pm

Hmmmm.. I'd rather see where we are going. One other way of addressing 
many of these issues is to allow large page sizes on the LRU which will
reduce the number of entities that have to be managed. Both approaches 



We do not have an accepted standard load. So how would we figure that one 
out?

-

From: Rik van Riel
Date: Tuesday, November 6, 2007 - 7:23 pm

On Tue, 6 Nov 2007 18:11:39 -0800 (PST)


Linus seems to have vetoed that (unless I am mistaken), so the
chances of that happening soon are probably not very large.

Also, a factor 16 increase in page size is not going to help
if memory sizes also increase by a factor 16, since we already 
 

For some workloads this is the most urgent change, indeed.
Since the patches for this already exist, integrating them
is at the top of my list.  Expect this to be integrated into


The current worst case is where we need to scan all of memory, 
just to find a few pages we can swap out.  With the effects of
lock contention figured in, this can take hours on huge systems.

In order to make the VM more scalable, we need to find acceptable
pages to swap out with low complexity in the VM.  The "worst case"
above refers to the upper bound on how much work the VM needs to
do in order to get something evicted from the page cache or swapped
out.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
-

From: Christoph Lameter
Date: Tuesday, November 6, 2007 - 7:40 pm

Note that a factor 16 increase usually goes hand in hand with
more processors. The synchronization of multiple processors becomes a 
concern. If you have an 8p and each of them tries to get the zone locks 
for reclaim then we are already in trouble. And given the immaturity
of the handling of cacheline contention in current commodity hardware this 


A bit sparse but limiting the scanning if we cannot do much is certainly 
the right thing to do. The percentage of memory taken up by anonymous 
pages varies depending on the load. HPC applications may consume all of 
memory with anonymous pages. But there the pain is already so bad that 

Right but I think this looks like a hopeless situation regardless of the 
algorithm if you have a couple of million pages and are trying to free 
one. Now image a series of processors going on the hunt for the few pages 
that can be reclaimed.
-

From: Rik van Riel
Date: Tuesday, November 6, 2007 - 7:51 pm

On Tue, 6 Nov 2007 18:40:46 -0800 (PST)

Which is why we need to greatly reduce the number of pages

An algorithm that only clears the referenced bit and then
moves the anonymous page from the active to the inactive
list will do a lot less work than an algorithm that needs
to scan the *whole* active list because all of the pages
on it are referenced.

This is not a theoretical situation: every anonymous page
starts out referenced!

Add in a relatively small inactive list on huge memory
systems, and we could have something of an acceptable
algorithmic complexity.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
-

From: Andrew Morton
Date: Wednesday, November 7, 2007 - 10:59 am

It strikes me that splitting one list into two lists will not provide
sufficient improvement in search efficiency to do that.  I mean, a naive
guess would be that it will, on average, halve the amount of work which
needs to be done.

But we need multiple-orders-of-magnitude improvements to address the
pathological worst-cases which you're looking at there.  Where is this
coming from?

Or is the problem which you're seeing due to scanning of mapped pages
at low "distress" levels?

Would be interested in seeing more details on all of this, please.
-

From: Rik van Riel
Date: Wednesday, November 7, 2007 - 11:16 am

On Wed, 7 Nov 2007 09:59:45 -0800

Well, if you look at the typical problem systems today, you
will see that most of the pages being allocated and evicted
are in the page cache, while most of the pages in memory are
actually anonymous pages.

Not having to scan over that 80% of memory that contains
anonymous pages and shared memory segments to get at the
20% page cache pages is much more than a factor two

Replacing page cache pages is easy.  If they were referenced
once (typical), we can just evict the page the first time we
scan it.

Anonymous pages have a similar optimization: every anonymous
page starts out referenced, so moving referenced pages back
to the front of the active list is unneeded work.

However, we cannot just place referenced anonymous pages onto
an inactive list that is shared with page cache pages, because
of the difference in replacement cost and relative importance

http://linux-mm.org/PageReplacementDesign

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
-

Previous thread: none

Next thread: Linux 2.6.16.57-rc1 by Adrian Bunk on Saturday, November 3, 2007 - 4:11 pm. (1 message)