diff mbox series

drop_caches: Allow unmapping pages

Message ID 20190107130239.3417-1-vincent.whitchurch@axis.com (mailing list archive)
State New, archived
Headers show
Series drop_caches: Allow unmapping pages | expand

Commit Message

Vincent Whitchurch Jan. 7, 2019, 1:02 p.m. UTC
drop_caches does not drop pages which are currently mapped.  Add an
option to try to unmap and drop even these pages.  This provides a
simple way to obtain a rough estimate of how many file pages are used in
a particular use case: drop everything and check how much gets read
back.

 # cat /proc/meminfo | grep file
 Active(file):      16608 kB
 Inactive(file):    23424 kB
 # echo 3 > /proc/sys/vm/drop_caches && cat /proc/meminfo | grep file
 Active(file):      10624 kB
 Inactive(file):    15060 kB
 # echo 11 > /proc/sys/vm/drop_caches && cat /proc/meminfo | grep file
 Active(file):        240 kB
 Inactive(file):     2344 kB

Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
---
 Documentation/sysctl/vm.txt |  4 ++++
 fs/drop_caches.c            |  3 ++-
 include/linux/fs.h          | 10 ++++++++--
 kernel/sysctl.c             |  4 ++--
 mm/truncate.c               | 39 ++++++++++++++++++++++++-------------
 5 files changed, 41 insertions(+), 19 deletions(-)

Comments

Matthew Wilcox (Oracle) Jan. 7, 2019, 2:15 p.m. UTC | #1
On Mon, Jan 07, 2019 at 02:02:39PM +0100, Vincent Whitchurch wrote:
> +++ b/Documentation/sysctl/vm.txt
> @@ -222,6 +222,10 @@ To increase the number of objects freed by this operation, the user may run
>  number of dirty objects on the system and create more candidates to be
>  dropped.
>  
> +By default, pages which are currently mapped are not dropped from the
> +pagecache.  If you want to unmap and drop these pages too, echo 9 or 11 instead
> +of 1 or 3 respectively (set bit 4).

Typically we number bits from 0, so this would be bit 3, not 4.  I do see
elsewhere in this file somebody else got this wrong:

: with your system.  To disable them, echo 4 (bit 3) into drop_caches.

but that should also be fixed.

> +static int __invalidate_inode_page(struct page *page, bool unmap)
> +{
> +	struct address_space *mapping = page_mapping(page);
> +	if (!mapping)
> +		return 0;
> +	if (PageDirty(page) || PageWriteback(page))
> +		return 0;
> +	if (page_mapped(page)) {
> +		if (!unmap)
> +			return 0;
> +		if (!try_to_unmap(page, TTU_IGNORE_ACCESS))
> +			return 0;

You're going to get data corruption doing this.  try_to_unmap_one() does:

                /* Move the dirty bit to the page. Now the pte is gone. */
                if (pte_dirty(pteval))
                        set_page_dirty(page);

so PageDirty() can be false above, but made true by calling try_to_unmap().

I also think the way you've done this is expedient at the cost of
efficiency and layering violations.  I think you should first tear
down the mappings of userspace processes (which will reclaim a lot
of pages allocated to page tables), then you won't need to touch the
invalidate_inode_pages paths at all.
Dave Hansen Jan. 7, 2019, 7:25 p.m. UTC | #2
On 1/7/19 6:15 AM, Matthew Wilcox wrote:
> You're going to get data corruption doing this.  try_to_unmap_one()
> does:
> 
>          /* Move the dirty bit to the page. Now the pte is gone. */
>          if (pte_dirty(pteval))
>                  set_page_dirty(page);
> 
> so PageDirty() can be false above, but made true by calling
> try_to_unmap().

I don't think that PageDirty() check is _required_ for correctness.  You
can always safely try_to_unmap() no matter the state of the PTE.  We
can't lock out the hardware from setting the Dirty bit, ever.

It's also just fine to unmap PageDirty() pages, as long as when the PTE
is created, we move the dirty bit from the PTE into the 'struct page'
(which try_to_unmap() does, as you noticed).

> I also think the way you've done this is expedient at the cost of
> efficiency and layering violations.  I think you should first tear
> down the mappings of userspace processes (which will reclaim a lot
> of pages allocated to page tables), then you won't need to touch the
> invalidate_inode_pages paths at all.

By "tear down the mappings", do you mean something analogous to munmap()
where the VMA goes away?  Or madvise(MADV_DONTNEED) where the PTE is
destroyed but the VMA remains?

Last time I checked, we only did free_pgtables() when tearing down VMAs,
but not for pure unmappings like reclaim or MADV_DONTNEED.  I've thought
it might be fun to make a shrinker that scanned page tables looking for
zero'd pages, but I've never run into a system where empty page table
pages were actually causing a noticeable problem.
Matthew Wilcox (Oracle) Jan. 7, 2019, 7:37 p.m. UTC | #3
On Mon, Jan 07, 2019 at 11:25:16AM -0800, Dave Hansen wrote:
> On 1/7/19 6:15 AM, Matthew Wilcox wrote:
> > You're going to get data corruption doing this.  try_to_unmap_one()
> > does:
> > 
> >          /* Move the dirty bit to the page. Now the pte is gone. */
> >          if (pte_dirty(pteval))
> >                  set_page_dirty(page);
> > 
> > so PageDirty() can be false above, but made true by calling
> > try_to_unmap().
> 
> I don't think that PageDirty() check is _required_ for correctness.  You
> can always safely try_to_unmap() no matter the state of the PTE.  We
> can't lock out the hardware from setting the Dirty bit, ever.
> 
> It's also just fine to unmap PageDirty() pages, as long as when the PTE
> is created, we move the dirty bit from the PTE into the 'struct page'
> (which try_to_unmap() does, as you noticed).

Right, but the very next thing the patch does is call
invalidate_complete_page(), which calls __remove_mapping() which ... oh,
re-checks PageDirty() and refuses to drop the page.  So this isn't the
data corruptor that I had thought it was.

> > I also think the way you've done this is expedient at the cost of
> > efficiency and layering violations.  I think you should first tear
> > down the mappings of userspace processes (which will reclaim a lot
> > of pages allocated to page tables), then you won't need to touch the
> > invalidate_inode_pages paths at all.
> 
> By "tear down the mappings", do you mean something analogous to munmap()
> where the VMA goes away?  Or madvise(MADV_DONTNEED) where the PTE is
> destroyed but the VMA remains?
> 
> Last time I checked, we only did free_pgtables() when tearing down VMAs,
> but not for pure unmappings like reclaim or MADV_DONTNEED.  I've thought
> it might be fun to make a shrinker that scanned page tables looking for
> zero'd pages, but I've never run into a system where empty page table
> pages were actually causing a noticeable problem.

A few hours ago when I thought this patch had the ordering of checking
PageDirty() the wrong way round, I had the madvise analogy in mind so
that the PTEs would get destroyed and the dirty information transferred
to the struct page first before trying to drop pages.
diff mbox series

Patch

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 187ce4f599a2..6ea06c2c973b 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -222,6 +222,10 @@  To increase the number of objects freed by this operation, the user may run
 number of dirty objects on the system and create more candidates to be
 dropped.
 
+By default, pages which are currently mapped are not dropped from the
+pagecache.  If you want to unmap and drop these pages too, echo 9 or 11 instead
+of 1 or 3 respectively (set bit 4).
+
 This file is not a means to control the growth of the various kernel caches
 (inodes, dentries, pagecache, etc...)  These objects are automatically
 reclaimed by the kernel when memory is needed elsewhere on the system.
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 82377017130f..9faaa1e3a672 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -17,6 +17,7 @@  int sysctl_drop_caches;
 static void drop_pagecache_sb(struct super_block *sb, void *unused)
 {
 	struct inode *inode, *toput_inode = NULL;
+	bool unmap = sysctl_drop_caches & 8;
 
 	spin_lock(&sb->s_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
@@ -30,7 +31,7 @@  static void drop_pagecache_sb(struct super_block *sb, void *unused)
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb->s_inode_list_lock);
 
-		invalidate_mapping_pages(inode->i_mapping, 0, -1);
+		__invalidate_mapping_pages(inode->i_mapping, 0, -1, unmap);
 		iput(toput_inode);
 		toput_inode = inode;
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 811c77743dad..503e176654ce 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2675,8 +2675,14 @@  extern int check_disk_change(struct block_device *);
 extern int __invalidate_device(struct block_device *, bool);
 extern int invalidate_partition(struct gendisk *, int);
 #endif
-unsigned long invalidate_mapping_pages(struct address_space *mapping,
-					pgoff_t start, pgoff_t end);
+unsigned long __invalidate_mapping_pages(struct address_space *mapping,
+					pgoff_t start, pgoff_t end, bool unmap);
+
+static inline unsigned long invalidate_mapping_pages(struct address_space *mapping,
+						     pgoff_t start, pgoff_t end)
+{
+	return __invalidate_mapping_pages(mapping, start, end, false);
+}
 
 static inline void invalidate_remote_inode(struct inode *inode)
 {
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index ba4d9e85feb8..f12c2a8d84fb 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -125,7 +125,7 @@  static int __maybe_unused neg_one = -1;
 static int zero;
 static int __maybe_unused one = 1;
 static int __maybe_unused two = 2;
-static int __maybe_unused four = 4;
+static int __maybe_unused fifteen = 15;
 static unsigned long one_ul = 1;
 static int one_hundred = 100;
 static int one_thousand = 1000;
@@ -1431,7 +1431,7 @@  static struct ctl_table vm_table[] = {
 		.mode		= 0644,
 		.proc_handler	= drop_caches_sysctl_handler,
 		.extra1		= &one,
-		.extra2		= &four,
+		.extra2		= &fifteen,
 	},
 #ifdef CONFIG_COMPACTION
 	{
diff --git a/mm/truncate.c b/mm/truncate.c
index 798e7ccfb030..613b02e02146 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -245,6 +245,22 @@  int generic_error_remove_page(struct address_space *mapping, struct page *page)
 }
 EXPORT_SYMBOL(generic_error_remove_page);
 
+static int __invalidate_inode_page(struct page *page, bool unmap)
+{
+	struct address_space *mapping = page_mapping(page);
+	if (!mapping)
+		return 0;
+	if (PageDirty(page) || PageWriteback(page))
+		return 0;
+	if (page_mapped(page)) {
+		if (!unmap)
+			return 0;
+		if (!try_to_unmap(page, TTU_IGNORE_ACCESS))
+			return 0;
+	}
+	return invalidate_complete_page(mapping, page);
+}
+
 /*
  * Safely invalidate one page from its pagecache mapping.
  * It only drops clean, unused pages. The page must be locked.
@@ -253,16 +269,10 @@  EXPORT_SYMBOL(generic_error_remove_page);
  */
 int invalidate_inode_page(struct page *page)
 {
-	struct address_space *mapping = page_mapping(page);
-	if (!mapping)
-		return 0;
-	if (PageDirty(page) || PageWriteback(page))
-		return 0;
-	if (page_mapped(page))
-		return 0;
-	return invalidate_complete_page(mapping, page);
+	return __invalidate_inode_page(page, false);
 }
 
+
 /**
  * truncate_inode_pages_range - truncate range of pages specified by start & end byte offsets
  * @mapping: mapping to truncate
@@ -532,16 +542,17 @@  EXPORT_SYMBOL(truncate_inode_pages_final);
  * @mapping: the address_space which holds the pages to invalidate
  * @start: the offset 'from' which to invalidate
  * @end: the offset 'to' which to invalidate (inclusive)
+ * @unmap: try to unmap pages
  *
  * This function only removes the unlocked pages, if you want to
  * remove all the pages of one inode, you must call truncate_inode_pages.
  *
  * invalidate_mapping_pages() will not block on IO activity. It will not
- * invalidate pages which are dirty, locked, under writeback or mapped into
- * pagetables.
+ * invalidate pages which are dirty, locked, under writeback or, if unmap is
+ * false, mapped into pagetables.
  */
-unsigned long invalidate_mapping_pages(struct address_space *mapping,
-		pgoff_t start, pgoff_t end)
+unsigned long __invalidate_mapping_pages(struct address_space *mapping,
+		pgoff_t start, pgoff_t end, bool unmap)
 {
 	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
@@ -591,7 +602,7 @@  unsigned long invalidate_mapping_pages(struct address_space *mapping,
 				}
 			}
 
-			ret = invalidate_inode_page(page);
+			ret = __invalidate_inode_page(page, unmap);
 			unlock_page(page);
 			/*
 			 * Invalidation is a hint that the page is no longer
@@ -608,7 +619,7 @@  unsigned long invalidate_mapping_pages(struct address_space *mapping,
 	}
 	return count;
 }
-EXPORT_SYMBOL(invalidate_mapping_pages);
+EXPORT_SYMBOL(__invalidate_mapping_pages);
 
 /*
  * This is like invalidate_complete_page(), except it ignores the page's