diff mbox series

[RFC] page cache drop-behind

Message ID 20200610023939.GI19604@bombadil.infradead.org (mailing list archive)
State New, archived
Headers show
Series [RFC] page cache drop-behind | expand

Commit Message

Matthew Wilcox June 10, 2020, 2:39 a.m. UTC
Andres reported a problem recently where reading a file several times
the size of memory causes intermittent stalls.  My suspicion is that
page allocation eventually runs into the low watermark and starts to
do reclaim.  Some shrinkers take a long time to run and have a low chance
of actually freeing a page (eg the dentry cache needs to free 21 dentries
which all happen to be on the same pair of pages to free those two pages).

This patch attempts to free pages from the file that we're currently
reading from if there are no pages readily available.  If that doesn't
work, we'll run all the shrinkers just as we did before.

This should solve Andres' problem, although it's a bit narrow in scope.
It might be better to look through the inactive page list, regardless of
which file they were allocated for.  That could solve the "weekly backup"
problem with lots of little files.

I'm not really set up to do performance testing at the moment, so this
is just me thinking hard about the problem.

Comments

Andres Freund June 11, 2020, 8:30 p.m. UTC | #1
Hi Matthew,

That's great!


On 2020-06-09 19:39:39 -0700, Matthew Wilcox wrote:
> Andres reported a problem recently where reading a file several times
> the size of memory causes intermittent stalls.  My suspicion is that
> page allocation eventually runs into the low watermark and starts to
> do reclaim.  Some shrinkers take a long time to run and have a low chance
> of actually freeing a page (eg the dentry cache needs to free 21 dentries
> which all happen to be on the same pair of pages to free those two pages).

That meshes with some of what I saw in profiles, IIRC.

There's a related issue that I don't think this would solve, but which
could be solved by some form of write-behind, in particular that I've
observed that reaching dirty data limits often leads to a pretty random
selection of pages being written back.


> This patch attempts to free pages from the file that we're currently
> reading from if there are no pages readily available.  If that doesn't
> work, we'll run all the shrinkers just as we did before.

> This should solve Andres' problem, although it's a bit narrow in scope.
> It might be better to look through the inactive page list, regardless of
> which file they were allocated for.  That could solve the "weekly backup"
> problem with lots of little files.

I wonder if there are cases where that'd cause problems:
1) If the pages selected are still dirty, doesn't this have a
   significant potential for additional stalls?
2) For some database/postgres tasks it's pretty common to occasionally
   do in-place writes where parts of the data is already in the page
   cache, but others aren't. A bit worried that this'd be more aggre
   aggressive throwing away pages that were already cached before the
   writes.


> I'm not really set up to do performance testing at the moment, so this
> is just me thinking hard about the problem.

The workload where I was observing the issue was creating backups of
larger postgres databases. I've attached the test program we've used.

gcc -Wall -ggdb ~/tmp/write_and_fsync.c -o /tmp/write_and_fsync && \
  rm -f /srv/dev/bench/test* && \
  echo 3 |sudo tee /proc/sys/vm/drop_caches && \
  perf stat -a -e cpu-clock,ref-cycles,cycles,instructions \
     /tmp/write_and_fsync --blocksize $((128*1024)) --sync_file_range=0 --fallocate=0 --fadvise=0 --sequential=0 --filesize=$((100*1024*1024*1024)) /srv/dev/bench/test{1,2,3,4}

Was the last invocation I found in my shell history :)

I found that sync_file_range=1, fadvise=1 often gave considerably better
performance. Here's a pointer to the thread with more details (see also
my response downthread):
https://postgr.es/m/20200503023643.x2o2244sa4vkikyb%40alap3.anarazel.de

Greetings,

Andres Freund
diff mbox series

Patch

diff --git a/mm/readahead.c b/mm/readahead.c
index 3c9a8dd7c56c..3531e1808e24 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -111,9 +111,24 @@  int read_cache_pages(struct address_space *mapping, struct list_head *pages,
 	}
 	return ret;
 }
-
 EXPORT_SYMBOL(read_cache_pages);
 
+/*
+ * Attempt to detect a streaming workload which exceeds memory and
+ * handle it by dropping the page cache behind the active part of the
+ * file.
+ */
+static void discard_behind(struct file *file, struct address_space *mapping)
+{
+	unsigned long keep = file->f_ra.ra_pages * 2;
+
+	if (mapping->nrpages < 1000)
+		return;
+	if (file->f_ra.start < keep)
+		return;
+	invalidate_mapping_pages(mapping, 0, file->f_ra.start - keep);
+}
+
 static void read_pages(struct readahead_control *rac, struct list_head *pages,
 		bool skip_page)
 {
@@ -179,6 +194,7 @@  void page_cache_readahead_unbounded(struct address_space *mapping,
 {
 	LIST_HEAD(page_pool);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
+	gfp_t light_gfp = gfp_mask & ~__GFP_DIRECT_RECLAIM;
 	struct readahead_control rac = {
 		.mapping = mapping,
 		.file = file,
@@ -219,7 +235,11 @@  void page_cache_readahead_unbounded(struct address_space *mapping,
 			continue;
 		}
 
-		page = __page_cache_alloc(gfp_mask);
+		page = __page_cache_alloc(light_gfp);
+		if (!page) {
+			discard_behind(file, mapping);
+			page = __page_cache_alloc(gfp_mask);
+		}
 		if (!page)
 			break;
 		if (mapping->a_ops->readpages) {