[0/2] Use multi-index entries in the page cache

Message ID	20200629152033.16175-1-willy@infradead.org (mailing list archive)
Headers	show Return-Path: <SRS0=ave5=AK=vger.kernel.org=linux-fsdevel-owner@kernel.org> From: "Matthew Wilcox (Oracle)" <willy@infradead.org> To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>, Hugh Dickins <hughd@google.com> Subject: [PATCH 0/2] Use multi-index entries in the page cache Date: Mon, 29 Jun 2020 16:20:31 +0100 Message-Id: <20200629152033.16175-1-willy@infradead.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	Use multi-index entries in the page cache \| expand [0/2] Use multi-index entries in the page cache [1/2] XArray: Add xas_split [2/2] mm: Use multi-index entries in the page cache

Matthew Wilcox June 29, 2020, 3:20 p.m. UTC

Following Hugh's advice at LSFMM 2019, I was trying to avoid doing this,
but it turns out to be hard to support range writeback with the pagecache
using multiple single entries.  Of course, this isn't a problem for
shmem because it doesn't have a real backing device (only swap), but
real filesystems need to start writeback at arbitrary file offsets and
have problems if we don't notice that the first (huge) page in the range
is dirty.

Hugh, I would love it if you could test this.  It didn't introduce any new
regressions to the xfstests, but shmem does exercise different paths and
of course I don't have a clean xfstests run yet, so there could easily
still be bugs.

I'd like this to be included in mmotm, but it is less urgent than the
previous patch series that I've sent.  As far as risk, I think it only
affects shmem/tmpfs.

Matthew Wilcox (Oracle) (2):
  XArray: Add xas_split
  mm: Use multi-index entries in the page cache

 Documentation/core-api/xarray.rst |  16 ++--
 include/linux/xarray.h            |   2 +
 lib/test_xarray.c                 |  41 ++++++++
 lib/xarray.c                      | 153 ++++++++++++++++++++++++++++--
 mm/filemap.c                      |  42 ++++----
 mm/huge_memory.c                  |  21 +++-
 mm/khugepaged.c                   |  12 ++-
 mm/shmem.c                        |  11 +--
 8 files changed, 245 insertions(+), 53 deletions(-)

William Kucharski June 30, 2020, 3:16 a.m. UTC | #1

Another nice cleanup.

For the series:

Reviewed-by: William Kucharski <william.kucharski@oracle.com>

> On Jun 29, 2020, at 9:20 AM, Matthew Wilcox (Oracle) <willy@infradead.org> wrote:
> 
> Following Hugh's advice at LSFMM 2019, I was trying to avoid doing this,
> but it turns out to be hard to support range writeback with the pagecache
> using multiple single entries.  Of course, this isn't a problem for
> shmem because it doesn't have a real backing device (only swap), but
> real filesystems need to start writeback at arbitrary file offsets and
> have problems if we don't notice that the first (huge) page in the range
> is dirty.
> 
> Hugh, I would love it if you could test this.  It didn't introduce any new
> regressions to the xfstests, but shmem does exercise different paths and
> of course I don't have a clean xfstests run yet, so there could easily
> still be bugs.
> 
> I'd like this to be included in mmotm, but it is less urgent than the
> previous patch series that I've sent.  As far as risk, I think it only
> affects shmem/tmpfs.
> 
> Matthew Wilcox (Oracle) (2):
>  XArray: Add xas_split
>  mm: Use multi-index entries in the page cache
> 
> Documentation/core-api/xarray.rst |  16 ++--
> include/linux/xarray.h            |   2 +
> lib/test_xarray.c                 |  41 ++++++++
> lib/xarray.c                      | 153 ++++++++++++++++++++++++++++--
> mm/filemap.c                      |  42 ++++----
> mm/huge_memory.c                  |  21 +++-
> mm/khugepaged.c                   |  12 ++-
> mm/shmem.c                        |  11 +--
> 8 files changed, 245 insertions(+), 53 deletions(-)
> 
> -- 
> 2.27.0
> 
>

Hugh Dickins July 4, 2020, 8:20 p.m. UTC | #2

On Mon, 29 Jun 2020, Matthew Wilcox (Oracle) wrote:

> Following Hugh's advice at LSFMM 2019, I was trying to avoid doing this,
> but it turns out to be hard to support range writeback with the pagecache
> using multiple single entries.  Of course, this isn't a problem for
> shmem because it doesn't have a real backing device (only swap), but
> real filesystems need to start writeback at arbitrary file offsets and
> have problems if we don't notice that the first (huge) page in the range
> is dirty.
> 
> Hugh, I would love it if you could test this.  It didn't introduce any new
> regressions to the xfstests, but shmem does exercise different paths and
> of course I don't have a clean xfstests run yet, so there could easily
> still be bugs.
> 
> I'd like this to be included in mmotm, but it is less urgent than the
> previous patch series that I've sent.  As far as risk, I think it only
> affects shmem/tmpfs.
> 
> Matthew Wilcox (Oracle) (2):
>   XArray: Add xas_split
>   mm: Use multi-index entries in the page cache
> 
>  Documentation/core-api/xarray.rst |  16 ++--
>  include/linux/xarray.h            |   2 +
>  lib/test_xarray.c                 |  41 ++++++++
>  lib/xarray.c                      | 153 ++++++++++++++++++++++++++++--
>  mm/filemap.c                      |  42 ++++----
>  mm/huge_memory.c                  |  21 +++-
>  mm/khugepaged.c                   |  12 ++-
>  mm/shmem.c                        |  11 +--
>  8 files changed, 245 insertions(+), 53 deletions(-)
> 
> -- 
> 2.27.0

I have been trying it, and it's not good yet. I had hoped to work
out what's wrong and send a patch, but I'm not making progress,
so better hand back to you with what I've found.

First problem was that it did not quite build on top of 5.8-rc3
plus your 1-7 THP prep patches: was there some other series we
were supposed to add in too? If so, that might make a big difference,
but I fixed up __add_to_page_cache_locked() as in the diff below
(and I'm not bothering about hugetlbfs, so haven't looked to see if
its page indexing is or isn't still exceptional with your series).

Second problem was fs/inode.c:530 BUG_ON(inode->i_data.nrpages),
after warning from shmem_evict_inode(). Surprisingly, that first
happened on a machine with CONFIG_TRANSPARENT_HUGEPAGE not set,
while doing an "rm -rf".

I've seen it on other machines since, with THP enabled, doing other
things; but have not seen it in the last couple of days, which is
intriguing (though partly consequence of narrowing down to xfstests).

The original non-THP machine ran the same load for
ten hours yesterday, but hit no problem. The only significant
difference in what ran successfully, is that I've been surprised
by all the non-zero entries I saw in xarray nodes, exceeding
total entry "count" (I've also been bothered by non-zero "offset"
at root, but imagine that's just noise that never gets used).
So I've changed the kmem_cache_alloc()s in lib/radix-tree.c to
kmem_cache_zalloc()s, as in the diff below: not suggesting that
as necessary, just a temporary precaution in case something is
not being initialized as intended.

I've also changed shmem_evict_inode()'s warning to BUG to crash
sooner, as in the diff below; but not hit it since doing so.

Here's a script to run in the xfstests directory, just running
the tests I've seen frequent problems from: maybe you were not
choosing the "huge=always" option to tmpfs?  An alternative is
"echo force >/sys/kernel/mm/transparent_hugepage/shmem_enabled":
that's good when running random tests that you cannot poke into;
but can cause crashes if the graphics driver e.g. i915 is using
shmem, and then gets hugepages it was not expecting.

sync
export FSTYP=tmpfs
export DISABLE_UDF_TEST=1
export TEST_DEV=tmpfs1:
export TEST_DIR=/xft		# I have that, you'll have somewhere else
export SCRATCH_DEV=tmpfs2:
export SCRATCH_MNT=/mnt
export TMPFS_MOUNT_OPTIONS="-o size=1088M,huge=always"
mount -t $FSTYP $TMPFS_MOUNT_OPTIONS $TEST_DEV $TEST_DIR || exit $?
./check generic/083 generic/224 generic/228 generic/269 generic/476
umount /xft /mnt 2>/dev/null

These problems were either mm/filemap.c:1565 find_lock_entry()
VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page); or hangs, which
(at least the ones that I went on to investigate) turned out also to be
find_lock_entry(), circling around with page_mapping(page) != mapping.
It seems that find_get_entry() is sometimes supplying the wrong page,
and you will work out why much quicker than I shall.  (One tantalizing
detail of the bad offset crashes: very often page pgoff is exactly one
less than the requested offset.)

I have modified find_lock_entry() as in the diff below, to be stricter:
we all prefer to investigate a crash than a hang, and I think it's been
unnecessarily forgiving of page->mapping. I might be wrong (fuse? page
migration? splice stealing? shmem swap? THP collapse? THP splitting?)
but I believe they are all okay with a simple head->mapping check there.

Hugh

--- 5083w/lib/radix-tree.c	2020-06-14 15:13:01.406042750 -0700
+++ 5083wh/lib/radix-tree.c	2020-07-04 11:35:11.433181700 -0700
@@ -249,7 +249,7 @@ radix_tree_node_alloc(gfp_t gfp_mask, st
 		 * cache first for the new node to get accounted to the memory
 		 * cgroup.
 		 */
-		ret = kmem_cache_alloc(radix_tree_node_cachep,
+		ret = kmem_cache_zalloc(radix_tree_node_cachep,
 				       gfp_mask | __GFP_NOWARN);
 		if (ret)
 			goto out;
@@ -257,7 +257,7 @@ radix_tree_node_alloc(gfp_t gfp_mask, st
 		/*
 		 * Provided the caller has preloaded here, we will always
 		 * succeed in getting a node here (and never reach
-		 * kmem_cache_alloc)
+		 * kmem_cache_zalloc)
 		 */
 		rtp = this_cpu_ptr(&radix_tree_preloads);
 		if (rtp->nr) {
@@ -272,7 +272,7 @@ radix_tree_node_alloc(gfp_t gfp_mask, st
 		kmemleak_update_trace(ret);
 		goto out;
 	}
-	ret = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
+	ret = kmem_cache_zalloc(radix_tree_node_cachep, gfp_mask);
 out:
 	BUG_ON(radix_tree_is_internal_node(ret));
 	if (ret) {
@@ -334,7 +334,7 @@ static __must_check int __radix_tree_pre
 	rtp = this_cpu_ptr(&radix_tree_preloads);
 	while (rtp->nr < nr) {
 		local_unlock(&radix_tree_preloads.lock);
-		node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
+		node = kmem_cache_zalloc(radix_tree_node_cachep, gfp_mask);
 		if (node == NULL)
 			goto out;
 		local_lock(&radix_tree_preloads.lock);
--- 5083w/mm/filemap.c	2020-06-30 11:38:58.094010948 -0700
+++ 5083wh/mm/filemap.c	2020-07-04 11:35:11.437181729 -0700
@@ -824,6 +824,7 @@ static int __add_to_page_cache_locked(st
 				      void **shadowp)
 {
 	XA_STATE(xas, &mapping->i_pages, offset);
+	unsigned int nr = thp_nr_pages(page);
 	int huge = PageHuge(page);
 	int error;
 	void *old;
@@ -859,12 +860,13 @@ static int __add_to_page_cache_locked(st
 		xas_store(&xas, page);
 		if (xas_error(&xas))
 			goto unlock;
+
 		mapping->nrexceptional -= exceptional;
 		mapping->nrpages += nr;

 		/* hugetlb pages do not participate in page cache accounting */
 		if (!huge)
-			__inc_lruvec_page_state(page, NR_FILE_PAGES);
+			__mod_lruvec_page_state(page, NR_FILE_PAGES, nr);
 unlock:
 		xas_unlock_irq(&xas);
 	} while (xas_nomem(&xas, gfp_mask & GFP_RECLAIM_MASK));
@@ -1548,19 +1550,26 @@ out:
  */
 struct page *find_lock_entry(struct address_space *mapping, pgoff_t offset)
 {
-	struct page *page;
-
+	struct page *page, *head;
+	pgoff_t pgoff;
 repeat:
 	page = find_get_entry(mapping, offset);
 	if (page && !xa_is_value(page)) {
-		lock_page(page);
+		head = compound_head(page);
+		lock_page(head);
 		/* Has the page been truncated? */
-		if (unlikely(page_mapping(page) != mapping)) {
-			unlock_page(page);
+		if (unlikely(!head->mapping)) {
+			unlock_page(head);
 			put_page(page);
 			goto repeat;
 		}
-		VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page);
+		pgoff = head->index + (page - head);
+		if (PageSwapCache(head) ||
+		    head->mapping != mapping || pgoff != offset) {
+			printk("page=%px pgoff=%lx offset=%lx headmapping=%px mapping=%px\n",
+				page, pgoff, offset, head->mapping, mapping);
+			VM_BUG_ON_PAGE(1, page);
+		}
 	}
 	return page;
 }
--- 5083w/mm/shmem.c	2020-06-30 11:38:58.098010985 -0700
+++ 5083wh/mm/shmem.c	2020-07-04 11:35:11.441181757 -0700
@@ -1108,7 +1108,12 @@ static void shmem_evict_inode(struct ino
 	}

 	simple_xattrs_free(&info->xattrs);
-	WARN_ON(inode->i_blocks);
+	if (inode->i_blocks) {
+		printk("mapping=%px i_blocks=%lx nrpages=%lx nrexcept=%lx\n",
+			inode->i_mapping, (unsigned long)inode->i_blocks,
+			inode->i_data.nrpages, inode->i_data.nrexceptional);
+		BUG();
+	}
 	shmem_free_inode(inode->i_sb);
 	clear_inode(inode);
 }

Matthew Wilcox July 6, 2020, 2:43 p.m. UTC | #3

On Sat, Jul 04, 2020 at 01:20:19PM -0700, Hugh Dickins wrote:
> On Mon, 29 Jun 2020, Matthew Wilcox (Oracle) wrote:
> > Hugh, I would love it if you could test this.  It didn't introduce any new
> > regressions to the xfstests, but shmem does exercise different paths and
> > of course I don't have a clean xfstests run yet, so there could easily
> > still be bugs.
> 
> I have been trying it, and it's not good yet. I had hoped to work
> out what's wrong and send a patch, but I'm not making progress,
> so better hand back to you with what I've found.

Thank you so much!  I've made some progress.

> First problem was that it did not quite build on top of 5.8-rc3
> plus your 1-7 THP prep patches: was there some other series we
> were supposed to add in too? If so, that might make a big difference,
> but I fixed up __add_to_page_cache_locked() as in the diff below
> (and I'm not bothering about hugetlbfs, so haven't looked to see if
> its page indexing is or isn't still exceptional with your series).

Oops.  I shifted some patches around and clearly didn't get it quite
right.  I'll fix it up.

> Second problem was fs/inode.c:530 BUG_ON(inode->i_data.nrpages),
> after warning from shmem_evict_inode(). Surprisingly, that first
> happened on a machine with CONFIG_TRANSPARENT_HUGEPAGE not set,
> while doing an "rm -rf".

I've seen that occasionally too.

> The original non-THP machine ran the same load for
> ten hours yesterday, but hit no problem. The only significant
> difference in what ran successfully, is that I've been surprised
> by all the non-zero entries I saw in xarray nodes, exceeding
> total entry "count" (I've also been bothered by non-zero "offset"
> at root, but imagine that's just noise that never gets used).
> So I've changed the kmem_cache_alloc()s in lib/radix-tree.c to
> kmem_cache_zalloc()s, as in the diff below: not suggesting that
> as necessary, just a temporary precaution in case something is
> not being initialized as intended.

Umm.  ->count should always be accurate and match the number of non-NULL
entries in a node.  the zalloc shouldn't be necessary, and will probably
break the workingset code.  Actually, it should BUG because we have both
a constructor and an instruction to zero the allocation, and they can't
both be right.

You're right that ->offset is never used at root.  I had plans to
repurpose that to support smaller files more efficiently, but never
got round to implementing those plans.

> These problems were either mm/filemap.c:1565 find_lock_entry()
> VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page); or hangs, which
> (at least the ones that I went on to investigate) turned out also to be
> find_lock_entry(), circling around with page_mapping(page) != mapping.
> It seems that find_get_entry() is sometimes supplying the wrong page,
> and you will work out why much quicker than I shall.  (One tantalizing
> detail of the bad offset crashes: very often page pgoff is exactly one
> less than the requested offset.)

I added this:

@@ -1535,6 +1535,11 @@ struct page *find_get_entry(struct address_space *mapping, pgoff_t offset)
                goto repeat;
        }
        page = find_subpage(page, offset);
+       if (page_to_index(page) != offset) {
+               printk("offset %ld xas index %ld offset %d\n", offset, xas.xa_index, xas.xa_offset);
+               dump_page(page, "index mismatch");
+               printk("xa_load %p\n", xa_load(&mapping->i_pages, offset));
+       }
 out:
        rcu_read_unlock();
 
and I have a good clue now:

1322 offset 631 xas index 631 offset 48
1322 page:000000008c9a9bc3 refcount:4 mapcount:0 mapping:00000000d8615d47 index:0x276
1322 flags: 0x4000000000002026(referenced|uptodate|active|private)
1322 mapping->aops:0xffffffff88a2ebc0 ino 1800b82 dentry name:"f1141"
1322 raw: 4000000000002026 dead000000000100 dead000000000122 ffff98ff2a8b8a20
1322 raw: 0000000000000276 ffff98ff1ac271a0 00000004ffffffff 0000000000000000
1322 page dumped because: index mismatch
1322 xa_load 000000008c9a9bc3

0x276 is decimal 630.  So we're looking up a tail page and getting its
erstwhile head page.  I'll dig in and figure out exactly how that's
happening.

Matthew Wilcox July 6, 2020, 6:50 p.m. UTC | #4

On Mon, Jul 06, 2020 at 03:43:20PM +0100, Matthew Wilcox wrote:
> On Sat, Jul 04, 2020 at 01:20:19PM -0700, Hugh Dickins wrote:
> > These problems were either mm/filemap.c:1565 find_lock_entry()
> > VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page); or hangs, which
> > (at least the ones that I went on to investigate) turned out also to be
> > find_lock_entry(), circling around with page_mapping(page) != mapping.
> > It seems that find_get_entry() is sometimes supplying the wrong page,
> > and you will work out why much quicker than I shall.  (One tantalizing
> > detail of the bad offset crashes: very often page pgoff is exactly one
> > less than the requested offset.)
> 
> I added this:
> 
> @@ -1535,6 +1535,11 @@ struct page *find_get_entry(struct address_space *mapping, pgoff_t offset)
>                 goto repeat;
>         }
>         page = find_subpage(page, offset);
> +       if (page_to_index(page) != offset) {
> +               printk("offset %ld xas index %ld offset %d\n", offset, xas.xa_index, xas.xa_offset);
> +               dump_page(page, "index mismatch");
> +               printk("xa_load %p\n", xa_load(&mapping->i_pages, offset));
> +       }
>  out:
>         rcu_read_unlock();
>  
> and I have a good clue now:
> 
> 1322 offset 631 xas index 631 offset 48
> 1322 page:000000008c9a9bc3 refcount:4 mapcount:0 mapping:00000000d8615d47 index:0x276
> 1322 flags: 0x4000000000002026(referenced|uptodate|active|private)
> 1322 mapping->aops:0xffffffff88a2ebc0 ino 1800b82 dentry name:"f1141"
> 1322 raw: 4000000000002026 dead000000000100 dead000000000122 ffff98ff2a8b8a20
> 1322 raw: 0000000000000276 ffff98ff1ac271a0 00000004ffffffff 0000000000000000
> 1322 page dumped because: index mismatch
> 1322 xa_load 000000008c9a9bc3
> 
> 0x276 is decimal 630.  So we're looking up a tail page and getting its
> erstwhile head page.  I'll dig in and figure out exactly how that's
> happening.

@@ -841,6 +842,7 @@ static int __add_to_page_cache_locked(struct page *page,
                nr = thp_nr_pages(page);
        }

+       VM_BUG_ON_PAGE(xa_load(&mapping->i_pages, offset + nr) == page, page);
        page_ref_add(page, nr);
        page->mapping = mapping;
        page->index = offset;
@@ -880,6 +882,7 @@ static int __add_to_page_cache_locked(struct page *page,
                goto error;
        }

+       VM_BUG_ON_PAGE(xa_load(&mapping->i_pages, offset + nr) == page, page);
        trace_mm_filemap_add_to_page_cache(page);
        return 0;
 error:

The second one triggers with generic/051 (running against xfs with the
rest of my patches).  So I deduce that we have a shadow entry which
takes up multiple indices, then when we store the page, we now have
a multi-index entry which refers to a single page.  And this explains
basically all the accounting problems.

Now I'm musing how best to fix this.

1. When removing a compound page from the cache, we could store only
a single entry.  That seems bad because we cuold hit somewhere else in
the compound page and we'd have the wrong information about workingset
history (or worse believe that a shmem page isn't in swap!)

2. When removing a compound page from the cache, we could split the
entry and store the same entry N times.  Again, this seems bad for shmem
because then all the swap entries would be the same, and we'd fetch the
wrong data from swap.

3 & 4. When adding a page to the cache, we delete any shadow entry which was
previously there, or replicate the shadow entry.  Same drawbacks as the
above two, I feel.

5. Use the size of the shadow entry to allocate a page of the right size.
We don't currently have an API to find the right size, so that would
need to be written.  And what do we do if we can't allocate a page of
sufficient size?

So that's five ideas with their drawbacks as I see them.  Maybe you have
a better idea?

Hugh Dickins July 7, 2020, 3:21 a.m. UTC | #5

On Mon, 6 Jul 2020, Matthew Wilcox wrote:
> On Sat, Jul 04, 2020 at 01:20:19PM -0700, Hugh Dickins wrote:
> 
> > The original non-THP machine ran the same load for
> > ten hours yesterday, but hit no problem. The only significant
> > difference in what ran successfully, is that I've been surprised
> > by all the non-zero entries I saw in xarray nodes, exceeding
> > total entry "count" (I've also been bothered by non-zero "offset"
> > at root, but imagine that's just noise that never gets used).
> > So I've changed the kmem_cache_alloc()s in lib/radix-tree.c to
> > kmem_cache_zalloc()s, as in the diff below: not suggesting that
> > as necessary, just a temporary precaution in case something is
> > not being initialized as intended.
> 
> Umm.  ->count should always be accurate and match the number of non-NULL
> entries in a node.  the zalloc shouldn't be necessary, and will probably
> break the workingset code.  Actually, it should BUG because we have both
> a constructor and an instruction to zero the allocation, and they can't
> both be right.

Those kmem_cache_zalloc()s did not cause BUGs because I did them in
lib/radix-tree.c: it occurred to me yesterday morning that that was
an odd choice to have made! so I then extended them to lib/xarray.c,
immediately hit the BUGs you point out, so also added INIT_LIST_HEADs.

And very interestingly, the little huge tmpfs xfstests script I gave
last time then usually succeeds without any crash; but not always.

To me that suggests that there's somewhere you omit to clear a slot;
and that usually those xfstests don't encounter that stale slot again,
before the node is recycled for use elsewhere (with the zalloc making
it right again); but occasionally the tests do hit the bogus entry
before the node is recycled.

Before the zallocs, I had noticed a node with count 4, but filled
with non-zero entries (most of them looking like huge head pointers).

> 
> You're right that ->offset is never used at root.  I had plans to
> repurpose that to support smaller files more efficiently, but never
> got round to implementing those plans.
> 
> > These problems were either mm/filemap.c:1565 find_lock_entry()
> > VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page); or hangs, which
> > (at least the ones that I went on to investigate) turned out also to be
> > find_lock_entry(), circling around with page_mapping(page) != mapping.
> > It seems that find_get_entry() is sometimes supplying the wrong page,
> > and you will work out why much quicker than I shall.  (One tantalizing
> > detail of the bad offset crashes: very often page pgoff is exactly one
> > less than the requested offset.)
> 
> I added this:
> 
> @@ -1535,6 +1535,11 @@ struct page *find_get_entry(struct address_space *mapping, pgoff_t offset)
>                 goto repeat;
>         }
>         page = find_subpage(page, offset);
> +       if (page_to_index(page) != offset) {
> +               printk("offset %ld xas index %ld offset %d\n", offset, xas.xa_index, xas.xa_offset);

(It's much easier to spot huge page issues with %x than with %d.)

> +               dump_page(page, "index mismatch");
> +               printk("xa_load %p\n", xa_load(&mapping->i_pages, offset));
> +       }
>  out:
>         rcu_read_unlock();
>  
> and I have a good clue now:
> 
> 1322 offset 631 xas index 631 offset 48
> 1322 page:000000008c9a9bc3 refcount:4 mapcount:0 mapping:00000000d8615d47 index:0x276

(You appear to have more patience with all this ghastly hashing of
kernel addresses in debug messages than I have: it really gets in our way.)

> 1322 flags: 0x4000000000002026(referenced|uptodate|active|private)
> 1322 mapping->aops:0xffffffff88a2ebc0 ino 1800b82 dentry name:"f1141"
> 1322 raw: 4000000000002026 dead000000000100 dead000000000122 ffff98ff2a8b8a20
> 1322 raw: 0000000000000276 ffff98ff1ac271a0 00000004ffffffff 0000000000000000
> 1322 page dumped because: index mismatch
> 1322 xa_load 000000008c9a9bc3
> 
> 0x276 is decimal 630.  So we're looking up a tail page and getting its

Index 630 for offset 631, yes, that's exactly the kind of off-by-one
I saw so frequently.

> erstwhile head page.  I'll dig in and figure out exactly how that's
> happening.

Very pleasing to see the word "erstwhile" in use (and I'm serious about
that); but I don't get your head for tail point - I can't make heads or
tails of what you're saying there :) Neither 631 nor 630 is close to 512.

Certainly it's finding a bogus page, and that's related to the multi-order
splitting (I saw no problem with your 1-7 series), I think it's from a
stale entry being left behind. (But that does not at all explain the
off-by-one pattern.)

Hugh

Hugh Dickins July 7, 2020, 3:43 a.m. UTC | #6

On Mon, 6 Jul 2020, Matthew Wilcox wrote:
> 
> @@ -841,6 +842,7 @@ static int __add_to_page_cache_locked(struct page *page,
>                 nr = thp_nr_pages(page);
>         }
>  
> +       VM_BUG_ON_PAGE(xa_load(&mapping->i_pages, offset + nr) == page, page);
>         page_ref_add(page, nr);
>         page->mapping = mapping;
>         page->index = offset;
> @@ -880,6 +882,7 @@ static int __add_to_page_cache_locked(struct page *page,
>                 goto error;
>         }
>  
> +       VM_BUG_ON_PAGE(xa_load(&mapping->i_pages, offset + nr) == page, page);
>         trace_mm_filemap_add_to_page_cache(page);
>         return 0;
>  error:
> 
> The second one triggers with generic/051 (running against xfs with the
> rest of my patches).  So I deduce that we have a shadow entry which
> takes up multiple indices, then when we store the page, we now have
> a multi-index entry which refers to a single page.  And this explains
> basically all the accounting problems.

I think you are jumping too far ahead by bringing in xfs and your
later patches.  Don't let me stop you from thinking ahead, but the
problem at hand is with tmpfs.

tmpfs doesn't use shadow entries, or not where we use the term "shadow"
for workingset business: it does save swap entries as xarray values,
but I don't suppose the five xfstests I was running get into swap
(without applying additional pressure); and (since a huge page
gets split before shmem swapout, at present anyway) I don't see
a philosophical problem about their multi-index entries.

Multi-index shadows, sorry, not a subject I can think about at present.

> 
> Now I'm musing how best to fix this.
> 
> 1. When removing a compound page from the cache, we could store only
> a single entry.  That seems bad because we cuold hit somewhere else in
> the compound page and we'd have the wrong information about workingset
> history (or worse believe that a shmem page isn't in swap!)
> 
> 2. When removing a compound page from the cache, we could split the
> entry and store the same entry N times.  Again, this seems bad for shmem
> because then all the swap entries would be the same, and we'd fetch the
> wrong data from swap.
> 
> 3 & 4. When adding a page to the cache, we delete any shadow entry which was
> previously there, or replicate the shadow entry.  Same drawbacks as the
> above two, I feel.
> 
> 5. Use the size of the shadow entry to allocate a page of the right size.
> We don't currently have an API to find the right size, so that would
> need to be written.  And what do we do if we can't allocate a page of
> sufficient size?
> 
> So that's five ideas with their drawbacks as I see them.  Maybe you have
> a better idea?
>

Matthew Wilcox July 7, 2020, 3:49 a.m. UTC | #7

On Mon, Jul 06, 2020 at 08:21:54PM -0700, Hugh Dickins wrote:
> > and I have a good clue now:
> > 
> > 1322 offset 631 xas index 631 offset 48
> > 1322 page:000000008c9a9bc3 refcount:4 mapcount:0 mapping:00000000d8615d47 index:0x276
> 
> (You appear to have more patience with all this ghastly hashing of
> kernel addresses in debug messages than I have: it really gets in our way.)
> 
> > 1322 flags: 0x4000000000002026(referenced|uptodate|active|private)
> > 1322 mapping->aops:0xffffffff88a2ebc0 ino 1800b82 dentry name:"f1141"
> > 1322 raw: 4000000000002026 dead000000000100 dead000000000122 ffff98ff2a8b8a20
> > 1322 raw: 0000000000000276 ffff98ff1ac271a0 00000004ffffffff 0000000000000000
> > 1322 page dumped because: index mismatch
> > 1322 xa_load 000000008c9a9bc3
> > 
> > 0x276 is decimal 630.  So we're looking up a tail page and getting its
> 
> Index 630 for offset 631, yes, that's exactly the kind of off-by-one
> I saw so frequently.
> 
> > erstwhile head page.  I'll dig in and figure out exactly how that's
> > happening.
> 
> Very pleasing to see the word "erstwhile" in use (and I'm serious about
> that); but I don't get your head for tail point - I can't make heads or
> tails of what you're saying there :) Neither 631 nor 630 is close to 512.

Ah, I'm working with another 40 or so patches on top of the single patch
I sent you, which includes the patches to make XFS use thp, and make thps
arbitrary order.  In this instance, we've hit what was probably an order-2
or order-4 page, not an order-9 page.

> Certainly it's finding a bogus page, and that's related to the multi-order
> splitting (I saw no problem with your 1-7 series), I think it's from a
> stale entry being left behind. (But that does not at all explain the
> off-by-one pattern.)

I think I understand that.  We create an order-4 page, and then swap
it out.  That leaves us with a shadow entry which stretches across 16
indices (0x270-0x27f).  Then we access one of those 16 indices again
and store that page (whichever index we happened to be looking at) in
the page cache, but the XArray doesn't know that it's supposed to be an
order-0 store, so it helpfully replaces the 'head' page (at 0x270) with
this one for 0x276.  Usually, we'll go on to access the page at 0x277,
and instead of seeing no page there, we find the wrong page and bring
everything to a juddering halt.

So seeing the common pattern of page at index n returned for page at
index n+1 is merely an artefact of our normal access patterns.  If we
were accessing random addresses, we'd've seen no pattern at all.

I'm halfway through a dual solution; one that first attempts to use the
information about the shadow entry to bring in the entire thp (if it was
in use as a thp before being paged out, we should probably try to keep
using it as a thp afterwards), but then splits the shadow entry across
the indices if the page being added is smaller than the shadow entry.

[0/2] Use multi-index entries in the page cache

Message

Comments