Message ID | 20200629152033.16175-1-willy@infradead.org (mailing list archive) |
---|---|
Headers | show |
Series | Use multi-index entries in the page cache | expand |
Another nice cleanup. For the series: Reviewed-by: William Kucharski <william.kucharski@oracle.com> > On Jun 29, 2020, at 9:20 AM, Matthew Wilcox (Oracle) <willy@infradead.org> wrote: > > Following Hugh's advice at LSFMM 2019, I was trying to avoid doing this, > but it turns out to be hard to support range writeback with the pagecache > using multiple single entries. Of course, this isn't a problem for > shmem because it doesn't have a real backing device (only swap), but > real filesystems need to start writeback at arbitrary file offsets and > have problems if we don't notice that the first (huge) page in the range > is dirty. > > Hugh, I would love it if you could test this. It didn't introduce any new > regressions to the xfstests, but shmem does exercise different paths and > of course I don't have a clean xfstests run yet, so there could easily > still be bugs. > > I'd like this to be included in mmotm, but it is less urgent than the > previous patch series that I've sent. As far as risk, I think it only > affects shmem/tmpfs. > > Matthew Wilcox (Oracle) (2): > XArray: Add xas_split > mm: Use multi-index entries in the page cache > > Documentation/core-api/xarray.rst | 16 ++-- > include/linux/xarray.h | 2 + > lib/test_xarray.c | 41 ++++++++ > lib/xarray.c | 153 ++++++++++++++++++++++++++++-- > mm/filemap.c | 42 ++++---- > mm/huge_memory.c | 21 +++- > mm/khugepaged.c | 12 ++- > mm/shmem.c | 11 +-- > 8 files changed, 245 insertions(+), 53 deletions(-) > > -- > 2.27.0 > >
On Mon, 29 Jun 2020, Matthew Wilcox (Oracle) wrote: > Following Hugh's advice at LSFMM 2019, I was trying to avoid doing this, > but it turns out to be hard to support range writeback with the pagecache > using multiple single entries. Of course, this isn't a problem for > shmem because it doesn't have a real backing device (only swap), but > real filesystems need to start writeback at arbitrary file offsets and > have problems if we don't notice that the first (huge) page in the range > is dirty. > > Hugh, I would love it if you could test this. It didn't introduce any new > regressions to the xfstests, but shmem does exercise different paths and > of course I don't have a clean xfstests run yet, so there could easily > still be bugs. > > I'd like this to be included in mmotm, but it is less urgent than the > previous patch series that I've sent. As far as risk, I think it only > affects shmem/tmpfs. > > Matthew Wilcox (Oracle) (2): > XArray: Add xas_split > mm: Use multi-index entries in the page cache > > Documentation/core-api/xarray.rst | 16 ++-- > include/linux/xarray.h | 2 + > lib/test_xarray.c | 41 ++++++++ > lib/xarray.c | 153 ++++++++++++++++++++++++++++-- > mm/filemap.c | 42 ++++---- > mm/huge_memory.c | 21 +++- > mm/khugepaged.c | 12 ++- > mm/shmem.c | 11 +-- > 8 files changed, 245 insertions(+), 53 deletions(-) > > -- > 2.27.0 I have been trying it, and it's not good yet. I had hoped to work out what's wrong and send a patch, but I'm not making progress, so better hand back to you with what I've found. First problem was that it did not quite build on top of 5.8-rc3 plus your 1-7 THP prep patches: was there some other series we were supposed to add in too? If so, that might make a big difference, but I fixed up __add_to_page_cache_locked() as in the diff below (and I'm not bothering about hugetlbfs, so haven't looked to see if its page indexing is or isn't still exceptional with your series). Second problem was fs/inode.c:530 BUG_ON(inode->i_data.nrpages), after warning from shmem_evict_inode(). Surprisingly, that first happened on a machine with CONFIG_TRANSPARENT_HUGEPAGE not set, while doing an "rm -rf". I've seen it on other machines since, with THP enabled, doing other things; but have not seen it in the last couple of days, which is intriguing (though partly consequence of narrowing down to xfstests). The original non-THP machine ran the same load for ten hours yesterday, but hit no problem. The only significant difference in what ran successfully, is that I've been surprised by all the non-zero entries I saw in xarray nodes, exceeding total entry "count" (I've also been bothered by non-zero "offset" at root, but imagine that's just noise that never gets used). So I've changed the kmem_cache_alloc()s in lib/radix-tree.c to kmem_cache_zalloc()s, as in the diff below: not suggesting that as necessary, just a temporary precaution in case something is not being initialized as intended. I've also changed shmem_evict_inode()'s warning to BUG to crash sooner, as in the diff below; but not hit it since doing so. Here's a script to run in the xfstests directory, just running the tests I've seen frequent problems from: maybe you were not choosing the "huge=always" option to tmpfs? An alternative is "echo force >/sys/kernel/mm/transparent_hugepage/shmem_enabled": that's good when running random tests that you cannot poke into; but can cause crashes if the graphics driver e.g. i915 is using shmem, and then gets hugepages it was not expecting. sync export FSTYP=tmpfs export DISABLE_UDF_TEST=1 export TEST_DEV=tmpfs1: export TEST_DIR=/xft # I have that, you'll have somewhere else export SCRATCH_DEV=tmpfs2: export SCRATCH_MNT=/mnt export TMPFS_MOUNT_OPTIONS="-o size=1088M,huge=always" mount -t $FSTYP $TMPFS_MOUNT_OPTIONS $TEST_DEV $TEST_DIR || exit $? ./check generic/083 generic/224 generic/228 generic/269 generic/476 umount /xft /mnt 2>/dev/null These problems were either mm/filemap.c:1565 find_lock_entry() VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page); or hangs, which (at least the ones that I went on to investigate) turned out also to be find_lock_entry(), circling around with page_mapping(page) != mapping. It seems that find_get_entry() is sometimes supplying the wrong page, and you will work out why much quicker than I shall. (One tantalizing detail of the bad offset crashes: very often page pgoff is exactly one less than the requested offset.) I have modified find_lock_entry() as in the diff below, to be stricter: we all prefer to investigate a crash than a hang, and I think it's been unnecessarily forgiving of page->mapping. I might be wrong (fuse? page migration? splice stealing? shmem swap? THP collapse? THP splitting?) but I believe they are all okay with a simple head->mapping check there. Hugh --- 5083w/lib/radix-tree.c 2020-06-14 15:13:01.406042750 -0700 +++ 5083wh/lib/radix-tree.c 2020-07-04 11:35:11.433181700 -0700 @@ -249,7 +249,7 @@ radix_tree_node_alloc(gfp_t gfp_mask, st * cache first for the new node to get accounted to the memory * cgroup. */ - ret = kmem_cache_alloc(radix_tree_node_cachep, + ret = kmem_cache_zalloc(radix_tree_node_cachep, gfp_mask | __GFP_NOWARN); if (ret) goto out; @@ -257,7 +257,7 @@ radix_tree_node_alloc(gfp_t gfp_mask, st /* * Provided the caller has preloaded here, we will always * succeed in getting a node here (and never reach - * kmem_cache_alloc) + * kmem_cache_zalloc) */ rtp = this_cpu_ptr(&radix_tree_preloads); if (rtp->nr) { @@ -272,7 +272,7 @@ radix_tree_node_alloc(gfp_t gfp_mask, st kmemleak_update_trace(ret); goto out; } - ret = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask); + ret = kmem_cache_zalloc(radix_tree_node_cachep, gfp_mask); out: BUG_ON(radix_tree_is_internal_node(ret)); if (ret) { @@ -334,7 +334,7 @@ static __must_check int __radix_tree_pre rtp = this_cpu_ptr(&radix_tree_preloads); while (rtp->nr < nr) { local_unlock(&radix_tree_preloads.lock); - node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask); + node = kmem_cache_zalloc(radix_tree_node_cachep, gfp_mask); if (node == NULL) goto out; local_lock(&radix_tree_preloads.lock); --- 5083w/mm/filemap.c 2020-06-30 11:38:58.094010948 -0700 +++ 5083wh/mm/filemap.c 2020-07-04 11:35:11.437181729 -0700 @@ -824,6 +824,7 @@ static int __add_to_page_cache_locked(st void **shadowp) { XA_STATE(xas, &mapping->i_pages, offset); + unsigned int nr = thp_nr_pages(page); int huge = PageHuge(page); int error; void *old; @@ -859,12 +860,13 @@ static int __add_to_page_cache_locked(st xas_store(&xas, page); if (xas_error(&xas)) goto unlock; + mapping->nrexceptional -= exceptional; mapping->nrpages += nr; /* hugetlb pages do not participate in page cache accounting */ if (!huge) - __inc_lruvec_page_state(page, NR_FILE_PAGES); + __mod_lruvec_page_state(page, NR_FILE_PAGES, nr); unlock: xas_unlock_irq(&xas); } while (xas_nomem(&xas, gfp_mask & GFP_RECLAIM_MASK)); @@ -1548,19 +1550,26 @@ out: */ struct page *find_lock_entry(struct address_space *mapping, pgoff_t offset) { - struct page *page; - + struct page *page, *head; + pgoff_t pgoff; repeat: page = find_get_entry(mapping, offset); if (page && !xa_is_value(page)) { - lock_page(page); + head = compound_head(page); + lock_page(head); /* Has the page been truncated? */ - if (unlikely(page_mapping(page) != mapping)) { - unlock_page(page); + if (unlikely(!head->mapping)) { + unlock_page(head); put_page(page); goto repeat; } - VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page); + pgoff = head->index + (page - head); + if (PageSwapCache(head) || + head->mapping != mapping || pgoff != offset) { + printk("page=%px pgoff=%lx offset=%lx headmapping=%px mapping=%px\n", + page, pgoff, offset, head->mapping, mapping); + VM_BUG_ON_PAGE(1, page); + } } return page; } --- 5083w/mm/shmem.c 2020-06-30 11:38:58.098010985 -0700 +++ 5083wh/mm/shmem.c 2020-07-04 11:35:11.441181757 -0700 @@ -1108,7 +1108,12 @@ static void shmem_evict_inode(struct ino } simple_xattrs_free(&info->xattrs); - WARN_ON(inode->i_blocks); + if (inode->i_blocks) { + printk("mapping=%px i_blocks=%lx nrpages=%lx nrexcept=%lx\n", + inode->i_mapping, (unsigned long)inode->i_blocks, + inode->i_data.nrpages, inode->i_data.nrexceptional); + BUG(); + } shmem_free_inode(inode->i_sb); clear_inode(inode); }
On Sat, Jul 04, 2020 at 01:20:19PM -0700, Hugh Dickins wrote: > On Mon, 29 Jun 2020, Matthew Wilcox (Oracle) wrote: > > Hugh, I would love it if you could test this. It didn't introduce any new > > regressions to the xfstests, but shmem does exercise different paths and > > of course I don't have a clean xfstests run yet, so there could easily > > still be bugs. > > I have been trying it, and it's not good yet. I had hoped to work > out what's wrong and send a patch, but I'm not making progress, > so better hand back to you with what I've found. Thank you so much! I've made some progress. > First problem was that it did not quite build on top of 5.8-rc3 > plus your 1-7 THP prep patches: was there some other series we > were supposed to add in too? If so, that might make a big difference, > but I fixed up __add_to_page_cache_locked() as in the diff below > (and I'm not bothering about hugetlbfs, so haven't looked to see if > its page indexing is or isn't still exceptional with your series). Oops. I shifted some patches around and clearly didn't get it quite right. I'll fix it up. > Second problem was fs/inode.c:530 BUG_ON(inode->i_data.nrpages), > after warning from shmem_evict_inode(). Surprisingly, that first > happened on a machine with CONFIG_TRANSPARENT_HUGEPAGE not set, > while doing an "rm -rf". I've seen that occasionally too. > The original non-THP machine ran the same load for > ten hours yesterday, but hit no problem. The only significant > difference in what ran successfully, is that I've been surprised > by all the non-zero entries I saw in xarray nodes, exceeding > total entry "count" (I've also been bothered by non-zero "offset" > at root, but imagine that's just noise that never gets used). > So I've changed the kmem_cache_alloc()s in lib/radix-tree.c to > kmem_cache_zalloc()s, as in the diff below: not suggesting that > as necessary, just a temporary precaution in case something is > not being initialized as intended. Umm. ->count should always be accurate and match the number of non-NULL entries in a node. the zalloc shouldn't be necessary, and will probably break the workingset code. Actually, it should BUG because we have both a constructor and an instruction to zero the allocation, and they can't both be right. You're right that ->offset is never used at root. I had plans to repurpose that to support smaller files more efficiently, but never got round to implementing those plans. > These problems were either mm/filemap.c:1565 find_lock_entry() > VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page); or hangs, which > (at least the ones that I went on to investigate) turned out also to be > find_lock_entry(), circling around with page_mapping(page) != mapping. > It seems that find_get_entry() is sometimes supplying the wrong page, > and you will work out why much quicker than I shall. (One tantalizing > detail of the bad offset crashes: very often page pgoff is exactly one > less than the requested offset.) I added this: @@ -1535,6 +1535,11 @@ struct page *find_get_entry(struct address_space *mapping, pgoff_t offset) goto repeat; } page = find_subpage(page, offset); + if (page_to_index(page) != offset) { + printk("offset %ld xas index %ld offset %d\n", offset, xas.xa_index, xas.xa_offset); + dump_page(page, "index mismatch"); + printk("xa_load %p\n", xa_load(&mapping->i_pages, offset)); + } out: rcu_read_unlock(); and I have a good clue now: 1322 offset 631 xas index 631 offset 48 1322 page:000000008c9a9bc3 refcount:4 mapcount:0 mapping:00000000d8615d47 index:0x276 1322 flags: 0x4000000000002026(referenced|uptodate|active|private) 1322 mapping->aops:0xffffffff88a2ebc0 ino 1800b82 dentry name:"f1141" 1322 raw: 4000000000002026 dead000000000100 dead000000000122 ffff98ff2a8b8a20 1322 raw: 0000000000000276 ffff98ff1ac271a0 00000004ffffffff 0000000000000000 1322 page dumped because: index mismatch 1322 xa_load 000000008c9a9bc3 0x276 is decimal 630. So we're looking up a tail page and getting its erstwhile head page. I'll dig in and figure out exactly how that's happening.
On Mon, Jul 06, 2020 at 03:43:20PM +0100, Matthew Wilcox wrote: > On Sat, Jul 04, 2020 at 01:20:19PM -0700, Hugh Dickins wrote: > > These problems were either mm/filemap.c:1565 find_lock_entry() > > VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page); or hangs, which > > (at least the ones that I went on to investigate) turned out also to be > > find_lock_entry(), circling around with page_mapping(page) != mapping. > > It seems that find_get_entry() is sometimes supplying the wrong page, > > and you will work out why much quicker than I shall. (One tantalizing > > detail of the bad offset crashes: very often page pgoff is exactly one > > less than the requested offset.) > > I added this: > > @@ -1535,6 +1535,11 @@ struct page *find_get_entry(struct address_space *mapping, pgoff_t offset) > goto repeat; > } > page = find_subpage(page, offset); > + if (page_to_index(page) != offset) { > + printk("offset %ld xas index %ld offset %d\n", offset, xas.xa_index, xas.xa_offset); > + dump_page(page, "index mismatch"); > + printk("xa_load %p\n", xa_load(&mapping->i_pages, offset)); > + } > out: > rcu_read_unlock(); > > and I have a good clue now: > > 1322 offset 631 xas index 631 offset 48 > 1322 page:000000008c9a9bc3 refcount:4 mapcount:0 mapping:00000000d8615d47 index:0x276 > 1322 flags: 0x4000000000002026(referenced|uptodate|active|private) > 1322 mapping->aops:0xffffffff88a2ebc0 ino 1800b82 dentry name:"f1141" > 1322 raw: 4000000000002026 dead000000000100 dead000000000122 ffff98ff2a8b8a20 > 1322 raw: 0000000000000276 ffff98ff1ac271a0 00000004ffffffff 0000000000000000 > 1322 page dumped because: index mismatch > 1322 xa_load 000000008c9a9bc3 > > 0x276 is decimal 630. So we're looking up a tail page and getting its > erstwhile head page. I'll dig in and figure out exactly how that's > happening. @@ -841,6 +842,7 @@ static int __add_to_page_cache_locked(struct page *page, nr = thp_nr_pages(page); } + VM_BUG_ON_PAGE(xa_load(&mapping->i_pages, offset + nr) == page, page); page_ref_add(page, nr); page->mapping = mapping; page->index = offset; @@ -880,6 +882,7 @@ static int __add_to_page_cache_locked(struct page *page, goto error; } + VM_BUG_ON_PAGE(xa_load(&mapping->i_pages, offset + nr) == page, page); trace_mm_filemap_add_to_page_cache(page); return 0; error: The second one triggers with generic/051 (running against xfs with the rest of my patches). So I deduce that we have a shadow entry which takes up multiple indices, then when we store the page, we now have a multi-index entry which refers to a single page. And this explains basically all the accounting problems. Now I'm musing how best to fix this. 1. When removing a compound page from the cache, we could store only a single entry. That seems bad because we cuold hit somewhere else in the compound page and we'd have the wrong information about workingset history (or worse believe that a shmem page isn't in swap!) 2. When removing a compound page from the cache, we could split the entry and store the same entry N times. Again, this seems bad for shmem because then all the swap entries would be the same, and we'd fetch the wrong data from swap. 3 & 4. When adding a page to the cache, we delete any shadow entry which was previously there, or replicate the shadow entry. Same drawbacks as the above two, I feel. 5. Use the size of the shadow entry to allocate a page of the right size. We don't currently have an API to find the right size, so that would need to be written. And what do we do if we can't allocate a page of sufficient size? So that's five ideas with their drawbacks as I see them. Maybe you have a better idea?
On Mon, 6 Jul 2020, Matthew Wilcox wrote: > On Sat, Jul 04, 2020 at 01:20:19PM -0700, Hugh Dickins wrote: > > > The original non-THP machine ran the same load for > > ten hours yesterday, but hit no problem. The only significant > > difference in what ran successfully, is that I've been surprised > > by all the non-zero entries I saw in xarray nodes, exceeding > > total entry "count" (I've also been bothered by non-zero "offset" > > at root, but imagine that's just noise that never gets used). > > So I've changed the kmem_cache_alloc()s in lib/radix-tree.c to > > kmem_cache_zalloc()s, as in the diff below: not suggesting that > > as necessary, just a temporary precaution in case something is > > not being initialized as intended. > > Umm. ->count should always be accurate and match the number of non-NULL > entries in a node. the zalloc shouldn't be necessary, and will probably > break the workingset code. Actually, it should BUG because we have both > a constructor and an instruction to zero the allocation, and they can't > both be right. Those kmem_cache_zalloc()s did not cause BUGs because I did them in lib/radix-tree.c: it occurred to me yesterday morning that that was an odd choice to have made! so I then extended them to lib/xarray.c, immediately hit the BUGs you point out, so also added INIT_LIST_HEADs. And very interestingly, the little huge tmpfs xfstests script I gave last time then usually succeeds without any crash; but not always. To me that suggests that there's somewhere you omit to clear a slot; and that usually those xfstests don't encounter that stale slot again, before the node is recycled for use elsewhere (with the zalloc making it right again); but occasionally the tests do hit the bogus entry before the node is recycled. Before the zallocs, I had noticed a node with count 4, but filled with non-zero entries (most of them looking like huge head pointers). > > You're right that ->offset is never used at root. I had plans to > repurpose that to support smaller files more efficiently, but never > got round to implementing those plans. > > > These problems were either mm/filemap.c:1565 find_lock_entry() > > VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page); or hangs, which > > (at least the ones that I went on to investigate) turned out also to be > > find_lock_entry(), circling around with page_mapping(page) != mapping. > > It seems that find_get_entry() is sometimes supplying the wrong page, > > and you will work out why much quicker than I shall. (One tantalizing > > detail of the bad offset crashes: very often page pgoff is exactly one > > less than the requested offset.) > > I added this: > > @@ -1535,6 +1535,11 @@ struct page *find_get_entry(struct address_space *mapping, pgoff_t offset) > goto repeat; > } > page = find_subpage(page, offset); > + if (page_to_index(page) != offset) { > + printk("offset %ld xas index %ld offset %d\n", offset, xas.xa_index, xas.xa_offset); (It's much easier to spot huge page issues with %x than with %d.) > + dump_page(page, "index mismatch"); > + printk("xa_load %p\n", xa_load(&mapping->i_pages, offset)); > + } > out: > rcu_read_unlock(); > > and I have a good clue now: > > 1322 offset 631 xas index 631 offset 48 > 1322 page:000000008c9a9bc3 refcount:4 mapcount:0 mapping:00000000d8615d47 index:0x276 (You appear to have more patience with all this ghastly hashing of kernel addresses in debug messages than I have: it really gets in our way.) > 1322 flags: 0x4000000000002026(referenced|uptodate|active|private) > 1322 mapping->aops:0xffffffff88a2ebc0 ino 1800b82 dentry name:"f1141" > 1322 raw: 4000000000002026 dead000000000100 dead000000000122 ffff98ff2a8b8a20 > 1322 raw: 0000000000000276 ffff98ff1ac271a0 00000004ffffffff 0000000000000000 > 1322 page dumped because: index mismatch > 1322 xa_load 000000008c9a9bc3 > > 0x276 is decimal 630. So we're looking up a tail page and getting its Index 630 for offset 631, yes, that's exactly the kind of off-by-one I saw so frequently. > erstwhile head page. I'll dig in and figure out exactly how that's > happening. Very pleasing to see the word "erstwhile" in use (and I'm serious about that); but I don't get your head for tail point - I can't make heads or tails of what you're saying there :) Neither 631 nor 630 is close to 512. Certainly it's finding a bogus page, and that's related to the multi-order splitting (I saw no problem with your 1-7 series), I think it's from a stale entry being left behind. (But that does not at all explain the off-by-one pattern.) Hugh
On Mon, 6 Jul 2020, Matthew Wilcox wrote: > > @@ -841,6 +842,7 @@ static int __add_to_page_cache_locked(struct page *page, > nr = thp_nr_pages(page); > } > > + VM_BUG_ON_PAGE(xa_load(&mapping->i_pages, offset + nr) == page, page); > page_ref_add(page, nr); > page->mapping = mapping; > page->index = offset; > @@ -880,6 +882,7 @@ static int __add_to_page_cache_locked(struct page *page, > goto error; > } > > + VM_BUG_ON_PAGE(xa_load(&mapping->i_pages, offset + nr) == page, page); > trace_mm_filemap_add_to_page_cache(page); > return 0; > error: > > The second one triggers with generic/051 (running against xfs with the > rest of my patches). So I deduce that we have a shadow entry which > takes up multiple indices, then when we store the page, we now have > a multi-index entry which refers to a single page. And this explains > basically all the accounting problems. I think you are jumping too far ahead by bringing in xfs and your later patches. Don't let me stop you from thinking ahead, but the problem at hand is with tmpfs. tmpfs doesn't use shadow entries, or not where we use the term "shadow" for workingset business: it does save swap entries as xarray values, but I don't suppose the five xfstests I was running get into swap (without applying additional pressure); and (since a huge page gets split before shmem swapout, at present anyway) I don't see a philosophical problem about their multi-index entries. Multi-index shadows, sorry, not a subject I can think about at present. > > Now I'm musing how best to fix this. > > 1. When removing a compound page from the cache, we could store only > a single entry. That seems bad because we cuold hit somewhere else in > the compound page and we'd have the wrong information about workingset > history (or worse believe that a shmem page isn't in swap!) > > 2. When removing a compound page from the cache, we could split the > entry and store the same entry N times. Again, this seems bad for shmem > because then all the swap entries would be the same, and we'd fetch the > wrong data from swap. > > 3 & 4. When adding a page to the cache, we delete any shadow entry which was > previously there, or replicate the shadow entry. Same drawbacks as the > above two, I feel. > > 5. Use the size of the shadow entry to allocate a page of the right size. > We don't currently have an API to find the right size, so that would > need to be written. And what do we do if we can't allocate a page of > sufficient size? > > So that's five ideas with their drawbacks as I see them. Maybe you have > a better idea? >
On Mon, Jul 06, 2020 at 08:21:54PM -0700, Hugh Dickins wrote: > > and I have a good clue now: > > > > 1322 offset 631 xas index 631 offset 48 > > 1322 page:000000008c9a9bc3 refcount:4 mapcount:0 mapping:00000000d8615d47 index:0x276 > > (You appear to have more patience with all this ghastly hashing of > kernel addresses in debug messages than I have: it really gets in our way.) > > > 1322 flags: 0x4000000000002026(referenced|uptodate|active|private) > > 1322 mapping->aops:0xffffffff88a2ebc0 ino 1800b82 dentry name:"f1141" > > 1322 raw: 4000000000002026 dead000000000100 dead000000000122 ffff98ff2a8b8a20 > > 1322 raw: 0000000000000276 ffff98ff1ac271a0 00000004ffffffff 0000000000000000 > > 1322 page dumped because: index mismatch > > 1322 xa_load 000000008c9a9bc3 > > > > 0x276 is decimal 630. So we're looking up a tail page and getting its > > Index 630 for offset 631, yes, that's exactly the kind of off-by-one > I saw so frequently. > > > erstwhile head page. I'll dig in and figure out exactly how that's > > happening. > > Very pleasing to see the word "erstwhile" in use (and I'm serious about > that); but I don't get your head for tail point - I can't make heads or > tails of what you're saying there :) Neither 631 nor 630 is close to 512. Ah, I'm working with another 40 or so patches on top of the single patch I sent you, which includes the patches to make XFS use thp, and make thps arbitrary order. In this instance, we've hit what was probably an order-2 or order-4 page, not an order-9 page. > Certainly it's finding a bogus page, and that's related to the multi-order > splitting (I saw no problem with your 1-7 series), I think it's from a > stale entry being left behind. (But that does not at all explain the > off-by-one pattern.) I think I understand that. We create an order-4 page, and then swap it out. That leaves us with a shadow entry which stretches across 16 indices (0x270-0x27f). Then we access one of those 16 indices again and store that page (whichever index we happened to be looking at) in the page cache, but the XArray doesn't know that it's supposed to be an order-0 store, so it helpfully replaces the 'head' page (at 0x270) with this one for 0x276. Usually, we'll go on to access the page at 0x277, and instead of seeing no page there, we find the wrong page and bring everything to a juddering halt. So seeing the common pattern of page at index n returned for page at index n+1 is merely an artefact of our normal access patterns. If we were accessing random addresses, we'd've seen no pattern at all. I'm halfway through a dual solution; one that first attempts to use the information about the shadow entry to bring in the entire thp (if it was in use as a thp before being paged out, we should probably try to keep using it as a thp afterwards), but then splits the shadow entry across the indices if the page being added is smaller than the shadow entry.