Message ID | 20250410091020.119116-1-david@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v1] fs/dax: fix folio splitting issue by resetting old folio order + _nr_pages | expand |
On Thu, Apr 10, 2025 at 11:10:20AM +0200, David Hildenbrand wrote: > Alison reports an issue with fsdax when large extends end up using > large ZONE_DEVICE folios: > Passes the ndctl/dax unit tests. Tested-by: Alison Schofield <alison.schofield@intel.com> snip
David Hildenbrand wrote: > Alison reports an issue with fsdax when large extends end up using > large ZONE_DEVICE folios: > > [ 417.796271] BUG: kernel NULL pointer dereference, address: 0000000000000b00 > [ 417.796982] #PF: supervisor read access in kernel mode > [ 417.797540] #PF: error_code(0x0000) - not-present page > [ 417.798123] PGD 2a5c5067 P4D 2a5c5067 PUD 2a5c6067 PMD 0 > [ 417.798690] Oops: Oops: 0000 [#1] SMP NOPTI > [ 417.799178] CPU: 5 UID: 0 PID: 1515 Comm: mmap Tainted: ... > [ 417.800150] Tainted: [O]=OOT_MODULE > [ 417.800583] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 > [ 417.801358] RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250 > [ 417.801948] Code: ... > [ 417.803662] RSP: 0000:ffffc90002be3a08 EFLAGS: 00010206 > [ 417.804234] RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000000002 > [ 417.804984] RDX: ffffffff815652d7 RSI: 0000000000000000 RDI: ffffffff82a2beae > [ 417.805689] RBP: ffffc90002be3a28 R08: 0000000000000000 R09: 0000000000000000 > [ 417.806384] R10: ffffea0007000040 R11: ffff888376ffe000 R12: 0000000000000001 > [ 417.807099] R13: 0000000000000012 R14: ffff88807fe4ab40 R15: ffff888029210580 > [ 417.807801] FS: 00007f339fa7a740(0000) GS:ffff8881fa9b9000(0000) knlGS:0000000000000000 > [ 417.808570] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 417.809193] CR2: 0000000000000b00 CR3: 000000002a4f0004 CR4: 0000000000370ef0 > [ 417.809925] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 417.810622] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [ 417.811353] Call Trace: > [ 417.811709] <TASK> > [ 417.812038] folio_add_file_rmap_ptes+0x143/0x230 > [ 417.812566] insert_page_into_pte_locked+0x1ee/0x3c0 > [ 417.813132] insert_page+0x78/0xf0 > [ 417.813558] vmf_insert_page_mkwrite+0x55/0xa0 > [ 417.814088] dax_fault_iter+0x484/0x7b0 > [ 417.814542] dax_iomap_pte_fault+0x1ca/0x620 > [ 417.815055] dax_iomap_fault+0x39/0x40 > [ 417.815499] __xfs_write_fault+0x139/0x380 > [ 417.815995] ? __handle_mm_fault+0x5e5/0x1a60 > [ 417.816483] xfs_write_fault+0x41/0x50 > [ 417.816966] xfs_filemap_fault+0x3b/0xe0 > [ 417.817424] __do_fault+0x31/0x180 > [ 417.817859] __handle_mm_fault+0xee1/0x1a60 > [ 417.818325] ? debug_smp_processor_id+0x17/0x20 > [ 417.818844] handle_mm_fault+0xe1/0x2b0 > [...] > > The issue is that when we split a large ZONE_DEVICE folio to order-0 > ones, we don't reset the order/_nr_pages. As folio->_nr_pages overlays > page[1]->memcg_data, once page[1] is a folio, it suddenly looks like it > has folio->memcg_data set. And we never manually initialize > folio->memcg_data in fsdax code, because we never expect it to be set at > all. > > When __lruvec_stat_mod_folio() then stumbles over such a folio, it tries to > use folio->memcg_data (because it's non-NULL) but it does not actually > point at a memcg, resulting in the problem. > > Alison also observed that these folios sometimes have "locked" > set, which is rather concerning (folios locked from the beginning ...). > The reason is that the order for large folios is stored in page[1]->flags, > which become the folio->flags of a new small folio. > > Let's fix it by adding a folio helper to clear order/_nr_pages for > splitting purposes. > > Maybe we should reinitialize other large folio flags / folio members as > well when splitting, because they might similarly cause harm once > page[1] becomes a folio? At least other flags in PAGE_FLAGS_SECOND should > not be set for fsdax, so at least page[1]->flags might be as expected with > this fix. > > From a quick glimpse, initializing ->mapping, ->pgmap and ->share should > re-initialize most things from a previous page[1] used by large folios > that fsdax cares about. For example folio->private might not get > reinitialized, but maybe that's not relevant -- no traces of it's use in > fsdax code. Needs a closer look. > > Another thing that should be considered in the future is performing similar > checks as we perform in free_tail_page_prepare() -- checking pincount etc. > -- when freeing a large fsdax folio. > > Fixes: 4996fc547f5b ("mm: let _folio_nr_pages overlay memcg_data in first tail page") > Fixes: 38607c62b34b ("fs/dax: properly refcount fs dax pages") > Reported-by: Alison Schofield <alison.schofield@intel.com> > Closes: https://lkml.kernel.org/r/Z_W9Oeg-D9FhImf3@aschofie-mobl2.lan > Cc: Alexander Viro <viro@zeniv.linux.org.uk> > Cc: Christian Brauner <brauner@kernel.org> > Cc: Jan Kara <jack@suse.cz> > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Alistair Popple <apopple@nvidia.com> > Cc: Christoph Hellwig <hch@infradead.org> > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > fs/dax.c | 1 + > include/linux/mm.h | 17 +++++++++++++++++ > 2 files changed, 18 insertions(+) Explanation excellent, folio_reset_order() looks correct to me and the callsite in fsdax looks correct. Reviewed-by: Dan Williams <dan.j.williams@intel.com> For consistency and clarity what about this incremental change, to make the __split_folio_to_order() path reuse folio_reset_order(), and use typical bitfield helpers for manipulating _flags_1? diff --git a/include/linux/mm.h b/include/linux/mm.h index bf55206935c4..5b614d31f4f6 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -33,6 +33,7 @@ #include <linux/slab.h> #include <linux/cacheinfo.h> #include <linux/rcuwait.h> +#include <linux/bitfield.h> struct mempolicy; struct anon_vma; @@ -1171,7 +1172,7 @@ extern void prep_compound_page(struct page *page, unsigned int order); static inline unsigned int folio_large_order(const struct folio *folio) { - return folio->_flags_1 & 0xff; + return FIELD_GET(FOLIO_ORDER_MASK, folio->_flags_1); } #ifdef NR_PAGES_IN_LARGE_FOLIO @@ -1229,7 +1230,8 @@ static inline void folio_reset_order(struct folio *folio) { if (WARN_ON_ONCE(!folio_test_large(folio))) return; - folio->_flags_1 &= ~0xffUL; + ClearPageCompound(&folio->page); + folio->_flags_1 &= ~FOLIO_ORDER_MASK; #ifdef NR_PAGES_IN_LARGE_FOLIO folio->_nr_pages = 0; #endif diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 56d07edd01f9..3dc2d98fde24 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -483,6 +483,8 @@ struct folio { }; }; +#define FOLIO_ORDER_MASK GENMASK(7, 0) + #define FOLIO_MATCH(pg, fl) \ static_assert(offsetof(struct page, pg) == offsetof(struct folio, fl)) FOLIO_MATCH(flags, flags); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 2a47682d1ab7..301ca9459122 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3404,7 +3404,7 @@ static void __split_folio_to_order(struct folio *folio, int old_order, if (new_order) folio_set_order(folio, new_order); else - ClearPageCompound(&folio->page); + folio_reset_order(folio); } /* diff --git a/mm/internal.h b/mm/internal.h index 50c2f590b2d0..41a4d2b66405 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -727,7 +727,8 @@ static inline void folio_set_order(struct folio *folio, unsigned int order) if (WARN_ON_ONCE(!order || !folio_test_large(folio))) return; - folio->_flags_1 = (folio->_flags_1 & ~0xffUL) | order; + folio->_flags_1 &= ~FOLIO_ORDER_MASK; + folio->_flags_1 |= FIELD_PREP(FOLIO_ORDER_MASK, order); #ifdef NR_PAGES_IN_LARGE_FOLIO folio->_nr_pages = 1U << order; #endif
On Thu, Apr 10, 2025 at 01:15:07PM -0700, Dan Williams wrote: > For consistency and clarity what about this incremental change, to make > the __split_folio_to_order() path reuse folio_reset_order(), and use > typical bitfield helpers for manipulating _flags_1? I dislike this intensely. It obfuscates rather than providing clarity. > static inline unsigned int folio_large_order(const struct folio *folio) > { > - return folio->_flags_1 & 0xff; > + return FIELD_GET(FOLIO_ORDER_MASK, folio->_flags_1); > } > > #ifdef NR_PAGES_IN_LARGE_FOLIO
Matthew Wilcox wrote: > On Thu, Apr 10, 2025 at 01:15:07PM -0700, Dan Williams wrote: > > For consistency and clarity what about this incremental change, to make > > the __split_folio_to_order() path reuse folio_reset_order(), and use > > typical bitfield helpers for manipulating _flags_1? > > I dislike this intensely. It obfuscates rather than providing clarity. I'm used to pushing folks to use bitfield.h in driver land, but will not push it further here. What about this hunk? diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 2a47682d1ab7..301ca9459122 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3404,7 +3404,7 @@ static void __split_folio_to_order(struct folio *folio, int old_order, if (new_order) folio_set_order(folio, new_order); else - ClearPageCompound(&folio->page); + folio_reset_order(folio); } /*
On Thu, Apr 10, 2025 at 01:46:06PM -0700, Dan Williams wrote: > Matthew Wilcox wrote: > > On Thu, Apr 10, 2025 at 01:15:07PM -0700, Dan Williams wrote: > > > For consistency and clarity what about this incremental change, to make > > > the __split_folio_to_order() path reuse folio_reset_order(), and use > > > typical bitfield helpers for manipulating _flags_1? > > > > I dislike this intensely. It obfuscates rather than providing clarity. > > I'm used to pushing folks to use bitfield.h in driver land, but will not > push it further here. I think it can make sense in places. Just not here. > What about this hunk? > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 2a47682d1ab7..301ca9459122 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -3404,7 +3404,7 @@ static void __split_folio_to_order(struct folio *folio, int old_order, > if (new_order) > folio_set_order(folio, new_order); > else > - ClearPageCompound(&folio->page); > + folio_reset_order(folio); > } I think that's wrong. We're splitting this folio into order-0 folios, but folio_reset_order() is going to modify folio->_flags_1 which is in the next page.
On 10.04.25 23:54, Matthew Wilcox wrote: > On Thu, Apr 10, 2025 at 01:46:06PM -0700, Dan Williams wrote: >> Matthew Wilcox wrote: >>> On Thu, Apr 10, 2025 at 01:15:07PM -0700, Dan Williams wrote: >>>> For consistency and clarity what about this incremental change, to make >>>> the __split_folio_to_order() path reuse folio_reset_order(), and use >>>> typical bitfield helpers for manipulating _flags_1? >>> >>> I dislike this intensely. It obfuscates rather than providing clarity. >> >> I'm used to pushing folks to use bitfield.h in driver land, but will not >> push it further here. > > I think it can make sense in places. Just not here. > >> What about this hunk? >> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index 2a47682d1ab7..301ca9459122 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -3404,7 +3404,7 @@ static void __split_folio_to_order(struct folio *folio, int old_order, >> if (new_order) >> folio_set_order(folio, new_order); >> else >> - ClearPageCompound(&folio->page); >> + folio_reset_order(folio); >> } > > I think that's wrong. We're splitting this folio into order-0 folios, > but folio_reset_order() is going to modify folio->_flags_1 which is in > the next page. Right, clearing in this context might only make sense at the very start. diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 2a47682d1ab77..4cd8b394b83a5 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3312,6 +3312,8 @@ static void __split_folio_to_order(struct folio *folio, int old_order, long nr_pages = 1 << old_order; long i; + folio_reset_order(folio); + /* * Skip the first new_nr_pages, since the new folio from them have all * the flags from the original folio. While it looks cleaner, it's in practice not required here, because 1) We handle _nr_pages overlay by setting new_folio->memcg_data new_folio->memcg_data = folio->memcg_data; 2) We handle the order by setting new_folio->flags new_folio->flags &= ~PAGE_FLAGS_CHECK_AT_PREP; That should clear all flags (excluding hwpoison), including the order. It's worth noting that in free_pages_prepare(), we handle both using if (compound) { page[1].flags &= ~PAGE_FLAGS_SECOND; #ifdef NR_PAGES_IN_LARGE_FOLIO folio->_nr_pages = 0; #endif
(adding CC list again, because I assume it was dropped by accident) >> diff --git a/fs/dax.c b/fs/dax.c >> index af5045b0f476e..676303419e9e8 100644 >> --- a/fs/dax.c >> +++ b/fs/dax.c >> @@ -396,6 +396,7 @@ static inline unsigned long dax_folio_put(struct folio *folio) >> order = folio_order(folio); >> if (!order) >> return 0; >> + folio_reset_order(folio); > > Wouldn't it be better to also move the loop below into this function? The intent > of this loop was to reinitialise the small folios after splitting which is what > I think this helper should be doing. As the function does nothing on small folios (as documented), I think this is good enough for now. Once we decouple folio from page, this code will likely have to change either way ... The first large folio will become a small folio (so resetting kind-of makes sense), but the other small folios would have to allocate a new "struct folio" for small folios. > >> for (i = 0; i < (1UL << order); i++) { >> struct dev_pagemap *pgmap = page_pgmap(&folio->page); >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index b7f13f087954b..bf55206935c46 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -1218,6 +1218,23 @@ static inline unsigned int folio_order(const struct folio *folio) >> return folio_large_order(folio); >> } >> >> +/** >> + * folio_reset_order - Reset the folio order and derived _nr_pages >> + * @folio: The folio. >> + * >> + * Reset the order and derived _nr_pages to 0. Must only be used in the >> + * process of splitting large folios. >> + */ >> +static inline void folio_reset_order(struct folio *folio) >> +{ >> + if (WARN_ON_ONCE(!folio_test_large(folio))) >> + return; >> + folio->_flags_1 &= ~0xffUL; >> +#ifdef NR_PAGES_IN_LARGE_FOLIO >> + folio->_nr_pages = 0; >> +#endif >> +} >> + I'm still not sure if this splitting code in fs/dax.c is more similar to THP splitting or to "splitting when freeing in the buddy". I think it's something in between: we want small folios, but the new folios are essentially free. Likely, to be future-proof, we should also look into doing folio->_flags_1 &= ~PAGE_FLAGS_SECOND; Or alternatively (better?) new_folio->flags &= ~PAGE_FLAGS_CHECK_AT_PREP; ... but that problem will go away once we decouple page from folio (see above), so I'm not sure if we should really do that at this point unless there is an issue.
On Fri, Apr 11, 2025 at 10:37:17AM +0200, David Hildenbrand wrote: > (adding CC list again, because I assume it was dropped by accident) Whoops. Thanks. > > > diff --git a/fs/dax.c b/fs/dax.c > > > index af5045b0f476e..676303419e9e8 100644 > > > --- a/fs/dax.c > > > +++ b/fs/dax.c > > > @@ -396,6 +396,7 @@ static inline unsigned long dax_folio_put(struct folio *folio) > > > order = folio_order(folio); > > > if (!order) > > > return 0; > > > + folio_reset_order(folio); > > > > Wouldn't it be better to also move the loop below into this function? The intent > > of this loop was to reinitialise the small folios after splitting which is what > > I think this helper should be doing. > > As the function does nothing on small folios (as documented), I think this > is good enough for now. > > Once we decouple folio from page, this code will likely have to change > either way ... > > The first large folio will become a small folio (so resetting kind-of makes > sense), but the other small folios would have to allocate a new "struct > folio" for small folios. > > > > > > for (i = 0; i < (1UL << order); i++) { > > > struct dev_pagemap *pgmap = page_pgmap(&folio->page); > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > index b7f13f087954b..bf55206935c46 100644 > > > --- a/include/linux/mm.h > > > +++ b/include/linux/mm.h > > > @@ -1218,6 +1218,23 @@ static inline unsigned int folio_order(const struct folio *folio) > > > return folio_large_order(folio); > > > } > > > +/** > > > + * folio_reset_order - Reset the folio order and derived _nr_pages > > > + * @folio: The folio. > > > + * > > > + * Reset the order and derived _nr_pages to 0. Must only be used in the > > > + * process of splitting large folios. > > > + */ > > > +static inline void folio_reset_order(struct folio *folio) > > > +{ > > > + if (WARN_ON_ONCE(!folio_test_large(folio))) > > > + return; > > > + folio->_flags_1 &= ~0xffUL; > > > +#ifdef NR_PAGES_IN_LARGE_FOLIO > > > + folio->_nr_pages = 0; > > > +#endif > > > +} > > > + > > > I'm still not sure if this splitting code in fs/dax.c is more similar to THP > splitting or to "splitting when freeing in the buddy". I think it's > something in between: we want small folios, but the new folios are > essentially free. I'm not too familiar with the code for "splitting when freeing in the buddy" but conceptually that sounds pretty similar to what we're doing here. The large folio (and all pages within it) are free and we need to split it back to small free folios ready to be allocated again in dax_folio_init(). > Likely, to be future-proof, we should also look into doing > > folio->_flags_1 &= ~PAGE_FLAGS_SECOND; > > Or alternatively (better?) > > new_folio->flags &= ~PAGE_FLAGS_CHECK_AT_PREP; That seems reasonable. > ... but that problem will go away once we decouple page from folio (see > above), so I'm not sure if we should really do that at this point unless > there is an issue. > > -- > Cheers, > > David / dhildenb >
diff --git a/fs/dax.c b/fs/dax.c index af5045b0f476e..676303419e9e8 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -396,6 +396,7 @@ static inline unsigned long dax_folio_put(struct folio *folio) order = folio_order(folio); if (!order) return 0; + folio_reset_order(folio); for (i = 0; i < (1UL << order); i++) { struct dev_pagemap *pgmap = page_pgmap(&folio->page); diff --git a/include/linux/mm.h b/include/linux/mm.h index b7f13f087954b..bf55206935c46 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1218,6 +1218,23 @@ static inline unsigned int folio_order(const struct folio *folio) return folio_large_order(folio); } +/** + * folio_reset_order - Reset the folio order and derived _nr_pages + * @folio: The folio. + * + * Reset the order and derived _nr_pages to 0. Must only be used in the + * process of splitting large folios. + */ +static inline void folio_reset_order(struct folio *folio) +{ + if (WARN_ON_ONCE(!folio_test_large(folio))) + return; + folio->_flags_1 &= ~0xffUL; +#ifdef NR_PAGES_IN_LARGE_FOLIO + folio->_nr_pages = 0; +#endif +} + #include <linux/huge_mm.h> /*
Alison reports an issue with fsdax when large extends end up using large ZONE_DEVICE folios: [ 417.796271] BUG: kernel NULL pointer dereference, address: 0000000000000b00 [ 417.796982] #PF: supervisor read access in kernel mode [ 417.797540] #PF: error_code(0x0000) - not-present page [ 417.798123] PGD 2a5c5067 P4D 2a5c5067 PUD 2a5c6067 PMD 0 [ 417.798690] Oops: Oops: 0000 [#1] SMP NOPTI [ 417.799178] CPU: 5 UID: 0 PID: 1515 Comm: mmap Tainted: ... [ 417.800150] Tainted: [O]=OOT_MODULE [ 417.800583] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 [ 417.801358] RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250 [ 417.801948] Code: ... [ 417.803662] RSP: 0000:ffffc90002be3a08 EFLAGS: 00010206 [ 417.804234] RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000000002 [ 417.804984] RDX: ffffffff815652d7 RSI: 0000000000000000 RDI: ffffffff82a2beae [ 417.805689] RBP: ffffc90002be3a28 R08: 0000000000000000 R09: 0000000000000000 [ 417.806384] R10: ffffea0007000040 R11: ffff888376ffe000 R12: 0000000000000001 [ 417.807099] R13: 0000000000000012 R14: ffff88807fe4ab40 R15: ffff888029210580 [ 417.807801] FS: 00007f339fa7a740(0000) GS:ffff8881fa9b9000(0000) knlGS:0000000000000000 [ 417.808570] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 417.809193] CR2: 0000000000000b00 CR3: 000000002a4f0004 CR4: 0000000000370ef0 [ 417.809925] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 417.810622] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 417.811353] Call Trace: [ 417.811709] <TASK> [ 417.812038] folio_add_file_rmap_ptes+0x143/0x230 [ 417.812566] insert_page_into_pte_locked+0x1ee/0x3c0 [ 417.813132] insert_page+0x78/0xf0 [ 417.813558] vmf_insert_page_mkwrite+0x55/0xa0 [ 417.814088] dax_fault_iter+0x484/0x7b0 [ 417.814542] dax_iomap_pte_fault+0x1ca/0x620 [ 417.815055] dax_iomap_fault+0x39/0x40 [ 417.815499] __xfs_write_fault+0x139/0x380 [ 417.815995] ? __handle_mm_fault+0x5e5/0x1a60 [ 417.816483] xfs_write_fault+0x41/0x50 [ 417.816966] xfs_filemap_fault+0x3b/0xe0 [ 417.817424] __do_fault+0x31/0x180 [ 417.817859] __handle_mm_fault+0xee1/0x1a60 [ 417.818325] ? debug_smp_processor_id+0x17/0x20 [ 417.818844] handle_mm_fault+0xe1/0x2b0 [...] The issue is that when we split a large ZONE_DEVICE folio to order-0 ones, we don't reset the order/_nr_pages. As folio->_nr_pages overlays page[1]->memcg_data, once page[1] is a folio, it suddenly looks like it has folio->memcg_data set. And we never manually initialize folio->memcg_data in fsdax code, because we never expect it to be set at all. When __lruvec_stat_mod_folio() then stumbles over such a folio, it tries to use folio->memcg_data (because it's non-NULL) but it does not actually point at a memcg, resulting in the problem. Alison also observed that these folios sometimes have "locked" set, which is rather concerning (folios locked from the beginning ...). The reason is that the order for large folios is stored in page[1]->flags, which become the folio->flags of a new small folio. Let's fix it by adding a folio helper to clear order/_nr_pages for splitting purposes. Maybe we should reinitialize other large folio flags / folio members as well when splitting, because they might similarly cause harm once page[1] becomes a folio? At least other flags in PAGE_FLAGS_SECOND should not be set for fsdax, so at least page[1]->flags might be as expected with this fix. From a quick glimpse, initializing ->mapping, ->pgmap and ->share should re-initialize most things from a previous page[1] used by large folios that fsdax cares about. For example folio->private might not get reinitialized, but maybe that's not relevant -- no traces of it's use in fsdax code. Needs a closer look. Another thing that should be considered in the future is performing similar checks as we perform in free_tail_page_prepare() -- checking pincount etc. -- when freeing a large fsdax folio. Fixes: 4996fc547f5b ("mm: let _folio_nr_pages overlay memcg_data in first tail page") Fixes: 38607c62b34b ("fs/dax: properly refcount fs dax pages") Reported-by: Alison Schofield <alison.schofield@intel.com> Closes: https://lkml.kernel.org/r/Z_W9Oeg-D9FhImf3@aschofie-mobl2.lan Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: David Hildenbrand <david@redhat.com> --- fs/dax.c | 1 + include/linux/mm.h | 17 +++++++++++++++++ 2 files changed, 18 insertions(+)