diff mbox series

[v1] fs/dax: fix folio splitting issue by resetting old folio order + _nr_pages

Message ID 20250410091020.119116-1-david@redhat.com (mailing list archive)
State New
Headers show
Series [v1] fs/dax: fix folio splitting issue by resetting old folio order + _nr_pages | expand

Commit Message

David Hildenbrand April 10, 2025, 9:10 a.m. UTC
Alison reports an issue with fsdax when large extends end up using
large ZONE_DEVICE folios:

[  417.796271] BUG: kernel NULL pointer dereference, address: 0000000000000b00
[  417.796982] #PF: supervisor read access in kernel mode
[  417.797540] #PF: error_code(0x0000) - not-present page
[  417.798123] PGD 2a5c5067 P4D 2a5c5067 PUD 2a5c6067 PMD 0
[  417.798690] Oops: Oops: 0000 [#1] SMP NOPTI
[  417.799178] CPU: 5 UID: 0 PID: 1515 Comm: mmap Tainted: ...
[  417.800150] Tainted: [O]=OOT_MODULE
[  417.800583] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[  417.801358] RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250
[  417.801948] Code: ...
[  417.803662] RSP: 0000:ffffc90002be3a08 EFLAGS: 00010206
[  417.804234] RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000000002
[  417.804984] RDX: ffffffff815652d7 RSI: 0000000000000000 RDI: ffffffff82a2beae
[  417.805689] RBP: ffffc90002be3a28 R08: 0000000000000000 R09: 0000000000000000
[  417.806384] R10: ffffea0007000040 R11: ffff888376ffe000 R12: 0000000000000001
[  417.807099] R13: 0000000000000012 R14: ffff88807fe4ab40 R15: ffff888029210580
[  417.807801] FS:  00007f339fa7a740(0000) GS:ffff8881fa9b9000(0000) knlGS:0000000000000000
[  417.808570] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  417.809193] CR2: 0000000000000b00 CR3: 000000002a4f0004 CR4: 0000000000370ef0
[  417.809925] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  417.810622] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  417.811353] Call Trace:
[  417.811709]  <TASK>
[  417.812038]  folio_add_file_rmap_ptes+0x143/0x230
[  417.812566]  insert_page_into_pte_locked+0x1ee/0x3c0
[  417.813132]  insert_page+0x78/0xf0
[  417.813558]  vmf_insert_page_mkwrite+0x55/0xa0
[  417.814088]  dax_fault_iter+0x484/0x7b0
[  417.814542]  dax_iomap_pte_fault+0x1ca/0x620
[  417.815055]  dax_iomap_fault+0x39/0x40
[  417.815499]  __xfs_write_fault+0x139/0x380
[  417.815995]  ? __handle_mm_fault+0x5e5/0x1a60
[  417.816483]  xfs_write_fault+0x41/0x50
[  417.816966]  xfs_filemap_fault+0x3b/0xe0
[  417.817424]  __do_fault+0x31/0x180
[  417.817859]  __handle_mm_fault+0xee1/0x1a60
[  417.818325]  ? debug_smp_processor_id+0x17/0x20
[  417.818844]  handle_mm_fault+0xe1/0x2b0
[...]

The issue is that when we split a large ZONE_DEVICE folio to order-0
ones, we don't reset the order/_nr_pages. As folio->_nr_pages overlays
page[1]->memcg_data, once page[1] is a folio, it suddenly looks like it
has folio->memcg_data set. And we never manually initialize
folio->memcg_data in fsdax code, because we never expect it to be set at
all.

When __lruvec_stat_mod_folio() then stumbles over such a folio, it tries to
use folio->memcg_data (because it's non-NULL) but it does not actually
point at a memcg, resulting in the problem.

Alison also observed that these folios sometimes have "locked"
set, which is rather concerning (folios locked from the beginning ...).
The reason is that the order for large folios is stored in page[1]->flags,
which become the folio->flags of a new small folio.

Let's fix it by adding a folio helper to clear order/_nr_pages for
splitting purposes.

Maybe we should reinitialize other large folio flags / folio members as
well when splitting, because they might similarly cause harm once
page[1] becomes a folio? At least other flags in PAGE_FLAGS_SECOND should
not be set for fsdax, so at least page[1]->flags might be as expected with
this fix.

From a quick glimpse, initializing ->mapping, ->pgmap and ->share should
re-initialize most things from a previous page[1] used by large folios
that fsdax cares about. For example folio->private might not get
reinitialized, but maybe that's not relevant -- no traces of it's use in
fsdax code. Needs a closer look.

Another thing that should be considered in the future is performing similar
checks as we perform in free_tail_page_prepare() -- checking pincount etc.
-- when freeing a large fsdax folio.

Fixes: 4996fc547f5b ("mm: let _folio_nr_pages overlay memcg_data in first tail page")
Fixes: 38607c62b34b ("fs/dax: properly refcount fs dax pages")
Reported-by: Alison Schofield <alison.schofield@intel.com>
Closes: https://lkml.kernel.org/r/Z_W9Oeg-D9FhImf3@aschofie-mobl2.lan
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 fs/dax.c           |  1 +
 include/linux/mm.h | 17 +++++++++++++++++
 2 files changed, 18 insertions(+)

Comments

Alison Schofield April 10, 2025, 7:12 p.m. UTC | #1
On Thu, Apr 10, 2025 at 11:10:20AM +0200, David Hildenbrand wrote:
> Alison reports an issue with fsdax when large extends end up using
> large ZONE_DEVICE folios:
>

Passes the ndctl/dax unit tests.

Tested-by: Alison Schofield <alison.schofield@intel.com>

snip
Dan Williams April 10, 2025, 8:15 p.m. UTC | #2
David Hildenbrand wrote:
> Alison reports an issue with fsdax when large extends end up using
> large ZONE_DEVICE folios:
> 
> [  417.796271] BUG: kernel NULL pointer dereference, address: 0000000000000b00
> [  417.796982] #PF: supervisor read access in kernel mode
> [  417.797540] #PF: error_code(0x0000) - not-present page
> [  417.798123] PGD 2a5c5067 P4D 2a5c5067 PUD 2a5c6067 PMD 0
> [  417.798690] Oops: Oops: 0000 [#1] SMP NOPTI
> [  417.799178] CPU: 5 UID: 0 PID: 1515 Comm: mmap Tainted: ...
> [  417.800150] Tainted: [O]=OOT_MODULE
> [  417.800583] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> [  417.801358] RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250
> [  417.801948] Code: ...
> [  417.803662] RSP: 0000:ffffc90002be3a08 EFLAGS: 00010206
> [  417.804234] RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000000002
> [  417.804984] RDX: ffffffff815652d7 RSI: 0000000000000000 RDI: ffffffff82a2beae
> [  417.805689] RBP: ffffc90002be3a28 R08: 0000000000000000 R09: 0000000000000000
> [  417.806384] R10: ffffea0007000040 R11: ffff888376ffe000 R12: 0000000000000001
> [  417.807099] R13: 0000000000000012 R14: ffff88807fe4ab40 R15: ffff888029210580
> [  417.807801] FS:  00007f339fa7a740(0000) GS:ffff8881fa9b9000(0000) knlGS:0000000000000000
> [  417.808570] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  417.809193] CR2: 0000000000000b00 CR3: 000000002a4f0004 CR4: 0000000000370ef0
> [  417.809925] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  417.810622] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  417.811353] Call Trace:
> [  417.811709]  <TASK>
> [  417.812038]  folio_add_file_rmap_ptes+0x143/0x230
> [  417.812566]  insert_page_into_pte_locked+0x1ee/0x3c0
> [  417.813132]  insert_page+0x78/0xf0
> [  417.813558]  vmf_insert_page_mkwrite+0x55/0xa0
> [  417.814088]  dax_fault_iter+0x484/0x7b0
> [  417.814542]  dax_iomap_pte_fault+0x1ca/0x620
> [  417.815055]  dax_iomap_fault+0x39/0x40
> [  417.815499]  __xfs_write_fault+0x139/0x380
> [  417.815995]  ? __handle_mm_fault+0x5e5/0x1a60
> [  417.816483]  xfs_write_fault+0x41/0x50
> [  417.816966]  xfs_filemap_fault+0x3b/0xe0
> [  417.817424]  __do_fault+0x31/0x180
> [  417.817859]  __handle_mm_fault+0xee1/0x1a60
> [  417.818325]  ? debug_smp_processor_id+0x17/0x20
> [  417.818844]  handle_mm_fault+0xe1/0x2b0
> [...]
> 
> The issue is that when we split a large ZONE_DEVICE folio to order-0
> ones, we don't reset the order/_nr_pages. As folio->_nr_pages overlays
> page[1]->memcg_data, once page[1] is a folio, it suddenly looks like it
> has folio->memcg_data set. And we never manually initialize
> folio->memcg_data in fsdax code, because we never expect it to be set at
> all.
> 
> When __lruvec_stat_mod_folio() then stumbles over such a folio, it tries to
> use folio->memcg_data (because it's non-NULL) but it does not actually
> point at a memcg, resulting in the problem.
> 
> Alison also observed that these folios sometimes have "locked"
> set, which is rather concerning (folios locked from the beginning ...).
> The reason is that the order for large folios is stored in page[1]->flags,
> which become the folio->flags of a new small folio.
> 
> Let's fix it by adding a folio helper to clear order/_nr_pages for
> splitting purposes.
> 
> Maybe we should reinitialize other large folio flags / folio members as
> well when splitting, because they might similarly cause harm once
> page[1] becomes a folio? At least other flags in PAGE_FLAGS_SECOND should
> not be set for fsdax, so at least page[1]->flags might be as expected with
> this fix.
> 
> From a quick glimpse, initializing ->mapping, ->pgmap and ->share should
> re-initialize most things from a previous page[1] used by large folios
> that fsdax cares about. For example folio->private might not get
> reinitialized, but maybe that's not relevant -- no traces of it's use in
> fsdax code. Needs a closer look.
> 
> Another thing that should be considered in the future is performing similar
> checks as we perform in free_tail_page_prepare() -- checking pincount etc.
> -- when freeing a large fsdax folio.
> 
> Fixes: 4996fc547f5b ("mm: let _folio_nr_pages overlay memcg_data in first tail page")
> Fixes: 38607c62b34b ("fs/dax: properly refcount fs dax pages")
> Reported-by: Alison Schofield <alison.schofield@intel.com>
> Closes: https://lkml.kernel.org/r/Z_W9Oeg-D9FhImf3@aschofie-mobl2.lan
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  fs/dax.c           |  1 +
>  include/linux/mm.h | 17 +++++++++++++++++
>  2 files changed, 18 insertions(+)

Explanation excellent, folio_reset_order() looks correct to me and the
callsite in fsdax looks correct.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

For consistency and clarity what about this incremental change, to make
the __split_folio_to_order() path reuse folio_reset_order(), and use
typical bitfield helpers for manipulating _flags_1?


diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf55206935c4..5b614d31f4f6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -33,6 +33,7 @@
 #include <linux/slab.h>
 #include <linux/cacheinfo.h>
 #include <linux/rcuwait.h>
+#include <linux/bitfield.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -1171,7 +1172,7 @@ extern void prep_compound_page(struct page *page, unsigned int order);
 
 static inline unsigned int folio_large_order(const struct folio *folio)
 {
-	return folio->_flags_1 & 0xff;
+	return FIELD_GET(FOLIO_ORDER_MASK, folio->_flags_1);
 }
 
 #ifdef NR_PAGES_IN_LARGE_FOLIO
@@ -1229,7 +1230,8 @@ static inline void folio_reset_order(struct folio *folio)
 {
 	if (WARN_ON_ONCE(!folio_test_large(folio)))
 		return;
-	folio->_flags_1 &= ~0xffUL;
+	ClearPageCompound(&folio->page);
+	folio->_flags_1 &= ~FOLIO_ORDER_MASK;
 #ifdef NR_PAGES_IN_LARGE_FOLIO
 	folio->_nr_pages = 0;
 #endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 56d07edd01f9..3dc2d98fde24 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -483,6 +483,8 @@ struct folio {
 	};
 };
 
+#define FOLIO_ORDER_MASK GENMASK(7, 0)
+
 #define FOLIO_MATCH(pg, fl)						\
 	static_assert(offsetof(struct page, pg) == offsetof(struct folio, fl))
 FOLIO_MATCH(flags, flags);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2a47682d1ab7..301ca9459122 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3404,7 +3404,7 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
 	if (new_order)
 		folio_set_order(folio, new_order);
 	else
-		ClearPageCompound(&folio->page);
+		folio_reset_order(folio);
 }
 
 /*
diff --git a/mm/internal.h b/mm/internal.h
index 50c2f590b2d0..41a4d2b66405 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -727,7 +727,8 @@ static inline void folio_set_order(struct folio *folio, unsigned int order)
 	if (WARN_ON_ONCE(!order || !folio_test_large(folio)))
 		return;
 
-	folio->_flags_1 = (folio->_flags_1 & ~0xffUL) | order;
+	folio->_flags_1 &= ~FOLIO_ORDER_MASK;
+	folio->_flags_1 |= FIELD_PREP(FOLIO_ORDER_MASK, order);
 #ifdef NR_PAGES_IN_LARGE_FOLIO
 	folio->_nr_pages = 1U << order;
 #endif
Matthew Wilcox April 10, 2025, 8:23 p.m. UTC | #3
On Thu, Apr 10, 2025 at 01:15:07PM -0700, Dan Williams wrote:
> For consistency and clarity what about this incremental change, to make
> the __split_folio_to_order() path reuse folio_reset_order(), and use
> typical bitfield helpers for manipulating _flags_1?

I dislike this intensely.  It obfuscates rather than providing clarity.

>  static inline unsigned int folio_large_order(const struct folio *folio)
>  {
> -	return folio->_flags_1 & 0xff;
> +	return FIELD_GET(FOLIO_ORDER_MASK, folio->_flags_1);
>  }
>  
>  #ifdef NR_PAGES_IN_LARGE_FOLIO
Dan Williams April 10, 2025, 8:46 p.m. UTC | #4
Matthew Wilcox wrote:
> On Thu, Apr 10, 2025 at 01:15:07PM -0700, Dan Williams wrote:
> > For consistency and clarity what about this incremental change, to make
> > the __split_folio_to_order() path reuse folio_reset_order(), and use
> > typical bitfield helpers for manipulating _flags_1?
> 
> I dislike this intensely.  It obfuscates rather than providing clarity.

I'm used to pushing folks to use bitfield.h in driver land, but will not
push it further here.

What about this hunk?

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2a47682d1ab7..301ca9459122 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3404,7 +3404,7 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
 	if (new_order)
 		folio_set_order(folio, new_order);
 	else
-		ClearPageCompound(&folio->page);
+		folio_reset_order(folio);
 }
 
 /*
Matthew Wilcox April 10, 2025, 9:54 p.m. UTC | #5
On Thu, Apr 10, 2025 at 01:46:06PM -0700, Dan Williams wrote:
> Matthew Wilcox wrote:
> > On Thu, Apr 10, 2025 at 01:15:07PM -0700, Dan Williams wrote:
> > > For consistency and clarity what about this incremental change, to make
> > > the __split_folio_to_order() path reuse folio_reset_order(), and use
> > > typical bitfield helpers for manipulating _flags_1?
> > 
> > I dislike this intensely.  It obfuscates rather than providing clarity.
> 
> I'm used to pushing folks to use bitfield.h in driver land, but will not
> push it further here.

I think it can make sense in places.  Just not here.

> What about this hunk?
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2a47682d1ab7..301ca9459122 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3404,7 +3404,7 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
>  	if (new_order)
>  		folio_set_order(folio, new_order);
>  	else
> -		ClearPageCompound(&folio->page);
> +		folio_reset_order(folio);
>  }

I think that's wrong.  We're splitting this folio into order-0 folios,
but folio_reset_order() is going to modify folio->_flags_1 which is in
the next page.
David Hildenbrand April 11, 2025, 8:28 a.m. UTC | #6
On 10.04.25 23:54, Matthew Wilcox wrote:
> On Thu, Apr 10, 2025 at 01:46:06PM -0700, Dan Williams wrote:
>> Matthew Wilcox wrote:
>>> On Thu, Apr 10, 2025 at 01:15:07PM -0700, Dan Williams wrote:
>>>> For consistency and clarity what about this incremental change, to make
>>>> the __split_folio_to_order() path reuse folio_reset_order(), and use
>>>> typical bitfield helpers for manipulating _flags_1?
>>>
>>> I dislike this intensely.  It obfuscates rather than providing clarity.
>>
>> I'm used to pushing folks to use bitfield.h in driver land, but will not
>> push it further here.
> 
> I think it can make sense in places.  Just not here.
> 
>> What about this hunk?
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 2a47682d1ab7..301ca9459122 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3404,7 +3404,7 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
>>   	if (new_order)
>>   		folio_set_order(folio, new_order);
>>   	else
>> -		ClearPageCompound(&folio->page);
>> +		folio_reset_order(folio);
>>   }
> 
> I think that's wrong.  We're splitting this folio into order-0 folios,
> but folio_reset_order() is going to modify folio->_flags_1 which is in
> the next page.

Right, clearing in this context might only make sense at the very start.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2a47682d1ab77..4cd8b394b83a5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3312,6 +3312,8 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
         long nr_pages = 1 << old_order;
         long i;
  
+       folio_reset_order(folio);
+
         /*
          * Skip the first new_nr_pages, since the new folio from them have all
          * the flags from the original folio.


While it looks cleaner, it's in practice not required here, because

1) We handle _nr_pages overlay by setting new_folio->memcg_data

	new_folio->memcg_data = folio->memcg_data;

2) We handle the order by setting new_folio->flags

	new_folio->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;

That should clear all flags (excluding hwpoison), including the order.


It's worth noting that in free_pages_prepare(), we handle both using

if (compound) {
	page[1].flags &= ~PAGE_FLAGS_SECOND;
#ifdef NR_PAGES_IN_LARGE_FOLIO
	folio->_nr_pages = 0;
#endif
David Hildenbrand April 11, 2025, 8:37 a.m. UTC | #7
(adding CC list again, because I assume it was dropped by accident)

>> diff --git a/fs/dax.c b/fs/dax.c
>> index af5045b0f476e..676303419e9e8 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -396,6 +396,7 @@ static inline unsigned long dax_folio_put(struct folio *folio)
>>   	order = folio_order(folio);
>>   	if (!order)
>>   		return 0;
>> +	folio_reset_order(folio);
> 
> Wouldn't it be better to also move the loop below into this function? The intent
> of this loop was to reinitialise the small folios after splitting which is what
> I think this helper should be doing.

As the function does nothing on small folios (as documented), I think 
this is good enough for now.

Once we decouple folio from page, this code will likely have to change 
either way ...

The first large folio will become a small folio (so resetting kind-of 
makes sense), but the other small folios would have to allocate a new 
"struct folio" for small folios.

> 
>>   	for (i = 0; i < (1UL << order); i++) {
>>   		struct dev_pagemap *pgmap = page_pgmap(&folio->page);
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index b7f13f087954b..bf55206935c46 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1218,6 +1218,23 @@ static inline unsigned int folio_order(const struct folio *folio)
>>   	return folio_large_order(folio);
>>   }
>>   
>> +/**
>> + * folio_reset_order - Reset the folio order and derived _nr_pages
>> + * @folio: The folio.
>> + *
>> + * Reset the order and derived _nr_pages to 0. Must only be used in the
>> + * process of splitting large folios.
>> + */
>> +static inline void folio_reset_order(struct folio *folio)
>> +{
>> +	if (WARN_ON_ONCE(!folio_test_large(folio)))
>> +		return;
>> +	folio->_flags_1 &= ~0xffUL;
>> +#ifdef NR_PAGES_IN_LARGE_FOLIO
>> +	folio->_nr_pages = 0;
>> +#endif
>> +}
>> +


I'm still not sure if this splitting code in fs/dax.c is more similar to 
THP splitting or to "splitting when freeing in the buddy". I think it's 
something in between: we want small folios, but the new folios are 
essentially free.

Likely, to be future-proof, we should also look into doing

folio->_flags_1 &= ~PAGE_FLAGS_SECOND;

Or alternatively (better?)

new_folio->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;


... but that problem will go away once we decouple page from folio (see 
above), so I'm not sure if we should really do that at this point unless 
there is an issue.
Alistair Popple April 14, 2025, 12:32 a.m. UTC | #8
On Fri, Apr 11, 2025 at 10:37:17AM +0200, David Hildenbrand wrote:
> (adding CC list again, because I assume it was dropped by accident)

Whoops. Thanks.

> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index af5045b0f476e..676303419e9e8 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -396,6 +396,7 @@ static inline unsigned long dax_folio_put(struct folio *folio)
> > >   	order = folio_order(folio);
> > >   	if (!order)
> > >   		return 0;
> > > +	folio_reset_order(folio);
> > 
> > Wouldn't it be better to also move the loop below into this function? The intent
> > of this loop was to reinitialise the small folios after splitting which is what
> > I think this helper should be doing.
> 
> As the function does nothing on small folios (as documented), I think this
> is good enough for now.
> 
> Once we decouple folio from page, this code will likely have to change
> either way ...
> 
> The first large folio will become a small folio (so resetting kind-of makes
> sense), but the other small folios would have to allocate a new "struct
> folio" for small folios.
> 
> > 
> > >   	for (i = 0; i < (1UL << order); i++) {
> > >   		struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index b7f13f087954b..bf55206935c46 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -1218,6 +1218,23 @@ static inline unsigned int folio_order(const struct folio *folio)
> > >   	return folio_large_order(folio);
> > >   }
> > > +/**
> > > + * folio_reset_order - Reset the folio order and derived _nr_pages
> > > + * @folio: The folio.
> > > + *
> > > + * Reset the order and derived _nr_pages to 0. Must only be used in the
> > > + * process of splitting large folios.
> > > + */
> > > +static inline void folio_reset_order(struct folio *folio)
> > > +{
> > > +	if (WARN_ON_ONCE(!folio_test_large(folio)))
> > > +		return;
> > > +	folio->_flags_1 &= ~0xffUL;
> > > +#ifdef NR_PAGES_IN_LARGE_FOLIO
> > > +	folio->_nr_pages = 0;
> > > +#endif
> > > +}
> > > +
> 
> 
> I'm still not sure if this splitting code in fs/dax.c is more similar to THP
> splitting or to "splitting when freeing in the buddy". I think it's
> something in between: we want small folios, but the new folios are
> essentially free.

I'm not too familiar with the code for "splitting when freeing in the buddy"
but conceptually that sounds pretty similar to what we're doing here. The large
folio (and all pages within it) are free and we need to split it back to small
free folios ready to be allocated again in dax_folio_init().

> Likely, to be future-proof, we should also look into doing
> 
> folio->_flags_1 &= ~PAGE_FLAGS_SECOND;
> 
> Or alternatively (better?)
> 
> new_folio->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;

That seems reasonable.

> ... but that problem will go away once we decouple page from folio (see
> above), so I'm not sure if we should really do that at this point unless
> there is an issue.
> 
> -- 
> Cheers,
> 
> David / dhildenb
>
diff mbox series

Patch

diff --git a/fs/dax.c b/fs/dax.c
index af5045b0f476e..676303419e9e8 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -396,6 +396,7 @@  static inline unsigned long dax_folio_put(struct folio *folio)
 	order = folio_order(folio);
 	if (!order)
 		return 0;
+	folio_reset_order(folio);
 
 	for (i = 0; i < (1UL << order); i++) {
 		struct dev_pagemap *pgmap = page_pgmap(&folio->page);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b7f13f087954b..bf55206935c46 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1218,6 +1218,23 @@  static inline unsigned int folio_order(const struct folio *folio)
 	return folio_large_order(folio);
 }
 
+/**
+ * folio_reset_order - Reset the folio order and derived _nr_pages
+ * @folio: The folio.
+ *
+ * Reset the order and derived _nr_pages to 0. Must only be used in the
+ * process of splitting large folios.
+ */
+static inline void folio_reset_order(struct folio *folio)
+{
+	if (WARN_ON_ONCE(!folio_test_large(folio)))
+		return;
+	folio->_flags_1 &= ~0xffUL;
+#ifdef NR_PAGES_IN_LARGE_FOLIO
+	folio->_nr_pages = 0;
+#endif
+}
+
 #include <linux/huge_mm.h>
 
 /*