[RFC] mm: fix refcount check in mapping_evict_folio

Message ID	f1f6909c39ffac4c24ba7feed4a561a61cecd742.1723573450.git.boris@bur.io (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Feedback-ID: i083147f8:Fastmail From: Boris Burkov <boris@bur.io> To: linux-mm@kvack.org Cc: willy@infradead.org, shakeel.butt@linux.dev Subject: [PATCH RFC] mm: fix refcount check in mapping_evict_folio Date: Tue, 13 Aug 2024 11:25:56 -0700 Message-ID: <f1f6909c39ffac4c24ba7feed4a561a61cecd742.1723573450.git.boris@bur.io> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[RFC] mm: fix refcount check in mapping_evict_folio \| expand [RFC] mm: fix refcount check in mapping_evict_folio

Boris Burkov Aug. 13, 2024, 6:25 p.m. UTC

The commit e41c81d0d30e ("mm/truncate: Replace page_mapped() call in
invalidate_inode_page()") replaced the page_mapped(page) check with a
refcount check. However, this refcount check does not work as expected
with drop_caches, at least for btrfs's metadata pages.

Btrfs has a per-sb metadata inode with cached pages, and when not in
active use by btrfs, they have a refcount of 3. One from the initial
call to alloc_pages, one (nr_pages == 1) from filemap_add_folio, and one
from folio_attach_private. We would expect such pages to get dropped by
drop_caches. However, drop_caches calls into mapping_evict_folio via
mapping_try_invalidate which gets a reference on the folio with
find_lock_entries(). As a result, these pages have a refcount of 4, and
fail this check.

For what it's worth, such pages do get reclaimed under memory pressure,
and if I change this refcount check to `if folio_mapped(folio)`

The following script produces such pages and uses drgn to further
analyze the state of the folios:
https://github.com/boryas/scripts/blob/main/sh/strand-meta/run.sh
It should at least outline the basic idea for producing some btrfs
metadata via creating inlined-extent files.

My proposed fix for the issue is to add one more hardcoded refcount to
this check to account for the caller having a refcount on the page.
However, I am less familiar with the other caller into
mapping_evict_folio in the page fault path, so I am concerned this fix
will not work properly there, and would appreciate extra scrutiny there.

Fixes: e41c81d0d30e ("mm/truncate: Replace page_mapped() call in invalidate_inode_page()")
Signed-off-by: Boris Burkov <boris@bur.io>
---
 mm/truncate.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

Shakeel Butt Aug. 13, 2024, 7:58 p.m. UTC | #1

CCing Miaohe Lin

On Tue, Aug 13, 2024 at 11:25:56AM GMT, Boris Burkov wrote:
> The commit e41c81d0d30e ("mm/truncate: Replace page_mapped() call in
> invalidate_inode_page()") replaced the page_mapped(page) check with a
> refcount check. However, this refcount check does not work as expected
> with drop_caches, at least for btrfs's metadata pages.
> 
> Btrfs has a per-sb metadata inode with cached pages, and when not in
> active use by btrfs, they have a refcount of 3. One from the initial
> call to alloc_pages, one (nr_pages == 1) from filemap_add_folio, and one
> from folio_attach_private. We would expect such pages to get dropped by
> drop_caches. However, drop_caches calls into mapping_evict_folio via
> mapping_try_invalidate which gets a reference on the folio with
> find_lock_entries(). As a result, these pages have a refcount of 4, and
> fail this check.
> 
> For what it's worth, such pages do get reclaimed under memory pressure,
> and if I change this refcount check to `if folio_mapped(folio)`
> 
> The following script produces such pages and uses drgn to further
> analyze the state of the folios:
> https://github.com/boryas/scripts/blob/main/sh/strand-meta/run.sh
> It should at least outline the basic idea for producing some btrfs
> metadata via creating inlined-extent files.
> 
> My proposed fix for the issue is to add one more hardcoded refcount to
> this check to account for the caller having a refcount on the page.
> However, I am less familiar with the other caller into
> mapping_evict_folio in the page fault path, so I am concerned this fix
> will not work properly there, and would appreciate extra scrutiny there.
> 
> Fixes: e41c81d0d30e ("mm/truncate: Replace page_mapped() call in invalidate_inode_page()")
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  mm/truncate.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 4d61fbdd4b2f..c710c84710b4 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -267,9 +267,17 @@ long mapping_evict_folio(struct address_space *mapping, struct folio *folio)
>  		return 0;
>  	if (folio_test_dirty(folio) || folio_test_writeback(folio))
>  		return 0;
> -	/* The refcount will be elevated if any page in the folio is mapped */
> +	/*
> +	 * The refcount will be elevated if any page in the folio is mapped.
> +	 *
> +	 * The refcounts break down as follows:
> +	 * 1 per mapped page
> +	 * 1 from folio_attach_private, if private is set
> +	 * 1 from allocating the page in the first place
> +	 * 1 from the caller
> +	 */

I think the above explanation is correct at least from my code
inspection. Most of the callers are related to memory failure. I would
reword the "1 per mapped page" to "1 per page in page cache" or
something as mapped here might mean mapped in page tables.

>  	if (folio_ref_count(folio) >
> -			folio_nr_pages(folio) + folio_has_private(folio) + 1)
> +			folio_nr_pages(folio) + folio_has_private(folio) + 1 + 1)
>  		return 0;
>  	if (!filemap_release_folio(folio, 0))
>  		return 0;
> -- 
> 2.46.0
> 
>

Matthew Wilcox Aug. 14, 2024, 3:15 a.m. UTC | #2

On Tue, Aug 13, 2024 at 12:58:09PM -0700, Shakeel Butt wrote:
> > +	/*
> > +	 * The refcount will be elevated if any page in the folio is mapped.
> > +	 *
> > +	 * The refcounts break down as follows:
> > +	 * 1 per mapped page
> > +	 * 1 from folio_attach_private, if private is set
> > +	 * 1 from allocating the page in the first place
> > +	 * 1 from the caller
> > +	 */
> 
> I think the above explanation is correct at least from my code
> inspection. Most of the callers are related to memory failure. I would
> reword the "1 per mapped page" to "1 per page in page cache" or
> something as mapped here might mean mapped in page tables.

It's not though.  The "1 from allocating the page in the first place"
is donated to the page cache.  It's late here and I don't have the
ability to work through what's really going on here.

Boris Burkov Aug. 14, 2024, 3:27 a.m. UTC | #3

On Wed, Aug 14, 2024 at 04:15:25AM +0100, Matthew Wilcox wrote:
> On Tue, Aug 13, 2024 at 12:58:09PM -0700, Shakeel Butt wrote:
> > > +	/*
> > > +	 * The refcount will be elevated if any page in the folio is mapped.
> > > +	 *
> > > +	 * The refcounts break down as follows:
> > > +	 * 1 per mapped page
> > > +	 * 1 from folio_attach_private, if private is set
> > > +	 * 1 from allocating the page in the first place
> > > +	 * 1 from the caller
> > > +	 */
> > 
> > I think the above explanation is correct at least from my code
> > inspection. Most of the callers are related to memory failure. I would
> > reword the "1 per mapped page" to "1 per page in page cache" or
> > something as mapped here might mean mapped in page tables.
> 
> It's not though.  The "1 from allocating the page in the first place"
> is donated to the page cache.  It's late here and I don't have the
> ability to work through what's really going on here.

Can you explain what you mean by "donated to the page cache" more
precisely?

Perhaps there is something better btrfs can do with its refcounting
as it calls alloc_pages_bulk_array, then filemap_add_folio, and finally
folio_attach_private. But I am not sure which of those refcounts we can
(or should?) drop.

Thanks,
Boris

Matthew Wilcox Aug. 14, 2024, 3:46 a.m. UTC | #4

On Tue, Aug 13, 2024 at 08:27:15PM -0700, Boris Burkov wrote:
> On Wed, Aug 14, 2024 at 04:15:25AM +0100, Matthew Wilcox wrote:
> > On Tue, Aug 13, 2024 at 12:58:09PM -0700, Shakeel Butt wrote:
> > > > +	/*
> > > > +	 * The refcount will be elevated if any page in the folio is mapped.
> > > > +	 *
> > > > +	 * The refcounts break down as follows:
> > > > +	 * 1 per mapped page
> > > > +	 * 1 from folio_attach_private, if private is set
> > > > +	 * 1 from allocating the page in the first place
> > > > +	 * 1 from the caller
> > > > +	 */
> > > 
> > > I think the above explanation is correct at least from my code
> > > inspection. Most of the callers are related to memory failure. I would
> > > reword the "1 per mapped page" to "1 per page in page cache" or
> > > something as mapped here might mean mapped in page tables.
> > 
> > It's not though.  The "1 from allocating the page in the first place"
> > is donated to the page cache.  It's late here and I don't have the
> > ability to work through what's really going on here.
> 
> Can you explain what you mean by "donated to the page cache" more
> precisely?
> 
> Perhaps there is something better btrfs can do with its refcounting
> as it calls alloc_pages_bulk_array, then filemap_add_folio, and finally
> folio_attach_private. But I am not sure which of those refcounts we can
> (or should?) drop.

Look at how readahead works for normal files; ignore what btrfs is doing
because it's probably wrong.  I'm going to use the term "expected
refcount" because there may also be temporary speculative refcounts
from stale references (either GUP or pagecache).

                folio = filemap_alloc_folio(gfp_mask, 0);
(expected refcount 1)
                ret = filemap_add_folio(mapping, folio, index + i, gfp_mask);
(expected refcount 1 + nr_pages)
        read_pages(ractl);
                aops->readahead(rac);
... calls readahead_folio() which calls folio_put()
(expected refcount nr_pages)

if filesystem calls folio_attach_private(), add one to the expected
refcount.

That's it.  Folios in the pagecache should have a refcount of nr_pages +
1 if private data exists.  Every caller who has called filemap_get_folio()
has an extra refcount.  Every user mapping of a page adds one to the
refcount (and to the mapcount).

If btrfs superblocks have an extra refcount, they're wrong and should
have it put somewhere.

At some point, I intend to reduce the number of atomic operations we do
by having filemap_add_folio() increment by one fewer than it currently
does, and removing the folio_put() in readahead_folio().  I haven't been
brave enough to do that yet.

I also think we should not increment the refcount by nr_pages when we
add it to the page cache.  Incrementing by one should be sufficient.
And that would mean that we can just delete the "folio_ref_add()"
in __filemap_add_folio().

Boris Burkov Aug. 14, 2024, 4:23 a.m. UTC | #5

On Wed, Aug 14, 2024 at 04:46:13AM +0100, Matthew Wilcox wrote:
> On Tue, Aug 13, 2024 at 08:27:15PM -0700, Boris Burkov wrote:
> > On Wed, Aug 14, 2024 at 04:15:25AM +0100, Matthew Wilcox wrote:
> > > On Tue, Aug 13, 2024 at 12:58:09PM -0700, Shakeel Butt wrote:
> > > > > +	/*
> > > > > +	 * The refcount will be elevated if any page in the folio is mapped.
> > > > > +	 *
> > > > > +	 * The refcounts break down as follows:
> > > > > +	 * 1 per mapped page
> > > > > +	 * 1 from folio_attach_private, if private is set
> > > > > +	 * 1 from allocating the page in the first place
> > > > > +	 * 1 from the caller
> > > > > +	 */
> > > > 
> > > > I think the above explanation is correct at least from my code
> > > > inspection. Most of the callers are related to memory failure. I would
> > > > reword the "1 per mapped page" to "1 per page in page cache" or
> > > > something as mapped here might mean mapped in page tables.
> > > 
> > > It's not though.  The "1 from allocating the page in the first place"
> > > is donated to the page cache.  It's late here and I don't have the
> > > ability to work through what's really going on here.
> > 
> > Can you explain what you mean by "donated to the page cache" more
> > precisely?
> > 
> > Perhaps there is something better btrfs can do with its refcounting
> > as it calls alloc_pages_bulk_array, then filemap_add_folio, and finally
> > folio_attach_private. But I am not sure which of those refcounts we can
> > (or should?) drop.
> 
> Look at how readahead works for normal files; ignore what btrfs is doing
> because it's probably wrong.  I'm going to use the term "expected
> refcount" because there may also be temporary speculative refcounts
> from stale references (either GUP or pagecache).
> 
>                 folio = filemap_alloc_folio(gfp_mask, 0);
> (expected refcount 1)
>                 ret = filemap_add_folio(mapping, folio, index + i, gfp_mask);
> (expected refcount 1 + nr_pages)
>         read_pages(ractl);
>                 aops->readahead(rac);
> ... calls readahead_folio() which calls folio_put()
> (expected refcount nr_pages)
> 
> if filesystem calls folio_attach_private(), add one to the expected
> refcount.
> 
> That's it.  Folios in the pagecache should have a refcount of nr_pages +
> 1 if private data exists.  Every caller who has called filemap_get_folio()
> has an extra refcount.  Every user mapping of a page adds one to the
> refcount (and to the mapcount).

Thank you for the extra explanation, that is very helpful.

> 
> If btrfs superblocks have an extra refcount, they're wrong and should
> have it put somewhere.

I suppose by analogy btrfs should do a put sometime after
filemap_add_folio of the metadata page.

I'll look into making that change instead of this, since it seems like
the expected refcount was correct after all and btrfs had an extra one.

> 
> 
> At some point, I intend to reduce the number of atomic operations we do
> by having filemap_add_folio() increment by one fewer than it currently
> does, and removing the folio_put() in readahead_folio().  I haven't been
> brave enough to do that yet.
> 
> I also think we should not increment the refcount by nr_pages when we
> add it to the page cache.  Incrementing by one should be sufficient.
> And that would mean that we can just delete the "folio_ref_add()"
> in __filemap_add_folio().

David Hildenbrand Aug. 20, 2024, 8 a.m. UTC | #6

On 13.08.24 20:25, Boris Burkov wrote:
> The commit e41c81d0d30e ("mm/truncate: Replace page_mapped() call in
> invalidate_inode_page()") replaced the page_mapped(page) check with a
> refcount check. However, this refcount check does not work as expected
> with drop_caches, at least for btrfs's metadata pages.
> 
> Btrfs has a per-sb metadata inode with cached pages, and when not in
> active use by btrfs, they have a refcount of 3. One from the initial
> call to alloc_pages, one (nr_pages == 1) from filemap_add_folio, and one
> from folio_attach_private. We would expect such pages to get dropped by
> drop_caches. However, drop_caches calls into mapping_evict_folio via
> mapping_try_invalidate which gets a reference on the folio with
> find_lock_entries(). As a result, these pages have a refcount of 4, and
> fail this check.
> 
> For what it's worth, such pages do get reclaimed under memory pressure,
> and if I change this refcount check to `if folio_mapped(folio)`
> 
> The following script produces such pages and uses drgn to further
> analyze the state of the folios:
> https://github.com/boryas/scripts/blob/main/sh/strand-meta/run.sh
> It should at least outline the basic idea for producing some btrfs
> metadata via creating inlined-extent files.
> 
> My proposed fix for the issue is to add one more hardcoded refcount to
> this check to account for the caller having a refcount on the page.
> However, I am less familiar with the other caller into
> mapping_evict_folio in the page fault path, so I am concerned this fix
> will not work properly there, and would appreciate extra scrutiny there.
> 
> Fixes: e41c81d0d30e ("mm/truncate: Replace page_mapped() call in invalidate_inode_page()")
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>   mm/truncate.c | 12 ++++++++++--
>   1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 4d61fbdd4b2f..c710c84710b4 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -267,9 +267,17 @@ long mapping_evict_folio(struct address_space *mapping, struct folio *folio)
>   		return 0;
>   	if (folio_test_dirty(folio) || folio_test_writeback(folio))
>   		return 0;
> -	/* The refcount will be elevated if any page in the folio is mapped */
> +	/*
> +	 * The refcount will be elevated if any page in the folio is mapped.
> +	 *
> +	 * The refcounts break down as follows:
> +	 * 1 per mapped page

Using "mapped" is confusing -- I would have expected a folio_mapcount() 
here.

Did you mean "1 reference from the pagecache to each page"?

> +	 * 1 from folio_attach_private, if private is set
> +	 * 1 from allocating the page in the first place
> +	 * 1 from the caller
> +	 */
>   	if (folio_ref_count(folio) >
> -			folio_nr_pages(folio) + folio_has_private(folio) + 1)
> +			folio_nr_pages(folio) + folio_has_private(folio) + 1 + 1)
>   		return 0;
>   	if (!filemap_release_folio(folio, 0))
>   		return 0;

David Hildenbrand Aug. 20, 2024, 2 p.m. UTC | #7

On 20.08.24 10:00, David Hildenbrand wrote:
> On 13.08.24 20:25, Boris Burkov wrote:
>> The commit e41c81d0d30e ("mm/truncate: Replace page_mapped() call in
>> invalidate_inode_page()") replaced the page_mapped(page) check with a
>> refcount check. However, this refcount check does not work as expected
>> with drop_caches, at least for btrfs's metadata pages.
>>
>> Btrfs has a per-sb metadata inode with cached pages, and when not in
>> active use by btrfs, they have a refcount of 3. One from the initial
>> call to alloc_pages, one (nr_pages == 1) from filemap_add_folio, and one
>> from folio_attach_private. We would expect such pages to get dropped by
>> drop_caches. However, drop_caches calls into mapping_evict_folio via
>> mapping_try_invalidate which gets a reference on the folio with
>> find_lock_entries(). As a result, these pages have a refcount of 4, and
>> fail this check.
>>
>> For what it's worth, such pages do get reclaimed under memory pressure,
>> and if I change this refcount check to `if folio_mapped(folio)`
>>
>> The following script produces such pages and uses drgn to further
>> analyze the state of the folios:
>> https://github.com/boryas/scripts/blob/main/sh/strand-meta/run.sh
>> It should at least outline the basic idea for producing some btrfs
>> metadata via creating inlined-extent files.
>>
>> My proposed fix for the issue is to add one more hardcoded refcount to
>> this check to account for the caller having a refcount on the page.
>> However, I am less familiar with the other caller into
>> mapping_evict_folio in the page fault path, so I am concerned this fix
>> will not work properly there, and would appreciate extra scrutiny there.
>>
>> Fixes: e41c81d0d30e ("mm/truncate: Replace page_mapped() call in invalidate_inode_page()")
>> Signed-off-by: Boris Burkov <boris@bur.io>
>> ---
>>    mm/truncate.c | 12 ++++++++++--
>>    1 file changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/truncate.c b/mm/truncate.c
>> index 4d61fbdd4b2f..c710c84710b4 100644
>> --- a/mm/truncate.c
>> +++ b/mm/truncate.c
>> @@ -267,9 +267,17 @@ long mapping_evict_folio(struct address_space *mapping, struct folio *folio)
>>    		return 0;
>>    	if (folio_test_dirty(folio) || folio_test_writeback(folio))
>>    		return 0;
>> -	/* The refcount will be elevated if any page in the folio is mapped */
>> +	/*
>> +	 * The refcount will be elevated if any page in the folio is mapped.
>> +	 *
>> +	 * The refcounts break down as follows:
>> +	 * 1 per mapped page
> 
> Using "mapped" is confusing -- I would have expected a folio_mapcount()
> here.
> 
> Did you mean "1 reference from the pagecache to each page"?

... and now I spotted that Willy had the same comment already, good :)

[RFC] mm: fix refcount check in mapping_evict_folio

Commit Message

Comments

Patch