diff mbox series

btrfs: refactor alloc_extent_buffer() to allocate-then-attach method

Message ID ffeb6b667a9ff0cf161f7dcd82899114782c0834.1700609426.git.wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: refactor alloc_extent_buffer() to allocate-then-attach method | expand

Commit Message

Qu Wenruo Nov. 21, 2023, 11:35 p.m. UTC
Currently alloc_extent_buffer() utilizes find_or_create_page() to
allocate one page a time for an extent buffer.

This method has the following disadvantages:

- find_or_create_page() is the legacy way of allocating new pages
  With the new folio infrastructure, find_or_create_page() is just
  redirected to filemap_get_folio().

- Lacks the way to support higher order (order >= 1) folios
  As we can not yet let filemap to give us a higher order folio (yet).

This patch would change the workflow by the following way:

		Old		   |		new
-----------------------------------+-------------------------------------
                                   | ret = btrfs_alloc_page_array();
for (i = 0; i < num_pages; i++) {  | for (i = 0; i < num_pages; i++) {
    p = find_or_create_page();     |     ret = filemap_add_folio();
    /* Attach page private */      |     /* Reuse page cache if needed */
    /* Reused eb if needed */      |
				   |     /* Attach page private and
				   |        reuse eb if needed */
				   | }

By this we split the page allocation and private attaching into two
parts, allowing future updates to each part more easily, and migrate to
folio interfaces (especially for possible higher order folios).

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 173 +++++++++++++++++++++++++++++++------------
 1 file changed, 126 insertions(+), 47 deletions(-)

Comments

Josef Bacik Nov. 22, 2023, 2:14 p.m. UTC | #1
On Wed, Nov 22, 2023 at 10:05:04AM +1030, Qu Wenruo wrote:
> Currently alloc_extent_buffer() utilizes find_or_create_page() to
> allocate one page a time for an extent buffer.
> 
> This method has the following disadvantages:
> 
> - find_or_create_page() is the legacy way of allocating new pages
>   With the new folio infrastructure, find_or_create_page() is just
>   redirected to filemap_get_folio().
> 
> - Lacks the way to support higher order (order >= 1) folios
>   As we can not yet let filemap to give us a higher order folio (yet).
> 
> This patch would change the workflow by the following way:
> 
> 		Old		   |		new
> -----------------------------------+-------------------------------------
>                                    | ret = btrfs_alloc_page_array();
> for (i = 0; i < num_pages; i++) {  | for (i = 0; i < num_pages; i++) {
>     p = find_or_create_page();     |     ret = filemap_add_folio();
>     /* Attach page private */      |     /* Reuse page cache if needed */
>     /* Reused eb if needed */      |
> 				   |     /* Attach page private and
> 				   |        reuse eb if needed */
> 				   | }
> 
> By this we split the page allocation and private attaching into two
> parts, allowing future updates to each part more easily, and migrate to
> folio interfaces (especially for possible higher order folios).
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/extent_io.c | 173 +++++++++++++++++++++++++++++++------------
>  1 file changed, 126 insertions(+), 47 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 99cc16aed9d7..0ea65f248c15 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3084,6 +3084,14 @@ static bool page_range_has_eb(struct btrfs_fs_info *fs_info, struct page *page)
>  static void detach_extent_buffer_page(struct extent_buffer *eb, struct page *page)
>  {
>  	struct btrfs_fs_info *fs_info = eb->fs_info;
> +	/*
> +	 * We can no longer using page->mapping reliably, as some extent buffer
> +	 * may not have any page mapped to btree_inode yet.
> +	 * Furthermore we have to handle dummy ebs during selftests, where
> +	 * btree_inode is not yet initialized.
> +	 */
> +	struct address_space *mapping = fs_info->btree_inode ?
> +					fs_info->btree_inode->i_mapping : NULL;

I don't understand this, this should only happen if we managed to get
PagePrivate set on the page, and in that case page->mapping is definitely
reliable.  We shouldn't be getting in here with a page that hasn't actually been
attached to the extent buffer, and if we are that needs to be fixed, we don't
need to be dealing with that case in this way.  Thanks,

Josef
David Sterba Nov. 22, 2023, 2:38 p.m. UTC | #2
On Wed, Nov 22, 2023 at 10:05:04AM +1030, Qu Wenruo wrote:
> -	for (i = 0; i < num_pages; i++, index++) {
> -		p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL);
> -		if (!p) {
> -			exists = ERR_PTR(-ENOMEM);
> -			btrfs_free_subpage(prealloc);
> -			goto free_eb;
> +	/* Alloc all pages. */
> +	ret = btrfs_alloc_page_array(num_pages, eb->pages);

This looks promising.  Assuming everything else works, this can be
changed to do:

- alloc_pages(order), the optimistic fast path, contig pages right from
  the allocator, can fail
- fall back to btrfs_alloc_page_array(), this fills the array with
  order-0 pages, similar what we have now, they could be contiguous

I wonder if we still can keep the __GFP_NOFAIL for the fallback
allocation, it's there right now and seems to work on sysmtems under
stress and does not cause random failures due to ENOMEM.
Qu Wenruo Nov. 22, 2023, 8 p.m. UTC | #3
On 2023/11/23 00:44, Josef Bacik wrote:
> On Wed, Nov 22, 2023 at 10:05:04AM +1030, Qu Wenruo wrote:
>> Currently alloc_extent_buffer() utilizes find_or_create_page() to
>> allocate one page a time for an extent buffer.
>>
>> This method has the following disadvantages:
>>
>> - find_or_create_page() is the legacy way of allocating new pages
>>    With the new folio infrastructure, find_or_create_page() is just
>>    redirected to filemap_get_folio().
>>
>> - Lacks the way to support higher order (order >= 1) folios
>>    As we can not yet let filemap to give us a higher order folio (yet).
>>
>> This patch would change the workflow by the following way:
>>
>> 		Old		   |		new
>> -----------------------------------+-------------------------------------
>>                                     | ret = btrfs_alloc_page_array();
>> for (i = 0; i < num_pages; i++) {  | for (i = 0; i < num_pages; i++) {
>>      p = find_or_create_page();     |     ret = filemap_add_folio();
>>      /* Attach page private */      |     /* Reuse page cache if needed */
>>      /* Reused eb if needed */      |
>> 				   |     /* Attach page private and
>> 				   |        reuse eb if needed */
>> 				   | }
>>
>> By this we split the page allocation and private attaching into two
>> parts, allowing future updates to each part more easily, and migrate to
>> folio interfaces (especially for possible higher order folios).
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/extent_io.c | 173 +++++++++++++++++++++++++++++++------------
>>   1 file changed, 126 insertions(+), 47 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 99cc16aed9d7..0ea65f248c15 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -3084,6 +3084,14 @@ static bool page_range_has_eb(struct btrfs_fs_info *fs_info, struct page *page)
>>   static void detach_extent_buffer_page(struct extent_buffer *eb, struct page *page)
>>   {
>>   	struct btrfs_fs_info *fs_info = eb->fs_info;
>> +	/*
>> +	 * We can no longer using page->mapping reliably, as some extent buffer
>> +	 * may not have any page mapped to btree_inode yet.
>> +	 * Furthermore we have to handle dummy ebs during selftests, where
>> +	 * btree_inode is not yet initialized.
>> +	 */
>> +	struct address_space *mapping = fs_info->btree_inode ?
>> +					fs_info->btree_inode->i_mapping : NULL;
>
> I don't understand this, this should only happen if we managed to get
> PagePrivate set on the page, and in that case page->mapping is definitely
> reliable.  We shouldn't be getting in here with a page that hasn't actually been
> attached to the extent buffer, and if we are that needs to be fixed, we don't
> need to be dealing with that case in this way.  Thanks,

The problem is, with the new way of allocating pages first, then attach
them to filemap, we can have a case where the extent buffer has 4 pages,
but only the first page is attached to filemap of btree inode.

In that case, we still want to lock the btree inode private, but
page->mapping is still NULL for the remaining 3 pages.

Thanks,
Qu
>
> Josef
>
Qu Wenruo Nov. 22, 2023, 8:03 p.m. UTC | #4
On 2023/11/23 01:08, David Sterba wrote:
> On Wed, Nov 22, 2023 at 10:05:04AM +1030, Qu Wenruo wrote:
>> -	for (i = 0; i < num_pages; i++, index++) {
>> -		p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL);
>> -		if (!p) {
>> -			exists = ERR_PTR(-ENOMEM);
>> -			btrfs_free_subpage(prealloc);
>> -			goto free_eb;
>> +	/* Alloc all pages. */
>> +	ret = btrfs_alloc_page_array(num_pages, eb->pages);
>
> This looks promising.  Assuming everything else works, this can be
> changed to do:
>
> - alloc_pages(order), the optimistic fast path, contig pages right from
>    the allocator, can fail
> - fall back to btrfs_alloc_page_array(), this fills the array with
>    order-0 pages, similar what we have now, they could be contiguous

Yep, that's the next step planned.

Although I'd prefer to fully migrate to folio interfaces, e.g.
eb::pages[] -> eb::folios[]

But that's not really a big deal considering under most cases
folio/pages are interchangeable.

>
> I wonder if we still can keep the __GFP_NOFAIL for the fallback
> allocation, it's there right now and seems to work on sysmtems under
> stress and does not cause random failures due to ENOMEM.
>
Oh, I forgot the __NOFAIL gfp flags, that's not hard to fix, just
re-introduce the gfp flags to btrfs_alloc_page_array().

Thanks,
Qu
Qu Wenruo Nov. 27, 2023, 5:10 a.m. UTC | #5
On 2023/11/23 06:33, Qu Wenruo wrote:
[...]
>> I wonder if we still can keep the __GFP_NOFAIL for the fallback
>> allocation, it's there right now and seems to work on sysmtems under
>> stress and does not cause random failures due to ENOMEM.
>>
> Oh, I forgot the __NOFAIL gfp flags, that's not hard to fix, just
> re-introduce the gfp flags to btrfs_alloc_page_array().

BTW, I think it's a good time to start a new discussion on whether we
should go __GFP_NOFAIL.
(Although I have updated the patch to keep the GFP_NOFAIL behavior)

I totally understand that we need some memory for tree block during
transaction commitment and other critical sections.

And it's not that uncommon to see __GFP_NOFAIL usage in other mainstream
filesystems.

But my concern is, we also have a lot of memory allocation which can
lead to a lot of problems either, like btrfs_csum_one_bio() or even
join_transaction().

I doubt if btrfs (or any other filesystems) would be to blamed if we're
really running out of memory.
Should the memory hungry user space programs to be firstly killed far
before we failed to allocate memory?


Furthermore, at least for btrfs, I don't think we would hit a situation
where memory allocation failure for metadata would lead to any data
corruption.
The worst case is we hit transaction abort, and the fs flips RO.

Thus I'm wondering if we really need __NOFAIL for btrfs?

Thanks,
Qu
>
> Thanks,
> Qu
>
Josef Bacik Nov. 27, 2023, 4:19 p.m. UTC | #6
On Mon, Nov 27, 2023 at 03:40:41PM +1030, Qu Wenruo wrote:
> On 2023/11/23 06:33, Qu Wenruo wrote:
> [...]
> > > I wonder if we still can keep the __GFP_NOFAIL for the fallback
> > > allocation, it's there right now and seems to work on sysmtems under
> > > stress and does not cause random failures due to ENOMEM.
> > > 
> > Oh, I forgot the __NOFAIL gfp flags, that's not hard to fix, just
> > re-introduce the gfp flags to btrfs_alloc_page_array().
> 
> BTW, I think it's a good time to start a new discussion on whether we
> should go __GFP_NOFAIL.
> (Although I have updated the patch to keep the GFP_NOFAIL behavior)
> 
> I totally understand that we need some memory for tree block during
> transaction commitment and other critical sections.
> 
> And it's not that uncommon to see __GFP_NOFAIL usage in other mainstream
> filesystems.
> 
> But my concern is, we also have a lot of memory allocation which can
> lead to a lot of problems either, like btrfs_csum_one_bio() or even
> join_transaction().
> 
> I doubt if btrfs (or any other filesystems) would be to blamed if we're
> really running out of memory.
> Should the memory hungry user space programs to be firstly killed far
> before we failed to allocate memory?
> 
> 
> Furthermore, at least for btrfs, I don't think we would hit a situation
> where memory allocation failure for metadata would lead to any data
> corruption.
> The worst case is we hit transaction abort, and the fs flips RO.
> 
> Thus I'm wondering if we really need __NOFAIL for btrfs?

It's my preference that we don't use GFP_NOFAIL at all, anywhere.  We don't
really have this option for some things, mostly lock_extent/unlock_extent, but
for extent buffers we check for errors here and generally can safely handle
ENOMEM in these cases.  I would prefer to drop it.  Thanks,

Josef
Josef Bacik Nov. 27, 2023, 4:28 p.m. UTC | #7
On Thu, Nov 23, 2023 at 06:30:05AM +1030, Qu Wenruo wrote:
> 
> 
> On 2023/11/23 00:44, Josef Bacik wrote:
> > On Wed, Nov 22, 2023 at 10:05:04AM +1030, Qu Wenruo wrote:
> > > Currently alloc_extent_buffer() utilizes find_or_create_page() to
> > > allocate one page a time for an extent buffer.
> > > 
> > > This method has the following disadvantages:
> > > 
> > > - find_or_create_page() is the legacy way of allocating new pages
> > >    With the new folio infrastructure, find_or_create_page() is just
> > >    redirected to filemap_get_folio().
> > > 
> > > - Lacks the way to support higher order (order >= 1) folios
> > >    As we can not yet let filemap to give us a higher order folio (yet).
> > > 
> > > This patch would change the workflow by the following way:
> > > 
> > > 		Old		   |		new
> > > -----------------------------------+-------------------------------------
> > >                                     | ret = btrfs_alloc_page_array();
> > > for (i = 0; i < num_pages; i++) {  | for (i = 0; i < num_pages; i++) {
> > >      p = find_or_create_page();     |     ret = filemap_add_folio();
> > >      /* Attach page private */      |     /* Reuse page cache if needed */
> > >      /* Reused eb if needed */      |
> > > 				   |     /* Attach page private and
> > > 				   |        reuse eb if needed */
> > > 				   | }
> > > 
> > > By this we split the page allocation and private attaching into two
> > > parts, allowing future updates to each part more easily, and migrate to
> > > folio interfaces (especially for possible higher order folios).
> > > 
> > > Signed-off-by: Qu Wenruo <wqu@suse.com>
> > > ---
> > >   fs/btrfs/extent_io.c | 173 +++++++++++++++++++++++++++++++------------
> > >   1 file changed, 126 insertions(+), 47 deletions(-)
> > > 
> > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > > index 99cc16aed9d7..0ea65f248c15 100644
> > > --- a/fs/btrfs/extent_io.c
> > > +++ b/fs/btrfs/extent_io.c
> > > @@ -3084,6 +3084,14 @@ static bool page_range_has_eb(struct btrfs_fs_info *fs_info, struct page *page)
> > >   static void detach_extent_buffer_page(struct extent_buffer *eb, struct page *page)
> > >   {
> > >   	struct btrfs_fs_info *fs_info = eb->fs_info;
> > > +	/*
> > > +	 * We can no longer using page->mapping reliably, as some extent buffer
> > > +	 * may not have any page mapped to btree_inode yet.
> > > +	 * Furthermore we have to handle dummy ebs during selftests, where
> > > +	 * btree_inode is not yet initialized.
> > > +	 */
> > > +	struct address_space *mapping = fs_info->btree_inode ?
> > > +					fs_info->btree_inode->i_mapping : NULL;
> > 
> > I don't understand this, this should only happen if we managed to get
> > PagePrivate set on the page, and in that case page->mapping is definitely
> > reliable.  We shouldn't be getting in here with a page that hasn't actually been
> > attached to the extent buffer, and if we are that needs to be fixed, we don't
> > need to be dealing with that case in this way.  Thanks,
> 
> The problem is, with the new way of allocating pages first, then attach
> them to filemap, we can have a case where the extent buffer has 4 pages,
> but only the first page is attached to filemap of btree inode.
> 
> In that case, we still want to lock the btree inode private, but
> page->mapping is still NULL for the remaining 3 pages.
> 

In this case I think we just change the error handling at the end, as we already
have to do something special anyways, so we now would do

for (int i = 0; i < attached; i++) {
	unlock_page(eb->pages[i]);
	detach_extent_buffer_page(eb->pages[i]);
}

for (int i = 0; i < num_pages; i++) {
	if (!eb->pages[i])
		break;
	put_page(eb->pages[i]);
}

and then we can leave detach_extent_buffer_page on its own.  Thanks,

Josef
Qu Wenruo Nov. 27, 2023, 10:17 p.m. UTC | #8
On 2023/11/28 02:58, Josef Bacik wrote:
> On Thu, Nov 23, 2023 at 06:30:05AM +1030, Qu Wenruo wrote:
>>
>>
>> On 2023/11/23 00:44, Josef Bacik wrote:
>>> On Wed, Nov 22, 2023 at 10:05:04AM +1030, Qu Wenruo wrote:
>>>> Currently alloc_extent_buffer() utilizes find_or_create_page() to
>>>> allocate one page a time for an extent buffer.
>>>>
>>>> This method has the following disadvantages:
>>>>
>>>> - find_or_create_page() is the legacy way of allocating new pages
>>>>     With the new folio infrastructure, find_or_create_page() is just
>>>>     redirected to filemap_get_folio().
>>>>
>>>> - Lacks the way to support higher order (order >= 1) folios
>>>>     As we can not yet let filemap to give us a higher order folio (yet).
>>>>
>>>> This patch would change the workflow by the following way:
>>>>
>>>> 		Old		   |		new
>>>> -----------------------------------+-------------------------------------
>>>>                                      | ret = btrfs_alloc_page_array();
>>>> for (i = 0; i < num_pages; i++) {  | for (i = 0; i < num_pages; i++) {
>>>>       p = find_or_create_page();     |     ret = filemap_add_folio();
>>>>       /* Attach page private */      |     /* Reuse page cache if needed */
>>>>       /* Reused eb if needed */      |
>>>> 				   |     /* Attach page private and
>>>> 				   |        reuse eb if needed */
>>>> 				   | }
>>>>
>>>> By this we split the page allocation and private attaching into two
>>>> parts, allowing future updates to each part more easily, and migrate to
>>>> folio interfaces (especially for possible higher order folios).
>>>>
>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>> ---
>>>>    fs/btrfs/extent_io.c | 173 +++++++++++++++++++++++++++++++------------
>>>>    1 file changed, 126 insertions(+), 47 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>>> index 99cc16aed9d7..0ea65f248c15 100644
>>>> --- a/fs/btrfs/extent_io.c
>>>> +++ b/fs/btrfs/extent_io.c
>>>> @@ -3084,6 +3084,14 @@ static bool page_range_has_eb(struct btrfs_fs_info *fs_info, struct page *page)
>>>>    static void detach_extent_buffer_page(struct extent_buffer *eb, struct page *page)
>>>>    {
>>>>    	struct btrfs_fs_info *fs_info = eb->fs_info;
>>>> +	/*
>>>> +	 * We can no longer using page->mapping reliably, as some extent buffer
>>>> +	 * may not have any page mapped to btree_inode yet.
>>>> +	 * Furthermore we have to handle dummy ebs during selftests, where
>>>> +	 * btree_inode is not yet initialized.
>>>> +	 */
>>>> +	struct address_space *mapping = fs_info->btree_inode ?
>>>> +					fs_info->btree_inode->i_mapping : NULL;
>>>
>>> I don't understand this, this should only happen if we managed to get
>>> PagePrivate set on the page, and in that case page->mapping is definitely
>>> reliable.  We shouldn't be getting in here with a page that hasn't actually been
>>> attached to the extent buffer, and if we are that needs to be fixed, we don't
>>> need to be dealing with that case in this way.  Thanks,
>>
>> The problem is, with the new way of allocating pages first, then attach
>> them to filemap, we can have a case where the extent buffer has 4 pages,
>> but only the first page is attached to filemap of btree inode.
>>
>> In that case, we still want to lock the btree inode private, but
>> page->mapping is still NULL for the remaining 3 pages.
>>
>
> In this case I think we just change the error handling at the end, as we already
> have to do something special anyways, so we now would do
>
> for (int i = 0; i < attached; i++) {
> 	unlock_page(eb->pages[i]);
> 	detach_extent_buffer_page(eb->pages[i]);
> }
>
> for (int i = 0; i < num_pages; i++) {
> 	if (!eb->pages[i])
> 		break;
> 	put_page(eb->pages[i]);
> }
>
> and then we can leave detach_extent_buffer_page on its own.  Thanks,

This sounds pretty reasonable, I'll do more tests to make sure this works.

Thanks,
Qu
>
> Josef
>
David Sterba Nov. 28, 2023, 4:26 p.m. UTC | #9
On Mon, Nov 27, 2023 at 03:40:41PM +1030, Qu Wenruo wrote:
> On 2023/11/23 06:33, Qu Wenruo wrote:
> [...]
> >> I wonder if we still can keep the __GFP_NOFAIL for the fallback
> >> allocation, it's there right now and seems to work on sysmtems under
> >> stress and does not cause random failures due to ENOMEM.
> >>
> > Oh, I forgot the __NOFAIL gfp flags, that's not hard to fix, just
> > re-introduce the gfp flags to btrfs_alloc_page_array().
> 
> BTW, I think it's a good time to start a new discussion on whether we
> should go __GFP_NOFAIL.
> (Although I have updated the patch to keep the GFP_NOFAIL behavior)
> 
> I totally understand that we need some memory for tree block during
> transaction commitment and other critical sections.
> 
> And it's not that uncommon to see __GFP_NOFAIL usage in other mainstream
> filesystems.

The use of NOFAIL is either carefuly evaluated or it's there for
historical reasons. The comment for the flag says that,
https://elixir.bootlin.com/linux/latest/source/include/linux/gfp_types.h#L198
and I know MM people see the flag as problematic and that it should not
be used if possible.

> But my concern is, we also have a lot of memory allocation which can
> lead to a lot of problems either, like btrfs_csum_one_bio() or even
> join_transaction().

While I agree that there are many places that can fail due to memory
allocations, the extent buffer requires whole 4 pages, other structures
could be taken from the generic slabs or our named caches. The latter
has lower chance to fail.

> I doubt if btrfs (or any other filesystems) would be to blamed if we're
> really running out of memory.

Well, people blame btrfs for everything.

> Should the memory hungry user space programs to be firstly killed far
> before we failed to allocate memory?

That's up to the allocator and I think it does a good job of providing
the memory to kernel rather than to user space programs.

We do the critical allocations as GFP_NOFS which so far provides the "do
not fail" guarantees. It's a long going discussion,
https://lwn.net/Articles/653573/ (2015). We can let many allocations
fail with a fallback, but still a lot of them would lead to transaction
abort. And as Josef said, there are some that can't fail because they're
too deep or there's no clear exit path.

> Furthermore, at least for btrfs, I don't think we would hit a situation
> where memory allocation failure for metadata would lead to any data
> corruption.
> The worst case is we hit transaction abort, and the fs flips RO.

Yeah, corruption can't happen as long as we have all the error handling
in place and the transaction abort as the ultimate fallback.

> Thus I'm wondering if we really need __NOFAIL for btrfs?

It's hard to say if or when the NOFAIL semantics actually apply. Let's
say there are applications doing metadata operations, the system is
under load, memory is freed slowly by writing data etc. Application that
waits inside the eb allocation will continue eventually. Without the
NOFAIL it would exit early.

As a middle ground, we may want something like "try hard" that would not
fail too soon but it could eventually. That's __GFP_RETRY_MAYFAIL .

Right now there are several changes around the extent buffers, I'd like
do the conversion first and then replace/drop the NOFAIL flag so we
don't mix too many changes in one release. The extent buffers are
critical so one step a time, with lots of testing.
Qu Wenruo Nov. 28, 2023, 8:06 p.m. UTC | #10
On 2023/11/29 02:56, David Sterba wrote:
> On Mon, Nov 27, 2023 at 03:40:41PM +1030, Qu Wenruo wrote:
>> On 2023/11/23 06:33, Qu Wenruo wrote:
>> [...]
>>>> I wonder if we still can keep the __GFP_NOFAIL for the fallback
>>>> allocation, it's there right now and seems to work on sysmtems under
>>>> stress and does not cause random failures due to ENOMEM.
>>>>
>>> Oh, I forgot the __NOFAIL gfp flags, that's not hard to fix, just
>>> re-introduce the gfp flags to btrfs_alloc_page_array().
>>
>> BTW, I think it's a good time to start a new discussion on whether we
>> should go __GFP_NOFAIL.
>> (Although I have updated the patch to keep the GFP_NOFAIL behavior)
>>
>> I totally understand that we need some memory for tree block during
>> transaction commitment and other critical sections.
>>
>> And it's not that uncommon to see __GFP_NOFAIL usage in other mainstream
>> filesystems.
> 
> The use of NOFAIL is either carefuly evaluated or it's there for
> historical reasons. The comment for the flag says that,
> https://elixir.bootlin.com/linux/latest/source/include/linux/gfp_types.h#L198
> and I know MM people see the flag as problematic and that it should not
> be used if possible.
> 
>> But my concern is, we also have a lot of memory allocation which can
>> lead to a lot of problems either, like btrfs_csum_one_bio() or even
>> join_transaction().
> 
> While I agree that there are many places that can fail due to memory
> allocations, the extent buffer requires whole 4 pages, other structures
> could be taken from the generic slabs or our named caches. The latter
> has lower chance to fail.
> 
>> I doubt if btrfs (or any other filesystems) would be to blamed if we're
>> really running out of memory.
> 
> Well, people blame btrfs for everything.
> 
>> Should the memory hungry user space programs to be firstly killed far
>> before we failed to allocate memory?
> 
> That's up to the allocator and I think it does a good job of providing
> the memory to kernel rather than to user space programs.
> 
> We do the critical allocations as GFP_NOFS which so far provides the "do
> not fail" guarantees. It's a long going discussion,
> https://lwn.net/Articles/653573/ (2015). We can let many allocations
> fail with a fallback, but still a lot of them would lead to transaction
> abort. And as Josef said, there are some that can't fail because they're
> too deep or there's no clear exit path.

Yep, for those call sites (aka, extent io tree) we still need NOFAIL 
until we added error handling for all the call sites.

> 
>> Furthermore, at least for btrfs, I don't think we would hit a situation
>> where memory allocation failure for metadata would lead to any data
>> corruption.
>> The worst case is we hit transaction abort, and the fs flips RO.
> 
> Yeah, corruption can't happen as long as we have all the error handling
> in place and the transaction abort as the ultimate fallback.
> 
>> Thus I'm wondering if we really need __NOFAIL for btrfs?
> 
> It's hard to say if or when the NOFAIL semantics actually apply. Let's
> say there are applications doing metadata operations, the system is
> under load, memory is freed slowly by writing data etc. Application that
> waits inside the eb allocation will continue eventually. Without the
> NOFAIL it would exit early.
> 
> As a middle ground, we may want something like "try hard" that would not
> fail too soon but it could eventually. That's __GFP_RETRY_MAYFAIL .

This sounds good. Although I'd say the MM is already doing a too good 
job, thus I'm not sure if we even need the extra retry.

> 
> Right now there are several changes around the extent buffers, I'd like
> do the conversion first and then replace/drop the NOFAIL flag so we
> don't mix too many changes in one release. The extent buffers are
> critical so one step a time, with lots of testing.

This sounds very reasonable.

Thanks,
Qu
diff mbox series

Patch

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 99cc16aed9d7..0ea65f248c15 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3084,6 +3084,14 @@  static bool page_range_has_eb(struct btrfs_fs_info *fs_info, struct page *page)
 static void detach_extent_buffer_page(struct extent_buffer *eb, struct page *page)
 {
 	struct btrfs_fs_info *fs_info = eb->fs_info;
+	/*
+	 * We can no longer using page->mapping reliably, as some extent buffer
+	 * may not have any page mapped to btree_inode yet.
+	 * Furthermore we have to handle dummy ebs during selftests, where
+	 * btree_inode is not yet initialized.
+	 */
+	struct address_space *mapping = fs_info->btree_inode ?
+					fs_info->btree_inode->i_mapping : NULL;
 	const bool mapped = !test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags);
 	struct folio *folio = page_folio(page);
 
@@ -3092,11 +3100,11 @@  static void detach_extent_buffer_page(struct extent_buffer *eb, struct page *pag
 	 * be done under the private_lock.
 	 */
 	if (mapped)
-		spin_lock(&page->mapping->private_lock);
+		spin_lock(&mapping->private_lock);
 
 	if (!folio_test_private(folio)) {
 		if (mapped)
-			spin_unlock(&page->mapping->private_lock);
+			spin_unlock(&mapping->private_lock);
 		return;
 	}
 
@@ -3120,7 +3128,7 @@  static void detach_extent_buffer_page(struct extent_buffer *eb, struct page *pag
 			folio_detach_private(folio);
 		}
 		if (mapped)
-			spin_unlock(&page->mapping->private_lock);
+			spin_unlock(&mapping->private_lock);
 		return;
 	}
 
@@ -3143,7 +3151,7 @@  static void detach_extent_buffer_page(struct extent_buffer *eb, struct page *pag
 	if (!page_range_has_eb(fs_info, page))
 		btrfs_detach_subpage(fs_info, page);
 
-	spin_unlock(&page->mapping->private_lock);
+	spin_unlock(&mapping->private_lock);
 }
 
 /* Release all pages attached to the extent buffer */
@@ -3474,16 +3482,80 @@  static int check_eb_alignment(struct btrfs_fs_info *fs_info, u64 start)
 	return 0;
 }
 
+/*
+ * Return 0 if eb->pages[i] is attached to btree inode successfully.
+ * Return >0 if there is already annother extent buffer for the range,
+ * and @found_eb_ret would be updated.
+ */
+static int attach_eb_page_to_filemap(struct extent_buffer *eb, int i,
+				     struct extent_buffer **found_eb_ret)
+{
+
+	struct btrfs_fs_info *fs_info = eb->fs_info;
+	struct address_space *mapping = fs_info->btree_inode->i_mapping;
+	const unsigned long index = eb->start >> PAGE_SHIFT;
+	struct folio *existing_folio;
+	int ret;
+
+	ASSERT(found_eb_ret);
+
+	/* Caller should ensure the page exists. */
+	ASSERT(eb->pages[i]);
+
+retry:
+	ret = filemap_add_folio(mapping, page_folio(eb->pages[i]), index + i,
+			GFP_NOFS | __GFP_NOFAIL);
+	if (!ret)
+		return 0;
+
+	existing_folio = filemap_lock_folio(mapping, index + i);
+	/* The page cache only exists for a very short time, just retry. */
+	if (IS_ERR(existing_folio))
+		goto retry;
+
+	/*
+	 * For now, we should only have single-page folios for btree
+	 * inode.
+	 */
+	ASSERT(folio_nr_pages(existing_folio) == 1);
+
+	if (fs_info->nodesize < PAGE_SIZE) {
+		/*
+		 * We're going to reuse the existing page, can
+		 * drop our page and subpage structure now.
+		 */
+		__free_page(eb->pages[i]);
+		eb->pages[i] = folio_page(existing_folio, 0);
+	} else {
+		struct extent_buffer *existing_eb;
+
+		existing_eb = grab_extent_buffer(fs_info,
+					folio_page(existing_folio, 0));
+		if (existing_eb) {
+			/*
+			 * The extent buffer still exists, we can use
+			 * it directly.
+			 */
+			*found_eb_ret = existing_eb;
+			folio_unlock(existing_folio);
+			folio_put(existing_folio);
+			return 1;
+		}
+		/* The extent buffer no longer exists, we can reuse the folio. */
+		__free_page(eb->pages[i]);
+		eb->pages[i] = folio_page(existing_folio, 0);
+	}
+	return 0;
+}
+
 struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 					  u64 start, u64 owner_root, int level)
 {
 	unsigned long len = fs_info->nodesize;
 	int num_pages;
-	int i;
-	unsigned long index = start >> PAGE_SHIFT;
+	int attached = 0;
 	struct extent_buffer *eb;
-	struct extent_buffer *exists = NULL;
-	struct page *p;
+	struct extent_buffer *existing_eb = NULL;
 	struct address_space *mapping = fs_info->btree_inode->i_mapping;
 	struct btrfs_subpage *prealloc = NULL;
 	u64 lockdep_owner = owner_root;
@@ -3534,31 +3606,39 @@  struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 	if (fs_info->nodesize < PAGE_SIZE) {
 		prealloc = btrfs_alloc_subpage(fs_info, BTRFS_SUBPAGE_METADATA);
 		if (IS_ERR(prealloc)) {
-			exists = ERR_CAST(prealloc);
-			goto free_eb;
+			ret = PTR_ERR(prealloc);
+			goto out;
 		}
 	}
 
-	for (i = 0; i < num_pages; i++, index++) {
-		p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL);
-		if (!p) {
-			exists = ERR_PTR(-ENOMEM);
-			btrfs_free_subpage(prealloc);
-			goto free_eb;
+	/* Alloc all pages. */
+	ret = btrfs_alloc_page_array(num_pages, eb->pages);
+	if (ret < 0) {
+		ret = -ENOMEM;
+		btrfs_free_subpage(prealloc);
+		goto out;
+	}
+
+	/* Attach all pages to the filemap. */
+	for (int i = 0; i < num_pages; i++) {
+		struct page *eb_page;
+
+		ret = attach_eb_page_to_filemap(eb, i, &existing_eb);
+		if (ret > 0) {
+			ASSERT(existing_eb);
+			goto out;
 		}
 
+		/*
+		 * Only after attach_eb_page_to_filemap(), eb->pages[] is
+		 * reliable, as we may choose to reuse the existing page cache
+		 * and free the allocated page.
+		 */
+		eb_page = eb->pages[i];
+
 		spin_lock(&mapping->private_lock);
-		exists = grab_extent_buffer(fs_info, p);
-		if (exists) {
-			spin_unlock(&mapping->private_lock);
-			unlock_page(p);
-			put_page(p);
-			mark_extent_buffer_accessed(exists, p);
-			btrfs_free_subpage(prealloc);
-			goto free_eb;
-		}
 		/* Should not fail, as we have preallocated the memory */
-		ret = attach_extent_buffer_page(eb, p, prealloc);
+		ret = attach_extent_buffer_page(eb, eb_page, prealloc);
 		ASSERT(!ret);
 		/*
 		 * To inform we have extra eb under allocation, so that
@@ -3569,22 +3649,18 @@  struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 		 * detach_extent_buffer_page().
 		 * Thus needs no special handling in error path.
 		 */
-		btrfs_page_inc_eb_refs(fs_info, p);
+		btrfs_page_inc_eb_refs(fs_info, eb_page);
 		spin_unlock(&mapping->private_lock);
 
-		WARN_ON(btrfs_page_test_dirty(fs_info, p, eb->start, eb->len));
-		eb->pages[i] = p;
-
+		WARN_ON(btrfs_page_test_dirty(fs_info, eb_page, eb->start, eb->len));
 		/*
 		 * Check if the current page is physically contiguous with previous eb
 		 * page.
 		 */
-		if (i && eb->pages[i - 1] + 1 != p)
+		if (i && eb->pages[i - 1] + 1 != eb_page)
 			page_contig = false;
-
-		if (!btrfs_page_test_uptodate(fs_info, p, eb->start, eb->len))
+		if (!btrfs_page_test_uptodate(fs_info, eb_page, eb->start, eb->len))
 			uptodate = 0;
-
 		/*
 		 * We can't unlock the pages just yet since the extent buffer
 		 * hasn't been properly inserted in the radix tree, this
@@ -3595,15 +3671,15 @@  struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 	}
 	if (uptodate)
 		set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
+
 	/* All pages are physically contiguous, can skip cross page handling. */
 	if (page_contig)
 		eb->addr = page_address(eb->pages[0]) + offset_in_page(eb->start);
+
 again:
 	ret = radix_tree_preload(GFP_NOFS);
-	if (ret) {
-		exists = ERR_PTR(ret);
-		goto free_eb;
-	}
+	if (ret)
+		goto out;
 
 	spin_lock(&fs_info->buffer_lock);
 	ret = radix_tree_insert(&fs_info->buffer_radix,
@@ -3611,9 +3687,9 @@  struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 	spin_unlock(&fs_info->buffer_lock);
 	radix_tree_preload_end();
 	if (ret == -EEXIST) {
-		exists = find_extent_buffer(fs_info, start);
-		if (exists)
-			goto free_eb;
+		existing_eb = find_extent_buffer(fs_info, start);
+		if (existing_eb)
+			goto out;
 		else
 			goto again;
 	}
@@ -3626,19 +3702,22 @@  struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 	 * btree_release_folio will correctly detect that a page belongs to a
 	 * live buffer and won't free them prematurely.
 	 */
-	for (i = 0; i < num_pages; i++)
+	for (int i = 0; i < num_pages; i++)
 		unlock_page(eb->pages[i]);
 	return eb;
 
-free_eb:
+out:
 	WARN_ON(!atomic_dec_and_test(&eb->refs));
-	for (i = 0; i < num_pages; i++) {
-		if (eb->pages[i])
-			unlock_page(eb->pages[i]);
+	for (int i = 0; i < attached; i++) {
+		ASSERT(eb->pages[i]);
+		unlock_page(eb->pages[i]);
 	}
 
 	btrfs_release_extent_buffer(eb);
-	return exists;
+	if (ret < 0)
+		return ERR_PTR(ret);
+	ASSERT(existing_eb);
+	return existing_eb;
 }
 
 static inline void btrfs_release_extent_buffer_rcu(struct rcu_head *head)