diff mbox series

[v2] btrfs: fix a possible race window when allocating new extent buffers

Message ID 0f003d96bcb54c9c1afd5512739645bbdddb701b.1717637062.git.wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series [v2] btrfs: fix a possible race window when allocating new extent buffers | expand

Commit Message

Qu Wenruo June 6, 2024, 1:31 a.m. UTC
[BUG]
Since v6.8 there are rare kernel crashes hitting by different reporters,
and most of them share the same bad page status error messages like
this:

 BUG: Bad page state in process kswapd0  pfn:d6e840
 page: refcount:0 mapcount:0 mapping:000000007512f4f2 index:0x2796c2c7c
 pfn:0xd6e840
 aops:btree_aops ino:1
 flags: 0x17ffffe0000008(uptodate|node=0|zone=2|lastcpupid=0x3fffff)
 page_type: 0xffffffff()
 raw: 0017ffffe0000008 dead000000000100 dead000000000122 ffff88826d0be4c0
 raw: 00000002796c2c7c 0000000000000000 00000000ffffffff 0000000000000000
 page dumped because: non-NULL mapping

[CAUSE]
Commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") changes the sequence when allocating a new
extent buffer.

Previously we always call grab_extent_buffer() under
mapping->i_private_lock, to ensure the safety on modification on
folio::private (which is a pointer to extent buffer for regular
sectorsize)

This can lead to the following race:

Thread A is trying to allocate an extent buffer at bytenr X, with 4
4K pages, meanwhile thread B is trying to release the page at X + 4K
(the second page of the extent buffer at X).

           Thread A                |                 Thread B
-----------------------------------+-------------------------------------
                                   | btree_release_folio()
				   | | This is for the page at X + 4K,
				   | | Not page X.
				   | |
alloc_extent_buffer()              | |- release_extent_buffer()
|- filemap_add_folio() for the     | |  |- atomic_dec_and_test(eb->refs)
|  page at bytenr X (the first     | |  |
|  page).                          | |  |
|  Which returned -EEXIST.         | |  |
|                                  | |  |
|- filemap_lock_folio()            | |  |
|  Returned the first page locked. | |  |
|                                  | |  |
|- grab_extent_buffer()            | |  |
|  |- atomic_inc_not_zero()        | |  |
|  |  Returned false               | |  |
|  |- folio_detach_private()       | |  |- folio_detach_private() for X
|     |- folio_test_private()      | |     |- folio_test_private()
      |  Returned true             | |     |  Returned true
      |- folio_put()               |       |- folio_put()

Now this double puts on the same folio at folio X, leads to the
refcount underflow of the folio X, and eventually causing the BUG_ON()
on the page->mapping.

The condition is not that easy to hit:

- The release must be triggered for the middle page of an eb
  If the release is on the same first page of an eb, page lock would kick
  in and prevent the race.

- folio_detach_private() has a very small race window
  It's only between folio_test_private() and folio_clear_private().

That's exactly what mapping->i_private_lock is used to prevent such race,
and commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") totally screwed this up.

At that time, I thought the page lock would kick in as
filemap_release_folio() also requires the page to be locked, but forgot
the filemap_release_folio() only locks one page, not all pages of an
extent buffer.

[FIX]
Move all the code requiring i_private_lock into
attach_eb_folio_to_filemap(), so that everything is done with proper
lock protection.

Furthermore to prevent future problems, add an extra
lockdep_assert_locked() to ensure we're holding the proper lock.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/linux-btrfs/CAHk-=wgt362nGfScVOOii8cgKn2LVVHeOvOA7OBwg1OwbuJQcw@mail.gmail.com/
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Link: https://lore.kernel.org/lkml/CABXGCsPktcHQOvKTbPaTwegMExije=Gpgci5NW=hqORo-s7diA@mail.gmail.com/
Fixes: 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method")
Cc: Chris Mason <clm@fb.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Changelog:
v2:
- With the help of Chris Mason, it's now clear where and how the race
  happened

- Fix a bug that @existing_folio is not properly cleaned up
  Which can easily crash the kernel if filemap_lock_folio() returned
  -EINVAL.

v1:
- Remove the analyze on the race window
  It turns out all that the allocation part (filemap_lock_folio() in
  alloc_extent_buffer()) and the folio release part
  (filemap_release_folio()) all require the folio to be locked.
  Thus it's impossible to race between eb allocation and release.

- Add extra lockdep_assert_hold() for grab_extent_buffer()

光是镜片就要500.。。。
-相比之下验光和筐子简直不要钱-
---
 fs/btrfs/extent_io.c | 59 ++++++++++++++++++++++++--------------------
 1 file changed, 32 insertions(+), 27 deletions(-)

Comments

Filipe Manana June 6, 2024, 4:22 p.m. UTC | #1
On Thu, Jun 6, 2024 at 2:34 AM Qu Wenruo <wqu@suse.com> wrote:
>
> [BUG]
> Since v6.8 there are rare kernel crashes hitting by different reporters,
> and most of them share the same bad page status error messages like
> this:
>
>  BUG: Bad page state in process kswapd0  pfn:d6e840
>  page: refcount:0 mapcount:0 mapping:000000007512f4f2 index:0x2796c2c7c
>  pfn:0xd6e840
>  aops:btree_aops ino:1
>  flags: 0x17ffffe0000008(uptodate|node=0|zone=2|lastcpupid=0x3fffff)
>  page_type: 0xffffffff()
>  raw: 0017ffffe0000008 dead000000000100 dead000000000122 ffff88826d0be4c0
>  raw: 00000002796c2c7c 0000000000000000 00000000ffffffff 0000000000000000
>  page dumped because: non-NULL mapping
>
> [CAUSE]
> Commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
> allocate-then-attach method") changes the sequence when allocating a new
> extent buffer.
>
> Previously we always call grab_extent_buffer() under
> mapping->i_private_lock, to ensure the safety on modification on
> folio::private (which is a pointer to extent buffer for regular
> sectorsize)
>
> This can lead to the following race:
>
> Thread A is trying to allocate an extent buffer at bytenr X, with 4
> 4K pages, meanwhile thread B is trying to release the page at X + 4K
> (the second page of the extent buffer at X).
>
>            Thread A                |                 Thread B
> -----------------------------------+-------------------------------------
>                                    | btree_release_folio()
>                                    | | This is for the page at X + 4K,
>                                    | | Not page X.
>                                    | |
> alloc_extent_buffer()              | |- release_extent_buffer()
> |- filemap_add_folio() for the     | |  |- atomic_dec_and_test(eb->refs)
> |  page at bytenr X (the first     | |  |
> |  page).                          | |  |
> |  Which returned -EEXIST.         | |  |
> |                                  | |  |
> |- filemap_lock_folio()            | |  |
> |  Returned the first page locked. | |  |
> |                                  | |  |
> |- grab_extent_buffer()            | |  |
> |  |- atomic_inc_not_zero()        | |  |
> |  |  Returned false               | |  |
> |  |- folio_detach_private()       | |  |- folio_detach_private() for X
> |     |- folio_test_private()      | |     |- folio_test_private()
>       |  Returned true             | |     |  Returned true
>       |- folio_put()               |       |- folio_put()
>
> Now this double puts on the same folio at folio X, leads to the
> refcount underflow of the folio X, and eventually causing the BUG_ON()
> on the page->mapping.
>
> The condition is not that easy to hit:
>
> - The release must be triggered for the middle page of an eb
>   If the release is on the same first page of an eb, page lock would kick
>   in and prevent the race.
>
> - folio_detach_private() has a very small race window
>   It's only between folio_test_private() and folio_clear_private().
>
> That's exactly what mapping->i_private_lock is used to prevent such race,
> and commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
> allocate-then-attach method") totally screwed this up.
>
> At that time, I thought the page lock would kick in as
> filemap_release_folio() also requires the page to be locked, but forgot
> the filemap_release_folio() only locks one page, not all pages of an
> extent buffer.
>
> [FIX]
> Move all the code requiring i_private_lock into
> attach_eb_folio_to_filemap(), so that everything is done with proper
> lock protection.
>
> Furthermore to prevent future problems, add an extra
> lockdep_assert_locked() to ensure we're holding the proper lock.
>
> Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
> Link: https://lore.kernel.org/linux-btrfs/CAHk-=wgt362nGfScVOOii8cgKn2LVVHeOvOA7OBwg1OwbuJQcw@mail.gmail.com/
> Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
> Link: https://lore.kernel.org/lkml/CABXGCsPktcHQOvKTbPaTwegMExije=Gpgci5NW=hqORo-s7diA@mail.gmail.com/
> Fixes: 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method")
> Cc: Chris Mason <clm@fb.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Looks good, thanks.

> ---
> Changelog:
> v2:
> - With the help of Chris Mason, it's now clear where and how the race
>   happened
>
> - Fix a bug that @existing_folio is not properly cleaned up
>   Which can easily crash the kernel if filemap_lock_folio() returned
>   -EINVAL.
>
> v1:
> - Remove the analyze on the race window
>   It turns out all that the allocation part (filemap_lock_folio() in
>   alloc_extent_buffer()) and the folio release part
>   (filemap_release_folio()) all require the folio to be locked.
>   Thus it's impossible to race between eb allocation and release.
>
> - Add extra lockdep_assert_hold() for grab_extent_buffer()
>
> 光是镜片就要500.。。。
> -相比之下验光和筐子简直不要钱-
> ---
>  fs/btrfs/extent_io.c | 59 ++++++++++++++++++++++++--------------------
>  1 file changed, 32 insertions(+), 27 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 0c74f7df2e8b..af5b7fa7da99 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2913,6 +2913,8 @@ static struct extent_buffer *grab_extent_buffer(
>         struct folio *folio = page_folio(page);
>         struct extent_buffer *exists;
>
> +       lockdep_assert_held(&page->mapping->i_private_lock);
> +
>         /*
>          * For subpage case, we completely rely on radix tree to ensure we
>          * don't try to insert two ebs for the same bytenr.  So here we always
> @@ -2980,13 +2982,14 @@ static int check_eb_alignment(struct btrfs_fs_info *fs_info, u64 start)
>   * The caller needs to free the existing folios and retry using the same order.
>   */
>  static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
> +                                     struct btrfs_subpage *prealloc,
>                                       struct extent_buffer **found_eb_ret)
>  {
>
>         struct btrfs_fs_info *fs_info = eb->fs_info;
>         struct address_space *mapping = fs_info->btree_inode->i_mapping;
>         const unsigned long index = eb->start >> PAGE_SHIFT;
> -       struct folio *existing_folio;
> +       struct folio *existing_folio = NULL;
>         int ret;
>
>         ASSERT(found_eb_ret);
> @@ -2998,12 +3001,14 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
>         ret = filemap_add_folio(mapping, eb->folios[i], index + i,
>                                 GFP_NOFS | __GFP_NOFAIL);
>         if (!ret)
> -               return 0;
> +               goto finish;
>
>         existing_folio = filemap_lock_folio(mapping, index + i);
>         /* The page cache only exists for a very short time, just retry. */
> -       if (IS_ERR(existing_folio))
> +       if (IS_ERR(existing_folio)) {
> +               existing_folio = NULL;
>                 goto retry;
> +       }
>
>         /* For now, we should only have single-page folios for btree inode. */
>         ASSERT(folio_nr_pages(existing_folio) == 1);
> @@ -3014,14 +3019,16 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
>                 return -EAGAIN;
>         }
>
> -       if (fs_info->nodesize < PAGE_SIZE) {
> +finish:
> +       spin_lock(&mapping->i_private_lock);
> +       if (existing_folio && fs_info->nodesize < PAGE_SIZE) {
>                 /*
> -                * We're going to reuse the existing page, can drop our page
> -                * and subpage structure now.
> +                * We're going to reuse the existing page, can drop our folio
> +                * now.
>                  */
>                 __free_page(folio_page(eb->folios[i], 0));
>                 eb->folios[i] = existing_folio;
> -       } else {
> +       } else if (existing_folio) {
>                 struct extent_buffer *existing_eb;
>
>                 existing_eb = grab_extent_buffer(fs_info,
> @@ -3029,6 +3036,7 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
>                 if (existing_eb) {
>                         /* The extent buffer still exists, we can use it directly. */
>                         *found_eb_ret = existing_eb;
> +                       spin_unlock(&mapping->i_private_lock);
>                         folio_unlock(existing_folio);
>                         folio_put(existing_folio);
>                         return 1;
> @@ -3037,6 +3045,22 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
>                 __free_page(folio_page(eb->folios[i], 0));
>                 eb->folios[i] = existing_folio;
>         }
> +       eb->folio_size = folio_size(eb->folios[i]);
> +       eb->folio_shift = folio_shift(eb->folios[i]);
> +       /* Should not fail, as we have preallocated the memory */
> +       ret = attach_extent_buffer_folio(eb, eb->folios[i], prealloc);
> +       ASSERT(!ret);
> +       /*
> +        * To inform we have extra eb under allocation, so that
> +        * detach_extent_buffer_page() won't release the folio private
> +        * when the eb hasn't yet been inserted into radix tree.
> +        *
> +        * The ref will be decreased when the eb released the page, in
> +        * detach_extent_buffer_page().
> +        * Thus needs no special handling in error path.
> +        */
> +       btrfs_folio_inc_eb_refs(fs_info, eb->folios[i]);
> +       spin_unlock(&mapping->i_private_lock);
>         return 0;
>  }
>
> @@ -3048,7 +3072,6 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>         int attached = 0;
>         struct extent_buffer *eb;
>         struct extent_buffer *existing_eb = NULL;
> -       struct address_space *mapping = fs_info->btree_inode->i_mapping;
>         struct btrfs_subpage *prealloc = NULL;
>         u64 lockdep_owner = owner_root;
>         bool page_contig = true;
> @@ -3114,7 +3137,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>         for (int i = 0; i < num_folios; i++) {
>                 struct folio *folio;
>
> -               ret = attach_eb_folio_to_filemap(eb, i, &existing_eb);
> +               ret = attach_eb_folio_to_filemap(eb, i, prealloc, &existing_eb);
>                 if (ret > 0) {
>                         ASSERT(existing_eb);
>                         goto out;
> @@ -3151,24 +3174,6 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>                  * and free the allocated page.
>                  */
>                 folio = eb->folios[i];
> -               eb->folio_size = folio_size(folio);
> -               eb->folio_shift = folio_shift(folio);
> -               spin_lock(&mapping->i_private_lock);
> -               /* Should not fail, as we have preallocated the memory */
> -               ret = attach_extent_buffer_folio(eb, folio, prealloc);
> -               ASSERT(!ret);
> -               /*
> -                * To inform we have extra eb under allocation, so that
> -                * detach_extent_buffer_page() won't release the folio private
> -                * when the eb hasn't yet been inserted into radix tree.
> -                *
> -                * The ref will be decreased when the eb released the page, in
> -                * detach_extent_buffer_page().
> -                * Thus needs no special handling in error path.
> -                */
> -               btrfs_folio_inc_eb_refs(fs_info, folio);
> -               spin_unlock(&mapping->i_private_lock);
> -
>                 WARN_ON(btrfs_folio_test_dirty(fs_info, folio, eb->start, eb->len));
>
>                 /*
> --
> 2.45.2
>
>
Josef Bacik June 6, 2024, 4:52 p.m. UTC | #2
On Thu, Jun 06, 2024 at 11:01:51AM +0930, Qu Wenruo wrote:
> [BUG]
> Since v6.8 there are rare kernel crashes hitting by different reporters,
> and most of them share the same bad page status error messages like
> this:
> 
>  BUG: Bad page state in process kswapd0  pfn:d6e840
>  page: refcount:0 mapcount:0 mapping:000000007512f4f2 index:0x2796c2c7c
>  pfn:0xd6e840
>  aops:btree_aops ino:1
>  flags: 0x17ffffe0000008(uptodate|node=0|zone=2|lastcpupid=0x3fffff)
>  page_type: 0xffffffff()
>  raw: 0017ffffe0000008 dead000000000100 dead000000000122 ffff88826d0be4c0
>  raw: 00000002796c2c7c 0000000000000000 00000000ffffffff 0000000000000000
>  page dumped because: non-NULL mapping
> 
> [CAUSE]
> Commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
> allocate-then-attach method") changes the sequence when allocating a new
> extent buffer.
> 
> Previously we always call grab_extent_buffer() under
> mapping->i_private_lock, to ensure the safety on modification on
> folio::private (which is a pointer to extent buffer for regular
> sectorsize)
> 
> This can lead to the following race:
> 
> Thread A is trying to allocate an extent buffer at bytenr X, with 4
> 4K pages, meanwhile thread B is trying to release the page at X + 4K
> (the second page of the extent buffer at X).
> 
>            Thread A                |                 Thread B
> -----------------------------------+-------------------------------------
>                                    | btree_release_folio()
> 				   | | This is for the page at X + 4K,
> 				   | | Not page X.
> 				   | |
> alloc_extent_buffer()              | |- release_extent_buffer()
> |- filemap_add_folio() for the     | |  |- atomic_dec_and_test(eb->refs)
> |  page at bytenr X (the first     | |  |
> |  page).                          | |  |
> |  Which returned -EEXIST.         | |  |
> |                                  | |  |
> |- filemap_lock_folio()            | |  |
> |  Returned the first page locked. | |  |
> |                                  | |  |
> |- grab_extent_buffer()            | |  |
> |  |- atomic_inc_not_zero()        | |  |
> |  |  Returned false               | |  |
> |  |- folio_detach_private()       | |  |- folio_detach_private() for X
> |     |- folio_test_private()      | |     |- folio_test_private()
>       |  Returned true             | |     |  Returned true
>       |- folio_put()               |       |- folio_put()
> 
> Now this double puts on the same folio at folio X, leads to the
> refcount underflow of the folio X, and eventually causing the BUG_ON()
> on the page->mapping.
> 
> The condition is not that easy to hit:
> 
> - The release must be triggered for the middle page of an eb
>   If the release is on the same first page of an eb, page lock would kick
>   in and prevent the race.
> 
> - folio_detach_private() has a very small race window
>   It's only between folio_test_private() and folio_clear_private().
> 
> That's exactly what mapping->i_private_lock is used to prevent such race,
> and commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
> allocate-then-attach method") totally screwed this up.
> 
> At that time, I thought the page lock would kick in as
> filemap_release_folio() also requires the page to be locked, but forgot
> the filemap_release_folio() only locks one page, not all pages of an
> extent buffer.
> 
> [FIX]
> Move all the code requiring i_private_lock into
> attach_eb_folio_to_filemap(), so that everything is done with proper
> lock protection.
> 
> Furthermore to prevent future problems, add an extra
> lockdep_assert_locked() to ensure we're holding the proper lock.
> 
> Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
> Link: https://lore.kernel.org/linux-btrfs/CAHk-=wgt362nGfScVOOii8cgKn2LVVHeOvOA7OBwg1OwbuJQcw@mail.gmail.com/
> Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
> Link: https://lore.kernel.org/lkml/CABXGCsPktcHQOvKTbPaTwegMExije=Gpgci5NW=hqORo-s7diA@mail.gmail.com/
> Fixes: 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method")
> Cc: Chris Mason <clm@fb.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks everybody who helped with this,

Josef
David Sterba June 6, 2024, 7:27 p.m. UTC | #3
On Thu, Jun 06, 2024 at 11:01:51AM +0930, Qu Wenruo wrote:
> [BUG]
> Since v6.8 there are rare kernel crashes hitting by different reporters,
> and most of them share the same bad page status error messages like
> this:
> 
>  BUG: Bad page state in process kswapd0  pfn:d6e840
>  page: refcount:0 mapcount:0 mapping:000000007512f4f2 index:0x2796c2c7c
>  pfn:0xd6e840
>  aops:btree_aops ino:1
>  flags: 0x17ffffe0000008(uptodate|node=0|zone=2|lastcpupid=0x3fffff)
>  page_type: 0xffffffff()
>  raw: 0017ffffe0000008 dead000000000100 dead000000000122 ffff88826d0be4c0
>  raw: 00000002796c2c7c 0000000000000000 00000000ffffffff 0000000000000000
>  page dumped because: non-NULL mapping
> 
> [CAUSE]
> Commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
> allocate-then-attach method") changes the sequence when allocating a new
> extent buffer.
> 
> Previously we always call grab_extent_buffer() under
> mapping->i_private_lock, to ensure the safety on modification on
> folio::private (which is a pointer to extent buffer for regular
> sectorsize)
> 
> This can lead to the following race:
> 
> Thread A is trying to allocate an extent buffer at bytenr X, with 4
> 4K pages, meanwhile thread B is trying to release the page at X + 4K
> (the second page of the extent buffer at X).
> 
>            Thread A                |                 Thread B
> -----------------------------------+-------------------------------------
>                                    | btree_release_folio()
> 				   | | This is for the page at X + 4K,
> 				   | | Not page X.
> 				   | |
> alloc_extent_buffer()              | |- release_extent_buffer()
> |- filemap_add_folio() for the     | |  |- atomic_dec_and_test(eb->refs)
> |  page at bytenr X (the first     | |  |
> |  page).                          | |  |
> |  Which returned -EEXIST.         | |  |
> |                                  | |  |
> |- filemap_lock_folio()            | |  |
> |  Returned the first page locked. | |  |
> |                                  | |  |
> |- grab_extent_buffer()            | |  |
> |  |- atomic_inc_not_zero()        | |  |
> |  |  Returned false               | |  |
> |  |- folio_detach_private()       | |  |- folio_detach_private() for X
> |     |- folio_test_private()      | |     |- folio_test_private()
>       |  Returned true             | |     |  Returned true
>       |- folio_put()               |       |- folio_put()
> 
> Now this double puts on the same folio at folio X, leads to the
> refcount underflow of the folio X, and eventually causing the BUG_ON()
> on the page->mapping.
> 
> The condition is not that easy to hit:
> 
> - The release must be triggered for the middle page of an eb
>   If the release is on the same first page of an eb, page lock would kick
>   in and prevent the race.
> 
> - folio_detach_private() has a very small race window
>   It's only between folio_test_private() and folio_clear_private().
> 
> That's exactly what mapping->i_private_lock is used to prevent such race,
> and commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
> allocate-then-attach method") totally screwed this up.
> 
> At that time, I thought the page lock would kick in as
> filemap_release_folio() also requires the page to be locked, but forgot
> the filemap_release_folio() only locks one page, not all pages of an
> extent buffer.
> 
> [FIX]
> Move all the code requiring i_private_lock into
> attach_eb_folio_to_filemap(), so that everything is done with proper
> lock protection.
> 
> Furthermore to prevent future problems, add an extra
> lockdep_assert_locked() to ensure we're holding the proper lock.
> 
> Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
> Link: https://lore.kernel.org/linux-btrfs/CAHk-=wgt362nGfScVOOii8cgKn2LVVHeOvOA7OBwg1OwbuJQcw@mail.gmail.com/
> Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
> Link: https://lore.kernel.org/lkml/CABXGCsPktcHQOvKTbPaTwegMExije=Gpgci5NW=hqORo-s7diA@mail.gmail.com/
> Fixes: 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method")
> Cc: Chris Mason <clm@fb.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Thanks. I'll pick the patch to branch for the next pull request, the fix has
survived enough testing and we should get it to stable without further delays.
I've edited the subject and changelog a bit, the problem is really the folio
private protection, it is a race window fix but that does not tell much what is
the cause. I've also added the reproducer script from Chris.
Qu Wenruo June 7, 2024, 4:27 a.m. UTC | #4
在 2024/6/7 04:57, David Sterba 写道:
[...]
>
> Thanks. I'll pick the patch to branch for the next pull request, the fix has
> survived enough testing and we should get it to stable without further delays.
> I've edited the subject and changelog a bit, the problem is really the folio
> private protection, it is a race window fix but that does not tell much what is
> the cause. I've also added the reproducer script from Chris.
>

Mind to push the updated version to for-next?

I'd like to restart the test of larger metadata folios.
Ironically if we force larger metadata folios (one folio one metadata),
the race can also be solved, as now folio lock would kick in to prevent
the race.

Thanks,
Qu
David Sterba June 7, 2024, 1:56 p.m. UTC | #5
On Fri, Jun 07, 2024 at 01:57:38PM +0930, Qu Wenruo wrote:
> 
> 
> 在 2024/6/7 04:57, David Sterba 写道:
> [...]
> >
> > Thanks. I'll pick the patch to branch for the next pull request, the fix has
> > survived enough testing and we should get it to stable without further delays.
> > I've edited the subject and changelog a bit, the problem is really the folio
> > private protection, it is a race window fix but that does not tell much what is
> > the cause. I've also added the reproducer script from Chris.
> >
> 
> Mind to push the updated version to for-next?

Now updated, otherwise if you need to find it the branch is
https://github.com/kdave/btrfs-devel/tree/misc-6.10 or it's in the
linux-next source branch
https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/log/?h=next-fixes

> I'd like to restart the test of larger metadata folios.
> Ironically if we force larger metadata folios (one folio one metadata),
> the race can also be solved, as now folio lock would kick in to prevent
> the race.

Yeah, but this would be an accidental fix. If we didn't understand that
the bug was in our implementation I'm guessing large folios would be
blamed and this won't make finding the problem easier. We need to be
sure that the easy page->folio conversion is solid. You can start
testing it but the actual switch may take a release or two.
Chris Mason June 7, 2024, 3:37 p.m. UTC | #6
On 6/7/24 9:56 AM, David Sterba wrote:
> On Fri, Jun 07, 2024 at 01:57:38PM +0930, Qu Wenruo wrote:
>>
>>
>> 在 2024/6/7 04:57, David Sterba 写道:
>> [...]
>>>
>>> Thanks. I'll pick the patch to branch for the next pull request, the fix has
>>> survived enough testing and we should get it to stable without further delays.
>>> I've edited the subject and changelog a bit, the problem is really the folio
>>> private protection, it is a race window fix but that does not tell much what is
>>> the cause. I've also added the reproducer script from Chris.
>>>
>>
>> Mind to push the updated version to for-next?
> 
> Now updated, otherwise if you need to find it the branch is
> https://github.com/kdave/btrfs-devel/tree/misc-6.10 or it's in the
> linux-next source branch
> https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/log/?h=next-fixes
> 
>> I'd like to restart the test of larger metadata folios.
>> Ironically if we force larger metadata folios (one folio one metadata),
>> the race can also be solved, as now folio lock would kick in to prevent
>> the race.
> 
> Yeah, but this would be an accidental fix. If we didn't understand that
> the bug was in our implementation I'm guessing large folios would be
> blamed and this won't make finding the problem easier. We need to be
> sure that the easy page->folio conversion is solid. You can start
> testing it but the actual switch may take a release or two.

I was going to suggest that we wait a bit as well.  The new code is
holding up well to testing, and we're going to roll it to prod w/our 6.9
kernel, so it'll end up with lots of validation over the next 10 weeks
or so.

I'd definitely start on the large folios now, but also give the code we
have today a chance to soak before making the switch.

-chris
diff mbox series

Patch

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 0c74f7df2e8b..af5b7fa7da99 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2913,6 +2913,8 @@  static struct extent_buffer *grab_extent_buffer(
 	struct folio *folio = page_folio(page);
 	struct extent_buffer *exists;
 
+	lockdep_assert_held(&page->mapping->i_private_lock);
+
 	/*
 	 * For subpage case, we completely rely on radix tree to ensure we
 	 * don't try to insert two ebs for the same bytenr.  So here we always
@@ -2980,13 +2982,14 @@  static int check_eb_alignment(struct btrfs_fs_info *fs_info, u64 start)
  * The caller needs to free the existing folios and retry using the same order.
  */
 static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
+				      struct btrfs_subpage *prealloc,
 				      struct extent_buffer **found_eb_ret)
 {
 
 	struct btrfs_fs_info *fs_info = eb->fs_info;
 	struct address_space *mapping = fs_info->btree_inode->i_mapping;
 	const unsigned long index = eb->start >> PAGE_SHIFT;
-	struct folio *existing_folio;
+	struct folio *existing_folio = NULL;
 	int ret;
 
 	ASSERT(found_eb_ret);
@@ -2998,12 +3001,14 @@  static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
 	ret = filemap_add_folio(mapping, eb->folios[i], index + i,
 				GFP_NOFS | __GFP_NOFAIL);
 	if (!ret)
-		return 0;
+		goto finish;
 
 	existing_folio = filemap_lock_folio(mapping, index + i);
 	/* The page cache only exists for a very short time, just retry. */
-	if (IS_ERR(existing_folio))
+	if (IS_ERR(existing_folio)) {
+		existing_folio = NULL;
 		goto retry;
+	}
 
 	/* For now, we should only have single-page folios for btree inode. */
 	ASSERT(folio_nr_pages(existing_folio) == 1);
@@ -3014,14 +3019,16 @@  static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
 		return -EAGAIN;
 	}
 
-	if (fs_info->nodesize < PAGE_SIZE) {
+finish:
+	spin_lock(&mapping->i_private_lock);
+	if (existing_folio && fs_info->nodesize < PAGE_SIZE) {
 		/*
-		 * We're going to reuse the existing page, can drop our page
-		 * and subpage structure now.
+		 * We're going to reuse the existing page, can drop our folio
+		 * now.
 		 */
 		__free_page(folio_page(eb->folios[i], 0));
 		eb->folios[i] = existing_folio;
-	} else {
+	} else if (existing_folio) {
 		struct extent_buffer *existing_eb;
 
 		existing_eb = grab_extent_buffer(fs_info,
@@ -3029,6 +3036,7 @@  static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
 		if (existing_eb) {
 			/* The extent buffer still exists, we can use it directly. */
 			*found_eb_ret = existing_eb;
+			spin_unlock(&mapping->i_private_lock);
 			folio_unlock(existing_folio);
 			folio_put(existing_folio);
 			return 1;
@@ -3037,6 +3045,22 @@  static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
 		__free_page(folio_page(eb->folios[i], 0));
 		eb->folios[i] = existing_folio;
 	}
+	eb->folio_size = folio_size(eb->folios[i]);
+	eb->folio_shift = folio_shift(eb->folios[i]);
+	/* Should not fail, as we have preallocated the memory */
+	ret = attach_extent_buffer_folio(eb, eb->folios[i], prealloc);
+	ASSERT(!ret);
+	/*
+	 * To inform we have extra eb under allocation, so that
+	 * detach_extent_buffer_page() won't release the folio private
+	 * when the eb hasn't yet been inserted into radix tree.
+	 *
+	 * The ref will be decreased when the eb released the page, in
+	 * detach_extent_buffer_page().
+	 * Thus needs no special handling in error path.
+	 */
+	btrfs_folio_inc_eb_refs(fs_info, eb->folios[i]);
+	spin_unlock(&mapping->i_private_lock);
 	return 0;
 }
 
@@ -3048,7 +3072,6 @@  struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 	int attached = 0;
 	struct extent_buffer *eb;
 	struct extent_buffer *existing_eb = NULL;
-	struct address_space *mapping = fs_info->btree_inode->i_mapping;
 	struct btrfs_subpage *prealloc = NULL;
 	u64 lockdep_owner = owner_root;
 	bool page_contig = true;
@@ -3114,7 +3137,7 @@  struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 	for (int i = 0; i < num_folios; i++) {
 		struct folio *folio;
 
-		ret = attach_eb_folio_to_filemap(eb, i, &existing_eb);
+		ret = attach_eb_folio_to_filemap(eb, i, prealloc, &existing_eb);
 		if (ret > 0) {
 			ASSERT(existing_eb);
 			goto out;
@@ -3151,24 +3174,6 @@  struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 		 * and free the allocated page.
 		 */
 		folio = eb->folios[i];
-		eb->folio_size = folio_size(folio);
-		eb->folio_shift = folio_shift(folio);
-		spin_lock(&mapping->i_private_lock);
-		/* Should not fail, as we have preallocated the memory */
-		ret = attach_extent_buffer_folio(eb, folio, prealloc);
-		ASSERT(!ret);
-		/*
-		 * To inform we have extra eb under allocation, so that
-		 * detach_extent_buffer_page() won't release the folio private
-		 * when the eb hasn't yet been inserted into radix tree.
-		 *
-		 * The ref will be decreased when the eb released the page, in
-		 * detach_extent_buffer_page().
-		 * Thus needs no special handling in error path.
-		 */
-		btrfs_folio_inc_eb_refs(fs_info, folio);
-		spin_unlock(&mapping->i_private_lock);
-
 		WARN_ON(btrfs_folio_test_dirty(fs_info, folio, eb->start, eb->len));
 
 		/*