[v2,0/6] btrfs: preparation patches for the incoming metadata folio conversion

Message ID	cover.1689143654.git.wqu@suse.com (mailing list archive)
Headers	show Return-Path: <linux-btrfs-owner@vger.kernel.org> From: Qu Wenruo <wqu@suse.com> To: linux-btrfs@vger.kernel.org Subject: [PATCH v2 0/6] btrfs: preparation patches for the incoming metadata folio conversion Date: Wed, 12 Jul 2023 14:37:40 +0800 Message-ID: <cover.1689143654.git.wqu@suse.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	btrfs: preparation patches for the incoming metadata folio conversion \| expand [v2,0/6] btrfs: preparation patches for the incoming metadata folio conversion [v2,1/6] btrfs: tests: enhance extent buffer bitmap tests [v2,2/6] btrfs: refactor extent buffer bitmaps operations [v2,3/6] btrfs: use write_extent_buffer() to implement write_extent_buffer_*id() [v2,4/6] btrfs: refactor memcpy_extent_buffer() [v2,5/6] btrfs: refactor copy_extent_buffer_full() [v2,6/6] btrfs: call copy_extent_buffer_full() inside btrfs_clone_extent_buffer()

Qu Wenruo July 12, 2023, 6:37 a.m. UTC

[CHANGELOG]
v2:
- Define write_extent_buffer_fsid/chunk_tree_uuid() as inline helpers

[BACKGROUND]

Recently I'm checking on the feasibility on converting metadata handling
to go a folio based solution.

The best part of using a single folio for metadata is, we can get rid of
the complexity of cross-page handling, everything would be just a single
memory operation on a continuous memory range.

[PITFALLS]

One of the biggest problem for metadata folio conversion is, we still
need the current page based solution (or folios with order 0) as a
fallback solution when we can not get a high order folio.

In that case, there would be a hell to handle the four different
combinations (folio/folio, folio/page, page/folio, page/page) for extent
buffer helpers involving two extent buffers.

Although there are some new ideas on how to handle metadata memory (e.g.
go full vmallocated memory), reducing the open-coded memory handling for
metadata should always be a good start point.

[OBJECTIVE]

So this patchset is the preparation to reduce direct page operations for
metadata.

The patchset would do this mostly by concentrating the operations to use
the common helper, write_extent_buffer() and read_extent_buffer().

For bitmap operations it's much complex, thus this patchset refactor it
completely to go a 3 part solution:

- Handle the first byte
- Handle the byte aligned ranges
- Handle the last byte

This needs more complex testing (which I failed several times during
development) to prevent regression.

Finally there is only one function which can not be properly migrated,
memmove_extent_buffer(), which has to use memmove() calls, thus must go
per-page mapping handling.

Thankfully if we go folio in the end, the folio based handling would
just be a single memmove(), thus it won't be too much burden.


Qu Wenruo (6):
  btrfs: tests: enhance extent buffer bitmap tests
  btrfs: refactor extent buffer bitmaps operations
  btrfs: use write_extent_buffer() to implement
    write_extent_buffer_*id()
  btrfs: refactor memcpy_extent_buffer()
  btrfs: refactor copy_extent_buffer_full()
  btrfs: call copy_extent_buffer_full() inside
    btrfs_clone_extent_buffer()

 fs/btrfs/extent_io.c             | 224 +++++++++++++------------------
 fs/btrfs/extent_io.h             |  19 ++-
 fs/btrfs/tests/extent-io-tests.c | 161 ++++++++++++++--------
 3 files changed, 215 insertions(+), 189 deletions(-)

Sweet Tea Dorminy July 12, 2023, 6:44 a.m. UTC | #1

On 7/12/23 02:37, Qu Wenruo wrote:
> [CHANGELOG]
> v2:
> - Define write_extent_buffer_fsid/chunk_tree_uuid() as inline helpers
> 
> [BACKGROUND]
> 
> Recently I'm checking on the feasibility on converting metadata handling
> to go a folio based solution.
> 
> The best part of using a single folio for metadata is, we can get rid of
> the complexity of cross-page handling, everything would be just a single
> memory operation on a continuous memory range.
> 
> [PITFALLS]
> 
> One of the biggest problem for metadata folio conversion is, we still
> need the current page based solution (or folios with order 0) as a
> fallback solution when we can not get a high order folio.
> 
> In that case, there would be a hell to handle the four different
> combinations (folio/folio, folio/page, page/folio, page/page) for extent
> buffer helpers involving two extent buffers.
> 
> Although there are some new ideas on how to handle metadata memory (e.g.
> go full vmallocated memory), reducing the open-coded memory handling for
> metadata should always be a good start point.
> 
> [OBJECTIVE]
> 
> So this patchset is the preparation to reduce direct page operations for
> metadata.
> 
> The patchset would do this mostly by concentrating the operations to use
> the common helper, write_extent_buffer() and read_extent_buffer().
> 
> For bitmap operations it's much complex, thus this patchset refactor it
> completely to go a 3 part solution:
> 
> - Handle the first byte
> - Handle the byte aligned ranges
> - Handle the last byte
> 
> This needs more complex testing (which I failed several times during
> development) to prevent regression.
> 
> Finally there is only one function which can not be properly migrated,
> memmove_extent_buffer(), which has to use memmove() calls, thus must go
> per-page mapping handling.
> 
> Thankfully if we go folio in the end, the folio based handling would
> just be a single memmove(), thus it won't be too much burden.
> 
> 
> Qu Wenruo (6):
>    btrfs: tests: enhance extent buffer bitmap tests
>    btrfs: refactor extent buffer bitmaps operations
>    btrfs: use write_extent_buffer() to implement
>      write_extent_buffer_*id()
>    btrfs: refactor memcpy_extent_buffer()
>    btrfs: refactor copy_extent_buffer_full()
>    btrfs: call copy_extent_buffer_full() inside
>      btrfs_clone_extent_buffer()
> 
>   fs/btrfs/extent_io.c             | 224 +++++++++++++------------------
>   fs/btrfs/extent_io.h             |  19 ++-
>   fs/btrfs/tests/extent-io-tests.c | 161 ++++++++++++++--------
>   3 files changed, 215 insertions(+), 189 deletions(-)
> 

For the series:
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>

Christoph Hellwig July 12, 2023, 4:41 p.m. UTC | #2

On Wed, Jul 12, 2023 at 02:37:40PM +0800, Qu Wenruo wrote:
> One of the biggest problem for metadata folio conversion is, we still
> need the current page based solution (or folios with order 0) as a
> fallback solution when we can not get a high order folio.

Do we?  btrfs by default uses a 16k nodesize (order 2 on x86), with
a maximum of 64k (order 4).  IIRC we should be able to get them pretty
reliably.

If not the best thning is to just a virtually contigous allocation as
fallback, i.e. use vm_map_ram.  That's what XFS uses in it's buffer
cache, and it already did so before it stopped to use page cache to
back it's buffer cache, something I plan to do for the btrfs buffer
cache as well, as the page cache algorithms tend to not work very
well for buffer based metadata, never mind that there is an incredible
amount of complex code just working around the interactions.

Qu Wenruo July 12, 2023, 11:58 p.m. UTC | #3

On 2023/7/13 00:41, Christoph Hellwig wrote:
> On Wed, Jul 12, 2023 at 02:37:40PM +0800, Qu Wenruo wrote:
>> One of the biggest problem for metadata folio conversion is, we still
>> need the current page based solution (or folios with order 0) as a
>> fallback solution when we can not get a high order folio.
> 
> Do we?  btrfs by default uses a 16k nodesize (order 2 on x86), with
> a maximum of 64k (order 4).  IIRC we should be able to get them pretty
> reliably.

If it can be done as reliable as order 0 with NOFAIL, I'm totally fine 
with that.

> 
> If not the best thning is to just a virtually contigous allocation as
> fallback, i.e. use vm_map_ram.

That's also what Sweet Tea Dorminy mentioned, and I believe it's the 
correct way to go (as the fallback)

Although my concern is my lack of experience on MM code, and if those 
pages can still be attached to address space (with PagePrivate set).

>  That's what XFS uses in it's buffer
> cache, and it already did so before it stopped to use page cache to
> back it's buffer cache, something I plan to do for the btrfs buffer
> cache as well, as the page cache algorithms tend to not work very
> well for buffer based metadata, never mind that there is an incredible
> amount of complex code just working around the interactions.
Thus we have the preparation patchset as the first step.
It should help no matter what the next step we go.

Thanks,
Qu

Christoph Hellwig July 13, 2023, 11:16 a.m. UTC | #4

On Thu, Jul 13, 2023 at 07:58:17AM +0800, Qu Wenruo wrote:
> > Do we?  btrfs by default uses a 16k nodesize (order 2 on x86), with
> > a maximum of 64k (order 4).  IIRC we should be able to get them pretty
> > reliably.
> 
> If it can be done as reliable as order 0 with NOFAIL, I'm totally fine with
> that.

I think that is the aim.  I'm not entirely sure if we are entirely there
yes, thus the Ccs.

> > If not the best thning is to just a virtually contigous allocation as
> > fallback, i.e. use vm_map_ram.
> 
> That's also what Sweet Tea Dorminy mentioned, and I believe it's the correct
> way to go (as the fallback)
> 
> Although my concern is my lack of experience on MM code, and if those pages
> can still be attached to address space (with PagePrivate set).

At least they could back in the day when XFS did exactly that.  In fact
that was the use case why I added vmap originally back in 2002..

David Sterba July 13, 2023, 11:26 a.m. UTC | #5

On Thu, Jul 13, 2023 at 07:58:17AM +0800, Qu Wenruo wrote:
> On 2023/7/13 00:41, Christoph Hellwig wrote:
> > On Wed, Jul 12, 2023 at 02:37:40PM +0800, Qu Wenruo wrote:
> >> One of the biggest problem for metadata folio conversion is, we still
> >> need the current page based solution (or folios with order 0) as a
> >> fallback solution when we can not get a high order folio.
> > 
> > Do we?  btrfs by default uses a 16k nodesize (order 2 on x86), with
> > a maximum of 64k (order 4).  IIRC we should be able to get them pretty
> > reliably.
> 
> If it can be done as reliable as order 0 with NOFAIL, I'm totally fine 
> with that.

I have mentioned my concerns about the allocation problems with higher
order than 0 in the past. Allocator gives some guarantees about not
failing for certain levels, now it's 1 (mm/fail_page_alloc.c
fail_page_alloc.min_oder = 1).

Per comment in page_alloc.c:rmqueue()

2814         /*
2815          * We most definitely don't want callers attempting to
2816          * allocate greater than order-1 page units with __GFP_NOFAIL.
2817          */
2818         WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));

For allocations with higher order, eg. 4 to match the default 16K nodes,
this increases pressure and can trigger compaction, logic around
PAGE_ALLOC_COSTLY_ORDER which is 3.

> > If not the best thning is to just a virtually contigous allocation as
> > fallback, i.e. use vm_map_ram.

So we can allocate 0-order pages and then map them to virtual addresses,
which needs manipulation of PTE (page table entries), and requires
additional memory. This is what xfs does,
fs/xfs_buf.c:_xfs_buf_map_pages(), needs some care with aliasing memory,
so vm_unmap_aliases() is required and brings some overhead, and at the
end vm_unmap_ram() needs to be called, another overhead but probably
bearable.

With all that in place there would be a contiguous memory range
representing the metadata, so a simple memcpy() can be done. Sure,
with higher overhead and decreased reliability due to potentially
failing memory allocations - for metadata operations.

Compare that to what we have:

Pages are allocated as order 0, so there's much higher chance to get
them under pressure and not increasing the pressure otherwise.  We don't
need any virtual mappings. The cost is that we have to iterate the pages
and do the partial copying ourselves, but this is hidden in helpers.

We have different usage pattern of the metadata buffers than xfs, so
that it does something with vmapped contiguous buffers may not be easily
transferable to btrfs and bring us new problems.

The conversion to folios will happen eventually, though I don't want to
sacrifice reliability just for API use convenience. First the conversion
should be done 1:1 with pages and folios both order 0 before switching
to some higher order allocations hidden behind API calls.

Qu Wenruo July 13, 2023, 11:41 a.m. UTC | #6

On 2023/7/13 19:26, David Sterba wrote:
> On Thu, Jul 13, 2023 at 07:58:17AM +0800, Qu Wenruo wrote:
>> On 2023/7/13 00:41, Christoph Hellwig wrote:
>>> On Wed, Jul 12, 2023 at 02:37:40PM +0800, Qu Wenruo wrote:
>>>> One of the biggest problem for metadata folio conversion is, we still
>>>> need the current page based solution (or folios with order 0) as a
>>>> fallback solution when we can not get a high order folio.
>>>
>>> Do we?  btrfs by default uses a 16k nodesize (order 2 on x86), with
>>> a maximum of 64k (order 4).  IIRC we should be able to get them pretty
>>> reliably.
>>
>> If it can be done as reliable as order 0 with NOFAIL, I'm totally fine
>> with that.
>
> I have mentioned my concerns about the allocation problems with higher
> order than 0 in the past. Allocator gives some guarantees about not
> failing for certain levels, now it's 1 (mm/fail_page_alloc.c
> fail_page_alloc.min_oder = 1).
>
> Per comment in page_alloc.c:rmqueue()
>
> 2814         /*
> 2815          * We most definitely don't want callers attempting to
> 2816          * allocate greater than order-1 page units with __GFP_NOFAIL.
> 2817          */
> 2818         WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
>
> For allocations with higher order, eg. 4 to match the default 16K nodes,
> this increases pressure and can trigger compaction, logic around
> PAGE_ALLOC_COSTLY_ORDER which is 3.
>
>>> If not the best thning is to just a virtually contigous allocation as
>>> fallback, i.e. use vm_map_ram.
>
> So we can allocate 0-order pages and then map them to virtual addresses,
> which needs manipulation of PTE (page table entries), and requires
> additional memory. This is what xfs does,
> fs/xfs_buf.c:_xfs_buf_map_pages(), needs some care with aliasing memory,
> so vm_unmap_aliases() is required and brings some overhead, and at the
> end vm_unmap_ram() needs to be called, another overhead but probably
> bearable.
>
> With all that in place there would be a contiguous memory range
> representing the metadata, so a simple memcpy() can be done. Sure,
> with higher overhead and decreased reliability due to potentially
> failing memory allocations - for metadata operations.
>
> Compare that to what we have:
>
> Pages are allocated as order 0, so there's much higher chance to get
> them under pressure and not increasing the pressure otherwise.  We don't
> need any virtual mappings. The cost is that we have to iterate the pages
> and do the partial copying ourselves, but this is hidden in helpers.
>
> We have different usage pattern of the metadata buffers than xfs, so
> that it does something with vmapped contiguous buffers may not be easily
> transferable to btrfs and bring us new problems.
>
> The conversion to folios will happen eventually, though I don't want to
> sacrifice reliability just for API use convenience. First the conversion
> should be done 1:1 with pages and folios both order 0 before switching
> to some higher order allocations hidden behind API calls.

In fact, I have another solution as a middle ground before adding folio
into the situation.

   Check if the pages are already physically continuous.
   If so, everything can go without any cross-page handling.

   If not, we can either keep the current cross-page handling, or migrate
   to the virtually continuous mapped pages.

Currently we already have around 50~66% of eb pages are already
allocated physically continuous.

If we can just reduce the cross page handling for more than half of the
ebs, it's already a win.

For the vmapped pages, I'm not sure about the overhead, but I can try to
go that path and check the result.

Thanks,
Qu

David Sterba July 13, 2023, 11:49 a.m. UTC | #7

On Thu, Jul 13, 2023 at 07:41:53PM +0800, Qu Wenruo wrote:
> On 2023/7/13 19:26, David Sterba wrote:
> > On Thu, Jul 13, 2023 at 07:58:17AM +0800, Qu Wenruo wrote:
> >> On 2023/7/13 00:41, Christoph Hellwig wrote:
> >>> On Wed, Jul 12, 2023 at 02:37:40PM +0800, Qu Wenruo wrote:
> >>>> One of the biggest problem for metadata folio conversion is, we still
> >>>> need the current page based solution (or folios with order 0) as a
> >>>> fallback solution when we can not get a high order folio.
> >>>
> >>> Do we?  btrfs by default uses a 16k nodesize (order 2 on x86), with
> >>> a maximum of 64k (order 4).  IIRC we should be able to get them pretty
> >>> reliably.
> >>
> >> If it can be done as reliable as order 0 with NOFAIL, I'm totally fine
> >> with that.
> >
> > I have mentioned my concerns about the allocation problems with higher
> > order than 0 in the past. Allocator gives some guarantees about not
> > failing for certain levels, now it's 1 (mm/fail_page_alloc.c
> > fail_page_alloc.min_oder = 1).
> >
> > Per comment in page_alloc.c:rmqueue()
> >
> > 2814         /*
> > 2815          * We most definitely don't want callers attempting to
> > 2816          * allocate greater than order-1 page units with __GFP_NOFAIL.
> > 2817          */
> > 2818         WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
> >
> > For allocations with higher order, eg. 4 to match the default 16K nodes,
> > this increases pressure and can trigger compaction, logic around
> > PAGE_ALLOC_COSTLY_ORDER which is 3.
> >
> >>> If not the best thning is to just a virtually contigous allocation as
> >>> fallback, i.e. use vm_map_ram.
> >
> > So we can allocate 0-order pages and then map them to virtual addresses,
> > which needs manipulation of PTE (page table entries), and requires
> > additional memory. This is what xfs does,
> > fs/xfs_buf.c:_xfs_buf_map_pages(), needs some care with aliasing memory,
> > so vm_unmap_aliases() is required and brings some overhead, and at the
> > end vm_unmap_ram() needs to be called, another overhead but probably
> > bearable.
> >
> > With all that in place there would be a contiguous memory range
> > representing the metadata, so a simple memcpy() can be done. Sure,
> > with higher overhead and decreased reliability due to potentially
> > failing memory allocations - for metadata operations.
> >
> > Compare that to what we have:
> >
> > Pages are allocated as order 0, so there's much higher chance to get
> > them under pressure and not increasing the pressure otherwise.  We don't
> > need any virtual mappings. The cost is that we have to iterate the pages
> > and do the partial copying ourselves, but this is hidden in helpers.
> >
> > We have different usage pattern of the metadata buffers than xfs, so
> > that it does something with vmapped contiguous buffers may not be easily
> > transferable to btrfs and bring us new problems.
> >
> > The conversion to folios will happen eventually, though I don't want to
> > sacrifice reliability just for API use convenience. First the conversion
> > should be done 1:1 with pages and folios both order 0 before switching
> > to some higher order allocations hidden behind API calls.
> 
> In fact, I have another solution as a middle ground before adding folio
> into the situation.
> 
>    Check if the pages are already physically continuous.
>    If so, everything can go without any cross-page handling.
> 
>    If not, we can either keep the current cross-page handling, or migrate
>    to the virtually continuous mapped pages.
> 
> Currently we already have around 50~66% of eb pages are already
> allocated physically continuous.

Memory fragmentation becomes problem over time on systems running for
weeks/months, then the contiguous ranges will became scarce. So if you
measure that on a system with a lot of memory and for a short time then
of course this will reach high rate of contiguous pages.

David Sterba July 13, 2023, 12:09 p.m. UTC | #8

On Wed, Jul 12, 2023 at 02:37:40PM +0800, Qu Wenruo wrote:
> [CHANGELOG]
> v2:
> - Define write_extent_buffer_fsid/chunk_tree_uuid() as inline helpers
> 
> [BACKGROUND]
> 
> Recently I'm checking on the feasibility on converting metadata handling
> to go a folio based solution.
> 
> The best part of using a single folio for metadata is, we can get rid of
> the complexity of cross-page handling, everything would be just a single
> memory operation on a continuous memory range.
> 
> [PITFALLS]
> 
> One of the biggest problem for metadata folio conversion is, we still
> need the current page based solution (or folios with order 0) as a
> fallback solution when we can not get a high order folio.
> 
> In that case, there would be a hell to handle the four different
> combinations (folio/folio, folio/page, page/folio, page/page) for extent
> buffer helpers involving two extent buffers.
> 
> Although there are some new ideas on how to handle metadata memory (e.g.
> go full vmallocated memory), reducing the open-coded memory handling for
> metadata should always be a good start point.
> 
> [OBJECTIVE]
> 
> So this patchset is the preparation to reduce direct page operations for
> metadata.
> 
> The patchset would do this mostly by concentrating the operations to use
> the common helper, write_extent_buffer() and read_extent_buffer().
> 
> For bitmap operations it's much complex, thus this patchset refactor it
> completely to go a 3 part solution:
> 
> - Handle the first byte
> - Handle the byte aligned ranges
> - Handle the last byte
> 
> This needs more complex testing (which I failed several times during
> development) to prevent regression.
> 
> Finally there is only one function which can not be properly migrated,
> memmove_extent_buffer(), which has to use memmove() calls, thus must go
> per-page mapping handling.
> 
> Thankfully if we go folio in the end, the folio based handling would
> just be a single memmove(), thus it won't be too much burden.
> 
> 
> Qu Wenruo (6):
>   btrfs: tests: enhance extent buffer bitmap tests
>   btrfs: refactor extent buffer bitmaps operations
>   btrfs: use write_extent_buffer() to implement
>     write_extent_buffer_*id()
>   btrfs: refactor memcpy_extent_buffer()
>   btrfs: refactor copy_extent_buffer_full()
>   btrfs: call copy_extent_buffer_full() inside
>     btrfs_clone_extent_buffer()

Added to misc-next, with some fixups, thanks. How far we'll get with the
folio conversions or other page contiguous improvements depends but this
patchset is fairly independent.

David Sterba July 13, 2023, 4:39 p.m. UTC | #9

On Thu, Jul 13, 2023 at 02:09:35PM +0200, David Sterba wrote:
> On Wed, Jul 12, 2023 at 02:37:40PM +0800, Qu Wenruo wrote:
> Added to misc-next

And removed again, it explodes right before the first test:

BTRFS: device fsid 4e9cf0f7-cdc4-4e38-9e59-de4d88122ee9 devid 1 transid 6 /dev/vdb scanned by mkfs.btrfs (13714)
BTRFS info (device vdb): using crc32c (crc32c-generic) checksum algorithm
BTRFS info (device vdb): using free space tree
BTRFS info (device vdb): auto enabling async discard
BTRFS info (device vdb): checking UUID tree
------------[ cut here ]------------
WARNING: CPU: 3 PID: 13739 at fs/btrfs/extent-tree.c:3026 __btrfs_free_extent+0x9ac/0x1280 [btrfs]
Modules linked in: btrfs blake2b_generic libcrc32c xor lzo_compress lzo_decompress raid6_pq zstd_decompress zstd_compress xxhash zstd_common loop
CPU: 3 PID: 13739 Comm: umount Not tainted 6.5.0-rc1-default+ #2126
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014
RIP: 0010:__btrfs_free_extent+0x9ac/0x1280 [btrfs]
RSP: 0018:ffff8880031c78a8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88802ec71708 RCX: ffffffffc065e9ba
RDX: dffffc0000000000 RSI: ffffffffc063f610 RDI: ffff888026511130
RBP: ffff888002734000 R08: 0000000000000000 R09: ffffed1000638eff
R10: ffff8880031c77ff R11: 0000000000000001 R12: 0000000000000001
R13: ffff8880058522b8 R14: ffff8880265110e0 R15: 0000000001d24000
FS:  00007fb5c22e9800(0000) GS:ffff88806d200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f2b194fcfc4 CR3: 000000002f32a000 CR4: 00000000000006a0
Call Trace:
 <TASK>
 ? __warn+0xa1/0x200
 ? __btrfs_free_extent+0x9ac/0x1280 [btrfs]
 ? report_bug+0x207/0x270
 ? handle_bug+0x65/0x90
 ? exc_invalid_op+0x13/0x40
 ? asm_exc_invalid_op+0x16/0x20
 ? __btrfs_free_extent+0x39a/0x1280 [btrfs]
 ? unlock_up+0x160/0x370 [btrfs]
 ? __btrfs_free_extent+0x9ac/0x1280 [btrfs]
 ? __btrfs_free_extent+0x39a/0x1280 [btrfs]
 ? lookup_extent_backref+0xd0/0xd0 [btrfs]
 ? __lock_release.isra.0+0x14e/0x510
 ? reacquire_held_locks+0x280/0x280
 run_delayed_tree_ref+0x10b/0x2d0 [btrfs]
 btrfs_run_delayed_refs_for_head+0x630/0x960 [btrfs]
 __btrfs_run_delayed_refs+0xce/0x160 [btrfs]
 btrfs_run_delayed_refs+0xe7/0x2a0 [btrfs]
 commit_cowonly_roots+0x3f1/0x4c0 [btrfs]
 ? trace_btrfs_transaction_commit+0xd0/0xd0 [btrfs]
 ? btrfs_commit_transaction+0xbbe/0x17e0 [btrfs]
 btrfs_commit_transaction+0xc13/0x17e0 [btrfs]
 ? cleanup_transaction+0x640/0x640 [btrfs]
 ? btrfs_attach_transaction_barrier+0x1e/0x50 [btrfs]
 sync_filesystem+0xd3/0x100
 generic_shutdown_super+0x44/0x1f0
 kill_anon_super+0x1e/0x40
 btrfs_kill_super+0x25/0x30 [btrfs]
 deactivate_locked_super+0x4c/0xc0
 cleanup_mnt+0x13a/0x1f0
 task_work_run+0xf2/0x170
 ? task_work_cancel+0x20/0x20
 ? mark_held_locks+0x1a/0x80
 exit_to_user_mode_prepare+0x16c/0x170
 syscall_exit_to_user_mode+0x19/0x50
 do_syscall_64+0x49/0x90
 entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7fb5c250f4bb
RSP: 002b:00007ffeee578518 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 000055bf227429f0 RCX: 00007fb5c250f4bb
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000055bf22742c20
RBP: 000055bf22742b08 R08: 0000000000000073 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 000055bf22742c20 R14: 0000000000000000 R15: 00007ffeee57b084
 </TASK>
irq event stamp: 11109
hardirqs last  enabled at (11119): [<ffffffff841678a2>] __up_console_sem+0x52/0x60
hardirqs last disabled at (11130): [<ffffffff84167887>] __up_console_sem+0x37/0x60
softirqs last  enabled at (11084): [<ffffffff84cc910b>] __do_softirq+0x31b/0x5ae
softirqs last disabled at (11079): [<ffffffff840b5b09>] irq_exit_rcu+0xa9/0x100
---[ end trace 0000000000000000 ]---
------------[ cut here ]------------
BTRFS: Transaction aborted (error -117)
WARNING: CPU: 3 PID: 13739 at fs/btrfs/extent-tree.c:3027 __btrfs_free_extent+0x10ff/0x1280 [btrfs]
Modules linked in: btrfs blake2b_generic libcrc32c xor lzo_compress lzo_decompress raid6_pq zstd_decompress zstd_compress xxhash zstd_common loop
CPU: 3 PID: 13739 Comm: umount Tainted: G        W          6.5.0-rc1-default+ #2126
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014
RIP: 0010:__btrfs_free_extent+0x10ff/0x1280 [btrfs]
RSP: 0018:ffff8880031c78a8 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff88802ec71708 RCX: 0000000000000000
RDX: 0000000000000002 RSI: ffffffff841007a8 RDI: ffffffff87c9e0e0
RBP: ffff888002734000 R08: 0000000000000001 R09: ffffed1000638eba
R10: ffff8880031c75d7 R11: 0000000000000001 R12: 0000000000000001
R13: ffff8880058522b8 R14: ffff8880265110e0 R15: 0000000001d24000
FS:  00007fb5c22e9800(0000) GS:ffff88806d200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f2b194fcfc4 CR3: 000000002f32a000 CR4: 00000000000006a0
Call Trace:
 <TASK>
 ? __warn+0xa1/0x200
 ? __btrfs_free_extent+0x10ff/0x1280 [btrfs]
 ? report_bug+0x207/0x270
 ? handle_bug+0x65/0x90
 ? exc_invalid_op+0x13/0x40
 ? asm_exc_invalid_op+0x16/0x20
 ? preempt_count_sub+0x18/0xc0
 ? __btrfs_free_extent+0x10ff/0x1280 [btrfs]
 ? __btrfs_free_extent+0x10ff/0x1280 [btrfs]
 ? lookup_extent_backref+0xd0/0xd0 [btrfs]
 ? __lock_release.isra.0+0x14e/0x510
 ? reacquire_held_locks+0x280/0x280
 run_delayed_tree_ref+0x10b/0x2d0 [btrfs]
 btrfs_run_delayed_refs_for_head+0x630/0x960 [btrfs]
 __btrfs_run_delayed_refs+0xce/0x160 [btrfs]
 btrfs_run_delayed_refs+0xe7/0x2a0 [btrfs]
 commit_cowonly_roots+0x3f1/0x4c0 [btrfs]
 ? trace_btrfs_transaction_commit+0xd0/0xd0 [btrfs]
 ? btrfs_commit_transaction+0xbbe/0x17e0 [btrfs]
 btrfs_commit_transaction+0xc13/0x17e0 [btrfs]
 ? cleanup_transaction+0x640/0x640 [btrfs]
 ? btrfs_attach_transaction_barrier+0x1e/0x50 [btrfs]
 sync_filesystem+0xd3/0x100
 generic_shutdown_super+0x44/0x1f0
 kill_anon_super+0x1e/0x40
 btrfs_kill_super+0x25/0x30 [btrfs]
 deactivate_locked_super+0x4c/0xc0
 cleanup_mnt+0x13a/0x1f0
 task_work_run+0xf2/0x170
 ? task_work_cancel+0x20/0x20
 ? mark_held_locks+0x1a/0x80
 exit_to_user_mode_prepare+0x16c/0x170
 syscall_exit_to_user_mode+0x19/0x50
 do_syscall_64+0x49/0x90
 entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7fb5c250f4bb
RSP: 002b:00007ffeee578518 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 000055bf227429f0 RCX: 00007fb5c250f4bb
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000055bf22742c20
RBP: 000055bf22742b08 R08: 0000000000000073 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 000055bf22742c20 R14: 0000000000000000 R15: 00007ffeee57b084
 </TASK>
irq event stamp: 11925
hardirqs last  enabled at (11935): [<ffffffff841678a2>] __up_console_sem+0x52/0x60
hardirqs last disabled at (11946): [<ffffffff84167887>] __up_console_sem+0x37/0x60
softirqs last  enabled at (11084): [<ffffffff84cc910b>] __do_softirq+0x31b/0x5ae
softirqs last disabled at (11079): [<ffffffff840b5b09>] irq_exit_rcu+0xa9/0x100
---[ end trace 0000000000000000 ]---
BTRFS: error (device vdb: state A) in __btrfs_free_extent:3027: errno=-117 Filesystem corrupted
BTRFS info (device vdb: state EA): forced readonly
BTRFS info (device vdb: state EA): leaf 30474240 gen 7 total ptrs 16 free space 15382 owner 2
BTRFS info (device vdb: state EA): refs 3 lock_owner 13739 current 13739
	item 0 key (13631488 192 8388608) itemoff 16259 itemsize 24
		block group used 0 chunk_objectid 256 flags 1
	item 1 key (22020096 192 8388608) itemoff 16235 itemsize 24
		block group used 16384 chunk_objectid 256 flags 34
	item 2 key (22036480 169 0) itemoff 16202 itemsize 33
		extent refs 1 gen 6 flags 2
		ref#0: tree block backref root 3
	item 3 key (30408704 169 0) itemoff 16169 itemsize 33
		extent refs 1 gen 6 flags 2
		ref#0: tree block backref root 2
	item 4 key (30408704 192 268435456) itemoff 16145 itemsize 24
		block group used 131072 chunk_objectid 256 flags 36
	item 5 key (30425088 169 0) itemoff 16112 itemsize 33
		extent refs 1 gen 5 flags 2
		ref#0: tree block backref root 5
	item 6 key (30441472 169 0) itemoff 16079 itemsize 33
		extent refs 1 gen 7 flags 2
		ref#0: tree block backref root 1
	item 7 key (30457856 169 0) itemoff 16046 itemsize 33
		extent refs 1 gen 7 flags 2
		ref#0: tree block backref root 4
	item 8 key (30474240 169 0) itemoff 16013 itemsize 33
		extent refs 1 gen 7 flags 2
		ref#0: tree block backref root 2
	item 9 key (30490624 169 0) itemoff 15980 itemsize 33
		extent refs 1 gen 5 flags 2
		ref#0: tree block backref root 7
	item 10 key (30507008 169 0) itemoff 15947 itemsize 33
		extent refs 1 gen 7 flags 2
		ref#0: tree block backref root 10
	item 11 key (30523392 169 0) itemoff 15914 itemsize 33
		extent refs 1 gen 5 flags 2
		ref#0: tree block backref root 7
	item 12 key (30539776 169 0) itemoff 15881 itemsize 33
		extent refs 1 gen 5 flags 2
		ref#0: tree block backref root 7
	item 13 key (30556160 169 0) itemoff 15848 itemsize 33
		extent refs 1 gen 5 flags 2
		ref#0: tree block backref root 7
	item 14 key (30572544 169 0) itemoff 15815 itemsize 33
		extent refs 1 gen 5 flags 2
		ref#0: tree block backref root 7
	item 15 key (30588928 169 0) itemoff 15782 itemsize 33
		extent refs 1 gen 5 flags 2
		ref#0: tree block backref root 7
BTRFS critical (device vdb: state EA): unable to find ref byte nr 30556160 parent 0 root 4 owner 0 offset 0 slot 14
BTRFS error (device vdb: state EA): failed to run delayed ref for logical 30556160 num_bytes 16384 type 176 action 2 ref_mod 1: -2
BTRFS: error (device vdb: state EA) in btrfs_run_delayed_refs:2102: errno=-2 No such entry
BTRFS warning (device vdb: state EA): Skipping commit of aborted transaction.
BTRFS: error (device vdb: state EA) in cleanup_transaction:1977: errno=-2 No such entry

Qu Wenruo July 13, 2023, 9:30 p.m. UTC | #10

On 2023/7/14 00:39, David Sterba wrote:
> On Thu, Jul 13, 2023 at 02:09:35PM +0200, David Sterba wrote:
>> On Wed, Jul 12, 2023 at 02:37:40PM +0800, Qu Wenruo wrote:
>> Added to misc-next
>
> And removed again, it explodes right before the first test:

Weird, it passed my local btrfs/* tests.


>
> BTRFS: device fsid 4e9cf0f7-cdc4-4e38-9e59-de4d88122ee9 devid 1 transid 6 /dev/vdb scanned by mkfs.btrfs (13714)
> BTRFS info (device vdb): using crc32c (crc32c-generic) checksum algorithm
> BTRFS info (device vdb): using free space tree
> BTRFS info (device vdb): auto enabling async discard
> BTRFS info (device vdb): checking UUID tree
> ------------[ cut here ]------------
> WARNING: CPU: 3 PID: 13739 at fs/btrfs/extent-tree.c:3026 __btrfs_free_extent+0x9ac/0x1280 [btrfs]
> Modules linked in: btrfs blake2b_generic libcrc32c xor lzo_compress lzo_decompress raid6_pq zstd_decompress zstd_compress xxhash zstd_common loop
> CPU: 3 PID: 13739 Comm: umount Not tainted 6.5.0-rc1-default+ #2126
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014
> RIP: 0010:__btrfs_free_extent+0x9ac/0x1280 [btrfs]
> RSP: 0018:ffff8880031c78a8 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: ffff88802ec71708 RCX: ffffffffc065e9ba
> RDX: dffffc0000000000 RSI: ffffffffc063f610 RDI: ffff888026511130
> RBP: ffff888002734000 R08: 0000000000000000 R09: ffffed1000638eff
> R10: ffff8880031c77ff R11: 0000000000000001 R12: 0000000000000001
> R13: ffff8880058522b8 R14: ffff8880265110e0 R15: 0000000001d24000
> FS:  00007fb5c22e9800(0000) GS:ffff88806d200000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f2b194fcfc4 CR3: 000000002f32a000 CR4: 00000000000006a0
> Call Trace:
>   <TASK>
>   ? __warn+0xa1/0x200
>   ? __btrfs_free_extent+0x9ac/0x1280 [btrfs]
>   ? report_bug+0x207/0x270
>   ? handle_bug+0x65/0x90
>   ? exc_invalid_op+0x13/0x40
>   ? asm_exc_invalid_op+0x16/0x20
>   ? __btrfs_free_extent+0x39a/0x1280 [btrfs]
>   ? unlock_up+0x160/0x370 [btrfs]
>   ? __btrfs_free_extent+0x9ac/0x1280 [btrfs]
>   ? __btrfs_free_extent+0x39a/0x1280 [btrfs]
>   ? lookup_extent_backref+0xd0/0xd0 [btrfs]
>   ? __lock_release.isra.0+0x14e/0x510
>   ? reacquire_held_locks+0x280/0x280
>   run_delayed_tree_ref+0x10b/0x2d0 [btrfs]
>   btrfs_run_delayed_refs_for_head+0x630/0x960 [btrfs]
>   __btrfs_run_delayed_refs+0xce/0x160 [btrfs]
>   btrfs_run_delayed_refs+0xe7/0x2a0 [btrfs]
>   commit_cowonly_roots+0x3f1/0x4c0 [btrfs]
>   ? trace_btrfs_transaction_commit+0xd0/0xd0 [btrfs]
>   ? btrfs_commit_transaction+0xbbe/0x17e0 [btrfs]
>   btrfs_commit_transaction+0xc13/0x17e0 [btrfs]
>   ? cleanup_transaction+0x640/0x640 [btrfs]
>   ? btrfs_attach_transaction_barrier+0x1e/0x50 [btrfs]
>   sync_filesystem+0xd3/0x100
>   generic_shutdown_super+0x44/0x1f0
>   kill_anon_super+0x1e/0x40
>   btrfs_kill_super+0x25/0x30 [btrfs]
>   deactivate_locked_super+0x4c/0xc0
>   cleanup_mnt+0x13a/0x1f0
>   task_work_run+0xf2/0x170
>   ? task_work_cancel+0x20/0x20
>   ? mark_held_locks+0x1a/0x80
>   exit_to_user_mode_prepare+0x16c/0x170
>   syscall_exit_to_user_mode+0x19/0x50
>   do_syscall_64+0x49/0x90
>   entry_SYSCALL_64_after_hwframe+0x46/0xb0
> RIP: 0033:0x7fb5c250f4bb
> RSP: 002b:00007ffeee578518 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
> RAX: 0000000000000000 RBX: 000055bf227429f0 RCX: 00007fb5c250f4bb
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000055bf22742c20
> RBP: 000055bf22742b08 R08: 0000000000000073 R09: 0000000000000001
> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> R13: 000055bf22742c20 R14: 0000000000000000 R15: 00007ffeee57b084
>   </TASK>
> irq event stamp: 11109
> hardirqs last  enabled at (11119): [<ffffffff841678a2>] __up_console_sem+0x52/0x60
> hardirqs last disabled at (11130): [<ffffffff84167887>] __up_console_sem+0x37/0x60
> softirqs last  enabled at (11084): [<ffffffff84cc910b>] __do_softirq+0x31b/0x5ae
> softirqs last disabled at (11079): [<ffffffff840b5b09>] irq_exit_rcu+0xa9/0x100
> ---[ end trace 0000000000000000 ]---
> ------------[ cut here ]------------
> BTRFS: Transaction aborted (error -117)
> WARNING: CPU: 3 PID: 13739 at fs/btrfs/extent-tree.c:3027 __btrfs_free_extent+0x10ff/0x1280 [btrfs]
> Modules linked in: btrfs blake2b_generic libcrc32c xor lzo_compress lzo_decompress raid6_pq zstd_decompress zstd_compress xxhash zstd_common loop
> CPU: 3 PID: 13739 Comm: umount Tainted: G        W          6.5.0-rc1-default+ #2126
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014
> RIP: 0010:__btrfs_free_extent+0x10ff/0x1280 [btrfs]
> RSP: 0018:ffff8880031c78a8 EFLAGS: 00010282
> RAX: 0000000000000000 RBX: ffff88802ec71708 RCX: 0000000000000000
> RDX: 0000000000000002 RSI: ffffffff841007a8 RDI: ffffffff87c9e0e0
> RBP: ffff888002734000 R08: 0000000000000001 R09: ffffed1000638eba
> R10: ffff8880031c75d7 R11: 0000000000000001 R12: 0000000000000001
> R13: ffff8880058522b8 R14: ffff8880265110e0 R15: 0000000001d24000
> FS:  00007fb5c22e9800(0000) GS:ffff88806d200000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f2b194fcfc4 CR3: 000000002f32a000 CR4: 00000000000006a0
> Call Trace:
>   <TASK>
>   ? __warn+0xa1/0x200
>   ? __btrfs_free_extent+0x10ff/0x1280 [btrfs]
>   ? report_bug+0x207/0x270
>   ? handle_bug+0x65/0x90
>   ? exc_invalid_op+0x13/0x40
>   ? asm_exc_invalid_op+0x16/0x20
>   ? preempt_count_sub+0x18/0xc0
>   ? __btrfs_free_extent+0x10ff/0x1280 [btrfs]
>   ? __btrfs_free_extent+0x10ff/0x1280 [btrfs]
>   ? lookup_extent_backref+0xd0/0xd0 [btrfs]
>   ? __lock_release.isra.0+0x14e/0x510
>   ? reacquire_held_locks+0x280/0x280
>   run_delayed_tree_ref+0x10b/0x2d0 [btrfs]
>   btrfs_run_delayed_refs_for_head+0x630/0x960 [btrfs]
>   __btrfs_run_delayed_refs+0xce/0x160 [btrfs]
>   btrfs_run_delayed_refs+0xe7/0x2a0 [btrfs]
>   commit_cowonly_roots+0x3f1/0x4c0 [btrfs]
>   ? trace_btrfs_transaction_commit+0xd0/0xd0 [btrfs]
>   ? btrfs_commit_transaction+0xbbe/0x17e0 [btrfs]
>   btrfs_commit_transaction+0xc13/0x17e0 [btrfs]
>   ? cleanup_transaction+0x640/0x640 [btrfs]
>   ? btrfs_attach_transaction_barrier+0x1e/0x50 [btrfs]
>   sync_filesystem+0xd3/0x100
>   generic_shutdown_super+0x44/0x1f0
>   kill_anon_super+0x1e/0x40
>   btrfs_kill_super+0x25/0x30 [btrfs]
>   deactivate_locked_super+0x4c/0xc0
>   cleanup_mnt+0x13a/0x1f0
>   task_work_run+0xf2/0x170
>   ? task_work_cancel+0x20/0x20
>   ? mark_held_locks+0x1a/0x80
>   exit_to_user_mode_prepare+0x16c/0x170
>   syscall_exit_to_user_mode+0x19/0x50
>   do_syscall_64+0x49/0x90
>   entry_SYSCALL_64_after_hwframe+0x46/0xb0
> RIP: 0033:0x7fb5c250f4bb
> RSP: 002b:00007ffeee578518 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
> RAX: 0000000000000000 RBX: 000055bf227429f0 RCX: 00007fb5c250f4bb
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000055bf22742c20
> RBP: 000055bf22742b08 R08: 0000000000000073 R09: 0000000000000001
> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> R13: 000055bf22742c20 R14: 0000000000000000 R15: 00007ffeee57b084
>   </TASK>
> irq event stamp: 11925
> hardirqs last  enabled at (11935): [<ffffffff841678a2>] __up_console_sem+0x52/0x60
> hardirqs last disabled at (11946): [<ffffffff84167887>] __up_console_sem+0x37/0x60
> softirqs last  enabled at (11084): [<ffffffff84cc910b>] __do_softirq+0x31b/0x5ae
> softirqs last disabled at (11079): [<ffffffff840b5b09>] irq_exit_rcu+0xa9/0x100
> ---[ end trace 0000000000000000 ]---
> BTRFS: error (device vdb: state A) in __btrfs_free_extent:3027: errno=-117 Filesystem corrupted
> BTRFS info (device vdb: state EA): forced readonly
> BTRFS info (device vdb: state EA): leaf 30474240 gen 7 total ptrs 16 free space 15382 owner 2
> BTRFS info (device vdb: state EA): refs 3 lock_owner 13739 current 13739
> 	item 0 key (13631488 192 8388608) itemoff 16259 itemsize 24
> 		block group used 0 chunk_objectid 256 flags 1
> 	item 1 key (22020096 192 8388608) itemoff 16235 itemsize 24
> 		block group used 16384 chunk_objectid 256 flags 34
> 	item 2 key (22036480 169 0) itemoff 16202 itemsize 33
> 		extent refs 1 gen 6 flags 2
> 		ref#0: tree block backref root 3
> 	item 3 key (30408704 169 0) itemoff 16169 itemsize 33
> 		extent refs 1 gen 6 flags 2
> 		ref#0: tree block backref root 2
> 	item 4 key (30408704 192 268435456) itemoff 16145 itemsize 24
> 		block group used 131072 chunk_objectid 256 flags 36
> 	item 5 key (30425088 169 0) itemoff 16112 itemsize 33
> 		extent refs 1 gen 5 flags 2
> 		ref#0: tree block backref root 5
> 	item 6 key (30441472 169 0) itemoff 16079 itemsize 33
> 		extent refs 1 gen 7 flags 2
> 		ref#0: tree block backref root 1
> 	item 7 key (30457856 169 0) itemoff 16046 itemsize 33
> 		extent refs 1 gen 7 flags 2
> 		ref#0: tree block backref root 4
> 	item 8 key (30474240 169 0) itemoff 16013 itemsize 33
> 		extent refs 1 gen 7 flags 2
> 		ref#0: tree block backref root 2
> 	item 9 key (30490624 169 0) itemoff 15980 itemsize 33
> 		extent refs 1 gen 5 flags 2
> 		ref#0: tree block backref root 7
> 	item 10 key (30507008 169 0) itemoff 15947 itemsize 33
> 		extent refs 1 gen 7 flags 2
> 		ref#0: tree block backref root 10
> 	item 11 key (30523392 169 0) itemoff 15914 itemsize 33
> 		extent refs 1 gen 5 flags 2
> 		ref#0: tree block backref root 7
> 	item 12 key (30539776 169 0) itemoff 15881 itemsize 33
> 		extent refs 1 gen 5 flags 2
> 		ref#0: tree block backref root 7
> 	item 13 key (30556160 169 0) itemoff 15848 itemsize 33
> 		extent refs 1 gen 5 flags 2
> 		ref#0: tree block backref root 7
> 	item 14 key (30572544 169 0) itemoff 15815 itemsize 33
> 		extent refs 1 gen 5 flags 2
> 		ref#0: tree block backref root 7
> 	item 15 key (30588928 169 0) itemoff 15782 itemsize 33
> 		extent refs 1 gen 5 flags 2
> 		ref#0: tree block backref root 7

This looks like an error in memmove_extent_buffer() which I
intentionally didn't touch.

Anyway I'll try rebase and more tests.

Can you put your modified commits in an external branch so I can inherit
all your modifications?

Thanks,
Qu

> BTRFS critical (device vdb: state EA): unable to find ref byte nr 30556160 parent 0 root 4 owner 0 offset 0 slot 14
> BTRFS error (device vdb: state EA): failed to run delayed ref for logical 30556160 num_bytes 16384 type 176 action 2 ref_mod 1: -2
> BTRFS: error (device vdb: state EA) in btrfs_run_delayed_refs:2102: errno=-2 No such entry
> BTRFS warning (device vdb: state EA): Skipping commit of aborted transaction.
> BTRFS: error (device vdb: state EA) in cleanup_transaction:1977: errno=-2 No such entry

David Sterba July 13, 2023, 10:03 p.m. UTC | #11

On Fri, Jul 14, 2023 at 05:30:33AM +0800, Qu Wenruo wrote:
> On 2023/7/14 00:39, David Sterba wrote:
> > 		ref#0: tree block backref root 7
> > 	item 14 key (30572544 169 0) itemoff 15815 itemsize 33
> > 		extent refs 1 gen 5 flags 2
> > 		ref#0: tree block backref root 7
> > 	item 15 key (30588928 169 0) itemoff 15782 itemsize 33
> > 		extent refs 1 gen 5 flags 2
> > 		ref#0: tree block backref root 7
> 
> This looks like an error in memmove_extent_buffer() which I
> intentionally didn't touch.
> 
> Anyway I'll try rebase and more tests.
> 
> Can you put your modified commits in an external branch so I can inherit
> all your modifications?

First I saw the crashes with the modified patches but the report is from
what you sent to the mailinglist so I can eliminate error on my side.

Qu Wenruo July 14, 2023, 12:09 a.m. UTC | #12

On 2023/7/14 06:03, David Sterba wrote:
> On Fri, Jul 14, 2023 at 05:30:33AM +0800, Qu Wenruo wrote:
>> On 2023/7/14 00:39, David Sterba wrote:
>>> 		ref#0: tree block backref root 7
>>> 	item 14 key (30572544 169 0) itemoff 15815 itemsize 33
>>> 		extent refs 1 gen 5 flags 2
>>> 		ref#0: tree block backref root 7
>>> 	item 15 key (30588928 169 0) itemoff 15782 itemsize 33
>>> 		extent refs 1 gen 5 flags 2
>>> 		ref#0: tree block backref root 7
>>
>> This looks like an error in memmove_extent_buffer() which I
>> intentionally didn't touch.
>>
>> Anyway I'll try rebase and more tests.
>>
>> Can you put your modified commits in an external branch so I can inherit
>> all your modifications?
>
> First I saw the crashes with the modified patches but the report is from
> what you sent to the mailinglist so I can eliminate error on my side.

Still a branch would help a lot, as you won't want to re-do the usual
modification (like grammar, comments etc).

Thanks,
Qu

David Sterba July 14, 2023, 12:26 a.m. UTC | #13

On Fri, Jul 14, 2023 at 08:09:16AM +0800, Qu Wenruo wrote:
> 
> 
> On 2023/7/14 06:03, David Sterba wrote:
> > On Fri, Jul 14, 2023 at 05:30:33AM +0800, Qu Wenruo wrote:
> >> On 2023/7/14 00:39, David Sterba wrote:
> >>> 		ref#0: tree block backref root 7
> >>> 	item 14 key (30572544 169 0) itemoff 15815 itemsize 33
> >>> 		extent refs 1 gen 5 flags 2
> >>> 		ref#0: tree block backref root 7
> >>> 	item 15 key (30588928 169 0) itemoff 15782 itemsize 33
> >>> 		extent refs 1 gen 5 flags 2
> >>> 		ref#0: tree block backref root 7
> >>
> >> This looks like an error in memmove_extent_buffer() which I
> >> intentionally didn't touch.
> >>
> >> Anyway I'll try rebase and more tests.
> >>
> >> Can you put your modified commits in an external branch so I can inherit
> >> all your modifications?
> >
> > First I saw the crashes with the modified patches but the report is from
> > what you sent to the mailinglist so I can eliminate error on my side.
> 
> Still a branch would help a lot, as you won't want to re-do the usual
> modification (like grammar, comments etc).

Branch ext/qu-eb-page-clanups-updated-broken at github.

Qu Wenruo July 14, 2023, 1:58 a.m. UTC | #14

On 2023/7/14 08:26, David Sterba wrote:
> On Fri, Jul 14, 2023 at 08:09:16AM +0800, Qu Wenruo wrote:
>>
>>
>> On 2023/7/14 06:03, David Sterba wrote:
>>> On Fri, Jul 14, 2023 at 05:30:33AM +0800, Qu Wenruo wrote:
>>>> On 2023/7/14 00:39, David Sterba wrote:
>>>>> 		ref#0: tree block backref root 7
>>>>> 	item 14 key (30572544 169 0) itemoff 15815 itemsize 33
>>>>> 		extent refs 1 gen 5 flags 2
>>>>> 		ref#0: tree block backref root 7
>>>>> 	item 15 key (30588928 169 0) itemoff 15782 itemsize 33
>>>>> 		extent refs 1 gen 5 flags 2
>>>>> 		ref#0: tree block backref root 7
>>>>
>>>> This looks like an error in memmove_extent_buffer() which I
>>>> intentionally didn't touch.
>>>>
>>>> Anyway I'll try rebase and more tests.
>>>>
>>>> Can you put your modified commits in an external branch so I can inherit
>>>> all your modifications?
>>>
>>> First I saw the crashes with the modified patches but the report is from
>>> what you sent to the mailinglist so I can eliminate error on my side.
>>
>> Still a branch would help a lot, as you won't want to re-do the usual
>> modification (like grammar, comments etc).
>
> Branch ext/qu-eb-page-clanups-updated-broken at github.

Already running the auto group with that branch, and no explosion so far
(btrfs/004 failed to mount with -o atime though).

Any extra setup needed to trigger the failure?

Thanks,
Qu

David Sterba July 14, 2023, 10:03 a.m. UTC | #15

On Fri, Jul 14, 2023 at 09:58:00AM +0800, Qu Wenruo wrote:
> 
> 
> On 2023/7/14 08:26, David Sterba wrote:
> > On Fri, Jul 14, 2023 at 08:09:16AM +0800, Qu Wenruo wrote:
> >>
> >>
> >> On 2023/7/14 06:03, David Sterba wrote:
> >>> On Fri, Jul 14, 2023 at 05:30:33AM +0800, Qu Wenruo wrote:
> >>>> On 2023/7/14 00:39, David Sterba wrote:
> >>>>> 		ref#0: tree block backref root 7
> >>>>> 	item 14 key (30572544 169 0) itemoff 15815 itemsize 33
> >>>>> 		extent refs 1 gen 5 flags 2
> >>>>> 		ref#0: tree block backref root 7
> >>>>> 	item 15 key (30588928 169 0) itemoff 15782 itemsize 33
> >>>>> 		extent refs 1 gen 5 flags 2
> >>>>> 		ref#0: tree block backref root 7
> >>>>
> >>>> This looks like an error in memmove_extent_buffer() which I
> >>>> intentionally didn't touch.
> >>>>
> >>>> Anyway I'll try rebase and more tests.
> >>>>
> >>>> Can you put your modified commits in an external branch so I can inherit
> >>>> all your modifications?
> >>>
> >>> First I saw the crashes with the modified patches but the report is from
> >>> what you sent to the mailinglist so I can eliminate error on my side.
> >>
> >> Still a branch would help a lot, as you won't want to re-do the usual
> >> modification (like grammar, comments etc).
> >
> > Branch ext/qu-eb-page-clanups-updated-broken at github.
> 
> Already running the auto group with that branch, and no explosion so far
> (btrfs/004 failed to mount with -o atime though).
> 
> Any extra setup needed to trigger the failure?

I'm not aware of anything different than usual. Patches applied to git,
built, updated VM and started. I had another branch built and tested and
it finished the fstests. I can at least bisect which patch does it.

Qu Wenruo July 14, 2023, 10:32 a.m. UTC | #16

On 2023/7/14 18:03, David Sterba wrote:
> On Fri, Jul 14, 2023 at 09:58:00AM +0800, Qu Wenruo wrote:
>>
>>
>> On 2023/7/14 08:26, David Sterba wrote:
>>> On Fri, Jul 14, 2023 at 08:09:16AM +0800, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2023/7/14 06:03, David Sterba wrote:
>>>>> On Fri, Jul 14, 2023 at 05:30:33AM +0800, Qu Wenruo wrote:
>>>>>> On 2023/7/14 00:39, David Sterba wrote:
>>>>>>> 		ref#0: tree block backref root 7
>>>>>>> 	item 14 key (30572544 169 0) itemoff 15815 itemsize 33
>>>>>>> 		extent refs 1 gen 5 flags 2
>>>>>>> 		ref#0: tree block backref root 7
>>>>>>> 	item 15 key (30588928 169 0) itemoff 15782 itemsize 33
>>>>>>> 		extent refs 1 gen 5 flags 2
>>>>>>> 		ref#0: tree block backref root 7
>>>>>>
>>>>>> This looks like an error in memmove_extent_buffer() which I
>>>>>> intentionally didn't touch.
>>>>>>
>>>>>> Anyway I'll try rebase and more tests.
>>>>>>
>>>>>> Can you put your modified commits in an external branch so I can inherit
>>>>>> all your modifications?
>>>>>
>>>>> First I saw the crashes with the modified patches but the report is from
>>>>> what you sent to the mailinglist so I can eliminate error on my side.
>>>>
>>>> Still a branch would help a lot, as you won't want to re-do the usual
>>>> modification (like grammar, comments etc).
>>>
>>> Branch ext/qu-eb-page-clanups-updated-broken at github.
>>
>> Already running the auto group with that branch, and no explosion so far
>> (btrfs/004 failed to mount with -o atime though).
>>
>> Any extra setup needed to trigger the failure?
>
> I'm not aware of anything different than usual. Patches applied to git,
> built, updated VM and started. I had another branch built and tested and
> it finished the fstests. I can at least bisect which patch does it.

A bisection would be very appreciated.

Although I guess it should be the memcpy_extent_buffer() patch, I didn't
see something obvious right now...

Thanks,
Qu

David Sterba July 14, 2023, 10:41 a.m. UTC | #17

On Fri, Jul 14, 2023 at 06:32:27PM +0800, Qu Wenruo wrote:
> >> Already running the auto group with that branch, and no explosion so far
> >> (btrfs/004 failed to mount with -o atime though).
> >>
> >> Any extra setup needed to trigger the failure?
> >
> > I'm not aware of anything different than usual. Patches applied to git,
> > built, updated VM and started. I had another branch built and tested and
> > it finished the fstests. I can at least bisect which patch does it.
> 
> A bisection would be very appreciated.
> 
> Although I guess it should be the memcpy_extent_buffer() patch, I didn't
> see something obvious right now...

5ebf7593abb81ec1993f31e90a7573b75aff4db4 is the first bad commit
btrfs: refactor main loop in memcpy_extent_buffer()

$ git bisect log
# bad: [5c6c140622dd7107acb13da404f0c682f1f954a6] btrfs: copy all pages at once at the end of btrfs_clone_extent_buffer()
# good: [72c15cf7e64769ca9273a825fff8495d99975c9c] btrfs: deprecate integrity checker feature
git bisect start 'ext/qu-eb-page-clanups-updated-broken' '72c15cf7e64769ca9273a825fff8495d99975c9c'
# good: [85ab525a6a63c477b92099835d6b05eaebd4ad4b] btrfs: use write_extent_buffer() to implement write_extent_buffer_*id()
git bisect good 85ab525a6a63c477b92099835d6b05eaebd4ad4b
# bad: [cd6668ef43a224b3f8130b78f4e3b922a7175a05] btrfs: refactor main loop in copy_extent_buffer_full()
git bisect bad cd6668ef43a224b3f8130b78f4e3b922a7175a05
# bad: [5ebf7593abb81ec1993f31e90a7573b75aff4db4] btrfs: refactor main loop in memcpy_extent_buffer()
git bisect bad 5ebf7593abb81ec1993f31e90a7573b75aff4db4
# first bad commit: [5ebf7593abb81ec1993f31e90a7573b75aff4db4] btrfs: refactor main loop in memcpy_extent_buffer()

Qu Wenruo July 15, 2023, 12:39 a.m. UTC | #18

On 2023/7/14 18:41, David Sterba wrote:
> On Fri, Jul 14, 2023 at 06:32:27PM +0800, Qu Wenruo wrote:
>>>> Already running the auto group with that branch, and no explosion so far
>>>> (btrfs/004 failed to mount with -o atime though).
>>>>
>>>> Any extra setup needed to trigger the failure?
>>>
>>> I'm not aware of anything different than usual. Patches applied to git,
>>> built, updated VM and started. I had another branch built and tested and
>>> it finished the fstests. I can at least bisect which patch does it.
>>
>> A bisection would be very appreciated.
>>
>> Although I guess it should be the memcpy_extent_buffer() patch, I didn't
>> see something obvious right now...
>
> 5ebf7593abb81ec1993f31e90a7573b75aff4db4 is the first bad commit
> btrfs: refactor main loop in memcpy_extent_buffer()

Anything special on the system that you can reproduce the bug?

I checked the overall code, it's a little different than the original
behavior.

The original behavior has double limits on the cross-page case, while
the new code only handles the cross-page on the source, and let
write_extent_buffer() to handle the cross-page situation on the destination.

Considering memcpy() is called for memmove() case, it can explain the
corrupted tree block we see in your report.

Although I can not see the obvious problem, I guess there may be some
hidden corner cases that would be finally exposed if we move to
folio/vmallocated memory eventually.

If I can reproduce it locally the turnover time can be reduced greatly.

Thanks,
Qu
>
> $ git bisect log
> # bad: [5c6c140622dd7107acb13da404f0c682f1f954a6] btrfs: copy all pages at once at the end of btrfs_clone_extent_buffer()
> # good: [72c15cf7e64769ca9273a825fff8495d99975c9c] btrfs: deprecate integrity checker feature
> git bisect start 'ext/qu-eb-page-clanups-updated-broken' '72c15cf7e64769ca9273a825fff8495d99975c9c'
> # good: [85ab525a6a63c477b92099835d6b05eaebd4ad4b] btrfs: use write_extent_buffer() to implement write_extent_buffer_*id()
> git bisect good 85ab525a6a63c477b92099835d6b05eaebd4ad4b
> # bad: [cd6668ef43a224b3f8130b78f4e3b922a7175a05] btrfs: refactor main loop in copy_extent_buffer_full()
> git bisect bad cd6668ef43a224b3f8130b78f4e3b922a7175a05
> # bad: [5ebf7593abb81ec1993f31e90a7573b75aff4db4] btrfs: refactor main loop in memcpy_extent_buffer()
> git bisect bad 5ebf7593abb81ec1993f31e90a7573b75aff4db4
> # first bad commit: [5ebf7593abb81ec1993f31e90a7573b75aff4db4] btrfs: refactor main loop in memcpy_extent_buffer()

Qu Wenruo July 15, 2023, 1:02 a.m. UTC | #19

On 2023/7/15 08:39, Qu Wenruo wrote:
> 
> 
> On 2023/7/14 18:41, David Sterba wrote:
>> On Fri, Jul 14, 2023 at 06:32:27PM +0800, Qu Wenruo wrote:
>>>>> Already running the auto group with that branch, and no explosion 
>>>>> so far
>>>>> (btrfs/004 failed to mount with -o atime though).
>>>>>
>>>>> Any extra setup needed to trigger the failure?
>>>>
>>>> I'm not aware of anything different than usual. Patches applied to git,
>>>> built, updated VM and started. I had another branch built and tested 
>>>> and
>>>> it finished the fstests. I can at least bisect which patch does it.
>>>
>>> A bisection would be very appreciated.
>>>
>>> Although I guess it should be the memcpy_extent_buffer() patch, I didn't
>>> see something obvious right now...
>>
>> 5ebf7593abb81ec1993f31e90a7573b75aff4db4 is the first bad commit
>> btrfs: refactor main loop in memcpy_extent_buffer()
> 
> Anything special on the system that you can reproduce the bug?
> 
> I checked the overall code, it's a little different than the original
> behavior.
> 
> The original behavior has double limits on the cross-page case, while
> the new code only handles the cross-page on the source, and let
> write_extent_buffer() to handle the cross-page situation on the 
> destination.

OK, I got the cause.

It's indeed the memcpy_extent_buffer() rework.

memcpy() itself is not safe if the range is overlapping, and the old 
code is doing proper overlap checks for both memcpy and memmove through 
copy_pages() helper.

And unfortunately I didn't go that copy_pages() helper and triggered the 
problem.

Let me find a better solution for this case.

Thanks,
Qu
> 
> Considering memcpy() is called for memmove() case, it can explain the
> corrupted tree block we see in your report.
> 
> Although I can not see the obvious problem, I guess there may be some
> hidden corner cases that would be finally exposed if we move to
> folio/vmallocated memory eventually.
> 
> If I can reproduce it locally the turnover time can be reduced greatly.
> 
> Thanks,
> Qu
>>
>> $ git bisect log
>> # bad: [5c6c140622dd7107acb13da404f0c682f1f954a6] btrfs: copy all 
>> pages at once at the end of btrfs_clone_extent_buffer()
>> # good: [72c15cf7e64769ca9273a825fff8495d99975c9c] btrfs: deprecate 
>> integrity checker feature
>> git bisect start 'ext/qu-eb-page-clanups-updated-broken' 
>> '72c15cf7e64769ca9273a825fff8495d99975c9c'
>> # good: [85ab525a6a63c477b92099835d6b05eaebd4ad4b] btrfs: use 
>> write_extent_buffer() to implement write_extent_buffer_*id()
>> git bisect good 85ab525a6a63c477b92099835d6b05eaebd4ad4b
>> # bad: [cd6668ef43a224b3f8130b78f4e3b922a7175a05] btrfs: refactor main 
>> loop in copy_extent_buffer_full()
>> git bisect bad cd6668ef43a224b3f8130b78f4e3b922a7175a05
>> # bad: [5ebf7593abb81ec1993f31e90a7573b75aff4db4] btrfs: refactor main 
>> loop in memcpy_extent_buffer()
>> git bisect bad 5ebf7593abb81ec1993f31e90a7573b75aff4db4
>> # first bad commit: [5ebf7593abb81ec1993f31e90a7573b75aff4db4] btrfs: 
>> refactor main loop in memcpy_extent_buffer()

[v2,0/6] btrfs: preparation patches for the incoming metadata folio conversion

Message

Comments