mbox series

[v2,00/17] btrfs: split bio at btrfs_map_bio() time

Message ID 20211206022937.26465-1-wqu@suse.com (mailing list archive)
Headers show
Series btrfs: split bio at btrfs_map_bio() time | expand

Message

Qu Wenruo Dec. 6, 2021, 2:29 a.m. UTC
This patchset be fetched from this branch:

https://github.com/adam900710/linux/tree/refactor_chunk_map

[BACKGROUND]

Currently btrfs never uses bio_split() to split its bio against RAID
stripe boundaries.

Instead inside btrfs we check our stripe boundary everytime we allocate
a new bio, and ensure the new bio never cross stripe boundaries.

[PROBLEMS]

Although this behavior works fine, it's against the common practice used in
stacked drivers, and is making the effort to convert to iomap harder.

There is also an hidden burden, every time we allocate a new bio, we uses
BIO_MAX_BVECS, but since we know the boundaries, for RAID0/RAID10 we can
only fit at most 16 pages (fixed 64K stripe size, and 4K page size),
wasting the 256 slots we allocated.

[CHALLENGES]

To change the situation, this patchset attempts to improve the situation
by moving the bio split into btrfs_map_bio() time, so upper layer should
no longer bother the bio split against RAID stripes or even chunk
boundaries.

But there are several challenges:

- Conflicts in various endio functions
  We want the existing granularity, instead of chained endio, thus we
  must make the involved endio functions to handle split bios.

  Although most endio functions are already doing their works
  independent of the bio size, they are not yet fully handling split
  bio.

  This patch will convert them to use saved bi_iter and only iterate
  the split range instead of the whole bio.
  This change involved 3 types of IOs:

  * Buffered IO
    Including both data and metadata
  * Direct IO
  * Compressed IO

  Their endio functions needs different level of updates to handle split
  bios.

  Furthermore, there is another endio, end_workqueue_bio(), it can't
  handle split bios at all, thus we change the timing so that
  btrfs_bio_wq_end_io() is only called after the bio being split.

- Checksum verification
  Currently we rely on btrfs_bio::csum to contain the checksum for the
  whole bio.
  If one bio get split, csum will no longer points to the correct
  location for the split bio.

  This can be solved by introducing btrfs_bio::offset_to_original, and
  use that new member to calculate where we should read csum from.

  For the parent bio, it still has btrfs_bio::csum for the whole bio,
  thus it can still free it correctly.

- Independent endio for each split bio
  Unlike stack drivers, for RAID10 btrfs needs to try its best effort to
  read every sectors, to handle the following case: (X means bad, either
  unable to read or failed to pass checksum verification, V means good)

  Dev 1	(missing) | D1 (X) |
  Dev 2 (OK)	  | D1 (V) |
  Dev 3 (OK)	  | D2 (V) |
  Dev 4 (OK)	  | D2 (X) |

  In the above RAID10 case, dev1 is missing, and although dev4 is fine,
  its D2 sector is corrupted (by bit rot or whatever).

  If we use bio_chain(), read bio for both D1 and D2 will be split, and
  since D1 is missing, the whole D1 and D2 read will be marked as error,
  thus we will try to read from dev2 and dev4.

  But D2 in dev4 has csum mismatch, we can only rebuild D1 and D2
  correctly from dev2:D1 and dev3:D2.

  This patchset resolve this by saving bi_iter into btrfs_bio::iter, and
  uses that at endio to iterate only the split part of an bio.
  Other than this, existing read/write page endio functions can handle
  them properly without problem.

- Bad RAID56 naming/functionality
  There are quite some RAID56 call sites relies on specific behavior on
  __btrfs_map_block(), like returning @map_length as stripe_len other
  than real mapped length.

  This is handled by some small cleanups specific for RAID56.

[CHANGELOG]
RFC->v1:
- Better patch split
  Now patch 01~06 are refactors/cleanups/preparations.
  While 07~13 are the patches that doing the conversion while can handle
  both old and new bio split timings.
  Finally patch 14~16 convert the bio split call sites one by one to
  newer facility.
  The final patch is just a small clean up.

- Various bug fixes
  During the full fstests run, various stupid bugs are exposed and
  fixed.

v2:
- Fix the error paths for allocated but never submitted bios
  There are tons of error path that we allocate a bio but it goes
  bio_endio() directly without going through btrfs_map_bio().
  New ASSERTS() in endio functions require a populated btrfs_bio::iter,
  thus for such bios we still need to call btrfs_bio_save_iter() to
  populate btrfs_bio::iter to prevent such ASSERT()s get triggered.

- Fix scrub_stripe_index_and_offset() which abuses stripe_len and
  mapped_length

Qu Wenruo (17):
  btrfs: update an stale comment on btrfs_submit_bio_hook()
  btrfs: save bio::bi_iter into btrfs_bio::iter before any endio
  btrfs: use correct bio size for error message in btrfs_end_dio_bio()
  btrfs: refactor btrfs_map_bio()
  btrfs: move btrfs_bio_wq_end_io() calls into submit_stripe_bio()
  btrfs: replace btrfs_dio_private::refs with
    btrfs_dio_private::pending_bytes
  btrfs: introduce btrfs_bio_split() helper
  btrfs: make data buffered read path to handle split bio properly
  btrfs: make data buffered write endio function to be split bio
    compatible
  btrfs: make metadata write endio functions to be split bio compatible
  btrfs: make dec_and_test_compressed_bio() to be split bio compatible
  btrfs: return proper mapped length for RAID56 profiles in
    __btrfs_map_block()
  btrfs: allow btrfs_map_bio() to split bio according to chunk stripe
    boundaries
  btrfs: remove buffered IO stripe boundary calculation
  btrfs: remove stripe boundary calculation for compressed IO
  btrfs: remove the stripe boundary calculation for direct IO
  btrfs: unexport btrfs_get_io_geometry()

 fs/btrfs/btrfs_inode.h |  10 +-
 fs/btrfs/compression.c |  70 +++-----------
 fs/btrfs/disk-io.c     |  11 +--
 fs/btrfs/extent_io.c   | 196 ++++++++++++++++++++++++++------------
 fs/btrfs/extent_io.h   |   2 +
 fs/btrfs/inode.c       | 210 ++++++++++++++++-------------------------
 fs/btrfs/raid56.c      |  14 ++-
 fs/btrfs/raid56.h      |   2 +-
 fs/btrfs/scrub.c       |  14 +--
 fs/btrfs/volumes.c     | 144 ++++++++++++++++++++--------
 fs/btrfs/volumes.h     |  74 +++++++++++++--
 11 files changed, 436 insertions(+), 311 deletions(-)

Comments

Johannes Thumshirn Dec. 9, 2021, 10:06 a.m. UTC | #1
On 06/12/2021 03:30, Qu Wenruo wrote:
> This patchset be fetched from this branch:
> 
> https://github.com/adam900710/linux/tree/refactor_chunk_map
> 
> [BACKGROUND]
> 
> Currently btrfs never uses bio_split() to split its bio against RAID
> stripe boundaries.
> 
> Instead inside btrfs we check our stripe boundary everytime we allocate
> a new bio, and ensure the new bio never cross stripe boundaries.
> 
> [PROBLEMS]
> 
> Although this behavior works fine, it's against the common practice used in
> stacked drivers, and is making the effort to convert to iomap harder.
> 
> There is also an hidden burden, every time we allocate a new bio, we uses
> BIO_MAX_BVECS, but since we know the boundaries, for RAID0/RAID10 we can
> only fit at most 16 pages (fixed 64K stripe size, and 4K page size),
> wasting the 256 slots we allocated.
> 
> [CHALLENGES]
> 
> To change the situation, this patchset attempts to improve the situation
> by moving the bio split into btrfs_map_bio() time, so upper layer should
> no longer bother the bio split against RAID stripes or even chunk
> boundaries.
> 
> But there are several challenges:
> 
> - Conflicts in various endio functions
>   We want the existing granularity, instead of chained endio, thus we
>   must make the involved endio functions to handle split bios.
> 
>   Although most endio functions are already doing their works
>   independent of the bio size, they are not yet fully handling split
>   bio.
> 
>   This patch will convert them to use saved bi_iter and only iterate
>   the split range instead of the whole bio.
>   This change involved 3 types of IOs:
> 
>   * Buffered IO
>     Including both data and metadata
>   * Direct IO
>   * Compressed IO
> 
>   Their endio functions needs different level of updates to handle split
>   bios.
> 
>   Furthermore, there is another endio, end_workqueue_bio(), it can't
>   handle split bios at all, thus we change the timing so that
>   btrfs_bio_wq_end_io() is only called after the bio being split.
> 
> - Checksum verification
>   Currently we rely on btrfs_bio::csum to contain the checksum for the
>   whole bio.
>   If one bio get split, csum will no longer points to the correct
>   location for the split bio.
> 
>   This can be solved by introducing btrfs_bio::offset_to_original, and
>   use that new member to calculate where we should read csum from.
> 
>   For the parent bio, it still has btrfs_bio::csum for the whole bio,
>   thus it can still free it correctly.
> 
> - Independent endio for each split bio
>   Unlike stack drivers, for RAID10 btrfs needs to try its best effort to
>   read every sectors, to handle the following case: (X means bad, either
>   unable to read or failed to pass checksum verification, V means good)
> 
>   Dev 1	(missing) | D1 (X) |
>   Dev 2 (OK)	  | D1 (V) |
>   Dev 3 (OK)	  | D2 (V) |
>   Dev 4 (OK)	  | D2 (X) |
> 
>   In the above RAID10 case, dev1 is missing, and although dev4 is fine,
>   its D2 sector is corrupted (by bit rot or whatever).
> 
>   If we use bio_chain(), read bio for both D1 and D2 will be split, and
>   since D1 is missing, the whole D1 and D2 read will be marked as error,
>   thus we will try to read from dev2 and dev4.
> 
>   But D2 in dev4 has csum mismatch, we can only rebuild D1 and D2
>   correctly from dev2:D1 and dev3:D2.
> 
>   This patchset resolve this by saving bi_iter into btrfs_bio::iter, and
>   uses that at endio to iterate only the split part of an bio.
>   Other than this, existing read/write page endio functions can handle
>   them properly without problem.
> 
> - Bad RAID56 naming/functionality
>   There are quite some RAID56 call sites relies on specific behavior on
>   __btrfs_map_block(), like returning @map_length as stripe_len other
>   than real mapped length.
> 
>   This is handled by some small cleanups specific for RAID56.
> 
> [CHANGELOG]
> RFC->v1:
> - Better patch split
>   Now patch 01~06 are refactors/cleanups/preparations.
>   While 07~13 are the patches that doing the conversion while can handle
>   both old and new bio split timings.
>   Finally patch 14~16 convert the bio split call sites one by one to
>   newer facility.
>   The final patch is just a small clean up.
> 
> - Various bug fixes
>   During the full fstests run, various stupid bugs are exposed and
>   fixed.
> 
> v2:
> - Fix the error paths for allocated but never submitted bios
>   There are tons of error path that we allocate a bio but it goes
>   bio_endio() directly without going through btrfs_map_bio().
>   New ASSERTS() in endio functions require a populated btrfs_bio::iter,
>   thus for such bios we still need to call btrfs_bio_save_iter() to
>   populate btrfs_bio::iter to prevent such ASSERT()s get triggered.
> 
> - Fix scrub_stripe_index_and_offset() which abuses stripe_len and
>   mapped_length
> 


FYI the patchset doesn't apply cleanly to misc-next anymore. I've
pulled your branch form github and queued it for testing on zoned
devices.

I'll report any findings.
Johannes Thumshirn Dec. 9, 2021, 10:52 a.m. UTC | #2
On 09/12/2021 11:07, Johannes Thumshirn wrote:
> 
> 
> FYI the patchset doesn't apply cleanly to misc-next anymore. I've
> pulled your branch form github and queued it for testing on zoned
> devices.
> 
> I'll report any findings.
> 

Unfortunately I do have something to report:

generic/068     [ 2020.934379] BTRFS critical (device nullb1): corrupt leaf: root=5 block=4339220480 slot=64 ino=2431 file_offset=962560, invalid disk_bytenr for file extent, have 5100404224, should be aligned to 4096
[ 2020.938165] BTRFS: error (device nullb1) in btrfs_commit_transaction:2310: errno=-5 IO failure (Error while writing out transaction)
[ 2020.938688] BTRFS: error (device nullb1) in btrfs_finish_ordered_io:3110: errno=-5 IO failure
[ 2020.939982] BTRFS: error (device nullb1) in cleanup_transaction:1913: errno=-5 IO failure                                          
[ 2020.941938] kernel BUG at fs/btrfs/ctree.h:3516!                                        
[ 2020.942344] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI                                                                                                                            
[ 2020.942802] CPU: 1 PID: 26201 Comm: fstest Tainted: G        W         5.16.0-rc3-qu-bio-split #30
[ 2020.943043] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5092880384, 5092884480)                          
[ 2020.943576] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-4.fc34 04/01/2014
[ 2020.944424] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5092884480, 5092888576)
[ 2020.945191] RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]                                                                                                                     
[ 2020.946076] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5092888576, 5092892672)                          
[ 2020.945833] Code: e8 ea 1a a0 48 c7 c7 20 eb 1a a0 e8 1b a7 43 e1 0f 0b 89 f1 48 c7 c2 9c 9a 1a a0 48 89 fe 48 c7 c7 48 eb 1a a0 e8 01 a7 43 e1 <0f> 0b be 57 00 00 00 48 c7 c7 70 eb 1a a0 e8 d5 ff ff ff be 73 00
[ 2020.947374] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5092892672, 5092896768)
[ 2020.945833] RSP: 0018:ffffc90004c8b890 EFLAGS: 00010296                                                                                                                            
[ 2020.949774] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5092896768, 5092900864)
[ 2020.945833] RAX: 0000000000000071 RBX: 0000000000001000 RCX: 0000000000000000                                                                                                      
[ 2020.945833] RDX: 0000000000000001 RSI: 00000000ffffffea RDI: 00000000ffffffff                                                                                                      
[ 2020.945833] RBP: ffff888139612700 R08: ffffffff81ca4a40 R09: 00000000ffffff74                                                                                                      
[ 2020.945833] R10: ffffffff81c35760 R11: ffffffff81c35760 R12: 000000013003fe00                                                                                                      
[ 2020.945833] R13: 0000000000000001 R14: ffff888101a8c000 R15: 0000000000000004                                                                                                      
[ 2020.945833] FS:  00007f04d5acf740(0000) GS:ffff888627d00000(0000) knlGS:0000000000000000                                                                                           
[ 2020.945833] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033                                                                                                                      
[ 2020.945833] CR2: 0000000000d7e008 CR3: 00000001bb7f0000 CR4: 00000000000006a0                                                                                                      
[ 2020.945833] Call Trace:
[ 2020.945833]  <TASK>
[ 2020.945833]  btrfs_lookup_bio_sums.cold+0x3f/0x61 [btrfs]
[ 2020.951120]  ? kmem_cache_alloc+0x100/0x1d0
[ 2020.951120]  ? mempool_alloc+0x4d/0x150
[ 2020.951120]  btrfs_submit_data_bio+0xeb/0x200 [btrfs]
[ 2020.951120]  ? bio_alloc_bioset+0x228/0x300
[ 2020.951120]  submit_one_bio+0x60/0x90 [btrfs]
[ 2020.951120]  submit_extent_page+0x175/0x460 [btrfs]
[ 2020.951120]  btrfs_do_readpage+0x263/0x800 [btrfs]
[ 2020.951120]  ? btrfs_repair_one_sector+0x450/0x450 [btrfs]
[ 2020.951120]  extent_readahead+0x296/0x380 [btrfs]
[ 2020.951120]  ? __mod_node_page_state+0x77/0xb0
[ 2020.951120]  ? __filemap_add_folio+0x115/0x190
[ 2020.951120]  read_pages+0x57/0x1a0
[ 2020.951120]  page_cache_ra_unbounded+0x15c/0x1e0
[ 2020.951120]  filemap_get_pages+0xcf/0x640
[ 2020.951120]  ? terminate_walk+0x5c/0xf0
[ 2020.951120]  filemap_read+0xb9/0x2a0
[ 2020.951120]  ? arch_stack_walk+0x77/0xb0
[ 2020.951120]  ? do_filp_open+0x9a/0x120
[ 2020.951120]  new_sync_read+0x103/0x170
[ 2020.951120]  ? 0xffffffff81000000
[ 2020.951120]  vfs_read+0x121/0x1a0
[ 2020.951120]  __x64_sys_pread64+0x69/0xa0
[ 2020.951120]  do_syscall_64+0x43/0x90
[ 2020.951120]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 2020.951120] RIP: 0033:0x7f04d5cce1aa
[ 2020.951120] Code: d8 64 89 02 b8 ff ff ff ff eb bf 0f 1f 44 00 00 f3 0f 1e fa 49 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 11 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 5e c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
[ 2020.951120] RSP: 002b:00007fff31411768 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
[ 2020.951120] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f04d5cce1aa
[ 2020.951120] RDX: 0000000000000400 RSI: 0000000000fab2b0 RDI: 0000000000000003
[ 2020.951120] RBP: 0000000000fab2b0 R08: 0000000000000000 R09: 00007fff31411507
[ 2020.951120] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000400
[ 2020.951120] R13: 0000000000000000 R14: 0000000000000400 R15: 0000000000fab6c0
[ 2020.951120]  </TASK>
[ 2020.951120] Modules linked in: dm_flakey loop btrfs blake2b_generic xor lzo_compress zlib_deflate raid6_pq zstd_decompress zstd_compress xxhash null_blk
[ 2020.972836] ---[ end trace ceb9e45abcff5d95 ]---
[ 2020.973244] RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
[ 2020.973773] Code: e8 ea 1a a0 48 c7 c7 20 eb 1a a0 e8 1b a7 43 e1 0f 0b 89 f1 48 c7 c2 9c 9a 1a a0 48 89 fe 48 c7 c7 48 eb 1a a0 e8 01 a7 43 e1 <0f> 0b be 57 00 00 00 48 c7 c7 70 eb 1a a0 e8 d5 ff ff ff be 73 00
[ 2020.973854] BTRFS warning (device nullb1): csum failed root 5 ino 2411 off 770048 csum 0x0203b7e3 expected csum 0x00000000 mirror 1
[ 2020.975411] RSP: 0018:ffffc90004c8b890 EFLAGS: 00010296
[ 2020.976383] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 1, gen 0
[ 2020.976841] RAX: 0000000000000071 RBX: 0000000000001000 RCX: 0000000000000000
[ 2020.977656] BTRFS warning (device nullb1): csum failed root 5 ino 2411 off 774144 csum 0x0203b7e3 expected csum 0x00000000 mirror 1
[ 2020.978272] RDX: 0000000000000001 RSI: 00000000ffffffea RDI: 00000000ffffffff
[ 2020.979247] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 2, gen 0
[ 2020.979853] RBP: ffff888139612700 R08: ffffffff81ca4a40 R09: 00000000ffffff74
[ 2020.980629] BTRFS warning (device nullb1): csum failed root 5 ino 2411 off 778240 csum 0x0203b7e3 expected csum 0x00000000 mirror 1
[ 2020.981224] R10: ffffffff81c35760 R11: ffffffff81c35760 R12: 000000013003fe00
[ 2020.982211] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 3, gen 0
[ 2020.982835] R13: 0000000000000001 R14: ffff888101a8c000 R15: 0000000000000004
[ 2020.983624] BTRFS warning (device nullb1): csum failed root 5 ino 2411 off 782336 csum 0x0203b7e3 expected csum 0x00000000 mirror 1
[ 2020.984226] FS:  00007f04d5acf740(0000) GS:ffff888627d00000(0000) knlGS:0000000000000000
[ 2020.985216] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 4, gen 0
[ 2020.985915] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2020.986683] BTRFS warning (device nullb1): csum failed root 5 ino 2411 off 786432 csum 0x0203b7e3 expected csum 0x00000000 mirror 1
[ 2020.987157] CR2: 0000000000d7e008 CR3: 00000001bb7f0000 CR4: 00000000000006a0
[ 2020.988150] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 5, gen 0
[ 2020.989019] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5092880384, 5092884480)
[ 2020.990401] BTRFS warning (device nullb1): csum failed root 5 ino 2411 off 770048 csum 0x0203b7e3 expected csum 0x00000000 mirror 1
[ 2020.991403] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 6, gen 0
[ 2020.992443] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5090144256, 5090148352)
[ 2020.993277] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5090148352, 5090152448)
[ 2020.994148] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5090152448, 5090156544)
[ 2020.994989] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5090156544, 5090160640)
[ 2020.996065] BTRFS warning (device nullb1): csum failed root 5 ino 2427 off 2179072 csum 0xa9788697 expected csum 0x00000000 mirror 1
[ 2020.996156] BTRFS warning (device nullb1): csum failed root 5 ino 2427 off 2183168 csum 0xa9788697 expected csum 0x00000000 mirror 1
[ 2020.997107] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 7, gen 0
[ 2020.998140] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 8, gen 0
[ 2020.999004] BTRFS warning (device nullb1): csum failed root 5 ino 2427 off 2187264 csum 0xa9788697 expected csum 0x00000000 mirror 1
[ 2020.999793] BTRFS warning (device nullb1): csum failed root 5 ino 2427 off 2191360 csum 0xa9788697 expected csum 0x00000000 mirror 1
[ 2021.000815] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 9, gen 0

Resolving btrfs_lookup_bio_sums.cold+0x3f doesn't make any sense though:
(gdb) l *btrfs_lookup_bio_sums+0x3f
0x1767f is in btrfs_lookup_bio_sums (fs/btrfs/file-item.c:372).
367     blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst)
368     {
369             struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
370             struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
371             struct btrfs_path *path;
372             const u32 sectorsize = fs_info->sectorsize;
373             const u32 csum_size = fs_info->csum_size;
374             u32 orig_len = bio->bi_iter.bi_size;
375             u64 orig_disk_bytenr = bio->bi_iter.bi_sector << SECTOR_SHIFT;
376             u64 cur_disk_bytenr;
(gdb)
Qu Wenruo Dec. 9, 2021, 11:08 a.m. UTC | #3
On 2021/12/9 18:52, Johannes Thumshirn wrote:
> On 09/12/2021 11:07, Johannes Thumshirn wrote:
>>
>>
>> FYI the patchset doesn't apply cleanly to misc-next anymore. I've
>> pulled your branch form github and queued it for testing on zoned
>> devices.
>>
>> I'll report any findings.
>>
>
> Unfortunately I do have something to report:
>
> generic/068     [ 2020.934379] BTRFS critical (device nullb1): corrupt leaf: root=5 block=4339220480 slot=64 ino=2431 file_offset=962560, invalid disk_bytenr for file extent, have 5100404224, should be aligned to 4096

No more error message after this line?

I thought it should be either write time or read time tree-checker
error, but I can't see the message indicating the timing.

And yes, that disk_bytenr is indeed not aligned.

> [ 2020.938165] BTRFS: error (device nullb1) in btrfs_commit_transaction:2310: errno=-5 IO failure (Error while writing out transaction)
> [ 2020.938688] BTRFS: error (device nullb1) in btrfs_finish_ordered_io:3110: errno=-5 IO failure
> [ 2020.939982] BTRFS: error (device nullb1) in cleanup_transaction:1913: errno=-5 IO failure
> [ 2020.941938] kernel BUG at fs/btrfs/ctree.h:3516!

And this is the most weird part, it's from assertfail(), but no line
showing the line number.

Mind to provide the full dmesg?
I guess some important lines are not included.

> [ 2020.942344] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> [ 2020.942802] CPU: 1 PID: 26201 Comm: fstest Tainted: G        W         5.16.0-rc3-qu-bio-split #30
> [ 2020.943043] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5092880384, 5092884480)

This is from btrfs_lookup_bio_sums(), meaning we're submitting a read,
while the range doesn't have the csum, and it's not NODATASUM inode.

Not sure if it's related to commit transaction error.

Thanks,
Qu
> [ 2020.943576] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-4.fc34 04/01/2014
> [ 2020.944424] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5092884480, 5092888576)
> [ 2020.945191] RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
> [ 2020.946076] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5092888576, 5092892672)
> [ 2020.945833] Code: e8 ea 1a a0 48 c7 c7 20 eb 1a a0 e8 1b a7 43 e1 0f 0b 89 f1 48 c7 c2 9c 9a 1a a0 48 89 fe 48 c7 c7 48 eb 1a a0 e8 01 a7 43 e1 <0f> 0b be 57 00 00 00 48 c7 c7 70 eb 1a a0 e8 d5 ff ff ff be 73 00
> [ 2020.947374] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5092892672, 5092896768)
> [ 2020.945833] RSP: 0018:ffffc90004c8b890 EFLAGS: 00010296
> [ 2020.949774] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5092896768, 5092900864)
> [ 2020.945833] RAX: 0000000000000071 RBX: 0000000000001000 RCX: 0000000000000000
> [ 2020.945833] RDX: 0000000000000001 RSI: 00000000ffffffea RDI: 00000000ffffffff
> [ 2020.945833] RBP: ffff888139612700 R08: ffffffff81ca4a40 R09: 00000000ffffff74
> [ 2020.945833] R10: ffffffff81c35760 R11: ffffffff81c35760 R12: 000000013003fe00
> [ 2020.945833] R13: 0000000000000001 R14: ffff888101a8c000 R15: 0000000000000004
> [ 2020.945833] FS:  00007f04d5acf740(0000) GS:ffff888627d00000(0000) knlGS:0000000000000000
> [ 2020.945833] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2020.945833] CR2: 0000000000d7e008 CR3: 00000001bb7f0000 CR4: 00000000000006a0
> [ 2020.945833] Call Trace:
> [ 2020.945833]  <TASK>
> [ 2020.945833]  btrfs_lookup_bio_sums.cold+0x3f/0x61 [btrfs]
> [ 2020.951120]  ? kmem_cache_alloc+0x100/0x1d0
> [ 2020.951120]  ? mempool_alloc+0x4d/0x150
> [ 2020.951120]  btrfs_submit_data_bio+0xeb/0x200 [btrfs]
> [ 2020.951120]  ? bio_alloc_bioset+0x228/0x300
> [ 2020.951120]  submit_one_bio+0x60/0x90 [btrfs]
> [ 2020.951120]  submit_extent_page+0x175/0x460 [btrfs]
> [ 2020.951120]  btrfs_do_readpage+0x263/0x800 [btrfs]
> [ 2020.951120]  ? btrfs_repair_one_sector+0x450/0x450 [btrfs]
> [ 2020.951120]  extent_readahead+0x296/0x380 [btrfs]
> [ 2020.951120]  ? __mod_node_page_state+0x77/0xb0
> [ 2020.951120]  ? __filemap_add_folio+0x115/0x190
> [ 2020.951120]  read_pages+0x57/0x1a0
> [ 2020.951120]  page_cache_ra_unbounded+0x15c/0x1e0
> [ 2020.951120]  filemap_get_pages+0xcf/0x640
> [ 2020.951120]  ? terminate_walk+0x5c/0xf0
> [ 2020.951120]  filemap_read+0xb9/0x2a0
> [ 2020.951120]  ? arch_stack_walk+0x77/0xb0
> [ 2020.951120]  ? do_filp_open+0x9a/0x120
> [ 2020.951120]  new_sync_read+0x103/0x170
> [ 2020.951120]  ? 0xffffffff81000000
> [ 2020.951120]  vfs_read+0x121/0x1a0
> [ 2020.951120]  __x64_sys_pread64+0x69/0xa0
> [ 2020.951120]  do_syscall_64+0x43/0x90
> [ 2020.951120]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [ 2020.951120] RIP: 0033:0x7f04d5cce1aa
> [ 2020.951120] Code: d8 64 89 02 b8 ff ff ff ff eb bf 0f 1f 44 00 00 f3 0f 1e fa 49 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 11 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 5e c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
> [ 2020.951120] RSP: 002b:00007fff31411768 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
> [ 2020.951120] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f04d5cce1aa
> [ 2020.951120] RDX: 0000000000000400 RSI: 0000000000fab2b0 RDI: 0000000000000003
> [ 2020.951120] RBP: 0000000000fab2b0 R08: 0000000000000000 R09: 00007fff31411507
> [ 2020.951120] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000400
> [ 2020.951120] R13: 0000000000000000 R14: 0000000000000400 R15: 0000000000fab6c0
> [ 2020.951120]  </TASK>
> [ 2020.951120] Modules linked in: dm_flakey loop btrfs blake2b_generic xor lzo_compress zlib_deflate raid6_pq zstd_decompress zstd_compress xxhash null_blk
> [ 2020.972836] ---[ end trace ceb9e45abcff5d95 ]---
> [ 2020.973244] RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
> [ 2020.973773] Code: e8 ea 1a a0 48 c7 c7 20 eb 1a a0 e8 1b a7 43 e1 0f 0b 89 f1 48 c7 c2 9c 9a 1a a0 48 89 fe 48 c7 c7 48 eb 1a a0 e8 01 a7 43 e1 <0f> 0b be 57 00 00 00 48 c7 c7 70 eb 1a a0 e8 d5 ff ff ff be 73 00
> [ 2020.973854] BTRFS warning (device nullb1): csum failed root 5 ino 2411 off 770048 csum 0x0203b7e3 expected csum 0x00000000 mirror 1
> [ 2020.975411] RSP: 0018:ffffc90004c8b890 EFLAGS: 00010296
> [ 2020.976383] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 1, gen 0
> [ 2020.976841] RAX: 0000000000000071 RBX: 0000000000001000 RCX: 0000000000000000
> [ 2020.977656] BTRFS warning (device nullb1): csum failed root 5 ino 2411 off 774144 csum 0x0203b7e3 expected csum 0x00000000 mirror 1
> [ 2020.978272] RDX: 0000000000000001 RSI: 00000000ffffffea RDI: 00000000ffffffff
> [ 2020.979247] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 2, gen 0
> [ 2020.979853] RBP: ffff888139612700 R08: ffffffff81ca4a40 R09: 00000000ffffff74
> [ 2020.980629] BTRFS warning (device nullb1): csum failed root 5 ino 2411 off 778240 csum 0x0203b7e3 expected csum 0x00000000 mirror 1
> [ 2020.981224] R10: ffffffff81c35760 R11: ffffffff81c35760 R12: 000000013003fe00
> [ 2020.982211] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 3, gen 0
> [ 2020.982835] R13: 0000000000000001 R14: ffff888101a8c000 R15: 0000000000000004
> [ 2020.983624] BTRFS warning (device nullb1): csum failed root 5 ino 2411 off 782336 csum 0x0203b7e3 expected csum 0x00000000 mirror 1
> [ 2020.984226] FS:  00007f04d5acf740(0000) GS:ffff888627d00000(0000) knlGS:0000000000000000
> [ 2020.985216] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 4, gen 0
> [ 2020.985915] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2020.986683] BTRFS warning (device nullb1): csum failed root 5 ino 2411 off 786432 csum 0x0203b7e3 expected csum 0x00000000 mirror 1
> [ 2020.987157] CR2: 0000000000d7e008 CR3: 00000001bb7f0000 CR4: 00000000000006a0
> [ 2020.988150] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 5, gen 0
> [ 2020.989019] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5092880384, 5092884480)
> [ 2020.990401] BTRFS warning (device nullb1): csum failed root 5 ino 2411 off 770048 csum 0x0203b7e3 expected csum 0x00000000 mirror 1
> [ 2020.991403] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 6, gen 0
> [ 2020.992443] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5090144256, 5090148352)
> [ 2020.993277] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5090148352, 5090152448)
> [ 2020.994148] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5090152448, 5090156544)
> [ 2020.994989] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5090156544, 5090160640)
> [ 2020.996065] BTRFS warning (device nullb1): csum failed root 5 ino 2427 off 2179072 csum 0xa9788697 expected csum 0x00000000 mirror 1
> [ 2020.996156] BTRFS warning (device nullb1): csum failed root 5 ino 2427 off 2183168 csum 0xa9788697 expected csum 0x00000000 mirror 1
> [ 2020.997107] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 7, gen 0
> [ 2020.998140] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 8, gen 0
> [ 2020.999004] BTRFS warning (device nullb1): csum failed root 5 ino 2427 off 2187264 csum 0xa9788697 expected csum 0x00000000 mirror 1
> [ 2020.999793] BTRFS warning (device nullb1): csum failed root 5 ino 2427 off 2191360 csum 0xa9788697 expected csum 0x00000000 mirror 1
> [ 2021.000815] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 9, gen 0
>
> Resolving btrfs_lookup_bio_sums.cold+0x3f doesn't make any sense though:
> (gdb) l *btrfs_lookup_bio_sums+0x3f
> 0x1767f is in btrfs_lookup_bio_sums (fs/btrfs/file-item.c:372).
> 367     blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst)
> 368     {
> 369             struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> 370             struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> 371             struct btrfs_path *path;
> 372             const u32 sectorsize = fs_info->sectorsize;
> 373             const u32 csum_size = fs_info->csum_size;
> 374             u32 orig_len = bio->bi_iter.bi_size;
> 375             u64 orig_disk_bytenr = bio->bi_iter.bi_sector << SECTOR_SHIFT;
> 376             u64 cur_disk_bytenr;
> (gdb)
>
Johannes Thumshirn Dec. 9, 2021, 11:13 a.m. UTC | #4
On 09/12/2021 12:08, Qu Wenruo wrote:
> 
> 
> On 2021/12/9 18:52, Johannes Thumshirn wrote:
>> On 09/12/2021 11:07, Johannes Thumshirn wrote:
>>>
>>>
>>> FYI the patchset doesn't apply cleanly to misc-next anymore. I've
>>> pulled your branch form github and queued it for testing on zoned
>>> devices.
>>>
>>> I'll report any findings.
>>>
>>
>> Unfortunately I do have something to report:
>>
>> generic/068     [ 2020.934379] BTRFS critical (device nullb1): corrupt leaf: root=5 block=4339220480 slot=64 ino=2431 file_offset=962560, invalid disk_bytenr for file extent, have 5100404224, should be aligned to 4096
> 
> No more error message after this line?
> 
> I thought it should be either write time or read time tree-checker
> error, but I can't see the message indicating the timing.
> 
> And yes, that disk_bytenr is indeed not aligned.
> 
>> [ 2020.938165] BTRFS: error (device nullb1) in btrfs_commit_transaction:2310: errno=-5 IO failure (Error while writing out transaction)
>> [ 2020.938688] BTRFS: error (device nullb1) in btrfs_finish_ordered_io:3110: errno=-5 IO failure
>> [ 2020.939982] BTRFS: error (device nullb1) in cleanup_transaction:1913: errno=-5 IO failure
>> [ 2020.941938] kernel BUG at fs/btrfs/ctree.h:3516!
> 
> And this is the most weird part, it's from assertfail(), but no line
> showing the line number.
> 
> Mind to provide the full dmesg?
> I guess some important lines are not included.
> 

Let me see if I can reproduce it with a higher logging level
Qu Wenruo Jan. 12, 2022, 12:33 a.m. UTC | #5
On 2021/12/9 19:13, Johannes Thumshirn wrote:
> On 09/12/2021 12:08, Qu Wenruo wrote:
>>
>>
>> On 2021/12/9 18:52, Johannes Thumshirn wrote:
>>> On 09/12/2021 11:07, Johannes Thumshirn wrote:
>>>>
>>>>
>>>> FYI the patchset doesn't apply cleanly to misc-next anymore. I've
>>>> pulled your branch form github and queued it for testing on zoned
>>>> devices.
>>>>
>>>> I'll report any findings.
>>>>
>>>
>>> Unfortunately I do have something to report:
>>>
>>> generic/068     [ 2020.934379] BTRFS critical (device nullb1): corrupt leaf: root=5 block=4339220480 slot=64 ino=2431 file_offset=962560, invalid disk_bytenr for file extent, have 5100404224, should be aligned to 4096
>>
>> No more error message after this line?
>>
>> I thought it should be either write time or read time tree-checker
>> error, but I can't see the message indicating the timing.
>>
>> And yes, that disk_bytenr is indeed not aligned.
>>
>>> [ 2020.938165] BTRFS: error (device nullb1) in btrfs_commit_transaction:2310: errno=-5 IO failure (Error while writing out transaction)
>>> [ 2020.938688] BTRFS: error (device nullb1) in btrfs_finish_ordered_io:3110: errno=-5 IO failure
>>> [ 2020.939982] BTRFS: error (device nullb1) in cleanup_transaction:1913: errno=-5 IO failure
>>> [ 2020.941938] kernel BUG at fs/btrfs/ctree.h:3516!
>>
>> And this is the most weird part, it's from assertfail(), but no line
>> showing the line number.
>>
>> Mind to provide the full dmesg?
>> I guess some important lines are not included.
>>
>
> Let me see if I can reproduce it with a higher logging level

Any follow up on the crash?

I doubt it's related to the patchset, but I'm also curious why
tree-checker is triggered for a test without any dm-flakey/dm-rust/etc.

Thanks,
Qu