Message ID | 1220142568f5b5f0d06fbc3ee28a08060afc0a53.1621351444.git.johannes.thumshirn@wdc.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: zoned: fix writes on a compressed zoned filesystem | expand |
On 5/18/21 11:40 AM, Johannes Thumshirn wrote: > When multiple processes write data to the same block group on a compressed > zoned filesystem, the underlying device could report I/O errors and data > corruption is possible. > > This happens because on a zoned file system, compressed data writes where > sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND > operation. But with REQ_OP_WRITE and parallel submission it cannot be > guaranteed that the data is always submitted aligned to the underlying > zone's write pointer. > > The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned > filesystem is non intrusive on a regular file system or when submitting to > a conventional zone on a zoned filesystem, as it is guarded by > btrfs_use_zone_append. > > Reported-by: David Sterba <dsterba@suse.com> > Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat flag") > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> This one is causing panics with btrfs/027 with -o compress. I bisected it to something else earlier, but it was still happening today and I bisected again and this is what popped out. I also went the extra step to revert the patch as I have already fucked this up once, and the problem didn't re-occur with this patch reverted. The panic looks like this May 22 00:33:16 xfstests2 kernel: BTRFS critical (device dm-9): mapping failed logical 22429696 bio len 53248 len 49152 May 22 00:33:16 xfstests2 kernel: ------------[ cut here ]------------ May 22 00:33:16 xfstests2 kernel: kernel BUG at fs/btrfs/volumes.c:6643! May 22 00:33:16 xfstests2 kernel: invalid opcode: 0000 [#1] SMP NOPTI May 22 00:33:16 xfstests2 kernel: CPU: 1 PID: 2236088 Comm: kworker/u4:4 Not tainted 5.13.0-rc2+ #240 May 22 00:33:16 xfstests2 kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 May 22 00:33:16 xfstests2 kernel: Workqueue: btrfs-delalloc btrfs_work_helper May 22 00:33:16 xfstests2 kernel: RIP: 0010:btrfs_map_bio.cold+0x58/0x5a May 22 00:33:16 xfstests2 kernel: Code: 50 e8 6b 83 ff ff e8 5b 0d 88 ff 48 83 c4 18 e9 94 8f 88 ff 48 8b 3c 24 4c 89 f1 4c 89 fa 48 c7 c6 f8 db 62 96 e8 47 83 ff ff <0f> 0b 4c 89 e7 e8 52 1f 83 ff e9 03 98 88 ff 49 8b 7a 50 44 89 f2 May 22 00:33:16 xfstests2 kernel: RSP: 0018:ffffb310c1de7c88 EFLAGS: 00010282 May 22 00:33:16 xfstests2 kernel: RAX: 0000000000000055 RBX: 0000000000000000 RCX: 0000000000000000 May 22 00:33:16 xfstests2 kernel: RDX: ffff9b9a7bd27540 RSI: ffff9b9a7bd18e10 RDI: ffff9b9a7bd18e10 May 22 00:33:16 xfstests2 kernel: RBP: ffff9b9a482ad7f8 R08: 0000000000000000 R09: 0000000000000000 May 22 00:33:16 xfstests2 kernel: R10: ffffb310c1de7a48 R11: ffffffff96973748 R12: 0000000000000000 May 22 00:33:16 xfstests2 kernel: R13: ffff9b9a001e7300 R14: 000000000000d000 R15: 0000000001564000 May 22 00:33:16 xfstests2 kernel: FS: 0000000000000000(0000) GS:ffff9b9a7bd00000(0000) knlGS:0000000000000000 May 22 00:33:16 xfstests2 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 May 22 00:33:16 xfstests2 kernel: CR2: 00005621fe4566e0 CR3: 000000013943a005 CR4: 0000000000370ee0 May 22 00:33:16 xfstests2 kernel: Call Trace: May 22 00:33:16 xfstests2 kernel: btrfs_submit_compressed_write+0x2d7/0x470 May 22 00:33:16 xfstests2 kernel: submit_compressed_extents+0x364/0x420 May 22 00:33:16 xfstests2 kernel: ? lock_acquire+0x15d/0x380 May 22 00:33:16 xfstests2 kernel: ? lock_release+0x1cd/0x2a0 May 22 00:33:16 xfstests2 kernel: ? submit_compressed_extents+0x420/0x420 May 22 00:33:16 xfstests2 kernel: btrfs_work_helper+0x133/0x520 May 22 00:33:16 xfstests2 kernel: process_one_work+0x26b/0x570 May 22 00:33:16 xfstests2 kernel: worker_thread+0x55/0x3c0 May 22 00:33:16 xfstests2 kernel: ? process_one_work+0x570/0x570 May 22 00:33:16 xfstests2 kernel: kthread+0x134/0x150 May 22 00:33:16 xfstests2 kernel: ? __kthread_bind_mask+0x60/0x60 May 22 00:33:16 xfstests2 kernel: ret_from_fork+0x1f/0x30 Thanks, Josef
On 2021/5/23 下午10:13, Josef Bacik wrote: > On 5/18/21 11:40 AM, Johannes Thumshirn wrote: >> When multiple processes write data to the same block group on a >> compressed >> zoned filesystem, the underlying device could report I/O errors and data >> corruption is possible. >> >> This happens because on a zoned file system, compressed data writes where >> sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND >> operation. But with REQ_OP_WRITE and parallel submission it cannot be >> guaranteed that the data is always submitted aligned to the underlying >> zone's write pointer. >> >> The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned >> filesystem is non intrusive on a regular file system or when >> submitting to >> a conventional zone on a zoned filesystem, as it is guarded by >> btrfs_use_zone_append. >> >> Reported-by: David Sterba <dsterba@suse.com> >> Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat flag") >> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> > > This one is causing panics with btrfs/027 with -o compress. I bisected > it to something else earlier, but it was still happening today and I > bisected again and this is what popped out. I also went the extra step > to revert the patch as I have already fucked this up once, and the > problem didn't re-occur with this patch reverted. The panic looks like > this > > May 22 00:33:16 xfstests2 kernel: BTRFS critical (device dm-9): mapping > failed logical 22429696 bio len 53248 len 49152 This is the interesting part, it means we are just one sector beyond the stripe boundary. Definitely a sign of changed bio submission timing. Just like the code: + if (pg_index == 0 && use_append) + len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0); + else + len = bio_add_page(bio, page, PAGE_SIZE, 0); + page->mapping = NULL; - if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) < - PAGE_SIZE) { + if (submit || len < PAGE_SIZE) { The code has changed the timing of bio_add_page(). Previously, if we have submit == true, we won't even try to call bio_add_page(). But now, we will add the page even we're already at the stripe boundary, thus it causes the extra sector being added to bio, and crosses stripe boundary. This part is already super tricky, thus I refactored submit_extent_page() to do a better job at stripe boundary calculation. We definitely need to make other bio_add_page() callers to use a better helper, not only for later subpage, but also to make our lives easier. Thanks, Qu > May 22 00:33:16 xfstests2 kernel: ------------[ cut here ]------------ > May 22 00:33:16 xfstests2 kernel: kernel BUG at fs/btrfs/volumes.c:6643! > May 22 00:33:16 xfstests2 kernel: invalid opcode: 0000 [#1] SMP NOPTI > May 22 00:33:16 xfstests2 kernel: CPU: 1 PID: 2236088 Comm: kworker/u4:4 > Not tainted 5.13.0-rc2+ #240 > May 22 00:33:16 xfstests2 kernel: Hardware name: QEMU Standard PC (Q35 + > ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 > May 22 00:33:16 xfstests2 kernel: Workqueue: btrfs-delalloc > btrfs_work_helper > May 22 00:33:16 xfstests2 kernel: RIP: 0010:btrfs_map_bio.cold+0x58/0x5a > May 22 00:33:16 xfstests2 kernel: Code: 50 e8 6b 83 ff ff e8 5b 0d 88 ff > 48 83 c4 18 e9 94 8f 88 ff 48 8b 3c 24 4c 89 f1 4c 89 fa 48 c7 c6 f8 db > 62 96 e8 47 83 ff ff <0f> 0b 4c 89 e7 e8 52 1f 83 ff e9 03 98 88 ff 49 > 8b 7a 50 44 89 f2 > May 22 00:33:16 xfstests2 kernel: RSP: 0018:ffffb310c1de7c88 EFLAGS: > 00010282 > May 22 00:33:16 xfstests2 kernel: RAX: 0000000000000055 RBX: > 0000000000000000 RCX: 0000000000000000 > May 22 00:33:16 xfstests2 kernel: RDX: ffff9b9a7bd27540 RSI: > ffff9b9a7bd18e10 RDI: ffff9b9a7bd18e10 > May 22 00:33:16 xfstests2 kernel: RBP: ffff9b9a482ad7f8 R08: > 0000000000000000 R09: 0000000000000000 > May 22 00:33:16 xfstests2 kernel: R10: ffffb310c1de7a48 R11: > ffffffff96973748 R12: 0000000000000000 > May 22 00:33:16 xfstests2 kernel: R13: ffff9b9a001e7300 R14: > 000000000000d000 R15: 0000000001564000 > May 22 00:33:16 xfstests2 kernel: FS: 0000000000000000(0000) > GS:ffff9b9a7bd00000(0000) knlGS:0000000000000000 > May 22 00:33:16 xfstests2 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: > 0000000080050033 > May 22 00:33:16 xfstests2 kernel: CR2: 00005621fe4566e0 CR3: > 000000013943a005 CR4: 0000000000370ee0 > May 22 00:33:16 xfstests2 kernel: Call Trace: > May 22 00:33:16 xfstests2 kernel: > btrfs_submit_compressed_write+0x2d7/0x470 > May 22 00:33:16 xfstests2 kernel: submit_compressed_extents+0x364/0x420 > May 22 00:33:16 xfstests2 kernel: ? lock_acquire+0x15d/0x380 > May 22 00:33:16 xfstests2 kernel: ? lock_release+0x1cd/0x2a0 > May 22 00:33:16 xfstests2 kernel: ? submit_compressed_extents+0x420/0x420 > May 22 00:33:16 xfstests2 kernel: btrfs_work_helper+0x133/0x520 > May 22 00:33:16 xfstests2 kernel: process_one_work+0x26b/0x570 > May 22 00:33:16 xfstests2 kernel: worker_thread+0x55/0x3c0 > May 22 00:33:16 xfstests2 kernel: ? process_one_work+0x570/0x570 > May 22 00:33:16 xfstests2 kernel: kthread+0x134/0x150 > May 22 00:33:16 xfstests2 kernel: ? __kthread_bind_mask+0x60/0x60 > May 22 00:33:16 xfstests2 kernel: ret_from_fork+0x1f/0x30 > > Thanks, > > Josef
On 2021/5/24 上午7:09, Qu Wenruo wrote: > > > On 2021/5/23 下午10:13, Josef Bacik wrote: >> On 5/18/21 11:40 AM, Johannes Thumshirn wrote: >>> When multiple processes write data to the same block group on a >>> compressed >>> zoned filesystem, the underlying device could report I/O errors and data >>> corruption is possible. >>> >>> This happens because on a zoned file system, compressed data writes >>> where >>> sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND >>> operation. But with REQ_OP_WRITE and parallel submission it cannot be >>> guaranteed that the data is always submitted aligned to the underlying >>> zone's write pointer. >>> >>> The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a >>> zoned >>> filesystem is non intrusive on a regular file system or when >>> submitting to >>> a conventional zone on a zoned filesystem, as it is guarded by >>> btrfs_use_zone_append. >>> >>> Reported-by: David Sterba <dsterba@suse.com> >>> Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat >>> flag") >>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> >> >> This one is causing panics with btrfs/027 with -o compress. I bisected >> it to something else earlier, but it was still happening today and I >> bisected again and this is what popped out. I also went the extra step >> to revert the patch as I have already fucked this up once, and the >> problem didn't re-occur with this patch reverted. The panic looks like >> this >> >> May 22 00:33:16 xfstests2 kernel: BTRFS critical (device dm-9): mapping >> failed logical 22429696 bio len 53248 len 49152 > > This is the interesting part, it means we are just one sector beyond the > stripe boundary. > Definitely a sign of changed bio submission timing. > > Just like the code: > > + if (pg_index == 0 && use_append) > + len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0); > + else > + len = bio_add_page(bio, page, PAGE_SIZE, 0); > + > page->mapping = NULL; > - if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) < > - PAGE_SIZE) { > + if (submit || len < PAGE_SIZE) { > > The code has changed the timing of bio_add_page(). > > Previously, if we have submit == true, we won't even try to call > bio_add_page(). > > But now, we will add the page even we're already at the stripe boundary, > thus it causes the extra sector being added to bio, and crosses stripe > boundary. > > This part is already super tricky, thus I refactored > submit_extent_page() to do a better job at stripe boundary calculation. BTW, I can also reproduce the problem in btrfs/027 using the latest misc-next branch. Thus to workaround the problem, I'm using the following diff, feel free to fold in to the offending patch. Thanks, Qu diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 831e6ae92940..26e2dceda1fc 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -455,10 +455,13 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio, 0); - if (pg_index == 0 && use_append) - len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0); - else - len = bio_add_page(bio, page, PAGE_SIZE, 0); + if (!submit) { + if (pg_index == 0 && use_append) + len = bio_add_zone_append_page(bio, page, + PAGE_SIZE, 0); + else + len = bio_add_page(bio, page, PAGE_SIZE, 0); + } page->mapping = NULL; if (submit || len < PAGE_SIZE) { > > We definitely need to make other bio_add_page() callers to use a better > helper, not only for later subpage, but also to make our lives easier. > > Thanks, > Qu
On Mon, May 24, 2021 at 09:04:40PM +0800, Qu Wenruo wrote: > > This is the interesting part, it means we are just one sector beyond the > > stripe boundary. > > Definitely a sign of changed bio submission timing. > > > > Just like the code: > > > > + if (pg_index == 0 && use_append) > > + len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0); > > + else > > + len = bio_add_page(bio, page, PAGE_SIZE, 0); > > + > > page->mapping = NULL; > > - if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) < > > - PAGE_SIZE) { > > + if (submit || len < PAGE_SIZE) { > > > > The code has changed the timing of bio_add_page(). > > > > Previously, if we have submit == true, we won't even try to call > > bio_add_page(). > > > > But now, we will add the page even we're already at the stripe boundary, > > thus it causes the extra sector being added to bio, and crosses stripe > > boundary. > > > > This part is already super tricky, thus I refactored > > submit_extent_page() to do a better job at stripe boundary calculation. > > BTW, I can also reproduce the problem in btrfs/027 using the latest > misc-next branch. > > Thus to workaround the problem, I'm using the following diff, feel free > to fold in to the offending patch. The patch is now in master so we'll need a proper fix.
On 23/05/2021 16:13, Josef Bacik wrote: > On 5/18/21 11:40 AM, Johannes Thumshirn wrote: >> When multiple processes write data to the same block group on a compressed >> zoned filesystem, the underlying device could report I/O errors and data >> corruption is possible. >> >> This happens because on a zoned file system, compressed data writes where >> sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND >> operation. But with REQ_OP_WRITE and parallel submission it cannot be >> guaranteed that the data is always submitted aligned to the underlying >> zone's write pointer. >> >> The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned >> filesystem is non intrusive on a regular file system or when submitting to >> a conventional zone on a zoned filesystem, as it is guarded by >> btrfs_use_zone_append. >> >> Reported-by: David Sterba <dsterba@suse.com> >> Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat flag") >> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> > > This one is causing panics with btrfs/027 with -o compress. I bisected it to > something else earlier, but it was still happening today and I bisected again > and this is what popped out. I also went the extra step to revert the patch as > I have already fucked this up once, and the problem didn't re-occur with this > patch reverted. The panic looks like this > This is on a regular block device or on a zoned block device? I guess it's a regular, but I can't see yet why.
On 24/05/2021 15:05, Qu Wenruo wrote: > diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c > index 831e6ae92940..26e2dceda1fc 100644 > --- a/fs/btrfs/compression.c > +++ b/fs/btrfs/compression.c > @@ -455,10 +455,13 @@ blk_status_t btrfs_submit_compressed_write(struct > btrfs_inode *inode, u64 start, > submit = btrfs_bio_fits_in_stripe(page, > PAGE_SIZE, bio, > 0); > > - if (pg_index == 0 && use_append) > - len = bio_add_zone_append_page(bio, page, > PAGE_SIZE, 0); > - else > - len = bio_add_page(bio, page, PAGE_SIZE, 0); > + if (!submit) { > + if (pg_index == 0 && use_append) > + len = bio_add_zone_append_page(bio, page, > + > PAGE_SIZE, 0); > + else > + len = bio_add_page(bio, page, PAGE_SIZE, 0); > + } > > page->mapping = NULL; > if (submit || len < PAGE_SIZE) { This looks good, thanks for fixing it. I indeed messed up the bio_add_page() call in the new version when untangling that if (submit || bio_add_page()) construct. Can you send an official patch? Thanks a lot, Johannes
On 2021/5/18 下午11:40, Johannes Thumshirn wrote: > When multiple processes write data to the same block group on a compressed > zoned filesystem, the underlying device could report I/O errors and data > corruption is possible. > > This happens because on a zoned file system, compressed data writes where > sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND > operation. But with REQ_OP_WRITE and parallel submission it cannot be > guaranteed that the data is always submitted aligned to the underlying > zone's write pointer. > > The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned > filesystem is non intrusive on a regular file system or when submitting to > a conventional zone on a zoned filesystem, as it is guarded by > btrfs_use_zone_append. > > Reported-by: David Sterba <dsterba@suse.com> > Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat flag") > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Now working on compression support for subpage, just noticed some strange code behavior, I'm not sure if it's designed or just a typo. So please correct me if possible. [...] > > bio = btrfs_bio_alloc(first_byte); > - bio->bi_opf = REQ_OP_WRITE | write_flags; > + bio->bi_opf = bio_op | write_flags; > bio->bi_private = cb; > bio->bi_end_io = end_compressed_bio_write; > > + if (use_append) { > + struct extent_map *em; > + struct map_lookup *map; > + struct block_device *bdev; > + > + em = btrfs_get_chunk_map(fs_info, disk_start, PAGE_SIZE); > + if (IS_ERR(em)) { > + kfree(cb); > + bio_put(bio); > + return BLK_STS_NOTSUPP; > + } > + > + map = em->map_lookup; > + /* We only support single profile for now */ > + ASSERT(map->num_stripes == 1); > + bdev = map->stripes[0].dev->bdev; > + > + bio_set_dev(bio, bdev); > + free_extent_map(em); > + } > + Here for the newly created bio, we will try to call bio_set_dev() for it. (although later patch refactor this part a little) So far so good. > if (blkcg_css) { > bio->bi_opf |= REQ_CGROUP_PUNT; > kthread_associate_blkcg(blkcg_css); > @@ -432,6 +458,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, > bytes_left = compressed_len; > for (pg_index = 0; pg_index < cb->nr_pages; pg_index++) { > int submit = 0; > + int len; > > page = compressed_pages[pg_index]; > page->mapping = inode->vfs_inode.i_mapping; > @@ -439,9 +466,13 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, > submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio, > 0); > > + if (pg_index == 0 && use_append) > + len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0); > + else > + len = bio_add_page(bio, page, PAGE_SIZE, 0); > + > page->mapping = NULL; > - if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) < > - PAGE_SIZE) { > + if (submit || len < PAGE_SIZE) { > /* > * inc the count before we submit the bio so > * we know the end IO handler won't happen before > @@ -465,11 +496,15 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, > } > > bio = btrfs_bio_alloc(first_byte); > - bio->bi_opf = REQ_OP_WRITE | write_flags; > + bio->bi_opf = bio_op | write_flags; But here, for the newly allocated bio, we didn't call bio_set_dev() at all. Shouldn't all zoned write bio need that bio_set_dev() call? I guess since most compressed extents are pretty small, it's really hard to hit a case where we need to split the bio due to stripe boundary, thus very hard to hit anything wrong. Anyway, since I'm working on compression code to make compressed write to follow the same boundary check in extent_io.c, I can definitely refactor the bio allocation code to add the zoned needed calls. Thanks, Qu > bio->bi_private = cb; > bio->bi_end_io = end_compressed_bio_write; > if (blkcg_css) > bio->bi_opf |= REQ_CGROUP_PUNT; > + /* > + * Use bio_add_page() to ensure the bio has at least one > + * page. > + */ > bio_add_page(bio, page, PAGE_SIZE, 0); > } > if (bytes_left < PAGE_SIZE) { >
On 2021/06/10 16:28, Qu Wenruo wrote: > > > On 2021/5/18 下午11:40, Johannes Thumshirn wrote: >> When multiple processes write data to the same block group on a compressed >> zoned filesystem, the underlying device could report I/O errors and data >> corruption is possible. >> >> This happens because on a zoned file system, compressed data writes where >> sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND >> operation. But with REQ_OP_WRITE and parallel submission it cannot be >> guaranteed that the data is always submitted aligned to the underlying >> zone's write pointer. >> >> The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned >> filesystem is non intrusive on a regular file system or when submitting to >> a conventional zone on a zoned filesystem, as it is guarded by >> btrfs_use_zone_append. >> >> Reported-by: David Sterba <dsterba@suse.com> >> Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat flag") >> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> > > Now working on compression support for subpage, just noticed some > strange code behavior, I'm not sure if it's designed or just a typo. > > So please correct me if possible. > > [...] >> >> bio = btrfs_bio_alloc(first_byte); >> - bio->bi_opf = REQ_OP_WRITE | write_flags; >> + bio->bi_opf = bio_op | write_flags; >> bio->bi_private = cb; >> bio->bi_end_io = end_compressed_bio_write; >> >> + if (use_append) { >> + struct extent_map *em; >> + struct map_lookup *map; >> + struct block_device *bdev; >> + >> + em = btrfs_get_chunk_map(fs_info, disk_start, PAGE_SIZE); >> + if (IS_ERR(em)) { >> + kfree(cb); >> + bio_put(bio); >> + return BLK_STS_NOTSUPP; >> + } >> + >> + map = em->map_lookup; >> + /* We only support single profile for now */ >> + ASSERT(map->num_stripes == 1); >> + bdev = map->stripes[0].dev->bdev; This variable seems rather useless... >> + >> + bio_set_dev(bio, bdev); >> + free_extent_map(em); >> + } >> + > > Here for the newly created bio, we will try to call bio_set_dev() for > it. (although later patch refactor this part a little) > > So far so good. > >> if (blkcg_css) { >> bio->bi_opf |= REQ_CGROUP_PUNT; >> kthread_associate_blkcg(blkcg_css); >> @@ -432,6 +458,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, >> bytes_left = compressed_len; >> for (pg_index = 0; pg_index < cb->nr_pages; pg_index++) { >> int submit = 0; >> + int len; >> >> page = compressed_pages[pg_index]; >> page->mapping = inode->vfs_inode.i_mapping; >> @@ -439,9 +466,13 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, >> submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio, >> 0); >> >> + if (pg_index == 0 && use_append) >> + len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0); >> + else >> + len = bio_add_page(bio, page, PAGE_SIZE, 0); >> + >> page->mapping = NULL; >> - if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) < >> - PAGE_SIZE) { >> + if (submit || len < PAGE_SIZE) { >> /* >> * inc the count before we submit the bio so >> * we know the end IO handler won't happen before >> @@ -465,11 +496,15 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, >> } >> >> bio = btrfs_bio_alloc(first_byte); >> - bio->bi_opf = REQ_OP_WRITE | write_flags; >> + bio->bi_opf = bio_op | write_flags; > > But here, for the newly allocated bio, we didn't call bio_set_dev() at all. > > Shouldn't all zoned write bio need that bio_set_dev() call? Yep, bio->bi_bdev must be set before bio_add_zone_append_page() is called. Otherwise, there will be a crash (first line of bio_add_zone_append_page() gets the request queue from bio->bi_bdev). I wonder why we do not see NULL pointer oops here... Johannes ? > > I guess since most compressed extents are pretty small, it's really hard > to hit a case where we need to split the bio due to stripe boundary, > thus very hard to hit anything wrong. > > Anyway, since I'm working on compression code to make compressed write > to follow the same boundary check in extent_io.c, I can definitely > refactor the bio allocation code to add the zoned needed calls. > > Thanks, > Qu > >> bio->bi_private = cb; >> bio->bi_end_io = end_compressed_bio_write; >> if (blkcg_css) >> bio->bi_opf |= REQ_CGROUP_PUNT; >> + /* >> + * Use bio_add_page() to ensure the bio has at least one >> + * page. >> + */ >> bio_add_page(bio, page, PAGE_SIZE, 0); >> } >> if (bytes_left < PAGE_SIZE) { >> >
On 2021/6/10 下午3:36, Damien Le Moal wrote: > On 2021/06/10 16:28, Qu Wenruo wrote: >> >> >> On 2021/5/18 下午11:40, Johannes Thumshirn wrote: >>> When multiple processes write data to the same block group on a compressed >>> zoned filesystem, the underlying device could report I/O errors and data >>> corruption is possible. >>> >>> This happens because on a zoned file system, compressed data writes where >>> sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND >>> operation. But with REQ_OP_WRITE and parallel submission it cannot be >>> guaranteed that the data is always submitted aligned to the underlying >>> zone's write pointer. >>> >>> The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned >>> filesystem is non intrusive on a regular file system or when submitting to >>> a conventional zone on a zoned filesystem, as it is guarded by >>> btrfs_use_zone_append. >>> >>> Reported-by: David Sterba <dsterba@suse.com> >>> Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat flag") >>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> >> >> Now working on compression support for subpage, just noticed some >> strange code behavior, I'm not sure if it's designed or just a typo. >> >> So please correct me if possible. >> >> [...] >>> >>> bio = btrfs_bio_alloc(first_byte); >>> - bio->bi_opf = REQ_OP_WRITE | write_flags; >>> + bio->bi_opf = bio_op | write_flags; >>> bio->bi_private = cb; >>> bio->bi_end_io = end_compressed_bio_write; >>> >>> + if (use_append) { >>> + struct extent_map *em; >>> + struct map_lookup *map; >>> + struct block_device *bdev; >>> + >>> + em = btrfs_get_chunk_map(fs_info, disk_start, PAGE_SIZE); >>> + if (IS_ERR(em)) { >>> + kfree(cb); >>> + bio_put(bio); >>> + return BLK_STS_NOTSUPP; >>> + } >>> + >>> + map = em->map_lookup; >>> + /* We only support single profile for now */ >>> + ASSERT(map->num_stripes == 1); >>> + bdev = map->stripes[0].dev->bdev; > > This variable seems rather useless... No need to bother that, that has already been removed by later refactor. > >>> + >>> + bio_set_dev(bio, bdev); >>> + free_extent_map(em); >>> + } >>> + >> >> Here for the newly created bio, we will try to call bio_set_dev() for >> it. (although later patch refactor this part a little) >> >> So far so good. >> >>> if (blkcg_css) { >>> bio->bi_opf |= REQ_CGROUP_PUNT; >>> kthread_associate_blkcg(blkcg_css); >>> @@ -432,6 +458,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, >>> bytes_left = compressed_len; >>> for (pg_index = 0; pg_index < cb->nr_pages; pg_index++) { >>> int submit = 0; >>> + int len; >>> >>> page = compressed_pages[pg_index]; >>> page->mapping = inode->vfs_inode.i_mapping; >>> @@ -439,9 +466,13 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, >>> submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio, >>> 0); >>> >>> + if (pg_index == 0 && use_append) >>> + len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0); >>> + else >>> + len = bio_add_page(bio, page, PAGE_SIZE, 0); >>> + >>> page->mapping = NULL; >>> - if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) < >>> - PAGE_SIZE) { >>> + if (submit || len < PAGE_SIZE) { >>> /* >>> * inc the count before we submit the bio so >>> * we know the end IO handler won't happen before >>> @@ -465,11 +496,15 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, >>> } >>> >>> bio = btrfs_bio_alloc(first_byte); >>> - bio->bi_opf = REQ_OP_WRITE | write_flags; >>> + bio->bi_opf = bio_op | write_flags; >> >> But here, for the newly allocated bio, we didn't call bio_set_dev() at all. >> >> Shouldn't all zoned write bio need that bio_set_dev() call? > > Yep, bio->bi_bdev must be set before bio_add_zone_append_page() is called. > Otherwise, there will be a crash (first line of bio_add_zone_append_page() gets > the request queue from bio->bi_bdev). I wonder why we do not see NULL pointer > oops here... Johannes ? That's because it's really really rare/hard to have a compressed extent just lies at the stripe boundary. For most cases, the data we provide for compression tests is either: - Too compressible Thus the whole range can be compressed into just one sector, thus it will never cross stripe boundary. - Not compressible at all We fall back to regular buffered write, which will do their proper stripe boundary check correctly. Thus it's really near impossible to hit it in various tests. Thanks, Qu > >> >> I guess since most compressed extents are pretty small, it's really hard >> to hit a case where we need to split the bio due to stripe boundary, >> thus very hard to hit anything wrong. >> >> Anyway, since I'm working on compression code to make compressed write >> to follow the same boundary check in extent_io.c, I can definitely >> refactor the bio allocation code to add the zoned needed calls. >> >> Thanks, >> Qu >> >>> bio->bi_private = cb; >>> bio->bi_end_io = end_compressed_bio_write; >>> if (blkcg_css) >>> bio->bi_opf |= REQ_CGROUP_PUNT; >>> + /* >>> + * Use bio_add_page() to ensure the bio has at least one >>> + * page. >>> + */ >>> bio_add_page(bio, page, PAGE_SIZE, 0); >>> } >>> if (bytes_left < PAGE_SIZE) { >>> >> > >
On 2021/06/10 16:41, Qu Wenruo wrote: > > > On 2021/6/10 下午3:36, Damien Le Moal wrote: >> On 2021/06/10 16:28, Qu Wenruo wrote: >>> >>> >>> On 2021/5/18 下午11:40, Johannes Thumshirn wrote: >>>> When multiple processes write data to the same block group on a compressed >>>> zoned filesystem, the underlying device could report I/O errors and data >>>> corruption is possible. >>>> >>>> This happens because on a zoned file system, compressed data writes where >>>> sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND >>>> operation. But with REQ_OP_WRITE and parallel submission it cannot be >>>> guaranteed that the data is always submitted aligned to the underlying >>>> zone's write pointer. >>>> >>>> The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned >>>> filesystem is non intrusive on a regular file system or when submitting to >>>> a conventional zone on a zoned filesystem, as it is guarded by >>>> btrfs_use_zone_append. >>>> >>>> Reported-by: David Sterba <dsterba@suse.com> >>>> Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat flag") >>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> >>> >>> Now working on compression support for subpage, just noticed some >>> strange code behavior, I'm not sure if it's designed or just a typo. >>> >>> So please correct me if possible. >>> >>> [...] >>>> >>>> bio = btrfs_bio_alloc(first_byte); >>>> - bio->bi_opf = REQ_OP_WRITE | write_flags; >>>> + bio->bi_opf = bio_op | write_flags; >>>> bio->bi_private = cb; >>>> bio->bi_end_io = end_compressed_bio_write; >>>> >>>> + if (use_append) { >>>> + struct extent_map *em; >>>> + struct map_lookup *map; >>>> + struct block_device *bdev; >>>> + >>>> + em = btrfs_get_chunk_map(fs_info, disk_start, PAGE_SIZE); >>>> + if (IS_ERR(em)) { >>>> + kfree(cb); >>>> + bio_put(bio); >>>> + return BLK_STS_NOTSUPP; >>>> + } >>>> + >>>> + map = em->map_lookup; >>>> + /* We only support single profile for now */ >>>> + ASSERT(map->num_stripes == 1); >>>> + bdev = map->stripes[0].dev->bdev; >> >> This variable seems rather useless... > > No need to bother that, that has already been removed by later refactor. > >> >>>> + >>>> + bio_set_dev(bio, bdev); >>>> + free_extent_map(em); >>>> + } >>>> + >>> >>> Here for the newly created bio, we will try to call bio_set_dev() for >>> it. (although later patch refactor this part a little) >>> >>> So far so good. >>> >>>> if (blkcg_css) { >>>> bio->bi_opf |= REQ_CGROUP_PUNT; >>>> kthread_associate_blkcg(blkcg_css); >>>> @@ -432,6 +458,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, >>>> bytes_left = compressed_len; >>>> for (pg_index = 0; pg_index < cb->nr_pages; pg_index++) { >>>> int submit = 0; >>>> + int len; >>>> >>>> page = compressed_pages[pg_index]; >>>> page->mapping = inode->vfs_inode.i_mapping; >>>> @@ -439,9 +466,13 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, >>>> submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio, >>>> 0); >>>> >>>> + if (pg_index == 0 && use_append) >>>> + len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0); >>>> + else >>>> + len = bio_add_page(bio, page, PAGE_SIZE, 0); >>>> + >>>> page->mapping = NULL; >>>> - if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) < >>>> - PAGE_SIZE) { >>>> + if (submit || len < PAGE_SIZE) { >>>> /* >>>> * inc the count before we submit the bio so >>>> * we know the end IO handler won't happen before >>>> @@ -465,11 +496,15 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, >>>> } >>>> >>>> bio = btrfs_bio_alloc(first_byte); >>>> - bio->bi_opf = REQ_OP_WRITE | write_flags; >>>> + bio->bi_opf = bio_op | write_flags; >>> >>> But here, for the newly allocated bio, we didn't call bio_set_dev() at all. >>> >>> Shouldn't all zoned write bio need that bio_set_dev() call? >> >> Yep, bio->bi_bdev must be set before bio_add_zone_append_page() is called. >> Otherwise, there will be a crash (first line of bio_add_zone_append_page() gets >> the request queue from bio->bi_bdev). I wonder why we do not see NULL pointer >> oops here... Johannes ? > > That's because it's really really rare/hard to have a compressed extent > just lies at the stripe boundary. > > For most cases, the data we provide for compression tests is either: > - Too compressible > Thus the whole range can be compressed into just one sector, thus > it will never cross stripe boundary. > > - Not compressible at all > We fall back to regular buffered write, which will do their proper > stripe boundary check correctly. > > Thus it's really near impossible to hit it in various tests. But this is a data write, isn't it ? So in the zoned case, it means a zone append write. And adding a page for even a single sector using bio_add_zone_append_page() will oops if the bio bdev is not set, regardless of the bio size... Am I misunderstanding something here about this IO path ? > > Thanks, > Qu > >> >>> >>> I guess since most compressed extents are pretty small, it's really hard >>> to hit a case where we need to split the bio due to stripe boundary, >>> thus very hard to hit anything wrong. >>> >>> Anyway, since I'm working on compression code to make compressed write >>> to follow the same boundary check in extent_io.c, I can definitely >>> refactor the bio allocation code to add the zoned needed calls. >>> >>> Thanks, >>> Qu >>> >>>> bio->bi_private = cb; >>>> bio->bi_end_io = end_compressed_bio_write; >>>> if (blkcg_css) >>>> bio->bi_opf |= REQ_CGROUP_PUNT; >>>> + /* >>>> + * Use bio_add_page() to ensure the bio has at least one >>>> + * page. >>>> + */ >>>> bio_add_page(bio, page, PAGE_SIZE, 0); >>>> } >>>> if (bytes_left < PAGE_SIZE) { >>>> >>> >> >> >
On 2021/6/10 下午3:45, Damien Le Moal wrote: > On 2021/06/10 16:41, Qu Wenruo wrote: >> >> >> On 2021/6/10 下午3:36, Damien Le Moal wrote: >>> On 2021/06/10 16:28, Qu Wenruo wrote: >>>> >>>> >>>> On 2021/5/18 下午11:40, Johannes Thumshirn wrote: >>>>> When multiple processes write data to the same block group on a compressed >>>>> zoned filesystem, the underlying device could report I/O errors and data >>>>> corruption is possible. >>>>> >>>>> This happens because on a zoned file system, compressed data writes where >>>>> sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND >>>>> operation. But with REQ_OP_WRITE and parallel submission it cannot be >>>>> guaranteed that the data is always submitted aligned to the underlying >>>>> zone's write pointer. >>>>> >>>>> The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned >>>>> filesystem is non intrusive on a regular file system or when submitting to >>>>> a conventional zone on a zoned filesystem, as it is guarded by >>>>> btrfs_use_zone_append. >>>>> >>>>> Reported-by: David Sterba <dsterba@suse.com> >>>>> Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat flag") >>>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> >>>> >>>> Now working on compression support for subpage, just noticed some >>>> strange code behavior, I'm not sure if it's designed or just a typo. >>>> >>>> So please correct me if possible. >>>> >>>> [...] >>>>> >>>>> bio = btrfs_bio_alloc(first_byte); >>>>> - bio->bi_opf = REQ_OP_WRITE | write_flags; >>>>> + bio->bi_opf = bio_op | write_flags; >>>>> bio->bi_private = cb; >>>>> bio->bi_end_io = end_compressed_bio_write; >>>>> >>>>> + if (use_append) { >>>>> + struct extent_map *em; >>>>> + struct map_lookup *map; >>>>> + struct block_device *bdev; >>>>> + >>>>> + em = btrfs_get_chunk_map(fs_info, disk_start, PAGE_SIZE); >>>>> + if (IS_ERR(em)) { >>>>> + kfree(cb); >>>>> + bio_put(bio); >>>>> + return BLK_STS_NOTSUPP; >>>>> + } >>>>> + >>>>> + map = em->map_lookup; >>>>> + /* We only support single profile for now */ >>>>> + ASSERT(map->num_stripes == 1); >>>>> + bdev = map->stripes[0].dev->bdev; >>> >>> This variable seems rather useless... >> >> No need to bother that, that has already been removed by later refactor. >> >>> >>>>> + >>>>> + bio_set_dev(bio, bdev); >>>>> + free_extent_map(em); >>>>> + } >>>>> + >>>> >>>> Here for the newly created bio, we will try to call bio_set_dev() for >>>> it. (although later patch refactor this part a little) >>>> >>>> So far so good. >>>> >>>>> if (blkcg_css) { >>>>> bio->bi_opf |= REQ_CGROUP_PUNT; >>>>> kthread_associate_blkcg(blkcg_css); >>>>> @@ -432,6 +458,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, >>>>> bytes_left = compressed_len; >>>>> for (pg_index = 0; pg_index < cb->nr_pages; pg_index++) { >>>>> int submit = 0; >>>>> + int len; >>>>> >>>>> page = compressed_pages[pg_index]; >>>>> page->mapping = inode->vfs_inode.i_mapping; >>>>> @@ -439,9 +466,13 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, >>>>> submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio, >>>>> 0); >>>>> >>>>> + if (pg_index == 0 && use_append) >>>>> + len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0); >>>>> + else >>>>> + len = bio_add_page(bio, page, PAGE_SIZE, 0); >>>>> + >>>>> page->mapping = NULL; >>>>> - if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) < >>>>> - PAGE_SIZE) { >>>>> + if (submit || len < PAGE_SIZE) { >>>>> /* >>>>> * inc the count before we submit the bio so >>>>> * we know the end IO handler won't happen before >>>>> @@ -465,11 +496,15 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, >>>>> } >>>>> >>>>> bio = btrfs_bio_alloc(first_byte); >>>>> - bio->bi_opf = REQ_OP_WRITE | write_flags; >>>>> + bio->bi_opf = bio_op | write_flags; >>>> >>>> But here, for the newly allocated bio, we didn't call bio_set_dev() at all. >>>> >>>> Shouldn't all zoned write bio need that bio_set_dev() call? >>> >>> Yep, bio->bi_bdev must be set before bio_add_zone_append_page() is called. >>> Otherwise, there will be a crash (first line of bio_add_zone_append_page() gets >>> the request queue from bio->bi_bdev). I wonder why we do not see NULL pointer >>> oops here... Johannes ? >> >> That's because it's really really rare/hard to have a compressed extent >> just lies at the stripe boundary. >> >> For most cases, the data we provide for compression tests is either: >> - Too compressible >> Thus the whole range can be compressed into just one sector, thus >> it will never cross stripe boundary. >> >> - Not compressible at all >> We fall back to regular buffered write, which will do their proper >> stripe boundary check correctly. >> >> Thus it's really near impossible to hit it in various tests. > > But this is a data write, isn't it ? So in the zoned case, it means a zone > append write. And adding a page for even a single sector using > bio_add_zone_append_page() will oops if the bio bdev is not set, regardless of > the bio size... Am I misunderstanding something here about this IO path ? The bio can only be allocated when we have a write range crossing device stripe boundary. Which is already a rare case for compressed write. And the initial bio always has the correct setup, thus we have to hit the corner case to get into this situation. Finally I find that since zoned support doesn't support multi-device profile yet, there is no limitation for stripe length, thus we won't get into this branch at all. So my bad, false alert... Thanks, Qu > >> >> Thanks, >> Qu >> >>> >>>> >>>> I guess since most compressed extents are pretty small, it's really hard >>>> to hit a case where we need to split the bio due to stripe boundary, >>>> thus very hard to hit anything wrong. >>>> >>>> Anyway, since I'm working on compression code to make compressed write >>>> to follow the same boundary check in extent_io.c, I can definitely >>>> refactor the bio allocation code to add the zoned needed calls. >>>> >>>> Thanks, >>>> Qu >>>> >>>>> bio->bi_private = cb; >>>>> bio->bi_end_io = end_compressed_bio_write; >>>>> if (blkcg_css) >>>>> bio->bi_opf |= REQ_CGROUP_PUNT; >>>>> + /* >>>>> + * Use bio_add_page() to ensure the bio has at least one >>>>> + * page. >>>>> + */ >>>>> bio_add_page(bio, page, PAGE_SIZE, 0); >>>>> } >>>>> if (bytes_left < PAGE_SIZE) { >>>>> >>>> >>> >>> >> > >
On 10/06/2021 09:45, Damien Le Moal wrote: > On 2021/06/10 16:41, Qu Wenruo wrote: >> >> >> On 2021/6/10 下午3:36, Damien Le Moal wrote: >>> On 2021/06/10 16:28, Qu Wenruo wrote: >>>> >>>> >>>> On 2021/5/18 下午11:40, Johannes Thumshirn wrote: >>>>> When multiple processes write data to the same block group on a compressed >>>>> zoned filesystem, the underlying device could report I/O errors and data >>>>> corruption is possible. >>>>> >>>>> This happens because on a zoned file system, compressed data writes where >>>>> sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND >>>>> operation. But with REQ_OP_WRITE and parallel submission it cannot be >>>>> guaranteed that the data is always submitted aligned to the underlying >>>>> zone's write pointer. >>>>> >>>>> The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned >>>>> filesystem is non intrusive on a regular file system or when submitting to >>>>> a conventional zone on a zoned filesystem, as it is guarded by >>>>> btrfs_use_zone_append. >>>>> >>>>> Reported-by: David Sterba <dsterba@suse.com> >>>>> Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat flag") >>>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> >>>> >>>> Now working on compression support for subpage, just noticed some >>>> strange code behavior, I'm not sure if it's designed or just a typo. >>>> >>>> So please correct me if possible. >>>> >>>> [...] >>>>> >>>>> bio = btrfs_bio_alloc(first_byte); >>>>> - bio->bi_opf = REQ_OP_WRITE | write_flags; >>>>> + bio->bi_opf = bio_op | write_flags; >>>>> bio->bi_private = cb; >>>>> bio->bi_end_io = end_compressed_bio_write; >>>>> >>>>> + if (use_append) { >>>>> + struct extent_map *em; >>>>> + struct map_lookup *map; >>>>> + struct block_device *bdev; >>>>> + >>>>> + em = btrfs_get_chunk_map(fs_info, disk_start, PAGE_SIZE); >>>>> + if (IS_ERR(em)) { >>>>> + kfree(cb); >>>>> + bio_put(bio); >>>>> + return BLK_STS_NOTSUPP; >>>>> + } >>>>> + >>>>> + map = em->map_lookup; >>>>> + /* We only support single profile for now */ >>>>> + ASSERT(map->num_stripes == 1); >>>>> + bdev = map->stripes[0].dev->bdev; >>> >>> This variable seems rather useless... >> >> No need to bother that, that has already been removed by later refactor. >> >>> >>>>> + >>>>> + bio_set_dev(bio, bdev); >>>>> + free_extent_map(em); >>>>> + } >>>>> + >>>> >>>> Here for the newly created bio, we will try to call bio_set_dev() for >>>> it. (although later patch refactor this part a little) >>>> >>>> So far so good. >>>> >>>>> if (blkcg_css) { >>>>> bio->bi_opf |= REQ_CGROUP_PUNT; >>>>> kthread_associate_blkcg(blkcg_css); >>>>> @@ -432,6 +458,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, >>>>> bytes_left = compressed_len; >>>>> for (pg_index = 0; pg_index < cb->nr_pages; pg_index++) { >>>>> int submit = 0; >>>>> + int len; >>>>> >>>>> page = compressed_pages[pg_index]; >>>>> page->mapping = inode->vfs_inode.i_mapping; >>>>> @@ -439,9 +466,13 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, >>>>> submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio, >>>>> 0); >>>>> >>>>> + if (pg_index == 0 && use_append) >>>>> + len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0); >>>>> + else >>>>> + len = bio_add_page(bio, page, PAGE_SIZE, 0); >>>>> + >>>>> page->mapping = NULL; >>>>> - if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) < >>>>> - PAGE_SIZE) { >>>>> + if (submit || len < PAGE_SIZE) { >>>>> /* >>>>> * inc the count before we submit the bio so >>>>> * we know the end IO handler won't happen before >>>>> @@ -465,11 +496,15 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, >>>>> } >>>>> >>>>> bio = btrfs_bio_alloc(first_byte); >>>>> - bio->bi_opf = REQ_OP_WRITE | write_flags; >>>>> + bio->bi_opf = bio_op | write_flags; >>>> >>>> But here, for the newly allocated bio, we didn't call bio_set_dev() at all. >>>> >>>> Shouldn't all zoned write bio need that bio_set_dev() call? >>> >>> Yep, bio->bi_bdev must be set before bio_add_zone_append_page() is called. >>> Otherwise, there will be a crash (first line of bio_add_zone_append_page() gets >>> the request queue from bio->bi_bdev). I wonder why we do not see NULL pointer >>> oops here... Johannes ? >> >> That's because it's really really rare/hard to have a compressed extent >> just lies at the stripe boundary. >> >> For most cases, the data we provide for compression tests is either: >> - Too compressible >> Thus the whole range can be compressed into just one sector, thus >> it will never cross stripe boundary. >> >> - Not compressible at all >> We fall back to regular buffered write, which will do their proper >> stripe boundary check correctly. >> >> Thus it's really near impossible to hit it in various tests. > > But this is a data write, isn't it ? So in the zoned case, it means a zone > append write. And adding a page for even a single sector using > bio_add_zone_append_page() will oops if the bio bdev is not set, regardless of > the bio size... Am I misunderstanding something here about this IO path ? Because we're not using bio_add_zone_append_page() but bio_add_page() as we only add one page to a newly allocated bio. It's a bit complicated. >>>>> + /* >>>>> + * Use bio_add_page() to ensure the bio has at least one >>>>> + * page. >>>>> + */ >>>>> bio_add_page(bio, page, PAGE_SIZE, 0); >>>>> } >>>>> if (bytes_left < PAGE_SIZE) { >>>>> >>>> >>> >>> >> > >
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 2bea01d23a5b..f224c35de5d8 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -28,6 +28,7 @@ #include "compression.h" #include "extent_io.h" #include "extent_map.h" +#include "zoned.h" static const char* const btrfs_compress_types[] = { "", "zlib", "lzo", "zstd" }; @@ -349,6 +350,7 @@ static void end_compressed_bio_write(struct bio *bio) */ inode = cb->inode; cb->compressed_pages[0]->mapping = cb->inode->i_mapping; + btrfs_record_physical_zoned(inode, cb->start, bio); btrfs_writepage_endio_finish_ordered(cb->compressed_pages[0], cb->start, cb->start + cb->len - 1, bio->bi_status == BLK_STS_OK); @@ -401,6 +403,9 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, u64 first_byte = disk_start; blk_status_t ret; int skip_sum = inode->flags & BTRFS_INODE_NODATASUM; + const bool use_append = btrfs_use_zone_append(inode, disk_start); + const unsigned int bio_op = + use_append ? REQ_OP_ZONE_APPEND : REQ_OP_WRITE; WARN_ON(!PAGE_ALIGNED(start)); cb = kmalloc(compressed_bio_size(fs_info, compressed_len), GFP_NOFS); @@ -418,10 +423,31 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, cb->nr_pages = nr_pages; bio = btrfs_bio_alloc(first_byte); - bio->bi_opf = REQ_OP_WRITE | write_flags; + bio->bi_opf = bio_op | write_flags; bio->bi_private = cb; bio->bi_end_io = end_compressed_bio_write; + if (use_append) { + struct extent_map *em; + struct map_lookup *map; + struct block_device *bdev; + + em = btrfs_get_chunk_map(fs_info, disk_start, PAGE_SIZE); + if (IS_ERR(em)) { + kfree(cb); + bio_put(bio); + return BLK_STS_NOTSUPP; + } + + map = em->map_lookup; + /* We only support single profile for now */ + ASSERT(map->num_stripes == 1); + bdev = map->stripes[0].dev->bdev; + + bio_set_dev(bio, bdev); + free_extent_map(em); + } + if (blkcg_css) { bio->bi_opf |= REQ_CGROUP_PUNT; kthread_associate_blkcg(blkcg_css); @@ -432,6 +458,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, bytes_left = compressed_len; for (pg_index = 0; pg_index < cb->nr_pages; pg_index++) { int submit = 0; + int len; page = compressed_pages[pg_index]; page->mapping = inode->vfs_inode.i_mapping; @@ -439,9 +466,13 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio, 0); + if (pg_index == 0 && use_append) + len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0); + else + len = bio_add_page(bio, page, PAGE_SIZE, 0); + page->mapping = NULL; - if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) < - PAGE_SIZE) { + if (submit || len < PAGE_SIZE) { /* * inc the count before we submit the bio so * we know the end IO handler won't happen before @@ -465,11 +496,15 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, } bio = btrfs_bio_alloc(first_byte); - bio->bi_opf = REQ_OP_WRITE | write_flags; + bio->bi_opf = bio_op | write_flags; bio->bi_private = cb; bio->bi_end_io = end_compressed_bio_write; if (blkcg_css) bio->bi_opf |= REQ_CGROUP_PUNT; + /* + * Use bio_add_page() to ensure the bio has at least one + * page. + */ bio_add_page(bio, page, PAGE_SIZE, 0); } if (bytes_left < PAGE_SIZE) {
When multiple processes write data to the same block group on a compressed zoned filesystem, the underlying device could report I/O errors and data corruption is possible. This happens because on a zoned file system, compressed data writes where sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND operation. But with REQ_OP_WRITE and parallel submission it cannot be guaranteed that the data is always submitted aligned to the underlying zone's write pointer. The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned filesystem is non intrusive on a regular file system or when submitting to a conventional zone on a zoned filesystem, as it is guarded by btrfs_use_zone_append. Reported-by: David Sterba <dsterba@suse.com> Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat flag") Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> --- fs/btrfs/compression.c | 43 ++++++++++++++++++++++++++++++++++++++---- 1 file changed, 39 insertions(+), 4 deletions(-)