[v2,2/3] btrfs: zoned: fix compressed writes

Message ID	1220142568f5b5f0d06fbc3ee28a08060afc0a53.1621351444.git.johannes.thumshirn@wdc.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> IronPort-SDR: iDP8oq4r48gcEZr/h8sZvr8zcToLZy7W3anrnrAI7NRi22qXpz9F++DYw3OhqykjDti4h9iL41 Xqp0Lt/PzzkOYqdoOrOSKca8IVSiUvKjUp2JMFXN32WS4W+APFh67ZoJP0cZ0wTYRt2hGgKSM9 LvvZ6idFhWumwUzqfHruWTRTIU3K9kV0fWdGfrokE5bCyCDDVsXNkPWtTtsX888SK0VrFZMM+H gH0tkr5319UtObrl5fCUEH2364j45Phsf4RNghsE3DgfEyCV5KglJ3Fg+CKgys/7QXLPgs1/tS fHc= IronPort-SDR: jaRc2Zvuy+KqqIRZ0uX4hxTGPFjBW+IMqUJaj8aJqH9ZlSjb4IQ+isW2CqyKpxkKwwRjA5mQHf W+J1nsR1D5f+JndTBIOvZAvbdyekNAwiXVmcfCZa7bJqfwPVxmre9mlU/IMmR98bt1dHls4M3J 6YH6oDLRjZNl04sjJqeVsC3KPkxUROZbQ51ZKrEB6eW3dWXOJkSF+/rsx4Xnb6EzXp+YzlKq3a Lc+6mm3Gl7CmYve/2NA2NqhRNbneeAhFoDgAJjPUi3npKiBjCht4yl66gOgXGY1n1RujtmIh1X 8gVn3RYYxVxA+zshvfyl6nEO IronPort-SDR: yIJxCe7BO8FuD3BnkCs0wTA+tSgW9Lj4VXUq/ifhfxkmbodSgzm+SZNWlhmZ0VXO50uB2VrYun G83ReweSVKNfsg6r1RmqBN6hVSJSXcVbgt8DXuXi5XWWG461gYO2AcrpVzIVOc2+pc2iCX7q3w JaF9FWMWm0GjqknwO3jUZRwH6ngYLwXSnDeQ+6A0MVcPhtCe7HOUrepkZ4OPpLv6oxPN5rUe0F D1tWNHHVlOyDiELAf+HOad88c5yyhTeXv33WQFkktybK7le4bhbb7g7FOHKae2yKmK/MIv7PdS zIs= WDCIronportException: Internal From: Johannes Thumshirn <johannes.thumshirn@wdc.com> To: David Sterba <dsterba@suse.com>, linux-btrfs@vger.kernel.org Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Subject: [PATCH v2 2/3] btrfs: zoned: fix compressed writes Date: Wed, 19 May 2021 00:40:28 +0900 Message-Id: <1220142568f5b5f0d06fbc3ee28a08060afc0a53.1621351444.git.johannes.thumshirn@wdc.com> In-Reply-To: <cover.1621351444.git.johannes.thumshirn@wdc.com> References: <cover.1621351444.git.johannes.thumshirn@wdc.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	btrfs: zoned: fix writes on a compressed zoned filesystem \| expand [v2,0/3] btrfs: zoned: fix writes on a compressed zoned filesystem [v2,1/3] btrfs: zoned: pass start block to btrfs_use_zone_append [v2,2/3] btrfs: zoned: fix compressed writes [v2,3/3] btrfs: zoned: factor out zoned device lookup

Johannes Thumshirn May 18, 2021, 3:40 p.m. UTC

When multiple processes write data to the same block group on a compressed
zoned filesystem, the underlying device could report I/O errors and data
corruption is possible.

This happens because on a zoned file system, compressed data writes where
sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND
operation. But with REQ_OP_WRITE and parallel submission it cannot be
guaranteed that the data is always submitted aligned to the underlying
zone's write pointer.

The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned
filesystem is non intrusive on a regular file system or when submitting to
a conventional zone on a zoned filesystem, as it is guarded by
btrfs_use_zone_append.

Reported-by: David Sterba <dsterba@suse.com>
Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat flag")
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/compression.c | 43 ++++++++++++++++++++++++++++++++++++++----
 1 file changed, 39 insertions(+), 4 deletions(-)

Josef Bacik May 23, 2021, 2:13 p.m. UTC | #1

On 5/18/21 11:40 AM, Johannes Thumshirn wrote:
> When multiple processes write data to the same block group on a compressed
> zoned filesystem, the underlying device could report I/O errors and data
> corruption is possible.
> 
> This happens because on a zoned file system, compressed data writes where
> sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND
> operation. But with REQ_OP_WRITE and parallel submission it cannot be
> guaranteed that the data is always submitted aligned to the underlying
> zone's write pointer.
> 
> The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned
> filesystem is non intrusive on a regular file system or when submitting to
> a conventional zone on a zoned filesystem, as it is guarded by
> btrfs_use_zone_append.
> 
> Reported-by: David Sterba <dsterba@suse.com>
> Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat flag")
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

This one is causing panics with btrfs/027 with -o compress.  I bisected it to 
something else earlier, but it was still happening today and I bisected again 
and this is what popped out.  I also went the extra step to revert the patch as 
I have already fucked this up once, and the problem didn't re-occur with this 
patch reverted.  The panic looks like this

May 22 00:33:16 xfstests2 kernel: BTRFS critical (device dm-9): mapping failed 
logical 22429696 bio len 53248 len 49152
May 22 00:33:16 xfstests2 kernel: ------------[ cut here ]------------
May 22 00:33:16 xfstests2 kernel: kernel BUG at fs/btrfs/volumes.c:6643!
May 22 00:33:16 xfstests2 kernel: invalid opcode: 0000 [#1] SMP NOPTI
May 22 00:33:16 xfstests2 kernel: CPU: 1 PID: 2236088 Comm: kworker/u4:4 Not 
tainted 5.13.0-rc2+ #240
May 22 00:33:16 xfstests2 kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 
2009), BIOS 1.13.0-2.fc32 04/01/2014
May 22 00:33:16 xfstests2 kernel: Workqueue: btrfs-delalloc btrfs_work_helper
May 22 00:33:16 xfstests2 kernel: RIP: 0010:btrfs_map_bio.cold+0x58/0x5a
May 22 00:33:16 xfstests2 kernel: Code: 50 e8 6b 83 ff ff e8 5b 0d 88 ff 48 83 
c4 18 e9 94 8f 88 ff 48 8b 3c 24 4c 89 f1 4c 89 fa 48 c7 c6 f8 db 62 96 e8 47 83 
ff ff <0f> 0b 4c 89 e7 e8 52 1f 83 ff e9 03 98 88 ff 49 8b 7a 50 44 89 f2
May 22 00:33:16 xfstests2 kernel: RSP: 0018:ffffb310c1de7c88 EFLAGS: 00010282
May 22 00:33:16 xfstests2 kernel: RAX: 0000000000000055 RBX: 0000000000000000 
RCX: 0000000000000000
May 22 00:33:16 xfstests2 kernel: RDX: ffff9b9a7bd27540 RSI: ffff9b9a7bd18e10 
RDI: ffff9b9a7bd18e10
May 22 00:33:16 xfstests2 kernel: RBP: ffff9b9a482ad7f8 R08: 0000000000000000 
R09: 0000000000000000
May 22 00:33:16 xfstests2 kernel: R10: ffffb310c1de7a48 R11: ffffffff96973748 
R12: 0000000000000000
May 22 00:33:16 xfstests2 kernel: R13: ffff9b9a001e7300 R14: 000000000000d000 
R15: 0000000001564000
May 22 00:33:16 xfstests2 kernel: FS:  0000000000000000(0000) 
GS:ffff9b9a7bd00000(0000) knlGS:0000000000000000
May 22 00:33:16 xfstests2 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 22 00:33:16 xfstests2 kernel: CR2: 00005621fe4566e0 CR3: 000000013943a005 
CR4: 0000000000370ee0
May 22 00:33:16 xfstests2 kernel: Call Trace:
May 22 00:33:16 xfstests2 kernel:  btrfs_submit_compressed_write+0x2d7/0x470
May 22 00:33:16 xfstests2 kernel:  submit_compressed_extents+0x364/0x420
May 22 00:33:16 xfstests2 kernel:  ? lock_acquire+0x15d/0x380
May 22 00:33:16 xfstests2 kernel:  ? lock_release+0x1cd/0x2a0
May 22 00:33:16 xfstests2 kernel:  ? submit_compressed_extents+0x420/0x420
May 22 00:33:16 xfstests2 kernel:  btrfs_work_helper+0x133/0x520
May 22 00:33:16 xfstests2 kernel:  process_one_work+0x26b/0x570
May 22 00:33:16 xfstests2 kernel:  worker_thread+0x55/0x3c0
May 22 00:33:16 xfstests2 kernel:  ? process_one_work+0x570/0x570
May 22 00:33:16 xfstests2 kernel:  kthread+0x134/0x150
May 22 00:33:16 xfstests2 kernel:  ? __kthread_bind_mask+0x60/0x60
May 22 00:33:16 xfstests2 kernel:  ret_from_fork+0x1f/0x30

Thanks,

Josef

Qu Wenruo May 23, 2021, 11:09 p.m. UTC | #2

On 2021/5/23 下午10:13, Josef Bacik wrote:
> On 5/18/21 11:40 AM, Johannes Thumshirn wrote:
>> When multiple processes write data to the same block group on a
>> compressed
>> zoned filesystem, the underlying device could report I/O errors and data
>> corruption is possible.
>>
>> This happens because on a zoned file system, compressed data writes where
>> sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND
>> operation. But with REQ_OP_WRITE and parallel submission it cannot be
>> guaranteed that the data is always submitted aligned to the underlying
>> zone's write pointer.
>>
>> The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned
>> filesystem is non intrusive on a regular file system or when
>> submitting to
>> a conventional zone on a zoned filesystem, as it is guarded by
>> btrfs_use_zone_append.
>>
>> Reported-by: David Sterba <dsterba@suse.com>
>> Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat flag")
>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> This one is causing panics with btrfs/027 with -o compress.  I bisected
> it to something else earlier, but it was still happening today and I
> bisected again and this is what popped out.  I also went the extra step
> to revert the patch as I have already fucked this up once, and the
> problem didn't re-occur with this patch reverted.  The panic looks like
> this
>
> May 22 00:33:16 xfstests2 kernel: BTRFS critical (device dm-9): mapping
> failed logical 22429696 bio len 53248 len 49152

This is the interesting part, it means we are just one sector beyond the
stripe boundary.
Definitely a sign of changed bio submission timing.

Just like the code:

+		if (pg_index == 0 && use_append)
+			len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0);
+		else
+			len = bio_add_page(bio, page, PAGE_SIZE, 0);
+
  		page->mapping = NULL;
-		if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) <
-		    PAGE_SIZE) {
+		if (submit || len < PAGE_SIZE) {

The code has changed the timing of bio_add_page().

Previously, if we have submit == true, we won't even try to call
bio_add_page().

But now, we will add the page even we're already at the stripe boundary,
thus it causes the extra sector being added to bio, and crosses stripe
boundary.

This part is already super tricky, thus I refactored
submit_extent_page() to do a better job at stripe boundary calculation.

We definitely need to make other bio_add_page() callers to use a better
helper, not only for later subpage, but also to make our lives easier.

Thanks,
Qu
> May 22 00:33:16 xfstests2 kernel: ------------[ cut here ]------------
> May 22 00:33:16 xfstests2 kernel: kernel BUG at fs/btrfs/volumes.c:6643!
> May 22 00:33:16 xfstests2 kernel: invalid opcode: 0000 [#1] SMP NOPTI
> May 22 00:33:16 xfstests2 kernel: CPU: 1 PID: 2236088 Comm: kworker/u4:4
> Not tainted 5.13.0-rc2+ #240
> May 22 00:33:16 xfstests2 kernel: Hardware name: QEMU Standard PC (Q35 +
> ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
> May 22 00:33:16 xfstests2 kernel: Workqueue: btrfs-delalloc
> btrfs_work_helper
> May 22 00:33:16 xfstests2 kernel: RIP: 0010:btrfs_map_bio.cold+0x58/0x5a
> May 22 00:33:16 xfstests2 kernel: Code: 50 e8 6b 83 ff ff e8 5b 0d 88 ff
> 48 83 c4 18 e9 94 8f 88 ff 48 8b 3c 24 4c 89 f1 4c 89 fa 48 c7 c6 f8 db
> 62 96 e8 47 83 ff ff <0f> 0b 4c 89 e7 e8 52 1f 83 ff e9 03 98 88 ff 49
> 8b 7a 50 44 89 f2
> May 22 00:33:16 xfstests2 kernel: RSP: 0018:ffffb310c1de7c88 EFLAGS:
> 00010282
> May 22 00:33:16 xfstests2 kernel: RAX: 0000000000000055 RBX:
> 0000000000000000 RCX: 0000000000000000
> May 22 00:33:16 xfstests2 kernel: RDX: ffff9b9a7bd27540 RSI:
> ffff9b9a7bd18e10 RDI: ffff9b9a7bd18e10
> May 22 00:33:16 xfstests2 kernel: RBP: ffff9b9a482ad7f8 R08:
> 0000000000000000 R09: 0000000000000000
> May 22 00:33:16 xfstests2 kernel: R10: ffffb310c1de7a48 R11:
> ffffffff96973748 R12: 0000000000000000
> May 22 00:33:16 xfstests2 kernel: R13: ffff9b9a001e7300 R14:
> 000000000000d000 R15: 0000000001564000
> May 22 00:33:16 xfstests2 kernel: FS:  0000000000000000(0000)
> GS:ffff9b9a7bd00000(0000) knlGS:0000000000000000
> May 22 00:33:16 xfstests2 kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
> 0000000080050033
> May 22 00:33:16 xfstests2 kernel: CR2: 00005621fe4566e0 CR3:
> 000000013943a005 CR4: 0000000000370ee0
> May 22 00:33:16 xfstests2 kernel: Call Trace:
> May 22 00:33:16 xfstests2 kernel:
> btrfs_submit_compressed_write+0x2d7/0x470
> May 22 00:33:16 xfstests2 kernel:  submit_compressed_extents+0x364/0x420
> May 22 00:33:16 xfstests2 kernel:  ? lock_acquire+0x15d/0x380
> May 22 00:33:16 xfstests2 kernel:  ? lock_release+0x1cd/0x2a0
> May 22 00:33:16 xfstests2 kernel:  ? submit_compressed_extents+0x420/0x420
> May 22 00:33:16 xfstests2 kernel:  btrfs_work_helper+0x133/0x520
> May 22 00:33:16 xfstests2 kernel:  process_one_work+0x26b/0x570
> May 22 00:33:16 xfstests2 kernel:  worker_thread+0x55/0x3c0
> May 22 00:33:16 xfstests2 kernel:  ? process_one_work+0x570/0x570
> May 22 00:33:16 xfstests2 kernel:  kthread+0x134/0x150
> May 22 00:33:16 xfstests2 kernel:  ? __kthread_bind_mask+0x60/0x60
> May 22 00:33:16 xfstests2 kernel:  ret_from_fork+0x1f/0x30
>
> Thanks,
>
> Josef

Qu Wenruo May 24, 2021, 1:04 p.m. UTC | #3

On 2021/5/24 上午7:09, Qu Wenruo wrote:
>
>
> On 2021/5/23 下午10:13, Josef Bacik wrote:
>> On 5/18/21 11:40 AM, Johannes Thumshirn wrote:
>>> When multiple processes write data to the same block group on a
>>> compressed
>>> zoned filesystem, the underlying device could report I/O errors and data
>>> corruption is possible.
>>>
>>> This happens because on a zoned file system, compressed data writes
>>> where
>>> sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND
>>> operation. But with REQ_OP_WRITE and parallel submission it cannot be
>>> guaranteed that the data is always submitted aligned to the underlying
>>> zone's write pointer.
>>>
>>> The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a
>>> zoned
>>> filesystem is non intrusive on a regular file system or when
>>> submitting to
>>> a conventional zone on a zoned filesystem, as it is guarded by
>>> btrfs_use_zone_append.
>>>
>>> Reported-by: David Sterba <dsterba@suse.com>
>>> Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat
>>> flag")
>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>
>> This one is causing panics with btrfs/027 with -o compress.  I bisected
>> it to something else earlier, but it was still happening today and I
>> bisected again and this is what popped out.  I also went the extra step
>> to revert the patch as I have already fucked this up once, and the
>> problem didn't re-occur with this patch reverted.  The panic looks like
>> this
>>
>> May 22 00:33:16 xfstests2 kernel: BTRFS critical (device dm-9): mapping
>> failed logical 22429696 bio len 53248 len 49152
>
> This is the interesting part, it means we are just one sector beyond the
> stripe boundary.
> Definitely a sign of changed bio submission timing.
>
> Just like the code:
>
> +        if (pg_index == 0 && use_append)
> +            len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0);
> +        else
> +            len = bio_add_page(bio, page, PAGE_SIZE, 0);
> +
>           page->mapping = NULL;
> -        if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) <
> -            PAGE_SIZE) {
> +        if (submit || len < PAGE_SIZE) {
>
> The code has changed the timing of bio_add_page().
>
> Previously, if we have submit == true, we won't even try to call
> bio_add_page().
>
> But now, we will add the page even we're already at the stripe boundary,
> thus it causes the extra sector being added to bio, and crosses stripe
> boundary.
>
> This part is already super tricky, thus I refactored
> submit_extent_page() to do a better job at stripe boundary calculation.

BTW, I can also reproduce the problem in btrfs/027 using the latest
misc-next branch.

Thus to workaround the problem, I'm using the following diff, feel free
to fold in to the offending patch.

Thanks,
Qu

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 831e6ae92940..26e2dceda1fc 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -455,10 +455,13 @@ blk_status_t btrfs_submit_compressed_write(struct
btrfs_inode *inode, u64 start,
                         submit = btrfs_bio_fits_in_stripe(page,
PAGE_SIZE, bio,
                                                           0);

-               if (pg_index == 0 && use_append)
-                       len = bio_add_zone_append_page(bio, page,
PAGE_SIZE, 0);
-               else
-                       len = bio_add_page(bio, page, PAGE_SIZE, 0);
+               if (!submit) {
+                       if (pg_index == 0 && use_append)
+                               len = bio_add_zone_append_page(bio, page,
+
PAGE_SIZE, 0);
+                       else
+                               len = bio_add_page(bio, page, PAGE_SIZE, 0);
+               }

                 page->mapping = NULL;
                 if (submit || len < PAGE_SIZE) {

>
> We definitely need to make other bio_add_page() callers to use a better
> helper, not only for later subpage, but also to make our lives easier.
>
> Thanks,
> Qu

David Sterba May 24, 2021, 1:30 p.m. UTC | #4

On Mon, May 24, 2021 at 09:04:40PM +0800, Qu Wenruo wrote:
> > This is the interesting part, it means we are just one sector beyond the
> > stripe boundary.
> > Definitely a sign of changed bio submission timing.
> >
> > Just like the code:
> >
> > +        if (pg_index == 0 && use_append)
> > +            len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0);
> > +        else
> > +            len = bio_add_page(bio, page, PAGE_SIZE, 0);
> > +
> >           page->mapping = NULL;
> > -        if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) <
> > -            PAGE_SIZE) {
> > +        if (submit || len < PAGE_SIZE) {
> >
> > The code has changed the timing of bio_add_page().
> >
> > Previously, if we have submit == true, we won't even try to call
> > bio_add_page().
> >
> > But now, we will add the page even we're already at the stripe boundary,
> > thus it causes the extra sector being added to bio, and crosses stripe
> > boundary.
> >
> > This part is already super tricky, thus I refactored
> > submit_extent_page() to do a better job at stripe boundary calculation.
> 
> BTW, I can also reproduce the problem in btrfs/027 using the latest
> misc-next branch.
> 
> Thus to workaround the problem, I'm using the following diff, feel free
> to fold in to the offending patch.

The patch is now in master so we'll need a proper fix.

Johannes Thumshirn May 25, 2021, 5:46 a.m. UTC | #5

On 23/05/2021 16:13, Josef Bacik wrote:
> On 5/18/21 11:40 AM, Johannes Thumshirn wrote:
>> When multiple processes write data to the same block group on a compressed
>> zoned filesystem, the underlying device could report I/O errors and data
>> corruption is possible.
>>
>> This happens because on a zoned file system, compressed data writes where
>> sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND
>> operation. But with REQ_OP_WRITE and parallel submission it cannot be
>> guaranteed that the data is always submitted aligned to the underlying
>> zone's write pointer.
>>
>> The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned
>> filesystem is non intrusive on a regular file system or when submitting to
>> a conventional zone on a zoned filesystem, as it is guarded by
>> btrfs_use_zone_append.
>>
>> Reported-by: David Sterba <dsterba@suse.com>
>> Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat flag")
>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> 
> This one is causing panics with btrfs/027 with -o compress.  I bisected it to 
> something else earlier, but it was still happening today and I bisected again 
> and this is what popped out.  I also went the extra step to revert the patch as 
> I have already fucked this up once, and the problem didn't re-occur with this 
> patch reverted.  The panic looks like this
> 
This is on a regular block device or on a zoned block device? I guess it's a 
regular, but I can't see yet why.

Johannes Thumshirn May 25, 2021, 6:31 a.m. UTC | #6

On 24/05/2021 15:05, Qu Wenruo wrote:
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 831e6ae92940..26e2dceda1fc 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -455,10 +455,13 @@ blk_status_t btrfs_submit_compressed_write(struct
> btrfs_inode *inode, u64 start,
>                          submit = btrfs_bio_fits_in_stripe(page,
> PAGE_SIZE, bio,
>                                                            0);
> 
> -               if (pg_index == 0 && use_append)
> -                       len = bio_add_zone_append_page(bio, page,
> PAGE_SIZE, 0);
> -               else
> -                       len = bio_add_page(bio, page, PAGE_SIZE, 0);
> +               if (!submit) {
> +                       if (pg_index == 0 && use_append)
> +                               len = bio_add_zone_append_page(bio, page,
> +
> PAGE_SIZE, 0);
> +                       else
> +                               len = bio_add_page(bio, page, PAGE_SIZE, 0);
> +               }
> 
>                  page->mapping = NULL;
>                  if (submit || len < PAGE_SIZE) {

This looks good, thanks for fixing it. I indeed messed up the bio_add_page() 
call in the new version when untangling that if (submit || bio_add_page())
construct. 

Can you send an official patch?

Thanks a lot,
	Johannes

Qu Wenruo June 10, 2021, 7:27 a.m. UTC | #7

On 2021/5/18 下午11:40, Johannes Thumshirn wrote:
> When multiple processes write data to the same block group on a compressed
> zoned filesystem, the underlying device could report I/O errors and data
> corruption is possible.
>
> This happens because on a zoned file system, compressed data writes where
> sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND
> operation. But with REQ_OP_WRITE and parallel submission it cannot be
> guaranteed that the data is always submitted aligned to the underlying
> zone's write pointer.
>
> The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned
> filesystem is non intrusive on a regular file system or when submitting to
> a conventional zone on a zoned filesystem, as it is guarded by
> btrfs_use_zone_append.
>
> Reported-by: David Sterba <dsterba@suse.com>
> Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat flag")
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Now working on compression support for subpage, just noticed some
strange code behavior, I'm not sure if it's designed or just a typo.

So please correct me if possible.

[...]
>
>   	bio = btrfs_bio_alloc(first_byte);
> -	bio->bi_opf = REQ_OP_WRITE | write_flags;
> +	bio->bi_opf = bio_op | write_flags;
>   	bio->bi_private = cb;
>   	bio->bi_end_io = end_compressed_bio_write;
>
> +	if (use_append) {
> +		struct extent_map *em;
> +		struct map_lookup *map;
> +		struct block_device *bdev;
> +
> +		em = btrfs_get_chunk_map(fs_info, disk_start, PAGE_SIZE);
> +		if (IS_ERR(em)) {
> +			kfree(cb);
> +			bio_put(bio);
> +			return BLK_STS_NOTSUPP;
> +		}
> +
> +		map = em->map_lookup;
> +		/* We only support single profile for now */
> +		ASSERT(map->num_stripes == 1);
> +		bdev = map->stripes[0].dev->bdev;
> +
> +		bio_set_dev(bio, bdev);
> +		free_extent_map(em);
> +	}
> +

Here for the newly created bio, we will try to call bio_set_dev() for
it. (although later patch refactor this part a little)

So far so good.

>   	if (blkcg_css) {
>   		bio->bi_opf |= REQ_CGROUP_PUNT;
>   		kthread_associate_blkcg(blkcg_css);
> @@ -432,6 +458,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>   	bytes_left = compressed_len;
>   	for (pg_index = 0; pg_index < cb->nr_pages; pg_index++) {
>   		int submit = 0;
> +		int len;
>
>   		page = compressed_pages[pg_index];
>   		page->mapping = inode->vfs_inode.i_mapping;
> @@ -439,9 +466,13 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>   			submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio,
>   							  0);
>
> +		if (pg_index == 0 && use_append)
> +			len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0);
> +		else
> +			len = bio_add_page(bio, page, PAGE_SIZE, 0);
> +
>   		page->mapping = NULL;
> -		if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) <
> -		    PAGE_SIZE) {
> +		if (submit || len < PAGE_SIZE) {
>   			/*
>   			 * inc the count before we submit the bio so
>   			 * we know the end IO handler won't happen before
> @@ -465,11 +496,15 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>   			}
>
>   			bio = btrfs_bio_alloc(first_byte);
> -			bio->bi_opf = REQ_OP_WRITE | write_flags;
> +			bio->bi_opf = bio_op | write_flags;

But here, for the newly allocated bio, we didn't call bio_set_dev() at all.

Shouldn't all zoned write bio need that bio_set_dev() call?

I guess since most compressed extents are pretty small, it's really hard
to hit a case where we need to split the bio due to stripe boundary,
thus very hard to hit anything wrong.

Anyway, since I'm working on compression code to make compressed write
to follow the same boundary check in extent_io.c, I can definitely
refactor the bio allocation code to add the zoned needed calls.

Thanks,
Qu

>   			bio->bi_private = cb;
>   			bio->bi_end_io = end_compressed_bio_write;
>   			if (blkcg_css)
>   				bio->bi_opf |= REQ_CGROUP_PUNT;
> +			/*
> +			 * Use bio_add_page() to ensure the bio has at least one
> +			 * page.
> +			 */
>   			bio_add_page(bio, page, PAGE_SIZE, 0);
>   		}
>   		if (bytes_left < PAGE_SIZE) {
>

Damien Le Moal June 10, 2021, 7:36 a.m. UTC | #8

On 2021/06/10 16:28, Qu Wenruo wrote:
> 
> 
> On 2021/5/18 下午11:40, Johannes Thumshirn wrote:
>> When multiple processes write data to the same block group on a compressed
>> zoned filesystem, the underlying device could report I/O errors and data
>> corruption is possible.
>>
>> This happens because on a zoned file system, compressed data writes where
>> sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND
>> operation. But with REQ_OP_WRITE and parallel submission it cannot be
>> guaranteed that the data is always submitted aligned to the underlying
>> zone's write pointer.
>>
>> The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned
>> filesystem is non intrusive on a regular file system or when submitting to
>> a conventional zone on a zoned filesystem, as it is guarded by
>> btrfs_use_zone_append.
>>
>> Reported-by: David Sterba <dsterba@suse.com>
>> Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat flag")
>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> 
> Now working on compression support for subpage, just noticed some
> strange code behavior, I'm not sure if it's designed or just a typo.
> 
> So please correct me if possible.
> 
> [...]
>>
>>   	bio = btrfs_bio_alloc(first_byte);
>> -	bio->bi_opf = REQ_OP_WRITE | write_flags;
>> +	bio->bi_opf = bio_op | write_flags;
>>   	bio->bi_private = cb;
>>   	bio->bi_end_io = end_compressed_bio_write;
>>
>> +	if (use_append) {
>> +		struct extent_map *em;
>> +		struct map_lookup *map;
>> +		struct block_device *bdev;
>> +
>> +		em = btrfs_get_chunk_map(fs_info, disk_start, PAGE_SIZE);
>> +		if (IS_ERR(em)) {
>> +			kfree(cb);
>> +			bio_put(bio);
>> +			return BLK_STS_NOTSUPP;
>> +		}
>> +
>> +		map = em->map_lookup;
>> +		/* We only support single profile for now */
>> +		ASSERT(map->num_stripes == 1);
>> +		bdev = map->stripes[0].dev->bdev;

This variable seems rather useless...

>> +
>> +		bio_set_dev(bio, bdev);
>> +		free_extent_map(em);
>> +	}
>> +
> 
> Here for the newly created bio, we will try to call bio_set_dev() for
> it. (although later patch refactor this part a little)
> 
> So far so good.
> 
>>   	if (blkcg_css) {
>>   		bio->bi_opf |= REQ_CGROUP_PUNT;
>>   		kthread_associate_blkcg(blkcg_css);
>> @@ -432,6 +458,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>>   	bytes_left = compressed_len;
>>   	for (pg_index = 0; pg_index < cb->nr_pages; pg_index++) {
>>   		int submit = 0;
>> +		int len;
>>
>>   		page = compressed_pages[pg_index];
>>   		page->mapping = inode->vfs_inode.i_mapping;
>> @@ -439,9 +466,13 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>>   			submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio,
>>   							  0);
>>
>> +		if (pg_index == 0 && use_append)
>> +			len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0);
>> +		else
>> +			len = bio_add_page(bio, page, PAGE_SIZE, 0);
>> +
>>   		page->mapping = NULL;
>> -		if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) <
>> -		    PAGE_SIZE) {
>> +		if (submit || len < PAGE_SIZE) {
>>   			/*
>>   			 * inc the count before we submit the bio so
>>   			 * we know the end IO handler won't happen before
>> @@ -465,11 +496,15 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>>   			}
>>
>>   			bio = btrfs_bio_alloc(first_byte);
>> -			bio->bi_opf = REQ_OP_WRITE | write_flags;
>> +			bio->bi_opf = bio_op | write_flags;
> 
> But here, for the newly allocated bio, we didn't call bio_set_dev() at all.
> 
> Shouldn't all zoned write bio need that bio_set_dev() call?

Yep, bio->bi_bdev must be set before bio_add_zone_append_page() is called.
Otherwise, there will be a crash (first line of bio_add_zone_append_page() gets
the request queue from bio->bi_bdev). I wonder why we do not see NULL pointer
oops here... Johannes ?

> 
> I guess since most compressed extents are pretty small, it's really hard
> to hit a case where we need to split the bio due to stripe boundary,
> thus very hard to hit anything wrong.
> 
> Anyway, since I'm working on compression code to make compressed write
> to follow the same boundary check in extent_io.c, I can definitely
> refactor the bio allocation code to add the zoned needed calls.
> 
> Thanks,
> Qu
> 
>>   			bio->bi_private = cb;
>>   			bio->bi_end_io = end_compressed_bio_write;
>>   			if (blkcg_css)
>>   				bio->bi_opf |= REQ_CGROUP_PUNT;
>> +			/*
>> +			 * Use bio_add_page() to ensure the bio has at least one
>> +			 * page.
>> +			 */
>>   			bio_add_page(bio, page, PAGE_SIZE, 0);
>>   		}
>>   		if (bytes_left < PAGE_SIZE) {
>>
>

Qu Wenruo June 10, 2021, 7:41 a.m. UTC | #9

On 2021/6/10 下午3:36, Damien Le Moal wrote:
> On 2021/06/10 16:28, Qu Wenruo wrote:
>>
>>
>> On 2021/5/18 下午11:40, Johannes Thumshirn wrote:
>>> When multiple processes write data to the same block group on a compressed
>>> zoned filesystem, the underlying device could report I/O errors and data
>>> corruption is possible.
>>>
>>> This happens because on a zoned file system, compressed data writes where
>>> sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND
>>> operation. But with REQ_OP_WRITE and parallel submission it cannot be
>>> guaranteed that the data is always submitted aligned to the underlying
>>> zone's write pointer.
>>>
>>> The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned
>>> filesystem is non intrusive on a regular file system or when submitting to
>>> a conventional zone on a zoned filesystem, as it is guarded by
>>> btrfs_use_zone_append.
>>>
>>> Reported-by: David Sterba <dsterba@suse.com>
>>> Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat flag")
>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>
>> Now working on compression support for subpage, just noticed some
>> strange code behavior, I'm not sure if it's designed or just a typo.
>>
>> So please correct me if possible.
>>
>> [...]
>>>
>>>    	bio = btrfs_bio_alloc(first_byte);
>>> -	bio->bi_opf = REQ_OP_WRITE | write_flags;
>>> +	bio->bi_opf = bio_op | write_flags;
>>>    	bio->bi_private = cb;
>>>    	bio->bi_end_io = end_compressed_bio_write;
>>>
>>> +	if (use_append) {
>>> +		struct extent_map *em;
>>> +		struct map_lookup *map;
>>> +		struct block_device *bdev;
>>> +
>>> +		em = btrfs_get_chunk_map(fs_info, disk_start, PAGE_SIZE);
>>> +		if (IS_ERR(em)) {
>>> +			kfree(cb);
>>> +			bio_put(bio);
>>> +			return BLK_STS_NOTSUPP;
>>> +		}
>>> +
>>> +		map = em->map_lookup;
>>> +		/* We only support single profile for now */
>>> +		ASSERT(map->num_stripes == 1);
>>> +		bdev = map->stripes[0].dev->bdev;
>
> This variable seems rather useless...

No need to bother that, that has already been removed by later refactor.

>
>>> +
>>> +		bio_set_dev(bio, bdev);
>>> +		free_extent_map(em);
>>> +	}
>>> +
>>
>> Here for the newly created bio, we will try to call bio_set_dev() for
>> it. (although later patch refactor this part a little)
>>
>> So far so good.
>>
>>>    	if (blkcg_css) {
>>>    		bio->bi_opf |= REQ_CGROUP_PUNT;
>>>    		kthread_associate_blkcg(blkcg_css);
>>> @@ -432,6 +458,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>>>    	bytes_left = compressed_len;
>>>    	for (pg_index = 0; pg_index < cb->nr_pages; pg_index++) {
>>>    		int submit = 0;
>>> +		int len;
>>>
>>>    		page = compressed_pages[pg_index];
>>>    		page->mapping = inode->vfs_inode.i_mapping;
>>> @@ -439,9 +466,13 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>>>    			submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio,
>>>    							  0);
>>>
>>> +		if (pg_index == 0 && use_append)
>>> +			len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0);
>>> +		else
>>> +			len = bio_add_page(bio, page, PAGE_SIZE, 0);
>>> +
>>>    		page->mapping = NULL;
>>> -		if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) <
>>> -		    PAGE_SIZE) {
>>> +		if (submit || len < PAGE_SIZE) {
>>>    			/*
>>>    			 * inc the count before we submit the bio so
>>>    			 * we know the end IO handler won't happen before
>>> @@ -465,11 +496,15 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>>>    			}
>>>
>>>    			bio = btrfs_bio_alloc(first_byte);
>>> -			bio->bi_opf = REQ_OP_WRITE | write_flags;
>>> +			bio->bi_opf = bio_op | write_flags;
>>
>> But here, for the newly allocated bio, we didn't call bio_set_dev() at all.
>>
>> Shouldn't all zoned write bio need that bio_set_dev() call?
>
> Yep, bio->bi_bdev must be set before bio_add_zone_append_page() is called.
> Otherwise, there will be a crash (first line of bio_add_zone_append_page() gets
> the request queue from bio->bi_bdev). I wonder why we do not see NULL pointer
> oops here... Johannes ?

That's because it's really really rare/hard to have a compressed extent
just lies at the stripe boundary.

For most cases, the data we provide for compression tests is either:
- Too compressible
   Thus the whole range can be compressed into just one sector, thus
   it will never cross stripe boundary.

- Not compressible at all
   We fall back to regular buffered write, which will do their proper
   stripe boundary check correctly.

Thus it's really near impossible to hit it in various tests.

Thanks,
Qu

>
>>
>> I guess since most compressed extents are pretty small, it's really hard
>> to hit a case where we need to split the bio due to stripe boundary,
>> thus very hard to hit anything wrong.
>>
>> Anyway, since I'm working on compression code to make compressed write
>> to follow the same boundary check in extent_io.c, I can definitely
>> refactor the bio allocation code to add the zoned needed calls.
>>
>> Thanks,
>> Qu
>>
>>>    			bio->bi_private = cb;
>>>    			bio->bi_end_io = end_compressed_bio_write;
>>>    			if (blkcg_css)
>>>    				bio->bi_opf |= REQ_CGROUP_PUNT;
>>> +			/*
>>> +			 * Use bio_add_page() to ensure the bio has at least one
>>> +			 * page.
>>> +			 */
>>>    			bio_add_page(bio, page, PAGE_SIZE, 0);
>>>    		}
>>>    		if (bytes_left < PAGE_SIZE) {
>>>
>>
>
>

Damien Le Moal June 10, 2021, 7:45 a.m. UTC | #10

On 2021/06/10 16:41, Qu Wenruo wrote:
> 
> 
> On 2021/6/10 下午3:36, Damien Le Moal wrote:
>> On 2021/06/10 16:28, Qu Wenruo wrote:
>>>
>>>
>>> On 2021/5/18 下午11:40, Johannes Thumshirn wrote:
>>>> When multiple processes write data to the same block group on a compressed
>>>> zoned filesystem, the underlying device could report I/O errors and data
>>>> corruption is possible.
>>>>
>>>> This happens because on a zoned file system, compressed data writes where
>>>> sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND
>>>> operation. But with REQ_OP_WRITE and parallel submission it cannot be
>>>> guaranteed that the data is always submitted aligned to the underlying
>>>> zone's write pointer.
>>>>
>>>> The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned
>>>> filesystem is non intrusive on a regular file system or when submitting to
>>>> a conventional zone on a zoned filesystem, as it is guarded by
>>>> btrfs_use_zone_append.
>>>>
>>>> Reported-by: David Sterba <dsterba@suse.com>
>>>> Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat flag")
>>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>>
>>> Now working on compression support for subpage, just noticed some
>>> strange code behavior, I'm not sure if it's designed or just a typo.
>>>
>>> So please correct me if possible.
>>>
>>> [...]
>>>>
>>>>    	bio = btrfs_bio_alloc(first_byte);
>>>> -	bio->bi_opf = REQ_OP_WRITE | write_flags;
>>>> +	bio->bi_opf = bio_op | write_flags;
>>>>    	bio->bi_private = cb;
>>>>    	bio->bi_end_io = end_compressed_bio_write;
>>>>
>>>> +	if (use_append) {
>>>> +		struct extent_map *em;
>>>> +		struct map_lookup *map;
>>>> +		struct block_device *bdev;
>>>> +
>>>> +		em = btrfs_get_chunk_map(fs_info, disk_start, PAGE_SIZE);
>>>> +		if (IS_ERR(em)) {
>>>> +			kfree(cb);
>>>> +			bio_put(bio);
>>>> +			return BLK_STS_NOTSUPP;
>>>> +		}
>>>> +
>>>> +		map = em->map_lookup;
>>>> +		/* We only support single profile for now */
>>>> +		ASSERT(map->num_stripes == 1);
>>>> +		bdev = map->stripes[0].dev->bdev;
>>
>> This variable seems rather useless...
> 
> No need to bother that, that has already been removed by later refactor.
> 
>>
>>>> +
>>>> +		bio_set_dev(bio, bdev);
>>>> +		free_extent_map(em);
>>>> +	}
>>>> +
>>>
>>> Here for the newly created bio, we will try to call bio_set_dev() for
>>> it. (although later patch refactor this part a little)
>>>
>>> So far so good.
>>>
>>>>    	if (blkcg_css) {
>>>>    		bio->bi_opf |= REQ_CGROUP_PUNT;
>>>>    		kthread_associate_blkcg(blkcg_css);
>>>> @@ -432,6 +458,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>>>>    	bytes_left = compressed_len;
>>>>    	for (pg_index = 0; pg_index < cb->nr_pages; pg_index++) {
>>>>    		int submit = 0;
>>>> +		int len;
>>>>
>>>>    		page = compressed_pages[pg_index];
>>>>    		page->mapping = inode->vfs_inode.i_mapping;
>>>> @@ -439,9 +466,13 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>>>>    			submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio,
>>>>    							  0);
>>>>
>>>> +		if (pg_index == 0 && use_append)
>>>> +			len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0);
>>>> +		else
>>>> +			len = bio_add_page(bio, page, PAGE_SIZE, 0);
>>>> +
>>>>    		page->mapping = NULL;
>>>> -		if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) <
>>>> -		    PAGE_SIZE) {
>>>> +		if (submit || len < PAGE_SIZE) {
>>>>    			/*
>>>>    			 * inc the count before we submit the bio so
>>>>    			 * we know the end IO handler won't happen before
>>>> @@ -465,11 +496,15 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>>>>    			}
>>>>
>>>>    			bio = btrfs_bio_alloc(first_byte);
>>>> -			bio->bi_opf = REQ_OP_WRITE | write_flags;
>>>> +			bio->bi_opf = bio_op | write_flags;
>>>
>>> But here, for the newly allocated bio, we didn't call bio_set_dev() at all.
>>>
>>> Shouldn't all zoned write bio need that bio_set_dev() call?
>>
>> Yep, bio->bi_bdev must be set before bio_add_zone_append_page() is called.
>> Otherwise, there will be a crash (first line of bio_add_zone_append_page() gets
>> the request queue from bio->bi_bdev). I wonder why we do not see NULL pointer
>> oops here... Johannes ?
> 
> That's because it's really really rare/hard to have a compressed extent
> just lies at the stripe boundary.
> 
> For most cases, the data we provide for compression tests is either:
> - Too compressible
>    Thus the whole range can be compressed into just one sector, thus
>    it will never cross stripe boundary.
> 
> - Not compressible at all
>    We fall back to regular buffered write, which will do their proper
>    stripe boundary check correctly.
> 
> Thus it's really near impossible to hit it in various tests.

But this is a data write, isn't it ? So in the zoned case, it means a zone
append write. And adding a page for even a single sector using
bio_add_zone_append_page() will oops if the bio bdev is not set, regardless of
the bio size... Am I misunderstanding something here about this IO path ?

> 
> Thanks,
> Qu
> 
>>
>>>
>>> I guess since most compressed extents are pretty small, it's really hard
>>> to hit a case where we need to split the bio due to stripe boundary,
>>> thus very hard to hit anything wrong.
>>>
>>> Anyway, since I'm working on compression code to make compressed write
>>> to follow the same boundary check in extent_io.c, I can definitely
>>> refactor the bio allocation code to add the zoned needed calls.
>>>
>>> Thanks,
>>> Qu
>>>
>>>>    			bio->bi_private = cb;
>>>>    			bio->bi_end_io = end_compressed_bio_write;
>>>>    			if (blkcg_css)
>>>>    				bio->bi_opf |= REQ_CGROUP_PUNT;
>>>> +			/*
>>>> +			 * Use bio_add_page() to ensure the bio has at least one
>>>> +			 * page.
>>>> +			 */
>>>>    			bio_add_page(bio, page, PAGE_SIZE, 0);
>>>>    		}
>>>>    		if (bytes_left < PAGE_SIZE) {
>>>>
>>>
>>
>>
>

Qu Wenruo June 10, 2021, 7:51 a.m. UTC | #11

On 2021/6/10 下午3:45, Damien Le Moal wrote:
> On 2021/06/10 16:41, Qu Wenruo wrote:
>>
>>
>> On 2021/6/10 下午3:36, Damien Le Moal wrote:
>>> On 2021/06/10 16:28, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2021/5/18 下午11:40, Johannes Thumshirn wrote:
>>>>> When multiple processes write data to the same block group on a compressed
>>>>> zoned filesystem, the underlying device could report I/O errors and data
>>>>> corruption is possible.
>>>>>
>>>>> This happens because on a zoned file system, compressed data writes where
>>>>> sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND
>>>>> operation. But with REQ_OP_WRITE and parallel submission it cannot be
>>>>> guaranteed that the data is always submitted aligned to the underlying
>>>>> zone's write pointer.
>>>>>
>>>>> The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned
>>>>> filesystem is non intrusive on a regular file system or when submitting to
>>>>> a conventional zone on a zoned filesystem, as it is guarded by
>>>>> btrfs_use_zone_append.
>>>>>
>>>>> Reported-by: David Sterba <dsterba@suse.com>
>>>>> Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat flag")
>>>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>>>
>>>> Now working on compression support for subpage, just noticed some
>>>> strange code behavior, I'm not sure if it's designed or just a typo.
>>>>
>>>> So please correct me if possible.
>>>>
>>>> [...]
>>>>>
>>>>>     	bio = btrfs_bio_alloc(first_byte);
>>>>> -	bio->bi_opf = REQ_OP_WRITE | write_flags;
>>>>> +	bio->bi_opf = bio_op | write_flags;
>>>>>     	bio->bi_private = cb;
>>>>>     	bio->bi_end_io = end_compressed_bio_write;
>>>>>
>>>>> +	if (use_append) {
>>>>> +		struct extent_map *em;
>>>>> +		struct map_lookup *map;
>>>>> +		struct block_device *bdev;
>>>>> +
>>>>> +		em = btrfs_get_chunk_map(fs_info, disk_start, PAGE_SIZE);
>>>>> +		if (IS_ERR(em)) {
>>>>> +			kfree(cb);
>>>>> +			bio_put(bio);
>>>>> +			return BLK_STS_NOTSUPP;
>>>>> +		}
>>>>> +
>>>>> +		map = em->map_lookup;
>>>>> +		/* We only support single profile for now */
>>>>> +		ASSERT(map->num_stripes == 1);
>>>>> +		bdev = map->stripes[0].dev->bdev;
>>>
>>> This variable seems rather useless...
>>
>> No need to bother that, that has already been removed by later refactor.
>>
>>>
>>>>> +
>>>>> +		bio_set_dev(bio, bdev);
>>>>> +		free_extent_map(em);
>>>>> +	}
>>>>> +
>>>>
>>>> Here for the newly created bio, we will try to call bio_set_dev() for
>>>> it. (although later patch refactor this part a little)
>>>>
>>>> So far so good.
>>>>
>>>>>     	if (blkcg_css) {
>>>>>     		bio->bi_opf |= REQ_CGROUP_PUNT;
>>>>>     		kthread_associate_blkcg(blkcg_css);
>>>>> @@ -432,6 +458,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>>>>>     	bytes_left = compressed_len;
>>>>>     	for (pg_index = 0; pg_index < cb->nr_pages; pg_index++) {
>>>>>     		int submit = 0;
>>>>> +		int len;
>>>>>
>>>>>     		page = compressed_pages[pg_index];
>>>>>     		page->mapping = inode->vfs_inode.i_mapping;
>>>>> @@ -439,9 +466,13 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>>>>>     			submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio,
>>>>>     							  0);
>>>>>
>>>>> +		if (pg_index == 0 && use_append)
>>>>> +			len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0);
>>>>> +		else
>>>>> +			len = bio_add_page(bio, page, PAGE_SIZE, 0);
>>>>> +
>>>>>     		page->mapping = NULL;
>>>>> -		if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) <
>>>>> -		    PAGE_SIZE) {
>>>>> +		if (submit || len < PAGE_SIZE) {
>>>>>     			/*
>>>>>     			 * inc the count before we submit the bio so
>>>>>     			 * we know the end IO handler won't happen before
>>>>> @@ -465,11 +496,15 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>>>>>     			}
>>>>>
>>>>>     			bio = btrfs_bio_alloc(first_byte);
>>>>> -			bio->bi_opf = REQ_OP_WRITE | write_flags;
>>>>> +			bio->bi_opf = bio_op | write_flags;
>>>>
>>>> But here, for the newly allocated bio, we didn't call bio_set_dev() at all.
>>>>
>>>> Shouldn't all zoned write bio need that bio_set_dev() call?
>>>
>>> Yep, bio->bi_bdev must be set before bio_add_zone_append_page() is called.
>>> Otherwise, there will be a crash (first line of bio_add_zone_append_page() gets
>>> the request queue from bio->bi_bdev). I wonder why we do not see NULL pointer
>>> oops here... Johannes ?
>>
>> That's because it's really really rare/hard to have a compressed extent
>> just lies at the stripe boundary.
>>
>> For most cases, the data we provide for compression tests is either:
>> - Too compressible
>>     Thus the whole range can be compressed into just one sector, thus
>>     it will never cross stripe boundary.
>>
>> - Not compressible at all
>>     We fall back to regular buffered write, which will do their proper
>>     stripe boundary check correctly.
>>
>> Thus it's really near impossible to hit it in various tests.
>
> But this is a data write, isn't it ? So in the zoned case, it means a zone
> append write. And adding a page for even a single sector using
> bio_add_zone_append_page() will oops if the bio bdev is not set, regardless of
> the bio size... Am I misunderstanding something here about this IO path ?

The bio can only be allocated when we have a write range crossing device
stripe boundary.
Which is already a rare case for compressed write.


And the initial bio always has the correct setup, thus we have to hit
the corner case to get into this situation.


Finally I find that since zoned support doesn't support multi-device
profile yet, there is no limitation for stripe length, thus we won't get
into this branch at all.

So my bad, false alert...

Thanks,
Qu

>
>>
>> Thanks,
>> Qu
>>
>>>
>>>>
>>>> I guess since most compressed extents are pretty small, it's really hard
>>>> to hit a case where we need to split the bio due to stripe boundary,
>>>> thus very hard to hit anything wrong.
>>>>
>>>> Anyway, since I'm working on compression code to make compressed write
>>>> to follow the same boundary check in extent_io.c, I can definitely
>>>> refactor the bio allocation code to add the zoned needed calls.
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>     			bio->bi_private = cb;
>>>>>     			bio->bi_end_io = end_compressed_bio_write;
>>>>>     			if (blkcg_css)
>>>>>     				bio->bi_opf |= REQ_CGROUP_PUNT;
>>>>> +			/*
>>>>> +			 * Use bio_add_page() to ensure the bio has at least one
>>>>> +			 * page.
>>>>> +			 */
>>>>>     			bio_add_page(bio, page, PAGE_SIZE, 0);
>>>>>     		}
>>>>>     		if (bytes_left < PAGE_SIZE) {
>>>>>
>>>>
>>>
>>>
>>
>
>

Johannes Thumshirn June 10, 2021, 8:28 a.m. UTC | #12

On 10/06/2021 09:45, Damien Le Moal wrote:
> On 2021/06/10 16:41, Qu Wenruo wrote:
>>
>>
>> On 2021/6/10 下午3:36, Damien Le Moal wrote:
>>> On 2021/06/10 16:28, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2021/5/18 下午11:40, Johannes Thumshirn wrote:
>>>>> When multiple processes write data to the same block group on a compressed
>>>>> zoned filesystem, the underlying device could report I/O errors and data
>>>>> corruption is possible.
>>>>>
>>>>> This happens because on a zoned file system, compressed data writes where
>>>>> sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND
>>>>> operation. But with REQ_OP_WRITE and parallel submission it cannot be
>>>>> guaranteed that the data is always submitted aligned to the underlying
>>>>> zone's write pointer.
>>>>>
>>>>> The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned
>>>>> filesystem is non intrusive on a regular file system or when submitting to
>>>>> a conventional zone on a zoned filesystem, as it is guarded by
>>>>> btrfs_use_zone_append.
>>>>>
>>>>> Reported-by: David Sterba <dsterba@suse.com>
>>>>> Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat flag")
>>>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>>>
>>>> Now working on compression support for subpage, just noticed some
>>>> strange code behavior, I'm not sure if it's designed or just a typo.
>>>>
>>>> So please correct me if possible.
>>>>
>>>> [...]
>>>>>
>>>>>    	bio = btrfs_bio_alloc(first_byte);
>>>>> -	bio->bi_opf = REQ_OP_WRITE | write_flags;
>>>>> +	bio->bi_opf = bio_op | write_flags;
>>>>>    	bio->bi_private = cb;
>>>>>    	bio->bi_end_io = end_compressed_bio_write;
>>>>>
>>>>> +	if (use_append) {
>>>>> +		struct extent_map *em;
>>>>> +		struct map_lookup *map;
>>>>> +		struct block_device *bdev;
>>>>> +
>>>>> +		em = btrfs_get_chunk_map(fs_info, disk_start, PAGE_SIZE);
>>>>> +		if (IS_ERR(em)) {
>>>>> +			kfree(cb);
>>>>> +			bio_put(bio);
>>>>> +			return BLK_STS_NOTSUPP;
>>>>> +		}
>>>>> +
>>>>> +		map = em->map_lookup;
>>>>> +		/* We only support single profile for now */
>>>>> +		ASSERT(map->num_stripes == 1);
>>>>> +		bdev = map->stripes[0].dev->bdev;
>>>
>>> This variable seems rather useless...
>>
>> No need to bother that, that has already been removed by later refactor.
>>
>>>
>>>>> +
>>>>> +		bio_set_dev(bio, bdev);
>>>>> +		free_extent_map(em);
>>>>> +	}
>>>>> +
>>>>
>>>> Here for the newly created bio, we will try to call bio_set_dev() for
>>>> it. (although later patch refactor this part a little)
>>>>
>>>> So far so good.
>>>>
>>>>>    	if (blkcg_css) {
>>>>>    		bio->bi_opf |= REQ_CGROUP_PUNT;
>>>>>    		kthread_associate_blkcg(blkcg_css);
>>>>> @@ -432,6 +458,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>>>>>    	bytes_left = compressed_len;
>>>>>    	for (pg_index = 0; pg_index < cb->nr_pages; pg_index++) {
>>>>>    		int submit = 0;
>>>>> +		int len;
>>>>>
>>>>>    		page = compressed_pages[pg_index];
>>>>>    		page->mapping = inode->vfs_inode.i_mapping;
>>>>> @@ -439,9 +466,13 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>>>>>    			submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio,
>>>>>    							  0);
>>>>>
>>>>> +		if (pg_index == 0 && use_append)
>>>>> +			len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0);
>>>>> +		else
>>>>> +			len = bio_add_page(bio, page, PAGE_SIZE, 0);
>>>>> +
>>>>>    		page->mapping = NULL;
>>>>> -		if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) <
>>>>> -		    PAGE_SIZE) {
>>>>> +		if (submit || len < PAGE_SIZE) {
>>>>>    			/*
>>>>>    			 * inc the count before we submit the bio so
>>>>>    			 * we know the end IO handler won't happen before
>>>>> @@ -465,11 +496,15 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>>>>>    			}
>>>>>
>>>>>    			bio = btrfs_bio_alloc(first_byte);
>>>>> -			bio->bi_opf = REQ_OP_WRITE | write_flags;
>>>>> +			bio->bi_opf = bio_op | write_flags;
>>>>
>>>> But here, for the newly allocated bio, we didn't call bio_set_dev() at all.
>>>>
>>>> Shouldn't all zoned write bio need that bio_set_dev() call?
>>>
>>> Yep, bio->bi_bdev must be set before bio_add_zone_append_page() is called.
>>> Otherwise, there will be a crash (first line of bio_add_zone_append_page() gets
>>> the request queue from bio->bi_bdev). I wonder why we do not see NULL pointer
>>> oops here... Johannes ?
>>
>> That's because it's really really rare/hard to have a compressed extent
>> just lies at the stripe boundary.
>>
>> For most cases, the data we provide for compression tests is either:
>> - Too compressible
>>    Thus the whole range can be compressed into just one sector, thus
>>    it will never cross stripe boundary.
>>
>> - Not compressible at all
>>    We fall back to regular buffered write, which will do their proper
>>    stripe boundary check correctly.
>>
>> Thus it's really near impossible to hit it in various tests.
> 
> But this is a data write, isn't it ? So in the zoned case, it means a zone
> append write. And adding a page for even a single sector using
> bio_add_zone_append_page() will oops if the bio bdev is not set, regardless of
> the bio size... Am I misunderstanding something here about this IO path ?

Because we're not using bio_add_zone_append_page() but bio_add_page() as we only add
one page to a newly allocated bio. It's a bit complicated.  

>>>>> +			/*
>>>>> +			 * Use bio_add_page() to ensure the bio has at least one
>>>>> +			 * page.
>>>>> +			 */
>>>>>    			bio_add_page(bio, page, PAGE_SIZE, 0);
>>>>>    		}
>>>>>    		if (bytes_left < PAGE_SIZE) {
>>>>>
>>>>
>>>
>>>
>>
> 
>

[v2,2/3] btrfs: zoned: fix compressed writes

Commit Message

Comments

Patch