diff mbox series

[v2] block: Fix a WRITE SAME BUG_ON

Message ID 20190125021107.4595-1-zhangxiaoxu5@huawei.com (mailing list archive)
State Not Applicable, archived
Delegated to: Mike Snitzer
Headers show
Series [v2] block: Fix a WRITE SAME BUG_ON | expand

Commit Message

Zhang Xiaoxu Jan. 25, 2019, 2:11 a.m. UTC
If the lvm is stacked by different logical_block_size disks,
when WRITE SAME on it, will bug_on:

kernel BUG at drivers/scsi/sd.c:968!
invalid opcode: 0000 [#1] SMP PTI
CPU: 11 PID: 525 Comm: kworker/11:1H Tainted: G           O      5.0.0-rc3+ #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
Workqueue: kblockd blk_mq_run_work_fn
RIP: 0010:sd_init_command+0x7aa/0xdb0
Code: 30 75 00 00 c7 85 44 01 00 00 18 00 00 00 0f 85 fa 04 00 00 48 83 c4 40 48
      89 df 5b 5d 41 5c 41 5d 41 5e 41 5f e9 b6 ca fe ff <0f> 0b 41 bc 09
RSP: 0018:ffffb55f80ddbca0 EFLAGS: 00010206
RAX: 0000000000001000 RBX: ffff9ed23fb927a0 RCX: 000000000000f000
RDX: ffff9ed23f0a8400 RSI: ffff9ed27bc79800 RDI: 0000000000000000
RBP: ffff9ed23fb92680 R08: ffff9ed27c8c0000 R09: ffff9ed23fb927d8
R10: 0000000000000000 R11: fefefefefefefeff R12: ffff9ed27bc79800
R13: ffff9ed2787a0000 R14: ffff9ed27bdf3400 R15: ffff9ed23fb927a0
FS:  0000000000000000(0000) GS:ffff9ed27c8c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6b14cf9341 CR3: 0000000069058000 CR4: 00000000000006e0
Call Trace:
 ? vp_notify+0x12/0x20
 scsi_queue_rq+0x525/0xa30
 blk_mq_dispatch_rq_list+0x8d/0x580
 ? syscall_return_via_sysret+0x10/0x7f
 ? elv_rb_del+0x1f/0x30
 ? deadline_remove_request+0x55/0xc0
 blk_mq_do_dispatch_sched+0x76/0x110
 blk_mq_sched_dispatch_requests+0xf9/0x170
 __blk_mq_run_hw_queue+0x51/0xd0
 process_one_work+0x195/0x380
 worker_thread+0x30/0x390
 ? process_one_work+0x380/0x380
 kthread+0x113/0x130
 ? kthread_park+0x90/0x90
 ret_from_fork+0x35/0x40
Modules linked in: alloc(O+)
---[ end trace dc92ddeb2e6d1fe5 ]---

The logical_block_size of the LVM is the max value of the sub disks,
it maybe different with one of the sub disk. when WRITE SAME on the
disk, it will BUG_ON when setup WRITE SAME cmd.

Close WRITE_SAME feature on the LVM if it was stacked by different
logical_block_size disk.

Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
---
 block/blk-settings.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

Comments

John Dorminy Jan. 26, 2019, 11:17 a.m. UTC | #1
Hi. I have read a bit of DM code and spent an hour reviewing this... I
didn't get to the point of knowing what the right fix for the problem
is, and I may have a wrong understanding, but I have two thoughts
about the patch:

I don't think this is the right solution for two reasons:

In the first place, if it's an LVM-only issue, we should fix it only
for device-mapper devices. If this is the right way to fix it,
possibly the way to do that would be to change DM calls to
blk_queue_max_write_same_sectors() to only set the max sectors to more
than 0 if and only if the logical block sizes match.

In the second place, I don't think this is the problem. The line of
code that's calling BUG_ON is, I think,
BUG_ON(bio_offset(bio) || bio_iovec(bio).bv_len != sdp->sector_size);

This is because write_same bios are supposed to have a single sector
of data at the beginning of a page in their bio_iovec(bio) [I think,
based on a commit message I've read] -- in other words, bio_iovec(bio)
is supposed to say something like { .page = something, .offset = 0,
.len = 1 native sector }. But clearly, since the BUG_ON is being
called, one of these is not true. Have you added a log statement right
before the BUG_ON() to print out bio_offset(bio) and
bio_iovec(bio).bv_len to see which value defies expectations?

I would be happy to help you trace through this or attempt to
reproduce it myself -- what stack of devices can you reproduce this
on? Is this a dm-linear device on top of a disk? Does it have a
filesystem on top, and if so, what filesystem?

Thank you!

John Dorminy


On Fri, Jan 25, 2019 at 9:53 AM Zhang Xiaoxu <zhangxiaoxu5@huawei.com> wrote:
>
> If the lvm is stacked by different logical_block_size disks,
> when WRITE SAME on it, will bug_on:
>
> kernel BUG at drivers/scsi/sd.c:968!
> invalid opcode: 0000 [#1] SMP PTI
> CPU: 11 PID: 525 Comm: kworker/11:1H Tainted: G           O      5.0.0-rc3+ #2
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
> Workqueue: kblockd blk_mq_run_work_fn
> RIP: 0010:sd_init_command+0x7aa/0xdb0
> Code: 30 75 00 00 c7 85 44 01 00 00 18 00 00 00 0f 85 fa 04 00 00 48 83 c4 40 48
>       89 df 5b 5d 41 5c 41 5d 41 5e 41 5f e9 b6 ca fe ff <0f> 0b 41 bc 09
> RSP: 0018:ffffb55f80ddbca0 EFLAGS: 00010206
> RAX: 0000000000001000 RBX: ffff9ed23fb927a0 RCX: 000000000000f000
> RDX: ffff9ed23f0a8400 RSI: ffff9ed27bc79800 RDI: 0000000000000000
> RBP: ffff9ed23fb92680 R08: ffff9ed27c8c0000 R09: ffff9ed23fb927d8
> R10: 0000000000000000 R11: fefefefefefefeff R12: ffff9ed27bc79800
> R13: ffff9ed2787a0000 R14: ffff9ed27bdf3400 R15: ffff9ed23fb927a0
> FS:  0000000000000000(0000) GS:ffff9ed27c8c0000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f6b14cf9341 CR3: 0000000069058000 CR4: 00000000000006e0
> Call Trace:
>  ? vp_notify+0x12/0x20
>  scsi_queue_rq+0x525/0xa30
>  blk_mq_dispatch_rq_list+0x8d/0x580
>  ? syscall_return_via_sysret+0x10/0x7f
>  ? elv_rb_del+0x1f/0x30
>  ? deadline_remove_request+0x55/0xc0
>  blk_mq_do_dispatch_sched+0x76/0x110
>  blk_mq_sched_dispatch_requests+0xf9/0x170
>  __blk_mq_run_hw_queue+0x51/0xd0
>  process_one_work+0x195/0x380
>  worker_thread+0x30/0x390
>  ? process_one_work+0x380/0x380
>  kthread+0x113/0x130
>  ? kthread_park+0x90/0x90
>  ret_from_fork+0x35/0x40
> Modules linked in: alloc(O+)
> ---[ end trace dc92ddeb2e6d1fe5 ]---
>
> The logical_block_size of the LVM is the max value of the sub disks,
> it maybe different with one of the sub disk. when WRITE SAME on the
> disk, it will BUG_ON when setup WRITE SAME cmd.
>
> Close WRITE_SAME feature on the LVM if it was stacked by different
> logical_block_size disk.
>
> Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
> ---
>  block/blk-settings.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 3e7038e475ee..e4664280edb4 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -497,8 +497,6 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>         t->max_sectors = min_not_zero(t->max_sectors, b->max_sectors);
>         t->max_hw_sectors = min_not_zero(t->max_hw_sectors, b->max_hw_sectors);
>         t->max_dev_sectors = min_not_zero(t->max_dev_sectors, b->max_dev_sectors);
> -       t->max_write_same_sectors = min(t->max_write_same_sectors,
> -                                       b->max_write_same_sectors);
>         t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
>                                         b->max_write_zeroes_sectors);
>         t->bounce_pfn = min_not_zero(t->bounce_pfn, b->bounce_pfn);
> @@ -537,6 +535,14 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>                 }
>         }
>
> +       /* If the logical block size is different, forbid write same */
> +       if (t->logical_block_size != b->logical_block_size &&
> +           t->max_write_same_sectors != UINT_MAX)
> +               t->max_write_same_sectors = 0;
> +       else
> +               t->max_write_same_sectors = min(t->max_write_same_sectors,
> +                                               b->max_write_same_sectors);
> +
>         t->logical_block_size = max(t->logical_block_size,
>                                     b->logical_block_size);
>
> --
> 2.14.4
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Zhang Xiaoxu Jan. 28, 2019, 5:48 a.m. UTC | #2
Hi, Thanks for your reply.

BUG_ON is because the 'bio_iovec(bio).bv_len' not same with 'sdp->sector_size'.

Just as below reproducer, If the 'golden' is stacked by disks 'sda'(logical_block_size=512)
and 'sdb'(logical_block_size=4096), call 'blkdev_issue_write_same' on it will BUG_ON because
of the
	bio->bi_io_vec->bv_len = bdev_logical_block_size('golden'), and the
	bdev_logical_block_size('golden') = max_logical_block_size(sda, sdb).

There are 2 solutions about the problem:
1. Disable the WRITE SAME for that situation: Just like this patch.
2. Improve the WRITE SAME to adapt that situation:
	Reorganization the 'WRITE SAME' bio for the disks which logical_block_size is smaller than the logical volume.


For now, I think we should close the 'WRITE SAME' for this situation.


Reproducer:
1. Start qemu with parameter:
     -drive file=/home/qemu/disk/512k.img,if=none,format=raw,cache=none,id=data.1,discard=on \
     -device scsi-hd,drive=data.1,id=vdata.1 \
     -drive file=/home/qemu/disk/4096k.img,if=none,format=raw,cache=none,id=data.2,discard=on \
     -device scsi-hd,drive=data.2,id=vdata.2,logical_block_size=4096,physical_block_size=4096 \

2. Create LV:
     vgremove golden -y
     vgcreate golden /dev/sda /dev/sdb
     lvcreate -L1G -n testvol -i 2 golden -y

3. Call 'blkdev_issue_write_same' on the device:
	#include <linux/module.h>
	#include <linux/kernel.h>
	#include <linux/init.h>
	#include <linux/fs.h>
	#include <linux/slab.h>
	#include <linux/blkdev.h>

	static int __init hello_module_init(void)
	{
		struct block_device *dev;
		int err;

		printk(KERN_ERR "/dev/golden/testvol\n");

		dev = lookup_bdev("/dev/golden/testvol");
		if (IS_ERR(dev)) {
			pr_warn("lookup blk error!\n");
			return 0;
		}

		err = blkdev_get(dev, FMODE_WRITE | FMODE_READ | FMODE_EXCL, (void *)&err);

		if (err) {
			pr_warn("get blk error!\n");
			return 0;
		}

		err = blkdev_issue_write_same(dev, 8, 0xf8, GFP_NOIO, ZERO_PAGE(0));

		printk(KERN_ERR "blkdev_issue_write_same return %d\n", err);

		return 0;
	}

	static void __exit hello_module_exit(void)
	{
		pr_warn("\n");
	}

	module_init(hello_module_init);
	module_exit(hello_module_exit);

	MODULE_DESCRIPTION("hello module");
	MODULE_LICENSE("GPL");

On 1/26/2019 7:17 PM, John Dorminy wrote:
> Hi. I have read a bit of DM code and spent an hour reviewing this... I
> didn't get to the point of knowing what the right fix for the problem
> is, and I may have a wrong understanding, but I have two thoughts
> about the patch:
> 
> I don't think this is the right solution for two reasons:
> 
> In the first place, if it's an LVM-only issue, we should fix it only
> for device-mapper devices. If this is the right way to fix it,
> possibly the way to do that would be to change DM calls to
> blk_queue_max_write_same_sectors() to only set the max sectors to more
> than 0 if and only if the logical block sizes match.
> 
> In the second place, I don't think this is the problem. The line of
> code that's calling BUG_ON is, I think,
> BUG_ON(bio_offset(bio) || bio_iovec(bio).bv_len != sdp->sector_size);
> 
> This is because write_same bios are supposed to have a single sector
> of data at the beginning of a page in their bio_iovec(bio) [I think,
> based on a commit message I've read] -- in other words, bio_iovec(bio)
> is supposed to say something like { .page = something, .offset = 0,
> .len = 1 native sector }. But clearly, since the BUG_ON is being
> called, one of these is not true. Have you added a log statement right
> before the BUG_ON() to print out bio_offset(bio) and
> bio_iovec(bio).bv_len to see which value defies expectations?
> 
> I would be happy to help you trace through this or attempt to
> reproduce it myself -- what stack of devices can you reproduce this
> on? Is this a dm-linear device on top of a disk? Does it have a
> filesystem on top, and if so, what filesystem?
> 
> Thank you!
> 
> John Dorminy
> 
> 
> On Fri, Jan 25, 2019 at 9:53 AM Zhang Xiaoxu <zhangxiaoxu5@huawei.com> wrote:
>>
>> If the lvm is stacked by different logical_block_size disks,
>> when WRITE SAME on it, will bug_on:
>>
>> kernel BUG at drivers/scsi/sd.c:968!
>> invalid opcode: 0000 [#1] SMP PTI
>> CPU: 11 PID: 525 Comm: kworker/11:1H Tainted: G           O      5.0.0-rc3+ #2
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
>> Workqueue: kblockd blk_mq_run_work_fn
>> RIP: 0010:sd_init_command+0x7aa/0xdb0
>> Code: 30 75 00 00 c7 85 44 01 00 00 18 00 00 00 0f 85 fa 04 00 00 48 83 c4 40 48
>>        89 df 5b 5d 41 5c 41 5d 41 5e 41 5f e9 b6 ca fe ff <0f> 0b 41 bc 09
>> RSP: 0018:ffffb55f80ddbca0 EFLAGS: 00010206
>> RAX: 0000000000001000 RBX: ffff9ed23fb927a0 RCX: 000000000000f000
>> RDX: ffff9ed23f0a8400 RSI: ffff9ed27bc79800 RDI: 0000000000000000
>> RBP: ffff9ed23fb92680 R08: ffff9ed27c8c0000 R09: ffff9ed23fb927d8
>> R10: 0000000000000000 R11: fefefefefefefeff R12: ffff9ed27bc79800
>> R13: ffff9ed2787a0000 R14: ffff9ed27bdf3400 R15: ffff9ed23fb927a0
>> FS:  0000000000000000(0000) GS:ffff9ed27c8c0000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 00007f6b14cf9341 CR3: 0000000069058000 CR4: 00000000000006e0
>> Call Trace:
>>   ? vp_notify+0x12/0x20
>>   scsi_queue_rq+0x525/0xa30
>>   blk_mq_dispatch_rq_list+0x8d/0x580
>>   ? syscall_return_via_sysret+0x10/0x7f
>>   ? elv_rb_del+0x1f/0x30
>>   ? deadline_remove_request+0x55/0xc0
>>   blk_mq_do_dispatch_sched+0x76/0x110
>>   blk_mq_sched_dispatch_requests+0xf9/0x170
>>   __blk_mq_run_hw_queue+0x51/0xd0
>>   process_one_work+0x195/0x380
>>   worker_thread+0x30/0x390
>>   ? process_one_work+0x380/0x380
>>   kthread+0x113/0x130
>>   ? kthread_park+0x90/0x90
>>   ret_from_fork+0x35/0x40
>> Modules linked in: alloc(O+)
>> ---[ end trace dc92ddeb2e6d1fe5 ]---
>>
>> The logical_block_size of the LVM is the max value of the sub disks,
>> it maybe different with one of the sub disk. when WRITE SAME on the
>> disk, it will BUG_ON when setup WRITE SAME cmd.
>>
>> Close WRITE_SAME feature on the LVM if it was stacked by different
>> logical_block_size disk.
>>
>> Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
>> ---
>>   block/blk-settings.c | 10 ++++++++--
>>   1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/blk-settings.c b/block/blk-settings.c
>> index 3e7038e475ee..e4664280edb4 100644
>> --- a/block/blk-settings.c
>> +++ b/block/blk-settings.c
>> @@ -497,8 +497,6 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>>          t->max_sectors = min_not_zero(t->max_sectors, b->max_sectors);
>>          t->max_hw_sectors = min_not_zero(t->max_hw_sectors, b->max_hw_sectors);
>>          t->max_dev_sectors = min_not_zero(t->max_dev_sectors, b->max_dev_sectors);
>> -       t->max_write_same_sectors = min(t->max_write_same_sectors,
>> -                                       b->max_write_same_sectors);
>>          t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
>>                                          b->max_write_zeroes_sectors);
>>          t->bounce_pfn = min_not_zero(t->bounce_pfn, b->bounce_pfn);
>> @@ -537,6 +535,14 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>>                  }
>>          }
>>
>> +       /* If the logical block size is different, forbid write same */
>> +       if (t->logical_block_size != b->logical_block_size &&
>> +           t->max_write_same_sectors != UINT_MAX)
>> +               t->max_write_same_sectors = 0;
>> +       else
>> +               t->max_write_same_sectors = min(t->max_write_same_sectors,
>> +                                               b->max_write_same_sectors);
>> +
>>          t->logical_block_size = max(t->logical_block_size,
>>                                      b->logical_block_size);
>>
>> --
>> 2.14.4
>>
>> --
>> dm-devel mailing list
>> dm-devel@redhat.com
>> https://www.redhat.com/mailman/listinfo/dm-devel
> 
> .
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer Jan. 28, 2019, 10:14 p.m. UTC | #3
On Sat, Jan 26 2019 at  6:17am -0500,
John Dorminy <jdorminy@redhat.com> wrote:

> Hi. I have read a bit of DM code and spent an hour reviewing this... I
> didn't get to the point of knowing what the right fix for the problem
> is, and I may have a wrong understanding, but I have two thoughts
> about the patch:
> 
> I don't think this is the right solution for two reasons:
> 
> In the first place, if it's an LVM-only issue, we should fix it only
> for device-mapper devices. If this is the right way to fix it,
> possibly the way to do that would be to change DM calls to
> blk_queue_max_write_same_sectors() to only set the max sectors to more
> than 0 if and only if the logical block sizes match.

There is no way this is specific to lvm (or DM).  It may _seem_ that way
because lvm/dm are in the business of creating stacked devices --
whereby exposing users to blk_stack_limits().

I'll have a closer look at this issue, hopefully tomorrow, but Zhang
Xiaoxu's proposed fix looks bogus to me.  Not disputing there is an
issue, just feels like a different fix is needed.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Martin K. Petersen Jan. 29, 2019, 4:54 a.m. UTC | #4
Mike,

>> In the first place, if it's an LVM-only issue, we should fix it only
>> for device-mapper devices. If this is the right way to fix it,
>> possibly the way to do that would be to change DM calls to
>> blk_queue_max_write_same_sectors() to only set the max sectors to
>> more than 0 if and only if the logical block sizes match.
>
> There is no way this is specific to lvm (or DM).  It may _seem_ that way
> because lvm/dm are in the business of creating stacked devices --
> whereby exposing users to blk_stack_limits().
>
> I'll have a closer look at this issue, hopefully tomorrow, but Zhang
> Xiaoxu's proposed fix looks bogus to me.  Not disputing there is an
> issue, just feels like a different fix is needed.

It's caused by a remnant of the old bio payload hack in sd.c:

	BUG_ON(bio_offset(bio) || bio_iovec(bio).bv_len != sdp->sector_size);

We rounded up LBS when we created the DM device. And therefore the
bv_len coming down is 4K. But one of the component devices has a LBS of
512 and fails this check.

At first glance one could argue we should just nuke the BUG_ON since the
sd code no longer relies on bv_len. However, the semantics for WRITE
SAME are particularly challenging in this scenario. Say the filesystem
wants to WRITE SAME a 4K PAGE consisting of 512 bytes of zeroes,
followed by 512 bytes of ones, followed by 512 bytes of twos, etc. If a
component device only has a 512-byte LBS, we would end up writing zeroes
to the entire 4K block on that component device instead of the correct
pattern. Not good.

So disallowing WRITE SAME unless all component devices have the same LBS
is the correct fix.

That said, now that we have REQ_OP_WRITE_ZEROES (where the LBS is
irrelevant due to the payload being the ZERO_PAGE), it may be worthwhile
to remove REQ_OP_WRITE_SAME. I think drbd is the only user relying on a
non-zero payload. The target code ends up manually iterating, if I
remember correctly...
Christoph Hellwig Jan. 29, 2019, 8:52 a.m. UTC | #5
On Mon, Jan 28, 2019 at 11:54:08PM -0500, Martin K. Petersen wrote:
> That said, now that we have REQ_OP_WRITE_ZEROES (where the LBS is
> irrelevant due to the payload being the ZERO_PAGE), it may be worthwhile
> to remove REQ_OP_WRITE_SAME. I think drbd is the only user relying on a
> non-zero payload. The target code ends up manually iterating, if I
> remember correctly...

I had a series to remove it a while ago, but you wanted to keep it at
that point.  I can resurrect it.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Zhang Xiaoxu Jan. 30, 2019, 6:50 a.m. UTC | #6
On 1/29/2019 4:52 PM, Christoph Hellwig wrote:
> On Mon, Jan 28, 2019 at 11:54:08PM -0500, Martin K. Petersen wrote:
>> That said, now that we have REQ_OP_WRITE_ZEROES (where the LBS is
>> irrelevant due to the payload being the ZERO_PAGE), it may be worthwhile
>> to remove REQ_OP_WRITE_SAME. I think drbd is the only user relying on a
>> non-zero payload. The target code ends up manually iterating, if I
>> remember correctly...
> 
> I had a series to remove it a while ago, but you wanted to keep it at
> that point.  I can resurrect it.
> 
Do you mean to remove the `blkdev_issue_write_same` interface?
I think we should fix the problem first because other users may also encounter
the same problem.

> .
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
John Dorminy Jan. 30, 2019, 2:08 p.m. UTC | #7
On Mon, Jan 28, 2019 at 11:54 PM Martin K. Petersen
<martin.petersen@oracle.com> wrote:
> We rounded up LBS when we created the DM device. And therefore the
> bv_len coming down is 4K. But one of the component devices has a LBS of
> 512 and fails this check.
>
> At first glance one could argue we should just nuke the BUG_ON since the
> sd code no longer relies on bv_len. However, the semantics for WRITE
> SAME are particularly challenging in this scenario. Say the filesystem
> wants to WRITE SAME a 4K PAGE consisting of 512 bytes of zeroes,
> followed by 512 bytes of ones, followed by 512 bytes of twos, etc. If a
> component device only has a 512-byte LBS, we would end up writing zeroes
> to the entire 4K block on that component device instead of the correct
> pattern. Not good.
>
> So disallowing WRITE SAME unless all component devices have the same LBS
> is the correct fix.

Alternately, could possibly WRITE_SAME bios be accepted with the
minimum sector size of the stack rather than the max, e.g. 512 in this
example rather than 4k? They'd need to have a granularity of the
larger sector size, though, presumabily necessitating new queue limits
write_same_{granularity,block_size}, which might be too much work. For
devices with bigger sectors, the block layer or DM would need to
expand the small-sector payload to an appropriate larger-sector
payload, but it would preserve the ability to use WRITE_SAME with
non-zero payloads.

(I use WRITE_SAME to fill devices with a particular pattern in order
to catch failures to initialize disk structures appropriately,
personally, but it's just for convenience/speed.)

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Zhang Xiaoxu Jan. 31, 2019, 12:58 a.m. UTC | #8
On 1/30/2019 10:08 PM, John Dorminy wrote:
> Alternately, could possibly WRITE_SAME bios be accepted with the
> minimum sector size of the stack rather than the max, e.g. 512 in this
> example rather than 4k? They'd need to have a granularity of the
> larger sector size, though, presumabily necessitating new queue limits
> write_same_{granularity,block_size}, which might be too much work. For
> devices with bigger sectors, the block layer or DM would need to
> expand the small-sector payload to an appropriate larger-sector
> payload, but it would preserve the ability to use WRITE_SAME with
> non-zero payloads.
> 
> (I use WRITE_SAME to fill devices with a particular pattern in order
> to catch failures to initialize disk structures appropriately,
> personally, but it's just for convenience/speed.)
I think two LBSs will produce ambiguity.
Reference spec
	Information technology -
	SCSI Block Commands – 4 (SBC-4)
	ISO/IEC 14776-324:201x
	BSR INCITS 506:201x
5.50 WRITE SAME (10) command
The WRITE SAME (10) command (see table 145) requests that the device server
**transfer a single logical block** from the Data-Out Buffer and for each LBA in
the specified range of LBAs:




--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Martin K. Petersen Jan. 31, 2019, 2:23 a.m. UTC | #9
John,

>> So disallowing WRITE SAME unless all component devices have the same LBS
>> is the correct fix.
>
> Alternately, could possibly WRITE_SAME bios be accepted with the
> minimum sector size of the stack rather than the max, e.g. 512 in this
> example rather than 4k? They'd need to have a granularity of the
> larger sector size, though, presumabily necessitating new queue limits
> write_same_{granularity,block_size}, which might be too much work.

I don't have a problem restricting the buffer contents to be consistent
within a page. Or even change the upper layer semantics to specify the
buffer contents using a single byte (0x00..0xff).

But the issue of head and tail remains if there is a block size mismatch
so it's important that we keep scaling the logical block size up when
stacking and reject any bio that can't be honored on a 4Kn device.

> (I use WRITE_SAME to fill devices with a particular pattern in order
> to catch failures to initialize disk structures appropriately,
> personally, but it's just for convenience/speed.)

The intent was for stuff like MD to use it to initialize parity disks,
etc. But adoption has been pretty slow.

I don't have any problems keeping WRITE_SAME around if people are
actually using it. It just seemed like most active users only cared
about writing zeroes.
Christoph Hellwig Jan. 31, 2019, 10:39 a.m. UTC | #10
On Wed, Jan 30, 2019 at 09:08:50AM -0500, John Dorminy wrote:
> (I use WRITE_SAME to fill devices with a particular pattern in order
> to catch failures to initialize disk structures appropriately,
> personally, but it's just for convenience/speed.)

How do you use it?  We don't have a user interface to generate
REQ_OP_WRITE_SAME requests.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
John Dorminy Jan. 31, 2019, 7:41 p.m. UTC | #11
On Thu, Jan 31, 2019 at 5:39 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Wed, Jan 30, 2019 at 09:08:50AM -0500, John Dorminy wrote:
> > (I use WRITE_SAME to fill devices with a particular pattern in order
> > to catch failures to initialize disk structures appropriately,
> > personally, but it's just for convenience/speed.)
>
> How do you use it?  We don't have a user interface to generate
> REQ_OP_WRITE_SAME requests.

A not-checked-in test module, similar to Zhang Xiaoxu's reproducer,
exposing a sysfs node to trigger filling a block device with a test
pattern.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Christoph Hellwig Feb. 1, 2019, 7:35 a.m. UTC | #12
On Thu, Jan 31, 2019 at 02:41:52PM -0500, John Dorminy wrote:
> > On Wed, Jan 30, 2019 at 09:08:50AM -0500, John Dorminy wrote:
> > > (I use WRITE_SAME to fill devices with a particular pattern in order
> > > to catch failures to initialize disk structures appropriately,
> > > personally, but it's just for convenience/speed.)
> >
> > How do you use it?  We don't have a user interface to generate
> > REQ_OP_WRITE_SAME requests.
> 
> A not-checked-in test module, similar to Zhang Xiaoxu's reproducer,
> exposing a sysfs node to trigger filling a block device with a test
> pattern.

Any reason you don't just use SCSI/NVMe passthrough directly from
userspace for that?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
John Dorminy Feb. 1, 2019, 2:09 p.m. UTC | #13
I didn't know such a thing existed... does it work on any block
device? Where do I read more about this?

On Fri, Feb 1, 2019 at 2:35 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Jan 31, 2019 at 02:41:52PM -0500, John Dorminy wrote:
> > > On Wed, Jan 30, 2019 at 09:08:50AM -0500, John Dorminy wrote:
> > > > (I use WRITE_SAME to fill devices with a particular pattern in order
> > > > to catch failures to initialize disk structures appropriately,
> > > > personally, but it's just for convenience/speed.)
> > >
> > > How do you use it?  We don't have a user interface to generate
> > > REQ_OP_WRITE_SAME requests.
> >
> > A not-checked-in test module, similar to Zhang Xiaoxu's reproducer,
> > exposing a sysfs node to trigger filling a block device with a test
> > pattern.
>
> Any reason you don't just use SCSI/NVMe passthrough directly from
> userspace for that?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Heinz Mauelshagen Feb. 1, 2019, 4:03 p.m. UTC | #14
On 2/1/19 3:09 PM, John Dorminy wrote:
> I didn't know such a thing existed... does it work on any block
> device? Where do I read more about this?


Use sg_write_same(8) from package sg3_utils.

For instance 'sg_write_same --in=foobarfile --lba=0 --num=20000 
--xferlen=512 /dev/sdwhatever'

will read the pattern to write same from file 'foobarfile' with length 
explicitely set to 512 bytes
(rather than derived from foobarfile's size) and write it 20000 times 
starting at LBA 0 to /dev/sdwhatever.

> --lba=
>
> On Fri, Feb 1, 2019 at 2:35 AM Christoph Hellwig <hch@infradead.org> wrote:
>> On Thu, Jan 31, 2019 at 02:41:52PM -0500, John Dorminy wrote:
>>>> On Wed, Jan 30, 2019 at 09:08:50AM -0500, John Dorminy wrote:
>>>>> (I use WRITE_SAME to fill devices with a particular pattern in order
>>>>> to catch failures to initialize disk structures appropriately,
>>>>> personally, but it's just for convenience/speed.)
>>>> How do you use it?  We don't have a user interface to generate
>>>> REQ_OP_WRITE_SAME requests.
>>> A not-checked-in test module, similar to Zhang Xiaoxu's reproducer,
>>> exposing a sysfs node to trigger filling a block device with a test
>>> pattern.
>> Any reason you don't just use SCSI/NVMe passthrough directly from
>> userspace for that?
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Christoph Hellwig Feb. 1, 2019, 4:18 p.m. UTC | #15
On Fri, Feb 01, 2019 at 05:03:40PM +0100, Heinz Mauelshagen wrote:
> On 2/1/19 3:09 PM, John Dorminy wrote:
> > I didn't know such a thing existed... does it work on any block
> > device? Where do I read more about this?
> 
> 
> Use sg_write_same(8) from package sg3_utils.
> 
> For instance 'sg_write_same --in=foobarfile --lba=0 --num=20000
> --xferlen=512 /dev/sdwhatever'
> 
> will read the pattern to write same from file 'foobarfile' with length
> explicitely set to 512 bytes
> (rather than derived from foobarfile's size) and write it 20000 times
> starting at LBA 0 to /dev/sdwhatever.

Yeah.  Note that this will only work on SCSI disks (and maybe ATA
for a very specific corner case).  But for actual devices and not
remappers the same is true of REQ_OP_WRITE_SAME.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Zhang Xiaoxu Feb. 12, 2019, 3:11 a.m. UTC | #16
Any progress about the problme?
Should we disable the write same when stack the different LBA disks?

On 2/2/2019 12:18 AM, Christoph Hellwig wrote:
> On Fri, Feb 01, 2019 at 05:03:40PM +0100, Heinz Mauelshagen wrote:
>> On 2/1/19 3:09 PM, John Dorminy wrote:
>>> I didn't know such a thing existed... does it work on any block
>>> device? Where do I read more about this?
>>
>>
>> Use sg_write_same(8) from package sg3_utils.
>>
>> For instance 'sg_write_same --in=foobarfile --lba=0 --num=20000
>> --xferlen=512 /dev/sdwhatever'
>>
>> will read the pattern to write same from file 'foobarfile' with length
>> explicitely set to 512 bytes
>> (rather than derived from foobarfile's size) and write it 20000 times
>> starting at LBA 0 to /dev/sdwhatever.
> 
> Yeah.  Note that this will only work on SCSI disks (and maybe ATA
> for a very specific corner case).  But for actual devices and not
> remappers the same is true of REQ_OP_WRITE_SAME.
> 
> .
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Martin K. Petersen Feb. 14, 2019, 2:31 a.m. UTC | #17
zhangxiaoxu,

> Any progress about the problme?
> Should we disable the write same when stack the different LBA disks?

Yes, please.
Zhang Xiaoxu Feb. 14, 2019, 9:36 a.m. UTC | #18
Any comments about the patch?

On 2/14/2019 10:31 AM, Martin K. Petersen wrote:
> 
> zhangxiaoxu,
> 
>> Any progress about the problme?
>> Should we disable the write same when stack the different LBA disks?
> 
> Yes, please.
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Zhang Xiaoxu Feb. 18, 2019, 2:10 p.m. UTC | #19
ping.

Anyone can help merge this patch? or any other solutions?

On 2/14/2019 5:36 PM, zhangxiaoxu (A) wrote:
> Any comments about the patch?
> 
> On 2/14/2019 10:31 AM, Martin K. Petersen wrote:
>>
>> zhangxiaoxu,
>>
>>> Any progress about the problme?
>>> Should we disable the write same when stack the different LBA disks?
>>
>> Yes, please.
>>

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Martin K. Petersen Feb. 19, 2019, 11:10 p.m. UTC | #20
Hi Zhang,

> ping.
>
> Anyone can help merge this patch? or any other solutions?

+	/* If the logical block size is different, forbid write same */
+	if (t->logical_block_size != b->logical_block_size &&
+	    t->max_write_same_sectors != UINT_MAX)
+		t->max_write_same_sectors = 0;
+	else
+		t->max_write_same_sectors = min(t->max_write_same_sectors,
+						b->max_write_same_sectors);
+

I am not particularly keen on this UINT_MAX magic. I would prefer to
have the stacking driver default for lbs be set to 0 so the stacking
function could avoid special-casing the first iteration. But I am not
sure whether that would break any assumptions in DM/MD wrt. the logical
block size being non-zero prior to calling the stacking function.

Mike? Any comments?

If we stick with the UINT_MAX check, the comment should at least point
out why it's there.
diff mbox series

Patch

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 3e7038e475ee..e4664280edb4 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -497,8 +497,6 @@  int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 	t->max_sectors = min_not_zero(t->max_sectors, b->max_sectors);
 	t->max_hw_sectors = min_not_zero(t->max_hw_sectors, b->max_hw_sectors);
 	t->max_dev_sectors = min_not_zero(t->max_dev_sectors, b->max_dev_sectors);
-	t->max_write_same_sectors = min(t->max_write_same_sectors,
-					b->max_write_same_sectors);
 	t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
 					b->max_write_zeroes_sectors);
 	t->bounce_pfn = min_not_zero(t->bounce_pfn, b->bounce_pfn);
@@ -537,6 +535,14 @@  int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 		}
 	}
 
+	/* If the logical block size is different, forbid write same */
+	if (t->logical_block_size != b->logical_block_size &&
+	    t->max_write_same_sectors != UINT_MAX)
+		t->max_write_same_sectors = 0;
+	else
+		t->max_write_same_sectors = min(t->max_write_same_sectors,
+						b->max_write_same_sectors);
+
 	t->logical_block_size = max(t->logical_block_size,
 				    b->logical_block_size);