diff mbox series

block: Improve io_opt limit stacking

Message ID 20200514065819.1113949-1-damien.lemoal@wdc.com (mailing list archive)
State New, archived
Headers show
Series block: Improve io_opt limit stacking | expand

Commit Message

Damien Le Moal May 14, 2020, 6:58 a.m. UTC
When devices with different physical sector sizes are stacked, the
largest value is used as the stacked device physical sector size. For
the optimal IO size, the lowest common multiple (lcm) of the underlying
devices is used for the stacked device. In this scenario, if only one of
the underlying device reports an optimal IO size, that value is used as
is for the stacked device but that value may not be a multiple of the
stacked device physical sector size. In this case, blk_stack_limits()
returns an error resulting in warnings being printed on device mapper
startup (observed with dm-zoned dual drive setup combining a 512B
sector SSD with a 4K sector HDD).

To fix this, rather than returning an error, the optimal IO size limit
for the stacked device can be adjusted to the lowest common multiple
(lcm) of the stacked physical sector size and optimal IO size, resulting
in a value that is a multiple of the physical sector size while still
being an optimal size for the underlying devices.

This patch is complementary to the patch "nvme: Fix io_opt limit
setting" which prevents the nvme driver from reporting an optimal IO
size equal to a namespace sector size for a device that does not report
an optimal IO size.

Suggested-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 block/blk-settings.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

Comments

Damien Le Moal May 22, 2020, 7:27 a.m. UTC | #1
On 2020/05/14 15:58, Damien Le Moal wrote:
> When devices with different physical sector sizes are stacked, the
> largest value is used as the stacked device physical sector size. For
> the optimal IO size, the lowest common multiple (lcm) of the underlying
> devices is used for the stacked device. In this scenario, if only one of
> the underlying device reports an optimal IO size, that value is used as
> is for the stacked device but that value may not be a multiple of the
> stacked device physical sector size. In this case, blk_stack_limits()
> returns an error resulting in warnings being printed on device mapper
> startup (observed with dm-zoned dual drive setup combining a 512B
> sector SSD with a 4K sector HDD).
> 
> To fix this, rather than returning an error, the optimal IO size limit
> for the stacked device can be adjusted to the lowest common multiple
> (lcm) of the stacked physical sector size and optimal IO size, resulting
> in a value that is a multiple of the physical sector size while still
> being an optimal size for the underlying devices.
> 
> This patch is complementary to the patch "nvme: Fix io_opt limit
> setting" which prevents the nvme driver from reporting an optimal IO
> size equal to a namespace sector size for a device that does not report
> an optimal IO size.
> 
> Suggested-by: Keith Busch <kbusch@kernel.org>
> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
> ---
>  block/blk-settings.c | 7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 9a2c23cd9700..9a2b017ff681 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -561,11 +561,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>  	}
>  
>  	/* Optimal I/O a multiple of the physical block size? */
> -	if (t->io_opt & (t->physical_block_size - 1)) {
> -		t->io_opt = 0;
> -		t->misaligned = 1;
> -		ret = -1;
> -	}
> +	if (t->io_opt & (t->physical_block_size - 1))
> +		t->io_opt = lcm(t->io_opt, t->physical_block_size);
>  
>  	t->raid_partial_stripes_expensive =
>  		max(t->raid_partial_stripes_expensive,
> 

Jens,

Any comment on this patch ?
Note: the patch the patch "nvme: Fix io_opt limit setting" is already queued for
5.8.
Martin K. Petersen May 22, 2020, 1:28 p.m. UTC | #2
Damien,

>> +	if (t->io_opt & (t->physical_block_size - 1))
>> +		t->io_opt = lcm(t->io_opt, t->physical_block_size);

> Any comment on this patch ?  Note: the patch the patch "nvme: Fix
> io_opt limit setting" is already queued for 5.8.

Setting io_opt to the physical block size is not correct.
Martin K. Petersen May 22, 2020, 1:36 p.m. UTC | #3
>>> +	if (t->io_opt & (t->physical_block_size - 1))
>>> +		t->io_opt = lcm(t->io_opt, t->physical_block_size);
>
>> Any comment on this patch ?  Note: the patch the patch "nvme: Fix
>> io_opt limit setting" is already queued for 5.8.
>
> Setting io_opt to the physical block size is not correct.

Oh, missed the lcm(). But I'm still concerned about twiddling io_opt to
a value different than the one reported by an underlying device.

Setting io_opt to something that's less than a full stripe width in a
RAID, for instance, doesn't produce the expected result. So I think I'd
prefer not to set io_opt at all if it isn't consistent across all the
stacked devices.

Let me chew on it for a bit...
Keith Busch May 22, 2020, 2:14 p.m. UTC | #4
On Fri, May 22, 2020 at 09:36:18AM -0400, Martin K. Petersen wrote:
> 
> >>> +	if (t->io_opt & (t->physical_block_size - 1))
> >>> +		t->io_opt = lcm(t->io_opt, t->physical_block_size);
> >
> >> Any comment on this patch ?  Note: the patch the patch "nvme: Fix
> >> io_opt limit setting" is already queued for 5.8.
> >
> > Setting io_opt to the physical block size is not correct.
> 
> Oh, missed the lcm(). But I'm still concerned about twiddling io_opt to
> a value different than the one reported by an underlying device.
>
> Setting io_opt to something that's less than a full stripe width in a
> RAID, for instance, doesn't produce the expected result. So I think I'd
> prefer not to set io_opt at all if it isn't consistent across all the
> stacked devices.

We already modify the stacked io_opt value if the two request_queues
report different io_opt's. If, however, only one reports an io_opt value,
and it happens to not align with the other's physical block size, the
code currently rejects stacking those devices. Damien's patch should
just set a larger io_opt value that aligns with both, so if one io_opt
is a RAID stripe size, the stacked result will be multiple stripes.

I think that makes sense, but please do let us know if you think such
mismatched devices just shouldn't stack.
diff mbox series

Patch

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 9a2c23cd9700..9a2b017ff681 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -561,11 +561,8 @@  int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 	}
 
 	/* Optimal I/O a multiple of the physical block size? */
-	if (t->io_opt & (t->physical_block_size - 1)) {
-		t->io_opt = 0;
-		t->misaligned = 1;
-		ret = -1;
-	}
+	if (t->io_opt & (t->physical_block_size - 1))
+		t->io_opt = lcm(t->io_opt, t->physical_block_size);
 
 	t->raid_partial_stripes_expensive =
 		max(t->raid_partial_stripes_expensive,