scsi_lib: Align max_sectors to kb

Message ID	20240418070015.27781-1-hare@kernel.org (mailing list archive)
State	Changes Requested
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 81D7A29A1 for <linux-scsi@vger.kernel.org>; Thu, 18 Apr 2024 07:00:20 +0000 (UTC) From: Hannes Reinecke <hare@kernel.org> To: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Christoph Hellwig <hch@lst.de>, James Bottomley <james.bottomley@hansenpartnership.com>, linux-scsi@vger.kernel.org, Hannes Reinecke <hare@kernel.org>, Martin Wilck <martin.wilck@suse.com> Subject: [PATCH] scsi_lib: Align max_sectors to kb Date: Thu, 18 Apr 2024 09:00:15 +0200 Message-Id: <20240418070015.27781-1-hare@kernel.org> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	scsi_lib: Align max_sectors to kb \| expand scsi_lib: Align max_sectors to kb

Hannes Reinecke April 18, 2024, 7 a.m. UTC

max_sectors can be modified via sysfs, but only in kb units.
Which leads to a misalignment on stacked devices if the original
max_sector size is an odd number. So align the max_sectors setting
to kb to avoid this issue.

Reported-by: Martin Wilck <martin.wilck@suse.com>
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
 drivers/scsi/scsi_lib.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Christoph Hellwig April 18, 2024, 7:03 a.m. UTC | #1

On Thu, Apr 18, 2024 at 09:00:15AM +0200, Hannes Reinecke wrote:
> max_sectors can be modified via sysfs, but only in kb units.

Yes.

> Which leads to a misalignment on stacked devices if the original
> max_sector size is an odd number.

How?

Note that we really should not stack max_sectors anyway, as it's only
used for splitting in the lower device to start with.

Hannes Reinecke April 18, 2024, 7:42 a.m. UTC | #2

On 4/18/24 09:03, Christoph Hellwig wrote:
> On Thu, Apr 18, 2024 at 09:00:15AM +0200, Hannes Reinecke wrote:
>> max_sectors can be modified via sysfs, but only in kb units.
> 
> Yes.
> 
>> Which leads to a misalignment on stacked devices if the original
>> max_sector size is an odd number.
> 
> How?
> 

That's an issue we've been seeing during testing:
https://lore.kernel.org/dm-devel/7742003e19b5a49398067dc0c59dfa8ddeffc3d7.camel@suse.com/

While this can be fixed in userspace (Martin Wilck provided a patchset
to multipath-tools), what I find irritating is that we will always
display the max_sectors setting in kb, even if the actual value is not
kb aligned.
_And_ we allow to modify that value (again in kb units). Which means 
that we _never_ are able to reset it to its original value.

I would vastly prefer to have the actual values displayed in sysfs,
ie either have a max_sectors_kb value for the block limits or have an
max_sectors setting in sysfs, but we really should avoid this 'shift by 
1' when displaying the value.

> Note that we really should not stack max_sectors anyway, as it's only
> used for splitting in the lower device to start with.

If that's the case, why don't we inhibit the modification for 
max_sectors on the lower devices?

Cheers,

Hannes

Martin Wilck April 18, 2024, 9 a.m. UTC | #3

On Thu, 2024-04-18 at 09:42 +0200, Hannes Reinecke wrote:
> On 4/18/24 09:03, Christoph Hellwig wrote:
> > On Thu, Apr 18, 2024 at 09:00:15AM +0200, Hannes Reinecke wrote:
> > > max_sectors can be modified via sysfs, but only in kb units.
> > 
> > Yes.
> > 
> > > Which leads to a misalignment on stacked devices if the original
> > > max_sector size is an odd number.
> > 
> > How?
> > 
> 
> That's an issue we've been seeing during testing:
> https://lore.kernel.org/dm-devel/7742003e19b5a49398067dc0c59dfa8ddeffc3d7.camel@suse.com/
> 
> While this can be fixed in userspace (Martin Wilck provided a
> patchset
> to multipath-tools), what I find irritating is that we will always
> display the max_sectors setting in kb, even if the actual value is
> not
> kb aligned.
> _And_ we allow to modify that value (again in kb units). Which means 
> that we _never_ are able to reset it to its original value.

User space has no way to determine whether the actual max_sectors value
in the kernel is even or odd. By reading max_sectors_kb and writing it
back, one may actually change the kernel-internal value by rounding it
down to the next even number. This can cause breakage if the device
being changed is a multipath path device.

Wrt Hannes' patch: It would fix the issue on the kernel side, but user
space would still have no means to determine whether this patch applied
or not, except for checking the kernel version, which is unreliable.
For user space, it would be more helpful to add a "max_sectors" sysfs
attribute that exposes the actual value in blocks.

> > Note that we really should not stack max_sectors anyway, as it's
> > only
> > used for splitting in the lower device to start with.
> 
> If that's the case, why don't we inhibit the modification for 
> max_sectors on the lower devices?

I vote for allowing changes to max_sectors_kb only for devices that
don't have anything stacked on top, even though my late testing
indicates that it's only a real problem with request-based dm aka
multipath. After all, max_sectors only needs to be manipulated in rare
situations, and would be generally recommended to do this in an udev
rule early during boot.

Regards,
Martin

Christoph Hellwig April 18, 2024, 2:51 p.m. UTC | #4

On Thu, Apr 18, 2024 at 09:42:06AM +0200, Hannes Reinecke wrote:
> While this can be fixed in userspace (Martin Wilck provided a patchset
> to multipath-tools), what I find irritating is that we will always
> display the max_sectors setting in kb, even if the actual value is not
> kb aligned.

The problem you are running into it exactly a problem of pointlessly
inheriting the value when we should not.

Other secondary issues are that we:

 a) should allow the modification in the granularity that we actually
    internally use, that is sectors
 b) apparently some drivers still use the silly BLK_SAFE_MAX_SECTORS
    limits that is too low for just about any modern hardwre.

>> Note that we really should not stack max_sectors anyway, as it's only
>> used for splitting in the lower device to start with.
>
> If that's the case, why don't we inhibit the modification for max_sectors 
> on the lower devices?

Why would we?  It makes absolutly no sense to inherit these limits,
the lower device will split anyway which is very much the point of
the immutable bio_vec work.

Martin Wilck April 18, 2024, 4:46 p.m. UTC | #5

On Thu, 2024-04-18 at 16:51 +0200, Christoph Hellwig wrote:
> 
> > > Note that we really should not stack max_sectors anyway, as it's
> > > only
> > > used for splitting in the lower device to start with.
> > 
> > If that's the case, why don't we inhibit the modification for
> > max_sectors 
> > on the lower devices?
> 
> Why would we?  It makes absolutly no sense to inherit these limits,
> the lower device will split anyway which is very much the point of
> the immutable bio_vec work.
> 

Sorry, I don't follow. With (request-based) dm-multipath on top of
SCSI, we hit the "over max size limit" condition in
blk_insert_cloned_request() [1], which will cause IO to fail at the dm
level. So at least in this configuration, it's crucial that the upper
device inherit the lower device's limits.

Regards,
Martin

[1] https://elixir.bootlin.com/linux/latest/source/block/blk-mq.c#L3048

Christoph Hellwig April 19, 2024, 6:20 a.m. UTC | #6

On Thu, Apr 18, 2024 at 06:46:06PM +0200, Martin Wilck wrote:
> > Why would we?  It makes absolutly no sense to inherit these limits,
> > the lower device will split anyway which is very much the point of
> > the immutable bio_vec work.
> > 
> 
> Sorry, I don't follow. With (request-based) dm-multipath on top of
> SCSI, we hit the "over max size limit" condition in
> blk_insert_cloned_request() [1], which will cause IO to fail at the dm
> level. So at least in this configuration, it's crucial that the upper
> device inherit the lower device's limits.

Oh, indeed.  Request based multipath is different from everyone else.
I really wish we could spend the effort to convert it to bio based
and remove this special case..

Martin Wilck April 19, 2024, 8:27 a.m. UTC | #7

(added Mike, Ben, and dm-devel)

On Fri, 2024-04-19 at 08:20 +0200, Christoph Hellwig wrote:
> On Thu, Apr 18, 2024 at 06:46:06PM +0200, Martin Wilck wrote:
> > > Why would we?  It makes absolutly no sense to inherit these
> > > limits,
> > > the lower device will split anyway which is very much the point
> > > of
> > > the immutable bio_vec work.
> > > 
> > 
> > Sorry, I don't follow. With (request-based) dm-multipath on top of
> > SCSI, we hit the "over max size limit" condition in
> > blk_insert_cloned_request() [1], which will cause IO to fail at the
> > dm
> > level. So at least in this configuration, it's crucial that the
> > upper
> > device inherit the lower device's limits.
> 
> Oh, indeed.  Request based multipath is different from everyone else.
> I really wish we could spend the effort to convert it to bio based
> and remove this special case..
> 

Well, it's just a matter of setting "queue_mode" in the multipath
configuration. But request-based has been the default since kernel v3.0
(f40c67f0f7e2 ("dm mpath: change to be request based")). While we
got bio-based dm-multipath back in v4.8 (76e33fe4e2c4 ("dm mpath:
reinstate bio-based support")), making it the default or even
removing request-based dm (in order to free the block layer from limit-
stacking requirements) will obviously require a broad discussion.

Mike has pointed out in 76e33fe4e2c4 that many of the original reasons
for introducing request-based dm-multipath have been obsoleted by
faster hardware and improvements in the generic block layer. 

I am open for discussion on this subject, but with my distribution 
maintainer hat on, I don't expect the default being changed for end
customers soon. Actually, with NVMe rising, I wonder if a major
technology switch for dm-multipath (which is effectively SCSI multipath
these days) makes sense at this point in time.

Thanks,
Martin

Christoph Hellwig April 22, 2024, 7:28 a.m. UTC | #8

On Fri, Apr 19, 2024 at 10:27:53AM +0200, Martin Wilck wrote:
> Well, it's just a matter of setting "queue_mode" in the multipath
> configuration.

Yes, and no.  We'd probably at very last want a blk_plug callback to
allow making decisions for entire large I/Os instead of on a per-bio
basis.

> Mike has pointed out in 76e33fe4e2c4 that many of the original reasons
> for introducing request-based dm-multipath have been obsoleted by
> faster hardware and improvements in the generic block layer. 

Yes.

> I am open for discussion on this subject, but with my distribution 
> maintainer hat on, I don't expect the default being changed for end
> customers soon. Actually, with NVMe rising, I wonder if a major
> technology switch for dm-multipath (which is effectively SCSI multipath
> these days) makes sense at this point in time.

Well, it'll take ages until anything upstream gets into the enterprise
distros, which array users tend to use anyway.  But I'd really love
to get the ball rolling upstream rather sooner than later even if there
is no need for immediate action.  The reason is that request based
dm multipath causes a lot of special casing in the block core and
dm, and that has become a significant maintenance burden.

So if you have a chance to test the current bio based path, or help
with testing a plug callback (I can try to write a patch for that,
but I don't really have a good test setup) we can start to see where
we are in practice in terms of still needing the request based stacking,
and if there is a feasible path to move away from it.

scsi_lib: Align max_sectors to kb

Commit Message

Comments

Patch