[v4,1/3] block: Improve checks on zone resource limits

Message ID	20240605075144.153141-2-dlemoal@kernel.org (mailing list archive)
State	New
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E1B3918C350; Wed, 5 Jun 2024 07:51:49 +0000 (UTC) From: Damien Le Moal <dlemoal@kernel.org> To: Jens Axboe <axboe@kernel.dk>, linux-block@vger.kernel.org, dm-devel@lists.linux.dev, Mike Snitzer <snitzer@kernel.org>, Mikulas Patocka <mpatocka@redhat.com> Cc: Christoph Hellwig <hch@lst.de>, Benjamin Marzinski <bmarzins@redhat.com> Subject: [PATCH v4 1/3] block: Improve checks on zone resource limits Date: Wed, 5 Jun 2024 16:51:42 +0900 Message-ID: <20240605075144.153141-2-dlemoal@kernel.org> In-Reply-To: <20240605075144.153141-1-dlemoal@kernel.org> References: <20240605075144.153141-1-dlemoal@kernel.org> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Fix DM zone resource limits stacking \| expand [v4,0/3] Fix DM zone resource limits stacking [v4,1/3] block: Improve checks on zone resource limits [v4,2/3] dm: Improve zone resource limits handling [v4,3/3] dm: Remove unused macro DM_ZONE_INVALID_WP_OFST

Damien Le Moal June 5, 2024, 7:51 a.m. UTC

Make sure that the zone resource limits of a zoned block device are
correct by checking that:
(a) If the device has a max active zones limit, make sure that the max
    open zones limit is lower than the max active zones limit.
(b) If the device has zone resource limits, check that the limits
    values are lower than the number of sequential zones of the device.
    If it is not, assume that the zoned device has no limits by setting
    the limits to 0.

For (a), a check is added to blk_validate_zoned_limits() and an error
returned if the max open zones limit exceeds the value of the max active
zone limit (if there is one).

For (b), given that we need the number of sequential zones of the zoned
device, this check is added to disk_update_zone_resources(). This is
safe to do as that function is executed with the disk queue frozen and
the check executed after queue_limits_start_update() which takes the
queue limits lock. Of note is that the early return in this function
for zoned devices that do not use zone write plugging (e.g. DM devices
using native zone append) is moved to after the new check and adjustment
of the zone resource limits so that the check applies to any zoned
device.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-settings.c |  8 ++++++++
 block/blk-zoned.c    | 20 ++++++++++++++++----
 2 files changed, 24 insertions(+), 4 deletions(-)

Christoph Hellwig June 5, 2024, 7:54 a.m. UTC | #1

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

Niklas Cassel June 5, 2024, 5:25 p.m. UTC | #2

On Wed, Jun 05, 2024 at 04:51:42PM +0900, Damien Le Moal wrote:
> Make sure that the zone resource limits of a zoned block device are
> correct by checking that:
> (a) If the device has a max active zones limit, make sure that the max
>     open zones limit is lower than the max active zones limit.
> (b) If the device has zone resource limits, check that the limits
>     values are lower than the number of sequential zones of the device.
>     If it is not, assume that the zoned device has no limits by setting
>     the limits to 0.
> 
> For (a), a check is added to blk_validate_zoned_limits() and an error
> returned if the max open zones limit exceeds the value of the max active
> zone limit (if there is one).
> 
> For (b), given that we need the number of sequential zones of the zoned
> device, this check is added to disk_update_zone_resources(). This is
> safe to do as that function is executed with the disk queue frozen and
> the check executed after queue_limits_start_update() which takes the
> queue limits lock. Of note is that the early return in this function
> for zoned devices that do not use zone write plugging (e.g. DM devices
> using native zone append) is moved to after the new check and adjustment
> of the zone resource limits so that the check applies to any zoned
> device.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>  block/blk-settings.c |  8 ++++++++
>  block/blk-zoned.c    | 20 ++++++++++++++++----
>  2 files changed, 24 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index effeb9a639bb..474c709ea85b 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -80,6 +80,14 @@ static int blk_validate_zoned_limits(struct queue_limits *lim)
>  	if (WARN_ON_ONCE(!IS_ENABLED(CONFIG_BLK_DEV_ZONED)))
>  		return -EINVAL;
>  
> +	/*
> +	 * Given that active zones include open zones, the maximum number of
> +	 * open zones cannot be larger than the maximum numbber of active zones.

s/numbber/number/


> +	 */
> +	if (lim->max_active_zones &&
> +	    lim->max_open_zones > lim->max_active_zones)
> +		return -EINVAL;
> +
>  	if (lim->zone_write_granularity < lim->logical_block_size)
>  		lim->zone_write_granularity = lim->logical_block_size;
>  
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 52abebf56027..8f89705f5e1c 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -1647,8 +1647,22 @@ static int disk_update_zone_resources(struct gendisk *disk,
>  		return -ENODEV;
>  	}
>  
> +	lim = queue_limits_start_update(q);
> +
> +	/*
> +	 * Some devices can advertize zone resource limits that are larger than
> +	 * the number of sequential zones of the zoned block device, e.g. a
> +	 * small ZNS namespace. For such case, assume that the zoned device has
> +	 * no zone resource limits.
> +	 */
> +	nr_seq_zones = disk->nr_zones - nr_conv_zones;
> +	if (lim.max_open_zones >= nr_seq_zones)
> +		lim.max_open_zones = 0;
> +	if (lim.max_active_zones >= nr_seq_zones)
> +		lim.max_active_zones = 0;
> +

Is this really correct to transform to no limits?

The MAR and MOR limits are defined in the I/O Command Set Specific Identify
Namespace Data Structure for the Zoned Namespace Command Set.

However, the user has no ability to control these limits themselves
during a namespace management create ns, or for the format command
(and this still seems to be the case in the latest ZNS spec 1.1d).

Which means that the controller has no way of knowing the number of
resources to allocate to each namespace.

Some (all?) controllers will right now simply report the same MAR/MOR
for all namespaces.


So if I use the namespace management command to create two small
zoned namespaces, the number of sequential zones might be smaller
than the limits in both namespaces, but could together be exceeding
the limit.

How is ignoring the limit that we got from the device better than
actually exposing the limit which we got from the device?

Since AFAICT, this also means that we will expose 0 to sysfs
instead of the value that the device reported.



Perhaps we should only do this optimization if:
- the device is not ZNS, or
- the device is ZNS and does not support NS management, or
- the device is ZNS and supports NS management and implements TP4115
  (Zoned Namespace Resource Management supported bit is set, even if
   that TP does not seem to be part of a Ratified ZNS version yet...)


Kind regards,
Niklas

Damien Le Moal June 6, 2024, 12:06 a.m. UTC | #3

On 6/6/24 2:25 AM, Niklas Cassel wrote:
> On Wed, Jun 05, 2024 at 04:51:42PM +0900, Damien Le Moal wrote:
>> Make sure that the zone resource limits of a zoned block device are
>> correct by checking that:
>> (a) If the device has a max active zones limit, make sure that the max
>>     open zones limit is lower than the max active zones limit.
>> (b) If the device has zone resource limits, check that the limits
>>     values are lower than the number of sequential zones of the device.
>>     If it is not, assume that the zoned device has no limits by setting
>>     the limits to 0.
>>
>> For (a), a check is added to blk_validate_zoned_limits() and an error
>> returned if the max open zones limit exceeds the value of the max active
>> zone limit (if there is one).
>>
>> For (b), given that we need the number of sequential zones of the zoned
>> device, this check is added to disk_update_zone_resources(). This is
>> safe to do as that function is executed with the disk queue frozen and
>> the check executed after queue_limits_start_update() which takes the
>> queue limits lock. Of note is that the early return in this function
>> for zoned devices that do not use zone write plugging (e.g. DM devices
>> using native zone append) is moved to after the new check and adjustment
>> of the zone resource limits so that the check applies to any zoned
>> device.
>>
>> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
>> ---
>>  block/blk-settings.c |  8 ++++++++
>>  block/blk-zoned.c    | 20 ++++++++++++++++----
>>  2 files changed, 24 insertions(+), 4 deletions(-)
>>
>> diff --git a/block/blk-settings.c b/block/blk-settings.c
>> index effeb9a639bb..474c709ea85b 100644
>> --- a/block/blk-settings.c
>> +++ b/block/blk-settings.c
>> @@ -80,6 +80,14 @@ static int blk_validate_zoned_limits(struct queue_limits *lim)
>>  	if (WARN_ON_ONCE(!IS_ENABLED(CONFIG_BLK_DEV_ZONED)))
>>  		return -EINVAL;
>>  
>> +	/*
>> +	 * Given that active zones include open zones, the maximum number of
>> +	 * open zones cannot be larger than the maximum numbber of active zones.
> 
> s/numbber/number/
> 
> 
>> +	 */
>> +	if (lim->max_active_zones &&
>> +	    lim->max_open_zones > lim->max_active_zones)
>> +		return -EINVAL;
>> +
>>  	if (lim->zone_write_granularity < lim->logical_block_size)
>>  		lim->zone_write_granularity = lim->logical_block_size;
>>  
>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>> index 52abebf56027..8f89705f5e1c 100644
>> --- a/block/blk-zoned.c
>> +++ b/block/blk-zoned.c
>> @@ -1647,8 +1647,22 @@ static int disk_update_zone_resources(struct gendisk *disk,
>>  		return -ENODEV;
>>  	}
>>  
>> +	lim = queue_limits_start_update(q);
>> +
>> +	/*
>> +	 * Some devices can advertize zone resource limits that are larger than
>> +	 * the number of sequential zones of the zoned block device, e.g. a
>> +	 * small ZNS namespace. For such case, assume that the zoned device has
>> +	 * no zone resource limits.
>> +	 */
>> +	nr_seq_zones = disk->nr_zones - nr_conv_zones;
>> +	if (lim.max_open_zones >= nr_seq_zones)
>> +		lim.max_open_zones = 0;
>> +	if (lim.max_active_zones >= nr_seq_zones)
>> +		lim.max_active_zones = 0;
>> +
> 
> Is this really correct to transform to no limits?
> 
> The MAR and MOR limits are defined in the I/O Command Set Specific Identify
> Namespace Data Structure for the Zoned Namespace Command Set.
> 
> However, the user has no ability to control these limits themselves
> during a namespace management create ns, or for the format command
> (and this still seems to be the case in the latest ZNS spec 1.1d).
> 
> Which means that the controller has no way of knowing the number of
> resources to allocate to each namespace.
> 
> Some (all?) controllers will right now simply report the same MAR/MOR
> for all namespaces.
> 
> 
> So if I use the namespace management command to create two small
> zoned namespaces, the number of sequential zones might be smaller
> than the limits in both namespaces, but could together be exceeding
> the limit.
> 
> How is ignoring the limit that we got from the device better than
> actually exposing the limit which we got from the device?

If the limits are larger than the number of zones in the namespace, there is no
limits at all. This is what this change does. No question that this is correct.
Even if we do not change the values, any application/fs looking at these limits
being larger than the number of zones will conclude the same.

The problem you are raising is the reliability of the limits themselves, and
for NVMe ZNS, given that MOR/MAR are not defined per namespace, we are in the
same situation as with DM devices sharing the same zoned block dev through
different targets: even if the user respects the limits, write errors may
happen due to the backing dev limits (or controller limits for ZNS) being
exceeded. Nothing much we can do to easily deal with this right now. We would
need to constantly track zone states and implement a software driven zone state
machine checking the limits all the time to actually provide guarantees.

> Since AFAICT, this also means that we will expose 0 to sysfs
> instead of the value that the device reported.

Yes. But the value reported by the device is for the whole controller. The
sysfs attributes are for the block device == namespace.

> Perhaps we should only do this optimization if:
> - the device is not ZNS, or
> - the device is ZNS and does not support NS management, or
> - the device is ZNS and supports NS management and implements TP4115
>   (Zoned Namespace Resource Management supported bit is set, even if
>    that TP does not seem to be part of a Ratified ZNS version yet...)

Right now, this all works the same way for DM and nvme zns, so I think this is
all good. If anything, we should probably add a warning in the nvme driver
about the potentially unreliable moz/moz limits if we see a ZNS device with
multiple zoned namespaces.

Niklas Cassel June 6, 2024, 1:21 a.m. UTC | #4

On Thu, Jun 06, 2024 at 09:06:58AM +0900, Damien Le Moal wrote:
> On 6/6/24 2:25 AM, Niklas Cassel wrote:
> > On Wed, Jun 05, 2024 at 04:51:42PM +0900, Damien Le Moal wrote:
> 
> The problem you are raising is the reliability of the limits themselves, and
> for NVMe ZNS, given that MOR/MAR are not defined per namespace, we are in the
> same situation as with DM devices sharing the same zoned block dev through
> different targets: even if the user respects the limits, write errors may
> happen due to the backing dev limits (or controller limits for ZNS) being
> exceeded. Nothing much we can do to easily deal with this right now. We would
> need to constantly track zone states and implement a software driven zone state
> machine checking the limits all the time to actually provide guarantees.
> 
> > Since AFAICT, this also means that we will expose 0 to sysfs
> > instead of the value that the device reported.
> 
> Yes. But the value reported by the device is for the whole controller. The
> sysfs attributes are for the block device == namespace.

The limits are defined in the I/O Command Set Specific Identify Namespace
Data Structure for the Zoned Namespace Command Set, so they are per NS,
otherwise they would have been defined in the I/O Command Set Specific
Identify Controller Data Structure for the Zoned Namespace Command Set.

> 
> > Perhaps we should only do this optimization if:
> > - the device is not ZNS, or
> > - the device is ZNS and does not support NS management, or
> > - the device is ZNS and supports NS management and implements TP4115
> >   (Zoned Namespace Resource Management supported bit is set, even if
> >    that TP does not seem to be part of a Ratified ZNS version yet...)
> 
> Right now, this all works the same way for DM and nvme zns, so I think this is
> all good. If anything, we should probably add a warning in the nvme driver
> about the potentially unreliable moz/moz limits if we see a ZNS device with
> multiple zoned namespaces.

Well, it is only a problem for ZNS devices with NS management.

If there are two ZNS namespaces on the device, and the device does not
support NS management, the device vendor would have been seriously silly
to not allocate and set the limits in the I/O Command Set Specific Identify
Namespace Data Structure for the Zoned Namespace Command Set correctly.

But yes, this concern cannot be solved in disk_update_zone_resources(),
which operates on per gendisk (and there is one gendisk per namespace),
so not much this function can do. If we were to do something, it would
have to be done in the nvme driver.

Perhaps if the device is ZNS, and does support NS management, but does
not have the Zoned Namespace Resource Management supported bit is set,
divide the MAR/MOR values reported by each namespace by the number of
ZNS namespaces?

Kind regards,
Niklas

Damien Le Moal June 6, 2024, 2:12 a.m. UTC | #5

On 6/6/24 10:21 AM, Niklas Cassel wrote:
>> Right now, this all works the same way for DM and nvme zns, so I think this is
>> all good. If anything, we should probably add a warning in the nvme driver
>> about the potentially unreliable moz/moz limits if we see a ZNS device with
>> multiple zoned namespaces.
> 
> Well, it is only a problem for ZNS devices with NS management.
> 
> If there are two ZNS namespaces on the device, and the device does not
> support NS management, the device vendor would have been seriously silly
> to not allocate and set the limits in the I/O Command Set Specific Identify
> Namespace Data Structure for the Zoned Namespace Command Set correctly.
> 
> But yes, this concern cannot be solved in disk_update_zone_resources(),
> which operates on per gendisk (and there is one gendisk per namespace),
> so not much this function can do. If we were to do something, it would
> have to be done in the nvme driver.
> 
> 
> Perhaps if the device is ZNS, and does support NS management, but does
> not have the Zoned Namespace Resource Management supported bit is set,
> divide the MAR/MOR values reported by each namespace by the number of
> ZNS namespaces?

Maybe. But that would still not provide any guarantee: a buggy application not
respecting the limits would be able to steal resources from the other namespace.

In any case, I think this is a discussion to have on the nvme list.

Christoph Hellwig June 6, 2024, 4:41 a.m. UTC | #6

On Thu, Jun 06, 2024 at 11:12:08AM +0900, Damien Le Moal wrote:
> Maybe. But that would still not provide any guarantee: a buggy application not
> respecting the limits would be able to steal resources from the other namespace.
> 
> In any case, I think this is a discussion to have on the nvme list.

Yes, the per-controller limits in NVMe are a big mistake that was
already mentioned during ZNS standardization.  We don't really have
any good way to deal with it except strongly recommending users to
not use multiple ZNS namespaes per controller.

[v4,1/3] block: Improve checks on zone resource limits

Commit Message

Comments

Patch