diff mbox series

[v6,3/4] dm: Improve zone resource limits handling

Message ID 20240606082147.96422-4-dlemoal@kernel.org (mailing list archive)
State New
Headers show
Series Fix DM zone resource limits stacking | expand

Commit Message

Damien Le Moal June 6, 2024, 8:21 a.m. UTC
The generic stacking of limits implemented in the block layer cannot
correctly handle stacking of zone resource limits (max open zones and
max active zones) because these limits are for an entire device but the
stacking may be for a portion of that device (e.g. a dm-linear target
that does not cover an entire block device). As a result, when DM
devices are created on top of zoned block devices, the DM device never
has any zone resource limits advertized, which is only correct if all
underlying target devices also have no zone resource limits.
If at least one target device has resource limits, the user may see
either performance issues (if the max open zone limit of the device is
exceeded) or write I/O errors if the max active zone limit of one of
the underlying target devices is exceeded.

While it is very difficult to correctly and reliably stack zone resource
limits in general, cases where targets are not sharing zone resources of
the same device can be dealt with relatively easily. Such situation
happens when a target maps all sequential zones of a zoned block device:
for such mapping, other targets mapping other parts of the same zoned
block device can only contain conventional zones and thus will not
require any zone resource to correctly handle write operations.

For a mapped device constructed with such targets, which includes mapped
devices constructed with targets mapping entire zoned block devices, the
zone resource limits can be reliably determined using the non-zero
minimum of the zone resource limits of all targets.

For mapped devices that include targets partially mapping the set of
sequential write required zones of zoned block devices, instead of
advertizing no zone resource limits, it is also better to set the mapped
device limits to the non-zero minimum of the limits of all targets. In
this case the limits for a target depend on the number of sequential
zones being mapped: if this number of zone is larger than the limits,
then the limits of the device apply and can be used. If on the other
hand the target maps a number of zones smaller than the limits, then no
limits is needed and we can assume that the target has no limits (limits
set to 0).

This commit improves zone resource limits handling as described above
by modifying dm_set_zones_restrictions() to iterate the targets of a
mapped device to evaluate the max open and max active zone limits. This
relies on an internal "stacking" of the limits of the target devices
combined with a direct counting of the number of sequential zones
mapped by the targets.
1) For a target mapping an entire zoned block device, the limits for the
   target are set to the limits of the device.
2) For a target partially mapping a zoned block device, the number of
   mapped sequential zones is used to determine the limits: if the
   target maps more sequential write required zones than the device
   limits, then the limits of the device are used as-is. If the number
   of mapped sequential zones is lower than the limits, then we assume
   that the target has no limits (limits set to 0).
As this evaluation is done for each target, the zone resource limits
for the mapped device are evaluated as the non-zero minimum of the
limits of all the targets.

For configurations resulting in unreliable limits, i.e. a table
containing a target partially mapping a zoned device, a warning message
is issued.

The counting of mapped sequential zones for the target is done using the
new function dm_device_count_zones() which performs a report zones on
the entire block device with the callback dm_device_count_zones_cb().
This count of mapped sequential zones is also used to determine if the
mapped device contains only conventional zones. This allows simplifying
dm_set_zones_restrictions() to not do a report zones just for this.
For mapped devices mapping only conventional zones, as before, the
mapped device is changed to a regular device by setting its zoned limit
to false and clearing all its zone related limits.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/md/dm-zone.c | 175 +++++++++++++++++++++++++++++++++++--------
 1 file changed, 145 insertions(+), 30 deletions(-)

Comments

Benjamin Marzinski June 6, 2024, 7:41 p.m. UTC | #1
Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com>
Niklas Cassel June 10, 2024, 7:18 a.m. UTC | #2
On Thu, Jun 06, 2024 at 05:21:46PM +0900, Damien Le Moal wrote:
> The generic stacking of limits implemented in the block layer cannot
> correctly handle stacking of zone resource limits (max open zones and
> max active zones) because these limits are for an entire device but the
> stacking may be for a portion of that device (e.g. a dm-linear target
> that does not cover an entire block device). As a result, when DM
> devices are created on top of zoned block devices, the DM device never
> has any zone resource limits advertized, which is only correct if all
> underlying target devices also have no zone resource limits.
> If at least one target device has resource limits, the user may see
> either performance issues (if the max open zone limit of the device is
> exceeded) or write I/O errors if the max active zone limit of one of
> the underlying target devices is exceeded.
> 
> While it is very difficult to correctly and reliably stack zone resource
> limits in general, cases where targets are not sharing zone resources of
> the same device can be dealt with relatively easily. Such situation
> happens when a target maps all sequential zones of a zoned block device:
> for such mapping, other targets mapping other parts of the same zoned
> block device can only contain conventional zones and thus will not
> require any zone resource to correctly handle write operations.
> 
> For a mapped device constructed with such targets, which includes mapped
> devices constructed with targets mapping entire zoned block devices, the
> zone resource limits can be reliably determined using the non-zero
> minimum of the zone resource limits of all targets.
> 
> For mapped devices that include targets partially mapping the set of
> sequential write required zones of zoned block devices, instead of
> advertizing no zone resource limits, it is also better to set the mapped
> device limits to the non-zero minimum of the limits of all targets. In
> this case the limits for a target depend on the number of sequential
> zones being mapped: if this number of zone is larger than the limits,
> then the limits of the device apply and can be used. If on the other
> hand the target maps a number of zones smaller than the limits, then no
> limits is needed and we can assume that the target has no limits (limits
> set to 0).
> 
> This commit improves zone resource limits handling as described above
> by modifying dm_set_zones_restrictions() to iterate the targets of a
> mapped device to evaluate the max open and max active zone limits. This
> relies on an internal "stacking" of the limits of the target devices
> combined with a direct counting of the number of sequential zones
> mapped by the targets.
> 1) For a target mapping an entire zoned block device, the limits for the
>    target are set to the limits of the device.
> 2) For a target partially mapping a zoned block device, the number of
>    mapped sequential zones is used to determine the limits: if the
>    target maps more sequential write required zones than the device
>    limits, then the limits of the device are used as-is. If the number
>    of mapped sequential zones is lower than the limits, then we assume
>    that the target has no limits (limits set to 0).
> As this evaluation is done for each target, the zone resource limits
> for the mapped device are evaluated as the non-zero minimum of the
> limits of all the targets.
> 
> For configurations resulting in unreliable limits, i.e. a table
> containing a target partially mapping a zoned device, a warning message
> is issued.
> 
> The counting of mapped sequential zones for the target is done using the
> new function dm_device_count_zones() which performs a report zones on
> the entire block device with the callback dm_device_count_zones_cb().
> This count of mapped sequential zones is also used to determine if the
> mapped device contains only conventional zones. This allows simplifying
> dm_set_zones_restrictions() to not do a report zones just for this.
> For mapped devices mapping only conventional zones, as before, the
> mapped device is changed to a regular device by setting its zoned limit
> to false and clearing all its zone related limits.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/md/dm-zone.c | 175 +++++++++++++++++++++++++++++++++++--------
>  1 file changed, 145 insertions(+), 30 deletions(-)
> 
> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> index 75d0019a0649..d9f8b7c0957a 100644
> --- a/drivers/md/dm-zone.c
> +++ b/drivers/md/dm-zone.c
> @@ -145,21 +145,6 @@ bool dm_is_zone_write(struct mapped_device *md, struct bio *bio)
>  	}
>  }
>  
> -/*
> - * Count conventional zones of a mapped zoned device. If the device
> - * only has conventional zones, do not expose it as zoned.
> - */
> -static int dm_check_zoned_cb(struct blk_zone *zone, unsigned int idx,
> -			     void *data)
> -{
> -	unsigned int *nr_conv_zones = data;
> -
> -	if (zone->type == BLK_ZONE_TYPE_CONVENTIONAL)
> -		(*nr_conv_zones)++;
> -
> -	return 0;
> -}
> -
>  /*
>   * Revalidate the zones of a mapped device to initialize resource necessary
>   * for zone append emulation. Note that we cannot simply use the block layer
> @@ -228,13 +213,122 @@ static bool dm_table_supports_zone_append(struct dm_table *t)
>  	return true;
>  }
>  
> +struct dm_device_zone_count {
> +	sector_t start;
> +	sector_t len;
> +	unsigned int total_nr_seq_zones;
> +	unsigned int target_nr_seq_zones;
> +};
> +
> +/*
> + * Count the total number of and the number of mapped sequential zones of a
> + * target zoned device.
> + */
> +static int dm_device_count_zones_cb(struct blk_zone *zone,
> +				    unsigned int idx, void *data)
> +{
> +	struct dm_device_zone_count *zc = data;
> +
> +	if (zone->type != BLK_ZONE_TYPE_CONVENTIONAL) {
> +		zc->total_nr_seq_zones++;
> +		if (zone->start >= zc->start &&
> +		    zone->start < zc->start + zc->len)
> +			zc->target_nr_seq_zones++;
> +	}
> +
> +	return 0;
> +}
> +
> +static int dm_device_count_zones(struct dm_dev *dev,
> +				 struct dm_device_zone_count *zc)
> +{
> +	int ret;
> +
> +	ret = blkdev_report_zones(dev->bdev, 0, BLK_ALL_ZONES,
> +				  dm_device_count_zones_cb, zc);
> +	if (ret < 0)
> +		return ret;
> +	if (!ret)
> +		return -EIO;
> +	return 0;
> +}
> +
> +struct dm_zone_resource_limits {
> +	unsigned int mapped_nr_seq_zones;
> +	struct queue_limits *lim;
> +	bool reliable_limits;
> +};
> +
> +static int device_get_zone_resource_limits(struct dm_target *ti,
> +					   struct dm_dev *dev, sector_t start,
> +					   sector_t len, void *data)
> +{
> +	struct dm_zone_resource_limits *zlim = data;
> +	struct gendisk *disk = dev->bdev->bd_disk;
> +	unsigned int max_open_zones, max_active_zones;
> +	int ret;
> +	struct dm_device_zone_count zc = {
> +		.start = start,
> +		.len = len,
> +	};
> +
> +	/*
> +	 * If the target is not the whole device, the device zone resources may
> +	 * be shared between different targets. Check this by counting the
> +	 * number of mapped sequential zones: if this number is smaller than the
> +	 * total number of sequential zones of the target device, then resource
> +	 * sharing may happen and the zone limits will not be reliable.
> +	 */
> +	ret = dm_device_count_zones(dev, &zc);
> +	if (ret) {
> +		DMERR("Count %s zones failed %d", disk->disk_name, ret);
> +		return ret;
> +	}
> +
> +	zlim->mapped_nr_seq_zones += zc.target_nr_seq_zones;
> +
> +	/*
> +	 * If the target does not map any sequential zones, then we do not need
> +	 * any zone resource limits.
> +	 */
> +	if (!zc.target_nr_seq_zones)
> +		return 0;

Nit: To me, it would be more logical to do the:

if (!zc.target_nr_seq_zones)
	return 0;

before doing:
zlim->mapped_nr_seq_zones += zc.target_nr_seq_zones;

Sure, if zero, += will not modify zlim->mapped_nr_seq_zones,
but since we do the check anyway, why not just do in before
the addition?

Personally, I would respin, but I leave it to the patch author.

Regardless:
Reviewed-by: Niklas Cassel <cassel@kernel.org>
diff mbox series

Patch

diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index 75d0019a0649..d9f8b7c0957a 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -145,21 +145,6 @@  bool dm_is_zone_write(struct mapped_device *md, struct bio *bio)
 	}
 }
 
-/*
- * Count conventional zones of a mapped zoned device. If the device
- * only has conventional zones, do not expose it as zoned.
- */
-static int dm_check_zoned_cb(struct blk_zone *zone, unsigned int idx,
-			     void *data)
-{
-	unsigned int *nr_conv_zones = data;
-
-	if (zone->type == BLK_ZONE_TYPE_CONVENTIONAL)
-		(*nr_conv_zones)++;
-
-	return 0;
-}
-
 /*
  * Revalidate the zones of a mapped device to initialize resource necessary
  * for zone append emulation. Note that we cannot simply use the block layer
@@ -228,13 +213,122 @@  static bool dm_table_supports_zone_append(struct dm_table *t)
 	return true;
 }
 
+struct dm_device_zone_count {
+	sector_t start;
+	sector_t len;
+	unsigned int total_nr_seq_zones;
+	unsigned int target_nr_seq_zones;
+};
+
+/*
+ * Count the total number of and the number of mapped sequential zones of a
+ * target zoned device.
+ */
+static int dm_device_count_zones_cb(struct blk_zone *zone,
+				    unsigned int idx, void *data)
+{
+	struct dm_device_zone_count *zc = data;
+
+	if (zone->type != BLK_ZONE_TYPE_CONVENTIONAL) {
+		zc->total_nr_seq_zones++;
+		if (zone->start >= zc->start &&
+		    zone->start < zc->start + zc->len)
+			zc->target_nr_seq_zones++;
+	}
+
+	return 0;
+}
+
+static int dm_device_count_zones(struct dm_dev *dev,
+				 struct dm_device_zone_count *zc)
+{
+	int ret;
+
+	ret = blkdev_report_zones(dev->bdev, 0, BLK_ALL_ZONES,
+				  dm_device_count_zones_cb, zc);
+	if (ret < 0)
+		return ret;
+	if (!ret)
+		return -EIO;
+	return 0;
+}
+
+struct dm_zone_resource_limits {
+	unsigned int mapped_nr_seq_zones;
+	struct queue_limits *lim;
+	bool reliable_limits;
+};
+
+static int device_get_zone_resource_limits(struct dm_target *ti,
+					   struct dm_dev *dev, sector_t start,
+					   sector_t len, void *data)
+{
+	struct dm_zone_resource_limits *zlim = data;
+	struct gendisk *disk = dev->bdev->bd_disk;
+	unsigned int max_open_zones, max_active_zones;
+	int ret;
+	struct dm_device_zone_count zc = {
+		.start = start,
+		.len = len,
+	};
+
+	/*
+	 * If the target is not the whole device, the device zone resources may
+	 * be shared between different targets. Check this by counting the
+	 * number of mapped sequential zones: if this number is smaller than the
+	 * total number of sequential zones of the target device, then resource
+	 * sharing may happen and the zone limits will not be reliable.
+	 */
+	ret = dm_device_count_zones(dev, &zc);
+	if (ret) {
+		DMERR("Count %s zones failed %d", disk->disk_name, ret);
+		return ret;
+	}
+
+	zlim->mapped_nr_seq_zones += zc.target_nr_seq_zones;
+
+	/*
+	 * If the target does not map any sequential zones, then we do not need
+	 * any zone resource limits.
+	 */
+	if (!zc.target_nr_seq_zones)
+		return 0;
+
+	/*
+	 * If the target does not map all sequential zones, the limits
+	 * will not be reliable.
+	 */
+	if (zc.target_nr_seq_zones < zc.total_nr_seq_zones)
+		zlim->reliable_limits = false;
+
+	/*
+	 * If the target maps less sequential zones than the limit values, then
+	 * we do not have limits for this target.
+	 */
+	max_active_zones = disk->queue->limits.max_active_zones;
+	if (max_active_zones >= zc.target_nr_seq_zones)
+		max_active_zones = 0;
+	zlim->lim->max_active_zones =
+		min_not_zero(max_active_zones, zlim->lim->max_active_zones);
+
+	max_open_zones = disk->queue->limits.max_open_zones;
+	if (max_open_zones >= zc.target_nr_seq_zones)
+		max_open_zones = 0;
+	zlim->lim->max_open_zones =
+		min_not_zero(max_open_zones, zlim->lim->max_open_zones);
+
+	return 0;
+}
+
 int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
 		struct queue_limits *lim)
 {
 	struct mapped_device *md = t->md;
 	struct gendisk *disk = md->disk;
-	unsigned int nr_conv_zones = 0;
-	int ret;
+	struct dm_zone_resource_limits zlim = {
+		.reliable_limits = true,
+		.lim = lim,
+	};
 
 	/*
 	 * Check if zone append is natively supported, and if not, set the
@@ -249,32 +343,53 @@  int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
 	}
 
 	/*
-	 * Count conventional zones to check that the mapped device will indeed 
-	 * have sequential write required zones.
+	 * Determine the max open and max active zone limits for the mapped
+	 * device by inspecting the zone resource limits and the zones mapped
+	 * by each target.
 	 */
-	md->zone_revalidate_map = t;
-	ret = dm_blk_report_zones(disk, 0, UINT_MAX,
-				  dm_check_zoned_cb, &nr_conv_zones);
-	md->zone_revalidate_map = NULL;
-	if (ret < 0) {
-		DMERR("Check zoned failed %d", ret);
-		return ret;
+	for (unsigned int i = 0; i < t->num_targets; i++) {
+		struct dm_target *ti = dm_table_get_target(t, i);
+
+		if (!ti->type->iterate_devices ||
+		    ti->type->iterate_devices(ti,
+				device_get_zone_resource_limits, &zlim)) {
+			DMERR("Could not determine %s zone resource limits",
+			      disk->disk_name);
+			return -ENODEV;
+		}
 	}
 
 	/*
-	 * If we only have conventional zones, expose the mapped device as
-	 * a regular device.
+	 * If we only have conventional zones mapped, expose the mapped device
+	 + as a regular device.
 	 */
-	if (nr_conv_zones >= ret) {
+	if (!zlim.mapped_nr_seq_zones) {
 		lim->max_open_zones = 0;
 		lim->max_active_zones = 0;
+		lim->max_zone_append_sectors = 0;
+		lim->zone_write_granularity = 0;
+		lim->chunk_sectors = 0;
 		lim->zoned = false;
 		clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
+		md->nr_zones = 0;
 		disk->nr_zones = 0;
 		return 0;
 	}
 
-	if (!static_key_enabled(&zoned_enabled.key))
+	/*
+	 * Warn once (when the capacity is not yet set) if the mapped device is
+	 * partially using zone resources of the target devices as that leads to
+	 * unreliable limits, i.e. if another mapped device uses the same
+	 * underlying devices, we cannot enforce zone limits to guarantee that
+	 * writing will not lead to errors. Note that we really should return
+	 * an error for such case but there is no easy way to find out if
+	 * another mapped device uses the same underlying zoned devices.
+	 */
+	if (!get_capacity(disk) && !zlim.reliable_limits)
+		DMWARN("%s zone resource limits may be unreliable",
+		       disk->disk_name);
+
+	if (lim->zoned && !static_key_enabled(&zoned_enabled.key))
 		static_branch_enable(&zoned_enabled);
 	return 0;
 }