diff mbox series

[15/17] btrfs: calculate file system wide queue limit for zoned mode

Message ID 20220901074216.1849941-16-hch@lst.de (mailing list archive)
State New, archived
Headers show
Series [01/17] block: export bio_split_rw | expand

Commit Message

Christoph Hellwig Sept. 1, 2022, 7:42 a.m. UTC
To be able to split a write into properly sized zone append commands,
we need a queue_limits structure that contains the least common
denominator suitable for all devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/ctree.h |  4 +++-
 fs/btrfs/zoned.c | 36 ++++++++++++++++++------------------
 fs/btrfs/zoned.h |  1 -
 3 files changed, 21 insertions(+), 20 deletions(-)

Comments

Johannes Thumshirn Sept. 1, 2022, 11:28 a.m. UTC | #1
On 01.09.22 09:43, Christoph Hellwig wrote:
> To be able to split a write into properly sized zone append commands,
> we need a queue_limits structure that contains the least common
> denominator suitable for all devices.
> 

This patch conflicts with Shinichiro's patch restoring functionality 
of the zone emulation mode.
Damien Le Moal Sept. 2, 2022, 1:56 a.m. UTC | #2
On 9/1/22 16:42, Christoph Hellwig wrote:
> To be able to split a write into properly sized zone append commands,
> we need a queue_limits structure that contains the least common
> denominator suitable for all devices.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/btrfs/ctree.h |  4 +++-
>  fs/btrfs/zoned.c | 36 ++++++++++++++++++------------------
>  fs/btrfs/zoned.h |  1 -
>  3 files changed, 21 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 5e57e3c6a1fd6..a37129363e184 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1071,8 +1071,10 @@ struct btrfs_fs_info {
>  	 */
>  	u64 zone_size;
>  
> -	/* Max size to emit ZONE_APPEND write command */
> +	/* Constraints for ZONE_APPEND commands: */
> +	struct queue_limits limits;
>  	u64 max_zone_append_size;

Can't we get rid of this one and have the code directly use
fs_info->limits.max_zone_append_sectors through a little helper doing a
conversion to bytes (a 9 bit shift) ?

[...]
>  	/* Count zoned devices */
>  	list_for_each_entry(device, &fs_devices->devices, dev_list) {
>  		enum blk_zoned_model model;
> @@ -685,11 +677,9 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
>  				ret = -EINVAL;
>  				goto out;
>  			}
> -			if (!max_zone_append_size ||
> -			    (zone_info->max_zone_append_size &&
> -			     zone_info->max_zone_append_size < max_zone_append_size))
> -				max_zone_append_size =
> -					zone_info->max_zone_append_size;
> +			blk_stack_limits(lim,
> +					 &bdev_get_queue(device->bdev)->limits,
> +					 0);

This does:

	t->max_zone_append_sectors = min(t->max_zone_append_sectors,
                                        b->max_zone_append_sectors);

So if we are mixing zoned and non-zoned devices in a multi-dev volume,
we'll end up with max_zone_append_sectors being 0. The previous code
prevented that.

Note that I am not sure if it is allowed to mix zoned and non-zoned drives
in the same volume. Given that we have a fake zone emulation for non-zoned
drives with zoned btrfs, I do not see why it would not work. But I may be
wrong.
Damien Le Moal Sept. 2, 2022, 1:59 a.m. UTC | #3
On 9/2/22 10:56, Damien Le Moal wrote:
> On 9/1/22 16:42, Christoph Hellwig wrote:
>> To be able to split a write into properly sized zone append commands,
>> we need a queue_limits structure that contains the least common
>> denominator suitable for all devices.
>>
>> Signed-off-by: Christoph Hellwig <hch@lst.de>
>> ---
>>  fs/btrfs/ctree.h |  4 +++-
>>  fs/btrfs/zoned.c | 36 ++++++++++++++++++------------------
>>  fs/btrfs/zoned.h |  1 -
>>  3 files changed, 21 insertions(+), 20 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 5e57e3c6a1fd6..a37129363e184 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -1071,8 +1071,10 @@ struct btrfs_fs_info {
>>  	 */
>>  	u64 zone_size;
>>  
>> -	/* Max size to emit ZONE_APPEND write command */
>> +	/* Constraints for ZONE_APPEND commands: */
>> +	struct queue_limits limits;
>>  	u64 max_zone_append_size;
> 
> Can't we get rid of this one and have the code directly use
> fs_info->limits.max_zone_append_sectors through a little helper doing a
> conversion to bytes (a 9 bit shift) ?

Note: Only a suggestion, not sure that would be much of a cleanup.

> 
> [...]
>>  	/* Count zoned devices */
>>  	list_for_each_entry(device, &fs_devices->devices, dev_list) {
>>  		enum blk_zoned_model model;
>> @@ -685,11 +677,9 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
>>  				ret = -EINVAL;
>>  				goto out;
>>  			}
>> -			if (!max_zone_append_size ||
>> -			    (zone_info->max_zone_append_size &&
>> -			     zone_info->max_zone_append_size < max_zone_append_size))
>> -				max_zone_append_size =
>> -					zone_info->max_zone_append_size;
>> +			blk_stack_limits(lim,
>> +					 &bdev_get_queue(device->bdev)->limits,
>> +					 0);
> 
> This does:
> 
> 	t->max_zone_append_sectors = min(t->max_zone_append_sectors,
>                                         b->max_zone_append_sectors);
> 
> So if we are mixing zoned and non-zoned devices in a multi-dev volume,
> we'll end up with max_zone_append_sectors being 0. The previous code
> prevented that.
> 
> Note that I am not sure if it is allowed to mix zoned and non-zoned drives
> in the same volume. Given that we have a fake zone emulation for non-zoned
> drives with zoned btrfs, I do not see why it would not work. But I may be
> wrong.
>
Christoph Hellwig Sept. 5, 2022, 6:50 a.m. UTC | #4
On Thu, Sep 01, 2022 at 11:28:02AM +0000, Johannes Thumshirn wrote:
> On 01.09.22 09:43, Christoph Hellwig wrote:
> > To be able to split a write into properly sized zone append commands,
> > we need a queue_limits structure that contains the least common
> > denominator suitable for all devices.
> > 
> 
> This patch conflicts with Shinichiro's patch restoring functionality 
> of the zone emulation mode.

Looks like that got into misc-next in the meantime, so I'll rebase on
top of that.
Christoph Hellwig Sept. 5, 2022, 6:54 a.m. UTC | #5
On Fri, Sep 02, 2022 at 10:56:40AM +0900, Damien Le Moal wrote:
> > -	/* Max size to emit ZONE_APPEND write command */
> > +	/* Constraints for ZONE_APPEND commands: */
> > +	struct queue_limits limits;
> >  	u64 max_zone_append_size;
> 
> Can't we get rid of this one and have the code directly use
> fs_info->limits.max_zone_append_sectors through a little helper doing a
> conversion to bytes (a 9 bit shift) ?

Well, the helper would be a little more complicated, doing three
different shifts, a max3 and and ALIGN_DOWN.  That's why I thought
I'd rather cache the value then recalculating it on every write.  But
either way would be entirely feasible.

> This does:
> 
> 	t->max_zone_append_sectors = min(t->max_zone_append_sectors,
>                                         b->max_zone_append_sectors);
> 
> So if we are mixing zoned and non-zoned devices in a multi-dev volume,
> we'll end up with max_zone_append_sectors being 0. The previous code
> prevented that.
> 
> Note that I am not sure if it is allowed to mix zoned and non-zoned drives
> in the same volume. Given that we have a fake zone emulation for non-zoned
> drives with zoned btrfs, I do not see why it would not work. But I may be
> wrong.

Yes, this could be problematic.  I wonder if we need to initialize
max_zone_append_sectors to max_hw_sectors by default and use a separate
flag if it is actually supported in the block layer if we want to
support that.
diff mbox series

Patch

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5e57e3c6a1fd6..a37129363e184 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1071,8 +1071,10 @@  struct btrfs_fs_info {
 	 */
 	u64 zone_size;
 
-	/* Max size to emit ZONE_APPEND write command */
+	/* Constraints for ZONE_APPEND commands: */
+	struct queue_limits limits;
 	u64 max_zone_append_size;
+
 	struct mutex zoned_meta_io_lock;
 	spinlock_t treelog_bg_lock;
 	u64 treelog_bg;
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 2638f71eec4b6..6e04fbbd76b92 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -415,16 +415,6 @@  int btrfs_get_dev_zone_info(struct btrfs_device *device, bool populate_cache)
 	nr_sectors = bdev_nr_sectors(bdev);
 	zone_info->zone_size_shift = ilog2(zone_info->zone_size);
 	zone_info->nr_zones = nr_sectors >> ilog2(zone_sectors);
-	/*
-	 * We limit max_zone_append_size also by max_segments *
-	 * PAGE_SIZE. Technically, we can have multiple pages per segment. But,
-	 * since btrfs adds the pages one by one to a bio, and btrfs cannot
-	 * increase the metadata reservation even if it increases the number of
-	 * extents, it is safe to stick with the limit.
-	 */
-	zone_info->max_zone_append_size =
-		min_t(u64, (u64)bdev_max_zone_append_sectors(bdev) << SECTOR_SHIFT,
-		      (u64)bdev_max_segments(bdev) << PAGE_SHIFT);
 	if (!IS_ALIGNED(nr_sectors, zone_sectors))
 		zone_info->nr_zones++;
 
@@ -646,14 +636,16 @@  int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+	struct queue_limits *lim = &fs_info->limits;
 	struct btrfs_device *device;
 	u64 zoned_devices = 0;
 	u64 nr_devices = 0;
 	u64 zone_size = 0;
-	u64 max_zone_append_size = 0;
 	const bool incompat_zoned = btrfs_fs_incompat(fs_info, ZONED);
 	int ret = 0;
 
+	blk_set_stacking_limits(lim);
+
 	/* Count zoned devices */
 	list_for_each_entry(device, &fs_devices->devices, dev_list) {
 		enum blk_zoned_model model;
@@ -685,11 +677,9 @@  int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 				ret = -EINVAL;
 				goto out;
 			}
-			if (!max_zone_append_size ||
-			    (zone_info->max_zone_append_size &&
-			     zone_info->max_zone_append_size < max_zone_append_size))
-				max_zone_append_size =
-					zone_info->max_zone_append_size;
+			blk_stack_limits(lim,
+					 &bdev_get_queue(device->bdev)->limits,
+					 0);
 		}
 		nr_devices++;
 	}
@@ -739,8 +729,18 @@  int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 	}
 
 	fs_info->zone_size = zone_size;
-	fs_info->max_zone_append_size = ALIGN_DOWN(max_zone_append_size,
-						   fs_info->sectorsize);
+	/*
+	 * Also limit max_zone_append_size by max_segments * PAGE_SIZE.
+	 * Technically, we can have multiple pages per segment. But,
+	 * since btrfs adds the pages one by one to a bio, and btrfs cannot
+	 * increase the metadata reservation even if it increases the number of
+	 * extents, it is safe to stick with the limit.
+	 */
+	fs_info->max_zone_append_size = ALIGN_DOWN(
+		min3((u64)lim->max_zone_append_sectors << SECTOR_SHIFT,
+		     (u64)lim->max_sectors << SECTOR_SHIFT,
+		     (u64)lim->max_segments << PAGE_SHIFT),
+		fs_info->sectorsize);
 	fs_info->fs_devices->chunk_alloc_policy = BTRFS_CHUNK_ALLOC_ZONED;
 	if (fs_info->max_zone_append_size < fs_info->max_extent_size)
 		fs_info->max_extent_size = fs_info->max_zone_append_size;
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index cafa639927050..0f22b22fe359f 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -19,7 +19,6 @@  struct btrfs_zoned_device_info {
 	 */
 	u64 zone_size;
 	u8  zone_size_shift;
-	u64 max_zone_append_size;
 	u32 nr_zones;
 	unsigned int max_active_zones;
 	atomic_t active_zones_left;