diff mbox series

[v2,10/11] block: Add support for the zone capacity concept

Message ID 20230418224002.1195163-11-bvanassche@acm.org (mailing list archive)
State New, archived
Headers show
Series mq-deadline: Improve support for zoned block devices | expand

Commit Message

Bart Van Assche April 18, 2023, 10:40 p.m. UTC
Make the zone capacity available in struct queue_limits for those
drivers that need it.

Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 Documentation/ABI/stable/sysfs-block |  8 ++++++++
 block/blk-settings.c                 |  1 +
 block/blk-sysfs.c                    |  7 +++++++
 block/blk-zoned.c                    | 15 +++++++++++++++
 include/linux/blkdev.h               |  1 +
 5 files changed, 32 insertions(+)

Comments

Niklas Cassel April 20, 2023, 9:23 a.m. UTC | #1
On Tue, Apr 18, 2023 at 03:40:01PM -0700, Bart Van Assche wrote:
> Make the zone capacity available in struct queue_limits for those
> drivers that need it.
> 
> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  Documentation/ABI/stable/sysfs-block |  8 ++++++++
>  block/blk-settings.c                 |  1 +
>  block/blk-sysfs.c                    |  7 +++++++
>  block/blk-zoned.c                    | 15 +++++++++++++++
>  include/linux/blkdev.h               |  1 +
>  5 files changed, 32 insertions(+)
> 
> diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
> index c57e5b7cb532..4527d0514fdb 100644
> --- a/Documentation/ABI/stable/sysfs-block
> +++ b/Documentation/ABI/stable/sysfs-block
> @@ -671,6 +671,14 @@ Description:
>  		regular block devices.
>  
>  
> +What:		/sys/block/<disk>/queue/zone_capacity
> +Date:		March 2023
> +Contact:	linux-block@vger.kernel.org
> +Description:
> +		[RO] The number of 512-byte sectors in a zone that can be read
> +		or written. This number is less than or equal to the zone size.
> +
> +
>  What:		/sys/block/<disk>/queue/zone_write_granularity
>  Date:		January 2021
>  Contact:	linux-block@vger.kernel.org
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 896b4654ab00..96f5dc63a815 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -685,6 +685,7 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>  						   b->max_secure_erase_sectors);
>  	t->zone_write_granularity = max(t->zone_write_granularity,
>  					b->zone_write_granularity);
> +	t->zone_capacity = max(t->zone_capacity, b->zone_capacity);
>  	t->zoned = max(t->zoned, b->zoned);
>  	return ret;
>  }

(snip)

> @@ -496,12 +498,23 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
>  				disk->disk_name);
>  			return -ENODEV;
>  		}
> +		if (zone->capacity != args->zone_capacity) {
> +			pr_warn("%s: Invalid zoned device with non constant zone capacity\n",
> +				disk->disk_name);
> +			return -ENODEV;

Hello Bart,

The NVMe Zoned Namespace Command Set Specification:
https://nvmexpress.org/wp-content/uploads/NVM-Express-Zoned-Namespace-Command-Set-Specification-1.1c-2022.10.03-Ratified.pdf

specifies the Zone Capacity (ZCAP) field in each Zone Descriptor.
The Descriptors are part of the Report Zones Data Structure.

This means that while the zone size is the same for all zones,
the zone capacity can be different for each zone.

While the single NVMe ZNS SSD that I've encountered so far did
coincidentally have the same zone capacity for all zones, this
is not required by the specification.

The NVMe driver does reject a ZNS device that has support for
Variable Zone Capacity (which is defined in the ZOC field):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/nvme/host/zns.c?h=v6.3-rc7#n95

Variable Zone Capacity simply means the the zone capacities
cannot change without a NVM format.

However, even when Variable Zone Capacity is not supported,
a NVMe ZNS device can still have different zone capacities,
and AFAICT, such devices are currently supported.

With your change above, we would start rejecting such devices.


Is this reduction of supported NVMe ZNS SSD devices really desired?

If it is, then I would at least expect to find a motivation of why we
are reducing the scope of the currently supported NVMe ZNS devices
to be found somewhere in the commit message.


Kind regards,
Niklas
Bart Van Assche April 20, 2023, 5:12 p.m. UTC | #2
On 4/20/23 02:23, Niklas Cassel wrote:
> With your change above, we would start rejecting such devices.
> 
> Is this reduction of supported NVMe ZNS SSD devices really desired?

Hi Niklas,

This is not my intention. A possible solution is to modify the NVMe 
driver and SCSI core such that the "zone is full" information is stored 
in struct request when a command completes. That will remove the need 
for the mq-deadline scheduler to know the zone capacity.

Bart.
Damien Le Moal April 20, 2023, 10 p.m. UTC | #3
On 4/21/23 02:12, Bart Van Assche wrote:
> On 4/20/23 02:23, Niklas Cassel wrote:
>> With your change above, we would start rejecting such devices.
>>
>> Is this reduction of supported NVMe ZNS SSD devices really desired?
> 
> Hi Niklas,
> 
> This is not my intention. A possible solution is to modify the NVMe 
> driver and SCSI core such that the "zone is full" information is stored 
> in struct request when a command completes. That will remove the need 
> for the mq-deadline scheduler to know the zone capacity.

I am not following... Why would the scheduler need to know the zone capacity ?

If the user does stupid things like accessing sectors between zone capacity and
zone size or trying to write to a full zone, the commands will be failed by the
drive and I do not see why the scheduler need to care about that.

> 
> Bart.
Bart Van Assche April 20, 2023, 10:51 p.m. UTC | #4
On 4/20/23 15:00, Damien Le Moal wrote:
> On 4/21/23 02:12, Bart Van Assche wrote:
>> On 4/20/23 02:23, Niklas Cassel wrote:
>>> With your change above, we would start rejecting such devices.
>>>
>>> Is this reduction of supported NVMe ZNS SSD devices really desired?
>>
>> Hi Niklas,
>>
>> This is not my intention. A possible solution is to modify the NVMe
>> driver and SCSI core such that the "zone is full" information is stored
>> in struct request when a command completes. That will remove the need
>> for the mq-deadline scheduler to know the zone capacity.
> 
> I am not following... Why would the scheduler need to know the zone capacity ?
> 
> If the user does stupid things like accessing sectors between zone capacity and
> zone size or trying to write to a full zone, the commands will be failed by the
> drive and I do not see why the scheduler need to care about that.

Hi Damien,

Restricting the number of active zones in the I/O scheduler (patch 
11/11) requires some knowledge of the zone condition.

According to ZBC-2, for sequential write preferred zones the additional 
sense code ZONE TRANSITION TO FULL must be reported if the zone 
condition changes from not full into full. There is no such requirement 
for sequential write required zones. Additionally, I'm not aware of any 
reporting mechanism in the NVMe specification for changes in the zone 
condition.

The overhead of submitting a REPORT ZONES command after every I/O 
completion would be unacceptable.

Is there any other solution for tracking the zone condition other than 
comparing the LBA at which a WRITE command finished with the zone capacity?

Did I perhaps overlook something?

Thanks,

Bart.
Damien Le Moal April 20, 2023, 11:37 p.m. UTC | #5
On 4/21/23 07:51, Bart Van Assche wrote:
> On 4/20/23 15:00, Damien Le Moal wrote:
>> On 4/21/23 02:12, Bart Van Assche wrote:
>>> On 4/20/23 02:23, Niklas Cassel wrote:
>>>> With your change above, we would start rejecting such devices.
>>>>
>>>> Is this reduction of supported NVMe ZNS SSD devices really desired?
>>>
>>> Hi Niklas,
>>>
>>> This is not my intention. A possible solution is to modify the NVMe
>>> driver and SCSI core such that the "zone is full" information is stored
>>> in struct request when a command completes. That will remove the need
>>> for the mq-deadline scheduler to know the zone capacity.
>>
>> I am not following... Why would the scheduler need to know the zone capacity ?
>>
>> If the user does stupid things like accessing sectors between zone capacity and
>> zone size or trying to write to a full zone, the commands will be failed by the
>> drive and I do not see why the scheduler need to care about that.
> 
> Hi Damien,
> 
> Restricting the number of active zones in the I/O scheduler (patch 
> 11/11) requires some knowledge of the zone condition.

Why would you need to handle the max active zone number restriction in the
scheduler ? That is the user responsibility. btrfs does it (that is still buggy
though, working on it).

> According to ZBC-2, for sequential write preferred zones the additional 
> sense code ZONE TRANSITION TO FULL must be reported if the zone 
> condition changes from not full into full. There is no such requirement 
> for sequential write required zones. Additionally, I'm not aware of any 
> reporting mechanism in the NVMe specification for changes in the zone 
> condition.

Sequential write preferred zones is ZBC which does not have the concept of
active zone. In general, for ZBC HDDs, ignoring the max number of open zones is
fine. There is no performance impact that can be measured, unless the user goes
full random write on the device. But in that case, the user is already asking
for bad perf anyway.

I suspect you are thinking about all this in the context of UFS devices ?
My point stands though. Trying to manage the active zone limit at the scheduler
level is not a good idea as there are no guarantees that the user will
eventually issue all the write commands to make zones full, and thus turn them
inactive. This is the responsibility of the user to manage that, so above the
block IO scheduler.

> The overhead of submitting a REPORT ZONES command after every I/O 
> completion would be unacceptable.
> 
> Is there any other solution for tracking the zone condition other than 
> comparing the LBA at which a WRITE command finished with the zone capacity?

The sd driver does some minimal tracking already which is used for zone append
emulation.

> 
> Did I perhaps overlook something?
> 
> Thanks,
> 
> Bart.
Bart Van Assche April 20, 2023, 11:44 p.m. UTC | #6
On 4/20/23 16:37, Damien Le Moal wrote:
> Why would you need to handle the max active zone number restriction in the
> scheduler ? That is the user responsibility. btrfs does it (that is still buggy
> though, working on it).

Hi Damien,

If the user (filesystem) restricts the number of active zones, the code 
for restricting the number of active zones will have to be duplicated 
into every filesystem that supports zoned devices. Wouldn't it be better 
if the I/O scheduler tracks the number of active zones?

Thanks,

Bart.
Damien Le Moal April 20, 2023, 11:53 p.m. UTC | #7
On 4/21/23 08:44, Bart Van Assche wrote:
> On 4/20/23 16:37, Damien Le Moal wrote:
>> Why would you need to handle the max active zone number restriction in the
>> scheduler ? That is the user responsibility. btrfs does it (that is still buggy
>> though, working on it).
> 
> Hi Damien,
> 
> If the user (filesystem) restricts the number of active zones, the code 
> for restricting the number of active zones will have to be duplicated 
> into every filesystem that supports zoned devices. Wouldn't it be better 
> if the I/O scheduler tracks the number of active zones?

I do not think so. The reason is that for a file system, the block allocator
must be aware of any active zone limit of the underlying device to make the best
decision possible regarding where to allocate blocks for files and metadata. For
btrfs, we added "active block groups" management for that purpose. And we have
tracking of a block group active state so that the block allocator can start
using new block groups (inactive ones) when previously used ones become full. We
also have a "finish block group" for cases when there is not enough remaining
free blocks in a group/zone (this does a finish zone operation to make a
non-full zone full, that is, inactive).

Even if the block IO scheduler were to track active zones, the FS would still
need to do its own tracking (e.g. to be able to finish zones when needed). So I
do not see the point in having the block scheduler doing anything about active
zones.
Jaegeuk Kim April 21, 2023, 12:29 a.m. UTC | #8
On 04/21, Damien Le Moal wrote:
> On 4/21/23 08:44, Bart Van Assche wrote:
> > On 4/20/23 16:37, Damien Le Moal wrote:
> >> Why would you need to handle the max active zone number restriction in the
> >> scheduler ? That is the user responsibility. btrfs does it (that is still buggy
> >> though, working on it).
> > 
> > Hi Damien,
> > 
> > If the user (filesystem) restricts the number of active zones, the code 
> > for restricting the number of active zones will have to be duplicated 
> > into every filesystem that supports zoned devices. Wouldn't it be better 
> > if the I/O scheduler tracks the number of active zones?
> 
> I do not think so. The reason is that for a file system, the block allocator
> must be aware of any active zone limit of the underlying device to make the best
> decision possible regarding where to allocate blocks for files and metadata. For

Well, I'm wondering what kind of decision FS can make when allocating zones?

> btrfs, we added "active block groups" management for that purpose. And we have
> tracking of a block group active state so that the block allocator can start
> using new block groups (inactive ones) when previously used ones become full. We
> also have a "finish block group" for cases when there is not enough remaining
> free blocks in a group/zone (this does a finish zone operation to make a
> non-full zone full, that is, inactive).
> 
> Even if the block IO scheduler were to track active zones, the FS would still
> need to do its own tracking (e.g. to be able to finish zones when needed). So I

Why does FS also need to track the # of open zones redundantly? I have two
concerns if FS needs to force it:
1) performance - waiting for finish_zone before allocating a new zone can break
the IO pipeline and block FS operations.
2) multiple partition support - if F2FS uses two partitions, one on conventional
partition while the other on zoned partition, we have to maintain such tracking
mechanism on zoned partition only which gives some code complexity.

In general, doesn't it make sense that FS (not zoned-device FS) just needs to
guarantee sequential writes per zone, while IO scheduler needs to limit the #
of open zones gracefully?

> do not see the point in having the block scheduler doing anything about active
> zones.
>
Damien Le Moal April 21, 2023, 1:52 a.m. UTC | #9
On 4/21/23 09:29, Jaegeuk Kim wrote:
> On 04/21, Damien Le Moal wrote:
>> On 4/21/23 08:44, Bart Van Assche wrote:
>>> On 4/20/23 16:37, Damien Le Moal wrote:
>>>> Why would you need to handle the max active zone number restriction in the
>>>> scheduler ? That is the user responsibility. btrfs does it (that is still buggy
>>>> though, working on it).
>>>
>>> Hi Damien,
>>>
>>> If the user (filesystem) restricts the number of active zones, the code 
>>> for restricting the number of active zones will have to be duplicated 
>>> into every filesystem that supports zoned devices. Wouldn't it be better 
>>> if the I/O scheduler tracks the number of active zones?
>>
>> I do not think so. The reason is that for a file system, the block allocator
>> must be aware of any active zone limit of the underlying device to make the best
>> decision possible regarding where to allocate blocks for files and metadata. For
> 
> Well, I'm wondering what kind of decision FS can make when allocating zones?

Not sure what you mean... It is very similar to regular block device case. The
FS does block allocation based on whatever block placement policy it wants/need
to implement. With zoned devices, the FS block management object are mapped to
zones (btrfs: block group == zone, f2fs: section == zone) and the active zone
limits simply adds one more constraint regarding the selection of block groups
for allocating blocks. This is a resource management issue.

>> btrfs, we added "active block groups" management for that purpose. And we have
>> tracking of a block group active state so that the block allocator can start
>> using new block groups (inactive ones) when previously used ones become full. We
>> also have a "finish block group" for cases when there is not enough remaining
>> free blocks in a group/zone (this does a finish zone operation to make a
>> non-full zone full, that is, inactive).
>>
>> Even if the block IO scheduler were to track active zones, the FS would still
>> need to do its own tracking (e.g. to be able to finish zones when needed). So I
> 
> Why does FS also need to track the # of open zones redundantly? I have two

Because the FS should not be issuing writes to a zone that cannot be activated
on the device, e.g. starting writing to an empty zone when there are already N
zones being written or partly written, with N >= max active zones, will result
in IO failures. Even if you have active zone tracking in the block IO scheduler,
there is absolutely NOTHING that the scheduler can do to prevent such errors.
E.g. think of this simple case:
1) Let's take a device with max active zones = N (N != 0)
2) The FS implements a block allocation policy which results in new files being
written to empty block groups, with 1 block group == 1 zone
3) User writes N files of 4KB

After step 3, the device has N active zones. So if the user tries to write a new
file, it will fail (cannot write to an empty zone as that will result in an
error because that zone cannot be activated by the device). AND the FS cannot
write metadata for these files into a metadata block group.

There is nothing that the IO scheduler can do about all this. The FS has to be
more intelligent and do block allocation also based on the current
active/inactive state of the zones used by block groups.

> concerns if FS needs to force it:
> 1) performance - waiting for finish_zone before allocating a new zone can break
> the IO pipeline and block FS operations.

The need to perform a zone finish is purely an optimization if, for instance,
you want to reduce fragmentation. E.g., if there is only 4K of free space left
in a zone and need to write a 1MB extent, you can write the last 4K of that
zone, making it inactive and write the remaining 1MB - 4KB in another zone and
you are guaranteed that this other zone can be written since you just
deactivated one zone.

But if you do not want to fragment that 1MB extent, then you must finish that
zone with only 4KB left first, to ensure that another zone can be activated.

This of course also depends on the current number of active zones N. If N < max
active zone limit, then there is no need to finish a zone.

> 2) multiple partition support - if F2FS uses two partitions, one on conventional
> partition while the other on zoned partition, we have to maintain such tracking
> mechanism on zoned partition only which gives some code complexity.

Conventional zones have no concept of active zones. All active zone resources
can be allocated to the sequential zones. And zoned block devices do not support
partitions anyway.

> In general, doesn't it make sense that FS (not zoned-device FS) just needs to
> guarantee sequential writes per zone, while IO scheduler needs to limit the #
> of open zones gracefully?

No. That will never work. See the example above: you can endup in a situation
where the drive becomes read-only (all writes failing) if the FS does not direct
block allocation & writes to zones that are already active. No amount of
trickery in the IO scheduler can change that fact.

If you want to hide the active zone limit to the FS, then what is needed is a
device mapper that remaps blocks. That is a lot more overhead that implementing
that support in the FS, and the FS can do a much better job at optimizing block
placement.
Jaegeuk Kim April 21, 2023, 8:15 p.m. UTC | #10
On 04/21, Damien Le Moal wrote:
> On 4/21/23 09:29, Jaegeuk Kim wrote:
> > On 04/21, Damien Le Moal wrote:
> >> On 4/21/23 08:44, Bart Van Assche wrote:
> >>> On 4/20/23 16:37, Damien Le Moal wrote:
> >>>> Why would you need to handle the max active zone number restriction in the
> >>>> scheduler ? That is the user responsibility. btrfs does it (that is still buggy
> >>>> though, working on it).
> >>>
> >>> Hi Damien,
> >>>
> >>> If the user (filesystem) restricts the number of active zones, the code 
> >>> for restricting the number of active zones will have to be duplicated 
> >>> into every filesystem that supports zoned devices. Wouldn't it be better 
> >>> if the I/O scheduler tracks the number of active zones?
> >>
> >> I do not think so. The reason is that for a file system, the block allocator
> >> must be aware of any active zone limit of the underlying device to make the best
> >> decision possible regarding where to allocate blocks for files and metadata. For
> > 
> > Well, I'm wondering what kind of decision FS can make when allocating zones?
> 
> Not sure what you mean... It is very similar to regular block device case. The
> FS does block allocation based on whatever block placement policy it wants/need
> to implement. With zoned devices, the FS block management object are mapped to
> zones (btrfs: block group == zone, f2fs: section == zone) and the active zone
> limits simply adds one more constraint regarding the selection of block groups
> for allocating blocks. This is a resource management issue.

Ok, so it seems I overlooked there might be something in the zone allocation
policy. So, f2fs already manages 6 open zones by design.

> 
> >> btrfs, we added "active block groups" management for that purpose. And we have
> >> tracking of a block group active state so that the block allocator can start
> >> using new block groups (inactive ones) when previously used ones become full. We
> >> also have a "finish block group" for cases when there is not enough remaining
> >> free blocks in a group/zone (this does a finish zone operation to make a
> >> non-full zone full, that is, inactive).
> >>
> >> Even if the block IO scheduler were to track active zones, the FS would still
> >> need to do its own tracking (e.g. to be able to finish zones when needed). So I
> > 
> > Why does FS also need to track the # of open zones redundantly? I have two
> 
> Because the FS should not be issuing writes to a zone that cannot be activated
> on the device, e.g. starting writing to an empty zone when there are already N
> zones being written or partly written, with N >= max active zones, will result
> in IO failures. Even if you have active zone tracking in the block IO scheduler,
> there is absolutely NOTHING that the scheduler can do to prevent such errors.
> E.g. think of this simple case:
> 1) Let's take a device with max active zones = N (N != 0)
> 2) The FS implements a block allocation policy which results in new files being
> written to empty block groups, with 1 block group == 1 zone
> 3) User writes N files of 4KB
> 
> After step 3, the device has N active zones. So if the user tries to write a new
> file, it will fail (cannot write to an empty zone as that will result in an
> error because that zone cannot be activated by the device). AND the FS cannot
> write metadata for these files into a metadata block group.

I think it needs to consider block allocation vs. data writes separately. That
being said,

1) FS zone allocation: as you described, FS needs to allocate blocks per zone,
and should finish to *allocate* blocks entirely in the zone, when allocating a
new one if it meets the limit. Fortunately, F2FS is doing that by design, so
I don't see any need to manage the open zone limitation.

2) data writes: IO scheduler needs to control write pipeline to get the best
performance while just checking the max open zones seamlessly.

With that, FS doesn't need to wait for IO completion when allocating a new
zone.

> 
> There is nothing that the IO scheduler can do about all this. The FS has to be
> more intelligent and do block allocation also based on the current
> active/inactive state of the zones used by block groups.

TBH, I can't find any benefit to manage such the active/inactive states in FS.
Am I mssing something in btrfs especially?

> 
> > concerns if FS needs to force it:
> > 1) performance - waiting for finish_zone before allocating a new zone can break
> > the IO pipeline and block FS operations.
> 
> The need to perform a zone finish is purely an optimization if, for instance,
> you want to reduce fragmentation. E.g., if there is only 4K of free space left
> in a zone and need to write a 1MB extent, you can write the last 4K of that
> zone, making it inactive and write the remaining 1MB - 4KB in another zone and
> you are guaranteed that this other zone can be written since you just
> deactivated one zone.
> 
> But if you do not want to fragment that 1MB extent, then you must finish that
> zone with only 4KB left first, to ensure that another zone can be activated.

So, why should FS be aware of that? I was expecting, once FS submitted 1MB
extent, block or IO scheduler will gracefully finish the old zone and open a
new one which is matched to the in-disk write pointers.

> 
> This of course also depends on the current number of active zones N. If N < max
> active zone limit, then there is no need to finish a zone.
> 
> > 2) multiple partition support - if F2FS uses two partitions, one on conventional
> > partition while the other on zoned partition, we have to maintain such tracking
> > mechanism on zoned partition only which gives some code complexity.
> 
> Conventional zones have no concept of active zones. All active zone resources
> can be allocated to the sequential zones. And zoned block devices do not support
> partitions anyway.
> 
> > In general, doesn't it make sense that FS (not zoned-device FS) just needs to
> > guarantee sequential writes per zone, while IO scheduler needs to limit the #
> > of open zones gracefully?
> 
> No. That will never work. See the example above: you can endup in a situation
> where the drive becomes read-only (all writes failing) if the FS does not direct
> block allocation & writes to zones that are already active. No amount of
> trickery in the IO scheduler can change that fact.
> 
> If you want to hide the active zone limit to the FS, then what is needed is a
> device mapper that remaps blocks. That is a lot more overhead that implementing
> that support in the FS, and the FS can do a much better job at optimizing block
> placement.

Oh, I meant FS like f2fs supporting zoned device.
Damien Le Moal April 21, 2023, 10:25 p.m. UTC | #11
On 4/22/23 05:15, Jaegeuk Kim wrote:
> On 04/21, Damien Le Moal wrote:
>> On 4/21/23 09:29, Jaegeuk Kim wrote:
>>> On 04/21, Damien Le Moal wrote:
>>>> On 4/21/23 08:44, Bart Van Assche wrote:
>>>>> On 4/20/23 16:37, Damien Le Moal wrote:
>>>>>> Why would you need to handle the max active zone number restriction in the
>>>>>> scheduler ? That is the user responsibility. btrfs does it (that is still buggy
>>>>>> though, working on it).
>>>>>
>>>>> Hi Damien,
>>>>>
>>>>> If the user (filesystem) restricts the number of active zones, the code 
>>>>> for restricting the number of active zones will have to be duplicated 
>>>>> into every filesystem that supports zoned devices. Wouldn't it be better 
>>>>> if the I/O scheduler tracks the number of active zones?
>>>>
>>>> I do not think so. The reason is that for a file system, the block allocator
>>>> must be aware of any active zone limit of the underlying device to make the best
>>>> decision possible regarding where to allocate blocks for files and metadata. For
>>>
>>> Well, I'm wondering what kind of decision FS can make when allocating zones?
>>
>> Not sure what you mean... It is very similar to regular block device case. The
>> FS does block allocation based on whatever block placement policy it wants/need
>> to implement. With zoned devices, the FS block management object are mapped to
>> zones (btrfs: block group == zone, f2fs: section == zone) and the active zone
>> limits simply adds one more constraint regarding the selection of block groups
>> for allocating blocks. This is a resource management issue.
> 
> Ok, so it seems I overlooked there might be something in the zone allocation
> policy. So, f2fs already manages 6 open zones by design.

Yes, so as long as the device allows for at least 6 active zones, there are no
issues with f2fs.

>>>> btrfs, we added "active block groups" management for that purpose. And we have
>>>> tracking of a block group active state so that the block allocator can start
>>>> using new block groups (inactive ones) when previously used ones become full. We
>>>> also have a "finish block group" for cases when there is not enough remaining
>>>> free blocks in a group/zone (this does a finish zone operation to make a
>>>> non-full zone full, that is, inactive).
>>>>
>>>> Even if the block IO scheduler were to track active zones, the FS would still
>>>> need to do its own tracking (e.g. to be able to finish zones when needed). So I
>>>
>>> Why does FS also need to track the # of open zones redundantly? I have two
>>
>> Because the FS should not be issuing writes to a zone that cannot be activated
>> on the device, e.g. starting writing to an empty zone when there are already N
>> zones being written or partly written, with N >= max active zones, will result
>> in IO failures. Even if you have active zone tracking in the block IO scheduler,
>> there is absolutely NOTHING that the scheduler can do to prevent such errors.
>> E.g. think of this simple case:
>> 1) Let's take a device with max active zones = N (N != 0)
>> 2) The FS implements a block allocation policy which results in new files being
>> written to empty block groups, with 1 block group == 1 zone
>> 3) User writes N files of 4KB
>>
>> After step 3, the device has N active zones. So if the user tries to write a new
>> file, it will fail (cannot write to an empty zone as that will result in an
>> error because that zone cannot be activated by the device). AND the FS cannot
>> write metadata for these files into a metadata block group.
> 
> I think it needs to consider block allocation vs. data writes separately. That
> being said,

That mostly depends on the FS design.

> 
> 1) FS zone allocation: as you described, FS needs to allocate blocks per zone,
> and should finish to *allocate* blocks entirely in the zone, when allocating a
> new one if it meets the limit. Fortunately, F2FS is doing that by design, so
> I don't see any need to manage the open zone limitation.

Correct for f2fs case. btrfs is different in that respect.

> 2) data writes: IO scheduler needs to control write pipeline to get the best
> performance while just checking the max open zones seamlessly.

There is absolutely no need for the IO scheduler to check open/active zones
state. More below.

> With that, FS doesn't need to wait for IO completion when allocating a new
> zone.

Incorrect. I showed you a simple example of why. You can also consider a more
complex scenario and think about what can happen: multiple writers doing
buffered IOs through the page cache and suddenly doing an fsync. If you have
more writers than can have active zones, depending on how blocks are allocated,
you'll need to wait before issuing writes for some to ensure that zones can be
activated. This is *NOT* a performance impact as this synchronization is needed,
it means that you already are pounding the drive hard. Issuing more IOs will not
make the drive go faster.

>> There is nothing that the IO scheduler can do about all this. The FS has to be
>> more intelligent and do block allocation also based on the current
>> active/inactive state of the zones used by block groups.
> 
> TBH, I can't find any benefit to manage such the active/inactive states in FS.
> Am I mssing something in btrfs especially?

btrfs block management is a little more complex than f2fs. For one thing, btrfs
is 100% copy on write (unlike f2fs), which means that we absolutely MUST ensure
that we can always write metadata block groups and the super block (multiple
copies). So we need some "reserved" active zone resources for that. And for file
data, given the that block allocation may work much faster than actually writing
the device, you need to control the writeback process to throttle it within the
available active zone resources. This is naturally done in f2fs given that there
are at most only 6 segments/zones used at any time for writing. But btrfs needs
additional code.

>>> concerns if FS needs to force it:
>>> 1) performance - waiting for finish_zone before allocating a new zone can break
>>> the IO pipeline and block FS operations.
>>
>> The need to perform a zone finish is purely an optimization if, for instance,
>> you want to reduce fragmentation. E.g., if there is only 4K of free space left
>> in a zone and need to write a 1MB extent, you can write the last 4K of that
>> zone, making it inactive and write the remaining 1MB - 4KB in another zone and
>> you are guaranteed that this other zone can be written since you just
>> deactivated one zone.
>>
>> But if you do not want to fragment that 1MB extent, then you must finish that
>> zone with only 4KB left first, to ensure that another zone can be activated.
> 
> So, why should FS be aware of that? I was expecting, once FS submitted 1MB
> extent, block or IO scheduler will gracefully finish the old zone and open a
> new one which is matched to the in-disk write pointers.

The block IO scheduler is just that, a scheduler. It should NEVER be the source
of a new command. You cannot have the block IO scheduler issue commands. That is
not how the block layer works.

And it seems that you are assuming that block IOs make it to the scheduler in
about the same order as issued by the FS. There are no guarantees of that
happening when considering a set of different zones as the various issuers may
block on request allocation, or due to a device mapper between FS and device,
etc. Plenty of reasons for the overall write pattern to change between FS and
device. Not per zone for regular writes though, that is preserved. But that is
not the case for zone append writes that btrfs uses.

And you are forgetting the case of applications using the drive directly. You
cannot rely on the application working correctly and have the IO scheduler do
some random things about open/active zones. That will never work.
Christoph Hellwig April 24, 2023, 6:01 a.m. UTC | #12
On Sat, Apr 22, 2023 at 07:25:33AM +0900, Damien Le Moal wrote:
> >> for allocating blocks. This is a resource management issue.
> > 
> > Ok, so it seems I overlooked there might be something in the zone allocation
> > policy. So, f2fs already manages 6 open zones by design.
> 
> Yes, so as long as the device allows for at least 6 active zones, there are no
> issues with f2fs.

I don't think it's quite as rosy, because f2fs can still schedule
I/O to the old zone after already scheduling I/O to a new zone for
any of these 6 slots.  It'll need code to wait for all I/O to the old
zone to finish first, similar to btrfs.
Jaegeuk Kim April 24, 2023, 5:48 p.m. UTC | #13
On 04/22, Damien Le Moal wrote:
> On 4/22/23 05:15, Jaegeuk Kim wrote:
> > On 04/21, Damien Le Moal wrote:
> >> On 4/21/23 09:29, Jaegeuk Kim wrote:
> >>> On 04/21, Damien Le Moal wrote:
> >>>> On 4/21/23 08:44, Bart Van Assche wrote:
> >>>>> On 4/20/23 16:37, Damien Le Moal wrote:
> >>>>>> Why would you need to handle the max active zone number restriction in the
> >>>>>> scheduler ? That is the user responsibility. btrfs does it (that is still buggy
> >>>>>> though, working on it).
> >>>>>
> >>>>> Hi Damien,
> >>>>>
> >>>>> If the user (filesystem) restricts the number of active zones, the code 
> >>>>> for restricting the number of active zones will have to be duplicated 
> >>>>> into every filesystem that supports zoned devices. Wouldn't it be better 
> >>>>> if the I/O scheduler tracks the number of active zones?
> >>>>
> >>>> I do not think so. The reason is that for a file system, the block allocator
> >>>> must be aware of any active zone limit of the underlying device to make the best
> >>>> decision possible regarding where to allocate blocks for files and metadata. For
> >>>
> >>> Well, I'm wondering what kind of decision FS can make when allocating zones?
> >>
> >> Not sure what you mean... It is very similar to regular block device case. The
> >> FS does block allocation based on whatever block placement policy it wants/need
> >> to implement. With zoned devices, the FS block management object are mapped to
> >> zones (btrfs: block group == zone, f2fs: section == zone) and the active zone
> >> limits simply adds one more constraint regarding the selection of block groups
> >> for allocating blocks. This is a resource management issue.
> > 
> > Ok, so it seems I overlooked there might be something in the zone allocation
> > policy. So, f2fs already manages 6 open zones by design.
> 
> Yes, so as long as the device allows for at least 6 active zones, there are no
> issues with f2fs.
> 
> >>>> btrfs, we added "active block groups" management for that purpose. And we have
> >>>> tracking of a block group active state so that the block allocator can start
> >>>> using new block groups (inactive ones) when previously used ones become full. We
> >>>> also have a "finish block group" for cases when there is not enough remaining
> >>>> free blocks in a group/zone (this does a finish zone operation to make a
> >>>> non-full zone full, that is, inactive).
> >>>>
> >>>> Even if the block IO scheduler were to track active zones, the FS would still
> >>>> need to do its own tracking (e.g. to be able to finish zones when needed). So I
> >>>
> >>> Why does FS also need to track the # of open zones redundantly? I have two
> >>
> >> Because the FS should not be issuing writes to a zone that cannot be activated
> >> on the device, e.g. starting writing to an empty zone when there are already N
> >> zones being written or partly written, with N >= max active zones, will result
> >> in IO failures. Even if you have active zone tracking in the block IO scheduler,
> >> there is absolutely NOTHING that the scheduler can do to prevent such errors.
> >> E.g. think of this simple case:
> >> 1) Let's take a device with max active zones = N (N != 0)
> >> 2) The FS implements a block allocation policy which results in new files being
> >> written to empty block groups, with 1 block group == 1 zone
> >> 3) User writes N files of 4KB
> >>
> >> After step 3, the device has N active zones. So if the user tries to write a new
> >> file, it will fail (cannot write to an empty zone as that will result in an
> >> error because that zone cannot be activated by the device). AND the FS cannot
> >> write metadata for these files into a metadata block group.
> > 
> > I think it needs to consider block allocation vs. data writes separately. That
> > being said,
> 
> That mostly depends on the FS design.
> 
> > 
> > 1) FS zone allocation: as you described, FS needs to allocate blocks per zone,
> > and should finish to *allocate* blocks entirely in the zone, when allocating a
> > new one if it meets the limit. Fortunately, F2FS is doing that by design, so
> > I don't see any need to manage the open zone limitation.
> 
> Correct for f2fs case. btrfs is different in that respect.
> 
> > 2) data writes: IO scheduler needs to control write pipeline to get the best
> > performance while just checking the max open zones seamlessly.
> 
> There is absolutely no need for the IO scheduler to check open/active zones
> state. More below.
> 
> > With that, FS doesn't need to wait for IO completion when allocating a new
> > zone.
> 
> Incorrect. I showed you a simple example of why. You can also consider a more
> complex scenario and think about what can happen: multiple writers doing
> buffered IOs through the page cache and suddenly doing an fsync. If you have
> more writers than can have active zones, depending on how blocks are allocated,
> you'll need to wait before issuing writes for some to ensure that zones can be
> activated. This is *NOT* a performance impact as this synchronization is needed,
> it means that you already are pounding the drive hard. Issuing more IOs will not
> make the drive go faster.

There's no issue at all in F2FS regarding to # of open zone limitation now,
even I don't see any problem with a working zoned disk. Again, what I'm talking
about is single case where FS needs to stop block allocation on new zone till
finishing the submitted writes to old zones, even though FS didn't violate the
full sequential writes in a zone all the time. Yes, it's about performance, not
functional matter.

> 
> >> There is nothing that the IO scheduler can do about all this. The FS has to be
> >> more intelligent and do block allocation also based on the current
> >> active/inactive state of the zones used by block groups.
> > 
> > TBH, I can't find any benefit to manage such the active/inactive states in FS.
> > Am I mssing something in btrfs especially?
> 
> btrfs block management is a little more complex than f2fs. For one thing, btrfs
> is 100% copy on write (unlike f2fs), which means that we absolutely MUST ensure
> that we can always write metadata block groups and the super block (multiple
> copies). So we need some "reserved" active zone resources for that. And for file
> data, given the that block allocation may work much faster than actually writing
> the device, you need to control the writeback process to throttle it within the
> available active zone resources. This is naturally done in f2fs given that there
> are at most only 6 segments/zones used at any time for writing. But btrfs needs
> additional code.
> 
> >>> concerns if FS needs to force it:
> >>> 1) performance - waiting for finish_zone before allocating a new zone can break
> >>> the IO pipeline and block FS operations.
> >>
> >> The need to perform a zone finish is purely an optimization if, for instance,
> >> you want to reduce fragmentation. E.g., if there is only 4K of free space left
> >> in a zone and need to write a 1MB extent, you can write the last 4K of that
> >> zone, making it inactive and write the remaining 1MB - 4KB in another zone and
> >> you are guaranteed that this other zone can be written since you just
> >> deactivated one zone.
> >>
> >> But if you do not want to fragment that 1MB extent, then you must finish that
> >> zone with only 4KB left first, to ensure that another zone can be activated.
> > 
> > So, why should FS be aware of that? I was expecting, once FS submitted 1MB
> > extent, block or IO scheduler will gracefully finish the old zone and open a
> > new one which is matched to the in-disk write pointers.
> 
> The block IO scheduler is just that, a scheduler. It should NEVER be the source
> of a new command. You cannot have the block IO scheduler issue commands. That is
> not how the block layer works.
> 
> And it seems that you are assuming that block IOs make it to the scheduler in
> about the same order as issued by the FS. There are no guarantees of that
> happening when considering a set of different zones as the various issuers may
> block on request allocation, or due to a device mapper between FS and device,
> etc. Plenty of reasons for the overall write pattern to change between FS and
> device. Not per zone for regular writes though, that is preserved. But that is
> not the case for zone append writes that btrfs uses.
> 
> And you are forgetting the case of applications using the drive directly. You
> cannot rely on the application working correctly and have the IO scheduler do
> some random things about open/active zones. That will never work.
Jaegeuk Kim April 24, 2023, 5:58 p.m. UTC | #14
On 04/24, Christoph Hellwig wrote:
> On Sat, Apr 22, 2023 at 07:25:33AM +0900, Damien Le Moal wrote:
> > >> for allocating blocks. This is a resource management issue.
> > > 
> > > Ok, so it seems I overlooked there might be something in the zone allocation
> > > policy. So, f2fs already manages 6 open zones by design.
> > 
> > Yes, so as long as the device allows for at least 6 active zones, there are no
> > issues with f2fs.
> 
> I don't think it's quite as rosy, because f2fs can still schedule
> I/O to the old zone after already scheduling I/O to a new zone for
> any of these 6 slots.  It'll need code to wait for all I/O to the old
> zone to finish first, similar to btrfs.

F2FS should serialize all the writes across 6 active open zones. If not, I think
it's a bug. The problem here is 1) open zone#1 through zone #6, 2) allocate all
blocks in zone #1, 3) submit all writes in zone #1, 4) allocate blocks in zone
#7, 5) submit all writes in zone #7, and so on.

In this scenario, I'm asking why F2FS needs to wait for entire write completion
between 3) and 4), which will impact performance a lot since 4) blocks syscalls.
Jaegeuk Kim April 24, 2023, 7:05 p.m. UTC | #15
On 04/24, Christoph Hellwig wrote:
> On Sat, Apr 22, 2023 at 07:25:33AM +0900, Damien Le Moal wrote:
> > >> for allocating blocks. This is a resource management issue.
> > > 
> > > Ok, so it seems I overlooked there might be something in the zone allocation
> > > policy. So, f2fs already manages 6 open zones by design.
> > 
> > Yes, so as long as the device allows for at least 6 active zones, there are no
> > issues with f2fs.
> 
> I don't think it's quite as rosy, because f2fs can still schedule
> I/O to the old zone after already scheduling I/O to a new zone for
> any of these 6 slots.  It'll need code to wait for all I/O to the old
> zone to finish first, similar to btrfs.

Hmm, I was concerned on the small zone size impacting the bandwidth, but feel
that I can reduce the # of logs in F2FS in that case. So there's some trade-off.
Let me take another look at, if I'm missing anything else. :(
Damien Le Moal April 25, 2023, 1:38 p.m. UTC | #16
On 4/24/23 15:01, Christoph Hellwig wrote:
> On Sat, Apr 22, 2023 at 07:25:33AM +0900, Damien Le Moal wrote:
>>>> for allocating blocks. This is a resource management issue.
>>>
>>> Ok, so it seems I overlooked there might be something in the zone allocation
>>> policy. So, f2fs already manages 6 open zones by design.
>>
>> Yes, so as long as the device allows for at least 6 active zones, there are no
>> issues with f2fs.
> 
> I don't think it's quite as rosy, because f2fs can still schedule
> I/O to the old zone after already scheduling I/O to a new zone for
> any of these 6 slots.  It'll need code to wait for all I/O to the old
> zone to finish first, similar to btrfs.

Indeed. Good point.
diff mbox series

Patch

diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index c57e5b7cb532..4527d0514fdb 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -671,6 +671,14 @@  Description:
 		regular block devices.
 
 
+What:		/sys/block/<disk>/queue/zone_capacity
+Date:		March 2023
+Contact:	linux-block@vger.kernel.org
+Description:
+		[RO] The number of 512-byte sectors in a zone that can be read
+		or written. This number is less than or equal to the zone size.
+
+
 What:		/sys/block/<disk>/queue/zone_write_granularity
 Date:		January 2021
 Contact:	linux-block@vger.kernel.org
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 896b4654ab00..96f5dc63a815 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -685,6 +685,7 @@  int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 						   b->max_secure_erase_sectors);
 	t->zone_write_granularity = max(t->zone_write_granularity,
 					b->zone_write_granularity);
+	t->zone_capacity = max(t->zone_capacity, b->zone_capacity);
 	t->zoned = max(t->zoned, b->zoned);
 	return ret;
 }
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index a64208583853..0443b8f536f4 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -337,6 +337,11 @@  static ssize_t queue_nr_zones_show(struct request_queue *q, char *page)
 	return queue_var_show(disk_nr_zones(q->disk), page);
 }
 
+static ssize_t queue_zone_capacity_show(struct request_queue *q, char *page)
+{
+	return queue_var_show(q->limits.zone_capacity, page);
+}
+
 static ssize_t queue_max_open_zones_show(struct request_queue *q, char *page)
 {
 	return queue_var_show(bdev_max_open_zones(q->disk->part0), page);
@@ -587,6 +592,7 @@  QUEUE_RO_ENTRY(queue_zone_write_granularity, "zone_write_granularity");
 
 QUEUE_RO_ENTRY(queue_zoned, "zoned");
 QUEUE_RO_ENTRY(queue_nr_zones, "nr_zones");
+QUEUE_RO_ENTRY(queue_zone_capacity, "zone_capacity");
 QUEUE_RO_ENTRY(queue_max_open_zones, "max_open_zones");
 QUEUE_RO_ENTRY(queue_max_active_zones, "max_active_zones");
 
@@ -644,6 +650,7 @@  static struct attribute *queue_attrs[] = {
 	&queue_nonrot_entry.attr,
 	&queue_zoned_entry.attr,
 	&queue_nr_zones_entry.attr,
+	&queue_zone_capacity_entry.attr,
 	&queue_max_open_zones_entry.attr,
 	&queue_max_active_zones_entry.attr,
 	&queue_nomerges_entry.attr,
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 9b9cd6adfd1b..a319acbe377b 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -463,6 +463,7 @@  struct blk_revalidate_zone_args {
 	unsigned long	*seq_zones_wlock;
 	unsigned int	nr_zones;
 	sector_t	zone_sectors;
+	sector_t	zone_capacity;
 	sector_t	sector;
 };
 
@@ -489,6 +490,7 @@  static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 		}
 
 		args->zone_sectors = zone->len;
+		args->zone_capacity = zone->capacity;
 		args->nr_zones = (capacity + zone->len - 1) >> ilog2(zone->len);
 	} else if (zone->start + args->zone_sectors < capacity) {
 		if (zone->len != args->zone_sectors) {
@@ -496,12 +498,23 @@  static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 				disk->disk_name);
 			return -ENODEV;
 		}
+		if (zone->capacity != args->zone_capacity) {
+			pr_warn("%s: Invalid zoned device with non constant zone capacity\n",
+				disk->disk_name);
+			return -ENODEV;
+		}
 	} else {
 		if (zone->len > args->zone_sectors) {
 			pr_warn("%s: Invalid zoned device with larger last zone size\n",
 				disk->disk_name);
 			return -ENODEV;
 		}
+		if (zone->capacity >
+		    min(args->zone_sectors, args->zone_capacity)) {
+			pr_warn("%s: Invalid zoned device with too large last zone capacity\n",
+				disk->disk_name);
+			return -ENODEV;
+		}
 	}
 
 	/* Check for holes in the zone report */
@@ -604,6 +617,7 @@  int blk_revalidate_disk_zones(struct gendisk *disk,
 	blk_mq_freeze_queue(q);
 	if (ret > 0) {
 		blk_queue_chunk_sectors(q, args.zone_sectors);
+		q->limits.zone_capacity = args.zone_capacity;
 		disk->nr_zones = args.nr_zones;
 		swap(disk->seq_zones_wlock, args.seq_zones_wlock);
 		swap(disk->conv_zones_bitmap, args.conv_zones_bitmap);
@@ -635,6 +649,7 @@  void disk_clear_zone_settings(struct gendisk *disk)
 	disk->max_open_zones = 0;
 	disk->max_active_zones = 0;
 	q->limits.chunk_sectors = 0;
+	q->limits.zone_capacity = 0;
 	q->limits.zone_write_granularity = 0;
 	q->limits.max_zone_append_sectors = 0;
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 261538319bbf..4fb0e6d7fdc8 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -297,6 +297,7 @@  struct queue_limits {
 	unsigned int		discard_granularity;
 	unsigned int		discard_alignment;
 	unsigned int		zone_write_granularity;
+	unsigned int		zone_capacity;
 
 	unsigned short		max_segments;
 	unsigned short		max_integrity_segments;