Message ID | 20230418224002.1195163-11-bvanassche@acm.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mq-deadline: Improve support for zoned block devices | expand |
On Tue, Apr 18, 2023 at 03:40:01PM -0700, Bart Van Assche wrote: > Make the zone capacity available in struct queue_limits for those > drivers that need it. > > Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com> > Cc: Christoph Hellwig <hch@lst.de> > Cc: Ming Lei <ming.lei@redhat.com> > Signed-off-by: Bart Van Assche <bvanassche@acm.org> > --- > Documentation/ABI/stable/sysfs-block | 8 ++++++++ > block/blk-settings.c | 1 + > block/blk-sysfs.c | 7 +++++++ > block/blk-zoned.c | 15 +++++++++++++++ > include/linux/blkdev.h | 1 + > 5 files changed, 32 insertions(+) > > diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block > index c57e5b7cb532..4527d0514fdb 100644 > --- a/Documentation/ABI/stable/sysfs-block > +++ b/Documentation/ABI/stable/sysfs-block > @@ -671,6 +671,14 @@ Description: > regular block devices. > > > +What: /sys/block/<disk>/queue/zone_capacity > +Date: March 2023 > +Contact: linux-block@vger.kernel.org > +Description: > + [RO] The number of 512-byte sectors in a zone that can be read > + or written. This number is less than or equal to the zone size. > + > + > What: /sys/block/<disk>/queue/zone_write_granularity > Date: January 2021 > Contact: linux-block@vger.kernel.org > diff --git a/block/blk-settings.c b/block/blk-settings.c > index 896b4654ab00..96f5dc63a815 100644 > --- a/block/blk-settings.c > +++ b/block/blk-settings.c > @@ -685,6 +685,7 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, > b->max_secure_erase_sectors); > t->zone_write_granularity = max(t->zone_write_granularity, > b->zone_write_granularity); > + t->zone_capacity = max(t->zone_capacity, b->zone_capacity); > t->zoned = max(t->zoned, b->zoned); > return ret; > } (snip) > @@ -496,12 +498,23 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx, > disk->disk_name); > return -ENODEV; > } > + if (zone->capacity != args->zone_capacity) { > + pr_warn("%s: Invalid zoned device with non constant zone capacity\n", > + disk->disk_name); > + return -ENODEV; Hello Bart, The NVMe Zoned Namespace Command Set Specification: https://nvmexpress.org/wp-content/uploads/NVM-Express-Zoned-Namespace-Command-Set-Specification-1.1c-2022.10.03-Ratified.pdf specifies the Zone Capacity (ZCAP) field in each Zone Descriptor. The Descriptors are part of the Report Zones Data Structure. This means that while the zone size is the same for all zones, the zone capacity can be different for each zone. While the single NVMe ZNS SSD that I've encountered so far did coincidentally have the same zone capacity for all zones, this is not required by the specification. The NVMe driver does reject a ZNS device that has support for Variable Zone Capacity (which is defined in the ZOC field): https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/nvme/host/zns.c?h=v6.3-rc7#n95 Variable Zone Capacity simply means the the zone capacities cannot change without a NVM format. However, even when Variable Zone Capacity is not supported, a NVMe ZNS device can still have different zone capacities, and AFAICT, such devices are currently supported. With your change above, we would start rejecting such devices. Is this reduction of supported NVMe ZNS SSD devices really desired? If it is, then I would at least expect to find a motivation of why we are reducing the scope of the currently supported NVMe ZNS devices to be found somewhere in the commit message. Kind regards, Niklas
On 4/20/23 02:23, Niklas Cassel wrote: > With your change above, we would start rejecting such devices. > > Is this reduction of supported NVMe ZNS SSD devices really desired? Hi Niklas, This is not my intention. A possible solution is to modify the NVMe driver and SCSI core such that the "zone is full" information is stored in struct request when a command completes. That will remove the need for the mq-deadline scheduler to know the zone capacity. Bart.
On 4/21/23 02:12, Bart Van Assche wrote: > On 4/20/23 02:23, Niklas Cassel wrote: >> With your change above, we would start rejecting such devices. >> >> Is this reduction of supported NVMe ZNS SSD devices really desired? > > Hi Niklas, > > This is not my intention. A possible solution is to modify the NVMe > driver and SCSI core such that the "zone is full" information is stored > in struct request when a command completes. That will remove the need > for the mq-deadline scheduler to know the zone capacity. I am not following... Why would the scheduler need to know the zone capacity ? If the user does stupid things like accessing sectors between zone capacity and zone size or trying to write to a full zone, the commands will be failed by the drive and I do not see why the scheduler need to care about that. > > Bart.
On 4/20/23 15:00, Damien Le Moal wrote: > On 4/21/23 02:12, Bart Van Assche wrote: >> On 4/20/23 02:23, Niklas Cassel wrote: >>> With your change above, we would start rejecting such devices. >>> >>> Is this reduction of supported NVMe ZNS SSD devices really desired? >> >> Hi Niklas, >> >> This is not my intention. A possible solution is to modify the NVMe >> driver and SCSI core such that the "zone is full" information is stored >> in struct request when a command completes. That will remove the need >> for the mq-deadline scheduler to know the zone capacity. > > I am not following... Why would the scheduler need to know the zone capacity ? > > If the user does stupid things like accessing sectors between zone capacity and > zone size or trying to write to a full zone, the commands will be failed by the > drive and I do not see why the scheduler need to care about that. Hi Damien, Restricting the number of active zones in the I/O scheduler (patch 11/11) requires some knowledge of the zone condition. According to ZBC-2, for sequential write preferred zones the additional sense code ZONE TRANSITION TO FULL must be reported if the zone condition changes from not full into full. There is no such requirement for sequential write required zones. Additionally, I'm not aware of any reporting mechanism in the NVMe specification for changes in the zone condition. The overhead of submitting a REPORT ZONES command after every I/O completion would be unacceptable. Is there any other solution for tracking the zone condition other than comparing the LBA at which a WRITE command finished with the zone capacity? Did I perhaps overlook something? Thanks, Bart.
On 4/21/23 07:51, Bart Van Assche wrote: > On 4/20/23 15:00, Damien Le Moal wrote: >> On 4/21/23 02:12, Bart Van Assche wrote: >>> On 4/20/23 02:23, Niklas Cassel wrote: >>>> With your change above, we would start rejecting such devices. >>>> >>>> Is this reduction of supported NVMe ZNS SSD devices really desired? >>> >>> Hi Niklas, >>> >>> This is not my intention. A possible solution is to modify the NVMe >>> driver and SCSI core such that the "zone is full" information is stored >>> in struct request when a command completes. That will remove the need >>> for the mq-deadline scheduler to know the zone capacity. >> >> I am not following... Why would the scheduler need to know the zone capacity ? >> >> If the user does stupid things like accessing sectors between zone capacity and >> zone size or trying to write to a full zone, the commands will be failed by the >> drive and I do not see why the scheduler need to care about that. > > Hi Damien, > > Restricting the number of active zones in the I/O scheduler (patch > 11/11) requires some knowledge of the zone condition. Why would you need to handle the max active zone number restriction in the scheduler ? That is the user responsibility. btrfs does it (that is still buggy though, working on it). > According to ZBC-2, for sequential write preferred zones the additional > sense code ZONE TRANSITION TO FULL must be reported if the zone > condition changes from not full into full. There is no such requirement > for sequential write required zones. Additionally, I'm not aware of any > reporting mechanism in the NVMe specification for changes in the zone > condition. Sequential write preferred zones is ZBC which does not have the concept of active zone. In general, for ZBC HDDs, ignoring the max number of open zones is fine. There is no performance impact that can be measured, unless the user goes full random write on the device. But in that case, the user is already asking for bad perf anyway. I suspect you are thinking about all this in the context of UFS devices ? My point stands though. Trying to manage the active zone limit at the scheduler level is not a good idea as there are no guarantees that the user will eventually issue all the write commands to make zones full, and thus turn them inactive. This is the responsibility of the user to manage that, so above the block IO scheduler. > The overhead of submitting a REPORT ZONES command after every I/O > completion would be unacceptable. > > Is there any other solution for tracking the zone condition other than > comparing the LBA at which a WRITE command finished with the zone capacity? The sd driver does some minimal tracking already which is used for zone append emulation. > > Did I perhaps overlook something? > > Thanks, > > Bart.
On 4/20/23 16:37, Damien Le Moal wrote: > Why would you need to handle the max active zone number restriction in the > scheduler ? That is the user responsibility. btrfs does it (that is still buggy > though, working on it). Hi Damien, If the user (filesystem) restricts the number of active zones, the code for restricting the number of active zones will have to be duplicated into every filesystem that supports zoned devices. Wouldn't it be better if the I/O scheduler tracks the number of active zones? Thanks, Bart.
On 4/21/23 08:44, Bart Van Assche wrote: > On 4/20/23 16:37, Damien Le Moal wrote: >> Why would you need to handle the max active zone number restriction in the >> scheduler ? That is the user responsibility. btrfs does it (that is still buggy >> though, working on it). > > Hi Damien, > > If the user (filesystem) restricts the number of active zones, the code > for restricting the number of active zones will have to be duplicated > into every filesystem that supports zoned devices. Wouldn't it be better > if the I/O scheduler tracks the number of active zones? I do not think so. The reason is that for a file system, the block allocator must be aware of any active zone limit of the underlying device to make the best decision possible regarding where to allocate blocks for files and metadata. For btrfs, we added "active block groups" management for that purpose. And we have tracking of a block group active state so that the block allocator can start using new block groups (inactive ones) when previously used ones become full. We also have a "finish block group" for cases when there is not enough remaining free blocks in a group/zone (this does a finish zone operation to make a non-full zone full, that is, inactive). Even if the block IO scheduler were to track active zones, the FS would still need to do its own tracking (e.g. to be able to finish zones when needed). So I do not see the point in having the block scheduler doing anything about active zones.
On 04/21, Damien Le Moal wrote: > On 4/21/23 08:44, Bart Van Assche wrote: > > On 4/20/23 16:37, Damien Le Moal wrote: > >> Why would you need to handle the max active zone number restriction in the > >> scheduler ? That is the user responsibility. btrfs does it (that is still buggy > >> though, working on it). > > > > Hi Damien, > > > > If the user (filesystem) restricts the number of active zones, the code > > for restricting the number of active zones will have to be duplicated > > into every filesystem that supports zoned devices. Wouldn't it be better > > if the I/O scheduler tracks the number of active zones? > > I do not think so. The reason is that for a file system, the block allocator > must be aware of any active zone limit of the underlying device to make the best > decision possible regarding where to allocate blocks for files and metadata. For Well, I'm wondering what kind of decision FS can make when allocating zones? > btrfs, we added "active block groups" management for that purpose. And we have > tracking of a block group active state so that the block allocator can start > using new block groups (inactive ones) when previously used ones become full. We > also have a "finish block group" for cases when there is not enough remaining > free blocks in a group/zone (this does a finish zone operation to make a > non-full zone full, that is, inactive). > > Even if the block IO scheduler were to track active zones, the FS would still > need to do its own tracking (e.g. to be able to finish zones when needed). So I Why does FS also need to track the # of open zones redundantly? I have two concerns if FS needs to force it: 1) performance - waiting for finish_zone before allocating a new zone can break the IO pipeline and block FS operations. 2) multiple partition support - if F2FS uses two partitions, one on conventional partition while the other on zoned partition, we have to maintain such tracking mechanism on zoned partition only which gives some code complexity. In general, doesn't it make sense that FS (not zoned-device FS) just needs to guarantee sequential writes per zone, while IO scheduler needs to limit the # of open zones gracefully? > do not see the point in having the block scheduler doing anything about active > zones. >
On 4/21/23 09:29, Jaegeuk Kim wrote: > On 04/21, Damien Le Moal wrote: >> On 4/21/23 08:44, Bart Van Assche wrote: >>> On 4/20/23 16:37, Damien Le Moal wrote: >>>> Why would you need to handle the max active zone number restriction in the >>>> scheduler ? That is the user responsibility. btrfs does it (that is still buggy >>>> though, working on it). >>> >>> Hi Damien, >>> >>> If the user (filesystem) restricts the number of active zones, the code >>> for restricting the number of active zones will have to be duplicated >>> into every filesystem that supports zoned devices. Wouldn't it be better >>> if the I/O scheduler tracks the number of active zones? >> >> I do not think so. The reason is that for a file system, the block allocator >> must be aware of any active zone limit of the underlying device to make the best >> decision possible regarding where to allocate blocks for files and metadata. For > > Well, I'm wondering what kind of decision FS can make when allocating zones? Not sure what you mean... It is very similar to regular block device case. The FS does block allocation based on whatever block placement policy it wants/need to implement. With zoned devices, the FS block management object are mapped to zones (btrfs: block group == zone, f2fs: section == zone) and the active zone limits simply adds one more constraint regarding the selection of block groups for allocating blocks. This is a resource management issue. >> btrfs, we added "active block groups" management for that purpose. And we have >> tracking of a block group active state so that the block allocator can start >> using new block groups (inactive ones) when previously used ones become full. We >> also have a "finish block group" for cases when there is not enough remaining >> free blocks in a group/zone (this does a finish zone operation to make a >> non-full zone full, that is, inactive). >> >> Even if the block IO scheduler were to track active zones, the FS would still >> need to do its own tracking (e.g. to be able to finish zones when needed). So I > > Why does FS also need to track the # of open zones redundantly? I have two Because the FS should not be issuing writes to a zone that cannot be activated on the device, e.g. starting writing to an empty zone when there are already N zones being written or partly written, with N >= max active zones, will result in IO failures. Even if you have active zone tracking in the block IO scheduler, there is absolutely NOTHING that the scheduler can do to prevent such errors. E.g. think of this simple case: 1) Let's take a device with max active zones = N (N != 0) 2) The FS implements a block allocation policy which results in new files being written to empty block groups, with 1 block group == 1 zone 3) User writes N files of 4KB After step 3, the device has N active zones. So if the user tries to write a new file, it will fail (cannot write to an empty zone as that will result in an error because that zone cannot be activated by the device). AND the FS cannot write metadata for these files into a metadata block group. There is nothing that the IO scheduler can do about all this. The FS has to be more intelligent and do block allocation also based on the current active/inactive state of the zones used by block groups. > concerns if FS needs to force it: > 1) performance - waiting for finish_zone before allocating a new zone can break > the IO pipeline and block FS operations. The need to perform a zone finish is purely an optimization if, for instance, you want to reduce fragmentation. E.g., if there is only 4K of free space left in a zone and need to write a 1MB extent, you can write the last 4K of that zone, making it inactive and write the remaining 1MB - 4KB in another zone and you are guaranteed that this other zone can be written since you just deactivated one zone. But if you do not want to fragment that 1MB extent, then you must finish that zone with only 4KB left first, to ensure that another zone can be activated. This of course also depends on the current number of active zones N. If N < max active zone limit, then there is no need to finish a zone. > 2) multiple partition support - if F2FS uses two partitions, one on conventional > partition while the other on zoned partition, we have to maintain such tracking > mechanism on zoned partition only which gives some code complexity. Conventional zones have no concept of active zones. All active zone resources can be allocated to the sequential zones. And zoned block devices do not support partitions anyway. > In general, doesn't it make sense that FS (not zoned-device FS) just needs to > guarantee sequential writes per zone, while IO scheduler needs to limit the # > of open zones gracefully? No. That will never work. See the example above: you can endup in a situation where the drive becomes read-only (all writes failing) if the FS does not direct block allocation & writes to zones that are already active. No amount of trickery in the IO scheduler can change that fact. If you want to hide the active zone limit to the FS, then what is needed is a device mapper that remaps blocks. That is a lot more overhead that implementing that support in the FS, and the FS can do a much better job at optimizing block placement.
On 04/21, Damien Le Moal wrote: > On 4/21/23 09:29, Jaegeuk Kim wrote: > > On 04/21, Damien Le Moal wrote: > >> On 4/21/23 08:44, Bart Van Assche wrote: > >>> On 4/20/23 16:37, Damien Le Moal wrote: > >>>> Why would you need to handle the max active zone number restriction in the > >>>> scheduler ? That is the user responsibility. btrfs does it (that is still buggy > >>>> though, working on it). > >>> > >>> Hi Damien, > >>> > >>> If the user (filesystem) restricts the number of active zones, the code > >>> for restricting the number of active zones will have to be duplicated > >>> into every filesystem that supports zoned devices. Wouldn't it be better > >>> if the I/O scheduler tracks the number of active zones? > >> > >> I do not think so. The reason is that for a file system, the block allocator > >> must be aware of any active zone limit of the underlying device to make the best > >> decision possible regarding where to allocate blocks for files and metadata. For > > > > Well, I'm wondering what kind of decision FS can make when allocating zones? > > Not sure what you mean... It is very similar to regular block device case. The > FS does block allocation based on whatever block placement policy it wants/need > to implement. With zoned devices, the FS block management object are mapped to > zones (btrfs: block group == zone, f2fs: section == zone) and the active zone > limits simply adds one more constraint regarding the selection of block groups > for allocating blocks. This is a resource management issue. Ok, so it seems I overlooked there might be something in the zone allocation policy. So, f2fs already manages 6 open zones by design. > > >> btrfs, we added "active block groups" management for that purpose. And we have > >> tracking of a block group active state so that the block allocator can start > >> using new block groups (inactive ones) when previously used ones become full. We > >> also have a "finish block group" for cases when there is not enough remaining > >> free blocks in a group/zone (this does a finish zone operation to make a > >> non-full zone full, that is, inactive). > >> > >> Even if the block IO scheduler were to track active zones, the FS would still > >> need to do its own tracking (e.g. to be able to finish zones when needed). So I > > > > Why does FS also need to track the # of open zones redundantly? I have two > > Because the FS should not be issuing writes to a zone that cannot be activated > on the device, e.g. starting writing to an empty zone when there are already N > zones being written or partly written, with N >= max active zones, will result > in IO failures. Even if you have active zone tracking in the block IO scheduler, > there is absolutely NOTHING that the scheduler can do to prevent such errors. > E.g. think of this simple case: > 1) Let's take a device with max active zones = N (N != 0) > 2) The FS implements a block allocation policy which results in new files being > written to empty block groups, with 1 block group == 1 zone > 3) User writes N files of 4KB > > After step 3, the device has N active zones. So if the user tries to write a new > file, it will fail (cannot write to an empty zone as that will result in an > error because that zone cannot be activated by the device). AND the FS cannot > write metadata for these files into a metadata block group. I think it needs to consider block allocation vs. data writes separately. That being said, 1) FS zone allocation: as you described, FS needs to allocate blocks per zone, and should finish to *allocate* blocks entirely in the zone, when allocating a new one if it meets the limit. Fortunately, F2FS is doing that by design, so I don't see any need to manage the open zone limitation. 2) data writes: IO scheduler needs to control write pipeline to get the best performance while just checking the max open zones seamlessly. With that, FS doesn't need to wait for IO completion when allocating a new zone. > > There is nothing that the IO scheduler can do about all this. The FS has to be > more intelligent and do block allocation also based on the current > active/inactive state of the zones used by block groups. TBH, I can't find any benefit to manage such the active/inactive states in FS. Am I mssing something in btrfs especially? > > > concerns if FS needs to force it: > > 1) performance - waiting for finish_zone before allocating a new zone can break > > the IO pipeline and block FS operations. > > The need to perform a zone finish is purely an optimization if, for instance, > you want to reduce fragmentation. E.g., if there is only 4K of free space left > in a zone and need to write a 1MB extent, you can write the last 4K of that > zone, making it inactive and write the remaining 1MB - 4KB in another zone and > you are guaranteed that this other zone can be written since you just > deactivated one zone. > > But if you do not want to fragment that 1MB extent, then you must finish that > zone with only 4KB left first, to ensure that another zone can be activated. So, why should FS be aware of that? I was expecting, once FS submitted 1MB extent, block or IO scheduler will gracefully finish the old zone and open a new one which is matched to the in-disk write pointers. > > This of course also depends on the current number of active zones N. If N < max > active zone limit, then there is no need to finish a zone. > > > 2) multiple partition support - if F2FS uses two partitions, one on conventional > > partition while the other on zoned partition, we have to maintain such tracking > > mechanism on zoned partition only which gives some code complexity. > > Conventional zones have no concept of active zones. All active zone resources > can be allocated to the sequential zones. And zoned block devices do not support > partitions anyway. > > > In general, doesn't it make sense that FS (not zoned-device FS) just needs to > > guarantee sequential writes per zone, while IO scheduler needs to limit the # > > of open zones gracefully? > > No. That will never work. See the example above: you can endup in a situation > where the drive becomes read-only (all writes failing) if the FS does not direct > block allocation & writes to zones that are already active. No amount of > trickery in the IO scheduler can change that fact. > > If you want to hide the active zone limit to the FS, then what is needed is a > device mapper that remaps blocks. That is a lot more overhead that implementing > that support in the FS, and the FS can do a much better job at optimizing block > placement. Oh, I meant FS like f2fs supporting zoned device.
On 4/22/23 05:15, Jaegeuk Kim wrote: > On 04/21, Damien Le Moal wrote: >> On 4/21/23 09:29, Jaegeuk Kim wrote: >>> On 04/21, Damien Le Moal wrote: >>>> On 4/21/23 08:44, Bart Van Assche wrote: >>>>> On 4/20/23 16:37, Damien Le Moal wrote: >>>>>> Why would you need to handle the max active zone number restriction in the >>>>>> scheduler ? That is the user responsibility. btrfs does it (that is still buggy >>>>>> though, working on it). >>>>> >>>>> Hi Damien, >>>>> >>>>> If the user (filesystem) restricts the number of active zones, the code >>>>> for restricting the number of active zones will have to be duplicated >>>>> into every filesystem that supports zoned devices. Wouldn't it be better >>>>> if the I/O scheduler tracks the number of active zones? >>>> >>>> I do not think so. The reason is that for a file system, the block allocator >>>> must be aware of any active zone limit of the underlying device to make the best >>>> decision possible regarding where to allocate blocks for files and metadata. For >>> >>> Well, I'm wondering what kind of decision FS can make when allocating zones? >> >> Not sure what you mean... It is very similar to regular block device case. The >> FS does block allocation based on whatever block placement policy it wants/need >> to implement. With zoned devices, the FS block management object are mapped to >> zones (btrfs: block group == zone, f2fs: section == zone) and the active zone >> limits simply adds one more constraint regarding the selection of block groups >> for allocating blocks. This is a resource management issue. > > Ok, so it seems I overlooked there might be something in the zone allocation > policy. So, f2fs already manages 6 open zones by design. Yes, so as long as the device allows for at least 6 active zones, there are no issues with f2fs. >>>> btrfs, we added "active block groups" management for that purpose. And we have >>>> tracking of a block group active state so that the block allocator can start >>>> using new block groups (inactive ones) when previously used ones become full. We >>>> also have a "finish block group" for cases when there is not enough remaining >>>> free blocks in a group/zone (this does a finish zone operation to make a >>>> non-full zone full, that is, inactive). >>>> >>>> Even if the block IO scheduler were to track active zones, the FS would still >>>> need to do its own tracking (e.g. to be able to finish zones when needed). So I >>> >>> Why does FS also need to track the # of open zones redundantly? I have two >> >> Because the FS should not be issuing writes to a zone that cannot be activated >> on the device, e.g. starting writing to an empty zone when there are already N >> zones being written or partly written, with N >= max active zones, will result >> in IO failures. Even if you have active zone tracking in the block IO scheduler, >> there is absolutely NOTHING that the scheduler can do to prevent such errors. >> E.g. think of this simple case: >> 1) Let's take a device with max active zones = N (N != 0) >> 2) The FS implements a block allocation policy which results in new files being >> written to empty block groups, with 1 block group == 1 zone >> 3) User writes N files of 4KB >> >> After step 3, the device has N active zones. So if the user tries to write a new >> file, it will fail (cannot write to an empty zone as that will result in an >> error because that zone cannot be activated by the device). AND the FS cannot >> write metadata for these files into a metadata block group. > > I think it needs to consider block allocation vs. data writes separately. That > being said, That mostly depends on the FS design. > > 1) FS zone allocation: as you described, FS needs to allocate blocks per zone, > and should finish to *allocate* blocks entirely in the zone, when allocating a > new one if it meets the limit. Fortunately, F2FS is doing that by design, so > I don't see any need to manage the open zone limitation. Correct for f2fs case. btrfs is different in that respect. > 2) data writes: IO scheduler needs to control write pipeline to get the best > performance while just checking the max open zones seamlessly. There is absolutely no need for the IO scheduler to check open/active zones state. More below. > With that, FS doesn't need to wait for IO completion when allocating a new > zone. Incorrect. I showed you a simple example of why. You can also consider a more complex scenario and think about what can happen: multiple writers doing buffered IOs through the page cache and suddenly doing an fsync. If you have more writers than can have active zones, depending on how blocks are allocated, you'll need to wait before issuing writes for some to ensure that zones can be activated. This is *NOT* a performance impact as this synchronization is needed, it means that you already are pounding the drive hard. Issuing more IOs will not make the drive go faster. >> There is nothing that the IO scheduler can do about all this. The FS has to be >> more intelligent and do block allocation also based on the current >> active/inactive state of the zones used by block groups. > > TBH, I can't find any benefit to manage such the active/inactive states in FS. > Am I mssing something in btrfs especially? btrfs block management is a little more complex than f2fs. For one thing, btrfs is 100% copy on write (unlike f2fs), which means that we absolutely MUST ensure that we can always write metadata block groups and the super block (multiple copies). So we need some "reserved" active zone resources for that. And for file data, given the that block allocation may work much faster than actually writing the device, you need to control the writeback process to throttle it within the available active zone resources. This is naturally done in f2fs given that there are at most only 6 segments/zones used at any time for writing. But btrfs needs additional code. >>> concerns if FS needs to force it: >>> 1) performance - waiting for finish_zone before allocating a new zone can break >>> the IO pipeline and block FS operations. >> >> The need to perform a zone finish is purely an optimization if, for instance, >> you want to reduce fragmentation. E.g., if there is only 4K of free space left >> in a zone and need to write a 1MB extent, you can write the last 4K of that >> zone, making it inactive and write the remaining 1MB - 4KB in another zone and >> you are guaranteed that this other zone can be written since you just >> deactivated one zone. >> >> But if you do not want to fragment that 1MB extent, then you must finish that >> zone with only 4KB left first, to ensure that another zone can be activated. > > So, why should FS be aware of that? I was expecting, once FS submitted 1MB > extent, block or IO scheduler will gracefully finish the old zone and open a > new one which is matched to the in-disk write pointers. The block IO scheduler is just that, a scheduler. It should NEVER be the source of a new command. You cannot have the block IO scheduler issue commands. That is not how the block layer works. And it seems that you are assuming that block IOs make it to the scheduler in about the same order as issued by the FS. There are no guarantees of that happening when considering a set of different zones as the various issuers may block on request allocation, or due to a device mapper between FS and device, etc. Plenty of reasons for the overall write pattern to change between FS and device. Not per zone for regular writes though, that is preserved. But that is not the case for zone append writes that btrfs uses. And you are forgetting the case of applications using the drive directly. You cannot rely on the application working correctly and have the IO scheduler do some random things about open/active zones. That will never work.
On Sat, Apr 22, 2023 at 07:25:33AM +0900, Damien Le Moal wrote: > >> for allocating blocks. This is a resource management issue. > > > > Ok, so it seems I overlooked there might be something in the zone allocation > > policy. So, f2fs already manages 6 open zones by design. > > Yes, so as long as the device allows for at least 6 active zones, there are no > issues with f2fs. I don't think it's quite as rosy, because f2fs can still schedule I/O to the old zone after already scheduling I/O to a new zone for any of these 6 slots. It'll need code to wait for all I/O to the old zone to finish first, similar to btrfs.
On 04/22, Damien Le Moal wrote: > On 4/22/23 05:15, Jaegeuk Kim wrote: > > On 04/21, Damien Le Moal wrote: > >> On 4/21/23 09:29, Jaegeuk Kim wrote: > >>> On 04/21, Damien Le Moal wrote: > >>>> On 4/21/23 08:44, Bart Van Assche wrote: > >>>>> On 4/20/23 16:37, Damien Le Moal wrote: > >>>>>> Why would you need to handle the max active zone number restriction in the > >>>>>> scheduler ? That is the user responsibility. btrfs does it (that is still buggy > >>>>>> though, working on it). > >>>>> > >>>>> Hi Damien, > >>>>> > >>>>> If the user (filesystem) restricts the number of active zones, the code > >>>>> for restricting the number of active zones will have to be duplicated > >>>>> into every filesystem that supports zoned devices. Wouldn't it be better > >>>>> if the I/O scheduler tracks the number of active zones? > >>>> > >>>> I do not think so. The reason is that for a file system, the block allocator > >>>> must be aware of any active zone limit of the underlying device to make the best > >>>> decision possible regarding where to allocate blocks for files and metadata. For > >>> > >>> Well, I'm wondering what kind of decision FS can make when allocating zones? > >> > >> Not sure what you mean... It is very similar to regular block device case. The > >> FS does block allocation based on whatever block placement policy it wants/need > >> to implement. With zoned devices, the FS block management object are mapped to > >> zones (btrfs: block group == zone, f2fs: section == zone) and the active zone > >> limits simply adds one more constraint regarding the selection of block groups > >> for allocating blocks. This is a resource management issue. > > > > Ok, so it seems I overlooked there might be something in the zone allocation > > policy. So, f2fs already manages 6 open zones by design. > > Yes, so as long as the device allows for at least 6 active zones, there are no > issues with f2fs. > > >>>> btrfs, we added "active block groups" management for that purpose. And we have > >>>> tracking of a block group active state so that the block allocator can start > >>>> using new block groups (inactive ones) when previously used ones become full. We > >>>> also have a "finish block group" for cases when there is not enough remaining > >>>> free blocks in a group/zone (this does a finish zone operation to make a > >>>> non-full zone full, that is, inactive). > >>>> > >>>> Even if the block IO scheduler were to track active zones, the FS would still > >>>> need to do its own tracking (e.g. to be able to finish zones when needed). So I > >>> > >>> Why does FS also need to track the # of open zones redundantly? I have two > >> > >> Because the FS should not be issuing writes to a zone that cannot be activated > >> on the device, e.g. starting writing to an empty zone when there are already N > >> zones being written or partly written, with N >= max active zones, will result > >> in IO failures. Even if you have active zone tracking in the block IO scheduler, > >> there is absolutely NOTHING that the scheduler can do to prevent such errors. > >> E.g. think of this simple case: > >> 1) Let's take a device with max active zones = N (N != 0) > >> 2) The FS implements a block allocation policy which results in new files being > >> written to empty block groups, with 1 block group == 1 zone > >> 3) User writes N files of 4KB > >> > >> After step 3, the device has N active zones. So if the user tries to write a new > >> file, it will fail (cannot write to an empty zone as that will result in an > >> error because that zone cannot be activated by the device). AND the FS cannot > >> write metadata for these files into a metadata block group. > > > > I think it needs to consider block allocation vs. data writes separately. That > > being said, > > That mostly depends on the FS design. > > > > > 1) FS zone allocation: as you described, FS needs to allocate blocks per zone, > > and should finish to *allocate* blocks entirely in the zone, when allocating a > > new one if it meets the limit. Fortunately, F2FS is doing that by design, so > > I don't see any need to manage the open zone limitation. > > Correct for f2fs case. btrfs is different in that respect. > > > 2) data writes: IO scheduler needs to control write pipeline to get the best > > performance while just checking the max open zones seamlessly. > > There is absolutely no need for the IO scheduler to check open/active zones > state. More below. > > > With that, FS doesn't need to wait for IO completion when allocating a new > > zone. > > Incorrect. I showed you a simple example of why. You can also consider a more > complex scenario and think about what can happen: multiple writers doing > buffered IOs through the page cache and suddenly doing an fsync. If you have > more writers than can have active zones, depending on how blocks are allocated, > you'll need to wait before issuing writes for some to ensure that zones can be > activated. This is *NOT* a performance impact as this synchronization is needed, > it means that you already are pounding the drive hard. Issuing more IOs will not > make the drive go faster. There's no issue at all in F2FS regarding to # of open zone limitation now, even I don't see any problem with a working zoned disk. Again, what I'm talking about is single case where FS needs to stop block allocation on new zone till finishing the submitted writes to old zones, even though FS didn't violate the full sequential writes in a zone all the time. Yes, it's about performance, not functional matter. > > >> There is nothing that the IO scheduler can do about all this. The FS has to be > >> more intelligent and do block allocation also based on the current > >> active/inactive state of the zones used by block groups. > > > > TBH, I can't find any benefit to manage such the active/inactive states in FS. > > Am I mssing something in btrfs especially? > > btrfs block management is a little more complex than f2fs. For one thing, btrfs > is 100% copy on write (unlike f2fs), which means that we absolutely MUST ensure > that we can always write metadata block groups and the super block (multiple > copies). So we need some "reserved" active zone resources for that. And for file > data, given the that block allocation may work much faster than actually writing > the device, you need to control the writeback process to throttle it within the > available active zone resources. This is naturally done in f2fs given that there > are at most only 6 segments/zones used at any time for writing. But btrfs needs > additional code. > > >>> concerns if FS needs to force it: > >>> 1) performance - waiting for finish_zone before allocating a new zone can break > >>> the IO pipeline and block FS operations. > >> > >> The need to perform a zone finish is purely an optimization if, for instance, > >> you want to reduce fragmentation. E.g., if there is only 4K of free space left > >> in a zone and need to write a 1MB extent, you can write the last 4K of that > >> zone, making it inactive and write the remaining 1MB - 4KB in another zone and > >> you are guaranteed that this other zone can be written since you just > >> deactivated one zone. > >> > >> But if you do not want to fragment that 1MB extent, then you must finish that > >> zone with only 4KB left first, to ensure that another zone can be activated. > > > > So, why should FS be aware of that? I was expecting, once FS submitted 1MB > > extent, block or IO scheduler will gracefully finish the old zone and open a > > new one which is matched to the in-disk write pointers. > > The block IO scheduler is just that, a scheduler. It should NEVER be the source > of a new command. You cannot have the block IO scheduler issue commands. That is > not how the block layer works. > > And it seems that you are assuming that block IOs make it to the scheduler in > about the same order as issued by the FS. There are no guarantees of that > happening when considering a set of different zones as the various issuers may > block on request allocation, or due to a device mapper between FS and device, > etc. Plenty of reasons for the overall write pattern to change between FS and > device. Not per zone for regular writes though, that is preserved. But that is > not the case for zone append writes that btrfs uses. > > And you are forgetting the case of applications using the drive directly. You > cannot rely on the application working correctly and have the IO scheduler do > some random things about open/active zones. That will never work.
On 04/24, Christoph Hellwig wrote: > On Sat, Apr 22, 2023 at 07:25:33AM +0900, Damien Le Moal wrote: > > >> for allocating blocks. This is a resource management issue. > > > > > > Ok, so it seems I overlooked there might be something in the zone allocation > > > policy. So, f2fs already manages 6 open zones by design. > > > > Yes, so as long as the device allows for at least 6 active zones, there are no > > issues with f2fs. > > I don't think it's quite as rosy, because f2fs can still schedule > I/O to the old zone after already scheduling I/O to a new zone for > any of these 6 slots. It'll need code to wait for all I/O to the old > zone to finish first, similar to btrfs. F2FS should serialize all the writes across 6 active open zones. If not, I think it's a bug. The problem here is 1) open zone#1 through zone #6, 2) allocate all blocks in zone #1, 3) submit all writes in zone #1, 4) allocate blocks in zone #7, 5) submit all writes in zone #7, and so on. In this scenario, I'm asking why F2FS needs to wait for entire write completion between 3) and 4), which will impact performance a lot since 4) blocks syscalls.
On 04/24, Christoph Hellwig wrote: > On Sat, Apr 22, 2023 at 07:25:33AM +0900, Damien Le Moal wrote: > > >> for allocating blocks. This is a resource management issue. > > > > > > Ok, so it seems I overlooked there might be something in the zone allocation > > > policy. So, f2fs already manages 6 open zones by design. > > > > Yes, so as long as the device allows for at least 6 active zones, there are no > > issues with f2fs. > > I don't think it's quite as rosy, because f2fs can still schedule > I/O to the old zone after already scheduling I/O to a new zone for > any of these 6 slots. It'll need code to wait for all I/O to the old > zone to finish first, similar to btrfs. Hmm, I was concerned on the small zone size impacting the bandwidth, but feel that I can reduce the # of logs in F2FS in that case. So there's some trade-off. Let me take another look at, if I'm missing anything else. :(
On 4/24/23 15:01, Christoph Hellwig wrote: > On Sat, Apr 22, 2023 at 07:25:33AM +0900, Damien Le Moal wrote: >>>> for allocating blocks. This is a resource management issue. >>> >>> Ok, so it seems I overlooked there might be something in the zone allocation >>> policy. So, f2fs already manages 6 open zones by design. >> >> Yes, so as long as the device allows for at least 6 active zones, there are no >> issues with f2fs. > > I don't think it's quite as rosy, because f2fs can still schedule > I/O to the old zone after already scheduling I/O to a new zone for > any of these 6 slots. It'll need code to wait for all I/O to the old > zone to finish first, similar to btrfs. Indeed. Good point.
diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block index c57e5b7cb532..4527d0514fdb 100644 --- a/Documentation/ABI/stable/sysfs-block +++ b/Documentation/ABI/stable/sysfs-block @@ -671,6 +671,14 @@ Description: regular block devices. +What: /sys/block/<disk>/queue/zone_capacity +Date: March 2023 +Contact: linux-block@vger.kernel.org +Description: + [RO] The number of 512-byte sectors in a zone that can be read + or written. This number is less than or equal to the zone size. + + What: /sys/block/<disk>/queue/zone_write_granularity Date: January 2021 Contact: linux-block@vger.kernel.org diff --git a/block/blk-settings.c b/block/blk-settings.c index 896b4654ab00..96f5dc63a815 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -685,6 +685,7 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, b->max_secure_erase_sectors); t->zone_write_granularity = max(t->zone_write_granularity, b->zone_write_granularity); + t->zone_capacity = max(t->zone_capacity, b->zone_capacity); t->zoned = max(t->zoned, b->zoned); return ret; } diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index a64208583853..0443b8f536f4 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -337,6 +337,11 @@ static ssize_t queue_nr_zones_show(struct request_queue *q, char *page) return queue_var_show(disk_nr_zones(q->disk), page); } +static ssize_t queue_zone_capacity_show(struct request_queue *q, char *page) +{ + return queue_var_show(q->limits.zone_capacity, page); +} + static ssize_t queue_max_open_zones_show(struct request_queue *q, char *page) { return queue_var_show(bdev_max_open_zones(q->disk->part0), page); @@ -587,6 +592,7 @@ QUEUE_RO_ENTRY(queue_zone_write_granularity, "zone_write_granularity"); QUEUE_RO_ENTRY(queue_zoned, "zoned"); QUEUE_RO_ENTRY(queue_nr_zones, "nr_zones"); +QUEUE_RO_ENTRY(queue_zone_capacity, "zone_capacity"); QUEUE_RO_ENTRY(queue_max_open_zones, "max_open_zones"); QUEUE_RO_ENTRY(queue_max_active_zones, "max_active_zones"); @@ -644,6 +650,7 @@ static struct attribute *queue_attrs[] = { &queue_nonrot_entry.attr, &queue_zoned_entry.attr, &queue_nr_zones_entry.attr, + &queue_zone_capacity_entry.attr, &queue_max_open_zones_entry.attr, &queue_max_active_zones_entry.attr, &queue_nomerges_entry.attr, diff --git a/block/blk-zoned.c b/block/blk-zoned.c index 9b9cd6adfd1b..a319acbe377b 100644 --- a/block/blk-zoned.c +++ b/block/blk-zoned.c @@ -463,6 +463,7 @@ struct blk_revalidate_zone_args { unsigned long *seq_zones_wlock; unsigned int nr_zones; sector_t zone_sectors; + sector_t zone_capacity; sector_t sector; }; @@ -489,6 +490,7 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx, } args->zone_sectors = zone->len; + args->zone_capacity = zone->capacity; args->nr_zones = (capacity + zone->len - 1) >> ilog2(zone->len); } else if (zone->start + args->zone_sectors < capacity) { if (zone->len != args->zone_sectors) { @@ -496,12 +498,23 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx, disk->disk_name); return -ENODEV; } + if (zone->capacity != args->zone_capacity) { + pr_warn("%s: Invalid zoned device with non constant zone capacity\n", + disk->disk_name); + return -ENODEV; + } } else { if (zone->len > args->zone_sectors) { pr_warn("%s: Invalid zoned device with larger last zone size\n", disk->disk_name); return -ENODEV; } + if (zone->capacity > + min(args->zone_sectors, args->zone_capacity)) { + pr_warn("%s: Invalid zoned device with too large last zone capacity\n", + disk->disk_name); + return -ENODEV; + } } /* Check for holes in the zone report */ @@ -604,6 +617,7 @@ int blk_revalidate_disk_zones(struct gendisk *disk, blk_mq_freeze_queue(q); if (ret > 0) { blk_queue_chunk_sectors(q, args.zone_sectors); + q->limits.zone_capacity = args.zone_capacity; disk->nr_zones = args.nr_zones; swap(disk->seq_zones_wlock, args.seq_zones_wlock); swap(disk->conv_zones_bitmap, args.conv_zones_bitmap); @@ -635,6 +649,7 @@ void disk_clear_zone_settings(struct gendisk *disk) disk->max_open_zones = 0; disk->max_active_zones = 0; q->limits.chunk_sectors = 0; + q->limits.zone_capacity = 0; q->limits.zone_write_granularity = 0; q->limits.max_zone_append_sectors = 0; diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 261538319bbf..4fb0e6d7fdc8 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -297,6 +297,7 @@ struct queue_limits { unsigned int discard_granularity; unsigned int discard_alignment; unsigned int zone_write_granularity; + unsigned int zone_capacity; unsigned short max_segments; unsigned short max_integrity_segments;
Make the zone capacity available in struct queue_limits for those drivers that need it. Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> --- Documentation/ABI/stable/sysfs-block | 8 ++++++++ block/blk-settings.c | 1 + block/blk-sysfs.c | 7 +++++++ block/blk-zoned.c | 15 +++++++++++++++ include/linux/blkdev.h | 1 + 5 files changed, 32 insertions(+)