mbox series

[v15,00/13] support zoned block devices with non-power-of-2 zone sizes

Message ID 20220923173618.6899-1-p.raghav@samsung.com (mailing list archive)
Headers show
Series support zoned block devices with non-power-of-2 zone sizes | expand

Message

Pankaj Raghav Sept. 23, 2022, 5:36 p.m. UTC
Hi Jens,
  Please consider this patch series for the 6.1 release.

- Background and Motivation:

The zone storage implementation in Linux, introduced since v4.10, first
targetted SMR drives which have a power of 2 (po2) zone size alignment
requirement. The po2 zone size was further imposed implicitly by the
block layer's blk_queue_chunk_sectors(), used to prevent IO merging
across chunks beyond the specified size, since v3.16 through commit
762380ad9322 ("block: add notion of a chunk size for request merging").
But this same general block layer po2 requirement for blk_queue_chunk_sectors()
was removed on v5.10 through commit 07d098e6bbad ("block: allow 'chunk_sectors'
to be non-power-of-2").

NAND, which is the media used in newer zoned storage devices, does not
naturally align to po2. In these devices, zone capacity(cap) is not the
same as the po2 zone size. When the zone cap != zone size, then unmapped
LBAs are introduced to cover the space between the zone cap and zone size.
po2 requirement does not make sense for these type of zone storage devices.
This patch series aims to remove these unmapped LBAs for zoned devices when
zone cap is npo2. This is done by relaxing the po2 zone size constraint
in the kernel and allowing zoned device with npo2 zone sizes if zone cap
== zone size.

Removing the po2 requirement from zone storage should be possible
now provided that no userspace regression and no performance regressions are
introduced. Stop-gap patches have been already merged into f2fs-tools to
proactively not allow npo2 zone sizes until proper support is added [1].

There were two efforts previously to add support to npo2 devices: 1) via
device level emulation [2] but that was rejected with a final conclusion
to add support for non po2 zoned device in the complete stack[3] 2)
adding support to the complete stack by removing the constraint in the
block layer and NVMe layer with support to btrfs, zonefs, etc which was
rejected with a conclusion to add a dm target for FS support [0]
to reduce the regression impact.

This series adds support to npo2 zoned devices in the block and nvme
layer and a new **dm target** is added: dm-po2zoned-target. This new
target will be initially used for filesystems such as btrfs and
f2fs until native npo2 zone support is added.

- Patchset description:
Patches 1-3 deals with removing the po2 constraint from the
block layer.

Patches 4-5 deals with removing the constraint from nvme zns.

Patch 5 removes the po2 contraint in null blk

Patch 6 adds npo2 support to zonefs

Patches 7-13 adds support for npo2 zoned devices in the DM layer and
adds a new target dm-po2zoned-target which converts a zoned device with
npo2 zone size into a zoned target with po2 zone size.

The patch series is based on linux-next tag: next-20220921

Testing:
The new target was tested with blktest and zonefs test suite in qemu and
on a real ZNS device with npo2 zone size.

Performance Measurement on a null blk:
Device:
zone size = 128M, blocksize=4k

FIO cmd:
fio --name=zbc --filename=/dev/nullb0 --direct=1 --zonemode=zbd  --size=23G
--io_size=<iosize> --ioengine=io_uring --iodepth=<iod> --rw=<mode> --bs=4k
--loops=4

The following results are an average of 4 runs on AMD Ryzen 5 5600X with
32GB of RAM:

Sequential Write:
x-----------------x---------------------------------x---------------------------------x
|     IOdepth     |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch   |  578     |  2257    |   12.80   |  576     |  2248    |   25.78   |
x-----------------x---------------------------------x---------------------------------x
|  With patch     |  581     |  2268    |   12.74   |  576     |  2248    |   25.85   |
x-----------------x---------------------------------x---------------------------------x

Sequential read:
x-----------------x---------------------------------x---------------------------------x
| IOdepth         |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch   |  667     |  2605    |   11.79   |  675     |  2637    |   23.49   |
x-----------------x---------------------------------x---------------------------------x
|  With patch     |  667     |  2605    |   11.79   |  675     |  2638    |   23.48   |
x-----------------x---------------------------------x---------------------------------x

Random read:
x-----------------x---------------------------------x---------------------------------x
| IOdepth         |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch   |  522     |  2038    |   15.05   |  514     |  2006    |   30.87   |
x-----------------x---------------------------------x---------------------------------x
|  With patch     |  522     |  2039    |   15.04   |  523     |  2042    |   30.33   |
x-----------------x---------------------------------x---------------------------------x

Minor variations are noticed in Sequential write with io depth 8 and
in random read with io depth 16. But overall no noticeable differences
were noticed

[0] https://lore.kernel.org/lkml/PH0PR04MB74166C87F694B150A5AE0F009BD09@PH0PR04MB7416.namprd04.prod.outlook.com/
[1] https://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git/commit/?h=dev-test&id=6afcf6493578e77528abe65ab8b12f3e1c16749f
[2] https://lore.kernel.org/all/20220310094725.GA28499@lst.de/T/
[3] https://lore.kernel.org/all/20220315135245.eqf4tqngxxb7ymqa@unifi/

Changes since v1:
- Put the function declaration and its usage in the same commit (Bart)
- Remove bdev_zone_aligned function (Bart)
- Change the name from blk_queue_zone_aligned to blk_queue_is_zone_start
  (Damien)
- q is never null in from bdev_get_queue (Damien)
- Add condition during bringup and check for zsze == zcap for npo2
  drives (Damien)
- Rounddown operation should be made generic to work in 32 bits arch
  (bart)
- Add comments where generic calculation is directly used instead having
  special handling for po2 zone sizes (Hannes)
- Make the minimum zone size alignment requirement for btrfs to be 1M
  instead of BTRFS_STRIPE_LEN(David)

Changes since v2:
- Minor formatting changes

Changes since v3:
- Make superblock mirror align with the existing superblock log offsets
  (David)
- DM change return value and remove extra newline
- Optimize null blk zone index lookup with shift for po2 zone size

Changes since v4:
- Remove direct filesystems support for npo2 devices (Johannes, Hannes,
  Damien)

Changes since v5:
- Use DIV_ROUND_UP* helper instead of round_up as it breaks 32bit arch
  build in null blk(kernel-test-robot, Nathan)
- Use DIV_ROUND_UP_SECTOR_T also in blkdev_nr_zones function instead of
  open coding it with div64_u64
- Added extra condition in dm-zoned and in dm to reject non power of 2
  zone sizes.

Changes since v6:
- Added a new dm target for non power of 2 devices
- Added support for non power of 2 devices in the DM layer.

Changes since v7:
- Improved dm target for non power of 2 zoned devices with some bug
  fixes and rearrangement
- Removed some unnecessary comments.

Changes since v8:
- Rename dm-po2z to dm-po2zone
- set max_io_len for the target to po2 zone size sector
- Simplify dm-po2zone target by removing some superfluous conditions
- Added documentation for the new dm-po2zone target
- Change pr_warn to pr_err for critical errors
- Split patch 2 and 11 with their corresponding prep patches
- Minor spelling and grammatical improvements

Changes since v9:
- Add a check for a zoned device in dm-po2zone ctr.
- Rephrased some commit messages and documentation for clarity

Changes since v10:
- Simplified dm_poz_map function (Damien)

Changes since v11:
- Rename bio_in_emulated_zone_area and some formatting adjustments
  (Damien)

Changes since v12:
- Changed the name from dm-po2zone to dm-po2zoned to have a common
  naming convention for zoned devices(Mike)
- Return directly from the dm_po2z_map function instead of having
  returns from different functions (Mike)
- Change target type to target feature flag in commit header (Mike)
- Added dm_po2z_status function and NOWAIT flag to the target
- Added some extra information to the target's documentation.

Changes since v13:
- Use goto for cleanup in dm-po2zoned target (Mike)
- Added dtr to dm-po2zoned target
- Expose zone capacity instead of po2 zone size for
  DMSTATUS_TYPE_INFO(Mike)

Changes since v14:
- Make sure to put device if ctr fails after dm_get_device(Mike)

Luis Chamberlain (1):
  dm-zoned: ensure only power of 2 zone sizes are allowed

Pankaj Raghav (12):
  block: make bdev_nr_zones and disk_zone_no generic for npo2 zone size
  block: rearrange bdev_{is_zoned,zone_sectors,get_queue} helper in
    blkdev.h
  block: allow blk-zoned devices to have non-power-of-2 zone size
  nvmet: Allow ZNS target to support non-power_of_2 zone sizes
  nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
  null_blk: allow zoned devices with non power-of-2 zone sizes
  zonefs: allow non power of 2 zoned devices
  dm-zone: use generic helpers to calculate offset from zone start
  dm-table: allow zoned devices with non power-of-2 zone sizes
  dm: call dm_zone_endio after the target endio callback for zoned
    devices
  dm: introduce DM_EMULATED_ZONES target feature flag
  dm: add power-of-2 target for zoned devices with non power-of-2 zone
    sizes

 .../admin-guide/device-mapper/dm-po2zoned.rst |  79 +++++
 .../admin-guide/device-mapper/index.rst       |   1 +
 block/blk-core.c                              |   2 +-
 block/blk-zoned.c                             |  37 ++-
 drivers/block/null_blk/main.c                 |   5 +-
 drivers/block/null_blk/null_blk.h             |   1 +
 drivers/block/null_blk/zoned.c                |  18 +-
 drivers/md/Kconfig                            |  10 +
 drivers/md/Makefile                           |   2 +
 drivers/md/dm-po2zoned-target.c               | 293 ++++++++++++++++++
 drivers/md/dm-table.c                         |  20 +-
 drivers/md/dm-zone.c                          |   8 +-
 drivers/md/dm-zoned-target.c                  |   8 +
 drivers/md/dm.c                               |   8 +-
 drivers/nvme/host/zns.c                       |  14 +-
 drivers/nvme/target/zns.c                     |   3 +-
 fs/zonefs/super.c                             |   6 +-
 fs/zonefs/zonefs.h                            |   1 -
 include/linux/blkdev.h                        |  80 +++--
 include/linux/device-mapper.h                 |   9 +
 20 files changed, 530 insertions(+), 75 deletions(-)
 create mode 100644 Documentation/admin-guide/device-mapper/dm-po2zoned.rst
 create mode 100644 drivers/md/dm-po2zoned-target.c

Comments

Pankaj Raghav Sept. 29, 2022, 6:31 a.m. UTC | #1
> Hi Jens,
>   Please consider this patch series for the 6.1 release.
> 

Hi Jens, Christoph, and Keith,
 All the patches have a Reviewed-by tag at this point. Can we queue this up
for 6.1?

--
Pankaj

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Jens Axboe Sept. 30, 2022, 3:13 p.m. UTC | #2
On 9/29/22 12:31 AM, Pankaj Raghav wrote:
>> Hi Jens,
>>   Please consider this patch series for the 6.1 release.
>>
> 
> Hi Jens, Christoph, and Keith,
>  All the patches have a Reviewed-by tag at this point. Can we queue this up
> for 6.1?

It's getting pretty late for 6.1 and I'd really like to have both Christoph
and Martin sign off on these changes.
Bart Van Assche Sept. 30, 2022, 7:38 p.m. UTC | #3
On 9/30/22 08:13, Jens Axboe wrote:
> On 9/29/22 12:31 AM, Pankaj Raghav wrote:
>>> Hi Jens,
>>>    Please consider this patch series for the 6.1 release.
>>>
>>
>> Hi Jens, Christoph, and Keith,
>>   All the patches have a Reviewed-by tag at this point. Can we queue this up
>> for 6.1?
> 
> It's getting pretty late for 6.1 and I'd really like to have both Christoph
> and Martin sign off on these changes.

Hi Jens,

Agreed that it's getting late for 6.1.

Since this has not been mentioned in the cover letter, I want to add 
that in the near future we will need these patches for Android devices. 
JEDEC is working on supporting zoned storage for UFS devices, the 
storage devices used in all modern Android phones. Although it would be 
possible to make the offset between zone starts a power of two by 
inserting gap zones between data zones, UFS vendors asked not to do this 
and hence need support for zone sizes that are not a power of two. An 
advantage of not having to deal with gap zones is better filesystem 
performance since filesystem extents cannot span gap zones. Having to 
split filesystem extents because of gap zones reduces filesystem 
performance.

Thanks,

Bart.


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Jens Axboe Sept. 30, 2022, 9:24 p.m. UTC | #4
On 9/30/22 1:38 PM, Bart Van Assche wrote:
> On 9/30/22 08:13, Jens Axboe wrote:
>> On 9/29/22 12:31 AM, Pankaj Raghav wrote:
>>>> Hi Jens,
>>>> ?? Please consider this patch series for the 6.1 release.
>>>>
>>>
>>> Hi Jens, Christoph, and Keith,
>>> ? All the patches have a Reviewed-by tag at this point. Can we queue this up
>>> for 6.1?
>>
>> It's getting pretty late for 6.1 and I'd really like to have both Christoph
>> and Martin sign off on these changes.
> 
> Hi Jens,
> 
> Agreed that it's getting late for 6.1.
> 
> Since this has not been mentioned in the cover letter, I want to add
> that in the near future we will need these patches for Android
> devices. JEDEC is working on supporting zoned storage for UFS devices,
> the storage devices used in all modern Android phones. Although it
> would be possible to make the offset between zone starts a power of
> two by inserting gap zones between data zones, UFS vendors asked not
> to do this and hence need support for zone sizes that are not a power
> of two. An advantage of not having to deal with gap zones is better
> filesystem performance since filesystem extents cannot span gap zones.
> Having to split filesystem extents because of gap zones reduces
> filesystem performance.

Noted. I'll find some time to review this as well separately, once we're
on the other side of the merge window.
Damien Le Moal Oct. 1, 2022, 12:45 a.m. UTC | #5
On 10/1/22 04:38, Bart Van Assche wrote:
> On 9/30/22 08:13, Jens Axboe wrote:
>> On 9/29/22 12:31 AM, Pankaj Raghav wrote:
>>>> Hi Jens,
>>>>    Please consider this patch series for the 6.1 release.
>>>>
>>>
>>> Hi Jens, Christoph, and Keith,
>>>   All the patches have a Reviewed-by tag at this point. Can we queue this up
>>> for 6.1?
>>
>> It's getting pretty late for 6.1 and I'd really like to have both Christoph
>> and Martin sign off on these changes.
> 
> Hi Jens,
> 
> Agreed that it's getting late for 6.1.
> 
> Since this has not been mentioned in the cover letter, I want to add 
> that in the near future we will need these patches for Android devices. 
> JEDEC is working on supporting zoned storage for UFS devices, the 
> storage devices used in all modern Android phones. Although it would be 
> possible to make the offset between zone starts a power of two by 
> inserting gap zones between data zones, UFS vendors asked not to do this 
> and hence need support for zone sizes that are not a power of two. An 
> advantage of not having to deal with gap zones is better filesystem 
> performance since filesystem extents cannot span gap zones. Having to 
> split filesystem extents because of gap zones reduces filesystem 
> performance.

As mentioned many times, my opinion is that a good implementation should
*not* have any extent span zone boundaries. So personally, I do not
consider such argument as a valid justification for the non-power-of-2
zone size support.

> 
> Thanks,
> 
> Bart.
> 
>
Bart Van Assche Oct. 1, 2022, 2:14 a.m. UTC | #6
On 9/30/22 17:45, Damien Le Moal wrote:
> On 10/1/22 04:38, Bart Van Assche wrote:
>> Since this has not been mentioned in the cover letter, I want to add
>> that in the near future we will need these patches for Android devices.
>> JEDEC is working on supporting zoned storage for UFS devices, the
>> storage devices used in all modern Android phones. Although it would be
>> possible to make the offset between zone starts a power of two by
>> inserting gap zones between data zones, UFS vendors asked not to do this
>> and hence need support for zone sizes that are not a power of two. An
>> advantage of not having to deal with gap zones is better filesystem
>> performance since filesystem extents cannot span gap zones. Having to
>> split filesystem extents because of gap zones reduces filesystem
>> performance.
> 
> As mentioned many times, my opinion is that a good implementation should
> *not* have any extent span zone boundaries. So personally, I do not
> consider such argument as a valid justification for the non-power-of-2
> zone size support.

Hi Damien,

Although the filesystem extent issue probably can be solved in software, 
the argument that UFS vendors strongly prefer not to have gap zones and 
hence need support for zone sizes that are not a power of two remains.

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Bart Van Assche Oct. 24, 2022, 7:02 p.m. UTC | #7
On 9/30/22 14:24, Jens Axboe wrote:
> Noted. I'll find some time to review this as well separately, once we're
> on the other side of the merge window.

Hi Jens,

Now that we are on the other side of the merge window: do you perhaps 
want Pankaj to repost this patch series? From what I have heard in 
several fora (JEDEC, SNIA) all flash storage vendors except one (WDC) 
are in favor of a contiguous LBA space and hence are in favor of 
supporting zone sizes that are not a power of two.

As you may know in JEDEC we are working on standardizing zoned storage 
for UFS devices. We (JEDEC JC-64.1 committee members) would like to know 
whether or not we should require that the UFS zone size should be a 
power of two.

Thank you,

Bart.


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Can Guo Jan. 4, 2023, 5:29 a.m. UTC | #8
Hi Pankaj,

On 9/24/2022 1:36 AM, Pankaj Raghav wrote:
> Hi Jens,
>    Please consider this patch series for the 6.1 release.
>
> - Background and Motivation:
>
> The zone storage implementation in Linux, introduced since v4.10, first
> targetted SMR drives which have a power of 2 (po2) zone size alignment
> requirement. The po2 zone size was further imposed implicitly by the
> block layer's blk_queue_chunk_sectors(), used to prevent IO merging
> across chunks beyond the specified size, since v3.16 through commit
> 762380ad9322 ("block: add notion of a chunk size for request merging").
> But this same general block layer po2 requirement for blk_queue_chunk_sectors()
> was removed on v5.10 through commit 07d098e6bbad ("block: allow 'chunk_sectors'
> to be non-power-of-2").
>
> NAND, which is the media used in newer zoned storage devices, does not
> naturally align to po2. In these devices, zone capacity(cap) is not the
> same as the po2 zone size. When the zone cap != zone size, then unmapped
> LBAs are introduced to cover the space between the zone cap and zone size.
> po2 requirement does not make sense for these type of zone storage devices.
> This patch series aims to remove these unmapped LBAs for zoned devices when
> zone cap is npo2. This is done by relaxing the po2 zone size constraint
> in the kernel and allowing zoned device with npo2 zone sizes if zone cap
> == zone size.

I came across function sd_zbc_check_capacity() in sd_zbc.c, it still 
errors out in case of npo2.

I don't see this series touching sd_zbc.c. Is there plan or existing 
change to relax this check?


if(!*is_power_of_2* 
<https://elixir.bootlin.com/linux/v6.2-rc2/C/ident/is_power_of_2>(*zone_blocks* 
<https://elixir.bootlin.com/linux/v6.2-rc2/C/ident/zone_blocks>)){

*sd_printk* 
<https://elixir.bootlin.com/linux/v6.2-rc2/C/ident/sd_printk>(*KERN_ERR* 
<https://elixir.bootlin.com/linux/v6.2-rc2/C/ident/KERN_ERR>,sdkp,

"Zone size %llu is not a power of two.\n",

*zone_blocks* 
<https://elixir.bootlin.com/linux/v6.2-rc2/C/ident/zone_blocks>);

return-*EINVAL* <https://elixir.bootlin.com/linux/v6.2-rc2/C/ident/EINVAL>;

}


Thanks.

Regards,

Can Guo.
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel