mbox series

[v7,00/11] Introduce Zone Append for writing to zoned block devices

Message ID 20200417121536.5393-1-johannes.thumshirn@wdc.com (mailing list archive)
Headers show
Series Introduce Zone Append for writing to zoned block devices | expand

Message

Johannes Thumshirn April 17, 2020, 12:15 p.m. UTC
The upcoming NVMe ZNS Specification will define a new type of write
command for zoned block devices, zone append.

When when writing to a zoned block device using zone append, the start
sector of the write is pointing at the start LBA of the zone to write to.
Upon completion the block device will respond with the position the data
has been placed in the zone. This from a high level perspective can be
seen like a file system's block allocator, where the user writes to a
file and the file-system takes care of the data placement on the device.

In order to fully exploit the new zone append command in file-systems and
other interfaces above the block layer, we choose to emulate zone append
in SCSI and null_blk. This way we can have a single write path for both
file-systems and other interfaces above the block-layer, like io_uring on
zoned block devices, without having to care too much about the underlying
characteristics of the device itself.

The emulation works by providing a cache of each zone's write pointer, so
zone append issued to the disk can be translated to a write with a
starting LBA of the write pointer. This LBA is used as input zone number
for the write pointer lookup in the zone write pointer offset cache and
the cached offset is then added to the LBA to get the actual position to
write the data. In SCSI we then turn the REQ_OP_ZONE_APPEND request into a
WRITE(16) command. Upon successful completion of the WRITE(16), the cache
will be updated to the new write pointer location and the written sector
will be noted in the request. On error the cache entry will be marked as
invalid and on the next write an update of the write pointer will be
scheduled, before issuing the actual write.

In order to reduce memory consumption, the only cached item is the offset
of the write pointer from the start of the zone, everything else can be
calculated. On an example drive with 52156 zones, the additional memory
consumption of the cache is thus 52156 * 4 = 208624 Bytes or 51 4k Byte
pages. The performance impact is neglectable for a spinning drive.

For null_blk the emulation is way simpler, as null_blk's zoned block
device emulation support already caches the write pointer position, so we
only need to report the position back to the upper layers. Additional
caching is not needed here.

Furthermore we have converted zonefs to run use ZONE_APPEND for synchronous
direct I/Os. Asynchronous I/O still uses the normal path via iomap.

The series is based on v5.7-rc1 , but it should be trivial to re-base onto
Jens' for-next branch once it re-opened.

As Christoph asked for a branch I pushed it to a git repo at:
git://git.kernel.org/pub/scm/linux/kernel/git/jth/linux.git zone-append.v7
https://git.kernel.org/pub/scm/linux/kernel/git/jth/linux.git/log/?h=zone-append.v7

Changes to v6:
- Added Daniel's Reviewed-by's
- Addressed Christoph's comment on whitespace changes in 4/11
- Renamed driver_cb in 6/11
- Fixed lines over 80 characters in 8/11
- Damien simplified sd_zbc_revalidate_zones() in 8/11

Changes to v5:
- Added patch to fix the memleak on failed scsi command setup
- Added prep patch from Christoph for bio_add_hw_page
- Added Christoph's suggestions for adding append pages to bios
- Fixed compile warning with !CONFIG_BLK_DEV_ZONED
- Damien re-worked revalidate zone
- Added Christoph's suggestions for rescanning write pointers to update cache

Changes to v4:
- Added page merging for zone-append bios (Christoph)
- Removed different locking schmes for zone management operations (Christoph)
- Changed wp_ofst assignment from blk_revalidate_zones (Christoph)
- Smaller nitpicks (Christoph)
- Documented my changes to Keith's patch so it's clear where I messed up so he
  doesn't get blamed
- Added Damien as a Co-developer to the sd emulation patch as he wrote as much
  code for it as I did (if not more)

Changes since v3:
- Remove impact of zone-append from bio_full() and bio_add_page()
  fast-path (Christoph)
- All of the zone write pointer offset caching is handled in SCSI now
  (Christoph) 
- Drop null_blk pathces that damien sent separately (Christoph)
- Use EXPORT_SYMBOL_GPL for new exports (Christoph)	

Changes since v2:
- Remove iomap implementation and directly issue zone-appends from within
  zonefs (Christoph)
- Drop already merged patch
- Rebase onto new for-next branch

Changes since v1:
- Too much to mention, treat as a completely new series.
  block: Modify revalidate zones
  null_blk: Support REQ_OP_ZONE_APPEND

Christoph Hellwig (1):
  block: rename __bio_add_pc_page to bio_add_hw_page

Damien Le Moal (2):
  block: Modify revalidate zones
  null_blk: Support REQ_OP_ZONE_APPEND

Johannes Thumshirn (7):
  scsi: free sgtables in case command setup fails
  block: provide fallbacks for blk_queue_zone_is_seq and
    blk_queue_zone_no
  block: introduce blk_req_zone_write_trylock
  scsi: sd_zbc: factor out sanity checks for zoned commands
  scsi: sd_zbc: emulate ZONE_APPEND commands
  block: export bio_release_pages and bio_iov_iter_get_pages
  zonefs: use REQ_OP_ZONE_APPEND for sync DIO

Keith Busch (1):
  block: Introduce REQ_OP_ZONE_APPEND

 block/bio.c                    | 126 ++++++++---
 block/blk-core.c               |  52 +++++
 block/blk-map.c                |   5 +-
 block/blk-mq.c                 |  27 +++
 block/blk-settings.c           |  23 ++
 block/blk-sysfs.c              |  13 ++
 block/blk-zoned.c              |  23 +-
 block/blk.h                    |   4 +-
 drivers/block/null_blk_zoned.c |  39 +++-
 drivers/scsi/scsi_lib.c        |  17 +-
 drivers/scsi/sd.c              |  24 +-
 drivers/scsi/sd.h              |  43 +++-
 drivers/scsi/sd_zbc.c          | 397 +++++++++++++++++++++++++++++----
 fs/zonefs/super.c              |  80 ++++++-
 include/linux/blk_types.h      |  14 ++
 include/linux/blkdev.h         |  25 ++-
 16 files changed, 804 insertions(+), 108 deletions(-)

Comments

Theodore Ts'o April 17, 2020, 4:03 p.m. UTC | #1
On Fri, Apr 17, 2020 at 09:15:25PM +0900, Johannes Thumshirn wrote:
> The upcoming NVMe ZNS Specification will define a new type of write
> command for zoned block devices, zone append.
> 
> When when writing to a zoned block device using zone append, the start
> sector of the write is pointing at the start LBA of the zone to write to.
> Upon completion the block device will respond with the position the data
> has been placed in the zone. This from a high level perspective can be
> seen like a file system's block allocator, where the user writes to a
> file and the file-system takes care of the data placement on the device.

What sort of reordering can take place due to I/O schedulers and or
racing write appends from different CPU's?  Is it purely the
userspace's responsibility to avoid racings writes to a particular
zone?

						- Ted
Johannes Thumshirn April 17, 2020, 5:48 p.m. UTC | #2
On 17/04/2020 18:04, Theodore Y. Ts'o wrote:
> What sort of reordering can take place due to I/O schedulers and or
> racing write appends from different CPU's?  Is it purely the
> userspace's responsibility to avoid racings writes to a particular
> zone?

For normal writes (i.e.: REQ_OP_WRITE) the zone will be locked by the 
I/O scheduler (namely mq-deadline as it's currently the one which does 
have support for zoned block devices). For REQ_OP_ZONE_APPEND we have a 
trylock scheme in the SCSI emulation of ZONE_APPEND, which is fine, as 
SCSI disks are single queue only.

A scenario that can of cause happen is concurring REQ_OP_ZONE_APPEND 
writes from different CPUs to the same zone. This is fine *iff* the sum 
of all writes stays within the free space of the zone. If one if the 
writes will cross a zone boundary it'll get EIO, obviously.

For "userspace's responsibility", I'd re-phrase this as "a consumer's 
responsibility", as we don't have an interface which aims at user-space 
yet. The only consumer this series implements is zonefs, although we did 
have an AIO implementation for early testing and io_uring shouldn't be 
too hard to implement.
Theodore Ts'o April 18, 2020, 1 a.m. UTC | #3
On Fri, Apr 17, 2020 at 05:48:20PM +0000, Johannes Thumshirn wrote:
> For "userspace's responsibility", I'd re-phrase this as "a consumer's 
> responsibility", as we don't have an interface which aims at user-space 
> yet. The only consumer this series implements is zonefs, although we did 
> have an AIO implementation for early testing and io_uring shouldn't be 
> too hard to implement.

Ah, I had assumed that userspace interface exposed would be opening
the block device with the O_APPEND flag.  (Which raises interesting
questions if the block device is also opened without O_APPEND and some
other thread was writing to the same zone, in which case the order in
which requests are processed would control whether the I/O would
fail.)

					- Ted
Johannes Thumshirn April 18, 2020, 8:57 a.m. UTC | #4
On 18/04/2020 03:01, Theodore Y. Ts'o wrote:
> Ah, I had assumed that userspace interface exposed would be opening
> the block device with the O_APPEND flag.

We actually did do this in early testing, but wouldn't this break 
established block device semantics?
Bart Van Assche April 18, 2020, 3:56 p.m. UTC | #5
On 2020-04-17 05:15, Johannes Thumshirn wrote:
> In order to reduce memory consumption, the only cached item is the offset
> of the write pointer from the start of the zone, everything else can be
> calculated. On an example drive with 52156 zones, the additional memory
> consumption of the cache is thus 52156 * 4 = 208624 Bytes or 51 4k Byte
> pages. The performance impact is neglectable for a spinning drive.

What will happen if e.g. syzkaller mixes write() system calls with SG_IO
writes? Can that cause a mismatch between the cached write pointer and
the write pointer maintained by the drive? If so, should SG_IO perhaps
be disallowed if the write pointer is cached?

Thanks,

Bart.
Damien Le Moal April 19, 2020, 10:51 p.m. UTC | #6
On 2020/04/18 10:01, Theodore Y. Ts'o wrote:
> On Fri, Apr 17, 2020 at 05:48:20PM +0000, Johannes Thumshirn wrote:
>> For "userspace's responsibility", I'd re-phrase this as "a consumer's 
>> responsibility", as we don't have an interface which aims at user-space 
>> yet. The only consumer this series implements is zonefs, although we did 
>> have an AIO implementation for early testing and io_uring shouldn't be 
>> too hard to implement.
> 
> Ah, I had assumed that userspace interface exposed would be opening
> the block device with the O_APPEND flag.  (Which raises interesting
> questions if the block device is also opened without O_APPEND and some
> other thread was writing to the same zone, in which case the order in
> which requests are processed would control whether the I/O would
> fail.)

O_APPEND has no effect for raw block device files since the file size is always
0. While we did use this flag initially for quick tests of user space interface,
it was a hack. Any proper implementation of a user space interface will probably
need a new RWF_ flag that can be passed to aios (io_submit() and io_uring) and
preadv2()/pwritev2() calls.

As for the case of one application doing regular writes and another doing zone
append writes to the same zone, you are correct, there will be errors. But not
for the zone append writes: they will all succeed since by definition, these do
not need the current zone write pointer and always append at the zone current
wp, wherever it is (with the zone not being full that is). Most of the regular
writes will likely fail since without synchronization between the applications,
the write pointer for the target zone would constantly change under the issuer
of the regular writes, even if that issuer uses report zones before any write
operation.

There is no automatic synchronization in the kernel for this and we do not
intend to add any: such bad use case is similar to 2 non-synchronized writers
issuing regular writes to the same zone. This cannot work correctly without
mutual exclusion in the IOs issuing path and that is the responsibility of the
user, be it an application process or an in-kernel component.

As Johannes pointed out, once BIOs aare submitted, the kernel does guarantee
ordered dispatching of writes per zone with zone write locking (mq-deadline).
Damien Le Moal April 20, 2020, 12:21 a.m. UTC | #7
On 2020/04/19 0:56, Bart Van Assche wrote:
> On 2020-04-17 05:15, Johannes Thumshirn wrote:
>> In order to reduce memory consumption, the only cached item is the offset
>> of the write pointer from the start of the zone, everything else can be
>> calculated. On an example drive with 52156 zones, the additional memory
>> consumption of the cache is thus 52156 * 4 = 208624 Bytes or 51 4k Byte
>> pages. The performance impact is neglectable for a spinning drive.
> 
> What will happen if e.g. syzkaller mixes write() system calls with SG_IO
> writes? Can that cause a mismatch between the cached write pointer and
> the write pointer maintained by the drive? If so, should SG_IO perhaps
> be disallowed if the write pointer is cached?

Bart,

Yes, SG_IO write will change the WP on the device side, causing the driver WP
cache to go out of sync with the device. We actually use that for testing and
generating write errors.

But that is not a problem limited to this new write pointer caching scheme. Any
zoned drive user can hit this problem (dm-zoned or f2fs currently). More
generally speaking, SG_IO writes can corrupt data/metadata on any regular disk
without (for instance) the file system noticing until the corrupted
data/metadata is accessed. SG_IO to a disk is not disabled if the disk has a
mounted file system. Also, since the series adds unconditional write pointer
caching with CONFIG_BLK_DEV_ZONED enabled, that would disable SG_IO permanently
in this case, and negatively impact a lot of userspace tools relying on it (FW
update, power management, SMART, etc).
Douglas Gilbert April 20, 2020, 1:06 a.m. UTC | #8
On 2020-04-19 8:21 p.m., Damien Le Moal wrote:
> On 2020/04/19 0:56, Bart Van Assche wrote:
>> On 2020-04-17 05:15, Johannes Thumshirn wrote:
>>> In order to reduce memory consumption, the only cached item is the offset
>>> of the write pointer from the start of the zone, everything else can be
>>> calculated. On an example drive with 52156 zones, the additional memory
>>> consumption of the cache is thus 52156 * 4 = 208624 Bytes or 51 4k Byte
>>> pages. The performance impact is neglectable for a spinning drive.
>>
>> What will happen if e.g. syzkaller mixes write() system calls with SG_IO
>> writes? Can that cause a mismatch between the cached write pointer and
>> the write pointer maintained by the drive? If so, should SG_IO perhaps
>> be disallowed if the write pointer is cached?
> 
> Bart,
> 
> Yes, SG_IO write will change the WP on the device side, causing the driver WP
> cache to go out of sync with the device. We actually use that for testing and
> generating write errors.
> 
> But that is not a problem limited to this new write pointer caching scheme. Any
> zoned drive user can hit this problem (dm-zoned or f2fs currently). More
> generally speaking, SG_IO writes can corrupt data/metadata on any regular disk
> without (for instance) the file system noticing until the corrupted
> data/metadata is accessed. SG_IO to a disk is not disabled if the disk has a
> mounted file system. Also, since the series adds unconditional write pointer
> caching with CONFIG_BLK_DEV_ZONED enabled, that would disable SG_IO permanently
> in this case, and negatively impact a lot of userspace tools relying on it (FW
> update, power management, SMART, etc).

Putting my smartmontools maintainer hat on, I don't like the sound of that.
We have spent time in the past discussing black and white lists, but they
can't cope with vendor specific commands. Then there are commands that both
sd and smartmontools might have good cause to use; MODE SELECT comes to
mind. Changing the setting of the URSWRZ_M bit in the Zoned Block Device
Control page, for example, could upset the apple cart.

Doug Gilbert