mbox series

[00/11] Introduce Zone Append for writing to zoned block devices

Message ID 20200310094653.33257-1-johannes.thumshirn@wdc.com (mailing list archive)
Headers show
Series Introduce Zone Append for writing to zoned block devices | expand

Message

Johannes Thumshirn March 10, 2020, 9:46 a.m. UTC
The upcoming NVMe ZNS Specification will define a new type of write
command for zoned block devices, zone append.

When when writing to a zoned block device using zone append, the start
sector of the write is pointing at the start LBA of the zone to write to.
Upon completion the block device will respond with the position the data
has been placed in the zone. This from a high level perspective can be
seen like a file system's block allocator, where the user writes to a
file and the file-system takes care of the data placement on the device.

In order to fully exploit the new zone append command in file-systems and
other interfaces above the block layer, we choose to emulate zone append
in SCSI and null_blk. This way we can have a single write path for both
file-systems and other interfaces above the block-layer, like io_uring on
zoned block devices, without having to care too much about the underlying
characteristics of the device itself.

The emulation works by providing a cache of each zone's write pointer, so
zone append issued to the disk can be translated to a write with a
starting LBA of the write pointer. This LBA is used as input zone number
for the write pointer lookup in the zone write pointer offset cache and
the cached offset is then added to the LBA to get the actual position to
write the data. In SCSI we then turn the REQ_OP_ZONE_APPEND request into a
WRITE(16) command. Upon successful completion of the WRITE(16), the cache
will be updated to the new write pointer location and the written sector
will be noted in the request. On error the cache entry will be marked as
invalid and on the next write an update of the write pointer will be
scheduled, before issuing the actual write.

In order to reduce memory consumption, the only cached item is the offset
of the write pointer from the start of the zone, everything else can be
calculated. On an example drive with 52156 zones, the additional memory
consumption of the cache is thus 52156 * 4 = 208624 Bytes or 51 4k Byte
pages. The performance impact is neglectable for a spinning drive.

For null_blk the emulation is way simpler, as null_blk's zoned block
device emulation support already caches the write pointer position, so we
only need to report the position back to the upper layers. Additional
caching is not needed here.

Testing has been conducted by translating RWF_APPEND DIOs into
REQ_OP_ZONE_APPEND commands in the block device's direct I/O function and
injecting errors by bypassing the block layer interface and directly
writing to the disc via the SCSI generic interface.

The whole series is relative to Jens' block-5.6 branch 14afc5936197
("block, bfq: fix overwrite of bfq_group pointer in bfq_find_set_group()").

Damien Le Moal (2):
  null_blk: Support REQ_OP_ZONE_APPEND
  block: Introduce zone write pointer offset caching

Johannes Thumshirn (8):
  block: provide fallbacks for blk_queue_zone_is_seq and
    blk_queue_zone_no
  block: introduce bio_add_append_page
  block: introduce BLK_STS_ZONE_RESOURCE
  block: introduce blk_req_zone_write_trylock
  block: factor out requeue handling from dispatch code
  block: delay un-dispatchable request
  scsi: sd_zbc: factor out sanity checks for zoned commands
  scsi: sd_zbc: emulate ZONE_APPEND commands

Keith Busch (1):
  block: Introduce REQ_OP_ZONE_APPEND

 block/bio.c                    |  41 +++-
 block/blk-core.c               |  49 +++++
 block/blk-map.c                |   2 +-
 block/blk-mq.c                 |  54 +++++-
 block/blk-settings.c           |  16 ++
 block/blk-sysfs.c              |  15 +-
 block/blk-zoned.c              |  83 +++++++-
 block/blk.h                    |   4 +-
 drivers/block/null_blk_main.c  |   9 +-
 drivers/block/null_blk_zoned.c |  21 +-
 drivers/scsi/scsi_lib.c        |   1 +
 drivers/scsi/sd.c              |  28 ++-
 drivers/scsi/sd.h              |  35 +++-
 drivers/scsi/sd_zbc.c          | 344 +++++++++++++++++++++++++++++++--
 include/linux/bio.h            |   3 +-
 include/linux/blk_types.h      |  14 ++
 include/linux/blkdev.h         |  42 +++-
 17 files changed, 701 insertions(+), 60 deletions(-)

Comments

Christoph Hellwig March 10, 2020, 4:42 p.m. UTC | #1
On Tue, Mar 10, 2020 at 06:46:42PM +0900, Johannes Thumshirn wrote:
> For null_blk the emulation is way simpler, as null_blk's zoned block
> device emulation support already caches the write pointer position, so we
> only need to report the position back to the upper layers. Additional
> caching is not needed here.
> 
> Testing has been conducted by translating RWF_APPEND DIOs into
> REQ_OP_ZONE_APPEND commands in the block device's direct I/O function and
> injecting errors by bypassing the block layer interface and directly
> writing to the disc via the SCSI generic interface.

We really need a user of this to be useful upstream.  Didn't you plan
to look into converting zonefs/iomap to use it?  Without that it is
at best a RFC.  Even better would be converting zonefs and the f2fs
zoned code so that can get rid of the old per-zone serialization in
the I/O scheduler entirely.
Damien Le Moal March 11, 2020, 12:37 a.m. UTC | #2
On 2020/03/11 1:42, Christoph Hellwig wrote:
> On Tue, Mar 10, 2020 at 06:46:42PM +0900, Johannes Thumshirn wrote:
>> For null_blk the emulation is way simpler, as null_blk's zoned block
>> device emulation support already caches the write pointer position, so we
>> only need to report the position back to the upper layers. Additional
>> caching is not needed here.
>>
>> Testing has been conducted by translating RWF_APPEND DIOs into
>> REQ_OP_ZONE_APPEND commands in the block device's direct I/O function and
>> injecting errors by bypassing the block layer interface and directly
>> writing to the disc via the SCSI generic interface.
> 
> We really need a user of this to be useful upstream.  Didn't you plan
> to look into converting zonefs/iomap to use it?  Without that it is
> at best a RFC.  Even better would be converting zonefs and the f2fs
> zoned code so that can get rid of the old per-zone serialization in
> the I/O scheduler entirely.

I do not think we can get rid of it entirely as it is needed for applications
using regular writes on raw zoned block devices. But the zone write locking will
be completely bypassed for zone append writes issued by file systems.
Christoph Hellwig March 11, 2020, 6:24 a.m. UTC | #3
On Wed, Mar 11, 2020 at 12:37:33AM +0000, Damien Le Moal wrote:
> I do not think we can get rid of it entirely as it is needed for applications
> using regular writes on raw zoned block devices. But the zone write locking will
> be completely bypassed for zone append writes issued by file systems.

But applications that are aware of zones should not be sending multiple
write commands to a zone anyway.  We certainly can't use zone write
locking for nvme if we want to be able to use multiple queues.
Damien Le Moal March 11, 2020, 6:40 a.m. UTC | #4
On 2020/03/11 15:25, Christoph Hellwig wrote:
> On Wed, Mar 11, 2020 at 12:37:33AM +0000, Damien Le Moal wrote:
>> I do not think we can get rid of it entirely as it is needed for applications
>> using regular writes on raw zoned block devices. But the zone write locking will
>> be completely bypassed for zone append writes issued by file systems.
> 
> But applications that are aware of zones should not be sending multiple
> write commands to a zone anyway.  We certainly can't use zone write
> locking for nvme if we want to be able to use multiple queues.
> 

True, and that is the main use case I am seeing in the field.

However, even for this to work properly, we will also need to have a special
bio_add_page() function for regular writes to zones, similarly to zone append,
to ensure that a large BIO does not become multiple requests, won't we ?
Otherwise, a write bio submit will generate multiple requests that may get
reordered on dispatch and on requeue (on SAS or on SATA).

Furthermore, we already have aio supported. Customers in the field use that with
fio libaio engine to test drives and for applications development. So I am
afraid that removing the zone write locking now would break user space, no ?

For nvme, we want to allow the "none" elevator as the default rather than
mq-deadline which is now the default for all zoned block devices. This is a very
simple change to the default elevator selection we can add based on the nonrot
queue flag.
Johannes Thumshirn March 11, 2020, 7:22 a.m. UTC | #5
On 10/03/2020 17:42, Christoph Hellwig wrote:
[...]
> We really need a user of this to be useful upstream.  Didn't you plan
> to look into converting zonefs/iomap to use it?  Without that it is
> at best a RFC.  Even better would be converting zonefs and the f2fs
> zoned code so that can get rid of the old per-zone serialization in
> the I/O scheduler entirely.

Yes I'm right now working on iomap/zonefs support for zone append.