Message ID | 20200417121536.5393-1-johannes.thumshirn@wdc.com (mailing list archive) |
---|---|
Headers | show |
Series | Introduce Zone Append for writing to zoned block devices | expand |
On Fri, Apr 17, 2020 at 09:15:25PM +0900, Johannes Thumshirn wrote: > The upcoming NVMe ZNS Specification will define a new type of write > command for zoned block devices, zone append. > > When when writing to a zoned block device using zone append, the start > sector of the write is pointing at the start LBA of the zone to write to. > Upon completion the block device will respond with the position the data > has been placed in the zone. This from a high level perspective can be > seen like a file system's block allocator, where the user writes to a > file and the file-system takes care of the data placement on the device. What sort of reordering can take place due to I/O schedulers and or racing write appends from different CPU's? Is it purely the userspace's responsibility to avoid racings writes to a particular zone? - Ted
On 17/04/2020 18:04, Theodore Y. Ts'o wrote: > What sort of reordering can take place due to I/O schedulers and or > racing write appends from different CPU's? Is it purely the > userspace's responsibility to avoid racings writes to a particular > zone? For normal writes (i.e.: REQ_OP_WRITE) the zone will be locked by the I/O scheduler (namely mq-deadline as it's currently the one which does have support for zoned block devices). For REQ_OP_ZONE_APPEND we have a trylock scheme in the SCSI emulation of ZONE_APPEND, which is fine, as SCSI disks are single queue only. A scenario that can of cause happen is concurring REQ_OP_ZONE_APPEND writes from different CPUs to the same zone. This is fine *iff* the sum of all writes stays within the free space of the zone. If one if the writes will cross a zone boundary it'll get EIO, obviously. For "userspace's responsibility", I'd re-phrase this as "a consumer's responsibility", as we don't have an interface which aims at user-space yet. The only consumer this series implements is zonefs, although we did have an AIO implementation for early testing and io_uring shouldn't be too hard to implement.
On Fri, Apr 17, 2020 at 05:48:20PM +0000, Johannes Thumshirn wrote: > For "userspace's responsibility", I'd re-phrase this as "a consumer's > responsibility", as we don't have an interface which aims at user-space > yet. The only consumer this series implements is zonefs, although we did > have an AIO implementation for early testing and io_uring shouldn't be > too hard to implement. Ah, I had assumed that userspace interface exposed would be opening the block device with the O_APPEND flag. (Which raises interesting questions if the block device is also opened without O_APPEND and some other thread was writing to the same zone, in which case the order in which requests are processed would control whether the I/O would fail.) - Ted
On 18/04/2020 03:01, Theodore Y. Ts'o wrote: > Ah, I had assumed that userspace interface exposed would be opening > the block device with the O_APPEND flag. We actually did do this in early testing, but wouldn't this break established block device semantics?
On 2020-04-17 05:15, Johannes Thumshirn wrote: > In order to reduce memory consumption, the only cached item is the offset > of the write pointer from the start of the zone, everything else can be > calculated. On an example drive with 52156 zones, the additional memory > consumption of the cache is thus 52156 * 4 = 208624 Bytes or 51 4k Byte > pages. The performance impact is neglectable for a spinning drive. What will happen if e.g. syzkaller mixes write() system calls with SG_IO writes? Can that cause a mismatch between the cached write pointer and the write pointer maintained by the drive? If so, should SG_IO perhaps be disallowed if the write pointer is cached? Thanks, Bart.
On 2020/04/18 10:01, Theodore Y. Ts'o wrote: > On Fri, Apr 17, 2020 at 05:48:20PM +0000, Johannes Thumshirn wrote: >> For "userspace's responsibility", I'd re-phrase this as "a consumer's >> responsibility", as we don't have an interface which aims at user-space >> yet. The only consumer this series implements is zonefs, although we did >> have an AIO implementation for early testing and io_uring shouldn't be >> too hard to implement. > > Ah, I had assumed that userspace interface exposed would be opening > the block device with the O_APPEND flag. (Which raises interesting > questions if the block device is also opened without O_APPEND and some > other thread was writing to the same zone, in which case the order in > which requests are processed would control whether the I/O would > fail.) O_APPEND has no effect for raw block device files since the file size is always 0. While we did use this flag initially for quick tests of user space interface, it was a hack. Any proper implementation of a user space interface will probably need a new RWF_ flag that can be passed to aios (io_submit() and io_uring) and preadv2()/pwritev2() calls. As for the case of one application doing regular writes and another doing zone append writes to the same zone, you are correct, there will be errors. But not for the zone append writes: they will all succeed since by definition, these do not need the current zone write pointer and always append at the zone current wp, wherever it is (with the zone not being full that is). Most of the regular writes will likely fail since without synchronization between the applications, the write pointer for the target zone would constantly change under the issuer of the regular writes, even if that issuer uses report zones before any write operation. There is no automatic synchronization in the kernel for this and we do not intend to add any: such bad use case is similar to 2 non-synchronized writers issuing regular writes to the same zone. This cannot work correctly without mutual exclusion in the IOs issuing path and that is the responsibility of the user, be it an application process or an in-kernel component. As Johannes pointed out, once BIOs aare submitted, the kernel does guarantee ordered dispatching of writes per zone with zone write locking (mq-deadline).
On 2020/04/19 0:56, Bart Van Assche wrote: > On 2020-04-17 05:15, Johannes Thumshirn wrote: >> In order to reduce memory consumption, the only cached item is the offset >> of the write pointer from the start of the zone, everything else can be >> calculated. On an example drive with 52156 zones, the additional memory >> consumption of the cache is thus 52156 * 4 = 208624 Bytes or 51 4k Byte >> pages. The performance impact is neglectable for a spinning drive. > > What will happen if e.g. syzkaller mixes write() system calls with SG_IO > writes? Can that cause a mismatch between the cached write pointer and > the write pointer maintained by the drive? If so, should SG_IO perhaps > be disallowed if the write pointer is cached? Bart, Yes, SG_IO write will change the WP on the device side, causing the driver WP cache to go out of sync with the device. We actually use that for testing and generating write errors. But that is not a problem limited to this new write pointer caching scheme. Any zoned drive user can hit this problem (dm-zoned or f2fs currently). More generally speaking, SG_IO writes can corrupt data/metadata on any regular disk without (for instance) the file system noticing until the corrupted data/metadata is accessed. SG_IO to a disk is not disabled if the disk has a mounted file system. Also, since the series adds unconditional write pointer caching with CONFIG_BLK_DEV_ZONED enabled, that would disable SG_IO permanently in this case, and negatively impact a lot of userspace tools relying on it (FW update, power management, SMART, etc).
On 2020-04-19 8:21 p.m., Damien Le Moal wrote: > On 2020/04/19 0:56, Bart Van Assche wrote: >> On 2020-04-17 05:15, Johannes Thumshirn wrote: >>> In order to reduce memory consumption, the only cached item is the offset >>> of the write pointer from the start of the zone, everything else can be >>> calculated. On an example drive with 52156 zones, the additional memory >>> consumption of the cache is thus 52156 * 4 = 208624 Bytes or 51 4k Byte >>> pages. The performance impact is neglectable for a spinning drive. >> >> What will happen if e.g. syzkaller mixes write() system calls with SG_IO >> writes? Can that cause a mismatch between the cached write pointer and >> the write pointer maintained by the drive? If so, should SG_IO perhaps >> be disallowed if the write pointer is cached? > > Bart, > > Yes, SG_IO write will change the WP on the device side, causing the driver WP > cache to go out of sync with the device. We actually use that for testing and > generating write errors. > > But that is not a problem limited to this new write pointer caching scheme. Any > zoned drive user can hit this problem (dm-zoned or f2fs currently). More > generally speaking, SG_IO writes can corrupt data/metadata on any regular disk > without (for instance) the file system noticing until the corrupted > data/metadata is accessed. SG_IO to a disk is not disabled if the disk has a > mounted file system. Also, since the series adds unconditional write pointer > caching with CONFIG_BLK_DEV_ZONED enabled, that would disable SG_IO permanently > in this case, and negatively impact a lot of userspace tools relying on it (FW > update, power management, SMART, etc). Putting my smartmontools maintainer hat on, I don't like the sound of that. We have spent time in the past discussing black and white lists, but they can't cope with vendor specific commands. Then there are commands that both sd and smartmontools might have good cause to use; MODE SELECT comes to mind. Changing the setting of the URSWRZ_M bit in the Zoned Block Device Control page, for example, could upset the apple cart. Doug Gilbert