mbox series

[v3,0/4] zone-append support in io-uring and aio

Message ID 1593974870-18919-1-git-send-email-joshi.k@samsung.com (mailing list archive)
Headers show
Series zone-append support in io-uring and aio | expand

Message

Kanchan Joshi July 5, 2020, 6:47 p.m. UTC
Changes since v2:
- Use file append infra (O_APPEND/RWF_APPEND) to trigger zone-append
(Christoph, Wilcox)
- Added Block I/O path changes (Damien). Avoided append split into multi-bio.
- Added patch to extend zone-append in block-layer to support bvec iov_iter.
Append using io-uring fixed-buffer is enabled with this.
- Made io-uring support code more concise, added changes mentioned by Pavel.

v2: https://lore.kernel.org/io-uring/1593105349-19270-1-git-send-email-joshi.k@samsung.com/

Changes since v1:
- No new opcodes in uring or aio. Use RWF_ZONE_APPEND flag instead.
- linux-aio changes vanish because of no new opcode
- Fixed the overflow and other issues mentioned by Damien
- Simplified uring support code, fixed the issues mentioned by Pavel
- Added error checks for io-uring fixed-buffer and sync kiocb

v1: https://lore.kernel.org/io-uring/1592414619-5646-1-git-send-email-joshi.k@samsung.com/

Cover letter (updated):

This patchset enables zone-append using io-uring/linux-aio, on block IO path.
Purpose is to provide zone-append consumption ability to applications which are
using zoned-block-device directly.
Application can send write with existing O/RWF_APPEND;On a zoned-block-device
this will trigger zone-append. On regular block device existing behavior is
retained. However, infra allows zone-append to be triggered on any file if
FMODE_ZONE_APPEND (new kernel-only fmode) is set during open.

With zone-append, written-location within zone is known only after completion.
So apart from usual return value of write, additional mean is needed to obtain
the actual written location.

In aio, this is returned to application using res2 field of io_event -

struct io_event {
        __u64           data;           /* the data field from the iocb */
        __u64           obj;            /* what iocb this event came from */
        __s64           res;            /* result code for this event */
        __s64           res2;           /* secondary result */
};

In io-uring, cqe->flags is repurposed for zone-append result.

struct io_uring_cqe {
        __u64   user_data;      /* sqe->data submission passed back */
        __s32   res;            /* result code for this event */
        __u32   flags;
};

32 bit flags is not sufficient, to cover zone-size represented by chunk_sectors.
Discussions in the LKML led to following ways to go about it -
Option 1: Return zone-relative offset in sector/512b unit
Option 2: Return zone-relative offset in bytes

With option #1, io-uring changes remain minimal, relatively clean, and extra
checks and conversions are avoided in I/O path. Also ki_complete interface change
is avoided (last parameter ret2 is of long type, which cannot store return value
in bytes). Bad part of the choice is - return value is in 512b units and not in
bytes. To hide that, a wrapper needs to be written in user-space that converts
cqe->flags value to bytes and combines with zone-start.

Option #2 requires pulling some bits from cqe->res and combine those with
cqe->flags to store result in bytes. This bitwise scattering needs to be done
by kernel in I/O path, and application still needs to have a relatively
heavyweight wrapper to assemble the pieces so that both cqe->res and append
location are derived correctly.

Patchset picks option #1.

Kanchan Joshi (2):
  fs: introduce FMODE_ZONE_APPEND and IOCB_ZONE_APPEND
  block: enable zone-append for iov_iter of bvec type

Selvakumar S (2):
  block: add zone append handling for direct I/O path
  io_uring: add support for zone-append

 block/bio.c        | 31 ++++++++++++++++++++++++++++---
 fs/block_dev.c     | 49 ++++++++++++++++++++++++++++++++++++++++---------
 fs/io_uring.c      | 21 +++++++++++++++++++--
 include/linux/fs.h | 14 ++++++++++++--
 4 files changed, 99 insertions(+), 16 deletions(-)