[1/4] io_uring: add IORING_OP_READ{WRITE}V_PI cmd
diff mbox series

Message ID 20200226083719.4389-2-bob.liu@oracle.com
State New
Headers show
Series
  • userspace PI passthrough via io_uring
Related show

Commit Message

Bob Liu Feb. 26, 2020, 8:37 a.m. UTC
Add read and write with protect information command.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
---
 fs/io_uring.c                 | 12 ++++++++++++
 include/linux/fs.h            |  1 +
 include/uapi/linux/io_uring.h |  2 ++
 3 files changed, 15 insertions(+)

Comments

Jens Axboe Feb. 26, 2020, 2:24 p.m. UTC | #1
On 2/26/20 1:37 AM, Bob Liu wrote:
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index a3300e1..98fa3f1 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -62,6 +62,8 @@ enum {
>  	IORING_OP_NOP,
>  	IORING_OP_READV,
>  	IORING_OP_WRITEV,
> +	IORING_OP_READV_PI,
> +	IORING_OP_WRITEV_PI,
>  	IORING_OP_FSYNC,
>  	IORING_OP_READ_FIXED,
>  	IORING_OP_WRITE_FIXED,

So this one renumbers everything past IORING_OP_WRITEV, breaking the
ABI in a very bad way. I'm guessing that was entirely unintentional?
Any new command must go at the end of the list.

You're also not updating io_op_defs[] with the two new commands,
which means it won't compile at all. I'm guessing you tested this on
an older version of the kernel which didn't have io_op_defs[]?
Christoph Hellwig Feb. 26, 2020, 3:57 p.m. UTC | #2
On Wed, Feb 26, 2020 at 07:24:06AM -0700, Jens Axboe wrote:
> On 2/26/20 1:37 AM, Bob Liu wrote:
> > diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> > index a3300e1..98fa3f1 100644
> > --- a/include/uapi/linux/io_uring.h
> > +++ b/include/uapi/linux/io_uring.h
> > @@ -62,6 +62,8 @@ enum {
> >  	IORING_OP_NOP,
> >  	IORING_OP_READV,
> >  	IORING_OP_WRITEV,
> > +	IORING_OP_READV_PI,
> > +	IORING_OP_WRITEV_PI,
> >  	IORING_OP_FSYNC,
> >  	IORING_OP_READ_FIXED,
> >  	IORING_OP_WRITE_FIXED,
> 
> So this one renumbers everything past IORING_OP_WRITEV, breaking the
> ABI in a very bad way. I'm guessing that was entirely unintentional?
> Any new command must go at the end of the list.
> 
> You're also not updating io_op_defs[] with the two new commands,
> which means it won't compile at all. I'm guessing you tested this on
> an older version of the kernel which didn't have io_op_defs[]?

And the real question is why we need ops insted of just a flag and
using previously reserved fields for the PI pointer.
Jens Axboe Feb. 26, 2020, 3:58 p.m. UTC | #3
On 2/26/20 8:57 AM, Christoph Hellwig wrote:
> On Wed, Feb 26, 2020 at 07:24:06AM -0700, Jens Axboe wrote:
>> On 2/26/20 1:37 AM, Bob Liu wrote:
>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>>> index a3300e1..98fa3f1 100644
>>> --- a/include/uapi/linux/io_uring.h
>>> +++ b/include/uapi/linux/io_uring.h
>>> @@ -62,6 +62,8 @@ enum {
>>>  	IORING_OP_NOP,
>>>  	IORING_OP_READV,
>>>  	IORING_OP_WRITEV,
>>> +	IORING_OP_READV_PI,
>>> +	IORING_OP_WRITEV_PI,
>>>  	IORING_OP_FSYNC,
>>>  	IORING_OP_READ_FIXED,
>>>  	IORING_OP_WRITE_FIXED,
>>
>> So this one renumbers everything past IORING_OP_WRITEV, breaking the
>> ABI in a very bad way. I'm guessing that was entirely unintentional?
>> Any new command must go at the end of the list.
>>
>> You're also not updating io_op_defs[] with the two new commands,
>> which means it won't compile at all. I'm guessing you tested this on
>> an older version of the kernel which didn't have io_op_defs[]?
> 
> And the real question is why we need ops insted of just a flag and
> using previously reserved fields for the PI pointer.

Yeah, should probably be a RWF_ flag instead, and a 64-bit SQE field
for the PI data. The 'last iovec is PI' is kind of icky.
Darrick J. Wong Feb. 26, 2020, 4:03 p.m. UTC | #4
On Wed, Feb 26, 2020 at 08:58:46AM -0700, Jens Axboe wrote:
> On 2/26/20 8:57 AM, Christoph Hellwig wrote:
> > On Wed, Feb 26, 2020 at 07:24:06AM -0700, Jens Axboe wrote:
> >> On 2/26/20 1:37 AM, Bob Liu wrote:
> >>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> >>> index a3300e1..98fa3f1 100644
> >>> --- a/include/uapi/linux/io_uring.h
> >>> +++ b/include/uapi/linux/io_uring.h
> >>> @@ -62,6 +62,8 @@ enum {
> >>>  	IORING_OP_NOP,
> >>>  	IORING_OP_READV,
> >>>  	IORING_OP_WRITEV,
> >>> +	IORING_OP_READV_PI,
> >>> +	IORING_OP_WRITEV_PI,
> >>>  	IORING_OP_FSYNC,
> >>>  	IORING_OP_READ_FIXED,
> >>>  	IORING_OP_WRITE_FIXED,
> >>
> >> So this one renumbers everything past IORING_OP_WRITEV, breaking the
> >> ABI in a very bad way. I'm guessing that was entirely unintentional?
> >> Any new command must go at the end of the list.
> >>
> >> You're also not updating io_op_defs[] with the two new commands,
> >> which means it won't compile at all. I'm guessing you tested this on
> >> an older version of the kernel which didn't have io_op_defs[]?
> > 
> > And the real question is why we need ops insted of just a flag and
> > using previously reserved fields for the PI pointer.
> 
> Yeah, should probably be a RWF_ flag instead, and a 64-bit SQE field
> for the PI data. The 'last iovec is PI' is kind of icky.

Heh, I was about to send in nearly the same comment. :)

--D

> -- 
> Jens Axboe
>
Christoph Hellwig Feb. 26, 2020, 4:53 p.m. UTC | #5
On Wed, Feb 26, 2020 at 08:58:46AM -0700, Jens Axboe wrote:
> Yeah, should probably be a RWF_ flag instead, and a 64-bit SQE field
> for the PI data. The 'last iovec is PI' is kind of icky.

Abusing an iovec (although I though of the first once when looking
into it) looks really horrible, but has two huge advantages:

 - it doesn't require passing another argument all the way down
   the I/O stack
 - it works with all the vectored interfaces that take a flag
   argument, so not just io_uring, but also preadv2/pwritev2 and aio.
   And while I don't care too much about the last I think preadv2
   and pwritev2 are valuable to support.
Bob Liu Feb. 27, 2020, 9:05 a.m. UTC | #6
On 2/26/20 10:24 PM, Jens Axboe wrote:
> On 2/26/20 1:37 AM, Bob Liu wrote:
>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>> index a3300e1..98fa3f1 100644
>> --- a/include/uapi/linux/io_uring.h
>> +++ b/include/uapi/linux/io_uring.h
>> @@ -62,6 +62,8 @@ enum {
>>  	IORING_OP_NOP,
>>  	IORING_OP_READV,
>>  	IORING_OP_WRITEV,
>> +	IORING_OP_READV_PI,
>> +	IORING_OP_WRITEV_PI,
>>  	IORING_OP_FSYNC,
>>  	IORING_OP_READ_FIXED,
>>  	IORING_OP_WRITE_FIXED,
> 
> So this one renumbers everything past IORING_OP_WRITEV, breaking the
> ABI in a very bad way. I'm guessing that was entirely unintentional?
> Any new command must go at the end of the list.
> 
> You're also not updating io_op_defs[] with the two new commands,
> which means it won't compile at all. I'm guessing you tested this on
> an older version of the kernel which didn't have io_op_defs[]?
> 

Yep, will rebase to the latest version next time.
Bob Liu Feb. 27, 2020, 9:19 a.m. UTC | #7
On 2/27/20 12:53 AM, Christoph Hellwig wrote:
> On Wed, Feb 26, 2020 at 08:58:46AM -0700, Jens Axboe wrote:
>> Yeah, should probably be a RWF_ flag instead, and a 64-bit SQE field
>> for the PI data. The 'last iovec is PI' is kind of icky.
> 
> Abusing an iovec (although I though of the first once when looking
> into it) looks really horrible, but has two huge advantages:
> 
>  - it doesn't require passing another argument all the way down
>    the I/O stack
>  - it works with all the vectored interfaces that take a flag
>    argument, so not just io_uring, but also preadv2/pwritev2 and aio.
>    And while I don't care too much about the last I think preadv2
>    and pwritev2 are valuable to support.
> 

Indeed, actually the 'last iovec is PI' idea was learned from Darrick's original
patch which support PI passthrough via aio.
https://www.mail-archive.com/linux-scsi@vger.kernel.org/msg27537.html

Patch
diff mbox series

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 562e3a1..c43d96a 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -630,12 +630,14 @@  static inline bool io_prep_async_work(struct io_kiocb *req,
 
 	switch (req->opcode) {
 	case IORING_OP_WRITEV:
+	case IORING_OP_WRITEV_PI:
 	case IORING_OP_WRITE_FIXED:
 		/* only regular files should be hashed for writes */
 		if (req->flags & REQ_F_ISREG)
 			do_hashed = true;
 		/* fall-through */
 	case IORING_OP_READV:
+	case IORING_OP_READV_PI:
 	case IORING_OP_READ_FIXED:
 	case IORING_OP_SENDMSG:
 	case IORING_OP_RECVMSG:
@@ -1505,6 +1507,12 @@  static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	kiocb->ki_pos = READ_ONCE(sqe->off);
 	kiocb->ki_flags = iocb_flags(kiocb->ki_filp);
 	kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp));
+	if (req->opcode == IORING_OP_READV_PI ||
+			req->opcode == IORING_OP_WRITEV_PI) {
+		if (!(req->file->f_flags & O_DIRECT))
+			return -EINVAL;
+		kiocb->ki_flags |= IOCB_USE_PI;
+	}
 
 	ioprio = READ_ONCE(sqe->ioprio);
 	if (ioprio) {
@@ -3065,10 +3073,12 @@  static int io_req_defer_prep(struct io_kiocb *req,
 	case IORING_OP_NOP:
 		break;
 	case IORING_OP_READV:
+	case IORING_OP_READV_PI:
 	case IORING_OP_READ_FIXED:
 		ret = io_read_prep(req, sqe, true);
 		break;
 	case IORING_OP_WRITEV:
+	case IORING_OP_WRITEV_PI:
 	case IORING_OP_WRITE_FIXED:
 		ret = io_write_prep(req, sqe, true);
 		break;
@@ -3156,6 +3166,7 @@  static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	case IORING_OP_NOP:
 		ret = io_nop(req);
 		break;
+	case IORING_OP_READV_PI:
 	case IORING_OP_READV:
 	case IORING_OP_READ_FIXED:
 		if (sqe) {
@@ -3166,6 +3177,7 @@  static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		ret = io_read(req, nxt, force_nonblock);
 		break;
 	case IORING_OP_WRITEV:
+	case IORING_OP_WRITEV_PI:
 	case IORING_OP_WRITE_FIXED:
 		if (sqe) {
 			ret = io_write_prep(req, sqe, force_nonblock);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 98e0349..65fda07 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -314,6 +314,7 @@  enum rw_hint {
 #define IOCB_SYNC		(1 << 5)
 #define IOCB_WRITE		(1 << 6)
 #define IOCB_NOWAIT		(1 << 7)
+#define IOCB_USE_PI		(1 << 8)
 
 struct kiocb {
 	struct file		*ki_filp;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index a3300e1..98fa3f1 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -62,6 +62,8 @@  enum {
 	IORING_OP_NOP,
 	IORING_OP_READV,
 	IORING_OP_WRITEV,
+	IORING_OP_READV_PI,
+	IORING_OP_WRITEV_PI,
 	IORING_OP_FSYNC,
 	IORING_OP_READ_FIXED,
 	IORING_OP_WRITE_FIXED,