diff mbox series

[17/17] nvme: enable non-inline passthru commands

Message ID 20220308152105.309618-18-joshi.k@samsung.com (mailing list archive)
State New, archived
Headers show
Series io_uring passthru over nvme | expand

Commit Message

Kanchan Joshi March 8, 2022, 3:21 p.m. UTC
From: Anuj Gupta <anuj20.g@samsung.com>

On submission,just fetch the commmand from userspace pointer and reuse
everything else. On completion, update the result field inside the
passthru command.

Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
---
 drivers/nvme/host/ioctl.c | 29 +++++++++++++++++++++++++----
 1 file changed, 25 insertions(+), 4 deletions(-)

Comments

Christoph Hellwig March 10, 2022, 8:36 a.m. UTC | #1
On Tue, Mar 08, 2022 at 08:51:05PM +0530, Kanchan Joshi wrote:
> From: Anuj Gupta <anuj20.g@samsung.com>
> 
> On submission,just fetch the commmand from userspace pointer and reuse
> everything else. On completion, update the result field inside the
> passthru command.

What is that supposed to mean?  What is the reason to do it.  Remember
to always document the why in commit logs.

>  
> +static inline bool is_inline_rw(struct io_uring_cmd *ioucmd, struct nvme_command *cmd)

Overly long line.
Kanchan Joshi March 10, 2022, 11:50 a.m. UTC | #2
On Thu, Mar 10, 2022 at 2:06 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Tue, Mar 08, 2022 at 08:51:05PM +0530, Kanchan Joshi wrote:
> > From: Anuj Gupta <anuj20.g@samsung.com>
> >
> > On submission,just fetch the commmand from userspace pointer and reuse
> > everything else. On completion, update the result field inside the
> > passthru command.
>
> What is that supposed to mean?  What is the reason to do it.  Remember
> to always document the why in commit logs.

I covered some of it in patch 6, but yes I need to expand the
reasoning here. Felt that retrospectively too.
So there are two ways/modes of submitting commands:

Mode 1: inline into sqe. This is the default way when passthru command
is placed inside a big sqe which has 80 bytes of space.
The only problem is - passthru command has this 'result' field
(structure below for quick reference) which is statically embedded and
not a pointer (like addr and metadata field).

struct nvme_passthru_cmd64 {
__u8 opcode;
__u8 flags;
__u16 rsvd1;
__u32 nsid;
__u32 cdw2;
__u32 cdw3;
__u64 metadata;
__u64 addr;
__u32 metadata_len;
__u32 data_len;
__u32 cdw10;
__u32 cdw11;
__u32 cdw12;
__u32 cdw13;
__u32 cdw14;
__u32 cdw15;
__u32 timeout_ms;
__u32   rsvd2;
__u64 result;
};
In sync ioctl, we always update this result field by doing put_user on
completion.
For async ioctl, since command is inside the the sqe, its lifetime is
only upto submission. SQE may get reused post submission, leaving no
way to update the "result" field on completion. Had this field been a
pointer, we could have saved this on submission and updated on
completion. But that would require redesigning this structure and
adding newer ioctl in nvme.

Coming back, even though sync-ioctl alway updates this result to
user-space, only a few nvme io commands (e.g. zone-append, copy,
zone-mgmt-send) can return this additional result (spec-wise).
Therefore in nvme, when we are dealing with inline-sqe commands from
io_uring, we never attempt to update the result. And since we don't
update the result, we limit support to only read/write passthru
commands. And fail any other command during submission itself (Patch
2).

Mode 2: Non-inline/indirect (pointer of command into sqe) submission.
User-space places a pointer of passthru command, and a flag in
io_uring saying that this is not inline.
For this, in nvme (this patch) we always update the 'result' on
completion and therefore can support all passthru commands.

Hope this makes the reasoning clear?
Plumbing wise, non-inline support does not create churn (almost all
the infra of inline-command handling is reused). Extra is
copy_from_user , and put_user.

> > +static inline bool is_inline_rw(struct io_uring_cmd *ioucmd, struct nvme_command *cmd)
>
> Overly long line.

Under 100, but sure, can fold it under 80.
Christoph Hellwig March 10, 2022, 2:19 p.m. UTC | #3
On Thu, Mar 10, 2022 at 05:20:13PM +0530, Kanchan Joshi wrote:
> In sync ioctl, we always update this result field by doing put_user on
> completion.
> For async ioctl, since command is inside the the sqe, its lifetime is
> only upto submission. SQE may get reused post submission, leaving no
> way to update the "result" field on completion. Had this field been a
> pointer, we could have saved this on submission and updated on
> completion. But that would require redesigning this structure and
> adding newer ioctl in nvme.

Why would it required adding an ioctl to nvme?  The whole io_uring
async_cmd infrastructure is completely independent from ioctls.

> Coming back, even though sync-ioctl alway updates this result to
> user-space, only a few nvme io commands (e.g. zone-append, copy,
> zone-mgmt-send) can return this additional result (spec-wise).
> Therefore in nvme, when we are dealing with inline-sqe commands from
> io_uring, we never attempt to update the result. And since we don't
> update the result, we limit support to only read/write passthru
> commands. And fail any other command during submission itself (Patch
> 2).

Yikes.  That is outright horrible.  passthrough needs to be command
agnostic and future proof to any newly added nvme command.

> > Overly long line.
> 
> Under 100, but sure, can fold it under 80.

You can only use 100 sparingly if it makes the code more readable.  Which
I know is fuzzy, and in practice never does.  Certainly not in nvme and
block code.
Kanchan Joshi March 10, 2022, 6:43 p.m. UTC | #4
On Thu, Mar 10, 2022 at 7:49 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Thu, Mar 10, 2022 at 05:20:13PM +0530, Kanchan Joshi wrote:
> > In sync ioctl, we always update this result field by doing put_user on
> > completion.
> > For async ioctl, since command is inside the the sqe, its lifetime is
> > only upto submission. SQE may get reused post submission, leaving no
> > way to update the "result" field on completion. Had this field been a
> > pointer, we could have saved this on submission and updated on
> > completion. But that would require redesigning this structure and
> > adding newer ioctl in nvme.
>
> Why would it required adding an ioctl to nvme?  The whole io_uring
> async_cmd infrastructure is completely independent from ioctls.

io_uring is sure not peeking into ioctl and its command-structure but
offering the facility to use its sqe to store that ioctl-command
inline.
Problem is, the inline facility does not go very well with this
particular nvme-passthru ioctl (NVME_IOCTL_IO64_CMD).
And that's because this ioctl requires additional "__u64 result;" to
be updated within "struct nvme_passthru_cmd64".
To update that during completion, we need, at the least, the result
field to be a pointer "__u64 result_ptr" inside the struct
nvme_passthru_cmd64.
Do you see that is possible without adding a new passthru ioctl in nvme?

> > Coming back, even though sync-ioctl alway updates this result to
> > user-space, only a few nvme io commands (e.g. zone-append, copy,
> > zone-mgmt-send) can return this additional result (spec-wise).
> > Therefore in nvme, when we are dealing with inline-sqe commands from
> > io_uring, we never attempt to update the result. And since we don't
> > update the result, we limit support to only read/write passthru
> > commands. And fail any other command during submission itself (Patch
> > 2).
>
> Yikes.  That is outright horrible.  passthrough needs to be command
> agnostic and future proof to any newly added nvme command.

This patch (along with patch 16) does exactly that. Makes it
command-agnostic and future-proof. All nvme-commands will work with
it.
Just that application needs to pass the pointer of ioctl-command and
not place it inline inside the sqe.

Overall, I think at io_uring infra level both submission makes sense:
big-sqe based inline submission (more efficient for <= 80 bytes) and
normal-sqe based non-inline/indirect submissions.
At nvme-level, we have to pick (depending on ioctl in hand). Currently
we are playing with both and constructing a sort of fast-path (for all
commands) and another faster-path (only for read/write commands).
Should we (at nvme-level) rather opt out and use only indirect
(because it works for all commands) or must we build a way to enable
inline-one for all commands?

> > > Overly long line.
> >
> > Under 100, but sure, can fold it under 80.
>
> You can only use 100 sparingly if it makes the code more readable.  Which
> I know is fuzzy, and in practice never does.  Certainly not in nvme and
> block code.

Clears up, thanks.
Christoph Hellwig March 11, 2022, 6:27 a.m. UTC | #5
On Fri, Mar 11, 2022 at 12:13:24AM +0530, Kanchan Joshi wrote:
> Problem is, the inline facility does not go very well with this
> particular nvme-passthru ioctl (NVME_IOCTL_IO64_CMD).

And it doesn't have to, because there is absolutely no need to reuse
the existing structures!  Quite to the contrary, trying to reuse the
structure and opcode makes things confusing as hell.

> And that's because this ioctl requires additional "__u64 result;" to
> be updated within "struct nvme_passthru_cmd64".
> To update that during completion, we need, at the least, the result
> field to be a pointer "__u64 result_ptr" inside the struct
> nvme_passthru_cmd64.
> Do you see that is possible without adding a new passthru ioctl in nvme?

We don't need a new passthrough ioctl in nvme.  We need to decouple the
uring cmd properly.  And properly in this case means not to add a
result pointer, but to drop the result from the _input_ structure
entirely, and instead optionally support a larger CQ entry that contains
it, just like the first patch does for the SQ.
Kanchan Joshi March 22, 2022, 5:10 p.m. UTC | #6
On Fri, Mar 11, 2022 at 11:57 AM Christoph Hellwig <hch@lst.de> wrote:
> > And that's because this ioctl requires additional "__u64 result;" to
> > be updated within "struct nvme_passthru_cmd64".
> > To update that during completion, we need, at the least, the result
> > field to be a pointer "__u64 result_ptr" inside the struct
> > nvme_passthru_cmd64.
> > Do you see that is possible without adding a new passthru ioctl in nvme?
>
> We don't need a new passthrough ioctl in nvme.
Right. Maybe it is easier for applications if they get to use the same
ioctl opcode/structure that they know well already.

> We need to decouple the
> uring cmd properly.  And properly in this case means not to add a
> result pointer, but to drop the result from the _input_ structure
> entirely, and instead optionally support a larger CQ entry that contains
> it, just like the first patch does for the SQ.

Creating a large CQE was my thought too. Gave that another stab.
Dealing with two types of CQE felt nasty to fit in liburing's api-set
(which is cqe-heavy).

Jens: Do you already have thoughts (go/no-go) for this route?

From all that we discussed, maybe the path forward could be this:
- inline-cmd/big-sqe is useful if paired with big-cqe. Drop big-sqe
for now if we cannot go the big-cqe route.
- use only indirect-cmd as this requires nothing special, just regular
sqe and cqe. We can support all passthru commands with a lot less
code. No new ioctl in nvme, so same semantics. For common commands
(i.e. read/write) we can still avoid updating the result (put_user
cost will go).

Please suggest if we should approach this any differently in v2.

Thanks,
Christoph Hellwig March 24, 2022, 6:32 a.m. UTC | #7
On Tue, Mar 22, 2022 at 10:40:27PM +0530, Kanchan Joshi wrote:
> On Fri, Mar 11, 2022 at 11:57 AM Christoph Hellwig <hch@lst.de> wrote:
> > > And that's because this ioctl requires additional "__u64 result;" to
> > > be updated within "struct nvme_passthru_cmd64".
> > > To update that during completion, we need, at the least, the result
> > > field to be a pointer "__u64 result_ptr" inside the struct
> > > nvme_passthru_cmd64.
> > > Do you see that is possible without adding a new passthru ioctl in nvme?
> >
> > We don't need a new passthrough ioctl in nvme.
> Right. Maybe it is easier for applications if they get to use the same
> ioctl opcode/structure that they know well already.

I disagree.  Reusing the same opcode and/or structure for something
fundamentally different creates major confusion.  Don't do it.

> >From all that we discussed, maybe the path forward could be this:
> - inline-cmd/big-sqe is useful if paired with big-cqe. Drop big-sqe
> for now if we cannot go the big-cqe route.
> - use only indirect-cmd as this requires nothing special, just regular
> sqe and cqe. We can support all passthru commands with a lot less
> code. No new ioctl in nvme, so same semantics. For common commands
> (i.e. read/write) we can still avoid updating the result (put_user
> cost will go).
> 
> Please suggest if we should approach this any differently in v2.

Personally I think larger SQEs and CQEs are the only sensible interface
here.  Everything else just fails like a horrible hack I would not want
to support in NVMe.
Clay Mayers March 24, 2022, 9:09 p.m. UTC | #8
> From: Kanchan Joshi
> Sent: Tuesday, March 8, 2022 7:21 AM
> To: axboe@kernel.dk; hch@lst.de; kbusch@kernel.org;
> asml.silence@gmail.com
> Cc: io-uring@vger.kernel.org; linux-nvme@lists.infradead.org; linux-
> block@vger.kernel.org; sbates@raithlin.com; logang@deltatee.com;
> pankydev8@gmail.com; javier@javigon.com; mcgrof@kernel.org;
> a.manzanares@samsung.com; joshiiitr@gmail.com; anuj20.g@samsung.com
> Subject: [PATCH 17/17] nvme: enable non-inline passthru commands
> 
> From: Anuj Gupta <anuj20.g@samsung.com>
> 
> On submission,just fetch the commmand from userspace pointer and reuse
> everything else. On completion, update the result field inside the passthru
> command.
> 
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> ---
>  drivers/nvme/host/ioctl.c | 29 +++++++++++++++++++++++++----
>  1 file changed, 25 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c index
> 701feaecabbe..ddb7e5864be6 100644
> --- a/drivers/nvme/host/ioctl.c
> +++ b/drivers/nvme/host/ioctl.c
> @@ -65,6 +65,14 @@ static void nvme_pt_task_cb(struct io_uring_cmd
> *ioucmd)
>  	}
>  	kfree(pdu->meta);
> 
> +	if (ioucmd->flags & IO_URING_F_UCMD_INDIRECT) {
> +		struct nvme_passthru_cmd64 __user *ptcmd64 = ioucmd-
> >cmd;
> +		u64 result = le64_to_cpu(nvme_req(req)->result.u64);
> +
> +		if (put_user(result, &ptcmd64->result))
> +			status = -EFAULT;

When the thread that submitted the io_uring_cmd has exited, the CB is
called by a system worker instead so put_user() fails.  The cqe is still
completed and the process sees a failed i/o status, but the i/o did not
fail.  The same is true for meta data being returned in patch 5.

I can't say if it's a requirement to support this case.  It does break our
current proto-type but we can adjust.

> +	}
> +
>  	io_uring_cmd_done(ioucmd, status);
>  }
> 
> @@ -143,6 +151,13 @@ static inline bool nvme_is_fixedb_passthru(struct
> io_uring_cmd *ioucmd)
>  	return ((ioucmd) && (ioucmd->flags &
> IO_URING_F_UCMD_FIXEDBUFS));  }
> 
> +static inline bool is_inline_rw(struct io_uring_cmd *ioucmd, struct
> +nvme_command *cmd) {
> +	return ((ioucmd->flags & IO_URING_F_UCMD_INDIRECT) ||
> +			(cmd->common.opcode == nvme_cmd_write ||
> +			 cmd->common.opcode == nvme_cmd_read)); }
> +
>  static int nvme_submit_user_cmd(struct request_queue *q,
>  		struct nvme_command *cmd, u64 ubuffer,
>  		unsigned bufflen, void __user *meta_buffer, unsigned
> meta_len, @@ -193,8 +208,7 @@ static int nvme_submit_user_cmd(struct
> request_queue *q,
>  		}
>  	}
>  	if (ioucmd) { /* async dispatch */
> -		if (cmd->common.opcode == nvme_cmd_write ||
> -				cmd->common.opcode == nvme_cmd_read) {
> +		if (is_inline_rw(ioucmd, cmd)) {
>  			if (bio && is_polling_enabled(ioucmd, req)) {
>  				ioucmd->bio = bio;
>  				bio->bi_opf |= REQ_POLLED;
> @@ -204,7 +218,7 @@ static int nvme_submit_user_cmd(struct
> request_queue *q,
>  			blk_execute_rq_nowait(req, 0, nvme_end_async_pt);
>  			return 0;
>  		} else {
> -			/* support only read and write for now. */
> +			/* support only read and write for inline */
>  			ret = -EINVAL;
>  			goto out_meta;
>  		}
> @@ -372,7 +386,14 @@ static int nvme_user_cmd64(struct nvme_ctrl *ctrl,
> struct nvme_ns *ns,
>  	} else {
>  		if (ioucmd->cmd_len != sizeof(struct nvme_passthru_cmd64))
>  			return -EINVAL;
> -		cptr = (struct nvme_passthru_cmd64 *)ioucmd->cmd;
> +		if (ioucmd->flags & IO_URING_F_UCMD_INDIRECT) {
> +			ucmd = (struct nvme_passthru_cmd64 __user
> *)ioucmd->cmd;
> +			if (copy_from_user(&cmd, ucmd, sizeof(cmd)))
> +				return -EFAULT;
> +			cptr = &cmd;
> +		} else {
> +			cptr = (struct nvme_passthru_cmd64 *)ioucmd->cmd;
> +		}
>  	}
>  	if (cptr->flags & NVME_HIPRI)
>  		rq_flags |= REQ_POLLED;
> --
> 2.25.1
Jens Axboe March 24, 2022, 11:36 p.m. UTC | #9
On 3/24/22 3:09 PM, Clay Mayers wrote:
>> From: Kanchan Joshi
>> Sent: Tuesday, March 8, 2022 7:21 AM
>> To: axboe@kernel.dk; hch@lst.de; kbusch@kernel.org;
>> asml.silence@gmail.com
>> Cc: io-uring@vger.kernel.org; linux-nvme@lists.infradead.org; linux-
>> block@vger.kernel.org; sbates@raithlin.com; logang@deltatee.com;
>> pankydev8@gmail.com; javier@javigon.com; mcgrof@kernel.org;
>> a.manzanares@samsung.com; joshiiitr@gmail.com; anuj20.g@samsung.com
>> Subject: [PATCH 17/17] nvme: enable non-inline passthru commands
>>
>> From: Anuj Gupta <anuj20.g@samsung.com>
>>
>> On submission,just fetch the commmand from userspace pointer and reuse
>> everything else. On completion, update the result field inside the passthru
>> command.
>>
>> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
>> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
>> ---
>>  drivers/nvme/host/ioctl.c | 29 +++++++++++++++++++++++++----
>>  1 file changed, 25 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c index
>> 701feaecabbe..ddb7e5864be6 100644
>> --- a/drivers/nvme/host/ioctl.c
>> +++ b/drivers/nvme/host/ioctl.c
>> @@ -65,6 +65,14 @@ static void nvme_pt_task_cb(struct io_uring_cmd
>> *ioucmd)
>>  	}
>>  	kfree(pdu->meta);
>>
>> +	if (ioucmd->flags & IO_URING_F_UCMD_INDIRECT) {
>> +		struct nvme_passthru_cmd64 __user *ptcmd64 = ioucmd-
>>> cmd;
>> +		u64 result = le64_to_cpu(nvme_req(req)->result.u64);
>> +
>> +		if (put_user(result, &ptcmd64->result))
>> +			status = -EFAULT;
> 
> When the thread that submitted the io_uring_cmd has exited, the CB is
> called by a system worker instead so put_user() fails.  The cqe is
> still completed and the process sees a failed i/o status, but the i/o
> did not fail.  The same is true for meta data being returned in patch
> 5.
> 
> I can't say if it's a requirement to support this case.  It does break
> our current proto-type but we can adjust.

Just don't do that then - it's all very much task based. If the task
goes away and completions haven't been reaped, don't count on anything
sane happening in terms of them completing successfully or not.

The common case for this happening is offloading submit to a submit
thread, which is utterly pointless with io_uring anyway.
Kanchan Joshi March 25, 2022, 1:39 p.m. UTC | #10
On Thu, Mar 24, 2022 at 07:32:18AM +0100, Christoph Hellwig wrote:
>On Tue, Mar 22, 2022 at 10:40:27PM +0530, Kanchan Joshi wrote:
>> On Fri, Mar 11, 2022 at 11:57 AM Christoph Hellwig <hch@lst.de> wrote:
>> > > And that's because this ioctl requires additional "__u64 result;" to
>> > > be updated within "struct nvme_passthru_cmd64".
>> > > To update that during completion, we need, at the least, the result
>> > > field to be a pointer "__u64 result_ptr" inside the struct
>> > > nvme_passthru_cmd64.
>> > > Do you see that is possible without adding a new passthru ioctl in nvme?
>> >
>> > We don't need a new passthrough ioctl in nvme.
>> Right. Maybe it is easier for applications if they get to use the same
>> ioctl opcode/structure that they know well already.
>
>I disagree.  Reusing the same opcode and/or structure for something
>fundamentally different creates major confusion.  Don't do it.

Ok. If you are open to take new opcode/struct route, that is all we
require to pair with big-sqe and have this sorted. How about this -

+/* same as nvme_passthru_cmd64 but expecting result field to be pointer */
+struct nvme_passthru_cmd64_ind {
+       __u8    opcode;
+       __u8    flags;
+       __u16   rsvd1;
+       __u32   nsid;
+       __u32   cdw2;
+       __u32   cdw3;
+       __u64   metadata;
+       __u64   addr;
+       __u32   metadata_len;
+       union {
+               __u32   data_len; /* for non-vectored io */
+               __u32   vec_cnt; /* for vectored io */
+       };
+       __u32   cdw10;
+       __u32   cdw11;
+       __u32   cdw12;
+       __u32   cdw13;
+       __u32   cdw14;
+       __u32   cdw15;
+       __u32   timeout_ms;
+       __u32   rsvd2;
+       __u64   presult; /* pointer to result */
+};
+
 #define nvme_admin_cmd nvme_passthru_cmd

+#define NVME_IOCTL_IO64_CMD_IND        _IOWR('N', 0x50, struct nvme_passthru_cmd64_ind)

Not heavy on code-volume too, because only one statement (updating
result) changes and we reuse everything else.

>> >From all that we discussed, maybe the path forward could be this:
>> - inline-cmd/big-sqe is useful if paired with big-cqe. Drop big-sqe
>> for now if we cannot go the big-cqe route.
>> - use only indirect-cmd as this requires nothing special, just regular
>> sqe and cqe. We can support all passthru commands with a lot less
>> code. No new ioctl in nvme, so same semantics. For common commands
>> (i.e. read/write) we can still avoid updating the result (put_user
>> cost will go).
>>
>> Please suggest if we should approach this any differently in v2.
>
>Personally I think larger SQEs and CQEs are the only sensible interface
>here.  Everything else just fails like a horrible hack I would not want
>to support in NVMe.

So far we have gathered three choices: 

(a) big-sqe + new opcode/struct in nvme
(b) big-sqe + big-cqe
(c) regular-sqe + regular-cqe

I can post a RFC on big-cqe work if that is what it takes to evaluate
clearly what path to take. But really, the code is much more compared
to choice (a) and (c). Differentiating one CQE with another does not
seem very maintenance-friendly, particularly in liburing.

For (c), I did not get what part feels like horrible hack.
It is same as how we do sync passthru - read passthru command from
user-space memory, and update result into that on completion.
But yes, (a) seems like the best option to me.
Kanchan Joshi March 28, 2022, 4:44 a.m. UTC | #11
> >I disagree.  Reusing the same opcode and/or structure for something
> >fundamentally different creates major confusion.  Don't do it.
>
> Ok. If you are open to take new opcode/struct route, that is all we
> require to pair with big-sqe and have this sorted. How about this -
>
> +/* same as nvme_passthru_cmd64 but expecting result field to be pointer */
> +struct nvme_passthru_cmd64_ind {
> +       __u8    opcode;
> +       __u8    flags;
> +       __u16   rsvd1;
> +       __u32   nsid;
> +       __u32   cdw2;
> +       __u32   cdw3;
> +       __u64   metadata;
> +       __u64   addr;
> +       __u32   metadata_len;
> +       union {
> +               __u32   data_len; /* for non-vectored io */
> +               __u32   vec_cnt; /* for vectored io */
> +       };
> +       __u32   cdw10;
> +       __u32   cdw11;
> +       __u32   cdw12;
> +       __u32   cdw13;
> +       __u32   cdw14;
> +       __u32   cdw15;
> +       __u32   timeout_ms;
> +       __u32   rsvd2;
> +       __u64   presult; /* pointer to result */
> +};
> +
>  #define nvme_admin_cmd nvme_passthru_cmd
>
> +#define NVME_IOCTL_IO64_CMD_IND        _IOWR('N', 0x50, struct nvme_passthru_cmd64_ind)
>
> Not heavy on code-volume too, because only one statement (updating
> result) changes and we reuse everything else.
>
> >> >From all that we discussed, maybe the path forward could be this:
> >> - inline-cmd/big-sqe is useful if paired with big-cqe. Drop big-sqe
> >> for now if we cannot go the big-cqe route.
> >> - use only indirect-cmd as this requires nothing special, just regular
> >> sqe and cqe. We can support all passthru commands with a lot less
> >> code. No new ioctl in nvme, so same semantics. For common commands
> >> (i.e. read/write) we can still avoid updating the result (put_user
> >> cost will go).
> >>
> >> Please suggest if we should approach this any differently in v2.
> >
> >Personally I think larger SQEs and CQEs are the only sensible interface
> >here.  Everything else just fails like a horrible hack I would not want
> >to support in NVMe.
>
> So far we have gathered three choices:
>
> (a) big-sqe + new opcode/struct in nvme
> (b) big-sqe + big-cqe
> (c) regular-sqe + regular-cqe
>
> I can post a RFC on big-cqe work if that is what it takes to evaluate
> clearly what path to take. But really, the code is much more compared
> to choice (a) and (c). Differentiating one CQE with another does not
> seem very maintenance-friendly, particularly in liburing.
>
> For (c), I did not get what part feels like horrible hack.
> It is same as how we do sync passthru - read passthru command from
> user-space memory, and update result into that on completion.
> But yes, (a) seems like the best option to me.

Thinking a bit more on "(b) big-sqe + big-cqe". Will that also require
a new ioctl (other than NVME_IOCTL_IO64_CMD) in nvme? Because
semantics will be slightly different (i.e. not updating the result
inside the passthrough command but sending it out-of-band to
io_uring). Or am I just overthinking it.
Christoph Hellwig March 30, 2022, 12:59 p.m. UTC | #12
On Mon, Mar 28, 2022 at 10:14:13AM +0530, Kanchan Joshi wrote:
> Thinking a bit more on "(b) big-sqe + big-cqe". Will that also require
> a new ioctl (other than NVME_IOCTL_IO64_CMD) in nvme? Because
> semantics will be slightly different (i.e. not updating the result
> inside the passthrough command but sending it out-of-band to
> io_uring). Or am I just overthinking it.

Again, there should be absolutely no coupling between ioctls and
io_uring async cmds.  The only thing trying to reuse structures or
constants does is to create a lot of confusion.
Christoph Hellwig March 30, 2022, 1:02 p.m. UTC | #13
On Fri, Mar 25, 2022 at 07:09:21PM +0530, Kanchan Joshi wrote:
> Ok. If you are open to take new opcode/struct route, that is all we
> require to pair with big-sqe and have this sorted. How about this -

I would much, much, much prefer to support a bigger CQE.  Having
a pointer in there just creates a fair amount of overhead and
really does not fit into the model nvme and io_uring use.

But yes, if we did not go down that route that would be the structure
that is needed.
Kanchan Joshi March 30, 2022, 1:14 p.m. UTC | #14
On Wed, Mar 30, 2022 at 6:32 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Fri, Mar 25, 2022 at 07:09:21PM +0530, Kanchan Joshi wrote:
> > Ok. If you are open to take new opcode/struct route, that is all we
> > require to pair with big-sqe and have this sorted. How about this -
>
> I would much, much, much prefer to support a bigger CQE.  Having
> a pointer in there just creates a fair amount of overhead and
> really does not fit into the model nvme and io_uring use.

Sure, will post the code with bigger-cqe first.

> But yes, if we did not go down that route that would be the structure
> that is needed.
Got it. Thanks for confirming.
Jens Axboe April 1, 2022, 1:22 a.m. UTC | #15
On 3/22/22 11:10 AM, Kanchan Joshi wrote:
>> We need to decouple the
>> uring cmd properly.  And properly in this case means not to add a
>> result pointer, but to drop the result from the _input_ structure
>> entirely, and instead optionally support a larger CQ entry that contains
>> it, just like the first patch does for the SQ.
> 
> Creating a large CQE was my thought too. Gave that another stab.
> Dealing with two types of CQE felt nasty to fit in liburing's api-set
> (which is cqe-heavy).
> 
> Jens: Do you already have thoughts (go/no-go) for this route?

Yes, I think we should just add support for 32-byte CQEs as well. Only
pondering I've done here is if it makes sense to manage them separately,
or if you should just get both big sqe and cqe support in one setting.
For passthrough, you'd want both. But eg for zoned writes, you can make
do with a normal sized sqes and only do larger cqes.

I did actually benchmark big sqes in peak testing, and found them to
perform about the same, no noticeable difference. Which does make sense,
as normal IO with big sqe would only touch the normal sized sqe and
leave the other one unwritten and unread. Since they are cacheline
sized, there's no extra load there.

For big cqes, that's a bit different and I'd expect a bit of a
performance hit for that. We can currently fit 4 of them into a
cacheline, with the change it'd be 2. The same number of ops/sec would
hence touch twice as many cachelines for completions.

But I still think it's way better than having to copy back part of the
completion info out-of-band vs just doing it inline, and it's more
efficient too for that case for sure.

> From all that we discussed, maybe the path forward could be this:
> - inline-cmd/big-sqe is useful if paired with big-cqe. Drop big-sqe
> for now if we cannot go the big-cqe route.

We should go big cqe for sure, it'll help clean up a bunch of things
too.
Jens Axboe April 1, 2022, 1:23 a.m. UTC | #16
On 3/30/22 7:02 AM, Christoph Hellwig wrote:
> On Fri, Mar 25, 2022 at 07:09:21PM +0530, Kanchan Joshi wrote:
>> Ok. If you are open to take new opcode/struct route, that is all we
>> require to pair with big-sqe and have this sorted. How about this -
> 
> I would much, much, much prefer to support a bigger CQE.  Having
> a pointer in there just creates a fair amount of overhead and
> really does not fit into the model nvme and io_uring use.
> 
> But yes, if we did not go down that route that would be the structure
> that is needed.

IMHO doing 32-byte CQEs is the only sane choice here, I would not
entertain anything else.
Jens Axboe April 1, 2022, 1:25 a.m. UTC | #17
On 3/30/22 7:14 AM, Kanchan Joshi wrote:
> On Wed, Mar 30, 2022 at 6:32 PM Christoph Hellwig <hch@lst.de> wrote:
>>
>> On Fri, Mar 25, 2022 at 07:09:21PM +0530, Kanchan Joshi wrote:
>>> Ok. If you are open to take new opcode/struct route, that is all we
>>> require to pair with big-sqe and have this sorted. How about this -
>>
>> I would much, much, much prefer to support a bigger CQE.  Having
>> a pointer in there just creates a fair amount of overhead and
>> really does not fit into the model nvme and io_uring use.
> 
> Sure, will post the code with bigger-cqe first.

I can add the support, should be pretty trivial. And do the liburing
side as well, so we have a sane base.

Then I'd suggest to collapse a few of the patches in the series,
the ones that simply modify or fix gaps in previous ones. Order
the series so we build the support and then add nvme support
nicely on top of that.

I'll send out a message on a rebase big sqe/cqe branch, will do
that once 5.18-rc1 is released so we can get it updated to a
current tree as well.
Kanchan Joshi April 1, 2022, 2:33 a.m. UTC | #18
On Fri, Apr 1, 2022 at 6:55 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 3/30/22 7:14 AM, Kanchan Joshi wrote:
> > On Wed, Mar 30, 2022 at 6:32 PM Christoph Hellwig <hch@lst.de> wrote:
> >>
> >> On Fri, Mar 25, 2022 at 07:09:21PM +0530, Kanchan Joshi wrote:
> >>> Ok. If you are open to take new opcode/struct route, that is all we
> >>> require to pair with big-sqe and have this sorted. How about this -
> >>
> >> I would much, much, much prefer to support a bigger CQE.  Having
> >> a pointer in there just creates a fair amount of overhead and
> >> really does not fit into the model nvme and io_uring use.
> >
> > Sure, will post the code with bigger-cqe first.
>
> I can add the support, should be pretty trivial. And do the liburing
> side as well, so we have a sane base.

 I will post the big-cqe based work today. It works with fio.
 It does not deal with liburing (which seems tricky), but hopefully it
can help us move forward anyway .

> Then I'd suggest to collapse a few of the patches in the series,
> the ones that simply modify or fix gaps in previous ones. Order
> the series so we build the support and then add nvme support
> nicely on top of that.

I think we already did away with patches which were fixing only the
gaps. But yes, patches still add infra for features incrementally.
Do you mean having all io_uring infra (async, plug, poll) squashed
into a single io_uring patch?
On a related note, I was thinking of deferring fixed-buffer and
bio-cache support for now.
Jens Axboe April 1, 2022, 2:44 a.m. UTC | #19
On 3/31/22 8:33 PM, Kanchan Joshi wrote:
> On Fri, Apr 1, 2022 at 6:55 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 3/30/22 7:14 AM, Kanchan Joshi wrote:
>>> On Wed, Mar 30, 2022 at 6:32 PM Christoph Hellwig <hch@lst.de> wrote:
>>>>
>>>> On Fri, Mar 25, 2022 at 07:09:21PM +0530, Kanchan Joshi wrote:
>>>>> Ok. If you are open to take new opcode/struct route, that is all we
>>>>> require to pair with big-sqe and have this sorted. How about this -
>>>>
>>>> I would much, much, much prefer to support a bigger CQE.  Having
>>>> a pointer in there just creates a fair amount of overhead and
>>>> really does not fit into the model nvme and io_uring use.
>>>
>>> Sure, will post the code with bigger-cqe first.
>>
>> I can add the support, should be pretty trivial. And do the liburing
>> side as well, so we have a sane base.
> 
>  I will post the big-cqe based work today. It works with fio.
>  It does not deal with liburing (which seems tricky), but hopefully it
> can help us move forward anyway .

Let's compare then, since I just did the support too :-)

Some limitations in what I pushed:

1) Doesn't support the inline completion path. Undecided if this is
super important or not, the priority here for me was to not pollute the
general completion path.

2) Doesn't support overflow. That can certainly be done, only
complication here is that we need 2x64bit in the io_kiocb for that.
Perhaps something can get reused for that, not impossible. But figured
it wasn't important enough for a first run.

I also did the liburing support, but haven't pushed it yet. That's
another case where some care has to be taken to avoid makig the general
path slower.

Oh, it's here, usual branch:

https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-big-sqe

and based on top of the pending 5.18 bits and the current 5.19 bits.

>> Then I'd suggest to collapse a few of the patches in the series,
>> the ones that simply modify or fix gaps in previous ones. Order
>> the series so we build the support and then add nvme support
>> nicely on top of that.
> 
> I think we already did away with patches which were fixing only the
> gaps. But yes, patches still add infra for features incrementally.
> Do you mean having all io_uring infra (async, plug, poll) squashed
> into a single io_uring patch?

At least async and plug, I'll double check on the poll bit.

> On a related note, I was thinking of deferring fixed-buffer and
> bio-cache support for now.

Yes, I think that can be done as a round 2. Keep the current one
simpler.
Jens Axboe April 1, 2022, 3:05 a.m. UTC | #20
On 3/31/22 8:44 PM, Jens Axboe wrote:
> On 3/31/22 8:33 PM, Kanchan Joshi wrote:
>> On Fri, Apr 1, 2022 at 6:55 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> On 3/30/22 7:14 AM, Kanchan Joshi wrote:
>>>> On Wed, Mar 30, 2022 at 6:32 PM Christoph Hellwig <hch@lst.de> wrote:
>>>>>
>>>>> On Fri, Mar 25, 2022 at 07:09:21PM +0530, Kanchan Joshi wrote:
>>>>>> Ok. If you are open to take new opcode/struct route, that is all we
>>>>>> require to pair with big-sqe and have this sorted. How about this -
>>>>>
>>>>> I would much, much, much prefer to support a bigger CQE.  Having
>>>>> a pointer in there just creates a fair amount of overhead and
>>>>> really does not fit into the model nvme and io_uring use.
>>>>
>>>> Sure, will post the code with bigger-cqe first.
>>>
>>> I can add the support, should be pretty trivial. And do the liburing
>>> side as well, so we have a sane base.
>>
>>  I will post the big-cqe based work today. It works with fio.
>>  It does not deal with liburing (which seems tricky), but hopefully it
>> can help us move forward anyway .
> 
> Let's compare then, since I just did the support too :-)
> 
> Some limitations in what I pushed:
> 
> 1) Doesn't support the inline completion path. Undecided if this is
> super important or not, the priority here for me was to not pollute the
> general completion path.
> 
> 2) Doesn't support overflow. That can certainly be done, only
> complication here is that we need 2x64bit in the io_kiocb for that.
> Perhaps something can get reused for that, not impossible. But figured
> it wasn't important enough for a first run.
> 
> I also did the liburing support, but haven't pushed it yet. That's
> another case where some care has to be taken to avoid makig the general
> path slower.
> 
> Oh, it's here, usual branch:
> 
> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-big-sqe
> 
> and based on top of the pending 5.18 bits and the current 5.19 bits.

Do post your version too, would be interesting to compare. I just wired
mine up to NOP, hasn't seen any testing beyond just verifying that we do
pass back the extra data.

Added inline completion as well. Kind of interesting in that performance
actually seems to be _better_ with CQE32 for my initial testing, just
using NOP. More testing surely needed, will run it on actual hardware
too as I have a good idea what performance should look like there.

I also think it's currently broken for request deferral and timeouts,
but those are just minor tweaks that need to be made to account for the
cq head being doubly incremented on bigger CQEs.
Kanchan Joshi April 1, 2022, 6:29 a.m. UTC | #21
On Fri, Apr 1, 2022 at 6:52 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 3/22/22 11:10 AM, Kanchan Joshi wrote:
> >> We need to decouple the
> >> uring cmd properly.  And properly in this case means not to add a
> >> result pointer, but to drop the result from the _input_ structure
> >> entirely, and instead optionally support a larger CQ entry that contains
> >> it, just like the first patch does for the SQ.
> >
> > Creating a large CQE was my thought too. Gave that another stab.
> > Dealing with two types of CQE felt nasty to fit in liburing's api-set
> > (which is cqe-heavy).
> >
> > Jens: Do you already have thoughts (go/no-go) for this route?
>
> Yes, I think we should just add support for 32-byte CQEs as well. Only
> pondering I've done here is if it makes sense to manage them separately,
> or if you should just get both big sqe and cqe support in one setting.
> For passthrough, you'd want both. But eg for zoned writes, you can make
> do with a normal sized sqes and only do larger cqes.

I had the same thought. That we may have other use-cases returning a
second result.
For now I am doing 32-byte cqe with the same big-sqe flag, but an
independent flag can be done easily.

Combinations are:
(a) big-sqe with big-cqe  (for nvme-passthru)
(b) big-sqe without big-cqe (inline submission but not requiring second result)
(c) regular-sqe with big-cqe (for zone-append)
(d) regular-sqe with regular-cqe (for cases when inline submission is
not enough e.g. > 80 bytes of cmd)

At this point (d) seems rare. And the other three can be done with two flags.
Kanchan Joshi April 1, 2022, 6:32 a.m. UTC | #22
On Fri, Apr 1, 2022 at 8:14 AM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>> Ok. If you are open to take new opcode/struct route, that is all we
> >>>>> require to pair with big-sqe and have this sorted. How about this -
> >>>>
> >>>> I would much, much, much prefer to support a bigger CQE.  Having
> >>>> a pointer in there just creates a fair amount of overhead and
> >>>> really does not fit into the model nvme and io_uring use.
> >>>
> >>> Sure, will post the code with bigger-cqe first.
> >>
> >> I can add the support, should be pretty trivial. And do the liburing
> >> side as well, so we have a sane base.
> >
> >  I will post the big-cqe based work today. It works with fio.
> >  It does not deal with liburing (which seems tricky), but hopefully it
> > can help us move forward anyway .
>
> Let's compare then, since I just did the support too :-)

:-) awesome

> Some limitations in what I pushed:
>
> 1) Doesn't support the inline completion path. Undecided if this is
> super important or not, the priority here for me was to not pollute the
> general completion path.
>
> 2) Doesn't support overflow. That can certainly be done, only
> complication here is that we need 2x64bit in the io_kiocb for that.
> Perhaps something can get reused for that, not impossible. But figured
> it wasn't important enough for a first run.

We have the handling in my version. But that part is not tested, since
that situation did not occur naturally.
Maybe it requires slowing down completion-reaping (in user-space) to
trigger that.

> I also did the liburing support, but haven't pushed it yet. That's
> another case where some care has to be taken to avoid makig the general
> path slower.
>
> Oh, it's here, usual branch:
>
> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-big-sqe
>
> and based on top of the pending 5.18 bits and the current 5.19 bits.
>
> >> Then I'd suggest to collapse a few of the patches in the series,
> >> the ones that simply modify or fix gaps in previous ones. Order
> >> the series so we build the support and then add nvme support
> >> nicely on top of that.
> >
> > I think we already did away with patches which were fixing only the
> > gaps. But yes, patches still add infra for features incrementally.
> > Do you mean having all io_uring infra (async, plug, poll) squashed
> > into a single io_uring patch?
>
> At least async and plug, I'll double check on the poll bit.

Sounds right, the plug should definitely go in the async one.
Kanchan Joshi April 19, 2022, 5:31 p.m. UTC | #23
Hi Jens,
Few thoughts below toward the next version -

On Fri, Apr 1, 2022 at 8:14 AM Jens Axboe <axboe@kernel.dk> wrote:
[snip]
> >>> Sure, will post the code with bigger-cqe first.
> >>
> >> I can add the support, should be pretty trivial. And do the liburing
> >> side as well, so we have a sane base.
> >
> >  I will post the big-cqe based work today. It works with fio.
> >  It does not deal with liburing (which seems tricky), but hopefully it
> > can help us move forward anyway .
>
> Let's compare then, since I just did the support too :-)

Major difference is generic support (rather than uring-cmd only) and
not touching the regular completion path. So plan is to use your patch
for the next version with some bits added (e.g. overflow-handling and
avoiding extra CQE tail increment). Hope that sounds fine.

We have things working on top of your current branch
"io_uring-big-sqe". Since SQE now has 8 bytes of free space (post
xattr merge) and CQE infra is different (post cqe-caching in ctx) -
things needed to be done a bit differently. But all this is now tested
better with liburing support/util (plan is to post that too).
Jens Axboe April 19, 2022, 6:19 p.m. UTC | #24
On 4/19/22 11:31 AM, Kanchan Joshi wrote:
> Hi Jens,
> Few thoughts below toward the next version -
> 
> On Fri, Apr 1, 2022 at 8:14 AM Jens Axboe <axboe@kernel.dk> wrote:
> [snip]
>>>>> Sure, will post the code with bigger-cqe first.
>>>>
>>>> I can add the support, should be pretty trivial. And do the liburing
>>>> side as well, so we have a sane base.
>>>
>>>  I will post the big-cqe based work today. It works with fio.
>>>  It does not deal with liburing (which seems tricky), but hopefully it
>>> can help us move forward anyway .
>>
>> Let's compare then, since I just did the support too :-)
> 
> Major difference is generic support (rather than uring-cmd only) and
> not touching the regular completion path. So plan is to use your patch
> for the next version with some bits added (e.g. overflow-handling and
> avoiding extra CQE tail increment). Hope that sounds fine.

I'll sanitize my branch today or tomorrow, it has overflow and proper cq
ring management now, just hasn't been posted yet. So it should be
complete.

> We have things working on top of your current branch
> "io_uring-big-sqe". Since SQE now has 8 bytes of free space (post
> xattr merge) and CQE infra is different (post cqe-caching in ctx) -
> things needed to be done a bit differently. But all this is now tested
> better with liburing support/util (plan is to post that too).

Just still grab the 16 bytes, we don't care about addr3 for passthrough.
Should be no changes required there.
Kanchan Joshi April 20, 2022, 3:14 p.m. UTC | #25
On Tue, Apr 19, 2022 at 12:19:01PM -0600, Jens Axboe wrote:
>On 4/19/22 11:31 AM, Kanchan Joshi wrote:
>> Hi Jens,
>> Few thoughts below toward the next version -
>>
>> On Fri, Apr 1, 2022 at 8:14 AM Jens Axboe <axboe@kernel.dk> wrote:
>> [snip]
>>>>>> Sure, will post the code with bigger-cqe first.
>>>>>
>>>>> I can add the support, should be pretty trivial. And do the liburing
>>>>> side as well, so we have a sane base.
>>>>
>>>>  I will post the big-cqe based work today. It works with fio.
>>>>  It does not deal with liburing (which seems tricky), but hopefully it
>>>> can help us move forward anyway .
>>>
>>> Let's compare then, since I just did the support too :-)
>>
>> Major difference is generic support (rather than uring-cmd only) and
>> not touching the regular completion path. So plan is to use your patch
>> for the next version with some bits added (e.g. overflow-handling and
>> avoiding extra CQE tail increment). Hope that sounds fine.
>
>I'll sanitize my branch today or tomorrow, it has overflow and proper cq
>ring management now, just hasn't been posted yet. So it should be
>complete.

Ok, thanks.
Here is revised patch that works for me (perhaps you don't need, but worth
if it saves any cycle). Would require one change in previous (big-sqe)
patch:

 enum io_uring_cmd_flags {
        IO_URING_F_COMPLETE_DEFER       = 1,
        IO_URING_F_UNLOCKED             = 2,
+       IO_URING_F_SQE128               = 4,

Subject: [PATCH 2/7] io_uring: add support for 32-byte CQEs

Normal CQEs are 16-bytes in length, which is fine for all the commands
we support. However, in preparation for supporting passthrough IO,
provide an option for setting up a ring with 32-byte CQEs and add a
helper for completing them.

Rather than always use the slower locked path, wire up use of the
deferred completion path that normal CQEs can take. This reuses the
hash list node for the storage we need to hold the two 64-bit values
that must be passed back.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                   | 206 ++++++++++++++++++++++++++++----
 include/trace/events/io_uring.h |  18 ++-
 include/uapi/linux/io_uring.h   |  12 ++
 3 files changed, 212 insertions(+), 24 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 5e2b7485f380..6c1a69ae74a4 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -206,6 +206,7 @@ enum io_uring_cmd_flags {
        IO_URING_F_COMPLETE_DEFER       = 1,
        IO_URING_F_UNLOCKED             = 2,
        IO_URING_F_SQE128               = 4,
+       IO_URING_F_CQE32                = 8,
        /* int's last bit, sign checks are usually faster than a bit test */
        IO_URING_F_NONBLOCK             = INT_MIN,
 };
@@ -221,8 +222,8 @@ struct io_mapped_ubuf {
 struct io_ring_ctx;

 struct io_overflow_cqe {
-       struct io_uring_cqe cqe;
        struct list_head list;
+       struct io_uring_cqe cqe; /* this must be kept at end */
 };

 struct io_fixed_file {
@@ -954,7 +955,13 @@ struct io_kiocb {
        atomic_t                        poll_refs;
        struct io_task_work             io_task_work;
        /* for polled requests, i.e. IORING_OP_POLL_ADD and async armed poll */
-       struct hlist_node               hash_node;
+       union {
+               struct hlist_node       hash_node;
+               struct {
+                       u64             extra1;
+                       u64             extra2;
+               };
+       };
        /* internal polling, see IORING_FEAT_FAST_POLL */
        struct async_poll               *apoll;
        /* opcode allocated if it needs to store data for async defer */
@@ -1902,6 +1909,40 @@ static inline struct io_uring_cqe *io_get_cqe(struct io_ring_ctx *ctx)
        return __io_get_cqe(ctx);
 }

+static noinline struct io_uring_cqe *__io_get_cqe32(struct io_ring_ctx *ctx)
+{
+       struct io_rings *rings = ctx->rings;
+       unsigned int off = ctx->cached_cq_tail & (ctx->cq_entries - 1);
+       unsigned int free, queued, len;
+
+       /* userspace may cheat modifying the tail, be safe and do min */
+       queued = min(__io_cqring_events(ctx), ctx->cq_entries);
+       free = ctx->cq_entries - queued;
+       /* we need a contiguous range, limit based on the current array offset */
+       len = min(free, ctx->cq_entries - off);
+       if (!len)
+               return NULL;
+
+       ctx->cached_cq_tail++;
+       /* double increment for 32 CQEs */
+       ctx->cqe_cached = &rings->cqes[off << 1];
+       ctx->cqe_sentinel = ctx->cqe_cached + (len << 1);
+       return ctx->cqe_cached;
+}
+
+static inline struct io_uring_cqe *io_get_cqe32(struct io_ring_ctx *ctx)
+{
+       struct io_uring_cqe *cqe32;
+       if (likely(ctx->cqe_cached < ctx->cqe_sentinel)) {
+               ctx->cached_cq_tail++;
+               cqe32 = ctx->cqe_cached;
+       } else
+               cqe32 = __io_get_cqe32(ctx);
+       /* double increment for 32b CQE*/
+       ctx->cqe_cached += 2;
+       return cqe32;
+}
+
 static void io_eventfd_signal(struct io_ring_ctx *ctx)
 {
        struct io_ev_fd *ev_fd;
@@ -1977,15 +2018,21 @@ static bool __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force)
        posted = false;
        spin_lock(&ctx->completion_lock);
        while (!list_empty(&ctx->cq_overflow_list)) {
-               struct io_uring_cqe *cqe = io_get_cqe(ctx);
+               struct io_uring_cqe *cqe;
                struct io_overflow_cqe *ocqe;
+               /* copy more for big-cqe */
+               int cqeshift = ctx->flags & IORING_SETUP_CQE32 ? 1 : 0;

+               if (cqeshift)
+                       cqe = io_get_cqe32(ctx);
+               else
+                       cqe = io_get_cqe(ctx);
                if (!cqe && !force)
                        break;
                ocqe = list_first_entry(&ctx->cq_overflow_list,
                                        struct io_overflow_cqe, list);
                if (cqe)
-                       memcpy(cqe, &ocqe->cqe, sizeof(*cqe));
+                       memcpy(cqe, &ocqe->cqe, sizeof(*cqe) << cqeshift);
                else
                        io_account_cq_overflow(ctx);

@@ -2074,11 +2121,18 @@ static __cold void io_uring_drop_tctx_refs(struct task_struct *task)
 }

 static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data,
-                                    s32 res, u32 cflags)
+                                    s32 res, u32 cflags, u64 extra1,
+                                    u64 extra2)
 {
        struct io_overflow_cqe *ocqe;
+       int size = sizeof(*ocqe);
+       bool cqe32 = ctx->flags & IORING_SETUP_CQE32;

-       ocqe = kmalloc(sizeof(*ocqe), GFP_ATOMIC | __GFP_ACCOUNT);
+       /* allocate more for 32b CQE */
+       if (cqe32)
+               size += sizeof(struct io_uring_cqe);
+
+       ocqe = kmalloc(size, GFP_ATOMIC | __GFP_ACCOUNT);
        if (!ocqe) {
                /*
                 * If we're in ring overflow flush mode, or in task cancel mode,
@@ -2097,6 +2151,10 @@ static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data,
        ocqe->cqe.user_data = user_data;
        ocqe->cqe.res = res;
        ocqe->cqe.flags = cflags;
+       if (cqe32) {
+               ocqe->cqe.b[0].extra1 = extra1;
+               ocqe->cqe.b[0].extra2 = extra2;
+       }
        list_add_tail(&ocqe->list, &ctx->cq_overflow_list);
        return true;
 }
@@ -2118,7 +2176,35 @@ static inline bool __io_fill_cqe(struct io_ring_ctx *ctx, u64 user_data,
                WRITE_ONCE(cqe->flags, cflags);
                return true;
        }
-       return io_cqring_event_overflow(ctx, user_data, res, cflags);
+       return io_cqring_event_overflow(ctx, user_data, res, cflags, 0, 0);
+}
+
+static inline bool __io_fill_cqe32_req_filled(struct io_ring_ctx *ctx,
+                                           struct io_kiocb *req)
+{
+       struct io_uring_cqe *cqe;
+       u64 extra1 = req->extra1;
+       u64 extra2 = req->extra2;
+
+       trace_io_uring_complete(req->ctx, req, req->cqe.user_data,
+                               req->cqe.res, req->cqe.flags, extra1,
+                               extra2);
+
+       /*
+        * If we can't get a cq entry, userspace overflowed the
+        * submission (by quite a lot). Increment the overflow count in
+        * the ring.
+        */
+       cqe = io_get_cqe32(ctx);
+       if (likely(cqe)) {
+               memcpy(cqe, &req->cqe, sizeof(*cqe));
+               cqe->b[0].extra1 = extra1;
+               cqe->b[0].extra2 = extra2;
+               return true;
+       }
+       return io_cqring_event_overflow(ctx, req->cqe.user_data,
+                                       req->cqe.res, req->cqe.flags, extra1,
+                                       extra2);
 }

 static inline bool __io_fill_cqe_req_filled(struct io_ring_ctx *ctx,
@@ -2127,7 +2213,7 @@ static inline bool __io_fill_cqe_req_filled(struct io_ring_ctx *ctx,
        struct io_uring_cqe *cqe;

        trace_io_uring_complete(req->ctx, req, req->cqe.user_data,
-                               req->cqe.res, req->cqe.flags);
+                               req->cqe.res, req->cqe.flags, 0, 0);

        /*
         * If we can't get a cq entry, userspace overflowed the
@@ -2140,12 +2226,13 @@ static inline bool __io_fill_cqe_req_filled(struct io_ring_ctx *ctx,
                return true;
        }
        return io_cqring_event_overflow(ctx, req->cqe.user_data,
-                                       req->cqe.res, req->cqe.flags);
+                                       req->cqe.res, req->cqe.flags, 0, 0);
 }

 static inline bool __io_fill_cqe_req(struct io_kiocb *req, s32 res, u32 cflags)
 {
-       trace_io_uring_complete(req->ctx, req, req->cqe.user_data, res, cflags);
+       trace_io_uring_complete(req->ctx, req, req->cqe.user_data, res, cflags,
+                       0, 0);
        return __io_fill_cqe(req->ctx, req->cqe.user_data, res, cflags);
 }

@@ -2159,22 +2246,52 @@ static noinline bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data,
                                     s32 res, u32 cflags)
 {
        ctx->cq_extra++;
-       trace_io_uring_complete(ctx, NULL, user_data, res, cflags);
+       trace_io_uring_complete(ctx, NULL, user_data, res, cflags, 0, 0);
        return __io_fill_cqe(ctx, user_data, res, cflags);
 }

-static void __io_req_complete_post(struct io_kiocb *req, s32 res,
-                                  u32 cflags)
+static void __io_fill_cqe32_req(struct io_kiocb *req, s32 res, u32 cflags,
+                               u64 extra1, u64 extra2)
 {
        struct io_ring_ctx *ctx = req->ctx;
+       struct io_uring_cqe *cqe;
+
+       if (WARN_ON_ONCE(!(ctx->flags & IORING_SETUP_CQE32)))
+               return;
+       if (req->flags & REQ_F_CQE_SKIP)
+               return;
+
+       trace_io_uring_complete(ctx, req, req->cqe.user_data, res, cflags,
+                       extra1, extra2);

-       if (!(req->flags & REQ_F_CQE_SKIP))
-               __io_fill_cqe_req(req, res, cflags);
+       /*
+        * If we can't get a cq entry, userspace overflowed the
+        * submission (by quite a lot). Increment the overflow count in
+        * the ring.
+        */
+       cqe = io_get_cqe32(ctx);
+       if (likely(cqe)) {
+               WRITE_ONCE(cqe->user_data, req->cqe.user_data);
+               WRITE_ONCE(cqe->res, res);
+               WRITE_ONCE(cqe->flags, cflags);
+               WRITE_ONCE(cqe->b[0].extra1, extra1);
+               WRITE_ONCE(cqe->b[0].extra2, extra2);
+               return;
+       }
+
+       io_cqring_event_overflow(ctx, req->cqe.user_data, res, cflags, extra1,
+                       extra2);
+}
+
+static void __io_req_complete_put(struct io_kiocb *req)
+{
        /*
         * If we're the last reference to this request, add to our locked
         * free_list cache.
         */
        if (req_ref_put_and_test(req)) {
+               struct io_ring_ctx *ctx = req->ctx;
+
                if (req->flags & (REQ_F_LINK | REQ_F_HARDLINK)) {
                        if (req->flags & IO_DISARM_MASK)
                                io_disarm_next(req);
@@ -2197,6 +2314,33 @@ static void __io_req_complete_post(struct io_kiocb *req, s32 res,
        }
 }

+static void __io_req_complete_post(struct io_kiocb *req, s32 res,
+                                  u32 cflags)
+{
+       if (!(req->flags & REQ_F_CQE_SKIP))
+               __io_fill_cqe_req(req, res, cflags);
+       __io_req_complete_put(req);
+}
+
+static void io_req_complete_post32(struct io_kiocb *req, s32 res,
+                                  u32 cflags, u64 extra1, u64 extra2)
+{
+       struct io_ring_ctx *ctx = req->ctx;
+       bool posted = false;
+
+       spin_lock(&ctx->completion_lock);
+
+       if (!(req->flags & REQ_F_CQE_SKIP)) {
+               __io_fill_cqe32_req(req, res, cflags, extra1, extra2);
+               io_commit_cqring(ctx);
+               posted = true;
+       }
+       __io_req_complete_put(req);
+       spin_unlock(&ctx->completion_lock);
+       if (posted)
+               io_cqring_ev_posted(ctx);
+}
+
 static void io_req_complete_post(struct io_kiocb *req, s32 res,
                                 u32 cflags)
 {
@@ -2226,6 +2370,19 @@ static inline void __io_req_complete(struct io_kiocb *req, unsigned issue_flags,
                io_req_complete_post(req, res, cflags);
 }

+static inline void __io_req_complete32(struct io_kiocb *req,
+                                      unsigned issue_flags, s32 res,
+                                      u32 cflags, u64 extra1, u64 extra2)
+{
+       if (issue_flags & IO_URING_F_COMPLETE_DEFER) {
+               io_req_complete_state(req, res, cflags);
+               req->extra1 = extra1;
+               req->extra2 = extra2;
+       } else {
+               io_req_complete_post32(req, res, cflags, extra1, extra2);
+       }
+}
+
 static inline void io_req_complete(struct io_kiocb *req, s32 res)
 {
        __io_req_complete(req, 0, res, 0);
@@ -2779,8 +2936,12 @@ static void __io_submit_flush_completions(struct io_ring_ctx *ctx)
                        struct io_kiocb *req = container_of(node, struct io_kiocb,
                                                    comp_list);

-                       if (!(req->flags & REQ_F_CQE_SKIP))
+                       if (req->flags & REQ_F_CQE_SKIP)
+                               continue;
+                       if (!(ctx->flags & IORING_SETUP_CQE32))
                                __io_fill_cqe_req_filled(ctx, req);
+                       else
+                               __io_fill_cqe32_req_filled(ctx, req);
                }

                io_commit_cqring(ctx);
@@ -9632,8 +9793,8 @@ static void *io_mem_alloc(size_t size)
        return (void *) __get_free_pages(gfp, get_order(size));
 }

-static unsigned long rings_size(unsigned sq_entries, unsigned cq_entries,
-                               size_t *sq_offset)
+static unsigned long rings_size(struct io_ring_ctx *ctx, unsigned sq_entries,
+                               unsigned cq_entries, size_t *sq_offset)
 {
        struct io_rings *rings;
        size_t off, sq_array_size;
@@ -9641,6 +9802,11 @@ static unsigned long rings_size(unsigned sq_entries, unsigned cq_entries,
        off = struct_size(rings, cqes, cq_entries);
        if (off == SIZE_MAX)
                return SIZE_MAX;
+       if (ctx->flags & IORING_SETUP_CQE32) {
+               if ((off << 1) < off)
+                       return SIZE_MAX;
+               off <<= 1;
+       }

 #ifdef CONFIG_SMP
        off = ALIGN(off, SMP_CACHE_BYTES);
@@ -11297,7 +11463,7 @@ static __cold int io_allocate_scq_urings(struct io_ring_ctx *ctx,
        ctx->sq_entries = p->sq_entries;
        ctx->cq_entries = p->cq_entries;

-       size = rings_size(p->sq_entries, p->cq_entries, &sq_array_offset);
+       size = rings_size(ctx, p->sq_entries, p->cq_entries, &sq_array_offset);
        if (size == SIZE_MAX)
                return -EOVERFLOW;

@@ -11540,7 +11706,7 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
                        IORING_SETUP_SQ_AFF | IORING_SETUP_CQSIZE |
                        IORING_SETUP_CLAMP | IORING_SETUP_ATTACH_WQ |
                        IORING_SETUP_R_DISABLED | IORING_SETUP_SUBMIT_ALL |
-                       IORING_SETUP_SQE128))
+                       IORING_SETUP_SQE128 | IORING_SETUP_CQE32))
                return -EINVAL;

        return  io_uring_create(entries, &p, params);
diff --git a/include/trace/events/io_uring.h b/include/trace/events/io_uring.h
index 8477414d6d06..2eb4f4e47de4 100644
--- a/include/trace/events/io_uring.h
+++ b/include/trace/events/io_uring.h
@@ -318,13 +318,16 @@ TRACE_EVENT(io_uring_fail_link,
  * @user_data:         user data associated with the request
  * @res:               result of the request
  * @cflags:            completion flags
+ * @extra1:            extra 64-bit data for CQE32
+ * @extra2:            extra 64-bit data for CQE32
  *
  */
 TRACE_EVENT(io_uring_complete,

-       TP_PROTO(void *ctx, void *req, u64 user_data, int res, unsigned cflags),
+       TP_PROTO(void *ctx, void *req, u64 user_data, int res, unsigned cflags,
+                u64 extra1, u64 extra2),

-       TP_ARGS(ctx, req, user_data, res, cflags),
+       TP_ARGS(ctx, req, user_data, res, cflags, extra1, extra2),

        TP_STRUCT__entry (
                __field(  void *,       ctx             )
@@ -332,6 +335,8 @@ TRACE_EVENT(io_uring_complete,
                __field(  u64,          user_data       )
                __field(  int,          res             )
                __field(  unsigned,     cflags          )
+               __field(  u64,          extra1          )
+               __field(  u64,          extra2          )
        ),

        TP_fast_assign(
@@ -340,12 +345,17 @@ TRACE_EVENT(io_uring_complete,
                __entry->user_data      = user_data;
                __entry->res            = res;
                __entry->cflags         = cflags;
+               __entry->extra1         = extra1;
+               __entry->extra2         = extra2;
        ),

-       TP_printk("ring %p, req %p, user_data 0x%llx, result %d, cflags 0x%x",
+       TP_printk("ring %p, req %p, user_data 0x%llx, result %d, cflags 0x%x "
+                 "extra1 %llu extra2 %llu ",
                __entry->ctx, __entry->req,
                __entry->user_data,
-               __entry->res, __entry->cflags)
+               __entry->res, __entry->cflags,
+               (unsigned long long) __entry->extra1,
+               (unsigned long long) __entry->extra2)
 );

 /**
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 88a5c67d6666..1fe0ad3668d1 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -111,6 +111,7 @@ enum {
 #define IORING_SETUP_R_DISABLED        (1U << 6)       /* start with ring disabled */
 #define IORING_SETUP_SUBMIT_ALL        (1U << 7)       /* continue submit on error */
 #define IORING_SETUP_SQE128    (1U << 8)       /* SQEs are 128b */
+#define IORING_SETUP_CQE32     (1U << 9)       /* CQEs are 32b */

 enum {
        IORING_OP_NOP,
@@ -200,6 +201,11 @@ enum {
 #define IORING_POLL_UPDATE_EVENTS      (1U << 1)
 #define IORING_POLL_UPDATE_USER_DATA   (1U << 2)

+struct io_uring_cqe_extra {
+       __u64   extra1;
+       __u64   extra2;
+};
+
 /*
  * IO completion data structure (Completion Queue Entry)
  */
@@ -207,6 +213,12 @@ struct io_uring_cqe {
        __u64   user_data;      /* sqe->data submission passed back */
        __s32   res;            /* result code for this event */
        __u32   flags;
+
+       /*
+        * If the ring is initialized with IORING_SETUP_CQE32, then this field
+        * contains 16-bytes of padding, doubling the size of the CQE.
+        */
+       struct io_uring_cqe_extra       b[0];
 };



>> We have things working on top of your current branch
>> "io_uring-big-sqe". Since SQE now has 8 bytes of free space (post
>> xattr merge) and CQE infra is different (post cqe-caching in ctx) -
>> things needed to be done a bit differently. But all this is now tested
>> better with liburing support/util (plan is to post that too).
>
>Just still grab the 16 bytes, we don't care about addr3 for passthrough.
>Should be no changes required there.
I was thinking of uring-cmd in general, but then also it does not seem
to collide with xattr. Got your point.
Measure was removing 8b "result" field from passthru-cmd, since 32b CQE
makes that part useless, and we are adding new opcode in nvme
anyway. Maybe we should still reduce passthu-cmd to 64b (rather than 72),
not very sure.
Kanchan Joshi April 20, 2022, 3:28 p.m. UTC | #26
> >Just still grab the 16 bytes, we don't care about addr3 for passthrough.
> >Should be no changes required there.
> I was thinking of uring-cmd in general, but then also it does not seem
> to collide with xattr. Got your point.
> Measure was removing 8b "result" field from passthru-cmd, since 32b CQE
> makes that part useless, and we are adding new opcode in nvme
> anyway. Maybe we should still reduce passthu-cmd to 64b (rather than 72),
> not very sure.
Correction above: reduce passthru-cmd to 72b (rather than 80b).
diff mbox series

Patch

diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 701feaecabbe..ddb7e5864be6 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -65,6 +65,14 @@  static void nvme_pt_task_cb(struct io_uring_cmd *ioucmd)
 	}
 	kfree(pdu->meta);
 
+	if (ioucmd->flags & IO_URING_F_UCMD_INDIRECT) {
+		struct nvme_passthru_cmd64 __user *ptcmd64 = ioucmd->cmd;
+		u64 result = le64_to_cpu(nvme_req(req)->result.u64);
+
+		if (put_user(result, &ptcmd64->result))
+			status = -EFAULT;
+	}
+
 	io_uring_cmd_done(ioucmd, status);
 }
 
@@ -143,6 +151,13 @@  static inline bool nvme_is_fixedb_passthru(struct io_uring_cmd *ioucmd)
 	return ((ioucmd) && (ioucmd->flags & IO_URING_F_UCMD_FIXEDBUFS));
 }
 
+static inline bool is_inline_rw(struct io_uring_cmd *ioucmd, struct nvme_command *cmd)
+{
+	return ((ioucmd->flags & IO_URING_F_UCMD_INDIRECT) ||
+			(cmd->common.opcode == nvme_cmd_write ||
+			 cmd->common.opcode == nvme_cmd_read));
+}
+
 static int nvme_submit_user_cmd(struct request_queue *q,
 		struct nvme_command *cmd, u64 ubuffer,
 		unsigned bufflen, void __user *meta_buffer, unsigned meta_len,
@@ -193,8 +208,7 @@  static int nvme_submit_user_cmd(struct request_queue *q,
 		}
 	}
 	if (ioucmd) { /* async dispatch */
-		if (cmd->common.opcode == nvme_cmd_write ||
-				cmd->common.opcode == nvme_cmd_read) {
+		if (is_inline_rw(ioucmd, cmd)) {
 			if (bio && is_polling_enabled(ioucmd, req)) {
 				ioucmd->bio = bio;
 				bio->bi_opf |= REQ_POLLED;
@@ -204,7 +218,7 @@  static int nvme_submit_user_cmd(struct request_queue *q,
 			blk_execute_rq_nowait(req, 0, nvme_end_async_pt);
 			return 0;
 		} else {
-			/* support only read and write for now. */
+			/* support only read and write for inline */
 			ret = -EINVAL;
 			goto out_meta;
 		}
@@ -372,7 +386,14 @@  static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 	} else {
 		if (ioucmd->cmd_len != sizeof(struct nvme_passthru_cmd64))
 			return -EINVAL;
-		cptr = (struct nvme_passthru_cmd64 *)ioucmd->cmd;
+		if (ioucmd->flags & IO_URING_F_UCMD_INDIRECT) {
+			ucmd = (struct nvme_passthru_cmd64 __user *)ioucmd->cmd;
+			if (copy_from_user(&cmd, ucmd, sizeof(cmd)))
+				return -EFAULT;
+			cptr = &cmd;
+		} else {
+			cptr = (struct nvme_passthru_cmd64 *)ioucmd->cmd;
+		}
 	}
 	if (cptr->flags & NVME_HIPRI)
 		rq_flags |= REQ_POLLED;