diff mbox series

[1/2] iouring: one capable call per iouring instance

Message ID 20231204175342.3418422-1-kbusch@meta.com (mailing list archive)
State New
Headers show
Series [1/2] iouring: one capable call per iouring instance | expand

Commit Message

Keith Busch Dec. 4, 2023, 5:53 p.m. UTC
From: Keith Busch <kbusch@kernel.org>

The uring_cmd operation is often used for privileged actions, so drivers
subscribing to this interface check capable() for each command. The
capable() function is not fast path friendly for many kernel configs,
and this can really harm performance. Stash the capable sys admin
attribute in the io_uring context and set a new issue_flag for the
uring_cmd interface.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 include/linux/io_uring_types.h | 4 ++++
 io_uring/io_uring.c            | 1 +
 io_uring/uring_cmd.c           | 2 ++
 3 files changed, 7 insertions(+)

Comments

Jens Axboe Dec. 4, 2023, 6:05 p.m. UTC | #1
On 12/4/23 10:53 AM, Keith Busch wrote:
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 1d254f2c997de..4aa10b64f539e 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -3980,6 +3980,7 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
>  		ctx->syscall_iopoll = 1;
>  
>  	ctx->compat = in_compat_syscall();
> +	ctx->sys_admin = capable(CAP_SYS_ADMIN);
>  	if (!ns_capable_noaudit(&init_user_ns, CAP_IPC_LOCK))
>  		ctx->user = get_uid(current_user());

Hmm, what happens if the app starts as eg root for initialization
purposes and drops caps after? That would have previously have caused
passthrough to fail, but now it will work. Perhaps this is fine, after
all this isn't unusual for eg opening device or doing other init special
work?

In any case, that should definitely be explicitly mentioned in the
commit message for a change like that.
Jens Axboe Dec. 4, 2023, 6:15 p.m. UTC | #2
On 12/4/23 10:53 AM, Keith Busch wrote:
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 1d254f2c997de..4aa10b64f539e 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -3980,6 +3980,7 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
>  		ctx->syscall_iopoll = 1;
>  
>  	ctx->compat = in_compat_syscall();
> +	ctx->sys_admin = capable(CAP_SYS_ADMIN);
>  	if (!ns_capable_noaudit(&init_user_ns, CAP_IPC_LOCK))
>  		ctx->user = get_uid(current_user());
>  
> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> index 8a38b9f75d841..764f0e004aa00 100644
> --- a/io_uring/uring_cmd.c
> +++ b/io_uring/uring_cmd.c
> @@ -164,6 +164,8 @@ int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
>  		issue_flags |= IO_URING_F_CQE32;
>  	if (ctx->compat)
>  		issue_flags |= IO_URING_F_COMPAT;
> +	if (ctx->sys_admin)
> +		issue_flags |= IO_URING_F_SYS_ADMIN;
>  	if (ctx->flags & IORING_SETUP_IOPOLL) {
>  		if (!file->f_op->uring_cmd_iopoll)
>  			return -EOPNOTSUPP;

Since we know have two flags that would be cached from init time, rather
than add a second branch for this, why not have:

io_uring_create()
{
	[...]
	if (in_compat_syscall())
		ctx->issue_flags |= IO_URING_F_COMPAT;
	if (capable(CAP_SYS_ADMIN))
		ctx->issue_flags |= IO_URING_F_SYS_ADMIN;
	[...]
}

and get rid of ->compat and ->sys_admin, and then have:

io_uring_cmd()
{
	issue_flags |= ctx->issue_flags;
}

Then we can also drop checking for IORING_SETUP_SQE128/CQE32 as well,
dropping two more fast path branches.
Jeff Moyer Dec. 4, 2023, 6:40 p.m. UTC | #3
I added a CC: linux-security-module@vger

Hi, Keith,

Keith Busch <kbusch@meta.com> writes:

> From: Keith Busch <kbusch@kernel.org>
>
> The uring_cmd operation is often used for privileged actions, so drivers
> subscribing to this interface check capable() for each command. The
> capable() function is not fast path friendly for many kernel configs,
> and this can really harm performance. Stash the capable sys admin
> attribute in the io_uring context and set a new issue_flag for the
> uring_cmd interface.

I have a few questions.  What privileged actions are performance
sensitive?  I would hope that anything requiring privileges would not be
in a fast path (but clearly that's not the case).  What performance
benefits did you measure with this patch set in place (and on what
workloads)?  What happens when a ring fd is passed to another process?

Finally, as Jens mentioned, I would expect dropping priviliges to, you
know, drop privileges.  I don't think a commit message is going to be
enough documentation for a change like this.

Cheers,
Jeff

>
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
>  include/linux/io_uring_types.h | 4 ++++
>  io_uring/io_uring.c            | 1 +
>  io_uring/uring_cmd.c           | 2 ++
>  3 files changed, 7 insertions(+)
>
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index bebab36abce89..d64d6916753f0 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -36,6 +36,9 @@ enum io_uring_cmd_flags {
>  	/* set when uring wants to cancel a previously issued command */
>  	IO_URING_F_CANCEL		= (1 << 11),
>  	IO_URING_F_COMPAT		= (1 << 12),
> +
> +	/* ring validated as CAP_SYS_ADMIN capable */
> +	IO_URING_F_SYS_ADMIN		= (1 << 13),
>  };
>  
>  struct io_wq_work_node {
> @@ -240,6 +243,7 @@ struct io_ring_ctx {
>  		unsigned int		poll_activated: 1;
>  		unsigned int		drain_disabled: 1;
>  		unsigned int		compat: 1;
> +		unsigned int		sys_admin: 1;
>  
>  		struct task_struct	*submitter_task;
>  		struct io_rings		*rings;
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 1d254f2c997de..4aa10b64f539e 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -3980,6 +3980,7 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
>  		ctx->syscall_iopoll = 1;
>  
>  	ctx->compat = in_compat_syscall();
> +	ctx->sys_admin = capable(CAP_SYS_ADMIN);
>  	if (!ns_capable_noaudit(&init_user_ns, CAP_IPC_LOCK))
>  		ctx->user = get_uid(current_user());
>  
> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> index 8a38b9f75d841..764f0e004aa00 100644
> --- a/io_uring/uring_cmd.c
> +++ b/io_uring/uring_cmd.c
> @@ -164,6 +164,8 @@ int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
>  		issue_flags |= IO_URING_F_CQE32;
>  	if (ctx->compat)
>  		issue_flags |= IO_URING_F_COMPAT;
> +	if (ctx->sys_admin)
> +		issue_flags |= IO_URING_F_SYS_ADMIN;
>  	if (ctx->flags & IORING_SETUP_IOPOLL) {
>  		if (!file->f_op->uring_cmd_iopoll)
>  			return -EOPNOTSUPP;
Pavel Begunkov Dec. 4, 2023, 6:45 p.m. UTC | #4
On 12/4/23 18:05, Jens Axboe wrote:
> On 12/4/23 10:53 AM, Keith Busch wrote:
>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> index 1d254f2c997de..4aa10b64f539e 100644
>> --- a/io_uring/io_uring.c
>> +++ b/io_uring/io_uring.c
>> @@ -3980,6 +3980,7 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
>>   		ctx->syscall_iopoll = 1;
>>   
>>   	ctx->compat = in_compat_syscall();
>> +	ctx->sys_admin = capable(CAP_SYS_ADMIN);
>>   	if (!ns_capable_noaudit(&init_user_ns, CAP_IPC_LOCK))
>>   		ctx->user = get_uid(current_user());
> 
> Hmm, what happens if the app starts as eg root for initialization
> purposes and drops caps after? That would have previously have caused
> passthrough to fail, but now it will work. Perhaps this is fine, after
> all this isn't unusual for eg opening device or doing other init special
> work?

The side effects would be quite a surprise when you initialize the ring
from a privileged process and then pass it to a less capable one. Ring
sharing would also be affected. Privilege downgrade also sounds like
a valid concern. The first two will be solved if restricted to
IORING_SETUP_DEFER_TASKRUN rings and

io_is_capable() {
	return ctx->sys_admin || capable();
}

And it still doesn't seem great bypassing it, when the question is
rather why it's expensive? I've seen before in the wild a fat BPF
program running on every call, is that what happens?

> In any case, that should definitely be explicitly mentioned in the
> commit message for a change like that.
>
Keith Busch Dec. 4, 2023, 6:57 p.m. UTC | #5
On Mon, Dec 04, 2023 at 01:40:58PM -0500, Jeff Moyer wrote:
> I added a CC: linux-security-module@vger
> Keith Busch <kbusch@meta.com> writes:
> > From: Keith Busch <kbusch@kernel.org>
> >
> > The uring_cmd operation is often used for privileged actions, so drivers
> > subscribing to this interface check capable() for each command. The
> > capable() function is not fast path friendly for many kernel configs,
> > and this can really harm performance. Stash the capable sys admin
> > attribute in the io_uring context and set a new issue_flag for the
> > uring_cmd interface.
> 
> I have a few questions.  What privileged actions are performance
> sensitive? I would hope that anything requiring privileges would not
> be in a fast path (but clearly that's not the case).

Protocol specifics that don't have a generic equivalent. For example,
NVMe FDP is reachable only through the uring_cmd and ioctl interfaces,
but you use it like normal reads and writes so has to be as fast as the
generic interfaces.

The same interfaces can be abused, so access needs to be restricted.

> What performance benefits did you measure with this patch set in place
> (and on what workloads)? 

Quite a bit. Here's a random read high-depth workload on a single
device test:

Before: 970k IOPs
After: 1750k IOPs

> What happens when a ring fd is passed to another process?
> 
> Finally, as Jens mentioned, I would expect dropping priviliges to, you
> know, drop privileges.  I don't think a commit message is going to be
> enough documentation for a change like this.

Yeah, point taken.
Jens Axboe Dec. 4, 2023, 7:01 p.m. UTC | #6
On 12/4/23 11:40 AM, Jeff Moyer wrote:
> Finally, as Jens mentioned, I would expect dropping priviliges to, you
> know, drop privileges.  I don't think a commit message is going to be
> enough documentation for a change like this.

Only thing I can think of here is to cache the state in
task->io_uring->something, and then ensure those are invalidated
whenever caps change. It's one of those cases where that's probably only
done once, but we do need to be able to catch it. Not convinced that
caching it at ring creation is sane enough, even if it is kind of like
opening devices before privs are dropped where you could not otherwise
re-open them later on.
Jeff Moyer Dec. 4, 2023, 7:22 p.m. UTC | #7
Jens Axboe <axboe@kernel.dk> writes:

> On 12/4/23 11:40 AM, Jeff Moyer wrote:
>> Finally, as Jens mentioned, I would expect dropping priviliges to, you
>> know, drop privileges.  I don't think a commit message is going to be
>> enough documentation for a change like this.
>
> Only thing I can think of here is to cache the state in
> task->io_uring->something, and then ensure those are invalidated
> whenever caps change.

I looked through the capable() code, and there is no way that I could
find to be notified of changes.

> It's one of those cases where that's probably only done once, but we
> do need to be able to catch it. Not convinced that caching it at ring
> creation is sane enough, even if it is kind of like opening devices
> before privs are dropped where you could not otherwise re-open them
> later on.

Agreed.

-Jeff
Jens Axboe Dec. 4, 2023, 7:33 p.m. UTC | #8
On 12/4/23 12:22 PM, Jeff Moyer wrote:
> Jens Axboe <axboe@kernel.dk> writes:
> 
>> On 12/4/23 11:40 AM, Jeff Moyer wrote:
>>> Finally, as Jens mentioned, I would expect dropping priviliges to, you
>>> know, drop privileges.  I don't think a commit message is going to be
>>> enough documentation for a change like this.
>>
>> Only thing I can think of here is to cache the state in
>> task->io_uring->something, and then ensure those are invalidated
>> whenever caps change.
> 
> I looked through the capable() code, and there is no way that I could
> find to be notified of changes.

Right, what I meant is that you'd need to add an io_uring_cap_change()
or something that gets called, and that iterates the rings associated
with that task and clears the flag. Ugly...
Keith Busch Dec. 4, 2023, 7:37 p.m. UTC | #9
On Mon, Dec 04, 2023 at 02:22:22PM -0500, Jeff Moyer wrote:
> Jens Axboe <axboe@kernel.dk> writes:
> 
> > On 12/4/23 11:40 AM, Jeff Moyer wrote:
> >> Finally, as Jens mentioned, I would expect dropping priviliges to, you
> >> know, drop privileges.  I don't think a commit message is going to be
> >> enough documentation for a change like this.
> >
> > Only thing I can think of here is to cache the state in
> > task->io_uring->something, and then ensure those are invalidated
> > whenever caps change.
> 
> I looked through the capable() code, and there is no way that I could
> find to be notified of changes.

Something like LSM_HOOK_INIT on 'capset', but needs to work without
CONFIG_SECURITY.
Ming Lei Dec. 5, 2023, 4:14 a.m. UTC | #10
On Mon, Dec 04, 2023 at 11:57:55AM -0700, Keith Busch wrote:
> On Mon, Dec 04, 2023 at 01:40:58PM -0500, Jeff Moyer wrote:
> > I added a CC: linux-security-module@vger
> > Keith Busch <kbusch@meta.com> writes:
> > > From: Keith Busch <kbusch@kernel.org>
> > >
> > > The uring_cmd operation is often used for privileged actions, so drivers
> > > subscribing to this interface check capable() for each command. The
> > > capable() function is not fast path friendly for many kernel configs,
> > > and this can really harm performance. Stash the capable sys admin
> > > attribute in the io_uring context and set a new issue_flag for the
> > > uring_cmd interface.
> > 
> > I have a few questions.  What privileged actions are performance
> > sensitive? I would hope that anything requiring privileges would not
> > be in a fast path (but clearly that's not the case).
> 
> Protocol specifics that don't have a generic equivalent. For example,
> NVMe FDP is reachable only through the uring_cmd and ioctl interfaces,
> but you use it like normal reads and writes so has to be as fast as the
> generic interfaces.

But normal read/write pt command doesn't require ADMIN any more since 
commit 855b7717f44b ("nvme: fine-granular CAP_SYS_ADMIN for nvme io commands"),
why do you have to pay the cost of checking capable(CAP_SYS_ADMIN)?


Thanks, 
Ming
Keith Busch Dec. 5, 2023, 4:31 a.m. UTC | #11
On Tue, Dec 05, 2023 at 12:14:22PM +0800, Ming Lei wrote:
> On Mon, Dec 04, 2023 at 11:57:55AM -0700, Keith Busch wrote:
> > On Mon, Dec 04, 2023 at 01:40:58PM -0500, Jeff Moyer wrote:
> > > I added a CC: linux-security-module@vger
> > > Keith Busch <kbusch@meta.com> writes:
> > > > From: Keith Busch <kbusch@kernel.org>
> > > >
> > > > The uring_cmd operation is often used for privileged actions, so drivers
> > > > subscribing to this interface check capable() for each command. The
> > > > capable() function is not fast path friendly for many kernel configs,
> > > > and this can really harm performance. Stash the capable sys admin
> > > > attribute in the io_uring context and set a new issue_flag for the
> > > > uring_cmd interface.
> > > 
> > > I have a few questions.  What privileged actions are performance
> > > sensitive? I would hope that anything requiring privileges would not
> > > be in a fast path (but clearly that's not the case).
> > 
> > Protocol specifics that don't have a generic equivalent. For example,
> > NVMe FDP is reachable only through the uring_cmd and ioctl interfaces,
> > but you use it like normal reads and writes so has to be as fast as the
> > generic interfaces.
> 
> But normal read/write pt command doesn't require ADMIN any more since 
> commit 855b7717f44b ("nvme: fine-granular CAP_SYS_ADMIN for nvme io commands"),
> why do you have to pay the cost of checking capable(CAP_SYS_ADMIN)?

Good question. The "capable" check had always been first so even with
the relaxed permissions, it was still paying the price. I have changed
that order in commit staged here (not yet upstream):

  http://git.infradead.org/nvme.git/commitdiff/7be866b1cf0bf1dfa74480fe8097daeceda68622

Note that only prevents the costly capable() check if the inexpensive
checks could make a determination. That's still not solving the problem
long term since we aim for forward compatibility where we have no idea
which opcodes, admin identifications, or vendor specifics could be
deemed "safe" for non-root users in the future, so those conditions
would always fall back to the more expensive check that this patch was
trying to mitigate for admin processes.
Ming Lei Dec. 5, 2023, 5:25 a.m. UTC | #12
On Mon, Dec 04, 2023 at 09:31:21PM -0700, Keith Busch wrote:
> On Tue, Dec 05, 2023 at 12:14:22PM +0800, Ming Lei wrote:
> > On Mon, Dec 04, 2023 at 11:57:55AM -0700, Keith Busch wrote:
> > > On Mon, Dec 04, 2023 at 01:40:58PM -0500, Jeff Moyer wrote:
> > > > I added a CC: linux-security-module@vger
> > > > Keith Busch <kbusch@meta.com> writes:
> > > > > From: Keith Busch <kbusch@kernel.org>
> > > > >
> > > > > The uring_cmd operation is often used for privileged actions, so drivers
> > > > > subscribing to this interface check capable() for each command. The
> > > > > capable() function is not fast path friendly for many kernel configs,
> > > > > and this can really harm performance. Stash the capable sys admin
> > > > > attribute in the io_uring context and set a new issue_flag for the
> > > > > uring_cmd interface.
> > > > 
> > > > I have a few questions.  What privileged actions are performance
> > > > sensitive? I would hope that anything requiring privileges would not
> > > > be in a fast path (but clearly that's not the case).
> > > 
> > > Protocol specifics that don't have a generic equivalent. For example,
> > > NVMe FDP is reachable only through the uring_cmd and ioctl interfaces,
> > > but you use it like normal reads and writes so has to be as fast as the
> > > generic interfaces.
> > 
> > But normal read/write pt command doesn't require ADMIN any more since 
> > commit 855b7717f44b ("nvme: fine-granular CAP_SYS_ADMIN for nvme io commands"),
> > why do you have to pay the cost of checking capable(CAP_SYS_ADMIN)?
> 
> Good question. The "capable" check had always been first so even with
> the relaxed permissions, it was still paying the price. I have changed
> that order in commit staged here (not yet upstream):
> 
>   http://git.infradead.org/nvme.git/commitdiff/7be866b1cf0bf1dfa74480fe8097daeceda68622

With this change, I guess you shouldn't see the following big gap, right?

> Before: 970k IOPs
> After: 1750k IOPs

> 
> Note that only prevents the costly capable() check if the inexpensive
> checks could make a determination. That's still not solving the problem
> long term since we aim for forward compatibility where we have no idea
> which opcodes, admin identifications, or vendor specifics could be
> deemed "safe" for non-root users in the future, so those conditions
> would always fall back to the more expensive check that this patch was
> trying to mitigate for admin processes.

Not sure I get the idea, it is related with nvme's permission model for
user pt command, and:

1) it should be always checked in entry of nvme user pt command

2) only the following two types of commands require ADMIN, per commit
855b7717f44b ("nvme: fine-granular CAP_SYS_ADMIN for nvme io commands")

    - any admin-cmd is not allowed
    - vendor-specific and fabric commmand are not allowed

Can you provide more details why the expensive check can't be avoided for
fast read/write user IO commands?

Thanks, 
Ming
Keith Busch Dec. 5, 2023, 3:45 p.m. UTC | #13
On Tue, Dec 05, 2023 at 01:25:44PM +0800, Ming Lei wrote:
> On Mon, Dec 04, 2023 at 09:31:21PM -0700, Keith Busch wrote:
> > Good question. The "capable" check had always been first so even with
> > the relaxed permissions, it was still paying the price. I have changed
> > that order in commit staged here (not yet upstream):
> > 
> >   http://git.infradead.org/nvme.git/commitdiff/7be866b1cf0bf1dfa74480fe8097daeceda68622
> 
> With this change, I guess you shouldn't see the following big gap, right?

Correct.
 
> > Before: 970k IOPs
> > After: 1750k IOPs
 
> > Note that only prevents the costly capable() check if the inexpensive
> > checks could make a determination. That's still not solving the problem
> > long term since we aim for forward compatibility where we have no idea
> > which opcodes, admin identifications, or vendor specifics could be
> > deemed "safe" for non-root users in the future, so those conditions
> > would always fall back to the more expensive check that this patch was
> > trying to mitigate for admin processes.
> 
> Not sure I get the idea, it is related with nvme's permission model for
> user pt command, and:
> 
> 1) it should be always checked in entry of nvme user pt command
> 
> 2) only the following two types of commands require ADMIN, per commit
> 855b7717f44b ("nvme: fine-granular CAP_SYS_ADMIN for nvme io commands")
> 
>     - any admin-cmd is not allowed
>     - vendor-specific and fabric commmand are not allowed
> 
> Can you provide more details why the expensive check can't be avoided for
> fast read/write user IO commands?

It's not necessarily about the read/write passthrough commands. It's for
commands we don't know about today. Do we want to revisit this problem
every time spec provides another operation? Are vendor unique solutions
not allowed to get high IOPs access?

Secondly, some people have rediscovered you can abuse this interface to
corrupt kernel memory, so there are considerations to restricting this
to CAP_SYS_ADMIN anyway, so there's no cheap check available today if we
have to go that route.

Lastly (and least important), there are a lot of checks happening in the
"quick" path that I am trying to replace with a single check. While each
invidividual check isn't too bad, they start to add up.
Kanchan Joshi Dec. 5, 2023, 4:21 p.m. UTC | #14
On 12/4/2023 11:35 PM, Jens Axboe wrote:
> On 12/4/23 10:53 AM, Keith Busch wrote:
>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> index 1d254f2c997de..4aa10b64f539e 100644
>> --- a/io_uring/io_uring.c
>> +++ b/io_uring/io_uring.c
>> @@ -3980,6 +3980,7 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
>>   		ctx->syscall_iopoll = 1;
>>   
>>   	ctx->compat = in_compat_syscall();
>> +	ctx->sys_admin = capable(CAP_SYS_ADMIN);
>>   	if (!ns_capable_noaudit(&init_user_ns, CAP_IPC_LOCK))
>>   		ctx->user = get_uid(current_user());
> Hmm, what happens if the app starts as eg root for initialization
> purposes and drops caps after? That would have previously have caused
> passthrough to fail, but now it will work.


Does it sound any better if this 'super ring' type of ability is asked 
explicitly by a setup flag say IORING_SETUP_CAPABLE_ONCE.
It does not change the old behavior. It also implies that capable user 
knows what it asked for, so no need to keep things in sync if the 
process drops caps after ring setup is done.


diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 4aa10b64f539..589e614144b6 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3981,6 +3981,8 @@ static __cold int io_uring_create(unsigned 
entries, struct io_uring_params *p,

         ctx->compat = in_compat_syscall();
+       if (ctx->flags & IORING_SETUP_CAPABLE_ONCE && 
capable(CAP_SYS_ADMIN))
+               ctx->sys_admin = 1;
         if (!ns_capable_noaudit(&init_user_ns, CAP_IPC_LOCK))
                 ctx->user = get_uid(current_user());
Ming Lei Dec. 6, 2023, 3:08 a.m. UTC | #15
On Tue, Dec 05, 2023 at 08:45:10AM -0700, Keith Busch wrote:
> On Tue, Dec 05, 2023 at 01:25:44PM +0800, Ming Lei wrote:
> > On Mon, Dec 04, 2023 at 09:31:21PM -0700, Keith Busch wrote:
> > > Good question. The "capable" check had always been first so even with
> > > the relaxed permissions, it was still paying the price. I have changed
> > > that order in commit staged here (not yet upstream):
> > > 
> > >   http://git.infradead.org/nvme.git/commitdiff/7be866b1cf0bf1dfa74480fe8097daeceda68622
> > 
> > With this change, I guess you shouldn't see the following big gap, right?
> 
> Correct.
>  
> > > Before: 970k IOPs
> > > After: 1750k IOPs
>  
> > > Note that only prevents the costly capable() check if the inexpensive
> > > checks could make a determination. That's still not solving the problem
> > > long term since we aim for forward compatibility where we have no idea
> > > which opcodes, admin identifications, or vendor specifics could be
> > > deemed "safe" for non-root users in the future, so those conditions
> > > would always fall back to the more expensive check that this patch was
> > > trying to mitigate for admin processes.
> > 
> > Not sure I get the idea, it is related with nvme's permission model for
> > user pt command, and:
> > 
> > 1) it should be always checked in entry of nvme user pt command
> > 
> > 2) only the following two types of commands require ADMIN, per commit
> > 855b7717f44b ("nvme: fine-granular CAP_SYS_ADMIN for nvme io commands")
> > 
> >     - any admin-cmd is not allowed
> >     - vendor-specific and fabric commmand are not allowed
> > 
> > Can you provide more details why the expensive check can't be avoided for
> > fast read/write user IO commands?
> 
> It's not necessarily about the read/write passthrough commands. It's for
> commands we don't know about today. Do we want to revisit this problem
> every time spec provides another operation? Are vendor unique solutions
> not allowed to get high IOPs access?

Except for read/write, what other commands are performance sensitive?

> 
> Secondly, some people have rediscovered you can abuse this interface to
> corrupt kernel memory, so there are considerations to restricting this

Just wondering why ADMIN won't corrupt kernel memory, and only normal
user can, looks it is kernel bug instead of permission related issue.

> to CAP_SYS_ADMIN anyway, so there's no cheap check available today if we
> have to go that route.

If capable(CAP_SYS_ADMIN) is really slow, I am wondering why not
optimize it in task_struct?


Thanks, 
Ming
Keith Busch Dec. 6, 2023, 3:31 p.m. UTC | #16
On Wed, Dec 06, 2023 at 11:08:17AM +0800, Ming Lei wrote:
> On Tue, Dec 05, 2023 at 08:45:10AM -0700, Keith Busch wrote:
> > 
> > It's not necessarily about the read/write passthrough commands. It's for
> > commands we don't know about today. Do we want to revisit this problem
> > every time spec provides another operation? Are vendor unique solutions
> > not allowed to get high IOPs access?
> 
> Except for read/write, what other commands are performance sensitive?

It varies by command set, but this question is irrelevant. I'm not
interested in gatekeeping the fast path.
 
> > Secondly, some people have rediscovered you can abuse this interface to
> > corrupt kernel memory, so there are considerations to restricting this
> 
> Just wondering why ADMIN won't corrupt kernel memory, and only normal
> user can, looks it is kernel bug instead of permission related issue.

Admin can corrupt memory as easily as a normal user through this
interface. We just don't want such capabilities to be available to
regular users.

And it's a user bug: user told the kernel to map buffer of size X, but
the device transfers size Y into it. Kernel can't do anything about that
(other than remove the interface, but such an action will break many
existing users) because we fundamentally do not know the true transfer
size of a random command. Many NVMe commands don't explicitly encode
transfer lengths, so disagreement between host and device on implicit
lengths risk corruption. It's a protocol "feature".

> > to CAP_SYS_ADMIN anyway, so there's no cheap check available today if we
> > have to go that route.
> 
> If capable(CAP_SYS_ADMIN) is really slow, I am wondering why not
> optimize it in task_struct?

That's an interesting point to look into. I was hoping to not touch such
a common struct, but I'm open to all options.
Keith Busch Dec. 6, 2023, 9:09 p.m. UTC | #17
On Tue, Dec 05, 2023 at 09:51:13PM +0530, Kanchan Joshi wrote:
> Does it sound any better if this 'super ring' type of ability is asked 
> explicitly by a setup flag say IORING_SETUP_CAPABLE_ONCE.

Just my opinion: that option sounds good!
Ming Lei Dec. 7, 2023, 1:23 a.m. UTC | #18
On Wed, Dec 06, 2023 at 08:31:54AM -0700, Keith Busch wrote:
> On Wed, Dec 06, 2023 at 11:08:17AM +0800, Ming Lei wrote:
> > On Tue, Dec 05, 2023 at 08:45:10AM -0700, Keith Busch wrote:
> > > 
> > > It's not necessarily about the read/write passthrough commands. It's for
> > > commands we don't know about today. Do we want to revisit this problem
> > > every time spec provides another operation? Are vendor unique solutions
> > > not allowed to get high IOPs access?
> > 
> > Except for read/write, what other commands are performance sensitive?
> 
> It varies by command set, but this question is irrelevant. I'm not
> interested in gatekeeping the fast path.

IMO, it doesn't make sense to run such optimization for commands which aren't
performance sensitive.

>  
> > > Secondly, some people have rediscovered you can abuse this interface to
> > > corrupt kernel memory, so there are considerations to restricting this
> > 
> > Just wondering why ADMIN won't corrupt kernel memory, and only normal
> > user can, looks it is kernel bug instead of permission related issue.
> 
> Admin can corrupt memory as easily as a normal user through this
> interface. We just don't want such capabilities to be available to
> regular users.
> 
> And it's a user bug: user told the kernel to map buffer of size X, but
> the device transfers size Y into it. Kernel can't do anything about that
> (other than remove the interface, but such an action will break many
> existing users) because we fundamentally do not know the true transfer
> size of a random command. Many NVMe commands don't explicitly encode
> transfer lengths, so disagreement between host and device on implicit
> lengths risk corruption. It's a protocol "feature".

Got it, thanks for the explanation, and looks one big defect of
NVMe protocol or the device implementation.

> 
> > > to CAP_SYS_ADMIN anyway, so there's no cheap check available today if we
> > > have to go that route.
> > 
> > If capable(CAP_SYS_ADMIN) is really slow, I am wondering why not
> > optimize it in task_struct?
> 
> That's an interesting point to look into. I was hoping to not touch such
> a common struct, but I'm open to all options.
 
capability is per-thread, and it is updated in current process/pthread, so
the correct place to cache this info is 'task_struct'.


Thanks,
Ming
Christoph Hellwig Dec. 7, 2023, 5:48 p.m. UTC | #19
On Thu, Dec 07, 2023 at 09:23:06AM +0800, Ming Lei wrote:
> Got it, thanks for the explanation, and looks one big defect of
> NVMe protocol or the device implementation.

It is.  And NVMe has plenty more problems like that and people keep
adding this kind of crap even today :(
diff mbox series

Patch

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index bebab36abce89..d64d6916753f0 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -36,6 +36,9 @@  enum io_uring_cmd_flags {
 	/* set when uring wants to cancel a previously issued command */
 	IO_URING_F_CANCEL		= (1 << 11),
 	IO_URING_F_COMPAT		= (1 << 12),
+
+	/* ring validated as CAP_SYS_ADMIN capable */
+	IO_URING_F_SYS_ADMIN		= (1 << 13),
 };
 
 struct io_wq_work_node {
@@ -240,6 +243,7 @@  struct io_ring_ctx {
 		unsigned int		poll_activated: 1;
 		unsigned int		drain_disabled: 1;
 		unsigned int		compat: 1;
+		unsigned int		sys_admin: 1;
 
 		struct task_struct	*submitter_task;
 		struct io_rings		*rings;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 1d254f2c997de..4aa10b64f539e 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3980,6 +3980,7 @@  static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
 		ctx->syscall_iopoll = 1;
 
 	ctx->compat = in_compat_syscall();
+	ctx->sys_admin = capable(CAP_SYS_ADMIN);
 	if (!ns_capable_noaudit(&init_user_ns, CAP_IPC_LOCK))
 		ctx->user = get_uid(current_user());
 
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 8a38b9f75d841..764f0e004aa00 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -164,6 +164,8 @@  int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
 		issue_flags |= IO_URING_F_CQE32;
 	if (ctx->compat)
 		issue_flags |= IO_URING_F_COMPAT;
+	if (ctx->sys_admin)
+		issue_flags |= IO_URING_F_SYS_ADMIN;
 	if (ctx->flags & IORING_SETUP_IOPOLL) {
 		if (!file->f_op->uring_cmd_iopoll)
 			return -EOPNOTSUPP;