Message ID | 20231204175342.3418422-1-kbusch@meta.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [1/2] iouring: one capable call per iouring instance | expand |
On 12/4/23 10:53 AM, Keith Busch wrote: > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c > index 1d254f2c997de..4aa10b64f539e 100644 > --- a/io_uring/io_uring.c > +++ b/io_uring/io_uring.c > @@ -3980,6 +3980,7 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p, > ctx->syscall_iopoll = 1; > > ctx->compat = in_compat_syscall(); > + ctx->sys_admin = capable(CAP_SYS_ADMIN); > if (!ns_capable_noaudit(&init_user_ns, CAP_IPC_LOCK)) > ctx->user = get_uid(current_user()); Hmm, what happens if the app starts as eg root for initialization purposes and drops caps after? That would have previously have caused passthrough to fail, but now it will work. Perhaps this is fine, after all this isn't unusual for eg opening device or doing other init special work? In any case, that should definitely be explicitly mentioned in the commit message for a change like that.
On 12/4/23 10:53 AM, Keith Busch wrote: > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c > index 1d254f2c997de..4aa10b64f539e 100644 > --- a/io_uring/io_uring.c > +++ b/io_uring/io_uring.c > @@ -3980,6 +3980,7 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p, > ctx->syscall_iopoll = 1; > > ctx->compat = in_compat_syscall(); > + ctx->sys_admin = capable(CAP_SYS_ADMIN); > if (!ns_capable_noaudit(&init_user_ns, CAP_IPC_LOCK)) > ctx->user = get_uid(current_user()); > > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c > index 8a38b9f75d841..764f0e004aa00 100644 > --- a/io_uring/uring_cmd.c > +++ b/io_uring/uring_cmd.c > @@ -164,6 +164,8 @@ int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags) > issue_flags |= IO_URING_F_CQE32; > if (ctx->compat) > issue_flags |= IO_URING_F_COMPAT; > + if (ctx->sys_admin) > + issue_flags |= IO_URING_F_SYS_ADMIN; > if (ctx->flags & IORING_SETUP_IOPOLL) { > if (!file->f_op->uring_cmd_iopoll) > return -EOPNOTSUPP; Since we know have two flags that would be cached from init time, rather than add a second branch for this, why not have: io_uring_create() { [...] if (in_compat_syscall()) ctx->issue_flags |= IO_URING_F_COMPAT; if (capable(CAP_SYS_ADMIN)) ctx->issue_flags |= IO_URING_F_SYS_ADMIN; [...] } and get rid of ->compat and ->sys_admin, and then have: io_uring_cmd() { issue_flags |= ctx->issue_flags; } Then we can also drop checking for IORING_SETUP_SQE128/CQE32 as well, dropping two more fast path branches.
I added a CC: linux-security-module@vger Hi, Keith, Keith Busch <kbusch@meta.com> writes: > From: Keith Busch <kbusch@kernel.org> > > The uring_cmd operation is often used for privileged actions, so drivers > subscribing to this interface check capable() for each command. The > capable() function is not fast path friendly for many kernel configs, > and this can really harm performance. Stash the capable sys admin > attribute in the io_uring context and set a new issue_flag for the > uring_cmd interface. I have a few questions. What privileged actions are performance sensitive? I would hope that anything requiring privileges would not be in a fast path (but clearly that's not the case). What performance benefits did you measure with this patch set in place (and on what workloads)? What happens when a ring fd is passed to another process? Finally, as Jens mentioned, I would expect dropping priviliges to, you know, drop privileges. I don't think a commit message is going to be enough documentation for a change like this. Cheers, Jeff > > Signed-off-by: Keith Busch <kbusch@kernel.org> > --- > include/linux/io_uring_types.h | 4 ++++ > io_uring/io_uring.c | 1 + > io_uring/uring_cmd.c | 2 ++ > 3 files changed, 7 insertions(+) > > diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h > index bebab36abce89..d64d6916753f0 100644 > --- a/include/linux/io_uring_types.h > +++ b/include/linux/io_uring_types.h > @@ -36,6 +36,9 @@ enum io_uring_cmd_flags { > /* set when uring wants to cancel a previously issued command */ > IO_URING_F_CANCEL = (1 << 11), > IO_URING_F_COMPAT = (1 << 12), > + > + /* ring validated as CAP_SYS_ADMIN capable */ > + IO_URING_F_SYS_ADMIN = (1 << 13), > }; > > struct io_wq_work_node { > @@ -240,6 +243,7 @@ struct io_ring_ctx { > unsigned int poll_activated: 1; > unsigned int drain_disabled: 1; > unsigned int compat: 1; > + unsigned int sys_admin: 1; > > struct task_struct *submitter_task; > struct io_rings *rings; > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c > index 1d254f2c997de..4aa10b64f539e 100644 > --- a/io_uring/io_uring.c > +++ b/io_uring/io_uring.c > @@ -3980,6 +3980,7 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p, > ctx->syscall_iopoll = 1; > > ctx->compat = in_compat_syscall(); > + ctx->sys_admin = capable(CAP_SYS_ADMIN); > if (!ns_capable_noaudit(&init_user_ns, CAP_IPC_LOCK)) > ctx->user = get_uid(current_user()); > > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c > index 8a38b9f75d841..764f0e004aa00 100644 > --- a/io_uring/uring_cmd.c > +++ b/io_uring/uring_cmd.c > @@ -164,6 +164,8 @@ int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags) > issue_flags |= IO_URING_F_CQE32; > if (ctx->compat) > issue_flags |= IO_URING_F_COMPAT; > + if (ctx->sys_admin) > + issue_flags |= IO_URING_F_SYS_ADMIN; > if (ctx->flags & IORING_SETUP_IOPOLL) { > if (!file->f_op->uring_cmd_iopoll) > return -EOPNOTSUPP;
On 12/4/23 18:05, Jens Axboe wrote: > On 12/4/23 10:53 AM, Keith Busch wrote: >> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c >> index 1d254f2c997de..4aa10b64f539e 100644 >> --- a/io_uring/io_uring.c >> +++ b/io_uring/io_uring.c >> @@ -3980,6 +3980,7 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p, >> ctx->syscall_iopoll = 1; >> >> ctx->compat = in_compat_syscall(); >> + ctx->sys_admin = capable(CAP_SYS_ADMIN); >> if (!ns_capable_noaudit(&init_user_ns, CAP_IPC_LOCK)) >> ctx->user = get_uid(current_user()); > > Hmm, what happens if the app starts as eg root for initialization > purposes and drops caps after? That would have previously have caused > passthrough to fail, but now it will work. Perhaps this is fine, after > all this isn't unusual for eg opening device or doing other init special > work? The side effects would be quite a surprise when you initialize the ring from a privileged process and then pass it to a less capable one. Ring sharing would also be affected. Privilege downgrade also sounds like a valid concern. The first two will be solved if restricted to IORING_SETUP_DEFER_TASKRUN rings and io_is_capable() { return ctx->sys_admin || capable(); } And it still doesn't seem great bypassing it, when the question is rather why it's expensive? I've seen before in the wild a fat BPF program running on every call, is that what happens? > In any case, that should definitely be explicitly mentioned in the > commit message for a change like that. >
On Mon, Dec 04, 2023 at 01:40:58PM -0500, Jeff Moyer wrote: > I added a CC: linux-security-module@vger > Keith Busch <kbusch@meta.com> writes: > > From: Keith Busch <kbusch@kernel.org> > > > > The uring_cmd operation is often used for privileged actions, so drivers > > subscribing to this interface check capable() for each command. The > > capable() function is not fast path friendly for many kernel configs, > > and this can really harm performance. Stash the capable sys admin > > attribute in the io_uring context and set a new issue_flag for the > > uring_cmd interface. > > I have a few questions. What privileged actions are performance > sensitive? I would hope that anything requiring privileges would not > be in a fast path (but clearly that's not the case). Protocol specifics that don't have a generic equivalent. For example, NVMe FDP is reachable only through the uring_cmd and ioctl interfaces, but you use it like normal reads and writes so has to be as fast as the generic interfaces. The same interfaces can be abused, so access needs to be restricted. > What performance benefits did you measure with this patch set in place > (and on what workloads)? Quite a bit. Here's a random read high-depth workload on a single device test: Before: 970k IOPs After: 1750k IOPs > What happens when a ring fd is passed to another process? > > Finally, as Jens mentioned, I would expect dropping priviliges to, you > know, drop privileges. I don't think a commit message is going to be > enough documentation for a change like this. Yeah, point taken.
On 12/4/23 11:40 AM, Jeff Moyer wrote: > Finally, as Jens mentioned, I would expect dropping priviliges to, you > know, drop privileges. I don't think a commit message is going to be > enough documentation for a change like this. Only thing I can think of here is to cache the state in task->io_uring->something, and then ensure those are invalidated whenever caps change. It's one of those cases where that's probably only done once, but we do need to be able to catch it. Not convinced that caching it at ring creation is sane enough, even if it is kind of like opening devices before privs are dropped where you could not otherwise re-open them later on.
Jens Axboe <axboe@kernel.dk> writes: > On 12/4/23 11:40 AM, Jeff Moyer wrote: >> Finally, as Jens mentioned, I would expect dropping priviliges to, you >> know, drop privileges. I don't think a commit message is going to be >> enough documentation for a change like this. > > Only thing I can think of here is to cache the state in > task->io_uring->something, and then ensure those are invalidated > whenever caps change. I looked through the capable() code, and there is no way that I could find to be notified of changes. > It's one of those cases where that's probably only done once, but we > do need to be able to catch it. Not convinced that caching it at ring > creation is sane enough, even if it is kind of like opening devices > before privs are dropped where you could not otherwise re-open them > later on. Agreed. -Jeff
On 12/4/23 12:22 PM, Jeff Moyer wrote: > Jens Axboe <axboe@kernel.dk> writes: > >> On 12/4/23 11:40 AM, Jeff Moyer wrote: >>> Finally, as Jens mentioned, I would expect dropping priviliges to, you >>> know, drop privileges. I don't think a commit message is going to be >>> enough documentation for a change like this. >> >> Only thing I can think of here is to cache the state in >> task->io_uring->something, and then ensure those are invalidated >> whenever caps change. > > I looked through the capable() code, and there is no way that I could > find to be notified of changes. Right, what I meant is that you'd need to add an io_uring_cap_change() or something that gets called, and that iterates the rings associated with that task and clears the flag. Ugly...
On Mon, Dec 04, 2023 at 02:22:22PM -0500, Jeff Moyer wrote: > Jens Axboe <axboe@kernel.dk> writes: > > > On 12/4/23 11:40 AM, Jeff Moyer wrote: > >> Finally, as Jens mentioned, I would expect dropping priviliges to, you > >> know, drop privileges. I don't think a commit message is going to be > >> enough documentation for a change like this. > > > > Only thing I can think of here is to cache the state in > > task->io_uring->something, and then ensure those are invalidated > > whenever caps change. > > I looked through the capable() code, and there is no way that I could > find to be notified of changes. Something like LSM_HOOK_INIT on 'capset', but needs to work without CONFIG_SECURITY.
On Mon, Dec 04, 2023 at 11:57:55AM -0700, Keith Busch wrote: > On Mon, Dec 04, 2023 at 01:40:58PM -0500, Jeff Moyer wrote: > > I added a CC: linux-security-module@vger > > Keith Busch <kbusch@meta.com> writes: > > > From: Keith Busch <kbusch@kernel.org> > > > > > > The uring_cmd operation is often used for privileged actions, so drivers > > > subscribing to this interface check capable() for each command. The > > > capable() function is not fast path friendly for many kernel configs, > > > and this can really harm performance. Stash the capable sys admin > > > attribute in the io_uring context and set a new issue_flag for the > > > uring_cmd interface. > > > > I have a few questions. What privileged actions are performance > > sensitive? I would hope that anything requiring privileges would not > > be in a fast path (but clearly that's not the case). > > Protocol specifics that don't have a generic equivalent. For example, > NVMe FDP is reachable only through the uring_cmd and ioctl interfaces, > but you use it like normal reads and writes so has to be as fast as the > generic interfaces. But normal read/write pt command doesn't require ADMIN any more since commit 855b7717f44b ("nvme: fine-granular CAP_SYS_ADMIN for nvme io commands"), why do you have to pay the cost of checking capable(CAP_SYS_ADMIN)? Thanks, Ming
On Tue, Dec 05, 2023 at 12:14:22PM +0800, Ming Lei wrote: > On Mon, Dec 04, 2023 at 11:57:55AM -0700, Keith Busch wrote: > > On Mon, Dec 04, 2023 at 01:40:58PM -0500, Jeff Moyer wrote: > > > I added a CC: linux-security-module@vger > > > Keith Busch <kbusch@meta.com> writes: > > > > From: Keith Busch <kbusch@kernel.org> > > > > > > > > The uring_cmd operation is often used for privileged actions, so drivers > > > > subscribing to this interface check capable() for each command. The > > > > capable() function is not fast path friendly for many kernel configs, > > > > and this can really harm performance. Stash the capable sys admin > > > > attribute in the io_uring context and set a new issue_flag for the > > > > uring_cmd interface. > > > > > > I have a few questions. What privileged actions are performance > > > sensitive? I would hope that anything requiring privileges would not > > > be in a fast path (but clearly that's not the case). > > > > Protocol specifics that don't have a generic equivalent. For example, > > NVMe FDP is reachable only through the uring_cmd and ioctl interfaces, > > but you use it like normal reads and writes so has to be as fast as the > > generic interfaces. > > But normal read/write pt command doesn't require ADMIN any more since > commit 855b7717f44b ("nvme: fine-granular CAP_SYS_ADMIN for nvme io commands"), > why do you have to pay the cost of checking capable(CAP_SYS_ADMIN)? Good question. The "capable" check had always been first so even with the relaxed permissions, it was still paying the price. I have changed that order in commit staged here (not yet upstream): http://git.infradead.org/nvme.git/commitdiff/7be866b1cf0bf1dfa74480fe8097daeceda68622 Note that only prevents the costly capable() check if the inexpensive checks could make a determination. That's still not solving the problem long term since we aim for forward compatibility where we have no idea which opcodes, admin identifications, or vendor specifics could be deemed "safe" for non-root users in the future, so those conditions would always fall back to the more expensive check that this patch was trying to mitigate for admin processes.
On Mon, Dec 04, 2023 at 09:31:21PM -0700, Keith Busch wrote: > On Tue, Dec 05, 2023 at 12:14:22PM +0800, Ming Lei wrote: > > On Mon, Dec 04, 2023 at 11:57:55AM -0700, Keith Busch wrote: > > > On Mon, Dec 04, 2023 at 01:40:58PM -0500, Jeff Moyer wrote: > > > > I added a CC: linux-security-module@vger > > > > Keith Busch <kbusch@meta.com> writes: > > > > > From: Keith Busch <kbusch@kernel.org> > > > > > > > > > > The uring_cmd operation is often used for privileged actions, so drivers > > > > > subscribing to this interface check capable() for each command. The > > > > > capable() function is not fast path friendly for many kernel configs, > > > > > and this can really harm performance. Stash the capable sys admin > > > > > attribute in the io_uring context and set a new issue_flag for the > > > > > uring_cmd interface. > > > > > > > > I have a few questions. What privileged actions are performance > > > > sensitive? I would hope that anything requiring privileges would not > > > > be in a fast path (but clearly that's not the case). > > > > > > Protocol specifics that don't have a generic equivalent. For example, > > > NVMe FDP is reachable only through the uring_cmd and ioctl interfaces, > > > but you use it like normal reads and writes so has to be as fast as the > > > generic interfaces. > > > > But normal read/write pt command doesn't require ADMIN any more since > > commit 855b7717f44b ("nvme: fine-granular CAP_SYS_ADMIN for nvme io commands"), > > why do you have to pay the cost of checking capable(CAP_SYS_ADMIN)? > > Good question. The "capable" check had always been first so even with > the relaxed permissions, it was still paying the price. I have changed > that order in commit staged here (not yet upstream): > > http://git.infradead.org/nvme.git/commitdiff/7be866b1cf0bf1dfa74480fe8097daeceda68622 With this change, I guess you shouldn't see the following big gap, right? > Before: 970k IOPs > After: 1750k IOPs > > Note that only prevents the costly capable() check if the inexpensive > checks could make a determination. That's still not solving the problem > long term since we aim for forward compatibility where we have no idea > which opcodes, admin identifications, or vendor specifics could be > deemed "safe" for non-root users in the future, so those conditions > would always fall back to the more expensive check that this patch was > trying to mitigate for admin processes. Not sure I get the idea, it is related with nvme's permission model for user pt command, and: 1) it should be always checked in entry of nvme user pt command 2) only the following two types of commands require ADMIN, per commit 855b7717f44b ("nvme: fine-granular CAP_SYS_ADMIN for nvme io commands") - any admin-cmd is not allowed - vendor-specific and fabric commmand are not allowed Can you provide more details why the expensive check can't be avoided for fast read/write user IO commands? Thanks, Ming
On Tue, Dec 05, 2023 at 01:25:44PM +0800, Ming Lei wrote: > On Mon, Dec 04, 2023 at 09:31:21PM -0700, Keith Busch wrote: > > Good question. The "capable" check had always been first so even with > > the relaxed permissions, it was still paying the price. I have changed > > that order in commit staged here (not yet upstream): > > > > http://git.infradead.org/nvme.git/commitdiff/7be866b1cf0bf1dfa74480fe8097daeceda68622 > > With this change, I guess you shouldn't see the following big gap, right? Correct. > > Before: 970k IOPs > > After: 1750k IOPs > > Note that only prevents the costly capable() check if the inexpensive > > checks could make a determination. That's still not solving the problem > > long term since we aim for forward compatibility where we have no idea > > which opcodes, admin identifications, or vendor specifics could be > > deemed "safe" for non-root users in the future, so those conditions > > would always fall back to the more expensive check that this patch was > > trying to mitigate for admin processes. > > Not sure I get the idea, it is related with nvme's permission model for > user pt command, and: > > 1) it should be always checked in entry of nvme user pt command > > 2) only the following two types of commands require ADMIN, per commit > 855b7717f44b ("nvme: fine-granular CAP_SYS_ADMIN for nvme io commands") > > - any admin-cmd is not allowed > - vendor-specific and fabric commmand are not allowed > > Can you provide more details why the expensive check can't be avoided for > fast read/write user IO commands? It's not necessarily about the read/write passthrough commands. It's for commands we don't know about today. Do we want to revisit this problem every time spec provides another operation? Are vendor unique solutions not allowed to get high IOPs access? Secondly, some people have rediscovered you can abuse this interface to corrupt kernel memory, so there are considerations to restricting this to CAP_SYS_ADMIN anyway, so there's no cheap check available today if we have to go that route. Lastly (and least important), there are a lot of checks happening in the "quick" path that I am trying to replace with a single check. While each invidividual check isn't too bad, they start to add up.
On 12/4/2023 11:35 PM, Jens Axboe wrote: > On 12/4/23 10:53 AM, Keith Busch wrote: >> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c >> index 1d254f2c997de..4aa10b64f539e 100644 >> --- a/io_uring/io_uring.c >> +++ b/io_uring/io_uring.c >> @@ -3980,6 +3980,7 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p, >> ctx->syscall_iopoll = 1; >> >> ctx->compat = in_compat_syscall(); >> + ctx->sys_admin = capable(CAP_SYS_ADMIN); >> if (!ns_capable_noaudit(&init_user_ns, CAP_IPC_LOCK)) >> ctx->user = get_uid(current_user()); > Hmm, what happens if the app starts as eg root for initialization > purposes and drops caps after? That would have previously have caused > passthrough to fail, but now it will work. Does it sound any better if this 'super ring' type of ability is asked explicitly by a setup flag say IORING_SETUP_CAPABLE_ONCE. It does not change the old behavior. It also implies that capable user knows what it asked for, so no need to keep things in sync if the process drops caps after ring setup is done. diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 4aa10b64f539..589e614144b6 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -3981,6 +3981,8 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p, ctx->compat = in_compat_syscall(); + if (ctx->flags & IORING_SETUP_CAPABLE_ONCE && capable(CAP_SYS_ADMIN)) + ctx->sys_admin = 1; if (!ns_capable_noaudit(&init_user_ns, CAP_IPC_LOCK)) ctx->user = get_uid(current_user());
On Tue, Dec 05, 2023 at 08:45:10AM -0700, Keith Busch wrote: > On Tue, Dec 05, 2023 at 01:25:44PM +0800, Ming Lei wrote: > > On Mon, Dec 04, 2023 at 09:31:21PM -0700, Keith Busch wrote: > > > Good question. The "capable" check had always been first so even with > > > the relaxed permissions, it was still paying the price. I have changed > > > that order in commit staged here (not yet upstream): > > > > > > http://git.infradead.org/nvme.git/commitdiff/7be866b1cf0bf1dfa74480fe8097daeceda68622 > > > > With this change, I guess you shouldn't see the following big gap, right? > > Correct. > > > > Before: 970k IOPs > > > After: 1750k IOPs > > > > Note that only prevents the costly capable() check if the inexpensive > > > checks could make a determination. That's still not solving the problem > > > long term since we aim for forward compatibility where we have no idea > > > which opcodes, admin identifications, or vendor specifics could be > > > deemed "safe" for non-root users in the future, so those conditions > > > would always fall back to the more expensive check that this patch was > > > trying to mitigate for admin processes. > > > > Not sure I get the idea, it is related with nvme's permission model for > > user pt command, and: > > > > 1) it should be always checked in entry of nvme user pt command > > > > 2) only the following two types of commands require ADMIN, per commit > > 855b7717f44b ("nvme: fine-granular CAP_SYS_ADMIN for nvme io commands") > > > > - any admin-cmd is not allowed > > - vendor-specific and fabric commmand are not allowed > > > > Can you provide more details why the expensive check can't be avoided for > > fast read/write user IO commands? > > It's not necessarily about the read/write passthrough commands. It's for > commands we don't know about today. Do we want to revisit this problem > every time spec provides another operation? Are vendor unique solutions > not allowed to get high IOPs access? Except for read/write, what other commands are performance sensitive? > > Secondly, some people have rediscovered you can abuse this interface to > corrupt kernel memory, so there are considerations to restricting this Just wondering why ADMIN won't corrupt kernel memory, and only normal user can, looks it is kernel bug instead of permission related issue. > to CAP_SYS_ADMIN anyway, so there's no cheap check available today if we > have to go that route. If capable(CAP_SYS_ADMIN) is really slow, I am wondering why not optimize it in task_struct? Thanks, Ming
On Wed, Dec 06, 2023 at 11:08:17AM +0800, Ming Lei wrote: > On Tue, Dec 05, 2023 at 08:45:10AM -0700, Keith Busch wrote: > > > > It's not necessarily about the read/write passthrough commands. It's for > > commands we don't know about today. Do we want to revisit this problem > > every time spec provides another operation? Are vendor unique solutions > > not allowed to get high IOPs access? > > Except for read/write, what other commands are performance sensitive? It varies by command set, but this question is irrelevant. I'm not interested in gatekeeping the fast path. > > Secondly, some people have rediscovered you can abuse this interface to > > corrupt kernel memory, so there are considerations to restricting this > > Just wondering why ADMIN won't corrupt kernel memory, and only normal > user can, looks it is kernel bug instead of permission related issue. Admin can corrupt memory as easily as a normal user through this interface. We just don't want such capabilities to be available to regular users. And it's a user bug: user told the kernel to map buffer of size X, but the device transfers size Y into it. Kernel can't do anything about that (other than remove the interface, but such an action will break many existing users) because we fundamentally do not know the true transfer size of a random command. Many NVMe commands don't explicitly encode transfer lengths, so disagreement between host and device on implicit lengths risk corruption. It's a protocol "feature". > > to CAP_SYS_ADMIN anyway, so there's no cheap check available today if we > > have to go that route. > > If capable(CAP_SYS_ADMIN) is really slow, I am wondering why not > optimize it in task_struct? That's an interesting point to look into. I was hoping to not touch such a common struct, but I'm open to all options.
On Tue, Dec 05, 2023 at 09:51:13PM +0530, Kanchan Joshi wrote: > Does it sound any better if this 'super ring' type of ability is asked > explicitly by a setup flag say IORING_SETUP_CAPABLE_ONCE. Just my opinion: that option sounds good!
On Wed, Dec 06, 2023 at 08:31:54AM -0700, Keith Busch wrote: > On Wed, Dec 06, 2023 at 11:08:17AM +0800, Ming Lei wrote: > > On Tue, Dec 05, 2023 at 08:45:10AM -0700, Keith Busch wrote: > > > > > > It's not necessarily about the read/write passthrough commands. It's for > > > commands we don't know about today. Do we want to revisit this problem > > > every time spec provides another operation? Are vendor unique solutions > > > not allowed to get high IOPs access? > > > > Except for read/write, what other commands are performance sensitive? > > It varies by command set, but this question is irrelevant. I'm not > interested in gatekeeping the fast path. IMO, it doesn't make sense to run such optimization for commands which aren't performance sensitive. > > > > Secondly, some people have rediscovered you can abuse this interface to > > > corrupt kernel memory, so there are considerations to restricting this > > > > Just wondering why ADMIN won't corrupt kernel memory, and only normal > > user can, looks it is kernel bug instead of permission related issue. > > Admin can corrupt memory as easily as a normal user through this > interface. We just don't want such capabilities to be available to > regular users. > > And it's a user bug: user told the kernel to map buffer of size X, but > the device transfers size Y into it. Kernel can't do anything about that > (other than remove the interface, but such an action will break many > existing users) because we fundamentally do not know the true transfer > size of a random command. Many NVMe commands don't explicitly encode > transfer lengths, so disagreement between host and device on implicit > lengths risk corruption. It's a protocol "feature". Got it, thanks for the explanation, and looks one big defect of NVMe protocol or the device implementation. > > > > to CAP_SYS_ADMIN anyway, so there's no cheap check available today if we > > > have to go that route. > > > > If capable(CAP_SYS_ADMIN) is really slow, I am wondering why not > > optimize it in task_struct? > > That's an interesting point to look into. I was hoping to not touch such > a common struct, but I'm open to all options. capability is per-thread, and it is updated in current process/pthread, so the correct place to cache this info is 'task_struct'. Thanks, Ming
On Thu, Dec 07, 2023 at 09:23:06AM +0800, Ming Lei wrote: > Got it, thanks for the explanation, and looks one big defect of > NVMe protocol or the device implementation. It is. And NVMe has plenty more problems like that and people keep adding this kind of crap even today :(
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index bebab36abce89..d64d6916753f0 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -36,6 +36,9 @@ enum io_uring_cmd_flags { /* set when uring wants to cancel a previously issued command */ IO_URING_F_CANCEL = (1 << 11), IO_URING_F_COMPAT = (1 << 12), + + /* ring validated as CAP_SYS_ADMIN capable */ + IO_URING_F_SYS_ADMIN = (1 << 13), }; struct io_wq_work_node { @@ -240,6 +243,7 @@ struct io_ring_ctx { unsigned int poll_activated: 1; unsigned int drain_disabled: 1; unsigned int compat: 1; + unsigned int sys_admin: 1; struct task_struct *submitter_task; struct io_rings *rings; diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 1d254f2c997de..4aa10b64f539e 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -3980,6 +3980,7 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p, ctx->syscall_iopoll = 1; ctx->compat = in_compat_syscall(); + ctx->sys_admin = capable(CAP_SYS_ADMIN); if (!ns_capable_noaudit(&init_user_ns, CAP_IPC_LOCK)) ctx->user = get_uid(current_user()); diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c index 8a38b9f75d841..764f0e004aa00 100644 --- a/io_uring/uring_cmd.c +++ b/io_uring/uring_cmd.c @@ -164,6 +164,8 @@ int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags) issue_flags |= IO_URING_F_CQE32; if (ctx->compat) issue_flags |= IO_URING_F_COMPAT; + if (ctx->sys_admin) + issue_flags |= IO_URING_F_SYS_ADMIN; if (ctx->flags & IORING_SETUP_IOPOLL) { if (!file->f_op->uring_cmd_iopoll) return -EOPNOTSUPP;