Message ID | 20200107170034.16165-4-axboe@kernel.dk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | io_uring: add support for open/close | expand |
Hi Jens, > This works just like openat(2), except it can be performed async. For > the normal case of a non-blocking path lookup this will complete > inline. If we have to do IO to perform the open, it'll be done from > async context. Did you already thought about the credentials being used for the async open? The application could call setuid() and similar calls to change the credentials of the userspace process/threads. In order for applications like samba to use this async openat, it would be required to specify the credentials for each open, as we have to multiplex requests from multiple user sessions in one process. This applies to non-fd based syscall. Also for an async connect to a unix domain socket. Do you have comments on this? Thanks! metze
On 1/8/20 6:05 AM, Stefan Metzmacher wrote: > Hi Jens, > >> This works just like openat(2), except it can be performed async. For >> the normal case of a non-blocking path lookup this will complete >> inline. If we have to do IO to perform the open, it'll be done from >> async context. > > Did you already thought about the credentials being used for the async > open? The application could call setuid() and similar calls to change > the credentials of the userspace process/threads. In order for > applications like samba to use this async openat, it would be required > to specify the credentials for each open, as we have to multiplex > requests from multiple user sessions in one process. > > This applies to non-fd based syscall. Also for an async connect > to a unix domain socket. > > Do you have comments on this? The open works like any of the other commands, it inherits the credentials that the ring was setup with. Same with the memory context, file table, etc. There's currently no way to have multiple personalities within a single ring. Sounds like you'd like an option for having multiple personalities within a single ring? I think it would be better to have a ring per personality instead. One thing we could do to make this more lightweight is to have rings that are associated, so that we can share a lot of the backend processing between them.
Am 08.01.20 um 17:20 schrieb Jens Axboe: > On 1/8/20 6:05 AM, Stefan Metzmacher wrote: >> Hi Jens, >> >>> This works just like openat(2), except it can be performed async. For >>> the normal case of a non-blocking path lookup this will complete >>> inline. If we have to do IO to perform the open, it'll be done from >>> async context. >> >> Did you already thought about the credentials being used for the async >> open? The application could call setuid() and similar calls to change >> the credentials of the userspace process/threads. In order for >> applications like samba to use this async openat, it would be required >> to specify the credentials for each open, as we have to multiplex >> requests from multiple user sessions in one process. >> >> This applies to non-fd based syscall. Also for an async connect >> to a unix domain socket. >> >> Do you have comments on this? > > The open works like any of the other commands, it inherits the > credentials that the ring was setup with. Same with the memory context, > file table, etc. There's currently no way to have multiple personalities > within a single ring. Ah, it's user = get_uid(current_user()); and ctx->user = user in io_uring_create(), right? > Sounds like you'd like an option for having multiple personalities > within a single ring? I'm not sure anymore, I wasn't aware of the above. > I think it would be better to have a ring per personality instead. We could do that. I guess we could use per user rings for path based operations and a single ring for fd based operations. > One thing we could do to make this more lightweight > is to have rings that are associated, so that we can share a lot of the > backend processing between them. My current idea is to use the ring fd and pass it to our main epoll loop. Can you be more specific about how an api for associated rings could look like? Thanks! metze
On 1/8/20 9:32 AM, Stefan Metzmacher wrote: > Am 08.01.20 um 17:20 schrieb Jens Axboe: >> On 1/8/20 6:05 AM, Stefan Metzmacher wrote: >>> Hi Jens, >>> >>>> This works just like openat(2), except it can be performed async. For >>>> the normal case of a non-blocking path lookup this will complete >>>> inline. If we have to do IO to perform the open, it'll be done from >>>> async context. >>> >>> Did you already thought about the credentials being used for the async >>> open? The application could call setuid() and similar calls to change >>> the credentials of the userspace process/threads. In order for >>> applications like samba to use this async openat, it would be required >>> to specify the credentials for each open, as we have to multiplex >>> requests from multiple user sessions in one process. >>> >>> This applies to non-fd based syscall. Also for an async connect >>> to a unix domain socket. >>> >>> Do you have comments on this? >> >> The open works like any of the other commands, it inherits the >> credentials that the ring was setup with. Same with the memory context, >> file table, etc. There's currently no way to have multiple personalities >> within a single ring. > > Ah, it's user = get_uid(current_user()); and ctx->user = user in > io_uring_create(), right? That's just for the accounting, it's the: ctx->creds = get_current_cred(); >> Sounds like you'd like an option for having multiple personalities >> within a single ring? > > I'm not sure anymore, I wasn't aware of the above. > >> I think it would be better to have a ring per personality instead. > > We could do that. I guess we could use per user rings for path based > operations and a single ring for fd based operations. > >> One thing we could do to make this more lightweight >> is to have rings that are associated, so that we can share a lot of the >> backend processing between them. > > My current idea is to use the ring fd and pass it to our main epoll loop. > > Can you be more specific about how an api for associated rings could > look like? The API would be the exact same, there would just be some way to associate rings when you create them. Probably a new field in struct io_uring_params (and an associated flag), which would tell io_uring that two separate rings are really the same "user". This would allow io_uring to use the same io-wq workqueues, for example, etc. This depends on the fact that you can setup the rings with the right personalities, that they would be known upfront. From your description, I'm not so sure that's the case? If not, then we would indeed need something that can pass in the credentials on a per-command basis. Not sure what that would look like.
Am 08.01.20 um 17:40 schrieb Jens Axboe: > On 1/8/20 9:32 AM, Stefan Metzmacher wrote: >> Am 08.01.20 um 17:20 schrieb Jens Axboe: >>> On 1/8/20 6:05 AM, Stefan Metzmacher wrote: >>>> Hi Jens, >>>> >>>>> This works just like openat(2), except it can be performed async. For >>>>> the normal case of a non-blocking path lookup this will complete >>>>> inline. If we have to do IO to perform the open, it'll be done from >>>>> async context. >>>> >>>> Did you already thought about the credentials being used for the async >>>> open? The application could call setuid() and similar calls to change >>>> the credentials of the userspace process/threads. In order for >>>> applications like samba to use this async openat, it would be required >>>> to specify the credentials for each open, as we have to multiplex >>>> requests from multiple user sessions in one process. >>>> >>>> This applies to non-fd based syscall. Also for an async connect >>>> to a unix domain socket. >>>> >>>> Do you have comments on this? >>> >>> The open works like any of the other commands, it inherits the >>> credentials that the ring was setup with. Same with the memory context, >>> file table, etc. There's currently no way to have multiple personalities >>> within a single ring. >> >> Ah, it's user = get_uid(current_user()); and ctx->user = user in >> io_uring_create(), right? > > That's just for the accounting, it's the: > > ctx->creds = get_current_cred(); Ok, I just looked at an old checkout. In kernel-dk-block/for-5.6/io_uring-vfs I see this only used in the async processing. Does a non-blocking openat also use ctx->creds? >>> Sounds like you'd like an option for having multiple personalities >>> within a single ring? >> >> I'm not sure anymore, I wasn't aware of the above. >> >>> I think it would be better to have a ring per personality instead. >> >> We could do that. I guess we could use per user rings for path based >> operations and a single ring for fd based operations. >> >>> One thing we could do to make this more lightweight >>> is to have rings that are associated, so that we can share a lot of the >>> backend processing between them. >> >> My current idea is to use the ring fd and pass it to our main epoll loop. >> >> Can you be more specific about how an api for associated rings could >> look like? > > The API would be the exact same, there would just be some way to > associate rings when you create them. Probably a new field in struct > io_uring_params (and an associated flag), which would tell io_uring that > two separate rings are really the same "user". This would allow io_uring > to use the same io-wq workqueues, for example, etc. Ok, this would be just for better performance / better usage of resources, right? > This depends on the fact that you can setup the rings with the right > personalities, that they would be known upfront. From your description, > I'm not so sure that's the case? If not, then we would indeed need > something that can pass in the credentials on a per-command basis. Not > sure what that would look like. We know the credentials and using a ring per user should be ok. metze
On 1/8/20 10:04 AM, Stefan Metzmacher wrote: > Am 08.01.20 um 17:40 schrieb Jens Axboe: >> On 1/8/20 9:32 AM, Stefan Metzmacher wrote: >>> Am 08.01.20 um 17:20 schrieb Jens Axboe: >>>> On 1/8/20 6:05 AM, Stefan Metzmacher wrote: >>>>> Hi Jens, >>>>> >>>>>> This works just like openat(2), except it can be performed async. For >>>>>> the normal case of a non-blocking path lookup this will complete >>>>>> inline. If we have to do IO to perform the open, it'll be done from >>>>>> async context. >>>>> >>>>> Did you already thought about the credentials being used for the async >>>>> open? The application could call setuid() and similar calls to change >>>>> the credentials of the userspace process/threads. In order for >>>>> applications like samba to use this async openat, it would be required >>>>> to specify the credentials for each open, as we have to multiplex >>>>> requests from multiple user sessions in one process. >>>>> >>>>> This applies to non-fd based syscall. Also for an async connect >>>>> to a unix domain socket. >>>>> >>>>> Do you have comments on this? >>>> >>>> The open works like any of the other commands, it inherits the >>>> credentials that the ring was setup with. Same with the memory context, >>>> file table, etc. There's currently no way to have multiple personalities >>>> within a single ring. >>> >>> Ah, it's user = get_uid(current_user()); and ctx->user = user in >>> io_uring_create(), right? >> >> That's just for the accounting, it's the: >> >> ctx->creds = get_current_cred(); > > Ok, I just looked at an old checkout. > > In kernel-dk-block/for-5.6/io_uring-vfs I see this only used in > the async processing. Does a non-blocking openat also use ctx->creds? There's basically two sets here - one set is in the ring, and the other is the identity that the async thread (briefly) assumes if we have to go async. Right now they are the same thing, and hence we don't need to play any tricks off the system call submitting SQEs to assume any other identity than the one we have. >>>> Sounds like you'd like an option for having multiple personalities >>>> within a single ring? >>> >>> I'm not sure anymore, I wasn't aware of the above. >>> >>>> I think it would be better to have a ring per personality instead. >>> >>> We could do that. I guess we could use per user rings for path based >>> operations and a single ring for fd based operations. >>> >>>> One thing we could do to make this more lightweight >>>> is to have rings that are associated, so that we can share a lot of the >>>> backend processing between them. >>> >>> My current idea is to use the ring fd and pass it to our main epoll loop. >>> >>> Can you be more specific about how an api for associated rings could >>> look like? >> >> The API would be the exact same, there would just be some way to >> associate rings when you create them. Probably a new field in struct >> io_uring_params (and an associated flag), which would tell io_uring that >> two separate rings are really the same "user". This would allow io_uring >> to use the same io-wq workqueues, for example, etc. > > Ok, this would be just for better performance / better usage of > resources, right? Exactly >> This depends on the fact that you can setup the rings with the right >> personalities, that they would be known upfront. From your description, >> I'm not so sure that's the case? If not, then we would indeed need >> something that can pass in the credentials on a per-command basis. Not >> sure what that would look like. > > We know the credentials and using a ring per user should be ok. Sounds good!
Am 08.01.20 um 23:53 schrieb Jens Axboe: > On 1/8/20 10:04 AM, Stefan Metzmacher wrote: >> Am 08.01.20 um 17:40 schrieb Jens Axboe: >>> On 1/8/20 9:32 AM, Stefan Metzmacher wrote: >>>> Am 08.01.20 um 17:20 schrieb Jens Axboe: >>>>> On 1/8/20 6:05 AM, Stefan Metzmacher wrote: >>>>>> Hi Jens, >>>>>> >>>>>>> This works just like openat(2), except it can be performed async. For >>>>>>> the normal case of a non-blocking path lookup this will complete >>>>>>> inline. If we have to do IO to perform the open, it'll be done from >>>>>>> async context. >>>>>> >>>>>> Did you already thought about the credentials being used for the async >>>>>> open? The application could call setuid() and similar calls to change >>>>>> the credentials of the userspace process/threads. In order for >>>>>> applications like samba to use this async openat, it would be required >>>>>> to specify the credentials for each open, as we have to multiplex >>>>>> requests from multiple user sessions in one process. >>>>>> >>>>>> This applies to non-fd based syscall. Also for an async connect >>>>>> to a unix domain socket. >>>>>> >>>>>> Do you have comments on this? >>>>> >>>>> The open works like any of the other commands, it inherits the >>>>> credentials that the ring was setup with. Same with the memory context, >>>>> file table, etc. There's currently no way to have multiple personalities >>>>> within a single ring. >>>> >>>> Ah, it's user = get_uid(current_user()); and ctx->user = user in >>>> io_uring_create(), right? >>> >>> That's just for the accounting, it's the: >>> >>> ctx->creds = get_current_cred(); >> >> Ok, I just looked at an old checkout. >> >> In kernel-dk-block/for-5.6/io_uring-vfs I see this only used in >> the async processing. Does a non-blocking openat also use ctx->creds? > > There's basically two sets here - one set is in the ring, and the other > is the identity that the async thread (briefly) assumes if we have to go > async. Right now they are the same thing, and hence we don't need to > play any tricks off the system call submitting SQEs to assume any other > identity than the one we have. I see two cases using it io_sq_thread() and io_wq_create()->io_worker_handle_work() call override_creds(). But aren't non-blocking syscall executed in the context of the thread calling io_uring_enter()->io_submit_sqes()? In only see some magic around ctx->sqo_mm for that case, but ctx->creds doesn't seem to be used in that case. And my design would require that. metze
On 1/8/20 4:03 PM, Stefan Metzmacher wrote: > Am 08.01.20 um 23:53 schrieb Jens Axboe: >> On 1/8/20 10:04 AM, Stefan Metzmacher wrote: >>> Am 08.01.20 um 17:40 schrieb Jens Axboe: >>>> On 1/8/20 9:32 AM, Stefan Metzmacher wrote: >>>>> Am 08.01.20 um 17:20 schrieb Jens Axboe: >>>>>> On 1/8/20 6:05 AM, Stefan Metzmacher wrote: >>>>>>> Hi Jens, >>>>>>> >>>>>>>> This works just like openat(2), except it can be performed async. For >>>>>>>> the normal case of a non-blocking path lookup this will complete >>>>>>>> inline. If we have to do IO to perform the open, it'll be done from >>>>>>>> async context. >>>>>>> >>>>>>> Did you already thought about the credentials being used for the async >>>>>>> open? The application could call setuid() and similar calls to change >>>>>>> the credentials of the userspace process/threads. In order for >>>>>>> applications like samba to use this async openat, it would be required >>>>>>> to specify the credentials for each open, as we have to multiplex >>>>>>> requests from multiple user sessions in one process. >>>>>>> >>>>>>> This applies to non-fd based syscall. Also for an async connect >>>>>>> to a unix domain socket. >>>>>>> >>>>>>> Do you have comments on this? >>>>>> >>>>>> The open works like any of the other commands, it inherits the >>>>>> credentials that the ring was setup with. Same with the memory context, >>>>>> file table, etc. There's currently no way to have multiple personalities >>>>>> within a single ring. >>>>> >>>>> Ah, it's user = get_uid(current_user()); and ctx->user = user in >>>>> io_uring_create(), right? >>>> >>>> That's just for the accounting, it's the: >>>> >>>> ctx->creds = get_current_cred(); >>> >>> Ok, I just looked at an old checkout. >>> >>> In kernel-dk-block/for-5.6/io_uring-vfs I see this only used in >>> the async processing. Does a non-blocking openat also use ctx->creds? >> >> There's basically two sets here - one set is in the ring, and the other >> is the identity that the async thread (briefly) assumes if we have to go >> async. Right now they are the same thing, and hence we don't need to >> play any tricks off the system call submitting SQEs to assume any other >> identity than the one we have. > > I see two cases using it io_sq_thread() and > io_wq_create()->io_worker_handle_work() call override_creds(). > > But aren't non-blocking syscall executed in the context of the thread > calling io_uring_enter()->io_submit_sqes()? > In only see some magic around ctx->sqo_mm for that case, but ctx->creds > doesn't seem to be used in that case. And my design would require that. For now, the sq thread (which is used if you use IORING_SETUP_SQPOLL) currently requires fixed files, so it can't be used with open at the moment anyway. But if/when enabled, it'll assume the same credentials as the async context and syscall path.
Am 09.01.20 um 00:05 schrieb Jens Axboe: > On 1/8/20 4:03 PM, Stefan Metzmacher wrote: >> Am 08.01.20 um 23:53 schrieb Jens Axboe: >>> On 1/8/20 10:04 AM, Stefan Metzmacher wrote: >>>> Am 08.01.20 um 17:40 schrieb Jens Axboe: >>>>> On 1/8/20 9:32 AM, Stefan Metzmacher wrote: >>>>>> Am 08.01.20 um 17:20 schrieb Jens Axboe: >>>>>>> On 1/8/20 6:05 AM, Stefan Metzmacher wrote: >>>>>>>> Hi Jens, >>>>>>>> >>>>>>>>> This works just like openat(2), except it can be performed async. For >>>>>>>>> the normal case of a non-blocking path lookup this will complete >>>>>>>>> inline. If we have to do IO to perform the open, it'll be done from >>>>>>>>> async context. >>>>>>>> >>>>>>>> Did you already thought about the credentials being used for the async >>>>>>>> open? The application could call setuid() and similar calls to change >>>>>>>> the credentials of the userspace process/threads. In order for >>>>>>>> applications like samba to use this async openat, it would be required >>>>>>>> to specify the credentials for each open, as we have to multiplex >>>>>>>> requests from multiple user sessions in one process. >>>>>>>> >>>>>>>> This applies to non-fd based syscall. Also for an async connect >>>>>>>> to a unix domain socket. >>>>>>>> >>>>>>>> Do you have comments on this? >>>>>>> >>>>>>> The open works like any of the other commands, it inherits the >>>>>>> credentials that the ring was setup with. Same with the memory context, >>>>>>> file table, etc. There's currently no way to have multiple personalities >>>>>>> within a single ring. >>>>>> >>>>>> Ah, it's user = get_uid(current_user()); and ctx->user = user in >>>>>> io_uring_create(), right? >>>>> >>>>> That's just for the accounting, it's the: >>>>> >>>>> ctx->creds = get_current_cred(); >>>> >>>> Ok, I just looked at an old checkout. >>>> >>>> In kernel-dk-block/for-5.6/io_uring-vfs I see this only used in >>>> the async processing. Does a non-blocking openat also use ctx->creds? >>> >>> There's basically two sets here - one set is in the ring, and the other >>> is the identity that the async thread (briefly) assumes if we have to go >>> async. Right now they are the same thing, and hence we don't need to >>> play any tricks off the system call submitting SQEs to assume any other >>> identity than the one we have. >> >> I see two cases using it io_sq_thread() and >> io_wq_create()->io_worker_handle_work() call override_creds(). >> >> But aren't non-blocking syscall executed in the context of the thread >> calling io_uring_enter()->io_submit_sqes()? >> In only see some magic around ctx->sqo_mm for that case, but ctx->creds >> doesn't seem to be used in that case. And my design would require that. > > For now, the sq thread (which is used if you use IORING_SETUP_SQPOLL) > currently requires fixed files, so it can't be used with open at the > moment anyway. But if/when enabled, it'll assume the same credentials > as the async context and syscall path. I'm sorry, but I'm still unsure we're talking about the same thing (or maybe I'm missing some basics here). My understanding of the io_uring_enter() is that it will execute as much non-blocking calls as it can without switching to any other kernel thread. And my fear is that openat will use get_current_cred() instead of ctx->creds. I'm I missing something? Thanks! metze
On 1/8/20 4:11 PM, Stefan Metzmacher wrote: > Am 09.01.20 um 00:05 schrieb Jens Axboe: >> On 1/8/20 4:03 PM, Stefan Metzmacher wrote: >>> Am 08.01.20 um 23:53 schrieb Jens Axboe: >>>> On 1/8/20 10:04 AM, Stefan Metzmacher wrote: >>>>> Am 08.01.20 um 17:40 schrieb Jens Axboe: >>>>>> On 1/8/20 9:32 AM, Stefan Metzmacher wrote: >>>>>>> Am 08.01.20 um 17:20 schrieb Jens Axboe: >>>>>>>> On 1/8/20 6:05 AM, Stefan Metzmacher wrote: >>>>>>>>> Hi Jens, >>>>>>>>> >>>>>>>>>> This works just like openat(2), except it can be performed async. For >>>>>>>>>> the normal case of a non-blocking path lookup this will complete >>>>>>>>>> inline. If we have to do IO to perform the open, it'll be done from >>>>>>>>>> async context. >>>>>>>>> >>>>>>>>> Did you already thought about the credentials being used for the async >>>>>>>>> open? The application could call setuid() and similar calls to change >>>>>>>>> the credentials of the userspace process/threads. In order for >>>>>>>>> applications like samba to use this async openat, it would be required >>>>>>>>> to specify the credentials for each open, as we have to multiplex >>>>>>>>> requests from multiple user sessions in one process. >>>>>>>>> >>>>>>>>> This applies to non-fd based syscall. Also for an async connect >>>>>>>>> to a unix domain socket. >>>>>>>>> >>>>>>>>> Do you have comments on this? >>>>>>>> >>>>>>>> The open works like any of the other commands, it inherits the >>>>>>>> credentials that the ring was setup with. Same with the memory context, >>>>>>>> file table, etc. There's currently no way to have multiple personalities >>>>>>>> within a single ring. >>>>>>> >>>>>>> Ah, it's user = get_uid(current_user()); and ctx->user = user in >>>>>>> io_uring_create(), right? >>>>>> >>>>>> That's just for the accounting, it's the: >>>>>> >>>>>> ctx->creds = get_current_cred(); >>>>> >>>>> Ok, I just looked at an old checkout. >>>>> >>>>> In kernel-dk-block/for-5.6/io_uring-vfs I see this only used in >>>>> the async processing. Does a non-blocking openat also use ctx->creds? >>>> >>>> There's basically two sets here - one set is in the ring, and the other >>>> is the identity that the async thread (briefly) assumes if we have to go >>>> async. Right now they are the same thing, and hence we don't need to >>>> play any tricks off the system call submitting SQEs to assume any other >>>> identity than the one we have. >>> >>> I see two cases using it io_sq_thread() and >>> io_wq_create()->io_worker_handle_work() call override_creds(). >>> >>> But aren't non-blocking syscall executed in the context of the thread >>> calling io_uring_enter()->io_submit_sqes()? >>> In only see some magic around ctx->sqo_mm for that case, but ctx->creds >>> doesn't seem to be used in that case. And my design would require that. >> >> For now, the sq thread (which is used if you use IORING_SETUP_SQPOLL) >> currently requires fixed files, so it can't be used with open at the >> moment anyway. But if/when enabled, it'll assume the same credentials >> as the async context and syscall path. > > I'm sorry, but I'm still unsure we're talking about the same thing > (or maybe I'm missing some basics here). > > My understanding of the io_uring_enter() is that it will execute as much > non-blocking calls as it can without switching to any other kernel thread. Correct, any SQE that we can do without switching, we will. > And my fear is that openat will use get_current_cred() instead of > ctx->creds. OK, I think I follow your concern. So you'd like to setup the rings from a _different_ user, and then later on use it for submission for SQEs that a specific user. So sort of the same as our initial discussion, except the mapping would be static. The difference being that you might setup the ring from a different user than the user that would be submitting IO on it? If so, then we do need something to support that, probably an IORING_REGISTER_CREDS or similar. This would allow you to replace the creds you currently have in ctx->creds with whatever new one. > I'm I missing something? I think we're talking about the same thing, just different views of it :-)
>> I'm sorry, but I'm still unsure we're talking about the same thing >> (or maybe I'm missing some basics here). >> >> My understanding of the io_uring_enter() is that it will execute as much >> non-blocking calls as it can without switching to any other kernel thread. > > Correct, any SQE that we can do without switching, we will. > >> And my fear is that openat will use get_current_cred() instead of >> ctx->creds. > > OK, I think I follow your concern. So you'd like to setup the rings from > a _different_ user, and then later on use it for submission for SQEs that > a specific user. So sort of the same as our initial discussion, except > the mapping would be static. The difference being that you might setup > the ring from a different user than the user that would be submitting IO > on it? Our current (much simplified here) flow is this: # we start as root seteuid(0);setegid(0);setgroups()... ... # we become the user555 and # create our desired credential token seteuid(555); seteguid(555); setgroups()... # Start an openat2 on behalf of user555 openat2() # we unbecome the user again and run as root seteuid(0);setegid(0); setgroups()... ... # we become the user444 and # create our desired credential token seteuid(444); seteguid(444); setgroups()... # Start an openat2 on behalf of user444 openat2() # we unbecome the user again and run as root seteuid(0);setegid(0); setgroups()... ... # we become the user555 and # create our desired credential token seteuid(555); seteguid(555); setgroups()... # Start an openat2 on behalf of user555 openat2() # we unbecome the user again and run as root seteuid(0);setegid(0); setgroups()... It means we have to do about 7 syscalls in order to open a file on behalf of a user. (In reality we cache things and avoid set*id() calls most of the time, but I want to demonstrate the simplified design here) With io_uring I'd like to use a flow like this: # we start as root seteuid(0);setegid(0);setgroups()... ... # we become the user444 and # create our desired credential token seteuid(444); seteguid(444); setgroups()... # we snapshot the credentials to the new ring for user444 ring444 = io_uring_setup() # we unbecome the user again and run as root seteuid(0);setegid(0);setgroups()... ... # we become the user555 and # create our desired credential token seteuid(555); seteguid(555); setgroups()... # we snapshot the credentials to the new ring for user555 ring555 = io_uring_setup() # we unbecome the user again and run as root seteuid(0);setegid(0);setgroups()... ... # Start an openat2 on behalf of user555 io_uring_enter(ring555, OP_OPENAT2...) ... # Start an openat2 on behalf of user444 io_uring_enter(ring444, OP_OPENAT2...) ... # Start an openat2 on behalf of user555 io_uring_enter(ring555, OP_OPENAT2...) So instead of constantly doing 7 syscalls per open, we would be down to just at most one. And I would assume that io_uring_enter() would do the temporary credential switch for me also in the non-blocking case. > If so, then we do need something to support that, probably an > IORING_REGISTER_CREDS or similar. This would allow you to replace the > creds you currently have in ctx->creds with whatever new one. I don't want to change ctx->creds, but I want it to be used consistently. What I think is missing is something like this: diff --git a/fs/io_uring.c b/fs/io_uring.c index 32aee149f652..55dbb154915a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -6359,10 +6359,27 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, struct mm_struct *cur_mm; mutex_lock(&ctx->uring_lock); + if (current->mm != ctx->sqo_mm) { + // TODO: somthing like this... + restore_mm = current->mm; + use_mm(ctx->sqo_mm); + } /* already have mm, so io_submit_sqes() won't try to grab it */ cur_mm = ctx->sqo_mm; + if (current_cred() != ctx->creds) { + // TODO: somthing like this... + restore_cred = override_creds(ctx->creds); + } submitted = io_submit_sqes(ctx, to_submit, f.file, fd, &cur_mm, false); + if (restore_cred != NULL) { + revert_creds(restore_cred); + } + if (restore_mm != NULL) { + // TODO: something like this... + unuse_mm(ctx->sqo_mm); + use_mm(restore_mm); + } mutex_unlock(&ctx->uring_lock); if (submitted != to_submit) I'm not sure if current->mm is needed, I just added it for completeness and as hint that io_op_defs[req->opcode].needs_mm is there and a needs_creds could also be added (if it helps with performance) Is it possible to trigger a change of current->mm from userspace? An IORING_REGISTER_CREDS would only be useful if it's possible to register a set of credentials and then use per io_uring_sqe credentials. That would also be fine for me, but I'm not sure it's needed for now. Apart from IORING_REGISTER_CREDS I think a change like the one above is needed in order to avoid potential security problems. >> I'm I missing something? > > I think we're talking about the same thing, just different views of it :-) I hope it's clear from my side now :-) Thanks! metze
On 1/9/20 3:40 AM, Stefan Metzmacher wrote: >>> I'm sorry, but I'm still unsure we're talking about the same thing >>> (or maybe I'm missing some basics here). >>> >>> My understanding of the io_uring_enter() is that it will execute as much >>> non-blocking calls as it can without switching to any other kernel thread. >> >> Correct, any SQE that we can do without switching, we will. >> >>> And my fear is that openat will use get_current_cred() instead of >>> ctx->creds. >> >> OK, I think I follow your concern. So you'd like to setup the rings from >> a _different_ user, and then later on use it for submission for SQEs that >> a specific user. So sort of the same as our initial discussion, except >> the mapping would be static. The difference being that you might setup >> the ring from a different user than the user that would be submitting IO >> on it? > > Our current (much simplified here) flow is this: > > # we start as root > seteuid(0);setegid(0);setgroups()... > ... > # we become the user555 and > # create our desired credential token > seteuid(555); seteguid(555); setgroups()... > # Start an openat2 on behalf of user555 > openat2() > # we unbecome the user again and run as root > seteuid(0);setegid(0); setgroups()... > ... > # we become the user444 and > # create our desired credential token > seteuid(444); seteguid(444); setgroups()... > # Start an openat2 on behalf of user444 > openat2() > # we unbecome the user again and run as root > seteuid(0);setegid(0); setgroups()... > ... > # we become the user555 and > # create our desired credential token > seteuid(555); seteguid(555); setgroups()... > # Start an openat2 on behalf of user555 > openat2() > # we unbecome the user again and run as root > seteuid(0);setegid(0); setgroups()... > > It means we have to do about 7 syscalls in order > to open a file on behalf of a user. > (In reality we cache things and avoid set*id() > calls most of the time, but I want to demonstrate the > simplified design here) > > With io_uring I'd like to use a flow like this: > > # we start as root > seteuid(0);setegid(0);setgroups()... > ... > # we become the user444 and > # create our desired credential token > seteuid(444); seteguid(444); setgroups()... > # we snapshot the credentials to the new ring for user444 > ring444 = io_uring_setup() > # we unbecome the user again and run as root > seteuid(0);setegid(0);setgroups()... > ... > # we become the user555 and > # create our desired credential token > seteuid(555); seteguid(555); setgroups()... > # we snapshot the credentials to the new ring for user555 > ring555 = io_uring_setup() > # we unbecome the user again and run as root > seteuid(0);setegid(0);setgroups()... > ... > # Start an openat2 on behalf of user555 > io_uring_enter(ring555, OP_OPENAT2...) > ... > # Start an openat2 on behalf of user444 > io_uring_enter(ring444, OP_OPENAT2...) > ... > # Start an openat2 on behalf of user555 > io_uring_enter(ring555, OP_OPENAT2...) > > So instead of constantly doing 7 syscalls per open, > we would be down to just at most one. And I would assume > that io_uring_enter() would do the temporary credential switch > for me also in the non-blocking case. OK, thanks for spelling the use case out, makes it easier to understand what you need in terms of what we currently can't do. >> If so, then we do need something to support that, probably an >> IORING_REGISTER_CREDS or similar. This would allow you to replace the >> creds you currently have in ctx->creds with whatever new one. > > I don't want to change ctx->creds, but I want it to be used consistently. > > What I think is missing is something like this: > > diff --git a/fs/io_uring.c b/fs/io_uring.c > index 32aee149f652..55dbb154915a 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -6359,10 +6359,27 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, > fd, u32, to_submit, > struct mm_struct *cur_mm; > > mutex_lock(&ctx->uring_lock); > + if (current->mm != ctx->sqo_mm) { > + // TODO: somthing like this... > + restore_mm = current->mm; > + use_mm(ctx->sqo_mm); > + } > /* already have mm, so io_submit_sqes() won't try to > grab it */ > cur_mm = ctx->sqo_mm; > + if (current_cred() != ctx->creds) { > + // TODO: somthing like this... > + restore_cred = override_creds(ctx->creds); > + } > submitted = io_submit_sqes(ctx, to_submit, f.file, fd, > &cur_mm, false); > + if (restore_cred != NULL) { > + revert_creds(restore_cred); > + } > + if (restore_mm != NULL) { > + // TODO: something like this... > + unuse_mm(ctx->sqo_mm); > + use_mm(restore_mm); > + } > mutex_unlock(&ctx->uring_lock); > > if (submitted != to_submit) > > I'm not sure if current->mm is needed, I just added it for completeness > and as hint that io_op_defs[req->opcode].needs_mm is there and a > needs_creds could also be added (if it helps with performance) > > Is it possible to trigger a change of current->mm from userspace? > > An IORING_REGISTER_CREDS would only be useful if it's possible to > register a set of credentials and then use per io_uring_sqe credentials. > That would also be fine for me, but I'm not sure it's needed for now. I think it'd be a cleaner way of doing the same thing as your patch does. It seems a little odd to do this by default (having the ring change personalities depending on who's using it), but from an opt-in point of view, I think it makes more sense. That would make the IORING_REGISTER_ call something like IORING_REGISTER_ADOPT_OWNER or something like that, meaning that the ring would just assume the identify of the task that's calling io_uring_enter(). Note that this also has to be passed through to the io-wq handler, as the mappings there are currently static as well.
Am 09.01.20 um 22:31 schrieb Jens Axboe: > On 1/9/20 3:40 AM, Stefan Metzmacher wrote: >>>> I'm sorry, but I'm still unsure we're talking about the same thing >>>> (or maybe I'm missing some basics here). >>>> >>>> My understanding of the io_uring_enter() is that it will execute as much >>>> non-blocking calls as it can without switching to any other kernel thread. >>> >>> Correct, any SQE that we can do without switching, we will. >>> >>>> And my fear is that openat will use get_current_cred() instead of >>>> ctx->creds. >>> >>> OK, I think I follow your concern. So you'd like to setup the rings from >>> a _different_ user, and then later on use it for submission for SQEs that >>> a specific user. So sort of the same as our initial discussion, except >>> the mapping would be static. The difference being that you might setup >>> the ring from a different user than the user that would be submitting IO >>> on it? >> >> Our current (much simplified here) flow is this: >> >> # we start as root >> seteuid(0);setegid(0);setgroups()... >> ... >> # we become the user555 and >> # create our desired credential token >> seteuid(555); seteguid(555); setgroups()... >> # Start an openat2 on behalf of user555 >> openat2() >> # we unbecome the user again and run as root >> seteuid(0);setegid(0); setgroups()... >> ... >> # we become the user444 and >> # create our desired credential token >> seteuid(444); seteguid(444); setgroups()... >> # Start an openat2 on behalf of user444 >> openat2() >> # we unbecome the user again and run as root >> seteuid(0);setegid(0); setgroups()... >> ... >> # we become the user555 and >> # create our desired credential token >> seteuid(555); seteguid(555); setgroups()... >> # Start an openat2 on behalf of user555 >> openat2() >> # we unbecome the user again and run as root >> seteuid(0);setegid(0); setgroups()... >> >> It means we have to do about 7 syscalls in order >> to open a file on behalf of a user. >> (In reality we cache things and avoid set*id() >> calls most of the time, but I want to demonstrate the >> simplified design here) >> >> With io_uring I'd like to use a flow like this: >> >> # we start as root >> seteuid(0);setegid(0);setgroups()... >> ... >> # we become the user444 and >> # create our desired credential token >> seteuid(444); seteguid(444); setgroups()... >> # we snapshot the credentials to the new ring for user444 >> ring444 = io_uring_setup() >> # we unbecome the user again and run as root >> seteuid(0);setegid(0);setgroups()... >> ... >> # we become the user555 and >> # create our desired credential token >> seteuid(555); seteguid(555); setgroups()... >> # we snapshot the credentials to the new ring for user555 >> ring555 = io_uring_setup() >> # we unbecome the user again and run as root >> seteuid(0);setegid(0);setgroups()... >> ... >> # Start an openat2 on behalf of user555 >> io_uring_enter(ring555, OP_OPENAT2...) >> ... >> # Start an openat2 on behalf of user444 >> io_uring_enter(ring444, OP_OPENAT2...) >> ... >> # Start an openat2 on behalf of user555 >> io_uring_enter(ring555, OP_OPENAT2...) >> >> So instead of constantly doing 7 syscalls per open, >> we would be down to just at most one. And I would assume >> that io_uring_enter() would do the temporary credential switch >> for me also in the non-blocking case. > > OK, thanks for spelling the use case out, makes it easier to understand > what you need in terms of what we currently can't do. > >>> If so, then we do need something to support that, probably an >>> IORING_REGISTER_CREDS or similar. This would allow you to replace the >>> creds you currently have in ctx->creds with whatever new one. >> >> I don't want to change ctx->creds, but I want it to be used consistently. >> >> What I think is missing is something like this: >> >> diff --git a/fs/io_uring.c b/fs/io_uring.c >> index 32aee149f652..55dbb154915a 100644 >> --- a/fs/io_uring.c >> +++ b/fs/io_uring.c >> @@ -6359,10 +6359,27 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, >> fd, u32, to_submit, >> struct mm_struct *cur_mm; >> >> mutex_lock(&ctx->uring_lock); >> + if (current->mm != ctx->sqo_mm) { >> + // TODO: somthing like this... >> + restore_mm = current->mm; >> + use_mm(ctx->sqo_mm); >> + } >> /* already have mm, so io_submit_sqes() won't try to >> grab it */ >> cur_mm = ctx->sqo_mm; >> + if (current_cred() != ctx->creds) { >> + // TODO: somthing like this... >> + restore_cred = override_creds(ctx->creds); >> + } >> submitted = io_submit_sqes(ctx, to_submit, f.file, fd, >> &cur_mm, false); >> + if (restore_cred != NULL) { >> + revert_creds(restore_cred); >> + } >> + if (restore_mm != NULL) { >> + // TODO: something like this... >> + unuse_mm(ctx->sqo_mm); >> + use_mm(restore_mm); >> + } >> mutex_unlock(&ctx->uring_lock); >> >> if (submitted != to_submit) >> >> I'm not sure if current->mm is needed, I just added it for completeness >> and as hint that io_op_defs[req->opcode].needs_mm is there and a >> needs_creds could also be added (if it helps with performance) >> >> Is it possible to trigger a change of current->mm from userspace? >> >> An IORING_REGISTER_CREDS would only be useful if it's possible to >> register a set of credentials and then use per io_uring_sqe credentials. >> That would also be fine for me, but I'm not sure it's needed for now. > > I think it'd be a cleaner way of doing the same thing as your patch > does. It seems a little odd to do this by default (having the ring > change personalities depending on who's using it), but from an opt-in > point of view, I think it makes more sense. > > That would make the IORING_REGISTER_ call something like > IORING_REGISTER_ADOPT_OWNER or something like that, meaning that the > ring would just assume the identify of the task that's calling > io_uring_enter(). > > Note that this also has to be passed through to the io-wq handler, as > the mappings there are currently static as well. What's the next step here? I think the current state is a security problem! The inline execution either needs to change the creds temporary or io_uring_enter() needs a general check that the current creds match the creds of the ring and return -EPERM or something similar. Thanks! metze
On 1/16/20 3:42 PM, Stefan Metzmacher wrote: >>> I'm not sure if current->mm is needed, I just added it for completeness >>> and as hint that io_op_defs[req->opcode].needs_mm is there and a >>> needs_creds could also be added (if it helps with performance) >>> >>> Is it possible to trigger a change of current->mm from userspace? >>> >>> An IORING_REGISTER_CREDS would only be useful if it's possible to >>> register a set of credentials and then use per io_uring_sqe credentials. >>> That would also be fine for me, but I'm not sure it's needed for now. >> >> I think it'd be a cleaner way of doing the same thing as your patch >> does. It seems a little odd to do this by default (having the ring >> change personalities depending on who's using it), but from an opt-in >> point of view, I think it makes more sense. >> >> That would make the IORING_REGISTER_ call something like >> IORING_REGISTER_ADOPT_OWNER or something like that, meaning that the >> ring would just assume the identify of the task that's calling >> io_uring_enter(). >> >> Note that this also has to be passed through to the io-wq handler, as >> the mappings there are currently static as well. > > What's the next step here? Not sure, need to find some time to work on this! > I think the current state is a security problem! > > The inline execution either needs to change the creds temporary > or io_uring_enter() needs a general check that the current creds match > the creds of the ring and return -EPERM or something similar. Hmm, if you transfer the fd to someone else, you also give them access to your credentials etc. We could make that -EPERM, if the owner of the ring isn't the one invoking the submit. But that doesn't really help the SQPOLL case, which simply consumes SQE entries. There can be no checking there.
diff --git a/fs/io_uring.c b/fs/io_uring.c index 1822bf9aba12..53ff67ab5c4b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -70,6 +70,8 @@ #include <linux/sizes.h> #include <linux/hugetlb.h> #include <linux/highmem.h> +#include <linux/namei.h> +#include <linux/fsnotify.h> #define CREATE_TRACE_POINTS #include <trace/events/io_uring.h> @@ -353,6 +355,15 @@ struct io_sr_msg { int msg_flags; }; +struct io_open { + struct file *file; + int dfd; + umode_t mode; + const char __user *fname; + struct filename *filename; + int flags; +}; + struct io_async_connect { struct sockaddr_storage address; }; @@ -371,12 +382,17 @@ struct io_async_rw { ssize_t size; }; +struct io_async_open { + struct filename *filename; +}; + struct io_async_ctx { union { struct io_async_rw rw; struct io_async_msghdr msg; struct io_async_connect connect; struct io_timeout_data timeout; + struct io_async_open open; }; }; @@ -397,6 +413,7 @@ struct io_kiocb { struct io_timeout timeout; struct io_connect connect; struct io_sr_msg sr_msg; + struct io_open open; }; struct io_async_ctx *io; @@ -2135,6 +2152,79 @@ static int io_fallocate(struct io_kiocb *req, struct io_kiocb **nxt, return 0; } +static int io_openat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + int ret; + + if (sqe->ioprio || sqe->buf_index) + return -EINVAL; + + req->open.dfd = READ_ONCE(sqe->fd); + req->open.mode = READ_ONCE(sqe->len); + req->open.fname = u64_to_user_ptr(READ_ONCE(sqe->addr)); + req->open.flags = READ_ONCE(sqe->open_flags); + + req->open.filename = getname(req->open.fname); + if (IS_ERR(req->open.filename)) { + ret = PTR_ERR(req->open.filename); + req->open.filename = NULL; + return ret; + } + + return 0; +} + +static void io_openat_async(struct io_wq_work **workptr) +{ + struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); + struct filename *filename = req->open.filename; + + io_wq_submit_work(workptr); + putname(filename); +} + +static int io_openat(struct io_kiocb *req, struct io_kiocb **nxt, + bool force_nonblock) +{ + struct open_flags op; + struct open_how how; + struct file *file; + int ret; + + how = build_open_how(req->open.flags, req->open.mode); + ret = build_open_flags(&how, &op); + if (ret) + goto err; + if (force_nonblock) + op.lookup_flags |= LOOKUP_NONBLOCK; + + ret = get_unused_fd_flags(how.flags); + if (ret < 0) + goto err; + + file = do_filp_open(req->open.dfd, req->open.filename, &op); + if (IS_ERR(file)) { + put_unused_fd(ret); + ret = PTR_ERR(file); + if (ret == -EAGAIN) { + req->work.flags |= IO_WQ_WORK_NEEDS_FILES; + req->work.func = io_openat_async; + return -EAGAIN; + } + } else { + fsnotify_open(file); + fd_install(ret, file); + } +err: + if (!io_wq_current_is_worker()) + putname(req->open.filename); + if (ret < 0) + req_set_fail_links(req); + io_cqring_add_event(req, ret); + io_put_req_find_next(req, nxt); + return 0; +} + static int io_prep_sfr(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_ring_ctx *ctx = req->ctx; @@ -3160,6 +3250,9 @@ static int io_req_defer_prep(struct io_kiocb *req, case IORING_OP_FALLOCATE: ret = io_fallocate_prep(req, sqe); break; + case IORING_OP_OPENAT: + ret = io_openat_prep(req, sqe); + break; default: printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n", req->opcode); @@ -3322,6 +3415,14 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, } ret = io_fallocate(req, nxt, force_nonblock); break; + case IORING_OP_OPENAT: + if (sqe) { + ret = io_openat_prep(req, sqe); + if (ret) + break; + } + ret = io_openat(req, nxt, force_nonblock); + break; default: ret = -EINVAL; break; @@ -3403,7 +3504,7 @@ static bool io_req_op_valid(int op) return op >= IORING_OP_NOP && op < IORING_OP_LAST; } -static int io_req_needs_file(struct io_kiocb *req) +static int io_req_needs_file(struct io_kiocb *req, int fd) { switch (req->opcode) { case IORING_OP_NOP: @@ -3413,6 +3514,8 @@ static int io_req_needs_file(struct io_kiocb *req) case IORING_OP_ASYNC_CANCEL: case IORING_OP_LINK_TIMEOUT: return 0; + case IORING_OP_OPENAT: + return fd != -1; default: if (io_req_op_valid(req->opcode)) return 1; @@ -3442,7 +3545,7 @@ static int io_req_set_file(struct io_submit_state *state, struct io_kiocb *req, if (flags & IOSQE_IO_DRAIN) req->flags |= REQ_F_IO_DRAIN; - ret = io_req_needs_file(req); + ret = io_req_needs_file(req, fd); if (ret <= 0) return ret; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index bdbe2b130179..02af580754ce 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -34,6 +34,7 @@ struct io_uring_sqe { __u32 timeout_flags; __u32 accept_flags; __u32 cancel_flags; + __u32 open_flags; }; __u64 user_data; /* data to be passed back at completion time */ union { @@ -77,6 +78,7 @@ enum { IORING_OP_LINK_TIMEOUT, IORING_OP_CONNECT, IORING_OP_FALLOCATE, + IORING_OP_OPENAT, /* this goes last, obviously */ IORING_OP_LAST,
This works just like openat(2), except it can be performed async. For the normal case of a non-blocking path lookup this will complete inline. If we have to do IO to perform the open, it'll be done from async context. Signed-off-by: Jens Axboe <axboe@kernel.dk> --- fs/io_uring.c | 107 +++++++++++++++++++++++++++++++++- include/uapi/linux/io_uring.h | 2 + 2 files changed, 107 insertions(+), 2 deletions(-)