diff mbox series

[3/6] io_uring: add support for IORING_OP_OPENAT

Message ID 20200107170034.16165-4-axboe@kernel.dk (mailing list archive)
State New, archived
Headers show
Series io_uring: add support for open/close | expand

Commit Message

Jens Axboe Jan. 7, 2020, 5 p.m. UTC
This works just like openat(2), except it can be performed async. For
the normal case of a non-blocking path lookup this will complete
inline. If we have to do IO to perform the open, it'll be done from
async context.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 107 +++++++++++++++++++++++++++++++++-
 include/uapi/linux/io_uring.h |   2 +
 2 files changed, 107 insertions(+), 2 deletions(-)

Comments

Stefan Metzmacher Jan. 8, 2020, 1:05 p.m. UTC | #1
Hi Jens,

> This works just like openat(2), except it can be performed async. For
> the normal case of a non-blocking path lookup this will complete
> inline. If we have to do IO to perform the open, it'll be done from
> async context.

Did you already thought about the credentials being used for the async
open? The application could call setuid() and similar calls to change
the credentials of the userspace process/threads. In order for
applications like samba to use this async openat, it would be required
to specify the credentials for each open, as we have to multiplex
requests from multiple user sessions in one process.

This applies to non-fd based syscall. Also for an async connect
to a unix domain socket.

Do you have comments on this?

Thanks!
metze
Jens Axboe Jan. 8, 2020, 4:20 p.m. UTC | #2
On 1/8/20 6:05 AM, Stefan Metzmacher wrote:
> Hi Jens,
> 
>> This works just like openat(2), except it can be performed async. For
>> the normal case of a non-blocking path lookup this will complete
>> inline. If we have to do IO to perform the open, it'll be done from
>> async context.
> 
> Did you already thought about the credentials being used for the async
> open? The application could call setuid() and similar calls to change
> the credentials of the userspace process/threads. In order for
> applications like samba to use this async openat, it would be required
> to specify the credentials for each open, as we have to multiplex
> requests from multiple user sessions in one process.
> 
> This applies to non-fd based syscall. Also for an async connect
> to a unix domain socket.
> 
> Do you have comments on this?

The open works like any of the other commands, it inherits the
credentials that the ring was setup with. Same with the memory context,
file table, etc. There's currently no way to have multiple personalities
within a single ring.

Sounds like you'd like an option for having multiple personalities
within a single ring? I think it would be better to have a ring per
personality instead. One thing we could do to make this more lightweight
is to have rings that are associated, so that we can share a lot of the
backend processing between them.
Stefan Metzmacher Jan. 8, 2020, 4:32 p.m. UTC | #3
Am 08.01.20 um 17:20 schrieb Jens Axboe:
> On 1/8/20 6:05 AM, Stefan Metzmacher wrote:
>> Hi Jens,
>>
>>> This works just like openat(2), except it can be performed async. For
>>> the normal case of a non-blocking path lookup this will complete
>>> inline. If we have to do IO to perform the open, it'll be done from
>>> async context.
>>
>> Did you already thought about the credentials being used for the async
>> open? The application could call setuid() and similar calls to change
>> the credentials of the userspace process/threads. In order for
>> applications like samba to use this async openat, it would be required
>> to specify the credentials for each open, as we have to multiplex
>> requests from multiple user sessions in one process.
>>
>> This applies to non-fd based syscall. Also for an async connect
>> to a unix domain socket.
>>
>> Do you have comments on this?
> 
> The open works like any of the other commands, it inherits the
> credentials that the ring was setup with. Same with the memory context,
> file table, etc. There's currently no way to have multiple personalities
> within a single ring.

Ah, it's user = get_uid(current_user()); and ctx->user = user in
io_uring_create(), right?

> Sounds like you'd like an option for having multiple personalities
> within a single ring?

I'm not sure anymore, I wasn't aware of the above.

> I think it would be better to have a ring per personality instead.

We could do that. I guess we could use per user rings for path based
operations and a single ring for fd based operations.

> One thing we could do to make this more lightweight
> is to have rings that are associated, so that we can share a lot of the
> backend processing between them.

My current idea is to use the ring fd and pass it to our main epoll loop.

Can you be more specific about how an api for associated rings could
look like?

Thanks!
metze
Jens Axboe Jan. 8, 2020, 4:40 p.m. UTC | #4
On 1/8/20 9:32 AM, Stefan Metzmacher wrote:
> Am 08.01.20 um 17:20 schrieb Jens Axboe:
>> On 1/8/20 6:05 AM, Stefan Metzmacher wrote:
>>> Hi Jens,
>>>
>>>> This works just like openat(2), except it can be performed async. For
>>>> the normal case of a non-blocking path lookup this will complete
>>>> inline. If we have to do IO to perform the open, it'll be done from
>>>> async context.
>>>
>>> Did you already thought about the credentials being used for the async
>>> open? The application could call setuid() and similar calls to change
>>> the credentials of the userspace process/threads. In order for
>>> applications like samba to use this async openat, it would be required
>>> to specify the credentials for each open, as we have to multiplex
>>> requests from multiple user sessions in one process.
>>>
>>> This applies to non-fd based syscall. Also for an async connect
>>> to a unix domain socket.
>>>
>>> Do you have comments on this?
>>
>> The open works like any of the other commands, it inherits the
>> credentials that the ring was setup with. Same with the memory context,
>> file table, etc. There's currently no way to have multiple personalities
>> within a single ring.
> 
> Ah, it's user = get_uid(current_user()); and ctx->user = user in
> io_uring_create(), right?

That's just for the accounting, it's the:

ctx->creds = get_current_cred();

>> Sounds like you'd like an option for having multiple personalities
>> within a single ring?
> 
> I'm not sure anymore, I wasn't aware of the above.
> 
>> I think it would be better to have a ring per personality instead.
> 
> We could do that. I guess we could use per user rings for path based
> operations and a single ring for fd based operations.
> 
>> One thing we could do to make this more lightweight
>> is to have rings that are associated, so that we can share a lot of the
>> backend processing between them.
> 
> My current idea is to use the ring fd and pass it to our main epoll loop.
> 
> Can you be more specific about how an api for associated rings could
> look like?

The API would be the exact same, there would just be some way to
associate rings when you create them. Probably a new field in struct
io_uring_params (and an associated flag), which would tell io_uring that
two separate rings are really the same "user". This would allow io_uring
to use the same io-wq workqueues, for example, etc.

This depends on the fact that you can setup the rings with the right
personalities, that they would be known upfront. From your description,
I'm not so sure that's the case? If not, then we would indeed need
something that can pass in the credentials on a per-command basis. Not
sure what that would look like.
Stefan Metzmacher Jan. 8, 2020, 5:04 p.m. UTC | #5
Am 08.01.20 um 17:40 schrieb Jens Axboe:
> On 1/8/20 9:32 AM, Stefan Metzmacher wrote:
>> Am 08.01.20 um 17:20 schrieb Jens Axboe:
>>> On 1/8/20 6:05 AM, Stefan Metzmacher wrote:
>>>> Hi Jens,
>>>>
>>>>> This works just like openat(2), except it can be performed async. For
>>>>> the normal case of a non-blocking path lookup this will complete
>>>>> inline. If we have to do IO to perform the open, it'll be done from
>>>>> async context.
>>>>
>>>> Did you already thought about the credentials being used for the async
>>>> open? The application could call setuid() and similar calls to change
>>>> the credentials of the userspace process/threads. In order for
>>>> applications like samba to use this async openat, it would be required
>>>> to specify the credentials for each open, as we have to multiplex
>>>> requests from multiple user sessions in one process.
>>>>
>>>> This applies to non-fd based syscall. Also for an async connect
>>>> to a unix domain socket.
>>>>
>>>> Do you have comments on this?
>>>
>>> The open works like any of the other commands, it inherits the
>>> credentials that the ring was setup with. Same with the memory context,
>>> file table, etc. There's currently no way to have multiple personalities
>>> within a single ring.
>>
>> Ah, it's user = get_uid(current_user()); and ctx->user = user in
>> io_uring_create(), right?
> 
> That's just for the accounting, it's the:
> 
> ctx->creds = get_current_cred();

Ok, I just looked at an old checkout.

In kernel-dk-block/for-5.6/io_uring-vfs I see this only used in
the async processing. Does a non-blocking openat also use ctx->creds?

>>> Sounds like you'd like an option for having multiple personalities
>>> within a single ring?
>>
>> I'm not sure anymore, I wasn't aware of the above.
>>
>>> I think it would be better to have a ring per personality instead.
>>
>> We could do that. I guess we could use per user rings for path based
>> operations and a single ring for fd based operations.
>>
>>> One thing we could do to make this more lightweight
>>> is to have rings that are associated, so that we can share a lot of the
>>> backend processing between them.
>>
>> My current idea is to use the ring fd and pass it to our main epoll loop.
>>
>> Can you be more specific about how an api for associated rings could
>> look like?
> 
> The API would be the exact same, there would just be some way to
> associate rings when you create them. Probably a new field in struct
> io_uring_params (and an associated flag), which would tell io_uring that
> two separate rings are really the same "user". This would allow io_uring
> to use the same io-wq workqueues, for example, etc.

Ok, this would be just for better performance / better usage of
resources, right?

> This depends on the fact that you can setup the rings with the right
> personalities, that they would be known upfront. From your description,
> I'm not so sure that's the case? If not, then we would indeed need
> something that can pass in the credentials on a per-command basis. Not
> sure what that would look like.

We know the credentials and using a ring per user should be ok.

metze
Jens Axboe Jan. 8, 2020, 10:53 p.m. UTC | #6
On 1/8/20 10:04 AM, Stefan Metzmacher wrote:
> Am 08.01.20 um 17:40 schrieb Jens Axboe:
>> On 1/8/20 9:32 AM, Stefan Metzmacher wrote:
>>> Am 08.01.20 um 17:20 schrieb Jens Axboe:
>>>> On 1/8/20 6:05 AM, Stefan Metzmacher wrote:
>>>>> Hi Jens,
>>>>>
>>>>>> This works just like openat(2), except it can be performed async. For
>>>>>> the normal case of a non-blocking path lookup this will complete
>>>>>> inline. If we have to do IO to perform the open, it'll be done from
>>>>>> async context.
>>>>>
>>>>> Did you already thought about the credentials being used for the async
>>>>> open? The application could call setuid() and similar calls to change
>>>>> the credentials of the userspace process/threads. In order for
>>>>> applications like samba to use this async openat, it would be required
>>>>> to specify the credentials for each open, as we have to multiplex
>>>>> requests from multiple user sessions in one process.
>>>>>
>>>>> This applies to non-fd based syscall. Also for an async connect
>>>>> to a unix domain socket.
>>>>>
>>>>> Do you have comments on this?
>>>>
>>>> The open works like any of the other commands, it inherits the
>>>> credentials that the ring was setup with. Same with the memory context,
>>>> file table, etc. There's currently no way to have multiple personalities
>>>> within a single ring.
>>>
>>> Ah, it's user = get_uid(current_user()); and ctx->user = user in
>>> io_uring_create(), right?
>>
>> That's just for the accounting, it's the:
>>
>> ctx->creds = get_current_cred();
> 
> Ok, I just looked at an old checkout.
> 
> In kernel-dk-block/for-5.6/io_uring-vfs I see this only used in
> the async processing. Does a non-blocking openat also use ctx->creds?

There's basically two sets here - one set is in the ring, and the other
is the identity that the async thread (briefly) assumes if we have to go
async. Right now they are the same thing, and hence we don't need to
play any tricks off the system call submitting SQEs to assume any other
identity than the one we have.

>>>> Sounds like you'd like an option for having multiple personalities
>>>> within a single ring?
>>>
>>> I'm not sure anymore, I wasn't aware of the above.
>>>
>>>> I think it would be better to have a ring per personality instead.
>>>
>>> We could do that. I guess we could use per user rings for path based
>>> operations and a single ring for fd based operations.
>>>
>>>> One thing we could do to make this more lightweight
>>>> is to have rings that are associated, so that we can share a lot of the
>>>> backend processing between them.
>>>
>>> My current idea is to use the ring fd and pass it to our main epoll loop.
>>>
>>> Can you be more specific about how an api for associated rings could
>>> look like?
>>
>> The API would be the exact same, there would just be some way to
>> associate rings when you create them. Probably a new field in struct
>> io_uring_params (and an associated flag), which would tell io_uring that
>> two separate rings are really the same "user". This would allow io_uring
>> to use the same io-wq workqueues, for example, etc.
> 
> Ok, this would be just for better performance / better usage of
> resources, right?

Exactly

>> This depends on the fact that you can setup the rings with the right
>> personalities, that they would be known upfront. From your description,
>> I'm not so sure that's the case? If not, then we would indeed need
>> something that can pass in the credentials on a per-command basis. Not
>> sure what that would look like.
> 
> We know the credentials and using a ring per user should be ok.

Sounds good!
Stefan Metzmacher Jan. 8, 2020, 11:03 p.m. UTC | #7
Am 08.01.20 um 23:53 schrieb Jens Axboe:
> On 1/8/20 10:04 AM, Stefan Metzmacher wrote:
>> Am 08.01.20 um 17:40 schrieb Jens Axboe:
>>> On 1/8/20 9:32 AM, Stefan Metzmacher wrote:
>>>> Am 08.01.20 um 17:20 schrieb Jens Axboe:
>>>>> On 1/8/20 6:05 AM, Stefan Metzmacher wrote:
>>>>>> Hi Jens,
>>>>>>
>>>>>>> This works just like openat(2), except it can be performed async. For
>>>>>>> the normal case of a non-blocking path lookup this will complete
>>>>>>> inline. If we have to do IO to perform the open, it'll be done from
>>>>>>> async context.
>>>>>>
>>>>>> Did you already thought about the credentials being used for the async
>>>>>> open? The application could call setuid() and similar calls to change
>>>>>> the credentials of the userspace process/threads. In order for
>>>>>> applications like samba to use this async openat, it would be required
>>>>>> to specify the credentials for each open, as we have to multiplex
>>>>>> requests from multiple user sessions in one process.
>>>>>>
>>>>>> This applies to non-fd based syscall. Also for an async connect
>>>>>> to a unix domain socket.
>>>>>>
>>>>>> Do you have comments on this?
>>>>>
>>>>> The open works like any of the other commands, it inherits the
>>>>> credentials that the ring was setup with. Same with the memory context,
>>>>> file table, etc. There's currently no way to have multiple personalities
>>>>> within a single ring.
>>>>
>>>> Ah, it's user = get_uid(current_user()); and ctx->user = user in
>>>> io_uring_create(), right?
>>>
>>> That's just for the accounting, it's the:
>>>
>>> ctx->creds = get_current_cred();
>>
>> Ok, I just looked at an old checkout.
>>
>> In kernel-dk-block/for-5.6/io_uring-vfs I see this only used in
>> the async processing. Does a non-blocking openat also use ctx->creds?
> 
> There's basically two sets here - one set is in the ring, and the other
> is the identity that the async thread (briefly) assumes if we have to go
> async. Right now they are the same thing, and hence we don't need to
> play any tricks off the system call submitting SQEs to assume any other
> identity than the one we have.

I see two cases using it io_sq_thread() and
io_wq_create()->io_worker_handle_work() call override_creds().

But aren't non-blocking syscall executed in the context of the thread
calling io_uring_enter()->io_submit_sqes()?
In only see some magic around ctx->sqo_mm for that case, but ctx->creds
doesn't seem to be used in that case. And my design would require that.

metze
Jens Axboe Jan. 8, 2020, 11:05 p.m. UTC | #8
On 1/8/20 4:03 PM, Stefan Metzmacher wrote:
> Am 08.01.20 um 23:53 schrieb Jens Axboe:
>> On 1/8/20 10:04 AM, Stefan Metzmacher wrote:
>>> Am 08.01.20 um 17:40 schrieb Jens Axboe:
>>>> On 1/8/20 9:32 AM, Stefan Metzmacher wrote:
>>>>> Am 08.01.20 um 17:20 schrieb Jens Axboe:
>>>>>> On 1/8/20 6:05 AM, Stefan Metzmacher wrote:
>>>>>>> Hi Jens,
>>>>>>>
>>>>>>>> This works just like openat(2), except it can be performed async. For
>>>>>>>> the normal case of a non-blocking path lookup this will complete
>>>>>>>> inline. If we have to do IO to perform the open, it'll be done from
>>>>>>>> async context.
>>>>>>>
>>>>>>> Did you already thought about the credentials being used for the async
>>>>>>> open? The application could call setuid() and similar calls to change
>>>>>>> the credentials of the userspace process/threads. In order for
>>>>>>> applications like samba to use this async openat, it would be required
>>>>>>> to specify the credentials for each open, as we have to multiplex
>>>>>>> requests from multiple user sessions in one process.
>>>>>>>
>>>>>>> This applies to non-fd based syscall. Also for an async connect
>>>>>>> to a unix domain socket.
>>>>>>>
>>>>>>> Do you have comments on this?
>>>>>>
>>>>>> The open works like any of the other commands, it inherits the
>>>>>> credentials that the ring was setup with. Same with the memory context,
>>>>>> file table, etc. There's currently no way to have multiple personalities
>>>>>> within a single ring.
>>>>>
>>>>> Ah, it's user = get_uid(current_user()); and ctx->user = user in
>>>>> io_uring_create(), right?
>>>>
>>>> That's just for the accounting, it's the:
>>>>
>>>> ctx->creds = get_current_cred();
>>>
>>> Ok, I just looked at an old checkout.
>>>
>>> In kernel-dk-block/for-5.6/io_uring-vfs I see this only used in
>>> the async processing. Does a non-blocking openat also use ctx->creds?
>>
>> There's basically two sets here - one set is in the ring, and the other
>> is the identity that the async thread (briefly) assumes if we have to go
>> async. Right now they are the same thing, and hence we don't need to
>> play any tricks off the system call submitting SQEs to assume any other
>> identity than the one we have.
> 
> I see two cases using it io_sq_thread() and
> io_wq_create()->io_worker_handle_work() call override_creds().
> 
> But aren't non-blocking syscall executed in the context of the thread
> calling io_uring_enter()->io_submit_sqes()?
> In only see some magic around ctx->sqo_mm for that case, but ctx->creds
> doesn't seem to be used in that case. And my design would require that.

For now, the sq thread (which is used if you use IORING_SETUP_SQPOLL)
currently requires fixed files, so it can't be used with open at the
moment anyway. But if/when enabled, it'll assume the same credentials
as the async context and syscall path.
Stefan Metzmacher Jan. 8, 2020, 11:11 p.m. UTC | #9
Am 09.01.20 um 00:05 schrieb Jens Axboe:
> On 1/8/20 4:03 PM, Stefan Metzmacher wrote:
>> Am 08.01.20 um 23:53 schrieb Jens Axboe:
>>> On 1/8/20 10:04 AM, Stefan Metzmacher wrote:
>>>> Am 08.01.20 um 17:40 schrieb Jens Axboe:
>>>>> On 1/8/20 9:32 AM, Stefan Metzmacher wrote:
>>>>>> Am 08.01.20 um 17:20 schrieb Jens Axboe:
>>>>>>> On 1/8/20 6:05 AM, Stefan Metzmacher wrote:
>>>>>>>> Hi Jens,
>>>>>>>>
>>>>>>>>> This works just like openat(2), except it can be performed async. For
>>>>>>>>> the normal case of a non-blocking path lookup this will complete
>>>>>>>>> inline. If we have to do IO to perform the open, it'll be done from
>>>>>>>>> async context.
>>>>>>>>
>>>>>>>> Did you already thought about the credentials being used for the async
>>>>>>>> open? The application could call setuid() and similar calls to change
>>>>>>>> the credentials of the userspace process/threads. In order for
>>>>>>>> applications like samba to use this async openat, it would be required
>>>>>>>> to specify the credentials for each open, as we have to multiplex
>>>>>>>> requests from multiple user sessions in one process.
>>>>>>>>
>>>>>>>> This applies to non-fd based syscall. Also for an async connect
>>>>>>>> to a unix domain socket.
>>>>>>>>
>>>>>>>> Do you have comments on this?
>>>>>>>
>>>>>>> The open works like any of the other commands, it inherits the
>>>>>>> credentials that the ring was setup with. Same with the memory context,
>>>>>>> file table, etc. There's currently no way to have multiple personalities
>>>>>>> within a single ring.
>>>>>>
>>>>>> Ah, it's user = get_uid(current_user()); and ctx->user = user in
>>>>>> io_uring_create(), right?
>>>>>
>>>>> That's just for the accounting, it's the:
>>>>>
>>>>> ctx->creds = get_current_cred();
>>>>
>>>> Ok, I just looked at an old checkout.
>>>>
>>>> In kernel-dk-block/for-5.6/io_uring-vfs I see this only used in
>>>> the async processing. Does a non-blocking openat also use ctx->creds?
>>>
>>> There's basically two sets here - one set is in the ring, and the other
>>> is the identity that the async thread (briefly) assumes if we have to go
>>> async. Right now they are the same thing, and hence we don't need to
>>> play any tricks off the system call submitting SQEs to assume any other
>>> identity than the one we have.
>>
>> I see two cases using it io_sq_thread() and
>> io_wq_create()->io_worker_handle_work() call override_creds().
>>
>> But aren't non-blocking syscall executed in the context of the thread
>> calling io_uring_enter()->io_submit_sqes()?
>> In only see some magic around ctx->sqo_mm for that case, but ctx->creds
>> doesn't seem to be used in that case. And my design would require that.
> 
> For now, the sq thread (which is used if you use IORING_SETUP_SQPOLL)
> currently requires fixed files, so it can't be used with open at the
> moment anyway. But if/when enabled, it'll assume the same credentials
> as the async context and syscall path.

I'm sorry, but I'm still unsure we're talking about the same thing
(or maybe I'm missing some basics here).

My understanding of the io_uring_enter() is that it will execute as much
non-blocking calls as it can without switching to any other kernel thread.

And my fear is that openat will use get_current_cred() instead of
ctx->creds.

I'm I missing something?

Thanks!
metze
Jens Axboe Jan. 8, 2020, 11:22 p.m. UTC | #10
On 1/8/20 4:11 PM, Stefan Metzmacher wrote:
> Am 09.01.20 um 00:05 schrieb Jens Axboe:
>> On 1/8/20 4:03 PM, Stefan Metzmacher wrote:
>>> Am 08.01.20 um 23:53 schrieb Jens Axboe:
>>>> On 1/8/20 10:04 AM, Stefan Metzmacher wrote:
>>>>> Am 08.01.20 um 17:40 schrieb Jens Axboe:
>>>>>> On 1/8/20 9:32 AM, Stefan Metzmacher wrote:
>>>>>>> Am 08.01.20 um 17:20 schrieb Jens Axboe:
>>>>>>>> On 1/8/20 6:05 AM, Stefan Metzmacher wrote:
>>>>>>>>> Hi Jens,
>>>>>>>>>
>>>>>>>>>> This works just like openat(2), except it can be performed async. For
>>>>>>>>>> the normal case of a non-blocking path lookup this will complete
>>>>>>>>>> inline. If we have to do IO to perform the open, it'll be done from
>>>>>>>>>> async context.
>>>>>>>>>
>>>>>>>>> Did you already thought about the credentials being used for the async
>>>>>>>>> open? The application could call setuid() and similar calls to change
>>>>>>>>> the credentials of the userspace process/threads. In order for
>>>>>>>>> applications like samba to use this async openat, it would be required
>>>>>>>>> to specify the credentials for each open, as we have to multiplex
>>>>>>>>> requests from multiple user sessions in one process.
>>>>>>>>>
>>>>>>>>> This applies to non-fd based syscall. Also for an async connect
>>>>>>>>> to a unix domain socket.
>>>>>>>>>
>>>>>>>>> Do you have comments on this?
>>>>>>>>
>>>>>>>> The open works like any of the other commands, it inherits the
>>>>>>>> credentials that the ring was setup with. Same with the memory context,
>>>>>>>> file table, etc. There's currently no way to have multiple personalities
>>>>>>>> within a single ring.
>>>>>>>
>>>>>>> Ah, it's user = get_uid(current_user()); and ctx->user = user in
>>>>>>> io_uring_create(), right?
>>>>>>
>>>>>> That's just for the accounting, it's the:
>>>>>>
>>>>>> ctx->creds = get_current_cred();
>>>>>
>>>>> Ok, I just looked at an old checkout.
>>>>>
>>>>> In kernel-dk-block/for-5.6/io_uring-vfs I see this only used in
>>>>> the async processing. Does a non-blocking openat also use ctx->creds?
>>>>
>>>> There's basically two sets here - one set is in the ring, and the other
>>>> is the identity that the async thread (briefly) assumes if we have to go
>>>> async. Right now they are the same thing, and hence we don't need to
>>>> play any tricks off the system call submitting SQEs to assume any other
>>>> identity than the one we have.
>>>
>>> I see two cases using it io_sq_thread() and
>>> io_wq_create()->io_worker_handle_work() call override_creds().
>>>
>>> But aren't non-blocking syscall executed in the context of the thread
>>> calling io_uring_enter()->io_submit_sqes()?
>>> In only see some magic around ctx->sqo_mm for that case, but ctx->creds
>>> doesn't seem to be used in that case. And my design would require that.
>>
>> For now, the sq thread (which is used if you use IORING_SETUP_SQPOLL)
>> currently requires fixed files, so it can't be used with open at the
>> moment anyway. But if/when enabled, it'll assume the same credentials
>> as the async context and syscall path.
> 
> I'm sorry, but I'm still unsure we're talking about the same thing
> (or maybe I'm missing some basics here).
> 
> My understanding of the io_uring_enter() is that it will execute as much
> non-blocking calls as it can without switching to any other kernel thread.

Correct, any SQE that we can do without switching, we will.

> And my fear is that openat will use get_current_cred() instead of
> ctx->creds.

OK, I think I follow your concern. So you'd like to setup the rings from
a _different_ user, and then later on use it for submission for SQEs that
a specific user. So sort of the same as our initial discussion, except
the mapping would be static. The difference being that you might setup
the ring from a different user than the user that would be submitting IO
on it?

If so, then we do need something to support that, probably an
IORING_REGISTER_CREDS or similar. This would allow you to replace the
creds you currently have in ctx->creds with whatever new one.

> I'm I missing something?

I think we're talking about the same thing, just different views of it :-)
Stefan Metzmacher Jan. 9, 2020, 10:40 a.m. UTC | #11
>> I'm sorry, but I'm still unsure we're talking about the same thing
>> (or maybe I'm missing some basics here).
>>
>> My understanding of the io_uring_enter() is that it will execute as much
>> non-blocking calls as it can without switching to any other kernel thread.
> 
> Correct, any SQE that we can do without switching, we will.
> 
>> And my fear is that openat will use get_current_cred() instead of
>> ctx->creds.
> 
> OK, I think I follow your concern. So you'd like to setup the rings from
> a _different_ user, and then later on use it for submission for SQEs that
> a specific user. So sort of the same as our initial discussion, except
> the mapping would be static. The difference being that you might setup
> the ring from a different user than the user that would be submitting IO
> on it?

Our current (much simplified here) flow is this:

  # we start as root
  seteuid(0);setegid(0);setgroups()...
  ...
  # we become the user555 and
  # create our desired credential token
  seteuid(555); seteguid(555); setgroups()...
  # Start an openat2 on behalf of user555
  openat2()
  # we unbecome the user again and run as root
  seteuid(0);setegid(0); setgroups()...
  ...
  # we become the user444 and
  # create our desired credential token
  seteuid(444); seteguid(444); setgroups()...
  # Start an openat2 on behalf of user444
  openat2()
  # we unbecome the user again and run as root
  seteuid(0);setegid(0); setgroups()...
  ...
  # we become the user555 and
  # create our desired credential token
  seteuid(555); seteguid(555); setgroups()...
  # Start an openat2 on behalf of user555
  openat2()
  # we unbecome the user again and run as root
  seteuid(0);setegid(0); setgroups()...

It means we have to do about 7 syscalls in order
to open a file on behalf of a user.
(In reality we cache things and avoid set*id()
calls most of the time, but I want to demonstrate the
simplified design here)

With io_uring I'd like to use a flow like this:

  # we start as root
  seteuid(0);setegid(0);setgroups()...
  ...
  # we become the user444 and
  # create our desired credential token
  seteuid(444); seteguid(444); setgroups()...
  # we snapshot the credentials to the new ring for user444
  ring444 = io_uring_setup()
  # we unbecome the user again and run as root
  seteuid(0);setegid(0);setgroups()...
  ...
  # we become the user555 and
  # create our desired credential token
  seteuid(555); seteguid(555); setgroups()...
  # we snapshot the credentials to the new ring for user555
  ring555 = io_uring_setup()
  # we unbecome the user again and run as root
  seteuid(0);setegid(0);setgroups()...
  ...
  # Start an openat2 on behalf of user555
  io_uring_enter(ring555, OP_OPENAT2...)
  ...
  # Start an openat2 on behalf of user444
  io_uring_enter(ring444, OP_OPENAT2...)
  ...
  # Start an openat2 on behalf of user555
  io_uring_enter(ring555, OP_OPENAT2...)

So instead of constantly doing 7 syscalls per open,
we would be down to just at most one. And I would assume
that io_uring_enter() would do the temporary credential switch
for me also in the non-blocking case.

> If so, then we do need something to support that, probably an
> IORING_REGISTER_CREDS or similar. This would allow you to replace the
> creds you currently have in ctx->creds with whatever new one.

I don't want to change ctx->creds, but I want it to be used consistently.

What I think is missing is something like this:

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 32aee149f652..55dbb154915a 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -6359,10 +6359,27 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int,
fd, u32, to_submit,
                struct mm_struct *cur_mm;

                mutex_lock(&ctx->uring_lock);
+               if (current->mm != ctx->sqo_mm) {
+                       // TODO: somthing like this...
+                       restore_mm = current->mm;
+                       use_mm(ctx->sqo_mm);
+               }
                /* already have mm, so io_submit_sqes() won't try to
grab it */
                cur_mm = ctx->sqo_mm;
+               if (current_cred() != ctx->creds) {
+                       // TODO: somthing like this...
+                       restore_cred = override_creds(ctx->creds);
+               }
                submitted = io_submit_sqes(ctx, to_submit, f.file, fd,
                                           &cur_mm, false);
+               if (restore_cred != NULL) {
+                       revert_creds(restore_cred);
+               }
+               if (restore_mm != NULL) {
+                       // TODO: something like this...
+                       unuse_mm(ctx->sqo_mm);
+                       use_mm(restore_mm);
+               }
                mutex_unlock(&ctx->uring_lock);

                if (submitted != to_submit)

I'm not sure if current->mm is needed, I just added it for completeness
and as hint that io_op_defs[req->opcode].needs_mm is there and a
needs_creds could also be added (if it helps with performance)

Is it possible to trigger a change of current->mm from userspace?

An IORING_REGISTER_CREDS would only be useful if it's possible to
register a set of credentials and then use per io_uring_sqe credentials.
That would also be fine for me, but I'm not sure it's needed for now.

Apart from IORING_REGISTER_CREDS I think a change like the one
above is needed in order to avoid potential security problems.

>> I'm I missing something?
> 
> I think we're talking about the same thing, just different views of it :-)

I hope it's clear from my side now :-)

Thanks!
metze
Jens Axboe Jan. 9, 2020, 9:31 p.m. UTC | #12
On 1/9/20 3:40 AM, Stefan Metzmacher wrote:
>>> I'm sorry, but I'm still unsure we're talking about the same thing
>>> (or maybe I'm missing some basics here).
>>>
>>> My understanding of the io_uring_enter() is that it will execute as much
>>> non-blocking calls as it can without switching to any other kernel thread.
>>
>> Correct, any SQE that we can do without switching, we will.
>>
>>> And my fear is that openat will use get_current_cred() instead of
>>> ctx->creds.
>>
>> OK, I think I follow your concern. So you'd like to setup the rings from
>> a _different_ user, and then later on use it for submission for SQEs that
>> a specific user. So sort of the same as our initial discussion, except
>> the mapping would be static. The difference being that you might setup
>> the ring from a different user than the user that would be submitting IO
>> on it?
> 
> Our current (much simplified here) flow is this:
> 
>   # we start as root
>   seteuid(0);setegid(0);setgroups()...
>   ...
>   # we become the user555 and
>   # create our desired credential token
>   seteuid(555); seteguid(555); setgroups()...
>   # Start an openat2 on behalf of user555
>   openat2()
>   # we unbecome the user again and run as root
>   seteuid(0);setegid(0); setgroups()...
>   ...
>   # we become the user444 and
>   # create our desired credential token
>   seteuid(444); seteguid(444); setgroups()...
>   # Start an openat2 on behalf of user444
>   openat2()
>   # we unbecome the user again and run as root
>   seteuid(0);setegid(0); setgroups()...
>   ...
>   # we become the user555 and
>   # create our desired credential token
>   seteuid(555); seteguid(555); setgroups()...
>   # Start an openat2 on behalf of user555
>   openat2()
>   # we unbecome the user again and run as root
>   seteuid(0);setegid(0); setgroups()...
> 
> It means we have to do about 7 syscalls in order
> to open a file on behalf of a user.
> (In reality we cache things and avoid set*id()
> calls most of the time, but I want to demonstrate the
> simplified design here)
> 
> With io_uring I'd like to use a flow like this:
> 
>   # we start as root
>   seteuid(0);setegid(0);setgroups()...
>   ...
>   # we become the user444 and
>   # create our desired credential token
>   seteuid(444); seteguid(444); setgroups()...
>   # we snapshot the credentials to the new ring for user444
>   ring444 = io_uring_setup()
>   # we unbecome the user again and run as root
>   seteuid(0);setegid(0);setgroups()...
>   ...
>   # we become the user555 and
>   # create our desired credential token
>   seteuid(555); seteguid(555); setgroups()...
>   # we snapshot the credentials to the new ring for user555
>   ring555 = io_uring_setup()
>   # we unbecome the user again and run as root
>   seteuid(0);setegid(0);setgroups()...
>   ...
>   # Start an openat2 on behalf of user555
>   io_uring_enter(ring555, OP_OPENAT2...)
>   ...
>   # Start an openat2 on behalf of user444
>   io_uring_enter(ring444, OP_OPENAT2...)
>   ...
>   # Start an openat2 on behalf of user555
>   io_uring_enter(ring555, OP_OPENAT2...)
> 
> So instead of constantly doing 7 syscalls per open,
> we would be down to just at most one. And I would assume
> that io_uring_enter() would do the temporary credential switch
> for me also in the non-blocking case.

OK, thanks for spelling the use case out, makes it easier to understand
what you need in terms of what we currently can't do.

>> If so, then we do need something to support that, probably an
>> IORING_REGISTER_CREDS or similar. This would allow you to replace the
>> creds you currently have in ctx->creds with whatever new one.
> 
> I don't want to change ctx->creds, but I want it to be used consistently.
> 
> What I think is missing is something like this:
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 32aee149f652..55dbb154915a 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -6359,10 +6359,27 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int,
> fd, u32, to_submit,
>                 struct mm_struct *cur_mm;
> 
>                 mutex_lock(&ctx->uring_lock);
> +               if (current->mm != ctx->sqo_mm) {
> +                       // TODO: somthing like this...
> +                       restore_mm = current->mm;
> +                       use_mm(ctx->sqo_mm);
> +               }
>                 /* already have mm, so io_submit_sqes() won't try to
> grab it */
>                 cur_mm = ctx->sqo_mm;
> +               if (current_cred() != ctx->creds) {
> +                       // TODO: somthing like this...
> +                       restore_cred = override_creds(ctx->creds);
> +               }
>                 submitted = io_submit_sqes(ctx, to_submit, f.file, fd,
>                                            &cur_mm, false);
> +               if (restore_cred != NULL) {
> +                       revert_creds(restore_cred);
> +               }
> +               if (restore_mm != NULL) {
> +                       // TODO: something like this...
> +                       unuse_mm(ctx->sqo_mm);
> +                       use_mm(restore_mm);
> +               }
>                 mutex_unlock(&ctx->uring_lock);
> 
>                 if (submitted != to_submit)
> 
> I'm not sure if current->mm is needed, I just added it for completeness
> and as hint that io_op_defs[req->opcode].needs_mm is there and a
> needs_creds could also be added (if it helps with performance)
> 
> Is it possible to trigger a change of current->mm from userspace?
> 
> An IORING_REGISTER_CREDS would only be useful if it's possible to
> register a set of credentials and then use per io_uring_sqe credentials.
> That would also be fine for me, but I'm not sure it's needed for now.

I think it'd be a cleaner way of doing the same thing as your patch
does. It seems a little odd to do this by default (having the ring
change personalities depending on who's using it), but from an opt-in
point of view, I think it makes more sense.

That would make the IORING_REGISTER_ call something like
IORING_REGISTER_ADOPT_OWNER or something like that, meaning that the
ring would just assume the identify of the task that's calling
io_uring_enter().

Note that this also has to be passed through to the io-wq handler, as
the mappings there are currently static as well.
Stefan Metzmacher Jan. 16, 2020, 10:42 p.m. UTC | #13
Am 09.01.20 um 22:31 schrieb Jens Axboe:
> On 1/9/20 3:40 AM, Stefan Metzmacher wrote:
>>>> I'm sorry, but I'm still unsure we're talking about the same thing
>>>> (or maybe I'm missing some basics here).
>>>>
>>>> My understanding of the io_uring_enter() is that it will execute as much
>>>> non-blocking calls as it can without switching to any other kernel thread.
>>>
>>> Correct, any SQE that we can do without switching, we will.
>>>
>>>> And my fear is that openat will use get_current_cred() instead of
>>>> ctx->creds.
>>>
>>> OK, I think I follow your concern. So you'd like to setup the rings from
>>> a _different_ user, and then later on use it for submission for SQEs that
>>> a specific user. So sort of the same as our initial discussion, except
>>> the mapping would be static. The difference being that you might setup
>>> the ring from a different user than the user that would be submitting IO
>>> on it?
>>
>> Our current (much simplified here) flow is this:
>>
>>   # we start as root
>>   seteuid(0);setegid(0);setgroups()...
>>   ...
>>   # we become the user555 and
>>   # create our desired credential token
>>   seteuid(555); seteguid(555); setgroups()...
>>   # Start an openat2 on behalf of user555
>>   openat2()
>>   # we unbecome the user again and run as root
>>   seteuid(0);setegid(0); setgroups()...
>>   ...
>>   # we become the user444 and
>>   # create our desired credential token
>>   seteuid(444); seteguid(444); setgroups()...
>>   # Start an openat2 on behalf of user444
>>   openat2()
>>   # we unbecome the user again and run as root
>>   seteuid(0);setegid(0); setgroups()...
>>   ...
>>   # we become the user555 and
>>   # create our desired credential token
>>   seteuid(555); seteguid(555); setgroups()...
>>   # Start an openat2 on behalf of user555
>>   openat2()
>>   # we unbecome the user again and run as root
>>   seteuid(0);setegid(0); setgroups()...
>>
>> It means we have to do about 7 syscalls in order
>> to open a file on behalf of a user.
>> (In reality we cache things and avoid set*id()
>> calls most of the time, but I want to demonstrate the
>> simplified design here)
>>
>> With io_uring I'd like to use a flow like this:
>>
>>   # we start as root
>>   seteuid(0);setegid(0);setgroups()...
>>   ...
>>   # we become the user444 and
>>   # create our desired credential token
>>   seteuid(444); seteguid(444); setgroups()...
>>   # we snapshot the credentials to the new ring for user444
>>   ring444 = io_uring_setup()
>>   # we unbecome the user again and run as root
>>   seteuid(0);setegid(0);setgroups()...
>>   ...
>>   # we become the user555 and
>>   # create our desired credential token
>>   seteuid(555); seteguid(555); setgroups()...
>>   # we snapshot the credentials to the new ring for user555
>>   ring555 = io_uring_setup()
>>   # we unbecome the user again and run as root
>>   seteuid(0);setegid(0);setgroups()...
>>   ...
>>   # Start an openat2 on behalf of user555
>>   io_uring_enter(ring555, OP_OPENAT2...)
>>   ...
>>   # Start an openat2 on behalf of user444
>>   io_uring_enter(ring444, OP_OPENAT2...)
>>   ...
>>   # Start an openat2 on behalf of user555
>>   io_uring_enter(ring555, OP_OPENAT2...)
>>
>> So instead of constantly doing 7 syscalls per open,
>> we would be down to just at most one. And I would assume
>> that io_uring_enter() would do the temporary credential switch
>> for me also in the non-blocking case.
> 
> OK, thanks for spelling the use case out, makes it easier to understand
> what you need in terms of what we currently can't do.
> 
>>> If so, then we do need something to support that, probably an
>>> IORING_REGISTER_CREDS or similar. This would allow you to replace the
>>> creds you currently have in ctx->creds with whatever new one.
>>
>> I don't want to change ctx->creds, but I want it to be used consistently.
>>
>> What I think is missing is something like this:
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 32aee149f652..55dbb154915a 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -6359,10 +6359,27 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int,
>> fd, u32, to_submit,
>>                 struct mm_struct *cur_mm;
>>
>>                 mutex_lock(&ctx->uring_lock);
>> +               if (current->mm != ctx->sqo_mm) {
>> +                       // TODO: somthing like this...
>> +                       restore_mm = current->mm;
>> +                       use_mm(ctx->sqo_mm);
>> +               }
>>                 /* already have mm, so io_submit_sqes() won't try to
>> grab it */
>>                 cur_mm = ctx->sqo_mm;
>> +               if (current_cred() != ctx->creds) {
>> +                       // TODO: somthing like this...
>> +                       restore_cred = override_creds(ctx->creds);
>> +               }
>>                 submitted = io_submit_sqes(ctx, to_submit, f.file, fd,
>>                                            &cur_mm, false);
>> +               if (restore_cred != NULL) {
>> +                       revert_creds(restore_cred);
>> +               }
>> +               if (restore_mm != NULL) {
>> +                       // TODO: something like this...
>> +                       unuse_mm(ctx->sqo_mm);
>> +                       use_mm(restore_mm);
>> +               }
>>                 mutex_unlock(&ctx->uring_lock);
>>
>>                 if (submitted != to_submit)
>>
>> I'm not sure if current->mm is needed, I just added it for completeness
>> and as hint that io_op_defs[req->opcode].needs_mm is there and a
>> needs_creds could also be added (if it helps with performance)
>>
>> Is it possible to trigger a change of current->mm from userspace?
>>
>> An IORING_REGISTER_CREDS would only be useful if it's possible to
>> register a set of credentials and then use per io_uring_sqe credentials.
>> That would also be fine for me, but I'm not sure it's needed for now.
> 
> I think it'd be a cleaner way of doing the same thing as your patch
> does. It seems a little odd to do this by default (having the ring
> change personalities depending on who's using it), but from an opt-in
> point of view, I think it makes more sense.
> 
> That would make the IORING_REGISTER_ call something like
> IORING_REGISTER_ADOPT_OWNER or something like that, meaning that the
> ring would just assume the identify of the task that's calling
> io_uring_enter().
> 
> Note that this also has to be passed through to the io-wq handler, as
> the mappings there are currently static as well.

What's the next step here?

I think the current state is a security problem!

The inline execution either needs to change the creds temporary
or io_uring_enter() needs a general check that the current creds match
the creds of the ring and return -EPERM or something similar.

Thanks!
metze
Jens Axboe Jan. 17, 2020, 12:16 a.m. UTC | #14
On 1/16/20 3:42 PM, Stefan Metzmacher wrote:
>>> I'm not sure if current->mm is needed, I just added it for completeness
>>> and as hint that io_op_defs[req->opcode].needs_mm is there and a
>>> needs_creds could also be added (if it helps with performance)
>>>
>>> Is it possible to trigger a change of current->mm from userspace?
>>>
>>> An IORING_REGISTER_CREDS would only be useful if it's possible to
>>> register a set of credentials and then use per io_uring_sqe credentials.
>>> That would also be fine for me, but I'm not sure it's needed for now.
>>
>> I think it'd be a cleaner way of doing the same thing as your patch
>> does. It seems a little odd to do this by default (having the ring
>> change personalities depending on who's using it), but from an opt-in
>> point of view, I think it makes more sense.
>>
>> That would make the IORING_REGISTER_ call something like
>> IORING_REGISTER_ADOPT_OWNER or something like that, meaning that the
>> ring would just assume the identify of the task that's calling
>> io_uring_enter().
>>
>> Note that this also has to be passed through to the io-wq handler, as
>> the mappings there are currently static as well.
> 
> What's the next step here?

Not sure, need to find some time to work on this!

> I think the current state is a security problem!
> 
> The inline execution either needs to change the creds temporary
> or io_uring_enter() needs a general check that the current creds match
> the creds of the ring and return -EPERM or something similar.

Hmm, if you transfer the fd to someone else, you also give them access
to your credentials etc. We could make that -EPERM, if the owner of the
ring isn't the one invoking the submit. But that doesn't really help the
SQPOLL case, which simply consumes SQE entries. There can be no checking
there.
diff mbox series

Patch

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 1822bf9aba12..53ff67ab5c4b 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -70,6 +70,8 @@ 
 #include <linux/sizes.h>
 #include <linux/hugetlb.h>
 #include <linux/highmem.h>
+#include <linux/namei.h>
+#include <linux/fsnotify.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/io_uring.h>
@@ -353,6 +355,15 @@  struct io_sr_msg {
 	int				msg_flags;
 };
 
+struct io_open {
+	struct file			*file;
+	int				dfd;
+	umode_t				mode;
+	const char __user		*fname;
+	struct filename			*filename;
+	int				flags;
+};
+
 struct io_async_connect {
 	struct sockaddr_storage		address;
 };
@@ -371,12 +382,17 @@  struct io_async_rw {
 	ssize_t				size;
 };
 
+struct io_async_open {
+	struct filename			*filename;
+};
+
 struct io_async_ctx {
 	union {
 		struct io_async_rw	rw;
 		struct io_async_msghdr	msg;
 		struct io_async_connect	connect;
 		struct io_timeout_data	timeout;
+		struct io_async_open	open;
 	};
 };
 
@@ -397,6 +413,7 @@  struct io_kiocb {
 		struct io_timeout	timeout;
 		struct io_connect	connect;
 		struct io_sr_msg	sr_msg;
+		struct io_open		open;
 	};
 
 	struct io_async_ctx		*io;
@@ -2135,6 +2152,79 @@  static int io_fallocate(struct io_kiocb *req, struct io_kiocb **nxt,
 	return 0;
 }
 
+static int io_openat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	int ret;
+
+	if (sqe->ioprio || sqe->buf_index)
+		return -EINVAL;
+
+	req->open.dfd = READ_ONCE(sqe->fd);
+	req->open.mode = READ_ONCE(sqe->len);
+	req->open.fname = u64_to_user_ptr(READ_ONCE(sqe->addr));
+	req->open.flags = READ_ONCE(sqe->open_flags);
+
+	req->open.filename = getname(req->open.fname);
+	if (IS_ERR(req->open.filename)) {
+		ret = PTR_ERR(req->open.filename);
+		req->open.filename = NULL;
+		return ret;
+	}
+
+	return 0;
+}
+
+static void io_openat_async(struct io_wq_work **workptr)
+{
+	struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
+	struct filename *filename = req->open.filename;
+
+	io_wq_submit_work(workptr);
+	putname(filename);
+}
+
+static int io_openat(struct io_kiocb *req, struct io_kiocb **nxt,
+		     bool force_nonblock)
+{
+	struct open_flags op;
+	struct open_how how;
+	struct file *file;
+	int ret;
+
+	how = build_open_how(req->open.flags, req->open.mode);
+	ret = build_open_flags(&how, &op);
+	if (ret)
+		goto err;
+	if (force_nonblock)
+		op.lookup_flags |= LOOKUP_NONBLOCK;
+
+	ret = get_unused_fd_flags(how.flags);
+	if (ret < 0)
+		goto err;
+
+	file = do_filp_open(req->open.dfd, req->open.filename, &op);
+	if (IS_ERR(file)) {
+		put_unused_fd(ret);
+		ret = PTR_ERR(file);
+		if (ret == -EAGAIN) {
+			req->work.flags |= IO_WQ_WORK_NEEDS_FILES;
+			req->work.func = io_openat_async;
+			return -EAGAIN;
+		}
+	} else {
+		fsnotify_open(file);
+		fd_install(ret, file);
+	}
+err:
+	if (!io_wq_current_is_worker())
+		putname(req->open.filename);
+	if (ret < 0)
+		req_set_fail_links(req);
+	io_cqring_add_event(req, ret);
+	io_put_req_find_next(req, nxt);
+	return 0;
+}
+
 static int io_prep_sfr(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_ring_ctx *ctx = req->ctx;
@@ -3160,6 +3250,9 @@  static int io_req_defer_prep(struct io_kiocb *req,
 	case IORING_OP_FALLOCATE:
 		ret = io_fallocate_prep(req, sqe);
 		break;
+	case IORING_OP_OPENAT:
+		ret = io_openat_prep(req, sqe);
+		break;
 	default:
 		printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n",
 				req->opcode);
@@ -3322,6 +3415,14 @@  static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		}
 		ret = io_fallocate(req, nxt, force_nonblock);
 		break;
+	case IORING_OP_OPENAT:
+		if (sqe) {
+			ret = io_openat_prep(req, sqe);
+			if (ret)
+				break;
+		}
+		ret = io_openat(req, nxt, force_nonblock);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -3403,7 +3504,7 @@  static bool io_req_op_valid(int op)
 	return op >= IORING_OP_NOP && op < IORING_OP_LAST;
 }
 
-static int io_req_needs_file(struct io_kiocb *req)
+static int io_req_needs_file(struct io_kiocb *req, int fd)
 {
 	switch (req->opcode) {
 	case IORING_OP_NOP:
@@ -3413,6 +3514,8 @@  static int io_req_needs_file(struct io_kiocb *req)
 	case IORING_OP_ASYNC_CANCEL:
 	case IORING_OP_LINK_TIMEOUT:
 		return 0;
+	case IORING_OP_OPENAT:
+		return fd != -1;
 	default:
 		if (io_req_op_valid(req->opcode))
 			return 1;
@@ -3442,7 +3545,7 @@  static int io_req_set_file(struct io_submit_state *state, struct io_kiocb *req,
 	if (flags & IOSQE_IO_DRAIN)
 		req->flags |= REQ_F_IO_DRAIN;
 
-	ret = io_req_needs_file(req);
+	ret = io_req_needs_file(req, fd);
 	if (ret <= 0)
 		return ret;
 
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index bdbe2b130179..02af580754ce 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -34,6 +34,7 @@  struct io_uring_sqe {
 		__u32		timeout_flags;
 		__u32		accept_flags;
 		__u32		cancel_flags;
+		__u32		open_flags;
 	};
 	__u64	user_data;	/* data to be passed back at completion time */
 	union {
@@ -77,6 +78,7 @@  enum {
 	IORING_OP_LINK_TIMEOUT,
 	IORING_OP_CONNECT,
 	IORING_OP_FALLOCATE,
+	IORING_OP_OPENAT,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,