mbox series

[RFC,0/9] Launching processes with io_uring

Message ID 20241209234316.4132786-1-krisman@suse.de (mailing list archive)
Headers show
Series Launching processes with io_uring | expand

Message

Gabriel Krisman Bertazi Dec. 9, 2024, 11:43 p.m. UTC
During LPC 2022, Josh Triplett proposed io_uring_spawn as a mechanism to
fork and exec new processes through io_uring [1].  The goal, according
to him, was to have a very efficient mechanism to quickly execute tasks,
eliminating the multiple roundtrips to userspace required to fork,
perform multiple $PATH lookup and finally execve.  In addition, he
mentioned this would allow for a more simple implementation of
preparatory tasks, such as file redirection configuration, and handling
of stuff like posix_spawn_file_actions_t.

This RFC revives his original patchset.  I fixed all the pending issues
I found with task submission, including the issue blocking the work at
the time, a kernel corruption after a few spawns, converted the execve
command into execveat* variant, cleaned up the code and surely
introduced a few bugs of my own along the way.  At this point, I made it
an RFC because I have a few outstanding questions about the design, in
particular whether the CLONE context would be better implemented as a
special io-wq case to avoid the exposure of io_issue_sqe and
duplication of the dispatching logic.

I'm also providing the liburing support in a separate patchset,
including a testcase that exemplifies the $PATH lookup mechanism
proposed by Josh.

Thanks,

[1]  https://lwn.net/Articles/908268/

Gabriel Krisman Bertazi (6):
  io_uring: Drop __io_req_find_next_prep
  io_uring: Expose failed request helper in internal header
  kernel/fork: Don't inherit PF_USER_WORKER from parent
  fs/exec: Expose do_execveat symbol
  io_uring: Let commands run with current credentials
  io_uring: Let ->issue know if it was called from spawn thread

Josh Triplett (3):
  kernel/fork: Add helper to fork from io_uring
  io_uring: Introduce IORING_OP_CLONE
  io_uring: Introduce IORING_OP_EXEC command

 fs/exec.c                      |   2 +-
 include/linux/binfmts.h        |   5 +
 include/linux/io_uring_types.h |   3 +
 include/linux/sched/task.h     |   1 +
 include/uapi/linux/io_uring.h  |   3 +
 io_uring/Makefile              |   2 +-
 io_uring/io_uring.c            |  27 ++---
 io_uring/io_uring.h            |   8 ++
 io_uring/opdef.c               |  18 +++
 io_uring/opdef.h               |   2 +
 io_uring/spawn.c               | 195 +++++++++++++++++++++++++++++++++
 io_uring/spawn.h               |  13 +++
 kernel/fork.c                  |  21 ++++
 13 files changed, 279 insertions(+), 21 deletions(-)
 create mode 100644 io_uring/spawn.c
 create mode 100644 io_uring/spawn.h

Comments

Josh Triplett Dec. 10, 2024, 9:10 p.m. UTC | #1
On Mon, Dec 09, 2024 at 06:43:02PM -0500, Gabriel Krisman Bertazi wrote:
> During LPC 2022, Josh Triplett proposed io_uring_spawn as a mechanism to
> fork and exec new processes through io_uring [1].  The goal, according
> to him, was to have a very efficient mechanism to quickly execute tasks,
> eliminating the multiple roundtrips to userspace required to fork,
> perform multiple $PATH lookup and finally execve.  In addition, he
> mentioned this would allow for a more simple implementation of
> preparatory tasks, such as file redirection configuration, and handling
> of stuff like posix_spawn_file_actions_t.
> 
> This RFC revives his original patchset.  I fixed all the pending issues
> I found with task submission, including the issue blocking the work at
> the time, a kernel corruption after a few spawns, converted the execve
> command into execveat* variant, cleaned up the code and surely
> introduced a few bugs of my own along the way.  At this point, I made it
> an RFC because I have a few outstanding questions about the design, in
> particular whether the CLONE context would be better implemented as a
> special io-wq case to avoid the exposure of io_issue_sqe and
> duplication of the dispatching logic.

Thank you for updating and debugging this! Much appreciated.
Pavel Begunkov Dec. 11, 2024, 2:02 p.m. UTC | #2
On 12/9/24 23:43, Gabriel Krisman Bertazi wrote:
> During LPC 2022, Josh Triplett proposed io_uring_spawn as a mechanism to
> fork and exec new processes through io_uring [1].  The goal, according
> to him, was to have a very efficient mechanism to quickly execute tasks,
> eliminating the multiple roundtrips to userspace required to fork,
> perform multiple $PATH lookup and finally execve.  In addition, he
> mentioned this would allow for a more simple implementation of
> preparatory tasks, such as file redirection configuration, and handling
> of stuff like posix_spawn_file_actions_t.
> 
> This RFC revives his original patchset.  I fixed all the pending issues
> I found with task submission, including the issue blocking the work at
> the time, a kernel corruption after a few spawns, converted the execve
> command into execveat* variant, cleaned up the code and surely
> introduced a few bugs of my own along the way.  At this point, I made it
> an RFC because I have a few outstanding questions about the design, in
> particular whether the CLONE context would be better implemented as a
> special io-wq case to avoid the exposure of io_issue_sqe and
> duplication of the dispatching logic.
> 
> I'm also providing the liburing support in a separate patchset,
> including a testcase that exemplifies the $PATH lookup mechanism
> proposed by Josh.

Sorry to say but the series is rather concerning.

1) It creates a special path that tries to mimick the core
path, but not without a bunch of troubles and in quite a
special way.

2) There would be a special set of ops that can only be run
from that special path.

3) And I don't believe that path can ever be allowed to run
anything but these ops from (2) and maybe a very limited subset
of normal ops like nop requests but no read/write/send/etc. (?)

4) And it all requires links, which already a bad sign for
a bunch of reasons.

At this point it raises a question why it even needs io_uring
infra? I don't think it's really helping you. E.g. why not do it
as a list of operation in a custom format instead of links? That
can be run by a single io_uring request or can even be a normal
syscall.

struct clone_op ops = { { CLONE },
         { SET_CRED, cred_id }, ...,
         { EXEC, path }};

Makes me wonder about a different ways of handling. E.g. why should
it be run in the created task context (apart from final exec)? Can
requests be run as normal by the original task, each will take the
half created and not yet launched task as a parameter (in some form),
modify it, and the final exec would launch it?
Josh Triplett Dec. 11, 2024, 5:34 p.m. UTC | #3
On Wed, Dec 11, 2024 at 02:02:14PM +0000, Pavel Begunkov wrote:
> 1) It creates a special path that tries to mimick the core
> path, but not without a bunch of troubles and in quite a
> special way.
> 
> 2) There would be a special set of ops that can only be run
> from that special path.

The goal would be for the exec op to work just fine from the normal
path, too, for processes that want to do the equivalent of "do several
syscalls then exec to replace myself", rather than doing a fork/exec.
The current implementation defers supporting exec on a non-clone ring,
but I'd expect that limitation to be lifted in the future.

> 3) And I don't believe that path can ever be allowed to run
> anything but these ops from (2) and maybe a very limited subset
> of normal ops like nop requests but no read/write/send/etc. (?)

I would ideally expect it to be able to run almost *any* op, in the
context of the new process: write, send, open, accept, connect,
unlinkat, FIXED_FD_INSTALL, ring messaging, ...

> 4) And it all requires links, which already a bad sign for
> a bunch of reasons.

In theory you don't *have* to have everything linked for a batch of
operations like this, as long as it's clear what to run in the new task.

> At this point it raises a question why it even needs io_uring
> infra? I don't think it's really helping you. E.g. why not do it
> as a list of operation in a custom format instead of links?

Because, as mentioned above, the intention *is* to support almost any
io_uring operation, not just a tiny subset.
Gabriel Krisman Bertazi Dec. 13, 2024, 8:13 p.m. UTC | #4
Hi Pavel,

Pavel Begunkov <asml.silence@gmail.com> writes:
> On 12/9/24 23:43, Gabriel Krisman Bertazi wrote:

> Sorry to say but the series is rather concerning.
>
> 1) It creates a special path that tries to mimick the core
> path, but not without a bunch of troubles and in quite a
> special way.

I fully agree this is one of the main problem with the series.  I'm
interested in how we can merge this implementation into the existing
io_uring paths.  My idea, which I hinted in the cover letter, is to have
a flavor of io-wq that executes one linked sequence and then terminates.
When a work is queued there, the newly spawned worker thread will live
only until the end of that link.  This wq is only used to execute the
link following a IORING_OP_CLONE and the user can pass CLONE_ flags to
determine how it is created.  This allows the user to create a detached
file descriptor table in the worker thread, for instance.

It'd allows us to reuse the dispatching infrastructure of io-wq, hide
io_uring internals from the OP_CLONE implementation, and
enable, if I understand correctly, the workarounds to execute
task_works.  We'd need to ensure nothing from the link gets
executed outside of this context.

> 2) There would be a special set of ops that can only be run
> from that special path.

There are problems with cancellations and timeouts, that I'd expect to
be more solvable when reusing the io-wq code.  But this task is
executing from a cloned context, so we have a copy of the parent
context, and share the same memory map.  It should be safe to do IO to
open file descriptors, wake futexes and pretty much anything that
doesn't touch io_uring itself.  There are oddities, like the fact the fd
table is split from the parent task while the io_uring direct
descriptors are shared.  That needs to be handled with more sane
semantics.

> At this point it raises a question why it even needs io_uring
> infra? I don't think it's really helping you. E.g. why not do it
> as a list of operation in a custom format instead of links? That
> can be run by a single io_uring request or can even be a normal
> syscall.
>
> struct clone_op ops = { { CLONE },
>         { SET_CRED, cred_id }, ...,
>         { EXEC, path }};
>
>
> Makes me wonder about a different ways of handling. E.g. why should
> it be run in the created task context (apart from final exec)? Can
> requests be run as normal by the original task, each will take the
> half created and not yet launched task as a parameter (in some form),
> modify it, and the final exec would launch it?

A single operation would be a very complex operation doing many things
at once , and much less flexible.  This approach is flexible: you
can combine any (in theory) io_uring operation to obtain the desired
behavior.
Pavel Begunkov Dec. 17, 2024, 4:10 p.m. UTC | #5
On 12/13/24 20:13, Gabriel Krisman Bertazi wrote:
> 
> Hi Pavel,
> 
> Pavel Begunkov <asml.silence@gmail.com> writes:
>> On 12/9/24 23:43, Gabriel Krisman Bertazi wrote:
> 
>> Sorry to say but the series is rather concerning.
>>
>> 1) It creates a special path that tries to mimick the core
>> path, but not without a bunch of troubles and in quite a
>> special way.
> 
> I fully agree this is one of the main problem with the series.  I'm
> interested in how we can merge this implementation into the existing
> io_uring paths.  My idea, which I hinted in the cover letter, is to have
> a flavor of io-wq that executes one linked sequence and then terminates.
> When a work is queued there, the newly spawned worker thread will live
> only until the end of that link.  This wq is only used to execute the
> link following a IORING_OP_CLONE and the user can pass CLONE_ flags to
> determine how it is created.  This allows the user to create a detached
> file descriptor table in the worker thread, for instance.
> 
> It'd allows us to reuse the dispatching infrastructure of io-wq, hide
> io_uring internals from the OP_CLONE implementation, and
> enable, if I understand correctly, the workarounds to execute
> task_works.  We'd need to ensure nothing from the link gets
> executed outside of this context.

One problem with io-wq is that it's not guaranteed that it's able to
serve all types of requests. Though it's limited to multishots atm,
which you might not need, but the situation might change. And there
is no guarantee that the request is completed by the time it returns
from ->issue(), it might even change hands from inside the callback
via task_work or by any other mean.

It also sounds like you want the cloned task to be a normal
io_uring submmiter in terms of infra even though it can't
initiate a syscall, which also sounds a bit like an SQPOLL task.

And do we really need to execute everything from the new task
context, or ops can take a task as an argument and run whenever
while final exec could be special cased inside the callback?


>> 2) There would be a special set of ops that can only be run
>> from that special path.
> 
> There are problems with cancellations and timeouts, that I'd expect to
> be more solvable when reusing the io-wq code.  But this task is
> executing from a cloned context, so we have a copy of the parent
> context, and share the same memory map.  It should be safe to do IO to
> open file descriptors, wake futexes and pretty much anything that
> doesn't touch io_uring itself.  There are oddities, like the fact the fd
> table is split from the parent task while the io_uring direct
> descriptors are shared.  That needs to be handled with more sane
> semantics.
> 
>> At this point it raises a question why it even needs io_uring
>> infra? I don't think it's really helping you. E.g. why not do it
>> as a list of operation in a custom format instead of links? That
>> can be run by a single io_uring request or can even be a normal
>> syscall.
>>
>> struct clone_op ops = { { CLONE },
>>          { SET_CRED, cred_id }, ...,
>>          { EXEC, path }};
>>
>>
>> Makes me wonder about a different ways of handling. E.g. why should
>> it be run in the created task context (apart from final exec)? Can
>> requests be run as normal by the original task, each will take the
>> half created and not yet launched task as a parameter (in some form),
>> modify it, and the final exec would launch it?
> 
> A single operation would be a very complex operation doing many things
> at once , and much less flexible.  This approach is flexible: you
> can combine any (in theory) io_uring operation to obtain the desired
> behavior.

Ok. And links are not flexible enough for it either. Think of
error handling, passing results from one request to another and
more complex relations. Unless chains are supposed to be very
short and simple, it'd need to be able to return back to user
space (the one issuing requests) for error handling.
Gabriel Krisman Bertazi Dec. 30, 2024, 11:38 p.m. UTC | #6
Pavel Begunkov <asml.silence@gmail.com> writes:

Hi Pavel,

Sorry for the delay. I took the chance to stay offline for a while on
the christmas week.

>> I fully agree this is one of the main problem with the series.  I'm
>> interested in how we can merge this implementation into the existing
>> io_uring paths.  My idea, which I hinted in the cover letter, is to have
>> a flavor of io-wq that executes one linked sequence and then terminates.
>> When a work is queued there, the newly spawned worker thread will live
>> only until the end of that link.  This wq is only used to execute the
>> link following a IORING_OP_CLONE and the user can pass CLONE_ flags to
>> determine how it is created.  This allows the user to create a detached
>> file descriptor table in the worker thread, for instance.
>> It'd allows us to reuse the dispatching infrastructure of io-wq, hide
>> io_uring internals from the OP_CLONE implementation, and
>> enable, if I understand correctly, the workarounds to execute
>> task_works.  We'd need to ensure nothing from the link gets
>> executed outside of this context.
>
> One problem with io-wq is that it's not guaranteed that it's able to
> serve all types of requests. Though it's limited to multishots atm,
> which you might not need, but the situation might change. And there
> is no guarantee that the request is completed by the time it returns
> from ->issue(), it might even change hands from inside the callback
> via task_work or by any other mean.

Multishot is the least of my concerns for this feature, tbh.  I don't
see how it could be useful in the context of spawning a new thread, so
in terms of finding sane semantics, we could just reject them at
submission time if linked from a CLONE.
>
> It also sounds like you want the cloned task to be a normal
> io_uring submmiter in terms of infra even though it can't
> initiate a syscall, which also sounds a bit like an SQPOLL task.
>
> And do we really need to execute everything from the new task
> context, or ops can take a task as an argument and run whenever
> while final exec could be special cased inside the callback?

Wouldn't this be similar to the original design of the io-wq/sqpoll, which
attempted to impersonate the submitter task and resulted in some issues?
Executing directly from the new task is much simpler than trying to do the
operations on the context of another thread.

>>> requests be run as normal by the original task, each will take the
>>> half created and not yet launched task as a parameter (in some form),
>>> modify it, and the final exec would launch it?
>> A single operation would be a very complex operation doing many things
>> at once , and much less flexible.  This approach is flexible: you
>> can combine any (in theory) io_uring operation to obtain the desired
>> behavior.
>
> Ok. And links are not flexible enough for it either. Think of
> error handling, passing results from one request to another and
> more complex relations. Unless chains are supposed to be very
> short and simple, it'd need to be able to return back to user
> space (the one issuing requests) for error handling.

We are posting the completions to the submitter ring.  If a request
fails, we kill the context, but the user is notified of what operation
failed and need to resubmit the entire link with a new spawn.

We could avoid links by letting the spawned task linger, and provide a
way for the user to submit more operations to be executed by this
specific context.  The new task exists until an op to kill the worker is
issued or when the execve command executes.  This would allow the user
to keep multiple partially initialized contexts around for quick
userspace thread dispatching.  we could provide a mechanism to clone
these pre-initialized tasks.

Perhaps it is a stupid idea, but not new - i've seen it discussed
before: I'm thinking of a workload like a thread scheduler in userspace
or a network server.  It could keep a partially initialized worker
thread that never and, when needed, the main task would duplicate it with
another uring command and make the copy return to userspace, mitigating
the cost of copying and reinitializing the task.
Pavel Begunkov Dec. 31, 2024, 2:35 p.m. UTC | #7
On 12/30/24 23:38, Gabriel Krisman Bertazi wrote:
> Pavel Begunkov <asml.silence@gmail.com> writes:
> 
> Hi Pavel,
> 
> Sorry for the delay. I took the chance to stay offline for a while on
> the christmas week.

No worries at all

>>> I fully agree this is one of the main problem with the series.  I'm
>>> interested in how we can merge this implementation into the existing
>>> io_uring paths.  My idea, which I hinted in the cover letter, is to have
>>> a flavor of io-wq that executes one linked sequence and then terminates.
>>> When a work is queued there, the newly spawned worker thread will live
>>> only until the end of that link.  This wq is only used to execute the
>>> link following a IORING_OP_CLONE and the user can pass CLONE_ flags to
>>> determine how it is created.  This allows the user to create a detached
>>> file descriptor table in the worker thread, for instance.
>>> It'd allows us to reuse the dispatching infrastructure of io-wq, hide
>>> io_uring internals from the OP_CLONE implementation, and
>>> enable, if I understand correctly, the workarounds to execute
>>> task_works.  We'd need to ensure nothing from the link gets
>>> executed outside of this context.
>>
>> One problem with io-wq is that it's not guaranteed that it's able to
>> serve all types of requests. Though it's limited to multishots atm,
>> which you might not need, but the situation might change. And there
>> is no guarantee that the request is completed by the time it returns
>> from ->issue(), it might even change hands from inside the callback
>> via task_work or by any other mean.
> 
> Multishot is the least of my concerns for this feature, tbh.  I don't
> see how it could be useful in the context of spawning a new thread, so
> in terms of finding sane semantics, we could just reject them at
> submission time if linked from a CLONE.

Right, for multishot that is, but the point here is that io-wq
is an internal detail, and it might get to a point where relying
solely on iowq way of execution would fence you off operations
that you care about. Regardless, it would need to follow common
rules like request refcounting and lifetime

>> It also sounds like you want the cloned task to be a normal
>> io_uring submmiter in terms of infra even though it can't
>> initiate a syscall, which also sounds a bit like an SQPOLL task.
>>
>> And do we really need to execute everything from the new task
>> context, or ops can take a task as an argument and run whenever
>> while final exec could be special cased inside the callback?
> 
> Wouldn't this be similar to the original design of the io-wq/sqpoll, which
> attempted to impersonate the submitter task and resulted in some issues?

The problem there was that it was a kthread, which are not
prepared to run what is basically a random syscall path, and
no amount of whack-a-mole'ing with likes of kthread_use_mm()
was ever enough.

That's the same reason why I'm saying that unless the the new
task is completely initialised and has all resources as far as
the kernel execution goes, there would be no way to run a random
io_uring request from it. "Initialised" would mean here having
->mm and all other task resources to anything valid.

> Executing directly from the new task is much simpler than trying to do the
> operations on the context of another thread.

I agree, if that's about executing e.g. a read request, but if
I'd doubt it if it's about setting a ns to a new not yet running
task.

>>>> requests be run as normal by the original task, each will take the
>>>> half created and not yet launched task as a parameter (in some form),
>>>> modify it, and the final exec would launch it?
>>> A single operation would be a very complex operation doing many things
>>> at once , and much less flexible.  This approach is flexible: you
>>> can combine any (in theory) io_uring operation to obtain the desired
>>> behavior.
>>
>> Ok. And links are not flexible enough for it either. Think of
>> error handling, passing results from one request to another and
>> more complex relations. Unless chains are supposed to be very
>> short and simple, it'd need to be able to return back to user
>> space (the one issuing requests) for error handling.
> 
> We are posting the completions to the submitter ring.  If a request
> fails, we kill the context, but the user is notified of what operation
> failed and need to resubmit the entire link with a new spawn.

That's fine in case of a short simple chain, but there is a
talk of random requests. Let's say a read op fails and you
need to redo it, that's not only expensive but might even be
impossible. E.g. it already consumed data from a pipe / socket.

But the biggest problem is flexibility, any non-trivial relation
between request would need some user processing. Take Josh's
example of linking 2+ exec requests, which instead of relying
on some special status of exec requests can be well implemented
by returning the control back to user, and that would be much
more flexible at that.

> We could avoid links by letting the spawned task linger, and provide a
> way for the user to submit more operations to be executed by this
> specific context.  The new task exists until an op to kill the worker is
> issued or when the execve command executes.  This would allow the user
> to keep multiple partially initialized contexts around for quick
> userspace thread dispatching.  we could provide a mechanism to clone
> these pre-initialized tasks.
> 
> Perhaps it is a stupid idea, but not new - i've seen it discussed
> before: I'm thinking of a workload like a thread scheduler in userspace
> or a network server.  It could keep a partially initialized worker
> thread that never and, when needed, the main task would duplicate it with
> another uring command and make the copy return to userspace, mitigating
> the cost of copying and reinitializing the task.
>