mbox series

[PATCHSET,v3,0/4] fs: Support for LOOKUP_NONBLOCK / RESOLVE_NONBLOCK

Message ID 20201214191323.173773-1-axboe@kernel.dk (mailing list archive)
Headers show
Series fs: Support for LOOKUP_NONBLOCK / RESOLVE_NONBLOCK | expand

Message

Jens Axboe Dec. 14, 2020, 7:13 p.m. UTC
Hi,

Wanted to throw out what the current state of this is, as we keep
getting closer to something palatable.

This time I've included the io_uring change too. I've tested this through
both openat2, and through io_uring as well.

I'm pretty happy with this at this point. The core change is very simple,
and the users end up being trivial too.

Also available here:

https://git.kernel.dk/cgit/linux-block/log/?h=nonblock-path-lookup

Changes since v2:

- Simplify the LOOKUP_NONBLOCK to _just_ cover LOOKUP_RCU and
  lookup_fast(), as per Linus's suggestion. I verified fast path
  operations are indeed just that, and it avoids having to mess with
  the inode lock and mnt_want_write() completely.

- Make O_TMPFILE return -EAGAIN for LOOKUP_NONBLOCK.

- Improve/add a few comments.

- Folded what was left of the open part of LOOKUP_NONBLOCK into the
  main patch.

Comments

Linus Torvalds Dec. 15, 2020, 3:06 a.m. UTC | #1
On Mon, Dec 14, 2020 at 11:13 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> I'm pretty happy with this at this point. The core change is very simple,
> and the users end up being trivial too.

It does look very simple.

It strikes me that io_statx would be another user of this. But it
obviously depends on what the actual io_uring users do..

             Linus
Jens Axboe Dec. 15, 2020, 3:18 a.m. UTC | #2
On 12/14/20 8:06 PM, Linus Torvalds wrote:
> On Mon, Dec 14, 2020 at 11:13 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> I'm pretty happy with this at this point. The core change is very simple,
>> and the users end up being trivial too.
> 
> It does look very simple.

Agree, hard to imagine it being much simpler than this.

> It strikes me that io_statx would be another user of this. But it
> obviously depends on what the actual io_uring users do..

Indeed, I did mention that in my email from the other thread earlier
today, and I think it's potentially an even bigger deal than nonblock
open. Lots of workloads are very stat intensive. So I did do that:

https://git.kernel.dk/cgit/linux-block/log/?h=nonblock-path-lookup

it's the top three patches there. Basically all local file systems are
fine with AT_STATX_LOOKUP just basically mapping to LOOKUP_NONBLOCK,
various (mostly) networked file systems like to do revalidation and
other kinds of communication as part of getattr. Hence the second patch
there is needed to have some:

if (query_flags & AT_STATX_NONBLOCK)
	return -EAGAIN;

constructs in there to avoid it in the nonblock path. Outside of that,
it's very trivial and just works. I didn't include it in this posting to
avoid detracting from the core work, but I definitely think we should be
doing that as well.
Al Viro Dec. 15, 2020, 6:11 a.m. UTC | #3
On Mon, Dec 14, 2020 at 12:13:20PM -0700, Jens Axboe wrote:
> Hi,
> 
> Wanted to throw out what the current state of this is, as we keep
> getting closer to something palatable.
> 
> This time I've included the io_uring change too. I've tested this through
> both openat2, and through io_uring as well.
> 
> I'm pretty happy with this at this point. The core change is very simple,
> and the users end up being trivial too.

I'll review tomorrow morning (sorry, half-asleep right now)
Jens Axboe Dec. 15, 2020, 3:29 p.m. UTC | #4
On 12/14/20 11:11 PM, Al Viro wrote:
> On Mon, Dec 14, 2020 at 12:13:20PM -0700, Jens Axboe wrote:
>> Hi,
>>
>> Wanted to throw out what the current state of this is, as we keep
>> getting closer to something palatable.
>>
>> This time I've included the io_uring change too. I've tested this through
>> both openat2, and through io_uring as well.
>>
>> I'm pretty happy with this at this point. The core change is very simple,
>> and the users end up being trivial too.
> 
> I'll review tomorrow morning (sorry, half-asleep right now)

Thanks Al, appreciate it.
Al Viro Jan. 4, 2021, 5:31 a.m. UTC | #5
On Mon, Dec 14, 2020 at 12:13:20PM -0700, Jens Axboe wrote:
> Hi,
> 
> Wanted to throw out what the current state of this is, as we keep
> getting closer to something palatable.
> 
> This time I've included the io_uring change too. I've tested this through
> both openat2, and through io_uring as well.
> 
> I'm pretty happy with this at this point. The core change is very simple,
> and the users end up being trivial too.
> 
> Also available here:
> 
> https://git.kernel.dk/cgit/linux-block/log/?h=nonblock-path-lookup

OK, pushed with modifications into vfs.git #work.namei

Changes: dropped a couple of pointless pieces in open_last_lookups()/do_open(),
moved O_TMPFILE rejection into build_open_flags() (i.e. in the third of your
commits).  And no io_uring stuff in there - your #4 is absent.

I've not put it into #for-next yet; yell if you see any problems with that
branch, or it'll end up there ;-)
Jens Axboe Jan. 4, 2021, 2:43 p.m. UTC | #6
On 1/3/21 10:31 PM, Al Viro wrote:
> On Mon, Dec 14, 2020 at 12:13:20PM -0700, Jens Axboe wrote:
>> Hi,
>>
>> Wanted to throw out what the current state of this is, as we keep
>> getting closer to something palatable.
>>
>> This time I've included the io_uring change too. I've tested this through
>> both openat2, and through io_uring as well.
>>
>> I'm pretty happy with this at this point. The core change is very simple,
>> and the users end up being trivial too.
>>
>> Also available here:
>>
>> https://git.kernel.dk/cgit/linux-block/log/?h=nonblock-path-lookup
> 
> OK, pushed with modifications into vfs.git #work.namei
> 
> Changes: dropped a couple of pointless pieces in open_last_lookups()/do_open(),
> moved O_TMPFILE rejection into build_open_flags() (i.e. in the third of your
> commits).  And no io_uring stuff in there - your #4 is absent.
> 
> I've not put it into #for-next yet; yell if you see any problems with that
> branch, or it'll end up there ;-)

Thanks Al - but you picked out of v3, not v4. Not that there are huge
changes between the two, from the posting of v4:

- Rename LOOKUP_NONBLOCK -> LOOKUP_CACHED, and ditto for the RESOLVE_
  flag. This better explains what the feature does, making it more self
  explanatory in terms of both code readability and for the user visible
  part.

- Remove dead LOOKUP_NONBLOCK check after we've dropped LOOKUP_RCU
  already, spotted by Al.

- Add O_TMPFILE to the checks upfront, so we can drop the checking in
  do_tmpfile().

and it sounds like you did the last two when merging yourself. I do like
LOOKUP_CACHED better than LOOKUP_NONBLOCK, mostly for the externally
self-documenting feature of it. What do you think?

Here's the v4 posting, fwiw:

https://lore.kernel.org/linux-fsdevel/20201217161911.743222-1-axboe@kernel.dk/
Al Viro Jan. 4, 2021, 4:54 p.m. UTC | #7
On Mon, Jan 04, 2021 at 07:43:17AM -0700, Jens Axboe wrote:

> > I've not put it into #for-next yet; yell if you see any problems with that
> > branch, or it'll end up there ;-)
> 
> Thanks Al - but you picked out of v3, not v4. Not that there are huge
> changes between the two, from the posting of v4:
> 
> - Rename LOOKUP_NONBLOCK -> LOOKUP_CACHED, and ditto for the RESOLVE_
>   flag. This better explains what the feature does, making it more self
>   explanatory in terms of both code readability and for the user visible
>   part.
> 
> - Remove dead LOOKUP_NONBLOCK check after we've dropped LOOKUP_RCU
>   already, spotted by Al.
> 
> - Add O_TMPFILE to the checks upfront, so we can drop the checking in
>   do_tmpfile().
> 
> and it sounds like you did the last two when merging yourself.

Yes - back when I'd posted that review.

> I do like
> LOOKUP_CACHED better than LOOKUP_NONBLOCK, mostly for the externally
> self-documenting feature of it. What do you think?

Agreed, especially since _NONBLOCK would confuse users into assumption
that operation is actually non-blocking...

> Here's the v4 posting, fwiw:
> 
> https://lore.kernel.org/linux-fsdevel/20201217161911.743222-1-axboe@kernel.dk/

Sorry, picked from the local branch that sat around since Mid-December ;-/
Fixed.  Another change: ..._child part in unlazy_child() is misleading -
it might as well be used for .. traversal, where dentry is usually the
_parent_ of nd->path.dentry.  The real constraint here is that dentry/seq pair
had been valid next position at some point during the RCU walk.  Renamed to
try_to_unlazy_next(), (hopefully) fixed the comment...

Updated variant force-pushed.
Jens Axboe Jan. 4, 2021, 5:03 p.m. UTC | #8
On 1/4/21 9:54 AM, Al Viro wrote:
> On Mon, Jan 04, 2021 at 07:43:17AM -0700, Jens Axboe wrote:
> 
>>> I've not put it into #for-next yet; yell if you see any problems with that
>>> branch, or it'll end up there ;-)
>>
>> Thanks Al - but you picked out of v3, not v4. Not that there are huge
>> changes between the two, from the posting of v4:
>>
>> - Rename LOOKUP_NONBLOCK -> LOOKUP_CACHED, and ditto for the RESOLVE_
>>   flag. This better explains what the feature does, making it more self
>>   explanatory in terms of both code readability and for the user visible
>>   part.
>>
>> - Remove dead LOOKUP_NONBLOCK check after we've dropped LOOKUP_RCU
>>   already, spotted by Al.
>>
>> - Add O_TMPFILE to the checks upfront, so we can drop the checking in
>>   do_tmpfile().
>>
>> and it sounds like you did the last two when merging yourself.
> 
> Yes - back when I'd posted that review.

Gotcha

>> I do like
>> LOOKUP_CACHED better than LOOKUP_NONBLOCK, mostly for the externally
>> self-documenting feature of it. What do you think?
> 
> Agreed, especially since _NONBLOCK would confuse users into assumption
> that operation is actually non-blocking...
> 
>> Here's the v4 posting, fwiw:
>>
>> https://lore.kernel.org/linux-fsdevel/20201217161911.743222-1-axboe@kernel.dk/
> 
> Sorry, picked from the local branch that sat around since Mid-December ;-/
> Fixed.  Another change: ..._child part in unlazy_child() is misleading -
> it might as well be used for .. traversal, where dentry is usually the
> _parent_ of nd->path.dentry.  The real constraint here is that dentry/seq pair
> had been valid next position at some point during the RCU walk.  Renamed to
> try_to_unlazy_next(), (hopefully) fixed the comment...
> 
> Updated variant force-pushed.

All good, looks good to me from a quick look. A diff between the two
bases just show seems sane. I'll pull it in and rebase on top of that,
and re-run the testing just to make sure that no ill effects snuck in
there with the changes.
Jens Axboe Feb. 14, 2021, 4:38 p.m. UTC | #9
On 2/13/21 3:08 PM, Eric W. Biederman wrote:
> Jens Axboe <axboe@kernel.dk> writes:
> 
>> Hi,
>>
>> Wanted to throw out what the current state of this is, as we keep
>> getting closer to something palatable.
>>
>> This time I've included the io_uring change too. I've tested this through
>> both openat2, and through io_uring as well.
>>
>> I'm pretty happy with this at this point. The core change is very simple,
>> and the users end up being trivial too.
>>
>> Also available here:
>>
>> https://git.kernel.dk/cgit/linux-block/log/?h=nonblock-path-lookup
>>
>> Changes since v2:
>>
>> - Simplify the LOOKUP_NONBLOCK to _just_ cover LOOKUP_RCU and
>>   lookup_fast(), as per Linus's suggestion. I verified fast path
>>   operations are indeed just that, and it avoids having to mess with
>>   the inode lock and mnt_want_write() completely.
>>
>> - Make O_TMPFILE return -EAGAIN for LOOKUP_NONBLOCK.
>>
>> - Improve/add a few comments.
>>
>> - Folded what was left of the open part of LOOKUP_NONBLOCK into the
>>   main patch.
> 
> I have to ask.  Are you doing work to verify that performing
> path traversals and opening of files yields the same results
> when passed to io_uring as it does when performed by the caller?
> 
> Looking at v5.11-rc7 it appears I can arbitrarily access the
> io-wq thread in proc by traversing "/proc/thread-self/".

Yes that looks like a bug, it needs similar treatment to /proc/self:

diff --git a/fs/proc/thread_self.c b/fs/proc/thread_self.c
index a553273fbd41..e8ca19371a36 100644
--- a/fs/proc/thread_self.c
+++ b/fs/proc/thread_self.c
@@ -17,6 +17,13 @@ static const char *proc_thread_self_get_link(struct dentry *dentry,
 	pid_t pid = task_pid_nr_ns(current, ns);
 	char *name;
 
+	/*
+	 * Not currently supported. Once we can inherit all of struct pid,
+	 * we can allow this.
+	 */
+	if (current->flags & PF_KTHREAD)
+		return ERR_PTR(-EOPNOTSUPP);
+
 	if (!pid)
 		return ERR_PTR(-ENOENT);
 	name = kmalloc(10 + 6 + 10 + 1, dentry ? GFP_KERNEL : GFP_ATOMIC);

as was done in this commit:

commit 8d4c3e76e3be11a64df95ddee52e99092d42fc19
Author: Jens Axboe <axboe@kernel.dk>
Date:   Fri Nov 13 16:47:52 2020 -0700

    proc: don't allow async path resolution of /proc/self components

> Similarly it looks like opening of "/dev/tty" fails to
> return the tty of the caller but instead fails because
> io-wq threads don't have a tty.
> 
> 
> There are a non-trivial number of places where current is used in the
> kernel both in path traversal and in opening files, and I may be blind
> but I have not see any work to find all of those places and make certain
> they are safe when called from io_uring.  That worries me as that using
> a kernel thread instead of a user thread could potential lead to
> privilege escalation.

I've got a patch queued up for 5.12 that clears ->fs and ->files for the
thread if not explicitly inherited, and I'm working on similarly
proactively catching these cases that could potentially be problematic.
Linus Torvalds Feb. 14, 2021, 8:30 p.m. UTC | #10
On Sun, Feb 14, 2021 at 8:38 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> > Similarly it looks like opening of "/dev/tty" fails to
> > return the tty of the caller but instead fails because
> > io-wq threads don't have a tty.
>
> I've got a patch queued up for 5.12 that clears ->fs and ->files for the
> thread if not explicitly inherited, and I'm working on similarly
> proactively catching these cases that could potentially be problematic.

Well, the /dev/tty case still needs fixing somehow.

Opening /dev/tty actually depends on current->signal, and if it is
NULL it will fall back on the first VT console instead (I think).

I wonder if it should do the same thing /proc/self does..

           Linus
Al Viro Feb. 14, 2021, 9:24 p.m. UTC | #11
On Sun, Feb 14, 2021 at 12:30:07PM -0800, Linus Torvalds wrote:
> On Sun, Feb 14, 2021 at 8:38 AM Jens Axboe <axboe@kernel.dk> wrote:
> >
> > > Similarly it looks like opening of "/dev/tty" fails to
> > > return the tty of the caller but instead fails because
> > > io-wq threads don't have a tty.
> >
> > I've got a patch queued up for 5.12 that clears ->fs and ->files for the
> > thread if not explicitly inherited, and I'm working on similarly
> > proactively catching these cases that could potentially be problematic.
> 
> Well, the /dev/tty case still needs fixing somehow.
> 
> Opening /dev/tty actually depends on current->signal, and if it is
> NULL it will fall back on the first VT console instead (I think).
> 
> I wonder if it should do the same thing /proc/self does..

	I still think that entire "offload pathname resolution to helper
threads" thing is bollocks.  ->fs, ->files and ->signal is still nowhere
near enough - look at /proc/net, for example.  Or netns-sensitive parts
of sysfs, for that matter...  And that's not going into really weird crap
where opener is very special and assumed to be doing all IO.
Eric W. Biederman Feb. 15, 2021, 5:56 p.m. UTC | #12
Jens Axboe <axboe@kernel.dk> writes:

> On 2/13/21 3:08 PM, Eric W. Biederman wrote:
>> 
>> I have to ask.  Are you doing work to verify that performing
>> path traversals and opening of files yields the same results
>> when passed to io_uring as it does when performed by the caller?
>> 
>> Looking at v5.11-rc7 it appears I can arbitrarily access the
>> io-wq thread in proc by traversing "/proc/thread-self/".
>
> Yes that looks like a bug, it needs similar treatment to /proc/self:
Agreed.


> diff --git a/fs/proc/thread_self.c b/fs/proc/thread_self.c
> index a553273fbd41..e8ca19371a36 100644
> --- a/fs/proc/thread_self.c
> +++ b/fs/proc/thread_self.c
> @@ -17,6 +17,13 @@ static const char *proc_thread_self_get_link(struct dentry *dentry,
>  	pid_t pid = task_pid_nr_ns(current, ns);
>  	char *name;
>  
> +	/*
> +	 * Not currently supported. Once we can inherit all of struct pid,
> +	 * we can allow this.
> +	 */
> +	if (current->flags & PF_KTHREAD)
> +		return ERR_PTR(-EOPNOTSUPP);

I suspect simply testing for PF_IO_WORKER is a better test.  As it is
the delegation to the io_worker not the fact that it is a kernel thread
that causes a problem.

I have a memory of that point being made when the smack_privileged test
and the tomoyo_kernel_service test and how to fix them was being
discussed.

>  	if (!pid)
>  		return ERR_PTR(-ENOENT);
>  	name = kmalloc(10 + 6 + 10 + 1, dentry ? GFP_KERNEL : GFP_ATOMIC);
>
> as was done in this commit:
>
> commit 8d4c3e76e3be11a64df95ddee52e99092d42fc19
> Author: Jens Axboe <axboe@kernel.dk>
> Date:   Fri Nov 13 16:47:52 2020 -0700
>
>     proc: don't allow async path resolution of /proc/self components

I suspect that would be better as PF_IO_WORKER as well.

>> Similarly it looks like opening of "/dev/tty" fails to
>> return the tty of the caller but instead fails because
>> io-wq threads don't have a tty.
>> 
>> 
>> There are a non-trivial number of places where current is used in the
>> kernel both in path traversal and in opening files, and I may be blind
>> but I have not see any work to find all of those places and make certain
>> they are safe when called from io_uring.  That worries me as that using
>> a kernel thread instead of a user thread could potential lead to
>> privilege escalation.
>
> I've got a patch queued up for 5.12 that clears ->fs and ->files for the
> thread if not explicitly inherited, and I'm working on similarly
> proactively catching these cases that could potentially be
> problematic.

Any pointers or is this all in a private tree for now?

It is difficult to follow because many of the fixes have not CC'd the
reporters or even the maintainers of the subsystems who have been
affected by this kind of issue.

Eric
Eric W. Biederman Feb. 15, 2021, 6:07 p.m. UTC | #13
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Sun, Feb 14, 2021 at 8:38 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> > Similarly it looks like opening of "/dev/tty" fails to
>> > return the tty of the caller but instead fails because
>> > io-wq threads don't have a tty.
>>
>> I've got a patch queued up for 5.12 that clears ->fs and ->files for the
>> thread if not explicitly inherited, and I'm working on similarly
>> proactively catching these cases that could potentially be problematic.
>
> Well, the /dev/tty case still needs fixing somehow.
>
> Opening /dev/tty actually depends on current->signal, and if it is
> NULL it will fall back on the first VT console instead (I think).
>
> I wonder if it should do the same thing /proc/self does..

Would there be any downside of making the io-wq kernel threads be per
process instead of per user?

I can see a lower probability of a thread already existing.  Are there
other downsides I am missing?

The upside would be that all of the issues of have we copied enough
should go away, as the io-wq thread would then behave like another user
space thread.  To handle posix setresuid() and friends it looks like
current_cred would need to be copied but I can't think of anything else.

Right I am with Al and I don't have any idea how many special cases we
need to play whack-a-mole with.

Eric
Jens Axboe Feb. 15, 2021, 6:24 p.m. UTC | #14
On 2/15/21 11:07 AM, Eric W. Biederman wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
> 
>> On Sun, Feb 14, 2021 at 8:38 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>>> Similarly it looks like opening of "/dev/tty" fails to
>>>> return the tty of the caller but instead fails because
>>>> io-wq threads don't have a tty.
>>>
>>> I've got a patch queued up for 5.12 that clears ->fs and ->files for the
>>> thread if not explicitly inherited, and I'm working on similarly
>>> proactively catching these cases that could potentially be problematic.
>>
>> Well, the /dev/tty case still needs fixing somehow.
>>
>> Opening /dev/tty actually depends on current->signal, and if it is
>> NULL it will fall back on the first VT console instead (I think).
>>
>> I wonder if it should do the same thing /proc/self does..
> 
> Would there be any downside of making the io-wq kernel threads be per
> process instead of per user?
> 
> I can see a lower probability of a thread already existing.  Are there
> other downsides I am missing?
> 
> The upside would be that all of the issues of have we copied enough
> should go away, as the io-wq thread would then behave like another user
> space thread.  To handle posix setresuid() and friends it looks like
> current_cred would need to be copied but I can't think of anything else.

I really like that idea. Do we currently have a way of creating a thread
internally, akin to what would happen if the same task did pthread_create?
That'd ensure that we have everything we need, without actively needing to
map the request types, or find future issues of "we also need this bit".
It'd work fine for the 'need new worker' case too, if one goes to sleep.
We'd just 'fork' off that child.

Would require some restructuring of io-wq, but at the end of it, it'd
be a simpler solution.
Jens Axboe Feb. 15, 2021, 9:09 p.m. UTC | #15
On 2/15/21 11:24 AM, Jens Axboe wrote:
> On 2/15/21 11:07 AM, Eric W. Biederman wrote:
>> Linus Torvalds <torvalds@linux-foundation.org> writes:
>>
>>> On Sun, Feb 14, 2021 at 8:38 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>>> Similarly it looks like opening of "/dev/tty" fails to
>>>>> return the tty of the caller but instead fails because
>>>>> io-wq threads don't have a tty.
>>>>
>>>> I've got a patch queued up for 5.12 that clears ->fs and ->files for the
>>>> thread if not explicitly inherited, and I'm working on similarly
>>>> proactively catching these cases that could potentially be problematic.
>>>
>>> Well, the /dev/tty case still needs fixing somehow.
>>>
>>> Opening /dev/tty actually depends on current->signal, and if it is
>>> NULL it will fall back on the first VT console instead (I think).
>>>
>>> I wonder if it should do the same thing /proc/self does..
>>
>> Would there be any downside of making the io-wq kernel threads be per
>> process instead of per user?
>>
>> I can see a lower probability of a thread already existing.  Are there
>> other downsides I am missing?
>>
>> The upside would be that all of the issues of have we copied enough
>> should go away, as the io-wq thread would then behave like another user
>> space thread.  To handle posix setresuid() and friends it looks like
>> current_cred would need to be copied but I can't think of anything else.
> 
> I really like that idea. Do we currently have a way of creating a thread
> internally, akin to what would happen if the same task did pthread_create?
> That'd ensure that we have everything we need, without actively needing to
> map the request types, or find future issues of "we also need this bit".
> It'd work fine for the 'need new worker' case too, if one goes to sleep.
> We'd just 'fork' off that child.
> 
> Would require some restructuring of io-wq, but at the end of it, it'd
> be a simpler solution.

I was intrigued enough that I tried to wire this up. If we can pull this
off, then it would take a great weight off my shoulders as there would
be no more worries on identity.

Here's a branch that's got a set of patches that actually work, though
it's a bit of a hack in spots. Notes:

- Forked worker initially crashed, since it's an actual user thread and
  bombed on deref of kernel structures. Expectedly. That's what the
  horrible kernel_clone_args->io_wq hack is working around for now.
  Obviously not the final solution, but helped move things along so
  I could actually test this.

- Shared io-wq helpers need indexing for task, right now this isn't
  done. But that's not hard to do.

- Idle thread reaping isn't done yet, so they persist until the
  context goes away.

- task_work fallback needs a bit of love. Currently we fallback to
  the io-wq manager thread for handling that, but a) manager is gone,
  and b) the new workers are now threads and go away as well when
  the original task goes away. None of the three fallback sites need
  task context, so likely solution here is just punt it to system_wq.
  Not the hot path, obviously, we're exiting.

- Personality registration is broken, it's just Good Enough to compile.

Probably a few more items that escape me right now. As long as you
don't hit the fallback cases, it appears to work fine for me. And
the diffstat is pretty good to:

 fs/io-wq.c                 | 418 +++++++++++--------------------------
 fs/io-wq.h                 |  10 +-
 fs/io_uring.c              | 314 +++-------------------------
 fs/proc/self.c             |   7 -
 fs/proc/thread_self.c      |   7 -
 include/linux/io_uring.h   |  19 --
 include/linux/sched.h      |   3 +
 include/linux/sched/task.h |   1 +
 kernel/fork.c              |   2 +
 9 files changed, 161 insertions(+), 620 deletions(-)

as it gets rid of _all_ the 'grab this or that piece' that we're
tracking.

WIP series here:

https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-worker
Eric W. Biederman Feb. 15, 2021, 10:41 p.m. UTC | #16
Jens Axboe <axboe@kernel.dk> writes:

> On 2/15/21 11:24 AM, Jens Axboe wrote:
>> On 2/15/21 11:07 AM, Eric W. Biederman wrote:
>>> Linus Torvalds <torvalds@linux-foundation.org> writes:
>>>
>>>> On Sun, Feb 14, 2021 at 8:38 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>
>>>>>> Similarly it looks like opening of "/dev/tty" fails to
>>>>>> return the tty of the caller but instead fails because
>>>>>> io-wq threads don't have a tty.
>>>>>
>>>>> I've got a patch queued up for 5.12 that clears ->fs and ->files for the
>>>>> thread if not explicitly inherited, and I'm working on similarly
>>>>> proactively catching these cases that could potentially be problematic.
>>>>
>>>> Well, the /dev/tty case still needs fixing somehow.
>>>>
>>>> Opening /dev/tty actually depends on current->signal, and if it is
>>>> NULL it will fall back on the first VT console instead (I think).
>>>>
>>>> I wonder if it should do the same thing /proc/self does..
>>>
>>> Would there be any downside of making the io-wq kernel threads be per
>>> process instead of per user?
>>>
>>> I can see a lower probability of a thread already existing.  Are there
>>> other downsides I am missing?
>>>
>>> The upside would be that all of the issues of have we copied enough
>>> should go away, as the io-wq thread would then behave like another user
>>> space thread.  To handle posix setresuid() and friends it looks like
>>> current_cred would need to be copied but I can't think of anything else.
>> 
>> I really like that idea. Do we currently have a way of creating a thread
>> internally, akin to what would happen if the same task did pthread_create?
>> That'd ensure that we have everything we need, without actively needing to
>> map the request types, or find future issues of "we also need this bit".
>> It'd work fine for the 'need new worker' case too, if one goes to sleep.
>> We'd just 'fork' off that child.
>> 
>> Would require some restructuring of io-wq, but at the end of it, it'd
>> be a simpler solution.
>
> I was intrigued enough that I tried to wire this up. If we can pull this
> off, then it would take a great weight off my shoulders as there would
> be no more worries on identity.
>
> Here's a branch that's got a set of patches that actually work, though
> it's a bit of a hack in spots. Notes:
>
> - Forked worker initially crashed, since it's an actual user thread and
>   bombed on deref of kernel structures. Expectedly. That's what the
>   horrible kernel_clone_args->io_wq hack is working around for now.
>   Obviously not the final solution, but helped move things along so
>   I could actually test this.
>
> - Shared io-wq helpers need indexing for task, right now this isn't
>   done. But that's not hard to do.
>
> - Idle thread reaping isn't done yet, so they persist until the
>   context goes away.
>
> - task_work fallback needs a bit of love. Currently we fallback to
>   the io-wq manager thread for handling that, but a) manager is gone,
>   and b) the new workers are now threads and go away as well when
>   the original task goes away. None of the three fallback sites need
>   task context, so likely solution here is just punt it to system_wq.
>   Not the hot path, obviously, we're exiting.
>
> - Personality registration is broken, it's just Good Enough to compile.
>
> Probably a few more items that escape me right now. As long as you
> don't hit the fallback cases, it appears to work fine for me. And
> the diffstat is pretty good to:
>
>  fs/io-wq.c                 | 418 +++++++++++--------------------------
>  fs/io-wq.h                 |  10 +-
>  fs/io_uring.c              | 314 +++-------------------------
>  fs/proc/self.c             |   7 -
>  fs/proc/thread_self.c      |   7 -
>  include/linux/io_uring.h   |  19 --
>  include/linux/sched.h      |   3 +
>  include/linux/sched/task.h |   1 +
>  kernel/fork.c              |   2 +
>  9 files changed, 161 insertions(+), 620 deletions(-)
>
> as it gets rid of _all_ the 'grab this or that piece' that we're
> tracking.
>
> WIP series here:
>
> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-worker

I took a quick look through the code and in general it seems reasonable.

Can the io_uring_task_cancel in begin_new_exec go away as well?
Today it happens after de_thread and so presumably all of the io_uring
threads are already gone.

Eric
Jens Axboe Feb. 16, 2021, 2:41 a.m. UTC | #17
On 2/15/21 3:41 PM, Eric W. Biederman wrote:
> Jens Axboe <axboe@kernel.dk> writes:
> 
>> On 2/15/21 11:24 AM, Jens Axboe wrote:
>>> On 2/15/21 11:07 AM, Eric W. Biederman wrote:
>>>> Linus Torvalds <torvalds@linux-foundation.org> writes:
>>>>
>>>>> On Sun, Feb 14, 2021 at 8:38 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>
>>>>>>> Similarly it looks like opening of "/dev/tty" fails to
>>>>>>> return the tty of the caller but instead fails because
>>>>>>> io-wq threads don't have a tty.
>>>>>>
>>>>>> I've got a patch queued up for 5.12 that clears ->fs and ->files for the
>>>>>> thread if not explicitly inherited, and I'm working on similarly
>>>>>> proactively catching these cases that could potentially be problematic.
>>>>>
>>>>> Well, the /dev/tty case still needs fixing somehow.
>>>>>
>>>>> Opening /dev/tty actually depends on current->signal, and if it is
>>>>> NULL it will fall back on the first VT console instead (I think).
>>>>>
>>>>> I wonder if it should do the same thing /proc/self does..
>>>>
>>>> Would there be any downside of making the io-wq kernel threads be per
>>>> process instead of per user?
>>>>
>>>> I can see a lower probability of a thread already existing.  Are there
>>>> other downsides I am missing?
>>>>
>>>> The upside would be that all of the issues of have we copied enough
>>>> should go away, as the io-wq thread would then behave like another user
>>>> space thread.  To handle posix setresuid() and friends it looks like
>>>> current_cred would need to be copied but I can't think of anything else.
>>>
>>> I really like that idea. Do we currently have a way of creating a thread
>>> internally, akin to what would happen if the same task did pthread_create?
>>> That'd ensure that we have everything we need, without actively needing to
>>> map the request types, or find future issues of "we also need this bit".
>>> It'd work fine for the 'need new worker' case too, if one goes to sleep.
>>> We'd just 'fork' off that child.
>>>
>>> Would require some restructuring of io-wq, but at the end of it, it'd
>>> be a simpler solution.
>>
>> I was intrigued enough that I tried to wire this up. If we can pull this
>> off, then it would take a great weight off my shoulders as there would
>> be no more worries on identity.
>>
>> Here's a branch that's got a set of patches that actually work, though
>> it's a bit of a hack in spots. Notes:
>>
>> - Forked worker initially crashed, since it's an actual user thread and
>>   bombed on deref of kernel structures. Expectedly. That's what the
>>   horrible kernel_clone_args->io_wq hack is working around for now.
>>   Obviously not the final solution, but helped move things along so
>>   I could actually test this.
>>
>> - Shared io-wq helpers need indexing for task, right now this isn't
>>   done. But that's not hard to do.
>>
>> - Idle thread reaping isn't done yet, so they persist until the
>>   context goes away.
>>
>> - task_work fallback needs a bit of love. Currently we fallback to
>>   the io-wq manager thread for handling that, but a) manager is gone,
>>   and b) the new workers are now threads and go away as well when
>>   the original task goes away. None of the three fallback sites need
>>   task context, so likely solution here is just punt it to system_wq.
>>   Not the hot path, obviously, we're exiting.
>>
>> - Personality registration is broken, it's just Good Enough to compile.
>>
>> Probably a few more items that escape me right now. As long as you
>> don't hit the fallback cases, it appears to work fine for me. And
>> the diffstat is pretty good to:
>>
>>  fs/io-wq.c                 | 418 +++++++++++--------------------------
>>  fs/io-wq.h                 |  10 +-
>>  fs/io_uring.c              | 314 +++-------------------------
>>  fs/proc/self.c             |   7 -
>>  fs/proc/thread_self.c      |   7 -
>>  include/linux/io_uring.h   |  19 --
>>  include/linux/sched.h      |   3 +
>>  include/linux/sched/task.h |   1 +
>>  kernel/fork.c              |   2 +
>>  9 files changed, 161 insertions(+), 620 deletions(-)
>>
>> as it gets rid of _all_ the 'grab this or that piece' that we're
>> tracking.
>>
>> WIP series here:
>>
>> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-worker
> 
> I took a quick look through the code and in general it seems reasonable.

Great, thanks for checking.

> Can the io_uring_task_cancel in begin_new_exec go away as well?
> Today it happens after de_thread and so presumably all of the io_uring
> threads are already gone.

I don't think so, async offload is only the slower path of async. We
can have poll based waiting, and various others. If we want to guarantee
that no requests escape across exec, then we'll need to retain that.
Jens Axboe Feb. 17, 2021, 1:18 a.m. UTC | #18
On 2/15/21 7:41 PM, Jens Axboe wrote:
> On 2/15/21 3:41 PM, Eric W. Biederman wrote:
>> Jens Axboe <axboe@kernel.dk> writes:
>>
>>> On 2/15/21 11:24 AM, Jens Axboe wrote:
>>>> On 2/15/21 11:07 AM, Eric W. Biederman wrote:
>>>>> Linus Torvalds <torvalds@linux-foundation.org> writes:
>>>>>
>>>>>> On Sun, Feb 14, 2021 at 8:38 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>
>>>>>>>> Similarly it looks like opening of "/dev/tty" fails to
>>>>>>>> return the tty of the caller but instead fails because
>>>>>>>> io-wq threads don't have a tty.
>>>>>>>
>>>>>>> I've got a patch queued up for 5.12 that clears ->fs and ->files for the
>>>>>>> thread if not explicitly inherited, and I'm working on similarly
>>>>>>> proactively catching these cases that could potentially be problematic.
>>>>>>
>>>>>> Well, the /dev/tty case still needs fixing somehow.
>>>>>>
>>>>>> Opening /dev/tty actually depends on current->signal, and if it is
>>>>>> NULL it will fall back on the first VT console instead (I think).
>>>>>>
>>>>>> I wonder if it should do the same thing /proc/self does..
>>>>>
>>>>> Would there be any downside of making the io-wq kernel threads be per
>>>>> process instead of per user?
>>>>>
>>>>> I can see a lower probability of a thread already existing.  Are there
>>>>> other downsides I am missing?
>>>>>
>>>>> The upside would be that all of the issues of have we copied enough
>>>>> should go away, as the io-wq thread would then behave like another user
>>>>> space thread.  To handle posix setresuid() and friends it looks like
>>>>> current_cred would need to be copied but I can't think of anything else.
>>>>
>>>> I really like that idea. Do we currently have a way of creating a thread
>>>> internally, akin to what would happen if the same task did pthread_create?
>>>> That'd ensure that we have everything we need, without actively needing to
>>>> map the request types, or find future issues of "we also need this bit".
>>>> It'd work fine for the 'need new worker' case too, if one goes to sleep.
>>>> We'd just 'fork' off that child.
>>>>
>>>> Would require some restructuring of io-wq, but at the end of it, it'd
>>>> be a simpler solution.
>>>
>>> I was intrigued enough that I tried to wire this up. If we can pull this
>>> off, then it would take a great weight off my shoulders as there would
>>> be no more worries on identity.
>>>
>>> Here's a branch that's got a set of patches that actually work, though
>>> it's a bit of a hack in spots. Notes:
>>>
>>> - Forked worker initially crashed, since it's an actual user thread and
>>>   bombed on deref of kernel structures. Expectedly. That's what the
>>>   horrible kernel_clone_args->io_wq hack is working around for now.
>>>   Obviously not the final solution, but helped move things along so
>>>   I could actually test this.
>>>
>>> - Shared io-wq helpers need indexing for task, right now this isn't
>>>   done. But that's not hard to do.
>>>
>>> - Idle thread reaping isn't done yet, so they persist until the
>>>   context goes away.
>>>
>>> - task_work fallback needs a bit of love. Currently we fallback to
>>>   the io-wq manager thread for handling that, but a) manager is gone,
>>>   and b) the new workers are now threads and go away as well when
>>>   the original task goes away. None of the three fallback sites need
>>>   task context, so likely solution here is just punt it to system_wq.
>>>   Not the hot path, obviously, we're exiting.
>>>
>>> - Personality registration is broken, it's just Good Enough to compile.
>>>
>>> Probably a few more items that escape me right now. As long as you
>>> don't hit the fallback cases, it appears to work fine for me. And
>>> the diffstat is pretty good to:
>>>
>>>  fs/io-wq.c                 | 418 +++++++++++--------------------------
>>>  fs/io-wq.h                 |  10 +-
>>>  fs/io_uring.c              | 314 +++-------------------------
>>>  fs/proc/self.c             |   7 -
>>>  fs/proc/thread_self.c      |   7 -
>>>  include/linux/io_uring.h   |  19 --
>>>  include/linux/sched.h      |   3 +
>>>  include/linux/sched/task.h |   1 +
>>>  kernel/fork.c              |   2 +
>>>  9 files changed, 161 insertions(+), 620 deletions(-)
>>>
>>> as it gets rid of _all_ the 'grab this or that piece' that we're
>>> tracking.
>>>
>>> WIP series here:
>>>
>>> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-worker
>>
>> I took a quick look through the code and in general it seems reasonable.
> 
> Great, thanks for checking.

Cleaner series here:

https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-worker.v2

One question, since I'm a bit stumped. The very top most debug patch:

https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-worker.v2&id=8a422f030b9630d16d5ec1ff97842a265f88485e

any idea what is going on here? For some reason, it only happens for
the 'manager' thread. That one doesn't do any work by itself, it's just
tasked with forking a new worker, if we need one.
Jens Axboe Feb. 17, 2021, 1:26 a.m. UTC | #19
On 2/16/21 6:18 PM, Jens Axboe wrote:
> On 2/15/21 7:41 PM, Jens Axboe wrote:
>> On 2/15/21 3:41 PM, Eric W. Biederman wrote:
>>> Jens Axboe <axboe@kernel.dk> writes:
>>>
>>>> On 2/15/21 11:24 AM, Jens Axboe wrote:
>>>>> On 2/15/21 11:07 AM, Eric W. Biederman wrote:
>>>>>> Linus Torvalds <torvalds@linux-foundation.org> writes:
>>>>>>
>>>>>>> On Sun, Feb 14, 2021 at 8:38 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>
>>>>>>>>> Similarly it looks like opening of "/dev/tty" fails to
>>>>>>>>> return the tty of the caller but instead fails because
>>>>>>>>> io-wq threads don't have a tty.
>>>>>>>>
>>>>>>>> I've got a patch queued up for 5.12 that clears ->fs and ->files for the
>>>>>>>> thread if not explicitly inherited, and I'm working on similarly
>>>>>>>> proactively catching these cases that could potentially be problematic.
>>>>>>>
>>>>>>> Well, the /dev/tty case still needs fixing somehow.
>>>>>>>
>>>>>>> Opening /dev/tty actually depends on current->signal, and if it is
>>>>>>> NULL it will fall back on the first VT console instead (I think).
>>>>>>>
>>>>>>> I wonder if it should do the same thing /proc/self does..
>>>>>>
>>>>>> Would there be any downside of making the io-wq kernel threads be per
>>>>>> process instead of per user?
>>>>>>
>>>>>> I can see a lower probability of a thread already existing.  Are there
>>>>>> other downsides I am missing?
>>>>>>
>>>>>> The upside would be that all of the issues of have we copied enough
>>>>>> should go away, as the io-wq thread would then behave like another user
>>>>>> space thread.  To handle posix setresuid() and friends it looks like
>>>>>> current_cred would need to be copied but I can't think of anything else.
>>>>>
>>>>> I really like that idea. Do we currently have a way of creating a thread
>>>>> internally, akin to what would happen if the same task did pthread_create?
>>>>> That'd ensure that we have everything we need, without actively needing to
>>>>> map the request types, or find future issues of "we also need this bit".
>>>>> It'd work fine for the 'need new worker' case too, if one goes to sleep.
>>>>> We'd just 'fork' off that child.
>>>>>
>>>>> Would require some restructuring of io-wq, but at the end of it, it'd
>>>>> be a simpler solution.
>>>>
>>>> I was intrigued enough that I tried to wire this up. If we can pull this
>>>> off, then it would take a great weight off my shoulders as there would
>>>> be no more worries on identity.
>>>>
>>>> Here's a branch that's got a set of patches that actually work, though
>>>> it's a bit of a hack in spots. Notes:
>>>>
>>>> - Forked worker initially crashed, since it's an actual user thread and
>>>>   bombed on deref of kernel structures. Expectedly. That's what the
>>>>   horrible kernel_clone_args->io_wq hack is working around for now.
>>>>   Obviously not the final solution, but helped move things along so
>>>>   I could actually test this.
>>>>
>>>> - Shared io-wq helpers need indexing for task, right now this isn't
>>>>   done. But that's not hard to do.
>>>>
>>>> - Idle thread reaping isn't done yet, so they persist until the
>>>>   context goes away.
>>>>
>>>> - task_work fallback needs a bit of love. Currently we fallback to
>>>>   the io-wq manager thread for handling that, but a) manager is gone,
>>>>   and b) the new workers are now threads and go away as well when
>>>>   the original task goes away. None of the three fallback sites need
>>>>   task context, so likely solution here is just punt it to system_wq.
>>>>   Not the hot path, obviously, we're exiting.
>>>>
>>>> - Personality registration is broken, it's just Good Enough to compile.
>>>>
>>>> Probably a few more items that escape me right now. As long as you
>>>> don't hit the fallback cases, it appears to work fine for me. And
>>>> the diffstat is pretty good to:
>>>>
>>>>  fs/io-wq.c                 | 418 +++++++++++--------------------------
>>>>  fs/io-wq.h                 |  10 +-
>>>>  fs/io_uring.c              | 314 +++-------------------------
>>>>  fs/proc/self.c             |   7 -
>>>>  fs/proc/thread_self.c      |   7 -
>>>>  include/linux/io_uring.h   |  19 --
>>>>  include/linux/sched.h      |   3 +
>>>>  include/linux/sched/task.h |   1 +
>>>>  kernel/fork.c              |   2 +
>>>>  9 files changed, 161 insertions(+), 620 deletions(-)
>>>>
>>>> as it gets rid of _all_ the 'grab this or that piece' that we're
>>>> tracking.
>>>>
>>>> WIP series here:
>>>>
>>>> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-worker
>>>
>>> I took a quick look through the code and in general it seems reasonable.
>>
>> Great, thanks for checking.
> 
> Cleaner series here:
> 
> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-worker.v2
> 
> One question, since I'm a bit stumped. The very top most debug patch:
> 
> https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-worker.v2&id=8a422f030b9630d16d5ec1ff97842a265f88485e
> 
> any idea what is going on here? For some reason, it only happens for
> the 'manager' thread. That one doesn't do any work by itself, it's just
> tasked with forking a new worker, if we need one.

Seems to trigger for all cases with a pthread in the app. This reproduces
it:


#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <liburing.h>

static void *fn(void *data)
{
	struct io_uring ring;

	io_uring_queue_init(1, &ring, 0);
	sleep(1);
	return NULL;
}

int main(int argc, char *argv[])
{
	pthread_t t;
	void *ret;

	pthread_create(&t, NULL, fn, NULL);
	pthread_join(t, &ret);

	return 0;
}
Jens Axboe Feb. 17, 2021, 3:11 a.m. UTC | #20
On 2/16/21 6:26 PM, Jens Axboe wrote:
> On 2/16/21 6:18 PM, Jens Axboe wrote:
>> On 2/15/21 7:41 PM, Jens Axboe wrote:
>>> On 2/15/21 3:41 PM, Eric W. Biederman wrote:
>>>> Jens Axboe <axboe@kernel.dk> writes:
>>>>
>>>>> On 2/15/21 11:24 AM, Jens Axboe wrote:
>>>>>> On 2/15/21 11:07 AM, Eric W. Biederman wrote:
>>>>>>> Linus Torvalds <torvalds@linux-foundation.org> writes:
>>>>>>>
>>>>>>>> On Sun, Feb 14, 2021 at 8:38 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>>
>>>>>>>>>> Similarly it looks like opening of "/dev/tty" fails to
>>>>>>>>>> return the tty of the caller but instead fails because
>>>>>>>>>> io-wq threads don't have a tty.
>>>>>>>>>
>>>>>>>>> I've got a patch queued up for 5.12 that clears ->fs and ->files for the
>>>>>>>>> thread if not explicitly inherited, and I'm working on similarly
>>>>>>>>> proactively catching these cases that could potentially be problematic.
>>>>>>>>
>>>>>>>> Well, the /dev/tty case still needs fixing somehow.
>>>>>>>>
>>>>>>>> Opening /dev/tty actually depends on current->signal, and if it is
>>>>>>>> NULL it will fall back on the first VT console instead (I think).
>>>>>>>>
>>>>>>>> I wonder if it should do the same thing /proc/self does..
>>>>>>>
>>>>>>> Would there be any downside of making the io-wq kernel threads be per
>>>>>>> process instead of per user?
>>>>>>>
>>>>>>> I can see a lower probability of a thread already existing.  Are there
>>>>>>> other downsides I am missing?
>>>>>>>
>>>>>>> The upside would be that all of the issues of have we copied enough
>>>>>>> should go away, as the io-wq thread would then behave like another user
>>>>>>> space thread.  To handle posix setresuid() and friends it looks like
>>>>>>> current_cred would need to be copied but I can't think of anything else.
>>>>>>
>>>>>> I really like that idea. Do we currently have a way of creating a thread
>>>>>> internally, akin to what would happen if the same task did pthread_create?
>>>>>> That'd ensure that we have everything we need, without actively needing to
>>>>>> map the request types, or find future issues of "we also need this bit".
>>>>>> It'd work fine for the 'need new worker' case too, if one goes to sleep.
>>>>>> We'd just 'fork' off that child.
>>>>>>
>>>>>> Would require some restructuring of io-wq, but at the end of it, it'd
>>>>>> be a simpler solution.
>>>>>
>>>>> I was intrigued enough that I tried to wire this up. If we can pull this
>>>>> off, then it would take a great weight off my shoulders as there would
>>>>> be no more worries on identity.
>>>>>
>>>>> Here's a branch that's got a set of patches that actually work, though
>>>>> it's a bit of a hack in spots. Notes:
>>>>>
>>>>> - Forked worker initially crashed, since it's an actual user thread and
>>>>>   bombed on deref of kernel structures. Expectedly. That's what the
>>>>>   horrible kernel_clone_args->io_wq hack is working around for now.
>>>>>   Obviously not the final solution, but helped move things along so
>>>>>   I could actually test this.
>>>>>
>>>>> - Shared io-wq helpers need indexing for task, right now this isn't
>>>>>   done. But that's not hard to do.
>>>>>
>>>>> - Idle thread reaping isn't done yet, so they persist until the
>>>>>   context goes away.
>>>>>
>>>>> - task_work fallback needs a bit of love. Currently we fallback to
>>>>>   the io-wq manager thread for handling that, but a) manager is gone,
>>>>>   and b) the new workers are now threads and go away as well when
>>>>>   the original task goes away. None of the three fallback sites need
>>>>>   task context, so likely solution here is just punt it to system_wq.
>>>>>   Not the hot path, obviously, we're exiting.
>>>>>
>>>>> - Personality registration is broken, it's just Good Enough to compile.
>>>>>
>>>>> Probably a few more items that escape me right now. As long as you
>>>>> don't hit the fallback cases, it appears to work fine for me. And
>>>>> the diffstat is pretty good to:
>>>>>
>>>>>  fs/io-wq.c                 | 418 +++++++++++--------------------------
>>>>>  fs/io-wq.h                 |  10 +-
>>>>>  fs/io_uring.c              | 314 +++-------------------------
>>>>>  fs/proc/self.c             |   7 -
>>>>>  fs/proc/thread_self.c      |   7 -
>>>>>  include/linux/io_uring.h   |  19 --
>>>>>  include/linux/sched.h      |   3 +
>>>>>  include/linux/sched/task.h |   1 +
>>>>>  kernel/fork.c              |   2 +
>>>>>  9 files changed, 161 insertions(+), 620 deletions(-)
>>>>>
>>>>> as it gets rid of _all_ the 'grab this or that piece' that we're
>>>>> tracking.
>>>>>
>>>>> WIP series here:
>>>>>
>>>>> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-worker
>>>>
>>>> I took a quick look through the code and in general it seems reasonable.
>>>
>>> Great, thanks for checking.
>>
>> Cleaner series here:
>>
>> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-worker.v2
>>
>> One question, since I'm a bit stumped. The very top most debug patch:
>>
>> https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-worker.v2&id=8a422f030b9630d16d5ec1ff97842a265f88485e
>>
>> any idea what is going on here? For some reason, it only happens for
>> the 'manager' thread. That one doesn't do any work by itself, it's just
>> tasked with forking a new worker, if we need one.
> 
> Seems to trigger for all cases with a pthread in the app. This reproduces
> it:

Nevermind, it was me being an idiot. I had a case in the manager thread
that did return 0 instead of do_exit()...