Message ID | 20201214191323.173773-1-axboe@kernel.dk (mailing list archive) |
---|---|
Headers | show |
Series | fs: Support for LOOKUP_NONBLOCK / RESOLVE_NONBLOCK | expand |
On Mon, Dec 14, 2020 at 11:13 AM Jens Axboe <axboe@kernel.dk> wrote: > > I'm pretty happy with this at this point. The core change is very simple, > and the users end up being trivial too. It does look very simple. It strikes me that io_statx would be another user of this. But it obviously depends on what the actual io_uring users do.. Linus
On 12/14/20 8:06 PM, Linus Torvalds wrote: > On Mon, Dec 14, 2020 at 11:13 AM Jens Axboe <axboe@kernel.dk> wrote: >> >> I'm pretty happy with this at this point. The core change is very simple, >> and the users end up being trivial too. > > It does look very simple. Agree, hard to imagine it being much simpler than this. > It strikes me that io_statx would be another user of this. But it > obviously depends on what the actual io_uring users do.. Indeed, I did mention that in my email from the other thread earlier today, and I think it's potentially an even bigger deal than nonblock open. Lots of workloads are very stat intensive. So I did do that: https://git.kernel.dk/cgit/linux-block/log/?h=nonblock-path-lookup it's the top three patches there. Basically all local file systems are fine with AT_STATX_LOOKUP just basically mapping to LOOKUP_NONBLOCK, various (mostly) networked file systems like to do revalidation and other kinds of communication as part of getattr. Hence the second patch there is needed to have some: if (query_flags & AT_STATX_NONBLOCK) return -EAGAIN; constructs in there to avoid it in the nonblock path. Outside of that, it's very trivial and just works. I didn't include it in this posting to avoid detracting from the core work, but I definitely think we should be doing that as well.
On Mon, Dec 14, 2020 at 12:13:20PM -0700, Jens Axboe wrote: > Hi, > > Wanted to throw out what the current state of this is, as we keep > getting closer to something palatable. > > This time I've included the io_uring change too. I've tested this through > both openat2, and through io_uring as well. > > I'm pretty happy with this at this point. The core change is very simple, > and the users end up being trivial too. I'll review tomorrow morning (sorry, half-asleep right now)
On 12/14/20 11:11 PM, Al Viro wrote: > On Mon, Dec 14, 2020 at 12:13:20PM -0700, Jens Axboe wrote: >> Hi, >> >> Wanted to throw out what the current state of this is, as we keep >> getting closer to something palatable. >> >> This time I've included the io_uring change too. I've tested this through >> both openat2, and through io_uring as well. >> >> I'm pretty happy with this at this point. The core change is very simple, >> and the users end up being trivial too. > > I'll review tomorrow morning (sorry, half-asleep right now) Thanks Al, appreciate it.
On Mon, Dec 14, 2020 at 12:13:20PM -0700, Jens Axboe wrote: > Hi, > > Wanted to throw out what the current state of this is, as we keep > getting closer to something palatable. > > This time I've included the io_uring change too. I've tested this through > both openat2, and through io_uring as well. > > I'm pretty happy with this at this point. The core change is very simple, > and the users end up being trivial too. > > Also available here: > > https://git.kernel.dk/cgit/linux-block/log/?h=nonblock-path-lookup OK, pushed with modifications into vfs.git #work.namei Changes: dropped a couple of pointless pieces in open_last_lookups()/do_open(), moved O_TMPFILE rejection into build_open_flags() (i.e. in the third of your commits). And no io_uring stuff in there - your #4 is absent. I've not put it into #for-next yet; yell if you see any problems with that branch, or it'll end up there ;-)
On 1/3/21 10:31 PM, Al Viro wrote: > On Mon, Dec 14, 2020 at 12:13:20PM -0700, Jens Axboe wrote: >> Hi, >> >> Wanted to throw out what the current state of this is, as we keep >> getting closer to something palatable. >> >> This time I've included the io_uring change too. I've tested this through >> both openat2, and through io_uring as well. >> >> I'm pretty happy with this at this point. The core change is very simple, >> and the users end up being trivial too. >> >> Also available here: >> >> https://git.kernel.dk/cgit/linux-block/log/?h=nonblock-path-lookup > > OK, pushed with modifications into vfs.git #work.namei > > Changes: dropped a couple of pointless pieces in open_last_lookups()/do_open(), > moved O_TMPFILE rejection into build_open_flags() (i.e. in the third of your > commits). And no io_uring stuff in there - your #4 is absent. > > I've not put it into #for-next yet; yell if you see any problems with that > branch, or it'll end up there ;-) Thanks Al - but you picked out of v3, not v4. Not that there are huge changes between the two, from the posting of v4: - Rename LOOKUP_NONBLOCK -> LOOKUP_CACHED, and ditto for the RESOLVE_ flag. This better explains what the feature does, making it more self explanatory in terms of both code readability and for the user visible part. - Remove dead LOOKUP_NONBLOCK check after we've dropped LOOKUP_RCU already, spotted by Al. - Add O_TMPFILE to the checks upfront, so we can drop the checking in do_tmpfile(). and it sounds like you did the last two when merging yourself. I do like LOOKUP_CACHED better than LOOKUP_NONBLOCK, mostly for the externally self-documenting feature of it. What do you think? Here's the v4 posting, fwiw: https://lore.kernel.org/linux-fsdevel/20201217161911.743222-1-axboe@kernel.dk/
On Mon, Jan 04, 2021 at 07:43:17AM -0700, Jens Axboe wrote: > > I've not put it into #for-next yet; yell if you see any problems with that > > branch, or it'll end up there ;-) > > Thanks Al - but you picked out of v3, not v4. Not that there are huge > changes between the two, from the posting of v4: > > - Rename LOOKUP_NONBLOCK -> LOOKUP_CACHED, and ditto for the RESOLVE_ > flag. This better explains what the feature does, making it more self > explanatory in terms of both code readability and for the user visible > part. > > - Remove dead LOOKUP_NONBLOCK check after we've dropped LOOKUP_RCU > already, spotted by Al. > > - Add O_TMPFILE to the checks upfront, so we can drop the checking in > do_tmpfile(). > > and it sounds like you did the last two when merging yourself. Yes - back when I'd posted that review. > I do like > LOOKUP_CACHED better than LOOKUP_NONBLOCK, mostly for the externally > self-documenting feature of it. What do you think? Agreed, especially since _NONBLOCK would confuse users into assumption that operation is actually non-blocking... > Here's the v4 posting, fwiw: > > https://lore.kernel.org/linux-fsdevel/20201217161911.743222-1-axboe@kernel.dk/ Sorry, picked from the local branch that sat around since Mid-December ;-/ Fixed. Another change: ..._child part in unlazy_child() is misleading - it might as well be used for .. traversal, where dentry is usually the _parent_ of nd->path.dentry. The real constraint here is that dentry/seq pair had been valid next position at some point during the RCU walk. Renamed to try_to_unlazy_next(), (hopefully) fixed the comment... Updated variant force-pushed.
On 1/4/21 9:54 AM, Al Viro wrote: > On Mon, Jan 04, 2021 at 07:43:17AM -0700, Jens Axboe wrote: > >>> I've not put it into #for-next yet; yell if you see any problems with that >>> branch, or it'll end up there ;-) >> >> Thanks Al - but you picked out of v3, not v4. Not that there are huge >> changes between the two, from the posting of v4: >> >> - Rename LOOKUP_NONBLOCK -> LOOKUP_CACHED, and ditto for the RESOLVE_ >> flag. This better explains what the feature does, making it more self >> explanatory in terms of both code readability and for the user visible >> part. >> >> - Remove dead LOOKUP_NONBLOCK check after we've dropped LOOKUP_RCU >> already, spotted by Al. >> >> - Add O_TMPFILE to the checks upfront, so we can drop the checking in >> do_tmpfile(). >> >> and it sounds like you did the last two when merging yourself. > > Yes - back when I'd posted that review. Gotcha >> I do like >> LOOKUP_CACHED better than LOOKUP_NONBLOCK, mostly for the externally >> self-documenting feature of it. What do you think? > > Agreed, especially since _NONBLOCK would confuse users into assumption > that operation is actually non-blocking... > >> Here's the v4 posting, fwiw: >> >> https://lore.kernel.org/linux-fsdevel/20201217161911.743222-1-axboe@kernel.dk/ > > Sorry, picked from the local branch that sat around since Mid-December ;-/ > Fixed. Another change: ..._child part in unlazy_child() is misleading - > it might as well be used for .. traversal, where dentry is usually the > _parent_ of nd->path.dentry. The real constraint here is that dentry/seq pair > had been valid next position at some point during the RCU walk. Renamed to > try_to_unlazy_next(), (hopefully) fixed the comment... > > Updated variant force-pushed. All good, looks good to me from a quick look. A diff between the two bases just show seems sane. I'll pull it in and rebase on top of that, and re-run the testing just to make sure that no ill effects snuck in there with the changes.
On 2/13/21 3:08 PM, Eric W. Biederman wrote: > Jens Axboe <axboe@kernel.dk> writes: > >> Hi, >> >> Wanted to throw out what the current state of this is, as we keep >> getting closer to something palatable. >> >> This time I've included the io_uring change too. I've tested this through >> both openat2, and through io_uring as well. >> >> I'm pretty happy with this at this point. The core change is very simple, >> and the users end up being trivial too. >> >> Also available here: >> >> https://git.kernel.dk/cgit/linux-block/log/?h=nonblock-path-lookup >> >> Changes since v2: >> >> - Simplify the LOOKUP_NONBLOCK to _just_ cover LOOKUP_RCU and >> lookup_fast(), as per Linus's suggestion. I verified fast path >> operations are indeed just that, and it avoids having to mess with >> the inode lock and mnt_want_write() completely. >> >> - Make O_TMPFILE return -EAGAIN for LOOKUP_NONBLOCK. >> >> - Improve/add a few comments. >> >> - Folded what was left of the open part of LOOKUP_NONBLOCK into the >> main patch. > > I have to ask. Are you doing work to verify that performing > path traversals and opening of files yields the same results > when passed to io_uring as it does when performed by the caller? > > Looking at v5.11-rc7 it appears I can arbitrarily access the > io-wq thread in proc by traversing "/proc/thread-self/". Yes that looks like a bug, it needs similar treatment to /proc/self: diff --git a/fs/proc/thread_self.c b/fs/proc/thread_self.c index a553273fbd41..e8ca19371a36 100644 --- a/fs/proc/thread_self.c +++ b/fs/proc/thread_self.c @@ -17,6 +17,13 @@ static const char *proc_thread_self_get_link(struct dentry *dentry, pid_t pid = task_pid_nr_ns(current, ns); char *name; + /* + * Not currently supported. Once we can inherit all of struct pid, + * we can allow this. + */ + if (current->flags & PF_KTHREAD) + return ERR_PTR(-EOPNOTSUPP); + if (!pid) return ERR_PTR(-ENOENT); name = kmalloc(10 + 6 + 10 + 1, dentry ? GFP_KERNEL : GFP_ATOMIC); as was done in this commit: commit 8d4c3e76e3be11a64df95ddee52e99092d42fc19 Author: Jens Axboe <axboe@kernel.dk> Date: Fri Nov 13 16:47:52 2020 -0700 proc: don't allow async path resolution of /proc/self components > Similarly it looks like opening of "/dev/tty" fails to > return the tty of the caller but instead fails because > io-wq threads don't have a tty. > > > There are a non-trivial number of places where current is used in the > kernel both in path traversal and in opening files, and I may be blind > but I have not see any work to find all of those places and make certain > they are safe when called from io_uring. That worries me as that using > a kernel thread instead of a user thread could potential lead to > privilege escalation. I've got a patch queued up for 5.12 that clears ->fs and ->files for the thread if not explicitly inherited, and I'm working on similarly proactively catching these cases that could potentially be problematic.
On Sun, Feb 14, 2021 at 8:38 AM Jens Axboe <axboe@kernel.dk> wrote: > > > Similarly it looks like opening of "/dev/tty" fails to > > return the tty of the caller but instead fails because > > io-wq threads don't have a tty. > > I've got a patch queued up for 5.12 that clears ->fs and ->files for the > thread if not explicitly inherited, and I'm working on similarly > proactively catching these cases that could potentially be problematic. Well, the /dev/tty case still needs fixing somehow. Opening /dev/tty actually depends on current->signal, and if it is NULL it will fall back on the first VT console instead (I think). I wonder if it should do the same thing /proc/self does.. Linus
On Sun, Feb 14, 2021 at 12:30:07PM -0800, Linus Torvalds wrote: > On Sun, Feb 14, 2021 at 8:38 AM Jens Axboe <axboe@kernel.dk> wrote: > > > > > Similarly it looks like opening of "/dev/tty" fails to > > > return the tty of the caller but instead fails because > > > io-wq threads don't have a tty. > > > > I've got a patch queued up for 5.12 that clears ->fs and ->files for the > > thread if not explicitly inherited, and I'm working on similarly > > proactively catching these cases that could potentially be problematic. > > Well, the /dev/tty case still needs fixing somehow. > > Opening /dev/tty actually depends on current->signal, and if it is > NULL it will fall back on the first VT console instead (I think). > > I wonder if it should do the same thing /proc/self does.. I still think that entire "offload pathname resolution to helper threads" thing is bollocks. ->fs, ->files and ->signal is still nowhere near enough - look at /proc/net, for example. Or netns-sensitive parts of sysfs, for that matter... And that's not going into really weird crap where opener is very special and assumed to be doing all IO.
Jens Axboe <axboe@kernel.dk> writes: > On 2/13/21 3:08 PM, Eric W. Biederman wrote: >> >> I have to ask. Are you doing work to verify that performing >> path traversals and opening of files yields the same results >> when passed to io_uring as it does when performed by the caller? >> >> Looking at v5.11-rc7 it appears I can arbitrarily access the >> io-wq thread in proc by traversing "/proc/thread-self/". > > Yes that looks like a bug, it needs similar treatment to /proc/self: Agreed. > diff --git a/fs/proc/thread_self.c b/fs/proc/thread_self.c > index a553273fbd41..e8ca19371a36 100644 > --- a/fs/proc/thread_self.c > +++ b/fs/proc/thread_self.c > @@ -17,6 +17,13 @@ static const char *proc_thread_self_get_link(struct dentry *dentry, > pid_t pid = task_pid_nr_ns(current, ns); > char *name; > > + /* > + * Not currently supported. Once we can inherit all of struct pid, > + * we can allow this. > + */ > + if (current->flags & PF_KTHREAD) > + return ERR_PTR(-EOPNOTSUPP); I suspect simply testing for PF_IO_WORKER is a better test. As it is the delegation to the io_worker not the fact that it is a kernel thread that causes a problem. I have a memory of that point being made when the smack_privileged test and the tomoyo_kernel_service test and how to fix them was being discussed. > if (!pid) > return ERR_PTR(-ENOENT); > name = kmalloc(10 + 6 + 10 + 1, dentry ? GFP_KERNEL : GFP_ATOMIC); > > as was done in this commit: > > commit 8d4c3e76e3be11a64df95ddee52e99092d42fc19 > Author: Jens Axboe <axboe@kernel.dk> > Date: Fri Nov 13 16:47:52 2020 -0700 > > proc: don't allow async path resolution of /proc/self components I suspect that would be better as PF_IO_WORKER as well. >> Similarly it looks like opening of "/dev/tty" fails to >> return the tty of the caller but instead fails because >> io-wq threads don't have a tty. >> >> >> There are a non-trivial number of places where current is used in the >> kernel both in path traversal and in opening files, and I may be blind >> but I have not see any work to find all of those places and make certain >> they are safe when called from io_uring. That worries me as that using >> a kernel thread instead of a user thread could potential lead to >> privilege escalation. > > I've got a patch queued up for 5.12 that clears ->fs and ->files for the > thread if not explicitly inherited, and I'm working on similarly > proactively catching these cases that could potentially be > problematic. Any pointers or is this all in a private tree for now? It is difficult to follow because many of the fixes have not CC'd the reporters or even the maintainers of the subsystems who have been affected by this kind of issue. Eric
Linus Torvalds <torvalds@linux-foundation.org> writes: > On Sun, Feb 14, 2021 at 8:38 AM Jens Axboe <axboe@kernel.dk> wrote: >> >> > Similarly it looks like opening of "/dev/tty" fails to >> > return the tty of the caller but instead fails because >> > io-wq threads don't have a tty. >> >> I've got a patch queued up for 5.12 that clears ->fs and ->files for the >> thread if not explicitly inherited, and I'm working on similarly >> proactively catching these cases that could potentially be problematic. > > Well, the /dev/tty case still needs fixing somehow. > > Opening /dev/tty actually depends on current->signal, and if it is > NULL it will fall back on the first VT console instead (I think). > > I wonder if it should do the same thing /proc/self does.. Would there be any downside of making the io-wq kernel threads be per process instead of per user? I can see a lower probability of a thread already existing. Are there other downsides I am missing? The upside would be that all of the issues of have we copied enough should go away, as the io-wq thread would then behave like another user space thread. To handle posix setresuid() and friends it looks like current_cred would need to be copied but I can't think of anything else. Right I am with Al and I don't have any idea how many special cases we need to play whack-a-mole with. Eric
On 2/15/21 11:07 AM, Eric W. Biederman wrote: > Linus Torvalds <torvalds@linux-foundation.org> writes: > >> On Sun, Feb 14, 2021 at 8:38 AM Jens Axboe <axboe@kernel.dk> wrote: >>> >>>> Similarly it looks like opening of "/dev/tty" fails to >>>> return the tty of the caller but instead fails because >>>> io-wq threads don't have a tty. >>> >>> I've got a patch queued up for 5.12 that clears ->fs and ->files for the >>> thread if not explicitly inherited, and I'm working on similarly >>> proactively catching these cases that could potentially be problematic. >> >> Well, the /dev/tty case still needs fixing somehow. >> >> Opening /dev/tty actually depends on current->signal, and if it is >> NULL it will fall back on the first VT console instead (I think). >> >> I wonder if it should do the same thing /proc/self does.. > > Would there be any downside of making the io-wq kernel threads be per > process instead of per user? > > I can see a lower probability of a thread already existing. Are there > other downsides I am missing? > > The upside would be that all of the issues of have we copied enough > should go away, as the io-wq thread would then behave like another user > space thread. To handle posix setresuid() and friends it looks like > current_cred would need to be copied but I can't think of anything else. I really like that idea. Do we currently have a way of creating a thread internally, akin to what would happen if the same task did pthread_create? That'd ensure that we have everything we need, without actively needing to map the request types, or find future issues of "we also need this bit". It'd work fine for the 'need new worker' case too, if one goes to sleep. We'd just 'fork' off that child. Would require some restructuring of io-wq, but at the end of it, it'd be a simpler solution.
On 2/15/21 11:24 AM, Jens Axboe wrote: > On 2/15/21 11:07 AM, Eric W. Biederman wrote: >> Linus Torvalds <torvalds@linux-foundation.org> writes: >> >>> On Sun, Feb 14, 2021 at 8:38 AM Jens Axboe <axboe@kernel.dk> wrote: >>>> >>>>> Similarly it looks like opening of "/dev/tty" fails to >>>>> return the tty of the caller but instead fails because >>>>> io-wq threads don't have a tty. >>>> >>>> I've got a patch queued up for 5.12 that clears ->fs and ->files for the >>>> thread if not explicitly inherited, and I'm working on similarly >>>> proactively catching these cases that could potentially be problematic. >>> >>> Well, the /dev/tty case still needs fixing somehow. >>> >>> Opening /dev/tty actually depends on current->signal, and if it is >>> NULL it will fall back on the first VT console instead (I think). >>> >>> I wonder if it should do the same thing /proc/self does.. >> >> Would there be any downside of making the io-wq kernel threads be per >> process instead of per user? >> >> I can see a lower probability of a thread already existing. Are there >> other downsides I am missing? >> >> The upside would be that all of the issues of have we copied enough >> should go away, as the io-wq thread would then behave like another user >> space thread. To handle posix setresuid() and friends it looks like >> current_cred would need to be copied but I can't think of anything else. > > I really like that idea. Do we currently have a way of creating a thread > internally, akin to what would happen if the same task did pthread_create? > That'd ensure that we have everything we need, without actively needing to > map the request types, or find future issues of "we also need this bit". > It'd work fine for the 'need new worker' case too, if one goes to sleep. > We'd just 'fork' off that child. > > Would require some restructuring of io-wq, but at the end of it, it'd > be a simpler solution. I was intrigued enough that I tried to wire this up. If we can pull this off, then it would take a great weight off my shoulders as there would be no more worries on identity. Here's a branch that's got a set of patches that actually work, though it's a bit of a hack in spots. Notes: - Forked worker initially crashed, since it's an actual user thread and bombed on deref of kernel structures. Expectedly. That's what the horrible kernel_clone_args->io_wq hack is working around for now. Obviously not the final solution, but helped move things along so I could actually test this. - Shared io-wq helpers need indexing for task, right now this isn't done. But that's not hard to do. - Idle thread reaping isn't done yet, so they persist until the context goes away. - task_work fallback needs a bit of love. Currently we fallback to the io-wq manager thread for handling that, but a) manager is gone, and b) the new workers are now threads and go away as well when the original task goes away. None of the three fallback sites need task context, so likely solution here is just punt it to system_wq. Not the hot path, obviously, we're exiting. - Personality registration is broken, it's just Good Enough to compile. Probably a few more items that escape me right now. As long as you don't hit the fallback cases, it appears to work fine for me. And the diffstat is pretty good to: fs/io-wq.c | 418 +++++++++++-------------------------- fs/io-wq.h | 10 +- fs/io_uring.c | 314 +++------------------------- fs/proc/self.c | 7 - fs/proc/thread_self.c | 7 - include/linux/io_uring.h | 19 -- include/linux/sched.h | 3 + include/linux/sched/task.h | 1 + kernel/fork.c | 2 + 9 files changed, 161 insertions(+), 620 deletions(-) as it gets rid of _all_ the 'grab this or that piece' that we're tracking. WIP series here: https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-worker
Jens Axboe <axboe@kernel.dk> writes: > On 2/15/21 11:24 AM, Jens Axboe wrote: >> On 2/15/21 11:07 AM, Eric W. Biederman wrote: >>> Linus Torvalds <torvalds@linux-foundation.org> writes: >>> >>>> On Sun, Feb 14, 2021 at 8:38 AM Jens Axboe <axboe@kernel.dk> wrote: >>>>> >>>>>> Similarly it looks like opening of "/dev/tty" fails to >>>>>> return the tty of the caller but instead fails because >>>>>> io-wq threads don't have a tty. >>>>> >>>>> I've got a patch queued up for 5.12 that clears ->fs and ->files for the >>>>> thread if not explicitly inherited, and I'm working on similarly >>>>> proactively catching these cases that could potentially be problematic. >>>> >>>> Well, the /dev/tty case still needs fixing somehow. >>>> >>>> Opening /dev/tty actually depends on current->signal, and if it is >>>> NULL it will fall back on the first VT console instead (I think). >>>> >>>> I wonder if it should do the same thing /proc/self does.. >>> >>> Would there be any downside of making the io-wq kernel threads be per >>> process instead of per user? >>> >>> I can see a lower probability of a thread already existing. Are there >>> other downsides I am missing? >>> >>> The upside would be that all of the issues of have we copied enough >>> should go away, as the io-wq thread would then behave like another user >>> space thread. To handle posix setresuid() and friends it looks like >>> current_cred would need to be copied but I can't think of anything else. >> >> I really like that idea. Do we currently have a way of creating a thread >> internally, akin to what would happen if the same task did pthread_create? >> That'd ensure that we have everything we need, without actively needing to >> map the request types, or find future issues of "we also need this bit". >> It'd work fine for the 'need new worker' case too, if one goes to sleep. >> We'd just 'fork' off that child. >> >> Would require some restructuring of io-wq, but at the end of it, it'd >> be a simpler solution. > > I was intrigued enough that I tried to wire this up. If we can pull this > off, then it would take a great weight off my shoulders as there would > be no more worries on identity. > > Here's a branch that's got a set of patches that actually work, though > it's a bit of a hack in spots. Notes: > > - Forked worker initially crashed, since it's an actual user thread and > bombed on deref of kernel structures. Expectedly. That's what the > horrible kernel_clone_args->io_wq hack is working around for now. > Obviously not the final solution, but helped move things along so > I could actually test this. > > - Shared io-wq helpers need indexing for task, right now this isn't > done. But that's not hard to do. > > - Idle thread reaping isn't done yet, so they persist until the > context goes away. > > - task_work fallback needs a bit of love. Currently we fallback to > the io-wq manager thread for handling that, but a) manager is gone, > and b) the new workers are now threads and go away as well when > the original task goes away. None of the three fallback sites need > task context, so likely solution here is just punt it to system_wq. > Not the hot path, obviously, we're exiting. > > - Personality registration is broken, it's just Good Enough to compile. > > Probably a few more items that escape me right now. As long as you > don't hit the fallback cases, it appears to work fine for me. And > the diffstat is pretty good to: > > fs/io-wq.c | 418 +++++++++++-------------------------- > fs/io-wq.h | 10 +- > fs/io_uring.c | 314 +++------------------------- > fs/proc/self.c | 7 - > fs/proc/thread_self.c | 7 - > include/linux/io_uring.h | 19 -- > include/linux/sched.h | 3 + > include/linux/sched/task.h | 1 + > kernel/fork.c | 2 + > 9 files changed, 161 insertions(+), 620 deletions(-) > > as it gets rid of _all_ the 'grab this or that piece' that we're > tracking. > > WIP series here: > > https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-worker I took a quick look through the code and in general it seems reasonable. Can the io_uring_task_cancel in begin_new_exec go away as well? Today it happens after de_thread and so presumably all of the io_uring threads are already gone. Eric
On 2/15/21 3:41 PM, Eric W. Biederman wrote: > Jens Axboe <axboe@kernel.dk> writes: > >> On 2/15/21 11:24 AM, Jens Axboe wrote: >>> On 2/15/21 11:07 AM, Eric W. Biederman wrote: >>>> Linus Torvalds <torvalds@linux-foundation.org> writes: >>>> >>>>> On Sun, Feb 14, 2021 at 8:38 AM Jens Axboe <axboe@kernel.dk> wrote: >>>>>> >>>>>>> Similarly it looks like opening of "/dev/tty" fails to >>>>>>> return the tty of the caller but instead fails because >>>>>>> io-wq threads don't have a tty. >>>>>> >>>>>> I've got a patch queued up for 5.12 that clears ->fs and ->files for the >>>>>> thread if not explicitly inherited, and I'm working on similarly >>>>>> proactively catching these cases that could potentially be problematic. >>>>> >>>>> Well, the /dev/tty case still needs fixing somehow. >>>>> >>>>> Opening /dev/tty actually depends on current->signal, and if it is >>>>> NULL it will fall back on the first VT console instead (I think). >>>>> >>>>> I wonder if it should do the same thing /proc/self does.. >>>> >>>> Would there be any downside of making the io-wq kernel threads be per >>>> process instead of per user? >>>> >>>> I can see a lower probability of a thread already existing. Are there >>>> other downsides I am missing? >>>> >>>> The upside would be that all of the issues of have we copied enough >>>> should go away, as the io-wq thread would then behave like another user >>>> space thread. To handle posix setresuid() and friends it looks like >>>> current_cred would need to be copied but I can't think of anything else. >>> >>> I really like that idea. Do we currently have a way of creating a thread >>> internally, akin to what would happen if the same task did pthread_create? >>> That'd ensure that we have everything we need, without actively needing to >>> map the request types, or find future issues of "we also need this bit". >>> It'd work fine for the 'need new worker' case too, if one goes to sleep. >>> We'd just 'fork' off that child. >>> >>> Would require some restructuring of io-wq, but at the end of it, it'd >>> be a simpler solution. >> >> I was intrigued enough that I tried to wire this up. If we can pull this >> off, then it would take a great weight off my shoulders as there would >> be no more worries on identity. >> >> Here's a branch that's got a set of patches that actually work, though >> it's a bit of a hack in spots. Notes: >> >> - Forked worker initially crashed, since it's an actual user thread and >> bombed on deref of kernel structures. Expectedly. That's what the >> horrible kernel_clone_args->io_wq hack is working around for now. >> Obviously not the final solution, but helped move things along so >> I could actually test this. >> >> - Shared io-wq helpers need indexing for task, right now this isn't >> done. But that's not hard to do. >> >> - Idle thread reaping isn't done yet, so they persist until the >> context goes away. >> >> - task_work fallback needs a bit of love. Currently we fallback to >> the io-wq manager thread for handling that, but a) manager is gone, >> and b) the new workers are now threads and go away as well when >> the original task goes away. None of the three fallback sites need >> task context, so likely solution here is just punt it to system_wq. >> Not the hot path, obviously, we're exiting. >> >> - Personality registration is broken, it's just Good Enough to compile. >> >> Probably a few more items that escape me right now. As long as you >> don't hit the fallback cases, it appears to work fine for me. And >> the diffstat is pretty good to: >> >> fs/io-wq.c | 418 +++++++++++-------------------------- >> fs/io-wq.h | 10 +- >> fs/io_uring.c | 314 +++------------------------- >> fs/proc/self.c | 7 - >> fs/proc/thread_self.c | 7 - >> include/linux/io_uring.h | 19 -- >> include/linux/sched.h | 3 + >> include/linux/sched/task.h | 1 + >> kernel/fork.c | 2 + >> 9 files changed, 161 insertions(+), 620 deletions(-) >> >> as it gets rid of _all_ the 'grab this or that piece' that we're >> tracking. >> >> WIP series here: >> >> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-worker > > I took a quick look through the code and in general it seems reasonable. Great, thanks for checking. > Can the io_uring_task_cancel in begin_new_exec go away as well? > Today it happens after de_thread and so presumably all of the io_uring > threads are already gone. I don't think so, async offload is only the slower path of async. We can have poll based waiting, and various others. If we want to guarantee that no requests escape across exec, then we'll need to retain that.
On 2/15/21 7:41 PM, Jens Axboe wrote: > On 2/15/21 3:41 PM, Eric W. Biederman wrote: >> Jens Axboe <axboe@kernel.dk> writes: >> >>> On 2/15/21 11:24 AM, Jens Axboe wrote: >>>> On 2/15/21 11:07 AM, Eric W. Biederman wrote: >>>>> Linus Torvalds <torvalds@linux-foundation.org> writes: >>>>> >>>>>> On Sun, Feb 14, 2021 at 8:38 AM Jens Axboe <axboe@kernel.dk> wrote: >>>>>>> >>>>>>>> Similarly it looks like opening of "/dev/tty" fails to >>>>>>>> return the tty of the caller but instead fails because >>>>>>>> io-wq threads don't have a tty. >>>>>>> >>>>>>> I've got a patch queued up for 5.12 that clears ->fs and ->files for the >>>>>>> thread if not explicitly inherited, and I'm working on similarly >>>>>>> proactively catching these cases that could potentially be problematic. >>>>>> >>>>>> Well, the /dev/tty case still needs fixing somehow. >>>>>> >>>>>> Opening /dev/tty actually depends on current->signal, and if it is >>>>>> NULL it will fall back on the first VT console instead (I think). >>>>>> >>>>>> I wonder if it should do the same thing /proc/self does.. >>>>> >>>>> Would there be any downside of making the io-wq kernel threads be per >>>>> process instead of per user? >>>>> >>>>> I can see a lower probability of a thread already existing. Are there >>>>> other downsides I am missing? >>>>> >>>>> The upside would be that all of the issues of have we copied enough >>>>> should go away, as the io-wq thread would then behave like another user >>>>> space thread. To handle posix setresuid() and friends it looks like >>>>> current_cred would need to be copied but I can't think of anything else. >>>> >>>> I really like that idea. Do we currently have a way of creating a thread >>>> internally, akin to what would happen if the same task did pthread_create? >>>> That'd ensure that we have everything we need, without actively needing to >>>> map the request types, or find future issues of "we also need this bit". >>>> It'd work fine for the 'need new worker' case too, if one goes to sleep. >>>> We'd just 'fork' off that child. >>>> >>>> Would require some restructuring of io-wq, but at the end of it, it'd >>>> be a simpler solution. >>> >>> I was intrigued enough that I tried to wire this up. If we can pull this >>> off, then it would take a great weight off my shoulders as there would >>> be no more worries on identity. >>> >>> Here's a branch that's got a set of patches that actually work, though >>> it's a bit of a hack in spots. Notes: >>> >>> - Forked worker initially crashed, since it's an actual user thread and >>> bombed on deref of kernel structures. Expectedly. That's what the >>> horrible kernel_clone_args->io_wq hack is working around for now. >>> Obviously not the final solution, but helped move things along so >>> I could actually test this. >>> >>> - Shared io-wq helpers need indexing for task, right now this isn't >>> done. But that's not hard to do. >>> >>> - Idle thread reaping isn't done yet, so they persist until the >>> context goes away. >>> >>> - task_work fallback needs a bit of love. Currently we fallback to >>> the io-wq manager thread for handling that, but a) manager is gone, >>> and b) the new workers are now threads and go away as well when >>> the original task goes away. None of the three fallback sites need >>> task context, so likely solution here is just punt it to system_wq. >>> Not the hot path, obviously, we're exiting. >>> >>> - Personality registration is broken, it's just Good Enough to compile. >>> >>> Probably a few more items that escape me right now. As long as you >>> don't hit the fallback cases, it appears to work fine for me. And >>> the diffstat is pretty good to: >>> >>> fs/io-wq.c | 418 +++++++++++-------------------------- >>> fs/io-wq.h | 10 +- >>> fs/io_uring.c | 314 +++------------------------- >>> fs/proc/self.c | 7 - >>> fs/proc/thread_self.c | 7 - >>> include/linux/io_uring.h | 19 -- >>> include/linux/sched.h | 3 + >>> include/linux/sched/task.h | 1 + >>> kernel/fork.c | 2 + >>> 9 files changed, 161 insertions(+), 620 deletions(-) >>> >>> as it gets rid of _all_ the 'grab this or that piece' that we're >>> tracking. >>> >>> WIP series here: >>> >>> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-worker >> >> I took a quick look through the code and in general it seems reasonable. > > Great, thanks for checking. Cleaner series here: https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-worker.v2 One question, since I'm a bit stumped. The very top most debug patch: https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-worker.v2&id=8a422f030b9630d16d5ec1ff97842a265f88485e any idea what is going on here? For some reason, it only happens for the 'manager' thread. That one doesn't do any work by itself, it's just tasked with forking a new worker, if we need one.
On 2/16/21 6:18 PM, Jens Axboe wrote: > On 2/15/21 7:41 PM, Jens Axboe wrote: >> On 2/15/21 3:41 PM, Eric W. Biederman wrote: >>> Jens Axboe <axboe@kernel.dk> writes: >>> >>>> On 2/15/21 11:24 AM, Jens Axboe wrote: >>>>> On 2/15/21 11:07 AM, Eric W. Biederman wrote: >>>>>> Linus Torvalds <torvalds@linux-foundation.org> writes: >>>>>> >>>>>>> On Sun, Feb 14, 2021 at 8:38 AM Jens Axboe <axboe@kernel.dk> wrote: >>>>>>>> >>>>>>>>> Similarly it looks like opening of "/dev/tty" fails to >>>>>>>>> return the tty of the caller but instead fails because >>>>>>>>> io-wq threads don't have a tty. >>>>>>>> >>>>>>>> I've got a patch queued up for 5.12 that clears ->fs and ->files for the >>>>>>>> thread if not explicitly inherited, and I'm working on similarly >>>>>>>> proactively catching these cases that could potentially be problematic. >>>>>>> >>>>>>> Well, the /dev/tty case still needs fixing somehow. >>>>>>> >>>>>>> Opening /dev/tty actually depends on current->signal, and if it is >>>>>>> NULL it will fall back on the first VT console instead (I think). >>>>>>> >>>>>>> I wonder if it should do the same thing /proc/self does.. >>>>>> >>>>>> Would there be any downside of making the io-wq kernel threads be per >>>>>> process instead of per user? >>>>>> >>>>>> I can see a lower probability of a thread already existing. Are there >>>>>> other downsides I am missing? >>>>>> >>>>>> The upside would be that all of the issues of have we copied enough >>>>>> should go away, as the io-wq thread would then behave like another user >>>>>> space thread. To handle posix setresuid() and friends it looks like >>>>>> current_cred would need to be copied but I can't think of anything else. >>>>> >>>>> I really like that idea. Do we currently have a way of creating a thread >>>>> internally, akin to what would happen if the same task did pthread_create? >>>>> That'd ensure that we have everything we need, without actively needing to >>>>> map the request types, or find future issues of "we also need this bit". >>>>> It'd work fine for the 'need new worker' case too, if one goes to sleep. >>>>> We'd just 'fork' off that child. >>>>> >>>>> Would require some restructuring of io-wq, but at the end of it, it'd >>>>> be a simpler solution. >>>> >>>> I was intrigued enough that I tried to wire this up. If we can pull this >>>> off, then it would take a great weight off my shoulders as there would >>>> be no more worries on identity. >>>> >>>> Here's a branch that's got a set of patches that actually work, though >>>> it's a bit of a hack in spots. Notes: >>>> >>>> - Forked worker initially crashed, since it's an actual user thread and >>>> bombed on deref of kernel structures. Expectedly. That's what the >>>> horrible kernel_clone_args->io_wq hack is working around for now. >>>> Obviously not the final solution, but helped move things along so >>>> I could actually test this. >>>> >>>> - Shared io-wq helpers need indexing for task, right now this isn't >>>> done. But that's not hard to do. >>>> >>>> - Idle thread reaping isn't done yet, so they persist until the >>>> context goes away. >>>> >>>> - task_work fallback needs a bit of love. Currently we fallback to >>>> the io-wq manager thread for handling that, but a) manager is gone, >>>> and b) the new workers are now threads and go away as well when >>>> the original task goes away. None of the three fallback sites need >>>> task context, so likely solution here is just punt it to system_wq. >>>> Not the hot path, obviously, we're exiting. >>>> >>>> - Personality registration is broken, it's just Good Enough to compile. >>>> >>>> Probably a few more items that escape me right now. As long as you >>>> don't hit the fallback cases, it appears to work fine for me. And >>>> the diffstat is pretty good to: >>>> >>>> fs/io-wq.c | 418 +++++++++++-------------------------- >>>> fs/io-wq.h | 10 +- >>>> fs/io_uring.c | 314 +++------------------------- >>>> fs/proc/self.c | 7 - >>>> fs/proc/thread_self.c | 7 - >>>> include/linux/io_uring.h | 19 -- >>>> include/linux/sched.h | 3 + >>>> include/linux/sched/task.h | 1 + >>>> kernel/fork.c | 2 + >>>> 9 files changed, 161 insertions(+), 620 deletions(-) >>>> >>>> as it gets rid of _all_ the 'grab this or that piece' that we're >>>> tracking. >>>> >>>> WIP series here: >>>> >>>> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-worker >>> >>> I took a quick look through the code and in general it seems reasonable. >> >> Great, thanks for checking. > > Cleaner series here: > > https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-worker.v2 > > One question, since I'm a bit stumped. The very top most debug patch: > > https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-worker.v2&id=8a422f030b9630d16d5ec1ff97842a265f88485e > > any idea what is going on here? For some reason, it only happens for > the 'manager' thread. That one doesn't do any work by itself, it's just > tasked with forking a new worker, if we need one. Seems to trigger for all cases with a pthread in the app. This reproduces it: #include <stdio.h> #include <pthread.h> #include <unistd.h> #include <liburing.h> static void *fn(void *data) { struct io_uring ring; io_uring_queue_init(1, &ring, 0); sleep(1); return NULL; } int main(int argc, char *argv[]) { pthread_t t; void *ret; pthread_create(&t, NULL, fn, NULL); pthread_join(t, &ret); return 0; }
On 2/16/21 6:26 PM, Jens Axboe wrote: > On 2/16/21 6:18 PM, Jens Axboe wrote: >> On 2/15/21 7:41 PM, Jens Axboe wrote: >>> On 2/15/21 3:41 PM, Eric W. Biederman wrote: >>>> Jens Axboe <axboe@kernel.dk> writes: >>>> >>>>> On 2/15/21 11:24 AM, Jens Axboe wrote: >>>>>> On 2/15/21 11:07 AM, Eric W. Biederman wrote: >>>>>>> Linus Torvalds <torvalds@linux-foundation.org> writes: >>>>>>> >>>>>>>> On Sun, Feb 14, 2021 at 8:38 AM Jens Axboe <axboe@kernel.dk> wrote: >>>>>>>>> >>>>>>>>>> Similarly it looks like opening of "/dev/tty" fails to >>>>>>>>>> return the tty of the caller but instead fails because >>>>>>>>>> io-wq threads don't have a tty. >>>>>>>>> >>>>>>>>> I've got a patch queued up for 5.12 that clears ->fs and ->files for the >>>>>>>>> thread if not explicitly inherited, and I'm working on similarly >>>>>>>>> proactively catching these cases that could potentially be problematic. >>>>>>>> >>>>>>>> Well, the /dev/tty case still needs fixing somehow. >>>>>>>> >>>>>>>> Opening /dev/tty actually depends on current->signal, and if it is >>>>>>>> NULL it will fall back on the first VT console instead (I think). >>>>>>>> >>>>>>>> I wonder if it should do the same thing /proc/self does.. >>>>>>> >>>>>>> Would there be any downside of making the io-wq kernel threads be per >>>>>>> process instead of per user? >>>>>>> >>>>>>> I can see a lower probability of a thread already existing. Are there >>>>>>> other downsides I am missing? >>>>>>> >>>>>>> The upside would be that all of the issues of have we copied enough >>>>>>> should go away, as the io-wq thread would then behave like another user >>>>>>> space thread. To handle posix setresuid() and friends it looks like >>>>>>> current_cred would need to be copied but I can't think of anything else. >>>>>> >>>>>> I really like that idea. Do we currently have a way of creating a thread >>>>>> internally, akin to what would happen if the same task did pthread_create? >>>>>> That'd ensure that we have everything we need, without actively needing to >>>>>> map the request types, or find future issues of "we also need this bit". >>>>>> It'd work fine for the 'need new worker' case too, if one goes to sleep. >>>>>> We'd just 'fork' off that child. >>>>>> >>>>>> Would require some restructuring of io-wq, but at the end of it, it'd >>>>>> be a simpler solution. >>>>> >>>>> I was intrigued enough that I tried to wire this up. If we can pull this >>>>> off, then it would take a great weight off my shoulders as there would >>>>> be no more worries on identity. >>>>> >>>>> Here's a branch that's got a set of patches that actually work, though >>>>> it's a bit of a hack in spots. Notes: >>>>> >>>>> - Forked worker initially crashed, since it's an actual user thread and >>>>> bombed on deref of kernel structures. Expectedly. That's what the >>>>> horrible kernel_clone_args->io_wq hack is working around for now. >>>>> Obviously not the final solution, but helped move things along so >>>>> I could actually test this. >>>>> >>>>> - Shared io-wq helpers need indexing for task, right now this isn't >>>>> done. But that's not hard to do. >>>>> >>>>> - Idle thread reaping isn't done yet, so they persist until the >>>>> context goes away. >>>>> >>>>> - task_work fallback needs a bit of love. Currently we fallback to >>>>> the io-wq manager thread for handling that, but a) manager is gone, >>>>> and b) the new workers are now threads and go away as well when >>>>> the original task goes away. None of the three fallback sites need >>>>> task context, so likely solution here is just punt it to system_wq. >>>>> Not the hot path, obviously, we're exiting. >>>>> >>>>> - Personality registration is broken, it's just Good Enough to compile. >>>>> >>>>> Probably a few more items that escape me right now. As long as you >>>>> don't hit the fallback cases, it appears to work fine for me. And >>>>> the diffstat is pretty good to: >>>>> >>>>> fs/io-wq.c | 418 +++++++++++-------------------------- >>>>> fs/io-wq.h | 10 +- >>>>> fs/io_uring.c | 314 +++------------------------- >>>>> fs/proc/self.c | 7 - >>>>> fs/proc/thread_self.c | 7 - >>>>> include/linux/io_uring.h | 19 -- >>>>> include/linux/sched.h | 3 + >>>>> include/linux/sched/task.h | 1 + >>>>> kernel/fork.c | 2 + >>>>> 9 files changed, 161 insertions(+), 620 deletions(-) >>>>> >>>>> as it gets rid of _all_ the 'grab this or that piece' that we're >>>>> tracking. >>>>> >>>>> WIP series here: >>>>> >>>>> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-worker >>>> >>>> I took a quick look through the code and in general it seems reasonable. >>> >>> Great, thanks for checking. >> >> Cleaner series here: >> >> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-worker.v2 >> >> One question, since I'm a bit stumped. The very top most debug patch: >> >> https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-worker.v2&id=8a422f030b9630d16d5ec1ff97842a265f88485e >> >> any idea what is going on here? For some reason, it only happens for >> the 'manager' thread. That one doesn't do any work by itself, it's just >> tasked with forking a new worker, if we need one. > > Seems to trigger for all cases with a pthread in the app. This reproduces > it: Nevermind, it was me being an idiot. I had a case in the manager thread that did return 0 instead of do_exit()...