Message ID | 4293faaf-8279-77e2-8b1a-aff765416980@I-love.SAKURA.ne.jp (mailing list archive) |
---|---|
State | Not Applicable |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [v2] 9p/trans_fd: perform read/write with TIF_SIGPENDING set | expand |
On Samstag, 27. August 2022 08:11:48 CEST Tetsuo Handa wrote: > syzbot is reporting hung task at p9_fd_close() [1], for p9_mux_poll_stop() > from p9_conn_destroy() from p9_fd_close() is failing to interrupt already > started kernel_read() from p9_fd_read() from p9_read_work() and/or > kernel_write() from p9_fd_write() from p9_write_work() requests. > > Since p9_socket_open() sets O_NONBLOCK flag, p9_mux_poll_stop() does not > need to interrupt kernel_{read,write}(). However, since p9_fd_open() does > not set O_NONBLOCK flag, but pipe blocks unless signal is pending, > p9_mux_poll_stop() needs to interrupt kernel_{read,write}() when the file > descriptor refers to a pipe. In other words, pipe file descriptor needs > to be handled as if socket file descriptor. We somehow need to interrupt > kernel_{read,write}() on pipes. > > If we can tolerate "possibility of breaking userspace program by setting > O_NONBLOCK flag on userspace-supplied file descriptors" and "possibility > of race window that userspace program clears O_NONBLOCK flag between after > automatically setting O_NONBLOCK flag and before calling > kernel_{read,write}()", we could automatically set O_NONBLOCK flag > immediately before calling kernel_{read,write}(). > > A different approach, which this patch is doing, is to surround > kernel_{read,write}() with set_thread_flag(TIF_SIGPENDING) and > recalc_sigpending(). This might be ugly and bit costly, but does not > touch userspace-supplied file descriptors. So the intention in this alternative approach is to allow user space apps still being able to perform blocking I/O, while at the same time making the kernel thread interruptible to fix this hung task issue, correct? > Link: https://syzkaller.appspot.com/bug?extid=8b41a1365f1106fd0f33 [1] > Reported-by: syzbot <syzbot+8b41a1365f1106fd0f33@syzkaller.appspotmail.com> > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > Tested-by: syzbot <syzbot+8b41a1365f1106fd0f33@syzkaller.appspotmail.com> > --- > Although syzbot tested that this patch solves hung task problem, syzbot > cannot verify that this patch will not break functionality of p9 users. > Please test before applying this patch. > > net/9p/trans_fd.c | 16 ++++++++++------ > 1 file changed, 10 insertions(+), 6 deletions(-) > > diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c > index e758978b44be..e2f4e3245a80 100644 > --- a/net/9p/trans_fd.c > +++ b/net/9p/trans_fd.c > @@ -256,11 +256,13 @@ static int p9_fd_read(struct p9_client *client, void > *v, int len) if (!ts) > return -EREMOTEIO; > > - if (!(ts->rd->f_flags & O_NONBLOCK)) > - p9_debug(P9_DEBUG_ERROR, "blocking read ...\n"); > - > pos = ts->rd->f_pos; > + /* Force non-blocking read() even without O_NONBLOCK. */ > + set_thread_flag(TIF_SIGPENDING); > ret = kernel_read(ts->rd, v, len, &pos); > + spin_lock_irq(¤t->sighand->siglock); > + recalc_sigpending(); > + spin_unlock_irq(¤t->sighand->siglock); Is the recalc_sigpending() block here actually needed? The TIF_SIGPENDING flag is already cleared by net/9p/client.c, no? > if (ret <= 0 && ret != -ERESTARTSYS && ret != -EAGAIN) > client->status = Disconnected; > return ret; > @@ -423,10 +425,12 @@ static int p9_fd_write(struct p9_client *client, void > *v, int len) if (!ts) > return -EREMOTEIO; > > - if (!(ts->wr->f_flags & O_NONBLOCK)) > - p9_debug(P9_DEBUG_ERROR, "blocking write ...\n"); > - > + /* Force non-blocking write() even without O_NONBLOCK. */ > + set_thread_flag(TIF_SIGPENDING); > ret = kernel_write(ts->wr, v, len, &ts->wr->f_pos); > + spin_lock_irq(¤t->sighand->siglock); > + recalc_sigpending(); > + spin_unlock_irq(¤t->sighand->siglock); > if (ret <= 0 && ret != -ERESTARTSYS && ret != -EAGAIN) > client->status = Disconnected; > return ret;
On 2022/09/02 0:23, Christian Schoenebeck wrote: > So the intention in this alternative approach is to allow user space apps > still being able to perform blocking I/O, while at the same time making the > kernel thread interruptible to fix this hung task issue, correct? Making the kernel thread "non-blocking" (rather than "interruptible") in order not to be blocked on I/O on pipes. Since kernel threads by default do not receive signals, being "interruptible" or "killable" does not help (except for silencing khungtaskd warning). Being "non-blocking" like I/O on sockets helps. >> --- a/net/9p/trans_fd.c >> +++ b/net/9p/trans_fd.c >> @@ -256,11 +256,13 @@ static int p9_fd_read(struct p9_client *client, void >> *v, int len) if (!ts) >> return -EREMOTEIO; >> >> - if (!(ts->rd->f_flags & O_NONBLOCK)) >> - p9_debug(P9_DEBUG_ERROR, "blocking read ...\n"); >> - >> pos = ts->rd->f_pos; >> + /* Force non-blocking read() even without O_NONBLOCK. */ >> + set_thread_flag(TIF_SIGPENDING); >> ret = kernel_read(ts->rd, v, len, &pos); >> + spin_lock_irq(¤t->sighand->siglock); >> + recalc_sigpending(); >> + spin_unlock_irq(¤t->sighand->siglock); > > Is the recalc_sigpending() block here actually needed? The TIF_SIGPENDING flag > is already cleared by net/9p/client.c, no? This is actually needed. The thread which processes this function is a kernel workqueue thread which is supposed to process other functions (which might call "interruptible" functions even if signals are not received by default). The thread which currently clearing the TIF_SIGPENDING flag is a user process (which are calling "killable" functions from syscall context but effectively "uninterruptible" due to clearing the TIF_SIGPENDING flag and retrying). By the way, clearing the TIF_SIGPENDING flag before retrying "killable" functions (like p9_client_rpc() does) is very bad and needs to be avoided...
Thanks for the patch and sorry for the slow reply v1 vs v2: my take is that I think v1 is easier to understand, and if you pass a fd to be used as kernel end for 9p you shouldn't also be messing with it so it's fair game to make it O_NONBLOCK -- we're reading and writing to these things, the fds shouldn't be used by the caller after the mount syscall. Is there any reason you spent time working on v2, or is that just theorical for not messing with userland fd ? unless there's any reason I'll try to find time to test v1 and queue it for 6.1 Tetsuo Handa wrote on Fri, Sep 02, 2022 at 07:25:30AM +0900: > On 2022/09/02 0:23, Christian Schoenebeck wrote: > > So the intention in this alternative approach is to allow user space apps > > still being able to perform blocking I/O, while at the same time making the > > kernel thread interruptible to fix this hung task issue, correct? > > Making the kernel thread "non-blocking" (rather than "interruptible") in order > not to be blocked on I/O on pipes. > > Since kernel threads by default do not receive signals, being "interruptible" > or "killable" does not help (except for silencing khungtaskd warning). Being > "non-blocking" like I/O on sockets helps. I'm still not 100% sure I understand the root of the deadlock, but I can agree the worker thread shouldn't block. We seem to check for EAGAIN where kernel_read/write end up being called and there's a poll for scheduling so it -should- work, but I assume this hasn't been tested much and might take a bit of time to test. > The thread which currently clearing the TIF_SIGPENDING flag is a user process > (which are calling "killable" functions from syscall context but effectively > "uninterruptible" due to clearing the TIF_SIGPENDING flag and retrying). > By the way, clearing the TIF_SIGPENDING flag before retrying "killable" functions > (like p9_client_rpc() does) is very bad and needs to be avoided... Yes, I really wish we could make this go away. I started work to make the cancel (flush) asynchronous, but it introduced a regression I never had (and still don't have) time to figure out... If you have motivation to take over, the patches I sent are here: https://lore.kernel.org/all/20181217110111.GB17466@nautica/T/ (unfortunately some refactoring happened and they no longer apply, but the logic should be mostly sane) Thanks, -- Dominique
On 2022/09/04 8:39, Dominique Martinet wrote: > Is there any reason you spent time working on v2, or is that just > theorical for not messing with userland fd ? Just theoretical for not messing with userland fd, for programs generated by fuzzers might use fds passed to the mount() syscall. I imagined that syzbot again reports this problem when it started playing with fcntl(). For robustness, not messing with userland fd is the better. ;-) > > unless there's any reason I'll try to find time to test v1 and queue it > for 6.1 OK. > We seem to check for EAGAIN where kernel_read/write end up being called > and there's a poll for scheduling so it -should- work, but I assume this > hasn't been tested much and might take a bit of time to test. Right. Since the I/O in kernel side is poll based multiplexing, forcing non-blocking I/O -should- work. (But I can't test e.g. changes in CPU time usage because I don't have environment to test. I assume that poll based multiplexing saves us from doing busy looping.) We are currently checking for ERESTARTSYS and EAGAIN. The former is for non-socket fds which do not have O_NONBLOCK flag, and the latter is for socket fds which have O_NONBLOCK flag. If we enforce O_NONBLOCK flag, the former will become redundant. I think we can remove the former check after you tested that setting O_NONBLOCK flag on non-socket fds does not break.
Tetsuo Handa wrote on Sun, Sep 04, 2022 at 09:27:22AM +0900: > On 2022/09/04 8:39, Dominique Martinet wrote: > > Is there any reason you spent time working on v2, or is that just > > theorical for not messing with userland fd ? > > Just theoretical for not messing with userland fd, for programs generated > by fuzzers might use fds passed to the mount() syscall. I imagined that > syzbot again reports this problem when it started playing with fcntl(). > > For robustness, not messing with userland fd is the better. ;-) By the way digging this back made me think about this a bit again. My opinion hasn't really changed that if you want to shoot yourself in the foot I don't think we're crossing any priviledge boundary here, but we could probably prevent it by saying the mount call with close that fd and somehow steal it? (drop the fget, close_fd after get_file perhaps?) That should address your concern about robustess and syzbot will no longer be able to play with fcntl on "our" end of the pipe. I think it's fair to say that once you pass it to the kernel all bets are off, so closing it for the userspace application could make sense, and the mount already survives when short processes do the mount call and immediately exit so it's not like we need that fd to be open... What do you think? (either way would be for 6.2, the patch is already good enough as is for me) -- Dominique
On 2022/10/07 10:40, Dominique Martinet wrote: > Tetsuo Handa wrote on Sun, Sep 04, 2022 at 09:27:22AM +0900: >> On 2022/09/04 8:39, Dominique Martinet wrote: >>> Is there any reason you spent time working on v2, or is that just >>> theorical for not messing with userland fd ? >> >> Just theoretical for not messing with userland fd, for programs generated >> by fuzzers might use fds passed to the mount() syscall. I imagined that >> syzbot again reports this problem when it started playing with fcntl(). >> >> For robustness, not messing with userland fd is the better. ;-) > > By the way digging this back made me think about this a bit again. > My opinion hasn't really changed that if you want to shoot yourself in > the foot I don't think we're crossing any priviledge boundary here, but > we could probably prevent it by saying the mount call with close that fd > and somehow steal it? (drop the fget, close_fd after get_file perhaps?) > > That should address your concern about robustess and syzbot will no > longer be able to play with fcntl on "our" end of the pipe. I think it's > fair to say that once you pass it to the kernel all bets are off, so > closing it for the userspace application could make sense, and the mount > already survives when short processes do the mount call and immediately > exit so it's not like we need that fd to be open... > > > What do you think? I found that pipe is using alloc_file_clone() which allocates "struct file" instead of just incrementing "struct file"->f_count. Then, can we add EXPORT_SYMBOL_GPL(alloc_file_clone) to fs/file_table.c and use it like struct file *f; ts->rd = fget(rfd); if (!ts->rd) goto out_free_ts; if (!(ts->rd->f_mode & FMODE_READ)) goto out_put_rd; f = alloc_file_clone(ts->rd, ts->rd->f_flags | O_NONBLOCK, ts->rd->f_op); if (IS_ERR(f)) goto out_put_rd; fput(ts->rd); ts->rd = f; ts->wr = fget(wfd); if (!ts->wr) goto out_put_rd; if (!(ts->wr->f_mode & FMODE_WRITE)) goto out_put_wr; f = alloc_file_clone(ts->wr, ts->wr->f_flags | O_NONBLOCK, ts->wr->f_op); if (IS_ERR(f)) goto out_put_wr; fput(ts->wr); ts->wr = f; from p9_fd_open() for cloning "struct file" with O_NONBLOCK flag added? Just an idea. I don't know whether alloc_file_clone() arguments are correct...
diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c index e758978b44be..e2f4e3245a80 100644 --- a/net/9p/trans_fd.c +++ b/net/9p/trans_fd.c @@ -256,11 +256,13 @@ static int p9_fd_read(struct p9_client *client, void *v, int len) if (!ts) return -EREMOTEIO; - if (!(ts->rd->f_flags & O_NONBLOCK)) - p9_debug(P9_DEBUG_ERROR, "blocking read ...\n"); - pos = ts->rd->f_pos; + /* Force non-blocking read() even without O_NONBLOCK. */ + set_thread_flag(TIF_SIGPENDING); ret = kernel_read(ts->rd, v, len, &pos); + spin_lock_irq(¤t->sighand->siglock); + recalc_sigpending(); + spin_unlock_irq(¤t->sighand->siglock); if (ret <= 0 && ret != -ERESTARTSYS && ret != -EAGAIN) client->status = Disconnected; return ret; @@ -423,10 +425,12 @@ static int p9_fd_write(struct p9_client *client, void *v, int len) if (!ts) return -EREMOTEIO; - if (!(ts->wr->f_flags & O_NONBLOCK)) - p9_debug(P9_DEBUG_ERROR, "blocking write ...\n"); - + /* Force non-blocking write() even without O_NONBLOCK. */ + set_thread_flag(TIF_SIGPENDING); ret = kernel_write(ts->wr, v, len, &ts->wr->f_pos); + spin_lock_irq(¤t->sighand->siglock); + recalc_sigpending(); + spin_unlock_irq(¤t->sighand->siglock); if (ret <= 0 && ret != -ERESTARTSYS && ret != -EAGAIN) client->status = Disconnected; return ret;