Message ID | 8e3894e3-2609-4233-83df-1633fba7d4dd@kernel.dk (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | io_uring: run normal task_work AFTER local work | expand |
On 9/18/24 19:03, Jens Axboe wrote: > io_cqring_wait() doesn't run normal task_work after the local work, and > it's the only location to do it in that order. Normally this doesn't > matter, except if: > > 1) The ring is setup with DEFER_TASKRUN > 2) The local work item may generate normal task_work > > For condition 2, this can happen when closing a file and it's the final > put of that file, for example. This can cause stalls where a task is > waiting to make progress, but there's nothing else that will wake it up. TIF_NOTIFY_SIGNAL from normal task_work should prevent the task from sleeping until it processes task works, that should make the waiting loop make another iteration and get to the task work execution again (if it continues to sleep). I don't understand how the patch works, but if it's legit sounds we have a bigger problem, e.g. what if someone else queue up a work right after that tw execution block. > Link: https://github.com/axboe/liburing/issues/1235 > Cc: stable@vger.kernel.org > Fixes: 846072f16eed ("io_uring: mimimise io_cqring_wait_schedule") > Signed-off-by: Jens Axboe <axboe@kernel.dk> > > --- > > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c > index 1aca501efaf6..d6a2cd351525 100644 > --- a/io_uring/io_uring.c > +++ b/io_uring/io_uring.c > @@ -2568,9 +2568,9 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags, > * If we got woken because of task_work being processed, run it > * now rather than let the caller do another wait loop. > */ > - io_run_task_work(); > if (!llist_empty(&ctx->work_llist)) > io_run_local_work(ctx, nr_wait); > + io_run_task_work(); > > /* > * Non-local task_work will be run on exit to userspace, but
On 9/19/24 4:22 AM, Pavel Begunkov wrote: > On 9/18/24 19:03, Jens Axboe wrote: >> io_cqring_wait() doesn't run normal task_work after the local work, and >> it's the only location to do it in that order. Normally this doesn't >> matter, except if: >> >> 1) The ring is setup with DEFER_TASKRUN >> 2) The local work item may generate normal task_work >> >> For condition 2, this can happen when closing a file and it's the final >> put of that file, for example. This can cause stalls where a task is >> waiting to make progress, but there's nothing else that will wake it up. > > TIF_NOTIFY_SIGNAL from normal task_work should prevent the task > from sleeping until it processes task works, that should make > the waiting loop make another iteration and get to the task work > execution again (if it continues to sleep). I don't understand how > the patch works, but if it's legit sounds we have a bigger problem, > e.g. what if someone else queue up a work right after that tw > execution block. It's not TIF_NOTIFY_SIGNAL, for that case it would've been fine. It would've just meant another loop around for waiting. As the likelihood of defer task_work generating normal task_work is infinitely higher than the other way around, I do think re-ordering makes sense regardless. The final fput will use TIF_NOTIFY_RESUME, as it should not be something that interrupts the task. Just needs to get done eventually when it exits to userspace. But for this case obviously that's a bit more problematic. We can also do something like the below which should fix it as well, which may be a better approach. At least, as it currently stands, TIF_NOTIFY_RESUME and TIF_NOTIFY_SIGNAL are the two signaling mechanisms for that. Hence checking for pending task_work and ensuring our task_work run handles both should be saner. I'd still swap the ordering of the task_work runs, however. diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 75f0087183e5..56097627eafc 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -2472,7 +2472,7 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx, return 1; if (unlikely(!llist_empty(&ctx->work_llist))) return 1; - if (unlikely(test_thread_flag(TIF_NOTIFY_SIGNAL))) + if (unlikely(task_work_pending(current))) return 1; if (unlikely(task_sigpending(current))) return -EINTR; diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h index 9d70b2cf7b1e..2fbf0ea9c171 100644 --- a/io_uring/io_uring.h +++ b/io_uring/io_uring.h @@ -308,15 +308,17 @@ static inline int io_run_task_work(void) */ if (test_thread_flag(TIF_NOTIFY_SIGNAL)) clear_notify_signal(); + + if (test_thread_flag(TIF_NOTIFY_RESUME)) { + __set_current_state(TASK_RUNNING); + resume_user_mode_work(NULL); + } + /* * PF_IO_WORKER never returns to userspace, so check here if we have * notify work that needs processing. */ if (current->flags & PF_IO_WORKER) { - if (test_thread_flag(TIF_NOTIFY_RESUME)) { - __set_current_state(TASK_RUNNING); - resume_user_mode_work(NULL); - } if (current->io_uring) { unsigned int count = 0;
> [...] > > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c > index 75f0087183e5..56097627eafc 100644 > --- a/io_uring/io_uring.c > +++ b/io_uring/io_uring.c > @@ -2472,7 +2472,7 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx, > return 1; > if (unlikely(!llist_empty(&ctx->work_llist))) > return 1; > - if (unlikely(test_thread_flag(TIF_NOTIFY_SIGNAL))) > + if (unlikely(task_work_pending(current))) > return 1; > if (unlikely(task_sigpending(current))) > return -EINTR; > diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h > index 9d70b2cf7b1e..2fbf0ea9c171 100644 > --- a/io_uring/io_uring.h > +++ b/io_uring/io_uring.h > @@ -308,15 +308,17 @@ static inline int io_run_task_work(void) > */ > if (test_thread_flag(TIF_NOTIFY_SIGNAL)) > clear_notify_signal(); > + > + if (test_thread_flag(TIF_NOTIFY_RESUME)) { > + __set_current_state(TASK_RUNNING); > + resume_user_mode_work(NULL); > + } > + > /* > * PF_IO_WORKER never returns to userspace, so check here if we have > * notify work that needs processing. > */ > if (current->flags & PF_IO_WORKER) { > - if (test_thread_flag(TIF_NOTIFY_RESUME)) { > - __set_current_state(TASK_RUNNING); > - resume_user_mode_work(NULL); > - } > if (current->io_uring) { > unsigned int count = 0; > > Can confirm that also this patch fixes the issue on my end (both with the reordering of the task_work and without it). Also found a different way to trigger the issue that does not misuse IOSQE_IO_LINK. Do three sends with IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_LINK followed by a close with IOSQE_CQE_SKIP_SUCCESS on a ring with IORING_SETUP_DEFER_TASKRUN. I confirmed that that test case also first brakes on 846072f16eed3b3fb4e59b677f3ed8afb8509b89 and is fixed by either of the two patches you sent. Not sure if that's a preferable test case compared to the weirder ealier one. You can find it below as a patch to the existing test case in the liburing repo: diff --git a/test/linked-defer-close.c b/test/linked-defer-close.c index 4be96b3..f9ef6eb 100644 --- a/test/linked-defer-close.c +++ b/test/linked-defer-close.c @@ -88,6 +88,7 @@ int main(int argc, char *argv[]) struct sockaddr_in saddr; char *msg1 = "message number 1\n"; char *msg2 = "message number 2\n"; + char *msg3 = "message number 3\n"; int val, send_fd, ret, sockfd; struct sigaction act[2] = { }; struct thread_data td; @@ -182,17 +183,22 @@ int main(int argc, char *argv[]) sqe = io_uring_get_sqe(&ring); io_uring_prep_send(sqe, send_fd, msg1, strlen(msg1), 0); sqe->user_data = IS_SEND; - sqe->flags = IOSQE_CQE_SKIP_SUCCESS; + sqe->flags = IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_LINK; sqe = io_uring_get_sqe(&ring); io_uring_prep_send(sqe, send_fd, msg2, strlen(msg2), 0); sqe->user_data = IS_SEND2; sqe->flags = IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_LINK; + sqe = io_uring_get_sqe(&ring); + io_uring_prep_send(sqe, send_fd, msg3, strlen(msg3), 0); + sqe->user_data = IS_SEND2; + sqe->flags = IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_LINK; + sqe = io_uring_get_sqe(&ring); io_uring_prep_close(sqe, send_fd); sqe->user_data = IS_CLOSE; - sqe->flags = IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_LINK; + sqe->flags = IOSQE_CQE_SKIP_SUCCESS; break; case IS_SEND: case IS_SEND2:
On 9/19/24 10:47 AM, Jan Hendrik Farr wrote: >> [...] >> >> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c >> index 75f0087183e5..56097627eafc 100644 >> --- a/io_uring/io_uring.c >> +++ b/io_uring/io_uring.c >> @@ -2472,7 +2472,7 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx, >> return 1; >> if (unlikely(!llist_empty(&ctx->work_llist))) >> return 1; >> - if (unlikely(test_thread_flag(TIF_NOTIFY_SIGNAL))) >> + if (unlikely(task_work_pending(current))) >> return 1; >> if (unlikely(task_sigpending(current))) >> return -EINTR; >> diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h >> index 9d70b2cf7b1e..2fbf0ea9c171 100644 >> --- a/io_uring/io_uring.h >> +++ b/io_uring/io_uring.h >> @@ -308,15 +308,17 @@ static inline int io_run_task_work(void) >> */ >> if (test_thread_flag(TIF_NOTIFY_SIGNAL)) >> clear_notify_signal(); >> + >> + if (test_thread_flag(TIF_NOTIFY_RESUME)) { >> + __set_current_state(TASK_RUNNING); >> + resume_user_mode_work(NULL); >> + } >> + >> /* >> * PF_IO_WORKER never returns to userspace, so check here if we have >> * notify work that needs processing. >> */ >> if (current->flags & PF_IO_WORKER) { >> - if (test_thread_flag(TIF_NOTIFY_RESUME)) { >> - __set_current_state(TASK_RUNNING); >> - resume_user_mode_work(NULL); >> - } >> if (current->io_uring) { >> unsigned int count = 0; >> >> > > Can confirm that also this patch fixes the issue on my end (both with the > reordering of the task_work and without it). Great, thanks for testing! Sent out a v2. No need to test it unless you absolutely want to ;-) > Also found a different way to trigger the issue that does not misuse > IOSQE_IO_LINK. Do three sends with IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_LINK > followed by a close with IOSQE_CQE_SKIP_SUCCESS on a ring with > IORING_SETUP_DEFER_TASKRUN. > > I confirmed that that test case also first brakes on > 846072f16eed3b3fb4e59b677f3ed8afb8509b89 and is fixed by either of the > two patches you sent. > > Not sure if that's a preferable test case compared to the weirder ealier one. > You can find it below as a patch to the existing test case in the liburing > repo: I think that's an improvement, just because it doesn't rely on a weird usage of IOSQE_IO_LINK. And it looks good to me - do you want me to commit this directly, or do you want to send a "proper" patch (or github PR) to retain the proper attribution to you?
On 19 12:06:20, Jens Axboe wrote: > On 9/19/24 10:47 AM, Jan Hendrik Farr wrote: > >> [...] > >> > >> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c > >> index 75f0087183e5..56097627eafc 100644 > >> --- a/io_uring/io_uring.c > >> +++ b/io_uring/io_uring.c > >> @@ -2472,7 +2472,7 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx, > >> return 1; > >> if (unlikely(!llist_empty(&ctx->work_llist))) > >> return 1; > >> - if (unlikely(test_thread_flag(TIF_NOTIFY_SIGNAL))) > >> + if (unlikely(task_work_pending(current))) > >> return 1; > >> if (unlikely(task_sigpending(current))) > >> return -EINTR; > >> diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h > >> index 9d70b2cf7b1e..2fbf0ea9c171 100644 > >> --- a/io_uring/io_uring.h > >> +++ b/io_uring/io_uring.h > >> @@ -308,15 +308,17 @@ static inline int io_run_task_work(void) > >> */ > >> if (test_thread_flag(TIF_NOTIFY_SIGNAL)) > >> clear_notify_signal(); > >> + > >> + if (test_thread_flag(TIF_NOTIFY_RESUME)) { > >> + __set_current_state(TASK_RUNNING); > >> + resume_user_mode_work(NULL); > >> + } > >> + > >> /* > >> * PF_IO_WORKER never returns to userspace, so check here if we have > >> * notify work that needs processing. > >> */ > >> if (current->flags & PF_IO_WORKER) { > >> - if (test_thread_flag(TIF_NOTIFY_RESUME)) { > >> - __set_current_state(TASK_RUNNING); > >> - resume_user_mode_work(NULL); > >> - } > >> if (current->io_uring) { > >> unsigned int count = 0; > >> > >> > > > > Can confirm that also this patch fixes the issue on my end (both with the > > reordering of the task_work and without it). > > Great, thanks for testing! Sent out a v2. No need to test it unless you > absolutely want to ;-) > > > Also found a different way to trigger the issue that does not misuse > > IOSQE_IO_LINK. Do three sends with IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_LINK > > followed by a close with IOSQE_CQE_SKIP_SUCCESS on a ring with > > IORING_SETUP_DEFER_TASKRUN. > > > > I confirmed that that test case also first brakes on > > 846072f16eed3b3fb4e59b677f3ed8afb8509b89 and is fixed by either of the > > two patches you sent. > > > > Not sure if that's a preferable test case compared to the weirder ealier one. > > You can find it below as a patch to the existing test case in the liburing > > repo: > > I think that's an improvement, just because it doesn't rely on a weird > usage of IOSQE_IO_LINK. And it looks good to me - do you want me to > commit this directly, or do you want to send a "proper" patch (or github > PR) to retain the proper attribution to you? > Sent the PR with one minor change (adjusted the user data for the third send).
On 9/19/24 12:31 PM, Jan Hendrik Farr wrote: > On 19 12:06:20, Jens Axboe wrote: >> On 9/19/24 10:47 AM, Jan Hendrik Farr wrote: >>>> [...] >>>> >>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c >>>> index 75f0087183e5..56097627eafc 100644 >>>> --- a/io_uring/io_uring.c >>>> +++ b/io_uring/io_uring.c >>>> @@ -2472,7 +2472,7 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx, >>>> return 1; >>>> if (unlikely(!llist_empty(&ctx->work_llist))) >>>> return 1; >>>> - if (unlikely(test_thread_flag(TIF_NOTIFY_SIGNAL))) >>>> + if (unlikely(task_work_pending(current))) >>>> return 1; >>>> if (unlikely(task_sigpending(current))) >>>> return -EINTR; >>>> diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h >>>> index 9d70b2cf7b1e..2fbf0ea9c171 100644 >>>> --- a/io_uring/io_uring.h >>>> +++ b/io_uring/io_uring.h >>>> @@ -308,15 +308,17 @@ static inline int io_run_task_work(void) >>>> */ >>>> if (test_thread_flag(TIF_NOTIFY_SIGNAL)) >>>> clear_notify_signal(); >>>> + >>>> + if (test_thread_flag(TIF_NOTIFY_RESUME)) { >>>> + __set_current_state(TASK_RUNNING); >>>> + resume_user_mode_work(NULL); >>>> + } >>>> + >>>> /* >>>> * PF_IO_WORKER never returns to userspace, so check here if we have >>>> * notify work that needs processing. >>>> */ >>>> if (current->flags & PF_IO_WORKER) { >>>> - if (test_thread_flag(TIF_NOTIFY_RESUME)) { >>>> - __set_current_state(TASK_RUNNING); >>>> - resume_user_mode_work(NULL); >>>> - } >>>> if (current->io_uring) { >>>> unsigned int count = 0; >>>> >>>> >>> >>> Can confirm that also this patch fixes the issue on my end (both with the >>> reordering of the task_work and without it). >> >> Great, thanks for testing! Sent out a v2. No need to test it unless you >> absolutely want to ;-) >> >>> Also found a different way to trigger the issue that does not misuse >>> IOSQE_IO_LINK. Do three sends with IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_LINK >>> followed by a close with IOSQE_CQE_SKIP_SUCCESS on a ring with >>> IORING_SETUP_DEFER_TASKRUN. >>> >>> I confirmed that that test case also first brakes on >>> 846072f16eed3b3fb4e59b677f3ed8afb8509b89 and is fixed by either of the >>> two patches you sent. >>> >>> Not sure if that's a preferable test case compared to the weirder ealier one. >>> You can find it below as a patch to the existing test case in the liburing >>> repo: >> >> I think that's an improvement, just because it doesn't rely on a weird >> usage of IOSQE_IO_LINK. And it looks good to me - do you want me to >> commit this directly, or do you want to send a "proper" patch (or github >> PR) to retain the proper attribution to you? >> > > Sent the PR with one minor change (adjusted the user data for the third > send). And pulled, thanks.
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 1aca501efaf6..d6a2cd351525 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -2568,9 +2568,9 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags, * If we got woken because of task_work being processed, run it * now rather than let the caller do another wait loop. */ - io_run_task_work(); if (!llist_empty(&ctx->work_llist)) io_run_local_work(ctx, nr_wait); + io_run_task_work(); /* * Non-local task_work will be run on exit to userspace, but
io_cqring_wait() doesn't run normal task_work after the local work, and it's the only location to do it in that order. Normally this doesn't matter, except if: 1) The ring is setup with DEFER_TASKRUN 2) The local work item may generate normal task_work For condition 2, this can happen when closing a file and it's the final put of that file, for example. This can cause stalls where a task is waiting to make progress, but there's nothing else that will wake it up. Link: https://github.com/axboe/liburing/issues/1235 Cc: stable@vger.kernel.org Fixes: 846072f16eed ("io_uring: mimimise io_cqring_wait_schedule") Signed-off-by: Jens Axboe <axboe@kernel.dk> ---