io_uring: run normal task_work AFTER local work

Message ID	8e3894e3-2609-4233-83df-1633fba7d4dd@kernel.dk (mailing list archive)
State	New
Headers	show Received: from mail-ed1-f49.google.com (mail-ed1-f49.google.com [209.85.208.49]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A67F21898E0 for <io-uring@vger.kernel.org>; Wed, 18 Sep 2024 18:03:10 +0000 (UTC) Message-ID: <8e3894e3-2609-4233-83df-1633fba7d4dd@kernel.dk> Date: Wed, 18 Sep 2024 12:03:06 -0600 Precedence: bulk MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Content-Language: en-US To: io-uring <io-uring@vger.kernel.org> From: Jens Axboe <axboe@kernel.dk> Subject: [PATCH] io_uring: run normal task_work AFTER local work Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit
Series	io_uring: run normal task_work AFTER local work \| expand io_uring: run normal task_work AFTER local work

Jens Axboe Sept. 18, 2024, 6:03 p.m. UTC

io_cqring_wait() doesn't run normal task_work after the local work, and
it's the only location to do it in that order. Normally this doesn't
matter, except if:

1) The ring is setup with DEFER_TASKRUN
2) The local work item may generate normal task_work

For condition 2, this can happen when closing a file and it's the final
put of that file, for example. This can cause stalls where a task is
waiting to make progress, but there's nothing else that will wake it up.

Link: https://github.com/axboe/liburing/issues/1235
Cc: stable@vger.kernel.org
Fixes: 846072f16eed ("io_uring: mimimise io_cqring_wait_schedule")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

---

Pavel Begunkov Sept. 19, 2024, 10:22 a.m. UTC | #1

On 9/18/24 19:03, Jens Axboe wrote:
> io_cqring_wait() doesn't run normal task_work after the local work, and
> it's the only location to do it in that order. Normally this doesn't
> matter, except if:
> 
> 1) The ring is setup with DEFER_TASKRUN
> 2) The local work item may generate normal task_work
> 
> For condition 2, this can happen when closing a file and it's the final
> put of that file, for example. This can cause stalls where a task is
> waiting to make progress, but there's nothing else that will wake it up.

TIF_NOTIFY_SIGNAL from normal task_work should prevent the task
from sleeping until it processes task works, that should make
the waiting loop make another iteration and get to the task work
execution again (if it continues to sleep). I don't understand how
the patch works, but if it's legit sounds we have a bigger problem,
e.g. what if someone else queue up a work right after that tw
execution block.

> Link: https://github.com/axboe/liburing/issues/1235
> Cc: stable@vger.kernel.org
> Fixes: 846072f16eed ("io_uring: mimimise io_cqring_wait_schedule")
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> ---
> 
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 1aca501efaf6..d6a2cd351525 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -2568,9 +2568,9 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags,
>   		 * If we got woken because of task_work being processed, run it
>   		 * now rather than let the caller do another wait loop.
>   		 */
> -		io_run_task_work();
>   		if (!llist_empty(&ctx->work_llist))
>   			io_run_local_work(ctx, nr_wait);
> +		io_run_task_work();
>   
>   		/*
>   		 * Non-local task_work will be run on exit to userspace, but

Jens Axboe Sept. 19, 2024, 4 p.m. UTC | #2

On 9/19/24 4:22 AM, Pavel Begunkov wrote:
> On 9/18/24 19:03, Jens Axboe wrote:
>> io_cqring_wait() doesn't run normal task_work after the local work, and
>> it's the only location to do it in that order. Normally this doesn't
>> matter, except if:
>>
>> 1) The ring is setup with DEFER_TASKRUN
>> 2) The local work item may generate normal task_work
>>
>> For condition 2, this can happen when closing a file and it's the final
>> put of that file, for example. This can cause stalls where a task is
>> waiting to make progress, but there's nothing else that will wake it up.
> 
> TIF_NOTIFY_SIGNAL from normal task_work should prevent the task
> from sleeping until it processes task works, that should make
> the waiting loop make another iteration and get to the task work
> execution again (if it continues to sleep). I don't understand how
> the patch works, but if it's legit sounds we have a bigger problem,
> e.g. what if someone else queue up a work right after that tw
> execution block.

It's not TIF_NOTIFY_SIGNAL, for that case it would've been fine. It
would've just meant another loop around for waiting. As the likelihood
of defer task_work generating normal task_work is infinitely higher than
the other way around, I do think re-ordering makes sense regardless.

The final fput will use TIF_NOTIFY_RESUME, as it should not be something
that interrupts the task. Just needs to get done eventually when it
exits to userspace. But for this case obviously that's a bit more
problematic. We can also do something like the below which should fix it
as well, which may be a better approach. At least, as it currently
stands, TIF_NOTIFY_RESUME and TIF_NOTIFY_SIGNAL are the two signaling
mechanisms for that. Hence checking for pending task_work and ensuring
our task_work run handles both should be saner. I'd still swap the
ordering of the task_work runs, however.

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 75f0087183e5..56097627eafc 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2472,7 +2472,7 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
 		return 1;
 	if (unlikely(!llist_empty(&ctx->work_llist)))
 		return 1;
-	if (unlikely(test_thread_flag(TIF_NOTIFY_SIGNAL)))
+	if (unlikely(task_work_pending(current)))
 		return 1;
 	if (unlikely(task_sigpending(current)))
 		return -EINTR;
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 9d70b2cf7b1e..2fbf0ea9c171 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -308,15 +308,17 @@ static inline int io_run_task_work(void)
 	 */
 	if (test_thread_flag(TIF_NOTIFY_SIGNAL))
 		clear_notify_signal();
+
+	if (test_thread_flag(TIF_NOTIFY_RESUME)) {
+		__set_current_state(TASK_RUNNING);
+		resume_user_mode_work(NULL);
+	}
+
 	/*
 	 * PF_IO_WORKER never returns to userspace, so check here if we have
 	 * notify work that needs processing.
 	 */
 	if (current->flags & PF_IO_WORKER) {
-		if (test_thread_flag(TIF_NOTIFY_RESUME)) {
-			__set_current_state(TASK_RUNNING);
-			resume_user_mode_work(NULL);
-		}
 		if (current->io_uring) {
 			unsigned int count = 0;

Jan Hendrik Farr Sept. 19, 2024, 4:47 p.m. UTC | #3

> [...]
> 
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 75f0087183e5..56097627eafc 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -2472,7 +2472,7 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
>  		return 1;
>  	if (unlikely(!llist_empty(&ctx->work_llist)))
>  		return 1;
> -	if (unlikely(test_thread_flag(TIF_NOTIFY_SIGNAL)))
> +	if (unlikely(task_work_pending(current)))
>  		return 1;
>  	if (unlikely(task_sigpending(current)))
>  		return -EINTR;
> diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
> index 9d70b2cf7b1e..2fbf0ea9c171 100644
> --- a/io_uring/io_uring.h
> +++ b/io_uring/io_uring.h
> @@ -308,15 +308,17 @@ static inline int io_run_task_work(void)
>  	 */
>  	if (test_thread_flag(TIF_NOTIFY_SIGNAL))
>  		clear_notify_signal();
> +
> +	if (test_thread_flag(TIF_NOTIFY_RESUME)) {
> +		__set_current_state(TASK_RUNNING);
> +		resume_user_mode_work(NULL);
> +	}
> +
>  	/*
>  	 * PF_IO_WORKER never returns to userspace, so check here if we have
>  	 * notify work that needs processing.
>  	 */
>  	if (current->flags & PF_IO_WORKER) {
> -		if (test_thread_flag(TIF_NOTIFY_RESUME)) {
> -			__set_current_state(TASK_RUNNING);
> -			resume_user_mode_work(NULL);
> -		}
>  		if (current->io_uring) {
>  			unsigned int count = 0;
>  
> 

Can confirm that also this patch fixes the issue on my end (both with the
reordering of the task_work and without it).

Also found a different way to trigger the issue that does not misuse
IOSQE_IO_LINK. Do three sends with IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_LINK
followed by a close with IOSQE_CQE_SKIP_SUCCESS on a ring with
IORING_SETUP_DEFER_TASKRUN.

I confirmed that that test case also first brakes on
846072f16eed3b3fb4e59b677f3ed8afb8509b89 and is fixed by either of the
two patches you sent.

Not sure if that's a preferable test case compared to the weirder ealier one.
You can find it below as a patch to the existing test case in the liburing
repo:


diff --git a/test/linked-defer-close.c b/test/linked-defer-close.c
index 4be96b3..f9ef6eb 100644
--- a/test/linked-defer-close.c
+++ b/test/linked-defer-close.c
@@ -88,6 +88,7 @@ int main(int argc, char *argv[])
 	struct sockaddr_in saddr;
 	char *msg1 = "message number 1\n";
 	char *msg2 = "message number 2\n";
+	char *msg3 = "message number 3\n";
 	int val, send_fd, ret, sockfd;
 	struct sigaction act[2] = { };
 	struct thread_data td;
@@ -182,17 +183,22 @@ int main(int argc, char *argv[])
 			sqe = io_uring_get_sqe(&ring);
 			io_uring_prep_send(sqe, send_fd, msg1, strlen(msg1), 0);
 			sqe->user_data = IS_SEND;
-			sqe->flags = IOSQE_CQE_SKIP_SUCCESS;
+			sqe->flags = IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_LINK;
 
 			sqe = io_uring_get_sqe(&ring);
 			io_uring_prep_send(sqe, send_fd, msg2, strlen(msg2), 0);
 			sqe->user_data = IS_SEND2;
 			sqe->flags = IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_LINK;
 
+			sqe = io_uring_get_sqe(&ring);
+			io_uring_prep_send(sqe, send_fd, msg3, strlen(msg3), 0);
+			sqe->user_data = IS_SEND2;
+			sqe->flags = IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_LINK;
+
 			sqe = io_uring_get_sqe(&ring);
 			io_uring_prep_close(sqe, send_fd);
 			sqe->user_data = IS_CLOSE;
-			sqe->flags = IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_LINK;
+			sqe->flags = IOSQE_CQE_SKIP_SUCCESS;
 			break;
 		case IS_SEND:
 		case IS_SEND2:

Jens Axboe Sept. 19, 2024, 6:06 p.m. UTC | #4

On 9/19/24 10:47 AM, Jan Hendrik Farr wrote:
>> [...]
>>
>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> index 75f0087183e5..56097627eafc 100644
>> --- a/io_uring/io_uring.c
>> +++ b/io_uring/io_uring.c
>> @@ -2472,7 +2472,7 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
>>  		return 1;
>>  	if (unlikely(!llist_empty(&ctx->work_llist)))
>>  		return 1;
>> -	if (unlikely(test_thread_flag(TIF_NOTIFY_SIGNAL)))
>> +	if (unlikely(task_work_pending(current)))
>>  		return 1;
>>  	if (unlikely(task_sigpending(current)))
>>  		return -EINTR;
>> diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
>> index 9d70b2cf7b1e..2fbf0ea9c171 100644
>> --- a/io_uring/io_uring.h
>> +++ b/io_uring/io_uring.h
>> @@ -308,15 +308,17 @@ static inline int io_run_task_work(void)
>>  	 */
>>  	if (test_thread_flag(TIF_NOTIFY_SIGNAL))
>>  		clear_notify_signal();
>> +
>> +	if (test_thread_flag(TIF_NOTIFY_RESUME)) {
>> +		__set_current_state(TASK_RUNNING);
>> +		resume_user_mode_work(NULL);
>> +	}
>> +
>>  	/*
>>  	 * PF_IO_WORKER never returns to userspace, so check here if we have
>>  	 * notify work that needs processing.
>>  	 */
>>  	if (current->flags & PF_IO_WORKER) {
>> -		if (test_thread_flag(TIF_NOTIFY_RESUME)) {
>> -			__set_current_state(TASK_RUNNING);
>> -			resume_user_mode_work(NULL);
>> -		}
>>  		if (current->io_uring) {
>>  			unsigned int count = 0;
>>  
>>
> 
> Can confirm that also this patch fixes the issue on my end (both with the
> reordering of the task_work and without it).

Great, thanks for testing! Sent out a v2. No need to test it unless you
absolutely want to ;-)

> Also found a different way to trigger the issue that does not misuse
> IOSQE_IO_LINK. Do three sends with IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_LINK
> followed by a close with IOSQE_CQE_SKIP_SUCCESS on a ring with
> IORING_SETUP_DEFER_TASKRUN.
> 
> I confirmed that that test case also first brakes on
> 846072f16eed3b3fb4e59b677f3ed8afb8509b89 and is fixed by either of the
> two patches you sent.
> 
> Not sure if that's a preferable test case compared to the weirder ealier one.
> You can find it below as a patch to the existing test case in the liburing
> repo:

I think that's an improvement, just because it doesn't rely on a weird
usage of IOSQE_IO_LINK. And it looks good to me - do you want me to
commit this directly, or do you want to send a "proper" patch (or github
PR) to retain the proper attribution to you?

Jan Hendrik Farr Sept. 19, 2024, 6:31 p.m. UTC | #5

On 19 12:06:20, Jens Axboe wrote:
> On 9/19/24 10:47 AM, Jan Hendrik Farr wrote:
> >> [...]
> >>
> >> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> >> index 75f0087183e5..56097627eafc 100644
> >> --- a/io_uring/io_uring.c
> >> +++ b/io_uring/io_uring.c
> >> @@ -2472,7 +2472,7 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
> >>  		return 1;
> >>  	if (unlikely(!llist_empty(&ctx->work_llist)))
> >>  		return 1;
> >> -	if (unlikely(test_thread_flag(TIF_NOTIFY_SIGNAL)))
> >> +	if (unlikely(task_work_pending(current)))
> >>  		return 1;
> >>  	if (unlikely(task_sigpending(current)))
> >>  		return -EINTR;
> >> diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
> >> index 9d70b2cf7b1e..2fbf0ea9c171 100644
> >> --- a/io_uring/io_uring.h
> >> +++ b/io_uring/io_uring.h
> >> @@ -308,15 +308,17 @@ static inline int io_run_task_work(void)
> >>  	 */
> >>  	if (test_thread_flag(TIF_NOTIFY_SIGNAL))
> >>  		clear_notify_signal();
> >> +
> >> +	if (test_thread_flag(TIF_NOTIFY_RESUME)) {
> >> +		__set_current_state(TASK_RUNNING);
> >> +		resume_user_mode_work(NULL);
> >> +	}
> >> +
> >>  	/*
> >>  	 * PF_IO_WORKER never returns to userspace, so check here if we have
> >>  	 * notify work that needs processing.
> >>  	 */
> >>  	if (current->flags & PF_IO_WORKER) {
> >> -		if (test_thread_flag(TIF_NOTIFY_RESUME)) {
> >> -			__set_current_state(TASK_RUNNING);
> >> -			resume_user_mode_work(NULL);
> >> -		}
> >>  		if (current->io_uring) {
> >>  			unsigned int count = 0;
> >>  
> >>
> > 
> > Can confirm that also this patch fixes the issue on my end (both with the
> > reordering of the task_work and without it).
> 
> Great, thanks for testing! Sent out a v2. No need to test it unless you
> absolutely want to ;-)
> 
> > Also found a different way to trigger the issue that does not misuse
> > IOSQE_IO_LINK. Do three sends with IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_LINK
> > followed by a close with IOSQE_CQE_SKIP_SUCCESS on a ring with
> > IORING_SETUP_DEFER_TASKRUN.
> > 
> > I confirmed that that test case also first brakes on
> > 846072f16eed3b3fb4e59b677f3ed8afb8509b89 and is fixed by either of the
> > two patches you sent.
> > 
> > Not sure if that's a preferable test case compared to the weirder ealier one.
> > You can find it below as a patch to the existing test case in the liburing
> > repo:
> 
> I think that's an improvement, just because it doesn't rely on a weird
> usage of IOSQE_IO_LINK. And it looks good to me - do you want me to
> commit this directly, or do you want to send a "proper" patch (or github
> PR) to retain the proper attribution to you?
> 

Sent the PR with one minor change (adjusted the user data for the third
send).

Jens Axboe Sept. 19, 2024, 6:32 p.m. UTC | #6

On 9/19/24 12:31 PM, Jan Hendrik Farr wrote:
> On 19 12:06:20, Jens Axboe wrote:
>> On 9/19/24 10:47 AM, Jan Hendrik Farr wrote:
>>>> [...]
>>>>
>>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>>>> index 75f0087183e5..56097627eafc 100644
>>>> --- a/io_uring/io_uring.c
>>>> +++ b/io_uring/io_uring.c
>>>> @@ -2472,7 +2472,7 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
>>>>  		return 1;
>>>>  	if (unlikely(!llist_empty(&ctx->work_llist)))
>>>>  		return 1;
>>>> -	if (unlikely(test_thread_flag(TIF_NOTIFY_SIGNAL)))
>>>> +	if (unlikely(task_work_pending(current)))
>>>>  		return 1;
>>>>  	if (unlikely(task_sigpending(current)))
>>>>  		return -EINTR;
>>>> diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
>>>> index 9d70b2cf7b1e..2fbf0ea9c171 100644
>>>> --- a/io_uring/io_uring.h
>>>> +++ b/io_uring/io_uring.h
>>>> @@ -308,15 +308,17 @@ static inline int io_run_task_work(void)
>>>>  	 */
>>>>  	if (test_thread_flag(TIF_NOTIFY_SIGNAL))
>>>>  		clear_notify_signal();
>>>> +
>>>> +	if (test_thread_flag(TIF_NOTIFY_RESUME)) {
>>>> +		__set_current_state(TASK_RUNNING);
>>>> +		resume_user_mode_work(NULL);
>>>> +	}
>>>> +
>>>>  	/*
>>>>  	 * PF_IO_WORKER never returns to userspace, so check here if we have
>>>>  	 * notify work that needs processing.
>>>>  	 */
>>>>  	if (current->flags & PF_IO_WORKER) {
>>>> -		if (test_thread_flag(TIF_NOTIFY_RESUME)) {
>>>> -			__set_current_state(TASK_RUNNING);
>>>> -			resume_user_mode_work(NULL);
>>>> -		}
>>>>  		if (current->io_uring) {
>>>>  			unsigned int count = 0;
>>>>  
>>>>
>>>
>>> Can confirm that also this patch fixes the issue on my end (both with the
>>> reordering of the task_work and without it).
>>
>> Great, thanks for testing! Sent out a v2. No need to test it unless you
>> absolutely want to ;-)
>>
>>> Also found a different way to trigger the issue that does not misuse
>>> IOSQE_IO_LINK. Do three sends with IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_LINK
>>> followed by a close with IOSQE_CQE_SKIP_SUCCESS on a ring with
>>> IORING_SETUP_DEFER_TASKRUN.
>>>
>>> I confirmed that that test case also first brakes on
>>> 846072f16eed3b3fb4e59b677f3ed8afb8509b89 and is fixed by either of the
>>> two patches you sent.
>>>
>>> Not sure if that's a preferable test case compared to the weirder ealier one.
>>> You can find it below as a patch to the existing test case in the liburing
>>> repo:
>>
>> I think that's an improvement, just because it doesn't rely on a weird
>> usage of IOSQE_IO_LINK. And it looks good to me - do you want me to
>> commit this directly, or do you want to send a "proper" patch (or github
>> PR) to retain the proper attribution to you?
>>
> 
> Sent the PR with one minor change (adjusted the user data for the third
> send).

And pulled, thanks.

io_uring: run normal task_work AFTER local work

Commit Message

Comments

Patch