diff mbox series

net: Do not break out of sk_stream_wait_memory() with TIF_NOTIFY_SIGNAL

Message ID 20240315100159.3898944-1-s.hauer@pengutronix.de (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series net: Do not break out of sk_stream_wait_memory() with TIF_NOTIFY_SIGNAL | expand

Checks

Context Check Description
netdev/series_format warning Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 941 this patch: 941
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 2 maintainers not CCed: edumazet@google.com kuba@kernel.org
netdev/build_clang success Errors and warnings before: 957 this patch: 957
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 958 this patch: 958
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 24 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-03-15--18-00 (tests: 905)

Commit Message

Sascha Hauer March 15, 2024, 10:01 a.m. UTC
It can happen that a socket sends the remaining data at close() time.
With io_uring and KTLS it can happen that sk_stream_wait_memory() bails
out with -512 (-ERESTARTSYS) because TIF_NOTIFY_SIGNAL is set for the
current task. This flag has been set in io_req_normal_work_add() by
calling task_work_add().

This patch replaces signal_pending() with task_sigpending(), thus ignoring
the TIF_NOTIFY_SIGNAL flag.

A discussion of this issue can be found at
https://lore.kernel.org/20231010141932.GD3114228@pengutronix.de

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
---
 net/core/stream.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comments

Jens Axboe March 15, 2024, 4:43 p.m. UTC | #1
On 3/15/24 4:01 AM, Sascha Hauer wrote:
> It can happen that a socket sends the remaining data at close() time.
> With io_uring and KTLS it can happen that sk_stream_wait_memory() bails
> out with -512 (-ERESTARTSYS) because TIF_NOTIFY_SIGNAL is set for the
> current task. This flag has been set in io_req_normal_work_add() by
> calling task_work_add().
> 
> This patch replaces signal_pending() with task_sigpending(), thus ignoring
> the TIF_NOTIFY_SIGNAL flag.

Reviewed-by: Jens Axboe <axboe@kernel.dk>

Should probably also flag this for stable, 5.10+ I think as the task
sigpending got backported that far.
Pavel Begunkov March 15, 2024, 5:02 p.m. UTC | #2
On 3/15/24 10:01, Sascha Hauer wrote:
> It can happen that a socket sends the remaining data at close() time.
> With io_uring and KTLS it can happen that sk_stream_wait_memory() bails
> out with -512 (-ERESTARTSYS) because TIF_NOTIFY_SIGNAL is set for the
> current task. This flag has been set in io_req_normal_work_add() by
> calling task_work_add().

The entire idea of task_work is to interrupt syscalls and let io_uring
do its job, otherwise it wouldn't free resources it might be holding,
and even potentially forever block the syscall.

I'm not that sure about connect / close (are they not restartable?),
but it doesn't seem to be a good idea for sk_stream_wait_memory(),
which is the normal TCP blocking send path. I'm thinking of some kinds
of cases with a local TCP socket pair, the tx queue is full as well
and the rx queue of the other end, and io_uring has to run to receive
the data.

If interruptions are not welcome you can use different io_uring flags,
see IORING_SETUP_COOP_TASKRUN and/or IORING_SETUP_DEFER_TASKRUN.

Maybe I'm missing something, why not restart your syscall?


> This patch replaces signal_pending() with task_sigpending(), thus ignoring
> the TIF_NOTIFY_SIGNAL flag.
> 
> A discussion of this issue can be found at
> https://lore.kernel.org/20231010141932.GD3114228@pengutronix.de
> 
> Suggested-by: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
> ---
>   net/core/stream.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/net/core/stream.c b/net/core/stream.c
> index 96fbcb9bbb30a..e9e17b48e0122 100644
> --- a/net/core/stream.c
> +++ b/net/core/stream.c
> @@ -67,7 +67,7 @@ int sk_stream_wait_connect(struct sock *sk, long *timeo_p)
>   			return -EPIPE;
>   		if (!*timeo_p)
>   			return -EAGAIN;
> -		if (signal_pending(tsk))
> +		if (task_sigpending(tsk))
>   			return sock_intr_errno(*timeo_p);
>   
>   		add_wait_queue(sk_sleep(sk), &wait);
> @@ -103,7 +103,7 @@ void sk_stream_wait_close(struct sock *sk, long timeout)
>   		do {
>   			if (sk_wait_event(sk, &timeout, !sk_stream_closing(sk), &wait))
>   				break;
> -		} while (!signal_pending(current) && timeout);
> +		} while (!task_sigpending(current) && timeout);
>   
>   		remove_wait_queue(sk_sleep(sk), &wait);
>   	}
> @@ -134,7 +134,7 @@ int sk_stream_wait_memory(struct sock *sk, long *timeo_p)
>   			goto do_error;
>   		if (!*timeo_p)
>   			goto do_eagain;
> -		if (signal_pending(current))
> +		if (task_sigpending(current))
>   			goto do_interrupted;
>   		sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
>   		if (sk_stream_memory_free(sk) && !vm_wait)
Sascha Hauer March 18, 2024, 12:03 p.m. UTC | #3
Apologies, I have sent the wrong mail. Here is the mail I really wanted
to send, with answers to some of the questions Paolo raised the last
time I sent it.

-----------------------------------8<------------------------------

>From 566bb198546423c024cdebc50d0aade7ed638a40 Mon Sep 17 00:00:00 2001
From: Sascha Hauer <s.hauer@pengutronix.de>
Date: Mon, 23 Oct 2023 14:13:46 +0200
Subject: [PATCH v2] net: Do not break out of sk_stream_wait_memory() with TIF_NOTIFY_SIGNAL

It can happen that a socket sends the remaining data at close() time.
With io_uring and KTLS it can happen that sk_stream_wait_memory() bails
out with -512 (-ERESTARTSYS) because TIF_NOTIFY_SIGNAL is set for the
current task. This flag has been set in io_req_normal_work_add() by
calling task_work_add().

It seems signal_pending() is too broad, so this patch replaces it with
task_sigpending(), thus ignoring the TIF_NOTIFY_SIGNAL flag.

A discussion of this issue can be found at
https://lore.kernel.org/20231010141932.GD3114228@pengutronix.de

Suggested-by: Jens Axboe <axboe@kernel.dk>
Fixes: 12db8b690010c ("entry: Add support for TIF_NOTIFY_SIGNAL")
Link: https://lore.kernel.org/r/20231023121346.4098160-1-s.hauer@pengutronix.de
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
---

Changes since v1:
- only replace signal_pending() with task_sigpending() where we need it,
  in sk_stream_wait_memory()

I'd like to pick up the discussion on this patch as it is still needed for our
usecase. Paolo Abeni raised some concerns about this patch for which I didn't have
good answers. I am referencing them here again with an attempts to answer them.
Jens, maybe you also have a few words here.

Paolo raised some concerns in
https://lore.kernel.org/all/e1e15554bfa5cfc8048d6074eedbc83c4d912c98.camel@redhat.com/:

> To be more explicit: why this will not cause user-space driven
> connect() from missing relevant events?

Note I dropped the hunk in sk_stream_wait_connect() and
sk_stream_wait_close() in this version.
Userspace driven signals are still catched with task_sigpending() which
tests for TIF_SIGPENDING. signal_pending() will additionally check for
TIF_NOTIFY_SIGNAL which is exclusively used by task_work_add() to add
work to a task.

> Why this is needed only here and not in all the many others event loops waiting
> for signals?

There might be other cases too, but this one particularly bites us. I guess in
other places the work can be picked up again after being interrupted by a signal,
but here it can't.

> Why can't the issue be addressed in any place more closely tied to the
> scenario, e.g. io_uring or ktls?

At least I don't have an idea how this could be done. All components seem
to do the right thing on their own, but the pieces do not seem to fit
together for this particular case.

Jens, please correct me if I am wrong or if you have something to add, this whole
thing is outside my comfort zone, but I'd be very happy if we could solve this
issue somehow.

Thanks,
  Sascha

 net/core/stream.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/stream.c b/net/core/stream.c
index b16dfa568a2d5..55d90251205a6 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -134,7 +134,7 @@ int sk_stream_wait_memory(struct sock *sk, long *timeo_p)
 			goto do_error;
 		if (!*timeo_p)
 			goto do_eagain;
-		if (signal_pending(current))
+		if (task_sigpending(current))
 			goto do_interrupted;
 		sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
 		if (sk_stream_memory_free(sk) && !vm_wait)
Sascha Hauer March 18, 2024, 12:10 p.m. UTC | #4
On Fri, Mar 15, 2024 at 05:02:05PM +0000, Pavel Begunkov wrote:
> On 3/15/24 10:01, Sascha Hauer wrote:
> > It can happen that a socket sends the remaining data at close() time.
> > With io_uring and KTLS it can happen that sk_stream_wait_memory() bails
> > out with -512 (-ERESTARTSYS) because TIF_NOTIFY_SIGNAL is set for the
> > current task. This flag has been set in io_req_normal_work_add() by
> > calling task_work_add().
> 
> The entire idea of task_work is to interrupt syscalls and let io_uring
> do its job, otherwise it wouldn't free resources it might be holding,
> and even potentially forever block the syscall.
> 
> I'm not that sure about connect / close (are they not restartable?),
> but it doesn't seem to be a good idea for sk_stream_wait_memory(),
> which is the normal TCP blocking send path. I'm thinking of some kinds
> of cases with a local TCP socket pair, the tx queue is full as well
> and the rx queue of the other end, and io_uring has to run to receive
> the data.
> 
> If interruptions are not welcome you can use different io_uring flags,
> see IORING_SETUP_COOP_TASKRUN and/or IORING_SETUP_DEFER_TASKRUN.

I tried with different combinations of these flags. For example
IORING_SETUP_TASKRUN_FLAG | IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN
makes the issue less likely, but nevertheless it still happens.

However, reading the documentation of these flags, they shall provide
hints to the kernel for optimizations, but it should work without these
flags, right?

> 
> Maybe I'm missing something, why not restart your syscall?

The problem comes with TLS. Normally with synchronous encryption all
data on a socket is written during write(). When asynchronous
encryption comes into play, then not all data is written during write(),
but instead the remaining data is written at close() time.

Here is my call stack when things go awry:

[  325.560946] tls_push_sg: tcp_sendmsg_locked returned -512
[  325.566371] CPU: 1 PID: 305 Comm: webserver_libur Not tainted 6.8.0-rc6-00022-g932acd9c444b-dirty #248
[  325.575684] Hardware name: NXP i.MX8MPlus EVK board (DT)
[  325.580997] Call trace:
[  325.583444]  dump_backtrace+0x90/0xe8
[  325.587122]  show_stack+0x18/0x24
[  325.590444]  dump_stack_lvl+0x48/0x60
[  325.594114]  dump_stack+0x18/0x24
[  325.597432]  tls_push_sg+0xfc/0x22c
[  325.600930]  tls_tx_records+0x114/0x1cc
[  325.604772]  tls_sw_release_resources_tx+0x3c/0x140
[  325.609658]  tls_sk_proto_close+0x2b0/0x3ac
[  325.613846]  inet_release+0x4c/0x9c
[  325.617341]  __sock_release+0x40/0xb4
[  325.621007]  sock_close+0x18/0x28
[  325.624328]  __fput+0x70/0x2bc
[  325.627386]  ____fput+0x10/0x1c
[  325.630531]  task_work_run+0x74/0xcc
[  325.634113]  do_notify_resume+0x22c/0x1310
[  325.638220]  el0_svc+0xa4/0xb4
[  325.641279]  el0t_64_sync_handler+0x120/0x12c
[  325.645643]  el0t_64_sync+0x190/0x194

As said, TLS is sending remaining data at close() time in tls_push_sg().
Here sk_stream_wait_memory() gets interrupted and returns -ERESTARTSYS.
There's no way to restart this operation, the socket is about to be
closed and won't accept data anymore.

Sascha
Pavel Begunkov March 18, 2024, 1:19 p.m. UTC | #5
On 3/18/24 12:10, Sascha Hauer wrote:
> On Fri, Mar 15, 2024 at 05:02:05PM +0000, Pavel Begunkov wrote:
>> On 3/15/24 10:01, Sascha Hauer wrote:
>>> It can happen that a socket sends the remaining data at close() time.
>>> With io_uring and KTLS it can happen that sk_stream_wait_memory() bails
>>> out with -512 (-ERESTARTSYS) because TIF_NOTIFY_SIGNAL is set for the
>>> current task. This flag has been set in io_req_normal_work_add() by
>>> calling task_work_add().
>>
>> The entire idea of task_work is to interrupt syscalls and let io_uring
>> do its job, otherwise it wouldn't free resources it might be holding,
>> and even potentially forever block the syscall.
>>
>> I'm not that sure about connect / close (are they not restartable?),
>> but it doesn't seem to be a good idea for sk_stream_wait_memory(),
>> which is the normal TCP blocking send path. I'm thinking of some kinds
>> of cases with a local TCP socket pair, the tx queue is full as well
>> and the rx queue of the other end, and io_uring has to run to receive
>> the data.

There is another case, let's say the IO is done via io-wq
(io_uring's worker thread pool) and hits the waiting. Now the
request can't get cancelled, which is done by interrupting the
task with TIF_NOTIFY_SIGNAL. User requested request cancellations
is one thing, but we'd need to check if io_uring can ever be closed
in this case.


>> If interruptions are not welcome you can use different io_uring flags,
>> see IORING_SETUP_COOP_TASKRUN and/or IORING_SETUP_DEFER_TASKRUN.
> 
> I tried with different combinations of these flags. For example
> IORING_SETUP_TASKRUN_FLAG | IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN
> makes the issue less likely, but nevertheless it still happens.
> 
> However, reading the documentation of these flags, they shall provide
> hints to the kernel for optimizations, but it should work without these
> flags, right?

That's true, and I guess there are other cases as well, like
io-wq and perhaps even a stray fput.


>> Maybe I'm missing something, why not restart your syscall?
> 
> The problem comes with TLS. Normally with synchronous encryption all
> data on a socket is written during write(). When asynchronous
> encryption comes into play, then not all data is written during write(),
> but instead the remaining data is written at close() time.

Was it considered to do the final cleanup in workqueue
and only then finalising the release?


> Here is my call stack when things go awry:
> 
> [  325.560946] tls_push_sg: tcp_sendmsg_locked returned -512
> [  325.566371] CPU: 1 PID: 305 Comm: webserver_libur Not tainted 6.8.0-rc6-00022-g932acd9c444b-dirty #248
> [  325.575684] Hardware name: NXP i.MX8MPlus EVK board (DT)
> [  325.580997] Call trace:
> [  325.583444]  dump_backtrace+0x90/0xe8
> [  325.587122]  show_stack+0x18/0x24
> [  325.590444]  dump_stack_lvl+0x48/0x60
> [  325.594114]  dump_stack+0x18/0x24
> [  325.597432]  tls_push_sg+0xfc/0x22c
> [  325.600930]  tls_tx_records+0x114/0x1cc
> [  325.604772]  tls_sw_release_resources_tx+0x3c/0x140
> [  325.609658]  tls_sk_proto_close+0x2b0/0x3ac
> [  325.613846]  inet_release+0x4c/0x9c
> [  325.617341]  __sock_release+0x40/0xb4
> [  325.621007]  sock_close+0x18/0x28
> [  325.624328]  __fput+0x70/0x2bc
> [  325.627386]  ____fput+0x10/0x1c
> [  325.630531]  task_work_run+0x74/0xcc
> [  325.634113]  do_notify_resume+0x22c/0x1310
> [  325.638220]  el0_svc+0xa4/0xb4
> [  325.641279]  el0t_64_sync_handler+0x120/0x12c
> [  325.645643]  el0t_64_sync+0x190/0x194
> 
> As said, TLS is sending remaining data at close() time in tls_push_sg().
> Here sk_stream_wait_memory() gets interrupted and returns -ERESTARTSYS.
> There's no way to restart this operation, the socket is about to be
> closed and won't accept data anymore.
Sascha Hauer March 19, 2024, 10:50 a.m. UTC | #6
On Mon, Mar 18, 2024 at 01:19:19PM +0000, Pavel Begunkov wrote:
> On 3/18/24 12:10, Sascha Hauer wrote:
> > On Fri, Mar 15, 2024 at 05:02:05PM +0000, Pavel Begunkov wrote:
> > > On 3/15/24 10:01, Sascha Hauer wrote:
> > > > It can happen that a socket sends the remaining data at close() time.
> > > > With io_uring and KTLS it can happen that sk_stream_wait_memory() bails
> > > > out with -512 (-ERESTARTSYS) because TIF_NOTIFY_SIGNAL is set for the
> > > > current task. This flag has been set in io_req_normal_work_add() by
> > > > calling task_work_add().
> > > 
> > > The entire idea of task_work is to interrupt syscalls and let io_uring
> > > do its job, otherwise it wouldn't free resources it might be holding,
> > > and even potentially forever block the syscall.
> > > 
> > > I'm not that sure about connect / close (are they not restartable?),
> > > but it doesn't seem to be a good idea for sk_stream_wait_memory(),
> > > which is the normal TCP blocking send path. I'm thinking of some kinds
> > > of cases with a local TCP socket pair, the tx queue is full as well
> > > and the rx queue of the other end, and io_uring has to run to receive
> > > the data.
> 
> There is another case, let's say the IO is done via io-wq
> (io_uring's worker thread pool) and hits the waiting. Now the
> request can't get cancelled, which is done by interrupting the
> task with TIF_NOTIFY_SIGNAL. User requested request cancellations
> is one thing, but we'd need to check if io_uring can ever be closed
> in this case.
> 
> 
> > > If interruptions are not welcome you can use different io_uring flags,
> > > see IORING_SETUP_COOP_TASKRUN and/or IORING_SETUP_DEFER_TASKRUN.
> > 
> > I tried with different combinations of these flags. For example
> > IORING_SETUP_TASKRUN_FLAG | IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN
> > makes the issue less likely, but nevertheless it still happens.
> > 
> > However, reading the documentation of these flags, they shall provide
> > hints to the kernel for optimizations, but it should work without these
> > flags, right?
> 
> That's true, and I guess there are other cases as well, like
> io-wq and perhaps even a stray fput.
> 
> 
> > > Maybe I'm missing something, why not restart your syscall?
> > 
> > The problem comes with TLS. Normally with synchronous encryption all
> > data on a socket is written during write(). When asynchronous
> > encryption comes into play, then not all data is written during write(),
> > but instead the remaining data is written at close() time.
> 
> Was it considered to do the final cleanup in workqueue
> and only then finalising the release?

No, but I don't really understand what you mean here. Could you
elaborate?

Sascha
Paolo Abeni March 19, 2024, 12:30 p.m. UTC | #7
On Mon, 2024-03-18 at 13:03 +0100, Sascha Hauer wrote:
> Apologies, I have sent the wrong mail. Here is the mail I really wanted
> to send, with answers to some of the questions Paolo raised the last
> time I sent it.
> 
> -----------------------------------8<------------------------------
> 
> > From 566bb198546423c024cdebc50d0aade7ed638a40 Mon Sep 17 00:00:00 2001
> From: Sascha Hauer <s.hauer@pengutronix.de>
> Date: Mon, 23 Oct 2023 14:13:46 +0200
> Subject: [PATCH v2] net: Do not break out of sk_stream_wait_memory() with TIF_NOTIFY_SIGNAL
> 
> It can happen that a socket sends the remaining data at close() time.
> With io_uring and KTLS it can happen that sk_stream_wait_memory() bails
> out with -512 (-ERESTARTSYS) because TIF_NOTIFY_SIGNAL is set for the
> current task. This flag has been set in io_req_normal_work_add() by
> calling task_work_add().
> 
> It seems signal_pending() is too broad, so this patch replaces it with
> task_sigpending(), thus ignoring the TIF_NOTIFY_SIGNAL flag.
> 
> A discussion of this issue can be found at
> https://lore.kernel.org/20231010141932.GD3114228@pengutronix.de
> 
> Suggested-by: Jens Axboe <axboe@kernel.dk>
> Fixes: 12db8b690010c ("entry: Add support for TIF_NOTIFY_SIGNAL")
> Link: https://lore.kernel.org/r/20231023121346.4098160-1-s.hauer@pengutronix.de
> Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
> ---
> 
> Changes since v1:
> - only replace signal_pending() with task_sigpending() where we need it,
>   in sk_stream_wait_memory()
> 
> I'd like to pick up the discussion on this patch as it is still needed for our
> usecase. Paolo Abeni raised some concerns about this patch for which I didn't have
> good answers. I am referencing them here again with an attempts to answer them.
> Jens, maybe you also have a few words here.
> 
> Paolo raised some concerns in
> https://lore.kernel.org/all/e1e15554bfa5cfc8048d6074eedbc83c4d912c98.camel@redhat.com/:
> 
> > To be more explicit: why this will not cause user-space driven
> > connect() from missing relevant events?
> 
> Note I dropped the hunk in sk_stream_wait_connect() and
> sk_stream_wait_close() in this version.
> Userspace driven signals are still catched with task_sigpending() which
> tests for TIF_SIGPENDING. signal_pending() will additionally check for
> TIF_NOTIFY_SIGNAL which is exclusively used by task_work_add() to add
> work to a task.

It looks like even e.g. livepatch would set TIF_NOTIFY_SIGNAL, and
ignoring it could break livepatch for any code waiting e.g. in
tcp_sendmsg()?!?

This change looks scary to me.

I think what Pavel is suggesting is to refactor the KTLS code to ensure
all the writes are completed before releasing the last socket
reference.

I would second such suggestion.

If really nothing else works, and this change is the only option, try
to obtain an ack from kernel/signal.c maintainers.

Thanks,

Paolo
Pavel Begunkov March 19, 2024, 1:55 p.m. UTC | #8
On 3/19/24 10:50, Sascha Hauer wrote:
> On Mon, Mar 18, 2024 at 01:19:19PM +0000, Pavel Begunkov wrote:
>> On 3/18/24 12:10, Sascha Hauer wrote:
>>> On Fri, Mar 15, 2024 at 05:02:05PM +0000, Pavel Begunkov wrote:
>>>> On 3/15/24 10:01, Sascha Hauer wrote:
>>>>> It can happen that a socket sends the remaining data at close() time.
>>>>> With io_uring and KTLS it can happen that sk_stream_wait_memory() bails
>>>>> out with -512 (-ERESTARTSYS) because TIF_NOTIFY_SIGNAL is set for the
>>>>> current task. This flag has been set in io_req_normal_work_add() by
>>>>> calling task_work_add().
>>>>
>>>> The entire idea of task_work is to interrupt syscalls and let io_uring
>>>> do its job, otherwise it wouldn't free resources it might be holding,
>>>> and even potentially forever block the syscall.
>>>>
>>>> I'm not that sure about connect / close (are they not restartable?),
>>>> but it doesn't seem to be a good idea for sk_stream_wait_memory(),
>>>> which is the normal TCP blocking send path. I'm thinking of some kinds
>>>> of cases with a local TCP socket pair, the tx queue is full as well
>>>> and the rx queue of the other end, and io_uring has to run to receive
>>>> the data.
>>
>> There is another case, let's say the IO is done via io-wq
>> (io_uring's worker thread pool) and hits the waiting. Now the
>> request can't get cancelled, which is done by interrupting the
>> task with TIF_NOTIFY_SIGNAL. User requested request cancellations
>> is one thing, but we'd need to check if io_uring can ever be closed
>> in this case.
>>
>>
>>>> If interruptions are not welcome you can use different io_uring flags,
>>>> see IORING_SETUP_COOP_TASKRUN and/or IORING_SETUP_DEFER_TASKRUN.
>>>
>>> I tried with different combinations of these flags. For example
>>> IORING_SETUP_TASKRUN_FLAG | IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN
>>> makes the issue less likely, but nevertheless it still happens.
>>>
>>> However, reading the documentation of these flags, they shall provide
>>> hints to the kernel for optimizations, but it should work without these
>>> flags, right?
>>
>> That's true, and I guess there are other cases as well, like
>> io-wq and perhaps even a stray fput.
>>
>>
>>>> Maybe I'm missing something, why not restart your syscall?
>>>
>>> The problem comes with TLS. Normally with synchronous encryption all
>>> data on a socket is written during write(). When asynchronous
>>> encryption comes into play, then not all data is written during write(),
>>> but instead the remaining data is written at close() time.
>>
>> Was it considered to do the final cleanup in workqueue
>> and only then finalising the release?
> 
> No, but I don't really understand what you mean here. Could you
> elaborate?

The suggestion is instead of executing the release and that final
flush off of the context you're in, namely userspace task,
you can spin up a kernel task (they're not getting any signals)
and execute it from there.

void deferred_release_fn(struct work_struct *work)
{
	do_release();
	...
}

struct work_struct work;
INIT_WORK(&work, deferred_release_fn);
queue_work(system_unbound_wq, &work);


There is a catch. Even though close() is not obliged to close
the file / socket immediately, but it still not nice when you
drop the final ref but port and other bits are not released
until some time after. So, you might want to wait for that
deferred release to complete before returning to the
userspace.

I'm assuming it's fine to run it by a kernel task since
IIRC fput might delay release to it anyway, but it's better
to ask net maintainers. In theory it shouldn't need
mm,fs,etc that user task would hold.
Sascha Hauer March 19, 2024, 3:08 p.m. UTC | #9
On Tue, Mar 19, 2024 at 01:55:21PM +0000, Pavel Begunkov wrote:
> On 3/19/24 10:50, Sascha Hauer wrote:
> > On Mon, Mar 18, 2024 at 01:19:19PM +0000, Pavel Begunkov wrote:
> > > On 3/18/24 12:10, Sascha Hauer wrote:
> > > > On Fri, Mar 15, 2024 at 05:02:05PM +0000, Pavel Begunkov wrote:
> > > > > On 3/15/24 10:01, Sascha Hauer wrote:
> > > > > > It can happen that a socket sends the remaining data at close() time.
> > > > > > With io_uring and KTLS it can happen that sk_stream_wait_memory() bails
> > > > > > out with -512 (-ERESTARTSYS) because TIF_NOTIFY_SIGNAL is set for the
> > > > > > current task. This flag has been set in io_req_normal_work_add() by
> > > > > > calling task_work_add().
> > > > > 
> > > > > The entire idea of task_work is to interrupt syscalls and let io_uring
> > > > > do its job, otherwise it wouldn't free resources it might be holding,
> > > > > and even potentially forever block the syscall.
> > > > > 
> > > > > I'm not that sure about connect / close (are they not restartable?),
> > > > > but it doesn't seem to be a good idea for sk_stream_wait_memory(),
> > > > > which is the normal TCP blocking send path. I'm thinking of some kinds
> > > > > of cases with a local TCP socket pair, the tx queue is full as well
> > > > > and the rx queue of the other end, and io_uring has to run to receive
> > > > > the data.
> > > 
> > > There is another case, let's say the IO is done via io-wq
> > > (io_uring's worker thread pool) and hits the waiting. Now the
> > > request can't get cancelled, which is done by interrupting the
> > > task with TIF_NOTIFY_SIGNAL. User requested request cancellations
> > > is one thing, but we'd need to check if io_uring can ever be closed
> > > in this case.
> > > 
> > > 
> > > > > If interruptions are not welcome you can use different io_uring flags,
> > > > > see IORING_SETUP_COOP_TASKRUN and/or IORING_SETUP_DEFER_TASKRUN.
> > > > 
> > > > I tried with different combinations of these flags. For example
> > > > IORING_SETUP_TASKRUN_FLAG | IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN
> > > > makes the issue less likely, but nevertheless it still happens.
> > > > 
> > > > However, reading the documentation of these flags, they shall provide
> > > > hints to the kernel for optimizations, but it should work without these
> > > > flags, right?
> > > 
> > > That's true, and I guess there are other cases as well, like
> > > io-wq and perhaps even a stray fput.
> > > 
> > > 
> > > > > Maybe I'm missing something, why not restart your syscall?
> > > > 
> > > > The problem comes with TLS. Normally with synchronous encryption all
> > > > data on a socket is written during write(). When asynchronous
> > > > encryption comes into play, then not all data is written during write(),
> > > > but instead the remaining data is written at close() time.
> > > 
> > > Was it considered to do the final cleanup in workqueue
> > > and only then finalising the release?
> > 
> > No, but I don't really understand what you mean here. Could you
> > elaborate?
> 
> The suggestion is instead of executing the release and that final
> flush off of the context you're in, namely userspace task,
> you can spin up a kernel task (they're not getting any signals)
> and execute it from there.
> 
> void deferred_release_fn(struct work_struct *work)
> {
> 	do_release();
> 	...
> }
> 
> struct work_struct work;
> INIT_WORK(&work, deferred_release_fn);
> queue_work(system_unbound_wq, &work);
> 
> 
> There is a catch. Even though close() is not obliged to close
> the file / socket immediately, but it still not nice when you
> drop the final ref but port and other bits are not released
> until some time after. So, you might want to wait for that
> deferred release to complete before returning to the
> userspace.
> 
> I'm assuming it's fine to run it by a kernel task since
> IIRC fput might delay release to it anyway, but it's better
> to ask net maintainers. In theory it shouldn't need
> mm,fs,etc that user task would hold.

Ok, I'll have a look into it. Thanks for your input.

Sascha
diff mbox series

Patch

diff --git a/net/core/stream.c b/net/core/stream.c
index 96fbcb9bbb30a..e9e17b48e0122 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -67,7 +67,7 @@  int sk_stream_wait_connect(struct sock *sk, long *timeo_p)
 			return -EPIPE;
 		if (!*timeo_p)
 			return -EAGAIN;
-		if (signal_pending(tsk))
+		if (task_sigpending(tsk))
 			return sock_intr_errno(*timeo_p);
 
 		add_wait_queue(sk_sleep(sk), &wait);
@@ -103,7 +103,7 @@  void sk_stream_wait_close(struct sock *sk, long timeout)
 		do {
 			if (sk_wait_event(sk, &timeout, !sk_stream_closing(sk), &wait))
 				break;
-		} while (!signal_pending(current) && timeout);
+		} while (!task_sigpending(current) && timeout);
 
 		remove_wait_queue(sk_sleep(sk), &wait);
 	}
@@ -134,7 +134,7 @@  int sk_stream_wait_memory(struct sock *sk, long *timeo_p)
 			goto do_error;
 		if (!*timeo_p)
 			goto do_eagain;
-		if (signal_pending(current))
+		if (task_sigpending(current))
 			goto do_interrupted;
 		sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
 		if (sk_stream_memory_free(sk) && !vm_wait)