diff mbox series

[2/2] coredump: Allow coredumps to pipes to work with io_uring

Message ID 87mtd3rals.fsf_-_@email.froward.int.ebiederm.org (mailing list archive)
State New, archived
Headers show
Series coredump: Allow io_uring using apps to dump to pipes. | expand

Commit Message

Eric W. Biederman July 20, 2022, 4:51 p.m. UTC
Now that io_uring like everything else stops for coredumps in
get_signal the code can once again allow any interruptible
condition after coredump_wait to interrupt the coredump.

Clear TIF_NOTIFY_SIGNAL after coredump_wait, to guarantee that
anything that sets TIF_NOTIFY_SIGNAL before coredump_wait completed
won't cause the coredumps to interrupted.

With all of the other threads in the process stopped io_uring doesn't
call task_work_add on the thread running do_coredump.  Combined with
the clearing of TIF_NOTIFY_SIGNAL this allows processes that use
io_uring to coredump through pipes.

Restore dump_interrupted to be a simple call to signal_pending
effectively reverting commit 06af8679449d ("coredump: Limit what can
interrupt coredumps").  At this point only SIGKILL delivered to the
coredumping thread should be able to cause signal_pending to return
true.

A nice followup would be to find a reliable race free way to modify
task_work_add and probably set_notify_signal to skip setting
TIF_NOTIFY_SIGNAL once it is clear a task will no longer process
signals and other interruptible conditions.  That would allow
TIF_NOTIFY_SIGNAL to be cleared where TIF_SIGPENDING is cleared in
coredump_zap_process.

To be as certain as possible that this works, I tested this with
commit 1d5f5ea7cb7d ("io-wq: remove worker to owner tw dependency")
reverted.  Which means that not only is TIF_NOTIFY_SIGNAL prevented
from stopping coredumps to pipes, the sequence of stopping threads to
participate in the coredump avoids deadlocks that were possible
previously.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/coredump.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Olivier Langlois Aug. 22, 2022, 9:16 p.m. UTC | #1
On Wed, 2022-07-20 at 11:51 -0500, Eric W. Biederman wrote:
> 
> Now that io_uring like everything else stops for coredumps in
> get_signal the code can once again allow any interruptible
> condition after coredump_wait to interrupt the coredump.
> 
> Clear TIF_NOTIFY_SIGNAL after coredump_wait, to guarantee that
> anything that sets TIF_NOTIFY_SIGNAL before coredump_wait completed
> won't cause the coredumps to interrupted.
> 
> With all of the other threads in the process stopped io_uring doesn't
> call task_work_add on the thread running do_coredump.  Combined with
> the clearing of TIF_NOTIFY_SIGNAL this allows processes that use
> io_uring to coredump through pipes.
> 
> Restore dump_interrupted to be a simple call to signal_pending
> effectively reverting commit 06af8679449d ("coredump: Limit what can
> interrupt coredumps").  At this point only SIGKILL delivered to the
> coredumping thread should be able to cause signal_pending to return
> true.
> 
> A nice followup would be to find a reliable race free way to modify
> task_work_add and probably set_notify_signal to skip setting
> TIF_NOTIFY_SIGNAL once it is clear a task will no longer process
> signals and other interruptible conditions.  That would allow
> TIF_NOTIFY_SIGNAL to be cleared where TIF_SIGPENDING is cleared in
> coredump_zap_process.
> 
> To be as certain as possible that this works, I tested this with
> commit 1d5f5ea7cb7d ("io-wq: remove worker to owner tw dependency")
> reverted.  Which means that not only is TIF_NOTIFY_SIGNAL prevented
> from stopping coredumps to pipes, the sequence of stopping threads to
> participate in the coredump avoids deadlocks that were possible
> previously.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> 
Hi Eric,

What is stopping the task calling do_coredump() to be interrupted and
call task_work_add() from the interrupt context?

This is precisely what I was experiencing last summer when I did work
on this issue.

My understanding of how async I/O works with io_uring is that the task
is added to a wait queue without being put to sleep and when the
io_uring callback is called from the interrupt context, task_work_add()
is called so that the next time io_uring syscall is invoked, pending
work is processed to complete the I/O.

So if:

1. io_uring request is initiated AND the task is in a wait queue
2. do_coredump() is called before the I/O is completed

IMHO, this is how you end up having task_work_add() called while the
coredump is generated.

So far, the only way that I have found making sure that this was not
happening was to cancel every pending io_uring requests before writing
the coredump by calling io_uring_task_cancel():

--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -43,7 +43,7 @@
 #include <linux/timekeeping.h>
 #include <linux/sysctl.h>
 #include <linux/elf.h>
-
+#include <linux/io_uring.h>
 #include <linux/uaccess.h>
 #include <asm/mmu_context.h>
 #include <asm/tlb.h>
@@ -561,6 +561,8 @@
 		need_suid_safe = true;
 	}
 
+	io_uring_task_cancel();
+
 	retval = coredump_wait(siginfo->si_signo, &core_state);
 	if (retval < 0)
 		goto fail_creds;


Greetings,
Olivier Langlois Aug. 23, 2022, 3:35 a.m. UTC | #2
On Mon, 2022-08-22 at 17:16 -0400, Olivier Langlois wrote:
> 
> What is stopping the task calling do_coredump() to be interrupted and
> call task_work_add() from the interrupt context?
> 
> This is precisely what I was experiencing last summer when I did work
> on this issue.
> 
> My understanding of how async I/O works with io_uring is that the
> task
> is added to a wait queue without being put to sleep and when the
> io_uring callback is called from the interrupt context,
> task_work_add()
> is called so that the next time io_uring syscall is invoked, pending
> work is processed to complete the I/O.
> 
> So if:
> 
> 1. io_uring request is initiated AND the task is in a wait queue
> 2. do_coredump() is called before the I/O is completed
> 
> IMHO, this is how you end up having task_work_add() called while the
> coredump is generated.
> 
I forgot to add that I have experienced the issue with TCP/IP I/O.

I suspect that with a TCP socket, the race condition window is much
larger than if it was disk I/O and this might make it easier to
reproduce the issue this way...
Eric W. Biederman Aug. 23, 2022, 6:22 p.m. UTC | #3
Olivier Langlois <olivier@trillion01.com> writes:

> On Mon, 2022-08-22 at 17:16 -0400, Olivier Langlois wrote:
>> 
>> What is stopping the task calling do_coredump() to be interrupted and
>> call task_work_add() from the interrupt context?
>> 
>> This is precisely what I was experiencing last summer when I did work
>> on this issue.
>> 
>> My understanding of how async I/O works with io_uring is that the
>> task
>> is added to a wait queue without being put to sleep and when the
>> io_uring callback is called from the interrupt context,
>> task_work_add()
>> is called so that the next time io_uring syscall is invoked, pending
>> work is processed to complete the I/O.
>> 
>> So if:
>> 
>> 1. io_uring request is initiated AND the task is in a wait queue
>> 2. do_coredump() is called before the I/O is completed
>> 
>> IMHO, this is how you end up having task_work_add() called while the
>> coredump is generated.
>> 
> I forgot to add that I have experienced the issue with TCP/IP I/O.
>
> I suspect that with a TCP socket, the race condition window is much
> larger than if it was disk I/O and this might make it easier to
> reproduce the issue this way...

I was under the apparently mistaken impression that the io_uring
task_work_add only comes from the io_uring userspace helper threads.
Those are definitely suppressed by my change.

Do you have any idea in the code where io_uring code is being called in
an interrupt context?  I would really like to trace that code path so I
have a better grasp on what is happening.

If task_work_add is being called from interrupt context then something
additional from what I have proposed certainly needs to be done.

Eric
Jens Axboe Aug. 23, 2022, 6:27 p.m. UTC | #4
On 8/23/22 12:22 PM, Eric W. Biederman wrote:
> Olivier Langlois <olivier@trillion01.com> writes:
> 
>> On Mon, 2022-08-22 at 17:16 -0400, Olivier Langlois wrote:
>>>
>>> What is stopping the task calling do_coredump() to be interrupted and
>>> call task_work_add() from the interrupt context?
>>>
>>> This is precisely what I was experiencing last summer when I did work
>>> on this issue.
>>>
>>> My understanding of how async I/O works with io_uring is that the
>>> task
>>> is added to a wait queue without being put to sleep and when the
>>> io_uring callback is called from the interrupt context,
>>> task_work_add()
>>> is called so that the next time io_uring syscall is invoked, pending
>>> work is processed to complete the I/O.
>>>
>>> So if:
>>>
>>> 1. io_uring request is initiated AND the task is in a wait queue
>>> 2. do_coredump() is called before the I/O is completed
>>>
>>> IMHO, this is how you end up having task_work_add() called while the
>>> coredump is generated.
>>>
>> I forgot to add that I have experienced the issue with TCP/IP I/O.
>>
>> I suspect that with a TCP socket, the race condition window is much
>> larger than if it was disk I/O and this might make it easier to
>> reproduce the issue this way...
> 
> I was under the apparently mistaken impression that the io_uring
> task_work_add only comes from the io_uring userspace helper threads.
> Those are definitely suppressed by my change.
> 
> Do you have any idea in the code where io_uring code is being called in
> an interrupt context?  I would really like to trace that code path so I
> have a better grasp on what is happening.
> 
> If task_work_add is being called from interrupt context then something
> additional from what I have proposed certainly needs to be done.

task_work may come from the helper threads, but generally it does not.
One example would be doing a read from a socket. There's no data there,
poll is armed to trigger a retry. When we get the poll notification that
there's now data to be read, then we kick that off with task_work. Since
it's from the poll handler, it can trigger from interrupt context. See
the path from io_uring/poll.c:io_poll_wake() -> __io_poll_execute() ->
io_req_task_work_add() -> task_work_add().

It can also happen for regular IRQ based reads from regular files, where
the completion is actually done via task_work added from the potentially
IRQ based completion path.
Eric W. Biederman Aug. 24, 2022, 3:11 p.m. UTC | #5
Jens Axboe <axboe@kernel.dk> writes:

> On 8/23/22 12:22 PM, Eric W. Biederman wrote:
>> Olivier Langlois <olivier@trillion01.com> writes:
>> 
>>> On Mon, 2022-08-22 at 17:16 -0400, Olivier Langlois wrote:
>>>>
>>>> What is stopping the task calling do_coredump() to be interrupted and
>>>> call task_work_add() from the interrupt context?
>>>>
>>>> This is precisely what I was experiencing last summer when I did work
>>>> on this issue.
>>>>
>>>> My understanding of how async I/O works with io_uring is that the
>>>> task
>>>> is added to a wait queue without being put to sleep and when the
>>>> io_uring callback is called from the interrupt context,
>>>> task_work_add()
>>>> is called so that the next time io_uring syscall is invoked, pending
>>>> work is processed to complete the I/O.
>>>>
>>>> So if:
>>>>
>>>> 1. io_uring request is initiated AND the task is in a wait queue
>>>> 2. do_coredump() is called before the I/O is completed
>>>>
>>>> IMHO, this is how you end up having task_work_add() called while the
>>>> coredump is generated.
>>>>
>>> I forgot to add that I have experienced the issue with TCP/IP I/O.
>>>
>>> I suspect that with a TCP socket, the race condition window is much
>>> larger than if it was disk I/O and this might make it easier to
>>> reproduce the issue this way...
>> 
>> I was under the apparently mistaken impression that the io_uring
>> task_work_add only comes from the io_uring userspace helper threads.
>> Those are definitely suppressed by my change.
>> 
>> Do you have any idea in the code where io_uring code is being called in
>> an interrupt context?  I would really like to trace that code path so I
>> have a better grasp on what is happening.
>> 
>> If task_work_add is being called from interrupt context then something
>> additional from what I have proposed certainly needs to be done.
>
> task_work may come from the helper threads, but generally it does not.
> One example would be doing a read from a socket. There's no data there,
> poll is armed to trigger a retry. When we get the poll notification that
> there's now data to be read, then we kick that off with task_work. Since
> it's from the poll handler, it can trigger from interrupt context. See
> the path from io_uring/poll.c:io_poll_wake() -> __io_poll_execute() ->
> io_req_task_work_add() -> task_work_add().

But that is a task_work to the helper thread correct?

> It can also happen for regular IRQ based reads from regular files, where
> the completion is actually done via task_work added from the potentially
> IRQ based completion path.

I can see that.

Which leaves me with the question do these task_work's directly wake up
the thread that submitted the I/O request?   Or is there likely to be
something for an I/O thread to do before an ordinary program thread is
notified.

I am asking because it is only the case of notifying ordinary program
threads that is interesting in the case of a coredump.  As I understand
it a data to read notification would typically be handled by the I/O
uring worker thread to trigger reading the data before letting userspace
know everything it asked to be done is complete.

Eric
Jens Axboe Aug. 24, 2022, 3:51 p.m. UTC | #6
On 8/24/22 9:11 AM, Eric W. Biederman wrote:
> Jens Axboe <axboe@kernel.dk> writes:
> 
>> On 8/23/22 12:22 PM, Eric W. Biederman wrote:
>>> Olivier Langlois <olivier@trillion01.com> writes:
>>>
>>>> On Mon, 2022-08-22 at 17:16 -0400, Olivier Langlois wrote:
>>>>>
>>>>> What is stopping the task calling do_coredump() to be interrupted and
>>>>> call task_work_add() from the interrupt context?
>>>>>
>>>>> This is precisely what I was experiencing last summer when I did work
>>>>> on this issue.
>>>>>
>>>>> My understanding of how async I/O works with io_uring is that the
>>>>> task
>>>>> is added to a wait queue without being put to sleep and when the
>>>>> io_uring callback is called from the interrupt context,
>>>>> task_work_add()
>>>>> is called so that the next time io_uring syscall is invoked, pending
>>>>> work is processed to complete the I/O.
>>>>>
>>>>> So if:
>>>>>
>>>>> 1. io_uring request is initiated AND the task is in a wait queue
>>>>> 2. do_coredump() is called before the I/O is completed
>>>>>
>>>>> IMHO, this is how you end up having task_work_add() called while the
>>>>> coredump is generated.
>>>>>
>>>> I forgot to add that I have experienced the issue with TCP/IP I/O.
>>>>
>>>> I suspect that with a TCP socket, the race condition window is much
>>>> larger than if it was disk I/O and this might make it easier to
>>>> reproduce the issue this way...
>>>
>>> I was under the apparently mistaken impression that the io_uring
>>> task_work_add only comes from the io_uring userspace helper threads.
>>> Those are definitely suppressed by my change.
>>>
>>> Do you have any idea in the code where io_uring code is being called in
>>> an interrupt context?  I would really like to trace that code path so I
>>> have a better grasp on what is happening.
>>>
>>> If task_work_add is being called from interrupt context then something
>>> additional from what I have proposed certainly needs to be done.
>>
>> task_work may come from the helper threads, but generally it does not.
>> One example would be doing a read from a socket. There's no data there,
>> poll is armed to trigger a retry. When we get the poll notification that
>> there's now data to be read, then we kick that off with task_work. Since
>> it's from the poll handler, it can trigger from interrupt context. See
>> the path from io_uring/poll.c:io_poll_wake() -> __io_poll_execute() ->
>> io_req_task_work_add() -> task_work_add().
> 
> But that is a task_work to the helper thread correct?

No, it goes to the task that originally allocated/issued the request.
Which would be the original task, unless the request was originally
marked as going straight to a helper (IOSQE_ASYNC). For most cases,
it'll be the original task, not a helper thread.

>> It can also happen for regular IRQ based reads from regular files, where
>> the completion is actually done via task_work added from the potentially
>> IRQ based completion path.
> 
> I can see that.
> 
> Which leaves me with the question do these task_work's directly wake up
> the thread that submitted the I/O request?   Or is there likely to be
> something for an I/O thread to do before an ordinary program thread is
> notified.
> 
> I am asking because it is only the case of notifying ordinary program
> threads that is interesting in the case of a coredump.  As I understand
> it a data to read notification would typically be handled by the I/O
> uring worker thread to trigger reading the data before letting userspace
> know everything it asked to be done is complete.

By default, it'll go back to the original task. If something is
pollable, then there's no helper thread doing the request. An issue
attempt is performed by the original task, there's no data/space there,
poll is armed to trigger a retry. Retry notification will then queue
task_work with the original task to retry.

Generally for io_uring, helper threads are a slow path and aren't used
unless we have no other options. For example, if someone has a network
IO backend and there's helper thread activity, then that's generally a
sign that something is wrong. This isn't super relevant to this
particular topic, just highlighting that normally you would not expect
to see much (if any) io-wq activity at all.
diff mbox series

Patch

diff --git a/fs/coredump.c b/fs/coredump.c
index 67dda77c500f..c06594f56cbb 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -476,7 +476,7 @@  static bool dump_interrupted(void)
 	 * but then we need to teach dump_write() to restart and clear
 	 * TIF_SIGPENDING.
 	 */
-	return fatal_signal_pending(current) || freezing(current);
+	return signal_pending(current);
 }
 
 static void wait_for_dump_helpers(struct file *file)
@@ -589,6 +589,9 @@  void do_coredump(const kernel_siginfo_t *siginfo)
 
 	old_cred = override_creds(cred);
 
+	/* Don't break out of interruptible sleeps */
+	clear_notify_signal();
+
 	ispipe = format_corename(&cn, &cprm, &argv, &argc);
 
 	if (ispipe) {