coredump: Limit what can interrupt coredumps

Message ID	87sg1p4h0g.fsf_-_@disp2133 (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: ebiederm@xmission.com (Eric W. Biederman) To: Linus Torvalds <torvalds@linux-foundation.org> Cc: Olivier Langlois <olivier@trillion01.com>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, linux-fsdevel <linux-fsdevel@vger.kernel.org>, io-uring <io-uring@vger.kernel.org>, Alexander Viro <viro@zeniv.linux.org.uk>, Jens Axboe <axboe@kernel.dk>, "Pavel Begunkov\>" <asml.silence@gmail.com>, Oleg Nesterov <oleg@redhat.com> References: <192c9697e379bf084636a8213108be6c3b948d0b.camel@trillion01.com> <9692dbb420eef43a9775f425cb8f6f33c9ba2db9.camel@trillion01.com> <87h7i694ij.fsf_-_@disp2133> <CAHk-=wjC7GmCHTkoz2_CkgSc_Cgy19qwSQgJGXz+v2f=KT3UOw@mail.gmail.com> <198e912402486f66214146d4eabad8cb3f010a8e.camel@trillion01.com> <87eeda7nqe.fsf@disp2133> <b8434a8987672ab16f9fb755c1fc4d51e0f4004a.camel@trillion01.com> <87pmwt6biw.fsf@disp2133> <87czst5yxh.fsf_-_@disp2133> <CAHk-=wiax83WoS0p5nWvPhU_O+hcjXwv6q3DXV8Ejb62BfynhQ@mail.gmail.com> <87y2bh4jg5.fsf@disp2133> <CAHk-=wjPiEaXjUp6PTcLZFjT8RrYX+ExtD-RY3NjFWDN7mKLbw@mail.gmail.com> Date: Thu, 10 Jun 2021 15:11:11 -0500 In-Reply-To: <CAHk-=wjPiEaXjUp6PTcLZFjT8RrYX+ExtD-RY3NjFWDN7mKLbw@mail.gmail.com> (Linus Torvalds's message of "Thu, 10 Jun 2021 12:50:48 -0700") Message-ID: <87sg1p4h0g.fsf_-_@disp2133> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain Subject: [PATCH] coredump: Limit what can interrupt coredumps Precedence: bulk
Series	coredump: Limit what can interrupt coredumps \| expand coredump: Limit what can interrupt coredumps

Eric W. Biederman June 10, 2021, 8:11 p.m. UTC

Olivier Langlois has been struggling with coredumps being incompletely written in
processes using io_uring.

Olivier Langlois <olivier@trillion01.com> writes:
> io_uring is a big user of task_work and any event that io_uring made a
> task waiting for that occurs during the core dump generation will
> generate a TIF_NOTIFY_SIGNAL.
>
> Here are the detailed steps of the problem:
> 1. io_uring calls vfs_poll() to install a task to a file wait queue
>    with io_async_wake() as the wakeup function cb from io_arm_poll_handler()
> 2. wakeup function ends up calling task_work_add() with TWA_SIGNAL
> 3. task_work_add() sets the TIF_NOTIFY_SIGNAL bit by calling
>    set_notify_signal()

The coredump code deliberately supports being interrupted by SIGKILL,
and depends upon prepare_signal to filter out all other signals.   Now
that signal_pending includes wake ups for TIF_NOTIFY_SIGNAL this hack
in dump_emitted by the coredump code no longer works.

Make the coredump code more robust by explicitly testing for all of
the wakeup conditions the coredump code supports.  This prevents
new wakeup conditions from breaking the coredump code, as well
as fixing the current issue.

The filesystem code that the coredump code uses already limits
itself to only aborting on fatal_signal_pending.  So it should
not develop surprising wake-up reasons either.

v2: Don't remove the now unnecessary code in prepare_signal.

Cc: stable@vger.kernel.org
Fixes: 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL")
Reported-by: Olivier Langlois <olivier@trillion01.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/coredump.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Linus Torvalds June 10, 2021, 9:04 p.m. UTC | #1

On Thu, Jun 10, 2021 at 1:11 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Make the coredump code more robust by explicitly testing for all of
> the wakeup conditions the coredump code supports.  This prevents
> new wakeup conditions from breaking the coredump code, as well
> as fixing the current issue.

Thanks, applied.

And lots of thanks to Olivier who kept debugging the odd coredump
behavior he saw.

            Linus

Olivier Langlois June 12, 2021, 2:36 p.m. UTC | #2

On Thu, 2021-06-10 at 15:11 -0500, Eric W. Biederman wrote:
> 
> Olivier Langlois has been struggling with coredumps being incompletely
> written in
> processes using io_uring.
> 
> Olivier Langlois <olivier@trillion01.com> writes:
> > io_uring is a big user of task_work and any event that io_uring made
> > a
> > task waiting for that occurs during the core dump generation will
> > generate a TIF_NOTIFY_SIGNAL.
> > 
> > Here are the detailed steps of the problem:
> > 1. io_uring calls vfs_poll() to install a task to a file wait queue
> >    with io_async_wake() as the wakeup function cb from
> > io_arm_poll_handler()
> > 2. wakeup function ends up calling task_work_add() with TWA_SIGNAL
> > 3. task_work_add() sets the TIF_NOTIFY_SIGNAL bit by calling
> >    set_notify_signal()
> 
> The coredump code deliberately supports being interrupted by SIGKILL,
> and depends upon prepare_signal to filter out all other signals.   Now
> that signal_pending includes wake ups for TIF_NOTIFY_SIGNAL this hack
> in dump_emitted by the coredump code no longer works.
> 
> Make the coredump code more robust by explicitly testing for all of
> the wakeup conditions the coredump code supports.  This prevents
> new wakeup conditions from breaking the coredump code, as well
> as fixing the current issue.
> 
> The filesystem code that the coredump code uses already limits
> itself to only aborting on fatal_signal_pending.  So it should
> not develop surprising wake-up reasons either.
> 
> v2: Don't remove the now unnecessary code in prepare_signal.
> 
> Cc: stable@vger.kernel.org
> Fixes: 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL")
> Reported-by: Olivier Langlois <olivier@trillion01.com>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/coredump.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 2868e3e171ae..c3d8fc14b993 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -519,7 +519,7 @@ static bool dump_interrupted(void)
>          * but then we need to teach dump_write() to restart and clear
>          * TIF_SIGPENDING.
>          */
> -       return signal_pending(current);
> +       return fatal_signal_pending(current) || freezing(current);
>  }
>  
>  static void wait_for_dump_helpers(struct file *file)

Tested-by: Olivier Langlois <olivier@trillion01.com>

Jens Axboe June 12, 2021, 4:26 p.m. UTC | #3

On 6/12/21 8:36 AM, Olivier Langlois wrote:
> On Thu, 2021-06-10 at 15:11 -0500, Eric W. Biederman wrote:
>>
>> Olivier Langlois has been struggling with coredumps being incompletely
>> written in
>> processes using io_uring.
>>
>> Olivier Langlois <olivier@trillion01.com> writes:
>>> io_uring is a big user of task_work and any event that io_uring made
>>> a
>>> task waiting for that occurs during the core dump generation will
>>> generate a TIF_NOTIFY_SIGNAL.
>>>
>>> Here are the detailed steps of the problem:
>>> 1. io_uring calls vfs_poll() to install a task to a file wait queue
>>>    with io_async_wake() as the wakeup function cb from
>>> io_arm_poll_handler()
>>> 2. wakeup function ends up calling task_work_add() with TWA_SIGNAL
>>> 3. task_work_add() sets the TIF_NOTIFY_SIGNAL bit by calling
>>>    set_notify_signal()
>>
>> The coredump code deliberately supports being interrupted by SIGKILL,
>> and depends upon prepare_signal to filter out all other signals.   Now
>> that signal_pending includes wake ups for TIF_NOTIFY_SIGNAL this hack
>> in dump_emitted by the coredump code no longer works.
>>
>> Make the coredump code more robust by explicitly testing for all of
>> the wakeup conditions the coredump code supports.  This prevents
>> new wakeup conditions from breaking the coredump code, as well
>> as fixing the current issue.
>>
>> The filesystem code that the coredump code uses already limits
>> itself to only aborting on fatal_signal_pending.  So it should
>> not develop surprising wake-up reasons either.
>>
>> v2: Don't remove the now unnecessary code in prepare_signal.
>>
>> Cc: stable@vger.kernel.org
>> Fixes: 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL")
>> Reported-by: Olivier Langlois <olivier@trillion01.com>
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  fs/coredump.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/coredump.c b/fs/coredump.c
>> index 2868e3e171ae..c3d8fc14b993 100644
>> --- a/fs/coredump.c
>> +++ b/fs/coredump.c
>> @@ -519,7 +519,7 @@ static bool dump_interrupted(void)
>>          * but then we need to teach dump_write() to restart and clear
>>          * TIF_SIGPENDING.
>>          */
>> -       return signal_pending(current);
>> +       return fatal_signal_pending(current) || freezing(current);
>>  }
>>  
>>  static void wait_for_dump_helpers(struct file *file)
> 
> Tested-by: Olivier Langlois <olivier@trillion01.com>

Thanks Olivier and Eric for taking care of this. I've been mostly
offline for more than a week, back at it next week.

Oleg Nesterov June 14, 2021, 2:10 p.m. UTC | #4

Eric, et al, sorry for delay, I didn't read emails several days.

On 06/10, Eric W. Biederman wrote:
>
> v2: Don't remove the now unnecessary code in prepare_signal.

No, that code is still needed. Otherwise any fatal signal will be
turned into SIGKILL.

> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -519,7 +519,7 @@ static bool dump_interrupted(void)
>  	 * but then we need to teach dump_write() to restart and clear
>  	 * TIF_SIGPENDING.
>  	 */
> -	return signal_pending(current);
> +	return fatal_signal_pending(current) || freezing(current);
>  }

Well yes, this is what the comment says.

But note that there is another reason why dump_interrupted() returns true
if signal_pending(), it assumes thagt __dump_emit()->__kernel_write() may
fail anyway if signal_pending() is true. Say, pipe_write(), or iirc nfs,
perhaps something else...

That is why zap_threads() clears TIF_SIGPENDING. Perhaps it should clear
TIF_NOTIFY_SIGNAL as well and we should change io-uring to not abuse the
dumping threads?

Or perhaps we should change __dump_emit() to clear signal_pending() and
restart __kernel_write() if it fails or returns a short write.

Otherwise the change above doesn't look like a full fix to me.

Oleg.

Eric W. Biederman June 14, 2021, 4:37 p.m. UTC | #5

Oleg Nesterov <oleg@redhat.com> writes:

> Eric, et al, sorry for delay, I didn't read emails several days.
>
> On 06/10, Eric W. Biederman wrote:
>>
>> v2: Don't remove the now unnecessary code in prepare_signal.
>
> No, that code is still needed. Otherwise any fatal signal will be
> turned into SIGKILL.
>
>> --- a/fs/coredump.c
>> +++ b/fs/coredump.c
>> @@ -519,7 +519,7 @@ static bool dump_interrupted(void)
>>  	 * but then we need to teach dump_write() to restart and clear
>>  	 * TIF_SIGPENDING.
>>  	 */
>> -	return signal_pending(current);
>> +	return fatal_signal_pending(current) || freezing(current);
>>  }
>
>
> Well yes, this is what the comment says.
>
> But note that there is another reason why dump_interrupted() returns true
> if signal_pending(), it assumes thagt __dump_emit()->__kernel_write() may
> fail anyway if signal_pending() is true. Say, pipe_write(), or iirc nfs,
> perhaps something else...

The pipe_write case is a good case to consider.  In general filesystems
are only allowed to stop if fatal_signal_pending.  It is an ancient
unix/posix thing.

> That is why zap_threads() clears TIF_SIGPENDING. Perhaps it should clear
> TIF_NOTIFY_SIGNAL as well and we should change io-uring to not abuse the
> dumping threads?

I would very much like some clarity on TIF_NOTIFY_SIGNAL.  At the very
least it would be nice if it could get renamed TIF_NOTIFY_TASK_WORK.
I don't know what it has to do with signals.

> Or perhaps we should change __dump_emit() to clear signal_pending() and
> restart __kernel_write() if it fails or returns a short write.

Given that the code needs to handle pipes that seems a reasonable thing
to do.  Note.  All of the blocking of new signals in prepare_signal
is still in place.  So only a SIGKILL can come in set TIF_SIGPENDING.

So I would say that the current fix is correct (and safe to backport).
But still requires some magic in prepare_signal until we do more.

I don't understand the logic with well enough of adding work to
non-io_uring threads with task_work_add to understand why that happens
in the first place.

There are a lot of silly corners here.  Yes please let's keep on
digging.

Eric

Oleg Nesterov June 14, 2021, 4:59 p.m. UTC | #6

On 06/14, Eric W. Biederman wrote:
>
> I would very much like some clarity on TIF_NOTIFY_SIGNAL.  At the very
> least it would be nice if it could get renamed TIF_NOTIFY_TASK_WORK.

No, no, no ;)

I think that, for example, freezer should be changed to use
set_notify_signal() rather than fake_signal_wake_up(). Livepatch.
And probably it will have more users.

> I don't understand the logic with well enough of adding work to
> non-io_uring threads with task_work_add to understand why that happens
> in the first place.

Same here.

Oleg.

Eric W. Biederman June 15, 2021, 10:08 p.m. UTC | #7

Oleg Nesterov <oleg@redhat.com> writes:

>> --- a/fs/coredump.c
>> +++ b/fs/coredump.c
>> @@ -519,7 +519,7 @@ static bool dump_interrupted(void)
>>  	 * but then we need to teach dump_write() to restart and clear
>>  	 * TIF_SIGPENDING.
>>  	 */
>> -	return signal_pending(current);
>> +	return fatal_signal_pending(current) || freezing(current);
>>  }
>
>
> Well yes, this is what the comment says.
>
> But note that there is another reason why dump_interrupted() returns true
> if signal_pending(), it assumes thagt __dump_emit()->__kernel_write() may
> fail anyway if signal_pending() is true. Say, pipe_write(), or iirc nfs,
> perhaps something else...
>
> That is why zap_threads() clears TIF_SIGPENDING. Perhaps it should clear
> TIF_NOTIFY_SIGNAL as well and we should change io-uring to not abuse the
> dumping threads?
>
> Or perhaps we should change __dump_emit() to clear signal_pending() and
> restart __kernel_write() if it fails or returns a short write.
>
> Otherwise the change above doesn't look like a full fix to me.

Agreed.  The coredump to a pipe will still be short.  That needs
something additional.

The problem Olivier Langlois <olivier@trillion01.com> reported was
core dumps coming up short because TIF_NOTIFY_SIGNAL was being
set during a core dump.

We can see this with pipe_write returning -ERESTARTSYS
on a full pipe if signal_pending which includes TIF_NOTIFY_SIGNAL
is true.

Looking further if the thread that is core dumping initiated
any io_uring work then io_ring_exit_work will use task_work_add
to request that thread clean up it's io_uring state.

Perhaps we can put a big comment in dump_emit and if we
get back -ERESTARTSYS run tracework_notify_signal.  I am not
seeing any locks held at that point in the coredump, so it
should be safe.  The coredump is run inside of file_start_write
which is the only potential complication.

The code flow is complicated but it looks like the entire
point of the exercise is to call io_uring_del_task_file
on the originating thread.  I suppose that keeps the
locking of the xarray in io_uring_task simple.

Hmm.   All of this comes from io_uring_release.
How do we get to io_uring_release?  The coredump should
be catching everything in exit_mm before exit_files?

Confused and hopeful someone can explain to me what is going on,
and perhaps simplify it.

Eric

Olivier Langlois June 16, 2021, 7:23 p.m. UTC | #8

On Tue, 2021-06-15 at 17:08 -0500, Eric W. Biederman wrote:
> Oleg Nesterov <oleg@redhat.com> writes:
> 
> > > --- a/fs/coredump.c
> > > +++ b/fs/coredump.c
> > > @@ -519,7 +519,7 @@ static bool dump_interrupted(void)
> > >          * but then we need to teach dump_write() to restart and
> > > clear
> > >          * TIF_SIGPENDING.
> > >          */
> > > -       return signal_pending(current);
> > > +       return fatal_signal_pending(current) ||
> > > freezing(current);
> > >  }
> > 
> > 
> > Well yes, this is what the comment says.
> > 
> > But note that there is another reason why dump_interrupted()
> > returns true
> > if signal_pending(), it assumes thagt __dump_emit()-
> > >__kernel_write() may
> > fail anyway if signal_pending() is true. Say, pipe_write(), or iirc
> > nfs,
> > perhaps something else...
> > 
> > That is why zap_threads() clears TIF_SIGPENDING. Perhaps it should
> > clear
> > TIF_NOTIFY_SIGNAL as well and we should change io-uring to not
> > abuse the
> > dumping threads?
> > 
> > Or perhaps we should change __dump_emit() to clear signal_pending()
> > and
> > restart __kernel_write() if it fails or returns a short write.
> > 
> > Otherwise the change above doesn't look like a full fix to me.
> 
> Agreed.  The coredump to a pipe will still be short.  That needs
> something additional.
> 
> The problem Olivier Langlois <olivier@trillion01.com> reported was
> core dumps coming up short because TIF_NOTIFY_SIGNAL was being
> set during a core dump.
> 
> We can see this with pipe_write returning -ERESTARTSYS
> on a full pipe if signal_pending which includes TIF_NOTIFY_SIGNAL
> is true.
> 
Eric,

I redid my test but this time instead of dumping directly into a file,
I did let the coredump be piped to the systemd coredump module and the
coredump generation isn't working as expected when piping.

So your code review conclusions are correct.

Eric W. Biederman June 16, 2021, 8 p.m. UTC | #9

Olivier Langlois <olivier@trillion01.com> writes:

> I redid my test but this time instead of dumping directly into a file,
> I did let the coredump be piped to the systemd coredump module and the
> coredump generation isn't working as expected when piping.
>
> So your code review conclusions are correct.

Thank you for confirming that.

Do you know how your test program is using io_uring?

I have been trying to put the pieces together on what io_uring is doing
that stops the coredump.  The fact that it takes a little while before
it kills the coredump is a little puzzling.  The code looks like all of
the io_uring operations should have been canceled before the coredump
starts.

Thanks,
Eric

Olivier Langlois June 18, 2021, 8:05 p.m. UTC | #10

On Wed, 2021-06-16 at 15:00 -0500, Eric W. Biederman wrote:
> Olivier Langlois <olivier@trillion01.com> writes:
> 
> > I redid my test but this time instead of dumping directly into a
> > file,
> > I did let the coredump be piped to the systemd coredump module and
> > the
> > coredump generation isn't working as expected when piping.
> > 
> > So your code review conclusions are correct.
> 
> Thank you for confirming that.
> 
> Do you know how your test program is using io_uring?
> 
> I have been trying to put the pieces together on what io_uring is
> doing
> that stops the coredump.  The fact that it takes a little while
> before
> it kills the coredump is a little puzzling.  The code looks like all
> of
> the io_uring operations should have been canceled before the coredump
> starts.
> 
> 
With a very simple setup, I guess that this could easily be
reproducible. Make a TCP connection with a server that is streaming
non-stop data and enter a loop where you keep initiating async
OP_IOURING_READ operations on your TCP fd.

Once you have that, manually sending a SIG_SEGV is a sure fire way to
stumble into the problem. This is how I am testing the patches.

IRL, it is possible to call io_uring_enter() to submit operations and
return from the syscall without waiting on all events to have
completed. Once the process is back in userspace, if it stumble into a
bug that triggers a coredump, any remaining pending I/O operations can
set TIF_SIGNAL_NOTIFY while the coredump is generated.

I have read the part of your previous email where you share the result
of your ongoing investigation. I didn't comment as the definitive
references in io_uring matters are Jens and Pavel but I am going to
share my opinion on the matter.

I think that you did put the finger on the code cleaning up the
io_uring instance in regards to pending operations. It seems to be
io_uring_release() which is probably called from exit_files() which
happens to be after the call to exit_mm().

At first, I did entertain the idea of considering if it could be
possible to duplicate some of the operations performed by
io_uring_release() related to the infamous TIF_SIGNAL_NOTIFY setting
into io_uring_files_cancel() which is called before exit_mm().

but the idea is useless as it is not the other threads of the group
that are causing the TIF_SIGNAL_NOTIFY problem. It is the thread
calling do_coredump() which is done by the signal handing code even
before that thread enters do_exit() and start to be cleaned up. That
thread when it enters do_coredump() is still fully loaded and
operational in terms of io_uring functionality.

I guess that this io_uring cancel all pending operations hook would
have to be called from do_coredump or from get_signal() but if it is
the way to go, I feel that this is a change major enough that wouldn't
dare going there without the blessing of the maintainers in cause....

Olivier Langlois Aug. 5, 2021, 1:06 p.m. UTC | #11

On Tue, 2021-06-15 at 17:08 -0500, Eric W. Biederman wrote:
> Oleg Nesterov <oleg@redhat.com> writes:
> 
> > > --- a/fs/coredump.c
> > > +++ b/fs/coredump.c
> > > @@ -519,7 +519,7 @@ static bool dump_interrupted(void)
> > >          * but then we need to teach dump_write() to restart and
> > > clear
> > >          * TIF_SIGPENDING.
> > >          */
> > > -       return signal_pending(current);
> > > +       return fatal_signal_pending(current) || freezing(current);
> > >  }
> > 
> > 
> > Well yes, this is what the comment says.
> > 
> > But note that there is another reason why dump_interrupted() returns
> > true
> > if signal_pending(), it assumes thagt __dump_emit()->__kernel_write()
> > may
> > fail anyway if signal_pending() is true. Say, pipe_write(), or iirc
> > nfs,
> > perhaps something else...
> > 
> > That is why zap_threads() clears TIF_SIGPENDING. Perhaps it should
> > clear
> > TIF_NOTIFY_SIGNAL as well and we should change io-uring to not abuse
> > the
> > dumping threads?
> > 
> > Or perhaps we should change __dump_emit() to clear signal_pending()
> > and
> > restart __kernel_write() if it fails or returns a short write.
> > 
> > Otherwise the change above doesn't look like a full fix to me.
> 
> Agreed.  The coredump to a pipe will still be short.  That needs
> something additional.
> 
> The problem Olivier Langlois <olivier@trillion01.com> reported was
> core dumps coming up short because TIF_NOTIFY_SIGNAL was being
> set during a core dump.
> 
> We can see this with pipe_write returning -ERESTARTSYS
> on a full pipe if signal_pending which includes TIF_NOTIFY_SIGNAL
> is true.
> 
> Looking further if the thread that is core dumping initiated
> any io_uring work then io_ring_exit_work will use task_work_add
> to request that thread clean up it's io_uring state.
> 
> Perhaps we can put a big comment in dump_emit and if we
> get back -ERESTARTSYS run tracework_notify_signal.  I am not
> seeing any locks held at that point in the coredump, so it
> should be safe.  The coredump is run inside of file_start_write
> which is the only potential complication.
> 
> 
> 
> The code flow is complicated but it looks like the entire
> point of the exercise is to call io_uring_del_task_file
> on the originating thread.  I suppose that keeps the
> locking of the xarray in io_uring_task simple.
> 
> 
> Hmm.   All of this comes from io_uring_release.
> How do we get to io_uring_release?  The coredump should
> be catching everything in exit_mm before exit_files?
> 
> Confused and hopeful someone can explain to me what is going on,
> and perhaps simplify it.
> 
> Eric

Hi all,

I didn't forgot about this remaining issue and I have kept thinking
about it on and off.

I did try the following on 5.12.19:

diff --git a/fs/coredump.c b/fs/coredump.c
index 07afb5ddb1c4..614fe7a54c1a 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -41,6 +41,7 @@
 #include <linux/fs.h>
 #include <linux/path.h>
 #include <linux/timekeeping.h>
+#include <linux/io_uring.h>
 
 #include <linux/uaccess.h>
 #include <asm/mmu_context.h>
@@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 		need_suid_safe = true;
 	}
 
+	io_uring_files_cancel(current->files);
+
 	retval = coredump_wait(siginfo->si_signo, &core_state);
 	if (retval < 0)
 		goto fail_creds;

Tony Battersby Aug. 10, 2021, 9:48 p.m. UTC | #12

On 8/5/21 9:06 AM, Olivier Langlois wrote:
> On Tue, 2021-06-15 at 17:08 -0500, Eric W. Biederman wrote:
>> Oleg Nesterov <oleg@redhat.com> writes:
>>
>>>> --- a/fs/coredump.c
>>>> +++ b/fs/coredump.c
>>>> @@ -519,7 +519,7 @@ static bool dump_interrupted(void)
>>>>          * but then we need to teach dump_write() to restart and
>>>> clear
>>>>          * TIF_SIGPENDING.
>>>>          */
>>>> -       return signal_pending(current);
>>>> +       return fatal_signal_pending(current) || freezing(current);
>>>>  }
>>> Well yes, this is what the comment says.
>>>
>>> But note that there is another reason why dump_interrupted() returns
>>> true
>>> if signal_pending(), it assumes thagt __dump_emit()->__kernel_write()
>>> may
>>> fail anyway if signal_pending() is true. Say, pipe_write(), or iirc
>>> nfs,
>>> perhaps something else...
>>>
>>> That is why zap_threads() clears TIF_SIGPENDING. Perhaps it should
>>> clear
>>> TIF_NOTIFY_SIGNAL as well and we should change io-uring to not abuse
>>> the
>>> dumping threads?
>>>
>>> Or perhaps we should change __dump_emit() to clear signal_pending()
>>> and
>>> restart __kernel_write() if it fails or returns a short write.
>>>
>>> Otherwise the change above doesn't look like a full fix to me.
>> Agreed.  The coredump to a pipe will still be short.  That needs
>> something additional.
>>
>> The problem Olivier Langlois <olivier@trillion01.com> reported was
>> core dumps coming up short because TIF_NOTIFY_SIGNAL was being
>> set during a core dump.
>>
>> We can see this with pipe_write returning -ERESTARTSYS
>> on a full pipe if signal_pending which includes TIF_NOTIFY_SIGNAL
>> is true.
>>
>> Looking further if the thread that is core dumping initiated
>> any io_uring work then io_ring_exit_work will use task_work_add
>> to request that thread clean up it's io_uring state.
>>
>> Perhaps we can put a big comment in dump_emit and if we
>> get back -ERESTARTSYS run tracework_notify_signal.  I am not
>> seeing any locks held at that point in the coredump, so it
>> should be safe.  The coredump is run inside of file_start_write
>> which is the only potential complication.
>>
>>
>>
>> The code flow is complicated but it looks like the entire
>> point of the exercise is to call io_uring_del_task_file
>> on the originating thread.  I suppose that keeps the
>> locking of the xarray in io_uring_task simple.
>>
>>
>> Hmm.   All of this comes from io_uring_release.
>> How do we get to io_uring_release?  The coredump should
>> be catching everything in exit_mm before exit_files?
>>
>> Confused and hopeful someone can explain to me what is going on,
>> and perhaps simplify it.
>>
>> Eric
> Hi all,
>
> I didn't forgot about this remaining issue and I have kept thinking
> about it on and off.
>
> I did try the following on 5.12.19:
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 07afb5ddb1c4..614fe7a54c1a 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -41,6 +41,7 @@
>  #include <linux/fs.h>
>  #include <linux/path.h>
>  #include <linux/timekeeping.h>
> +#include <linux/io_uring.h>
>  
>  #include <linux/uaccess.h>
>  #include <asm/mmu_context.h>
> @@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>  		need_suid_safe = true;
>  	}
>  
> +	io_uring_files_cancel(current->files);
> +
>  	retval = coredump_wait(siginfo->si_signo, &core_state);
>  	if (retval < 0)
>  		goto fail_creds;
> --
> 2.32.0
>
> with my current understanding, io_uring_files_cancel is supposed to
> cancel everything that might set the TIF_NOTIFY_SIGNAL.
>
> I must report that in my testing with generating a core dump through a
> pipe with the modif above, I still get truncated core dumps.
>
> systemd is having a weird error:
> [ 2577.870742] systemd-coredump[4056]: Failed to get COMM: No such
> process
>
> and nothing is captured
>
> so I have replaced it with a very simple shell:
> $ cat /proc/sys/kernel/core_pattern 
> |/home/lano1106/bin/pipe_core.sh %e %p
>
> ~/bin $ cat pipe_core.sh 
> #!/bin/sh
>
> cat > /home/lano1106/core/core.$1.$2
>
> BFD: warning: /home/lano1106/core/core.test.10886 is truncated:
> expected core file size >= 24129536, found: 61440
>
> I conclude from my attempt that maybe io_uring_files_cancel is not 100%
> cleaning everything that it should clean.
>
>
>
I just ran into this problem also - coredumps from an io_uring program
to a pipe are truncated.  But I am using kernel 5.10.57, which does NOT
have commit 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL") or
commit 06af8679449d ("coredump: Limit what can interrupt coredumps"). 
Kernel 5.4 works though, so I bisected the problem to commit
f38c7e3abfba ("io_uring: ensure async buffered read-retry is setup
properly") in kernel 5.9.  Note that my io_uring program uses only async
buffered reads, which may be why this particular commit makes a
difference to my program.

My io_uring program is a multi-purpose long-running program with many
threads.  Most threads don't use io_uring but a few of them do. 
Normally, my core dumps are piped to a program so that they can be
compressed before being written to disk, but I can also test writing the
core dumps directly to disk.  This is what I have found:

*) Unpatched 5.10.57: if a thread that doesn't use io_uring triggers a
coredump, the core file is written correctly, whether it is written to
disk or piped to a program, even if another thread is using io_uring at
the same time.

*) Unpatched 5.10.57: if a thread that uses io_uring triggers a
coredump, the core file is truncated, whether written directly to disk
or piped to a program.

*) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
triggers a coredump, and the core is written directly to disk, then it
is written correctly.

*) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
triggers a coredump, and the core is piped to a program, then it is
truncated.

*) 5.10.57+revert f38c7e3abfba: core dumps are written correctly,
whether written directly to disk or piped to a program.

Tony Battersby
Cybernetics

Olivier Langlois Aug. 11, 2021, 8:47 p.m. UTC | #13

On Tue, 2021-08-10 at 17:48 -0400, Tony Battersby wrote:
> > 
> I just ran into this problem also - coredumps from an io_uring
> program
> to a pipe are truncated.  But I am using kernel 5.10.57, which does
> NOT
> have commit 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL")
> or
> commit 06af8679449d ("coredump: Limit what can interrupt
> coredumps"). 
> Kernel 5.4 works though, so I bisected the problem to commit
> f38c7e3abfba ("io_uring: ensure async buffered read-retry is setup
> properly") in kernel 5.9.  Note that my io_uring program uses only
> async
> buffered reads, which may be why this particular commit makes a
> difference to my program.
> 
> My io_uring program is a multi-purpose long-running program with many
> threads.  Most threads don't use io_uring but a few of them do. 
> Normally, my core dumps are piped to a program so that they can be
> compressed before being written to disk, but I can also test writing
> the
> core dumps directly to disk.  This is what I have found:
> 
> *) Unpatched 5.10.57: if a thread that doesn't use io_uring triggers
> a
> coredump, the core file is written correctly, whether it is written
> to
> disk or piped to a program, even if another thread is using io_uring
> at
> the same time.
> 
> *) Unpatched 5.10.57: if a thread that uses io_uring triggers a
> coredump, the core file is truncated, whether written directly to
> disk
> or piped to a program.
> 
> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
> triggers a coredump, and the core is written directly to disk, then
> it
> is written correctly.
> 
> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
> triggers a coredump, and the core is piped to a program, then it is
> truncated.
> 
> *) 5.10.57+revert f38c7e3abfba: core dumps are written correctly,
> whether written directly to disk or piped to a program.
> 
> Tony Battersby
> Cybernetics
> 
Tony,

this is super interesting details. I'm leaving for few days so I will
not be able to look into it until I am back but here is my
interpretation of your findings:

f38c7e3abfba makes it more likely that your task ends up in a fd read
wait queue. Previously the io_uring req queuing was failing and
returning EAGAIN. Now it ends up using io uring fast poll.

When the core dump gets written through a pipe, pipe_write must block
waiting for some event. If the task gets waken up by the io_uring wait
queue entry instead, it must somehow make pipe_write fails.

So the problem must be a mix of TIF_NOTIFY_SIGNAL and the fact that
io_uring wait queue entries aren't cleaned up while doing the core
dump.

I have a new modif to try out. I'll hopefully be able to submit a patch
to fix that once I come back (I cannot do it now or else, I'll never
leave ;-))

Jens Axboe Aug. 12, 2021, 1:55 a.m. UTC | #14

On 8/10/21 3:48 PM, Tony Battersby wrote:
> On 8/5/21 9:06 AM, Olivier Langlois wrote:
>> On Tue, 2021-06-15 at 17:08 -0500, Eric W. Biederman wrote:
>>> Oleg Nesterov <oleg@redhat.com> writes:
>>>
>>>>> --- a/fs/coredump.c
>>>>> +++ b/fs/coredump.c
>>>>> @@ -519,7 +519,7 @@ static bool dump_interrupted(void)
>>>>>          * but then we need to teach dump_write() to restart and
>>>>> clear
>>>>>          * TIF_SIGPENDING.
>>>>>          */
>>>>> -       return signal_pending(current);
>>>>> +       return fatal_signal_pending(current) || freezing(current);
>>>>>  }
>>>> Well yes, this is what the comment says.
>>>>
>>>> But note that there is another reason why dump_interrupted() returns
>>>> true
>>>> if signal_pending(), it assumes thagt __dump_emit()->__kernel_write()
>>>> may
>>>> fail anyway if signal_pending() is true. Say, pipe_write(), or iirc
>>>> nfs,
>>>> perhaps something else...
>>>>
>>>> That is why zap_threads() clears TIF_SIGPENDING. Perhaps it should
>>>> clear
>>>> TIF_NOTIFY_SIGNAL as well and we should change io-uring to not abuse
>>>> the
>>>> dumping threads?
>>>>
>>>> Or perhaps we should change __dump_emit() to clear signal_pending()
>>>> and
>>>> restart __kernel_write() if it fails or returns a short write.
>>>>
>>>> Otherwise the change above doesn't look like a full fix to me.
>>> Agreed.  The coredump to a pipe will still be short.  That needs
>>> something additional.
>>>
>>> The problem Olivier Langlois <olivier@trillion01.com> reported was
>>> core dumps coming up short because TIF_NOTIFY_SIGNAL was being
>>> set during a core dump.
>>>
>>> We can see this with pipe_write returning -ERESTARTSYS
>>> on a full pipe if signal_pending which includes TIF_NOTIFY_SIGNAL
>>> is true.
>>>
>>> Looking further if the thread that is core dumping initiated
>>> any io_uring work then io_ring_exit_work will use task_work_add
>>> to request that thread clean up it's io_uring state.
>>>
>>> Perhaps we can put a big comment in dump_emit and if we
>>> get back -ERESTARTSYS run tracework_notify_signal.  I am not
>>> seeing any locks held at that point in the coredump, so it
>>> should be safe.  The coredump is run inside of file_start_write
>>> which is the only potential complication.
>>>
>>>
>>>
>>> The code flow is complicated but it looks like the entire
>>> point of the exercise is to call io_uring_del_task_file
>>> on the originating thread.  I suppose that keeps the
>>> locking of the xarray in io_uring_task simple.
>>>
>>>
>>> Hmm.   All of this comes from io_uring_release.
>>> How do we get to io_uring_release?  The coredump should
>>> be catching everything in exit_mm before exit_files?
>>>
>>> Confused and hopeful someone can explain to me what is going on,
>>> and perhaps simplify it.
>>>
>>> Eric
>> Hi all,
>>
>> I didn't forgot about this remaining issue and I have kept thinking
>> about it on and off.
>>
>> I did try the following on 5.12.19:
>>
>> diff --git a/fs/coredump.c b/fs/coredump.c
>> index 07afb5ddb1c4..614fe7a54c1a 100644
>> --- a/fs/coredump.c
>> +++ b/fs/coredump.c
>> @@ -41,6 +41,7 @@
>>  #include <linux/fs.h>
>>  #include <linux/path.h>
>>  #include <linux/timekeeping.h>
>> +#include <linux/io_uring.h>
>>  
>>  #include <linux/uaccess.h>
>>  #include <asm/mmu_context.h>
>> @@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>>  		need_suid_safe = true;
>>  	}
>>  
>> +	io_uring_files_cancel(current->files);
>> +
>>  	retval = coredump_wait(siginfo->si_signo, &core_state);
>>  	if (retval < 0)
>>  		goto fail_creds;
>> --
>> 2.32.0
>>
>> with my current understanding, io_uring_files_cancel is supposed to
>> cancel everything that might set the TIF_NOTIFY_SIGNAL.
>>
>> I must report that in my testing with generating a core dump through a
>> pipe with the modif above, I still get truncated core dumps.
>>
>> systemd is having a weird error:
>> [ 2577.870742] systemd-coredump[4056]: Failed to get COMM: No such
>> process
>>
>> and nothing is captured
>>
>> so I have replaced it with a very simple shell:
>> $ cat /proc/sys/kernel/core_pattern 
>> |/home/lano1106/bin/pipe_core.sh %e %p
>>
>> ~/bin $ cat pipe_core.sh 
>> #!/bin/sh
>>
>> cat > /home/lano1106/core/core.$1.$2
>>
>> BFD: warning: /home/lano1106/core/core.test.10886 is truncated:
>> expected core file size >= 24129536, found: 61440
>>
>> I conclude from my attempt that maybe io_uring_files_cancel is not 100%
>> cleaning everything that it should clean.
>>
>>
>>
> I just ran into this problem also - coredumps from an io_uring program
> to a pipe are truncated.  But I am using kernel 5.10.57, which does NOT
> have commit 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL") or
> commit 06af8679449d ("coredump: Limit what can interrupt coredumps"). 
> Kernel 5.4 works though, so I bisected the problem to commit
> f38c7e3abfba ("io_uring: ensure async buffered read-retry is setup
> properly") in kernel 5.9.  Note that my io_uring program uses only async
> buffered reads, which may be why this particular commit makes a
> difference to my program.
> 
> My io_uring program is a multi-purpose long-running program with many
> threads.  Most threads don't use io_uring but a few of them do. 
> Normally, my core dumps are piped to a program so that they can be
> compressed before being written to disk, but I can also test writing the
> core dumps directly to disk.  This is what I have found:
> 
> *) Unpatched 5.10.57: if a thread that doesn't use io_uring triggers a
> coredump, the core file is written correctly, whether it is written to
> disk or piped to a program, even if another thread is using io_uring at
> the same time.
> 
> *) Unpatched 5.10.57: if a thread that uses io_uring triggers a
> coredump, the core file is truncated, whether written directly to disk
> or piped to a program.
> 
> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
> triggers a coredump, and the core is written directly to disk, then it
> is written correctly.
> 
> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
> triggers a coredump, and the core is piped to a program, then it is
> truncated.
> 
> *) 5.10.57+revert f38c7e3abfba: core dumps are written correctly,
> whether written directly to disk or piped to a program.

That is very interesting. Like Olivier mentioned, it's not that actual
commit, but rather the change of behavior implemented by it. Before that
commit, we'd hit the async workers more often, whereas after we do the
correct retry method where it's driven by the wakeup when the page is
unlocked. This is purely speculation, but perhaps the fact that the
process changes state potentially mid dump is why the dump ends up being
truncated?

I'd love to dive into this and try and figure it out. Absent a test
case, at least the above gives me an idea of what to try out. I'll see
if it makes it easier for me to create a case that does result in a
truncated core dump.

Tony Battersby Aug. 12, 2021, 1:53 p.m. UTC | #15

On 8/11/21 9:55 PM, Jens Axboe wrote:
>
> That is very interesting. Like Olivier mentioned, it's not that actual
> commit, but rather the change of behavior implemented by it. Before that
> commit, we'd hit the async workers more often, whereas after we do the
> correct retry method where it's driven by the wakeup when the page is
> unlocked. This is purely speculation, but perhaps the fact that the
> process changes state potentially mid dump is why the dump ends up being
> truncated?
>
> I'd love to dive into this and try and figure it out. Absent a test
> case, at least the above gives me an idea of what to try out. I'll see
> if it makes it easier for me to create a case that does result in a
> truncated core dump.
>
If it helps, a "good" coredump from my program is about 350 MB
compressed down to about 7 MB by bzip2.  A truncated coredump varies in
size from about 60 KB to about 2 MB before compression.  The program
that receives the coredump uses bzip2 to compress the data before
writing it to disk.

Tony

Olivier Langlois Aug. 15, 2021, 8:42 p.m. UTC | #16

On Wed, 2021-08-11 at 19:55 -0600, Jens Axboe wrote:
> On 8/10/21 3:48 PM, Tony Battersby wrote:
> > On 8/5/21 9:06 AM, Olivier Langlois wrote:
> > > 
> > > Hi all,
> > > 
> > > I didn't forgot about this remaining issue and I have kept thinking
> > > about it on and off.
> > > 
> > > I did try the following on 5.12.19:
> > > 
> > > diff --git a/fs/coredump.c b/fs/coredump.c
> > > index 07afb5ddb1c4..614fe7a54c1a 100644
> > > --- a/fs/coredump.c
> > > +++ b/fs/coredump.c
> > > @@ -41,6 +41,7 @@
> > >  #include <linux/fs.h>
> > >  #include <linux/path.h>
> > >  #include <linux/timekeeping.h>
> > > +#include <linux/io_uring.h>
> > >  
> > >  #include <linux/uaccess.h>
> > >  #include <asm/mmu_context.h>
> > > @@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t
> > > *siginfo)
> > >                 need_suid_safe = true;
> > >         }
> > >  
> > > +       io_uring_files_cancel(current->files);
> > > +
> > >         retval = coredump_wait(siginfo->si_signo, &core_state);
> > >         if (retval < 0)
> > >                 goto fail_creds;
> > > --
> > > 2.32.0
> > > 
> > > with my current understanding, io_uring_files_cancel is supposed to
> > > cancel everything that might set the TIF_NOTIFY_SIGNAL.
> > > 
> > > I must report that in my testing with generating a core dump
> > > through a
> > > pipe with the modif above, I still get truncated core dumps.
> > > 
> > > systemd is having a weird error:
> > > [ 2577.870742] systemd-coredump[4056]: Failed to get COMM: No such
> > > process
> > > 
> > > and nothing is captured
> > > 
> > > so I have replaced it with a very simple shell:
> > > $ cat /proc/sys/kernel/core_pattern 
> > > > /home/lano1106/bin/pipe_core.sh %e %p
> > > 
> > > ~/bin $ cat pipe_core.sh 
> > > #!/bin/sh
> > > 
> > > cat > /home/lano1106/core/core.$1.$2
> > > 
> > > BFD: warning: /home/lano1106/core/core.test.10886 is truncated:
> > > expected core file size >= 24129536, found: 61440
> > > 
> > > I conclude from my attempt that maybe io_uring_files_cancel is not
> > > 100%
> > > cleaning everything that it should clean.
> > > 
> > > 
> > > 
> > I just ran into this problem also - coredumps from an io_uring
> > program
> > to a pipe are truncated.  But I am using kernel 5.10.57, which does
> > NOT
> > have commit 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL")
> > or
> > commit 06af8679449d ("coredump: Limit what can interrupt coredumps").
> > Kernel 5.4 works though, so I bisected the problem to commit
> > f38c7e3abfba ("io_uring: ensure async buffered read-retry is setup
> > properly") in kernel 5.9.  Note that my io_uring program uses only
> > async
> > buffered reads, which may be why this particular commit makes a
> > difference to my program.
> > 
> > My io_uring program is a multi-purpose long-running program with many
> > threads.  Most threads don't use io_uring but a few of them do. 
> > Normally, my core dumps are piped to a program so that they can be
> > compressed before being written to disk, but I can also test writing
> > the
> > core dumps directly to disk.  This is what I have found:
> > 
> > *) Unpatched 5.10.57: if a thread that doesn't use io_uring triggers
> > a
> > coredump, the core file is written correctly, whether it is written
> > to
> > disk or piped to a program, even if another thread is using io_uring
> > at
> > the same time.
> > 
> > *) Unpatched 5.10.57: if a thread that uses io_uring triggers a
> > coredump, the core file is truncated, whether written directly to
> > disk
> > or piped to a program.
> > 
> > *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
> > triggers a coredump, and the core is written directly to disk, then
> > it
> > is written correctly.
> > 
> > *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
> > triggers a coredump, and the core is piped to a program, then it is
> > truncated.
> > 
> > *) 5.10.57+revert f38c7e3abfba: core dumps are written correctly,
> > whether written directly to disk or piped to a program.
> 
> That is very interesting. Like Olivier mentioned, it's not that actual
> commit, but rather the change of behavior implemented by it. Before
> that
> commit, we'd hit the async workers more often, whereas after we do the
> correct retry method where it's driven by the wakeup when the page is
> unlocked. This is purely speculation, but perhaps the fact that the
> process changes state potentially mid dump is why the dump ends up
> being
> truncated?
> 
> I'd love to dive into this and try and figure it out. Absent a test
> case, at least the above gives me an idea of what to try out. I'll see
> if it makes it easier for me to create a case that does result in a
> truncated core dump.
> 
Jens,

When I have first encountered the issue, the very first thing that I
did try was to create a simple test program that would synthetize the
problem.

After few time consumming failed attempts, I just gave up the idea and
simply settle to my prod program that showcase systematically the
problem every time that I kill the process with a SEGV signal.

In a nutshell, all the program does is to issue read operations with
io_uring on a TCP socket on which there is a constant data stream.

Now that I have a better understanding of what is going on, I think
that one way that could reproduce the problem consistently could be
along those lines:

1. Create a pipe
2. fork a child
3. Initiate a read operation on the pipe with io_uring from the child
4. Let the parent kill its child with a core dump generating signal.
5. Write something in the pipe from the parent so that the io_uring
read operation completes while the core dump is generated.

I guess that I'll end up doing that if I cannot fix the issue with my
current setup but here is what I have attempted so far:

1. Call io_uring_files_cancel from do_coredump
2. Same as #1 but also make sure that TIF_NOTIFY_SIGNAL is cleared on
returning from io_uring_files_cancel

Those attempts didn't work but lurking in the io_uring dev mailing list
is starting to pay off. I thought that I did reach the bottom of the
rabbit hole in my journey of understanding io_uring but the recent
patch set sent by Hao Xu

https://lore.kernel.org/io-uring/90fce498-968e-6812-7b6a-fdf8520ea8d9@kernel.dk/T/#t

made me realize that I still haven't assimilated all the small io_uring
nuances...

Here is my feedback. From my casual io_uring code reader point of view,
it is not 100% obvious what the difference is between
io_uring_files_cancel and io_uring_task_cancel

It seems like io_uring_files_cancel is cancelling polls only if they
have the REQ_F_INFLIGHT flag set.

I have no idea what an inflight request means and why someone would
want to call io_uring_files_cancel over io_uring_task_cancel.

I guess that if I was to meditate on the question for few hours, I
would at some point get some illumination strike me but I believe that
it could be a good idea to document in the code those concepts for
helping casual readers...

Bottomline, I now understand that io_uring_files_cancel does not cancel
all the requests. Therefore, without fully understanding what I am
doing, I am going to replace my call to io_uring_files_cancel from
do_coredump with io_uring_task_cancel and see if this finally fix the
issue for good.

What I am trying to do is to cancel pending io_uring requests to make
sure that TIF_NOTIFY_SIGNAL isn't set while core dump is generated.

Maybe another solution would simply be to modify __dump_emit to make it
resilient to TIF_NOTIFY_SIGNAL as Eric W. Biederman originally
suggested.

or maybe do both...

Not sure which approach is best. If someone has an opinion, I would be
curious to hear it.

Greetings,

Pavel Begunkov Aug. 16, 2021, 1:02 p.m. UTC | #17

On 8/15/21 9:42 PM, Olivier Langlois wrote:
[...]
> When I have first encountered the issue, the very first thing that I
> did try was to create a simple test program that would synthetize the
> problem.
> 
> After few time consumming failed attempts, I just gave up the idea and
> simply settle to my prod program that showcase systematically the
> problem every time that I kill the process with a SEGV signal.
> 
> In a nutshell, all the program does is to issue read operations with
> io_uring on a TCP socket on which there is a constant data stream.
> 
> Now that I have a better understanding of what is going on, I think
> that one way that could reproduce the problem consistently could be
> along those lines:
> 
> 1. Create a pipe
> 2. fork a child
> 3. Initiate a read operation on the pipe with io_uring from the child
> 4. Let the parent kill its child with a core dump generating signal.
> 5. Write something in the pipe from the parent so that the io_uring
> read operation completes while the core dump is generated.
> 
> I guess that I'll end up doing that if I cannot fix the issue with my
> current setup but here is what I have attempted so far:
> 
> 1. Call io_uring_files_cancel from do_coredump
> 2. Same as #1 but also make sure that TIF_NOTIFY_SIGNAL is cleared on
> returning from io_uring_files_cancel
> 
> Those attempts didn't work but lurking in the io_uring dev mailing list
> is starting to pay off. I thought that I did reach the bottom of the
> rabbit hole in my journey of understanding io_uring but the recent
> patch set sent by Hao Xu
> 
> https://lore.kernel.org/io-uring/90fce498-968e-6812-7b6a-fdf8520ea8d9@kernel.dk/T/#t
> 
> made me realize that I still haven't assimilated all the small io_uring
> nuances...
> 
> Here is my feedback. From my casual io_uring code reader point of view,
> it is not 100% obvious what the difference is between
> io_uring_files_cancel and io_uring_task_cancel

As you mentioned, io_uring_task_cancel() cancels and waits for all
requests submitted by current task, used in exec() and SQPOLL because
of potential races.

io_uring_task_cancel() cancels only selected ones and


io_uring_files_cancel()
cancels and waits only some specific requests that we absolutely have
to, e.g. in 5.15 it'll be only requests referencing the ring itself.
It's used on normal task exit.

io_uring_task_cancel() cancels and waits all requests submitted by
current task, used on exec() because of races.



As you mentioned 

> 
> It seems like io_uring_files_cancel is cancelling polls only if they
> have the REQ_F_INFLIGHT flag set.
> 
> I have no idea what an inflight request means and why someone would
> want to call io_uring_files_cancel over io_uring_task_cancel.
> 
> I guess that if I was to meditate on the question for few hours, I
> would at some point get some illumination strike me but I believe that
> it could be a good idea to document in the code those concepts for
> helping casual readers...
> 
> Bottomline, I now understand that io_uring_files_cancel does not cancel
> all the requests. Therefore, without fully understanding what I am
> doing, I am going to replace my call to io_uring_files_cancel from
> do_coredump with io_uring_task_cancel and see if this finally fix the
> issue for good.
> 
> What I am trying to do is to cancel pending io_uring requests to make
> sure that TIF_NOTIFY_SIGNAL isn't set while core dump is generated.
> 
> Maybe another solution would simply be to modify __dump_emit to make it
> resilient to TIF_NOTIFY_SIGNAL as Eric W. Biederman originally
> suggested.
> 
> or maybe do both...
> 
> Not sure which approach is best. If someone has an opinion, I would be
> curious to hear it.
> 
> Greetings,
> 
>

Pavel Begunkov Aug. 16, 2021, 1:06 p.m. UTC | #18

On 8/16/21 2:02 PM, Pavel Begunkov wrote:
> On 8/15/21 9:42 PM, Olivier Langlois wrote:
> [...]
>> When I have first encountered the issue, the very first thing that I
>> did try was to create a simple test program that would synthetize the
>> problem.
>>
>> After few time consumming failed attempts, I just gave up the idea and
>> simply settle to my prod program that showcase systematically the
>> problem every time that I kill the process with a SEGV signal.
>>
>> In a nutshell, all the program does is to issue read operations with
>> io_uring on a TCP socket on which there is a constant data stream.
>>
>> Now that I have a better understanding of what is going on, I think
>> that one way that could reproduce the problem consistently could be
>> along those lines:
>>
>> 1. Create a pipe
>> 2. fork a child
>> 3. Initiate a read operation on the pipe with io_uring from the child
>> 4. Let the parent kill its child with a core dump generating signal.
>> 5. Write something in the pipe from the parent so that the io_uring
>> read operation completes while the core dump is generated.
>>
>> I guess that I'll end up doing that if I cannot fix the issue with my
>> current setup but here is what I have attempted so far:
>>
>> 1. Call io_uring_files_cancel from do_coredump
>> 2. Same as #1 but also make sure that TIF_NOTIFY_SIGNAL is cleared on
>> returning from io_uring_files_cancel
>>
>> Those attempts didn't work but lurking in the io_uring dev mailing list
>> is starting to pay off. I thought that I did reach the bottom of the
>> rabbit hole in my journey of understanding io_uring but the recent
>> patch set sent by Hao Xu
>>
>> https://lore.kernel.org/io-uring/90fce498-968e-6812-7b6a-fdf8520ea8d9@kernel.dk/T/#t
>>
>> made me realize that I still haven't assimilated all the small io_uring
>> nuances...
>>
>> Here is my feedback. From my casual io_uring code reader point of view,
>> it is not 100% obvious what the difference is between
>> io_uring_files_cancel and io_uring_task_cancel
> 
> As you mentioned, io_uring_task_cancel() cancels and waits for all
> requests submitted by current task, used in exec() and SQPOLL because
> of potential races.

Apologies for this draft rumbling...

As you mentioned, io_uring_task_cancel() cancels and waits for all
requests submitted by current task, used in exec() and SQPOLL because
of potential races.

io_uring_task_cancel() cancels only selected ones, e.g. in 5.15
will be only requests operating on io_uring, and used during normal
exit.

Agree that the names may be not too descriptive.

>>
>> It seems like io_uring_files_cancel is cancelling polls only if they
>> have the REQ_F_INFLIGHT flag set.
>>
>> I have no idea what an inflight request means and why someone would
>> want to call io_uring_files_cancel over io_uring_task_cancel.
>>
>> I guess that if I was to meditate on the question for few hours, I
>> would at some point get some illumination strike me but I believe that
>> it could be a good idea to document in the code those concepts for
>> helping casual readers...
>>
>> Bottomline, I now understand that io_uring_files_cancel does not cancel
>> all the requests. Therefore, without fully understanding what I am
>> doing, I am going to replace my call to io_uring_files_cancel from
>> do_coredump with io_uring_task_cancel and see if this finally fix the
>> issue for good.
>>
>> What I am trying to do is to cancel pending io_uring requests to make
>> sure that TIF_NOTIFY_SIGNAL isn't set while core dump is generated.
>>
>> Maybe another solution would simply be to modify __dump_emit to make it
>> resilient to TIF_NOTIFY_SIGNAL as Eric W. Biederman originally
>> suggested.
>>
>> or maybe do both...
>>
>> Not sure which approach is best. If someone has an opinion, I would be
>> curious to hear it.
>>
>> Greetings,
>>
>>
>

Jens Axboe Aug. 17, 2021, 6:15 p.m. UTC | #19

On 8/15/21 2:42 PM, Olivier Langlois wrote:
> On Wed, 2021-08-11 at 19:55 -0600, Jens Axboe wrote:
>> On 8/10/21 3:48 PM, Tony Battersby wrote:
>>> On 8/5/21 9:06 AM, Olivier Langlois wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I didn't forgot about this remaining issue and I have kept thinking
>>>> about it on and off.
>>>>
>>>> I did try the following on 5.12.19:
>>>>
>>>> diff --git a/fs/coredump.c b/fs/coredump.c
>>>> index 07afb5ddb1c4..614fe7a54c1a 100644
>>>> --- a/fs/coredump.c
>>>> +++ b/fs/coredump.c
>>>> @@ -41,6 +41,7 @@
>>>>  #include <linux/fs.h>
>>>>  #include <linux/path.h>
>>>>  #include <linux/timekeeping.h>
>>>> +#include <linux/io_uring.h>
>>>>  
>>>>  #include <linux/uaccess.h>
>>>>  #include <asm/mmu_context.h>
>>>> @@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t
>>>> *siginfo)
>>>>                 need_suid_safe = true;
>>>>         }
>>>>  
>>>> +       io_uring_files_cancel(current->files);
>>>> +
>>>>         retval = coredump_wait(siginfo->si_signo, &core_state);
>>>>         if (retval < 0)
>>>>                 goto fail_creds;
>>>> --
>>>> 2.32.0
>>>>
>>>> with my current understanding, io_uring_files_cancel is supposed to
>>>> cancel everything that might set the TIF_NOTIFY_SIGNAL.
>>>>
>>>> I must report that in my testing with generating a core dump
>>>> through a
>>>> pipe with the modif above, I still get truncated core dumps.
>>>>
>>>> systemd is having a weird error:
>>>> [ 2577.870742] systemd-coredump[4056]: Failed to get COMM: No such
>>>> process
>>>>
>>>> and nothing is captured
>>>>
>>>> so I have replaced it with a very simple shell:
>>>> $ cat /proc/sys/kernel/core_pattern 
>>>>> /home/lano1106/bin/pipe_core.sh %e %p
>>>>
>>>> ~/bin $ cat pipe_core.sh 
>>>> #!/bin/sh
>>>>
>>>> cat > /home/lano1106/core/core.$1.$2
>>>>
>>>> BFD: warning: /home/lano1106/core/core.test.10886 is truncated:
>>>> expected core file size >= 24129536, found: 61440
>>>>
>>>> I conclude from my attempt that maybe io_uring_files_cancel is not
>>>> 100%
>>>> cleaning everything that it should clean.
>>>>
>>>>
>>>>
>>> I just ran into this problem also - coredumps from an io_uring
>>> program
>>> to a pipe are truncated.  But I am using kernel 5.10.57, which does
>>> NOT
>>> have commit 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL")
>>> or
>>> commit 06af8679449d ("coredump: Limit what can interrupt coredumps").
>>> Kernel 5.4 works though, so I bisected the problem to commit
>>> f38c7e3abfba ("io_uring: ensure async buffered read-retry is setup
>>> properly") in kernel 5.9.  Note that my io_uring program uses only
>>> async
>>> buffered reads, which may be why this particular commit makes a
>>> difference to my program.
>>>
>>> My io_uring program is a multi-purpose long-running program with many
>>> threads.  Most threads don't use io_uring but a few of them do. 
>>> Normally, my core dumps are piped to a program so that they can be
>>> compressed before being written to disk, but I can also test writing
>>> the
>>> core dumps directly to disk.  This is what I have found:
>>>
>>> *) Unpatched 5.10.57: if a thread that doesn't use io_uring triggers
>>> a
>>> coredump, the core file is written correctly, whether it is written
>>> to
>>> disk or piped to a program, even if another thread is using io_uring
>>> at
>>> the same time.
>>>
>>> *) Unpatched 5.10.57: if a thread that uses io_uring triggers a
>>> coredump, the core file is truncated, whether written directly to
>>> disk
>>> or piped to a program.
>>>
>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
>>> triggers a coredump, and the core is written directly to disk, then
>>> it
>>> is written correctly.
>>>
>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
>>> triggers a coredump, and the core is piped to a program, then it is
>>> truncated.
>>>
>>> *) 5.10.57+revert f38c7e3abfba: core dumps are written correctly,
>>> whether written directly to disk or piped to a program.
>>
>> That is very interesting. Like Olivier mentioned, it's not that actual
>> commit, but rather the change of behavior implemented by it. Before
>> that
>> commit, we'd hit the async workers more often, whereas after we do the
>> correct retry method where it's driven by the wakeup when the page is
>> unlocked. This is purely speculation, but perhaps the fact that the
>> process changes state potentially mid dump is why the dump ends up
>> being
>> truncated?
>>
>> I'd love to dive into this and try and figure it out. Absent a test
>> case, at least the above gives me an idea of what to try out. I'll see
>> if it makes it easier for me to create a case that does result in a
>> truncated core dump.
>>
> Jens,
> 
> When I have first encountered the issue, the very first thing that I
> did try was to create a simple test program that would synthetize the
> problem.
> 
> After few time consumming failed attempts, I just gave up the idea and
> simply settle to my prod program that showcase systematically the
> problem every time that I kill the process with a SEGV signal.
> 
> In a nutshell, all the program does is to issue read operations with
> io_uring on a TCP socket on which there is a constant data stream.
> 
> Now that I have a better understanding of what is going on, I think
> that one way that could reproduce the problem consistently could be
> along those lines:
> 
> 1. Create a pipe
> 2. fork a child
> 3. Initiate a read operation on the pipe with io_uring from the child
> 4. Let the parent kill its child with a core dump generating signal.
> 5. Write something in the pipe from the parent so that the io_uring
> read operation completes while the core dump is generated.
> 
> I guess that I'll end up doing that if I cannot fix the issue with my
> current setup but here is what I have attempted so far:
> 
> 1. Call io_uring_files_cancel from do_coredump
> 2. Same as #1 but also make sure that TIF_NOTIFY_SIGNAL is cleared on
> returning from io_uring_files_cancel
> 
> Those attempts didn't work but lurking in the io_uring dev mailing list
> is starting to pay off. I thought that I did reach the bottom of the
> rabbit hole in my journey of understanding io_uring but the recent
> patch set sent by Hao Xu
> 
> https://lore.kernel.org/io-uring/90fce498-968e-6812-7b6a-fdf8520ea8d9@kernel.dk/T/#t
> 
> made me realize that I still haven't assimilated all the small io_uring
> nuances...
> 
> Here is my feedback. From my casual io_uring code reader point of view,
> it is not 100% obvious what the difference is between
> io_uring_files_cancel and io_uring_task_cancel
> 
> It seems like io_uring_files_cancel is cancelling polls only if they
> have the REQ_F_INFLIGHT flag set.
> 
> I have no idea what an inflight request means and why someone would
> want to call io_uring_files_cancel over io_uring_task_cancel.
> 
> I guess that if I was to meditate on the question for few hours, I
> would at some point get some illumination strike me but I believe that
> it could be a good idea to document in the code those concepts for
> helping casual readers...
> 
> Bottomline, I now understand that io_uring_files_cancel does not cancel
> all the requests. Therefore, without fully understanding what I am
> doing, I am going to replace my call to io_uring_files_cancel from
> do_coredump with io_uring_task_cancel and see if this finally fix the
> issue for good.
> 
> What I am trying to do is to cancel pending io_uring requests to make
> sure that TIF_NOTIFY_SIGNAL isn't set while core dump is generated.
> 
> Maybe another solution would simply be to modify __dump_emit to make it
> resilient to TIF_NOTIFY_SIGNAL as Eric W. Biederman originally
> suggested.
> 
> or maybe do both...
> 
> Not sure which approach is best. If someone has an opinion, I would be
> curious to hear it.

It does indeed sound like it's TIF_NOTIFY_SIGNAL that will trigger some
signal_pending() and cause an interruption of the core dump. Just out of
curiosity, what is your /proc/sys/kernel/core_pattern set to? If it's
set to some piped process, can you try and set it to 'core' and see if
that eliminates the truncation of the core dumps for your case?

Jens Axboe Aug. 17, 2021, 6:24 p.m. UTC | #20

On 8/17/21 12:15 PM, Jens Axboe wrote:
> On 8/15/21 2:42 PM, Olivier Langlois wrote:
>> On Wed, 2021-08-11 at 19:55 -0600, Jens Axboe wrote:
>>> On 8/10/21 3:48 PM, Tony Battersby wrote:
>>>> On 8/5/21 9:06 AM, Olivier Langlois wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> I didn't forgot about this remaining issue and I have kept thinking
>>>>> about it on and off.
>>>>>
>>>>> I did try the following on 5.12.19:
>>>>>
>>>>> diff --git a/fs/coredump.c b/fs/coredump.c
>>>>> index 07afb5ddb1c4..614fe7a54c1a 100644
>>>>> --- a/fs/coredump.c
>>>>> +++ b/fs/coredump.c
>>>>> @@ -41,6 +41,7 @@
>>>>>  #include <linux/fs.h>
>>>>>  #include <linux/path.h>
>>>>>  #include <linux/timekeeping.h>
>>>>> +#include <linux/io_uring.h>
>>>>>  
>>>>>  #include <linux/uaccess.h>
>>>>>  #include <asm/mmu_context.h>
>>>>> @@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t
>>>>> *siginfo)
>>>>>                 need_suid_safe = true;
>>>>>         }
>>>>>  
>>>>> +       io_uring_files_cancel(current->files);
>>>>> +
>>>>>         retval = coredump_wait(siginfo->si_signo, &core_state);
>>>>>         if (retval < 0)
>>>>>                 goto fail_creds;
>>>>> --
>>>>> 2.32.0
>>>>>
>>>>> with my current understanding, io_uring_files_cancel is supposed to
>>>>> cancel everything that might set the TIF_NOTIFY_SIGNAL.
>>>>>
>>>>> I must report that in my testing with generating a core dump
>>>>> through a
>>>>> pipe with the modif above, I still get truncated core dumps.
>>>>>
>>>>> systemd is having a weird error:
>>>>> [ 2577.870742] systemd-coredump[4056]: Failed to get COMM: No such
>>>>> process
>>>>>
>>>>> and nothing is captured
>>>>>
>>>>> so I have replaced it with a very simple shell:
>>>>> $ cat /proc/sys/kernel/core_pattern 
>>>>>> /home/lano1106/bin/pipe_core.sh %e %p
>>>>>
>>>>> ~/bin $ cat pipe_core.sh 
>>>>> #!/bin/sh
>>>>>
>>>>> cat > /home/lano1106/core/core.$1.$2
>>>>>
>>>>> BFD: warning: /home/lano1106/core/core.test.10886 is truncated:
>>>>> expected core file size >= 24129536, found: 61440
>>>>>
>>>>> I conclude from my attempt that maybe io_uring_files_cancel is not
>>>>> 100%
>>>>> cleaning everything that it should clean.
>>>>>
>>>>>
>>>>>
>>>> I just ran into this problem also - coredumps from an io_uring
>>>> program
>>>> to a pipe are truncated.  But I am using kernel 5.10.57, which does
>>>> NOT
>>>> have commit 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL")
>>>> or
>>>> commit 06af8679449d ("coredump: Limit what can interrupt coredumps").
>>>> Kernel 5.4 works though, so I bisected the problem to commit
>>>> f38c7e3abfba ("io_uring: ensure async buffered read-retry is setup
>>>> properly") in kernel 5.9.  Note that my io_uring program uses only
>>>> async
>>>> buffered reads, which may be why this particular commit makes a
>>>> difference to my program.
>>>>
>>>> My io_uring program is a multi-purpose long-running program with many
>>>> threads.  Most threads don't use io_uring but a few of them do. 
>>>> Normally, my core dumps are piped to a program so that they can be
>>>> compressed before being written to disk, but I can also test writing
>>>> the
>>>> core dumps directly to disk.  This is what I have found:
>>>>
>>>> *) Unpatched 5.10.57: if a thread that doesn't use io_uring triggers
>>>> a
>>>> coredump, the core file is written correctly, whether it is written
>>>> to
>>>> disk or piped to a program, even if another thread is using io_uring
>>>> at
>>>> the same time.
>>>>
>>>> *) Unpatched 5.10.57: if a thread that uses io_uring triggers a
>>>> coredump, the core file is truncated, whether written directly to
>>>> disk
>>>> or piped to a program.
>>>>
>>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
>>>> triggers a coredump, and the core is written directly to disk, then
>>>> it
>>>> is written correctly.
>>>>
>>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
>>>> triggers a coredump, and the core is piped to a program, then it is
>>>> truncated.
>>>>
>>>> *) 5.10.57+revert f38c7e3abfba: core dumps are written correctly,
>>>> whether written directly to disk or piped to a program.
>>>
>>> That is very interesting. Like Olivier mentioned, it's not that actual
>>> commit, but rather the change of behavior implemented by it. Before
>>> that
>>> commit, we'd hit the async workers more often, whereas after we do the
>>> correct retry method where it's driven by the wakeup when the page is
>>> unlocked. This is purely speculation, but perhaps the fact that the
>>> process changes state potentially mid dump is why the dump ends up
>>> being
>>> truncated?
>>>
>>> I'd love to dive into this and try and figure it out. Absent a test
>>> case, at least the above gives me an idea of what to try out. I'll see
>>> if it makes it easier for me to create a case that does result in a
>>> truncated core dump.
>>>
>> Jens,
>>
>> When I have first encountered the issue, the very first thing that I
>> did try was to create a simple test program that would synthetize the
>> problem.
>>
>> After few time consumming failed attempts, I just gave up the idea and
>> simply settle to my prod program that showcase systematically the
>> problem every time that I kill the process with a SEGV signal.
>>
>> In a nutshell, all the program does is to issue read operations with
>> io_uring on a TCP socket on which there is a constant data stream.
>>
>> Now that I have a better understanding of what is going on, I think
>> that one way that could reproduce the problem consistently could be
>> along those lines:
>>
>> 1. Create a pipe
>> 2. fork a child
>> 3. Initiate a read operation on the pipe with io_uring from the child
>> 4. Let the parent kill its child with a core dump generating signal.
>> 5. Write something in the pipe from the parent so that the io_uring
>> read operation completes while the core dump is generated.
>>
>> I guess that I'll end up doing that if I cannot fix the issue with my
>> current setup but here is what I have attempted so far:
>>
>> 1. Call io_uring_files_cancel from do_coredump
>> 2. Same as #1 but also make sure that TIF_NOTIFY_SIGNAL is cleared on
>> returning from io_uring_files_cancel
>>
>> Those attempts didn't work but lurking in the io_uring dev mailing list
>> is starting to pay off. I thought that I did reach the bottom of the
>> rabbit hole in my journey of understanding io_uring but the recent
>> patch set sent by Hao Xu
>>
>> https://lore.kernel.org/io-uring/90fce498-968e-6812-7b6a-fdf8520ea8d9@kernel.dk/T/#t
>>
>> made me realize that I still haven't assimilated all the small io_uring
>> nuances...
>>
>> Here is my feedback. From my casual io_uring code reader point of view,
>> it is not 100% obvious what the difference is between
>> io_uring_files_cancel and io_uring_task_cancel
>>
>> It seems like io_uring_files_cancel is cancelling polls only if they
>> have the REQ_F_INFLIGHT flag set.
>>
>> I have no idea what an inflight request means and why someone would
>> want to call io_uring_files_cancel over io_uring_task_cancel.
>>
>> I guess that if I was to meditate on the question for few hours, I
>> would at some point get some illumination strike me but I believe that
>> it could be a good idea to document in the code those concepts for
>> helping casual readers...
>>
>> Bottomline, I now understand that io_uring_files_cancel does not cancel
>> all the requests. Therefore, without fully understanding what I am
>> doing, I am going to replace my call to io_uring_files_cancel from
>> do_coredump with io_uring_task_cancel and see if this finally fix the
>> issue for good.
>>
>> What I am trying to do is to cancel pending io_uring requests to make
>> sure that TIF_NOTIFY_SIGNAL isn't set while core dump is generated.
>>
>> Maybe another solution would simply be to modify __dump_emit to make it
>> resilient to TIF_NOTIFY_SIGNAL as Eric W. Biederman originally
>> suggested.
>>
>> or maybe do both...
>>
>> Not sure which approach is best. If someone has an opinion, I would be
>> curious to hear it.
> 
> It does indeed sound like it's TIF_NOTIFY_SIGNAL that will trigger some
> signal_pending() and cause an interruption of the core dump. Just out of
> curiosity, what is your /proc/sys/kernel/core_pattern set to? If it's
> set to some piped process, can you try and set it to 'core' and see if
> that eliminates the truncation of the core dumps for your case?

And assuming that works, then I suspect this one would fix your issue
even with a piped core dump:

diff --git a/fs/coredump.c b/fs/coredump.c
index 07afb5ddb1c4..852737a9ccbf 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -41,6 +41,7 @@
 #include <linux/fs.h>
 #include <linux/path.h>
 #include <linux/timekeeping.h>
+#include <linux/io_uring.h>
 
 #include <linux/uaccess.h>
 #include <asm/mmu_context.h>
@@ -603,6 +604,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 	};
 
 	audit_core_dumps(siginfo->si_signo);
+	io_uring_task_cancel();
 
 	binfmt = mm->binfmt;
 	if (!binfmt || !binfmt->core_dump)

Tony Battersby Aug. 17, 2021, 7:29 p.m. UTC | #21

On 8/17/21 2:24 PM, Jens Axboe wrote:
> On 8/17/21 12:15 PM, Jens Axboe wrote:
>> On 8/15/21 2:42 PM, Olivier Langlois wrote:
>>> On Wed, 2021-08-11 at 19:55 -0600, Jens Axboe wrote:
>>>> On 8/10/21 3:48 PM, Tony Battersby wrote:
>>>>> On 8/5/21 9:06 AM, Olivier Langlois wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> I didn't forgot about this remaining issue and I have kept thinking
>>>>>> about it on and off.
>>>>>>
>>>>>> I did try the following on 5.12.19:
>>>>>>
>>>>>> diff --git a/fs/coredump.c b/fs/coredump.c
>>>>>> index 07afb5ddb1c4..614fe7a54c1a 100644
>>>>>> --- a/fs/coredump.c
>>>>>> +++ b/fs/coredump.c
>>>>>> @@ -41,6 +41,7 @@
>>>>>>  #include <linux/fs.h>
>>>>>>  #include <linux/path.h>
>>>>>>  #include <linux/timekeeping.h>
>>>>>> +#include <linux/io_uring.h>
>>>>>>  
>>>>>>  #include <linux/uaccess.h>
>>>>>>  #include <asm/mmu_context.h>
>>>>>> @@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t
>>>>>> *siginfo)
>>>>>>                 need_suid_safe = true;
>>>>>>         }
>>>>>>  
>>>>>> +       io_uring_files_cancel(current->files);
>>>>>> +
>>>>>>         retval = coredump_wait(siginfo->si_signo, &core_state);
>>>>>>         if (retval < 0)
>>>>>>                 goto fail_creds;
>>>>>> --
>>>>>> 2.32.0
>>>>>>
>>>>>> with my current understanding, io_uring_files_cancel is supposed to
>>>>>> cancel everything that might set the TIF_NOTIFY_SIGNAL.
>>>>>>
>>>>>> I must report that in my testing with generating a core dump
>>>>>> through a
>>>>>> pipe with the modif above, I still get truncated core dumps.
>>>>>>
>>>>>> systemd is having a weird error:
>>>>>> [ 2577.870742] systemd-coredump[4056]: Failed to get COMM: No such
>>>>>> process
>>>>>>
>>>>>> and nothing is captured
>>>>>>
>>>>>> so I have replaced it with a very simple shell:
>>>>>> $ cat /proc/sys/kernel/core_pattern 
>>>>>>> /home/lano1106/bin/pipe_core.sh %e %p
>>>>>> ~/bin $ cat pipe_core.sh 
>>>>>> #!/bin/sh
>>>>>>
>>>>>> cat > /home/lano1106/core/core.$1.$2
>>>>>>
>>>>>> BFD: warning: /home/lano1106/core/core.test.10886 is truncated:
>>>>>> expected core file size >= 24129536, found: 61440
>>>>>>
>>>>>> I conclude from my attempt that maybe io_uring_files_cancel is not
>>>>>> 100%
>>>>>> cleaning everything that it should clean.
>>>>>>
>>>>>>
>>>>>>
>>>>> I just ran into this problem also - coredumps from an io_uring
>>>>> program
>>>>> to a pipe are truncated.  But I am using kernel 5.10.57, which does
>>>>> NOT
>>>>> have commit 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL")
>>>>> or
>>>>> commit 06af8679449d ("coredump: Limit what can interrupt coredumps").
>>>>> Kernel 5.4 works though, so I bisected the problem to commit
>>>>> f38c7e3abfba ("io_uring: ensure async buffered read-retry is setup
>>>>> properly") in kernel 5.9.  Note that my io_uring program uses only
>>>>> async
>>>>> buffered reads, which may be why this particular commit makes a
>>>>> difference to my program.
>>>>>
>>>>> My io_uring program is a multi-purpose long-running program with many
>>>>> threads.  Most threads don't use io_uring but a few of them do. 
>>>>> Normally, my core dumps are piped to a program so that they can be
>>>>> compressed before being written to disk, but I can also test writing
>>>>> the
>>>>> core dumps directly to disk.  This is what I have found:
>>>>>
>>>>> *) Unpatched 5.10.57: if a thread that doesn't use io_uring triggers
>>>>> a
>>>>> coredump, the core file is written correctly, whether it is written
>>>>> to
>>>>> disk or piped to a program, even if another thread is using io_uring
>>>>> at
>>>>> the same time.
>>>>>
>>>>> *) Unpatched 5.10.57: if a thread that uses io_uring triggers a
>>>>> coredump, the core file is truncated, whether written directly to
>>>>> disk
>>>>> or piped to a program.
>>>>>
>>>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
>>>>> triggers a coredump, and the core is written directly to disk, then
>>>>> it
>>>>> is written correctly.
>>>>>
>>>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
>>>>> triggers a coredump, and the core is piped to a program, then it is
>>>>> truncated.
>>>>>
>>>>> *) 5.10.57+revert f38c7e3abfba: core dumps are written correctly,
>>>>> whether written directly to disk or piped to a program.
>>>> That is very interesting. Like Olivier mentioned, it's not that actual
>>>> commit, but rather the change of behavior implemented by it. Before
>>>> that
>>>> commit, we'd hit the async workers more often, whereas after we do the
>>>> correct retry method where it's driven by the wakeup when the page is
>>>> unlocked. This is purely speculation, but perhaps the fact that the
>>>> process changes state potentially mid dump is why the dump ends up
>>>> being
>>>> truncated?
>>>>
>>>> I'd love to dive into this and try and figure it out. Absent a test
>>>> case, at least the above gives me an idea of what to try out. I'll see
>>>> if it makes it easier for me to create a case that does result in a
>>>> truncated core dump.
>>>>
>>> Jens,
>>>
>>> When I have first encountered the issue, the very first thing that I
>>> did try was to create a simple test program that would synthetize the
>>> problem.
>>>
>>> After few time consumming failed attempts, I just gave up the idea and
>>> simply settle to my prod program that showcase systematically the
>>> problem every time that I kill the process with a SEGV signal.
>>>
>>> In a nutshell, all the program does is to issue read operations with
>>> io_uring on a TCP socket on which there is a constant data stream.
>>>
>>> Now that I have a better understanding of what is going on, I think
>>> that one way that could reproduce the problem consistently could be
>>> along those lines:
>>>
>>> 1. Create a pipe
>>> 2. fork a child
>>> 3. Initiate a read operation on the pipe with io_uring from the child
>>> 4. Let the parent kill its child with a core dump generating signal.
>>> 5. Write something in the pipe from the parent so that the io_uring
>>> read operation completes while the core dump is generated.
>>>
>>> I guess that I'll end up doing that if I cannot fix the issue with my
>>> current setup but here is what I have attempted so far:
>>>
>>> 1. Call io_uring_files_cancel from do_coredump
>>> 2. Same as #1 but also make sure that TIF_NOTIFY_SIGNAL is cleared on
>>> returning from io_uring_files_cancel
>>>
>>> Those attempts didn't work but lurking in the io_uring dev mailing list
>>> is starting to pay off. I thought that I did reach the bottom of the
>>> rabbit hole in my journey of understanding io_uring but the recent
>>> patch set sent by Hao Xu
>>>
>>> https://lore.kernel.org/io-uring/90fce498-968e-6812-7b6a-fdf8520ea8d9@kernel.dk/T/#t
>>>
>>> made me realize that I still haven't assimilated all the small io_uring
>>> nuances...
>>>
>>> Here is my feedback. From my casual io_uring code reader point of view,
>>> it is not 100% obvious what the difference is between
>>> io_uring_files_cancel and io_uring_task_cancel
>>>
>>> It seems like io_uring_files_cancel is cancelling polls only if they
>>> have the REQ_F_INFLIGHT flag set.
>>>
>>> I have no idea what an inflight request means and why someone would
>>> want to call io_uring_files_cancel over io_uring_task_cancel.
>>>
>>> I guess that if I was to meditate on the question for few hours, I
>>> would at some point get some illumination strike me but I believe that
>>> it could be a good idea to document in the code those concepts for
>>> helping casual readers...
>>>
>>> Bottomline, I now understand that io_uring_files_cancel does not cancel
>>> all the requests. Therefore, without fully understanding what I am
>>> doing, I am going to replace my call to io_uring_files_cancel from
>>> do_coredump with io_uring_task_cancel and see if this finally fix the
>>> issue for good.
>>>
>>> What I am trying to do is to cancel pending io_uring requests to make
>>> sure that TIF_NOTIFY_SIGNAL isn't set while core dump is generated.
>>>
>>> Maybe another solution would simply be to modify __dump_emit to make it
>>> resilient to TIF_NOTIFY_SIGNAL as Eric W. Biederman originally
>>> suggested.
>>>
>>> or maybe do both...
>>>
>>> Not sure which approach is best. If someone has an opinion, I would be
>>> curious to hear it.
>> It does indeed sound like it's TIF_NOTIFY_SIGNAL that will trigger some
>> signal_pending() and cause an interruption of the core dump. Just out of
>> curiosity, what is your /proc/sys/kernel/core_pattern set to? If it's
>> set to some piped process, can you try and set it to 'core' and see if
>> that eliminates the truncation of the core dumps for your case?
> And assuming that works, then I suspect this one would fix your issue
> even with a piped core dump:
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 07afb5ddb1c4..852737a9ccbf 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -41,6 +41,7 @@
>  #include <linux/fs.h>
>  #include <linux/path.h>
>  #include <linux/timekeeping.h>
> +#include <linux/io_uring.h>
>  
>  #include <linux/uaccess.h>
>  #include <asm/mmu_context.h>
> @@ -603,6 +604,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>  	};
>  
>  	audit_core_dumps(siginfo->si_signo);
> +	io_uring_task_cancel();
>  
>  	binfmt = mm->binfmt;
>  	if (!binfmt || !binfmt->core_dump)
>
FYI, I tested kernel 5.10.59 + backport 06af8679449d + the patch above
with my io_uring program.  The coredump locked up even when writing the
core file directly to disk; the zombie process could not be killed with
"kill -9".  Unfortunately I can't test with newer kernels without
spending some time on it, and I am too busy with other stuff right now.

My io_uring program does async buffered reads
(io_uring_prep_read()/io_uring_prep_readv()) from a raw disk partition
(no filesystem).  One thread submits I/Os while another thread calls
io_uring_wait_cqe() and processes the completions.  To trigger the
coredump, I added an intentional abort() in the thread that submits I/Os
after running for a second.

Tony Battersby

Jens Axboe Aug. 17, 2021, 7:59 p.m. UTC | #22

On 8/17/21 1:29 PM, Tony Battersby wrote:
> On 8/17/21 2:24 PM, Jens Axboe wrote:
>> On 8/17/21 12:15 PM, Jens Axboe wrote:
>>> On 8/15/21 2:42 PM, Olivier Langlois wrote:
>>>> On Wed, 2021-08-11 at 19:55 -0600, Jens Axboe wrote:
>>>>> On 8/10/21 3:48 PM, Tony Battersby wrote:
>>>>>> On 8/5/21 9:06 AM, Olivier Langlois wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I didn't forgot about this remaining issue and I have kept thinking
>>>>>>> about it on and off.
>>>>>>>
>>>>>>> I did try the following on 5.12.19:
>>>>>>>
>>>>>>> diff --git a/fs/coredump.c b/fs/coredump.c
>>>>>>> index 07afb5ddb1c4..614fe7a54c1a 100644
>>>>>>> --- a/fs/coredump.c
>>>>>>> +++ b/fs/coredump.c
>>>>>>> @@ -41,6 +41,7 @@
>>>>>>>  #include <linux/fs.h>
>>>>>>>  #include <linux/path.h>
>>>>>>>  #include <linux/timekeeping.h>
>>>>>>> +#include <linux/io_uring.h>
>>>>>>>  
>>>>>>>  #include <linux/uaccess.h>
>>>>>>>  #include <asm/mmu_context.h>
>>>>>>> @@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t
>>>>>>> *siginfo)
>>>>>>>                 need_suid_safe = true;
>>>>>>>         }
>>>>>>>  
>>>>>>> +       io_uring_files_cancel(current->files);
>>>>>>> +
>>>>>>>         retval = coredump_wait(siginfo->si_signo, &core_state);
>>>>>>>         if (retval < 0)
>>>>>>>                 goto fail_creds;
>>>>>>> --
>>>>>>> 2.32.0
>>>>>>>
>>>>>>> with my current understanding, io_uring_files_cancel is supposed to
>>>>>>> cancel everything that might set the TIF_NOTIFY_SIGNAL.
>>>>>>>
>>>>>>> I must report that in my testing with generating a core dump
>>>>>>> through a
>>>>>>> pipe with the modif above, I still get truncated core dumps.
>>>>>>>
>>>>>>> systemd is having a weird error:
>>>>>>> [ 2577.870742] systemd-coredump[4056]: Failed to get COMM: No such
>>>>>>> process
>>>>>>>
>>>>>>> and nothing is captured
>>>>>>>
>>>>>>> so I have replaced it with a very simple shell:
>>>>>>> $ cat /proc/sys/kernel/core_pattern 
>>>>>>>> /home/lano1106/bin/pipe_core.sh %e %p
>>>>>>> ~/bin $ cat pipe_core.sh 
>>>>>>> #!/bin/sh
>>>>>>>
>>>>>>> cat > /home/lano1106/core/core.$1.$2
>>>>>>>
>>>>>>> BFD: warning: /home/lano1106/core/core.test.10886 is truncated:
>>>>>>> expected core file size >= 24129536, found: 61440
>>>>>>>
>>>>>>> I conclude from my attempt that maybe io_uring_files_cancel is not
>>>>>>> 100%
>>>>>>> cleaning everything that it should clean.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> I just ran into this problem also - coredumps from an io_uring
>>>>>> program
>>>>>> to a pipe are truncated.  But I am using kernel 5.10.57, which does
>>>>>> NOT
>>>>>> have commit 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL")
>>>>>> or
>>>>>> commit 06af8679449d ("coredump: Limit what can interrupt coredumps").
>>>>>> Kernel 5.4 works though, so I bisected the problem to commit
>>>>>> f38c7e3abfba ("io_uring: ensure async buffered read-retry is setup
>>>>>> properly") in kernel 5.9.  Note that my io_uring program uses only
>>>>>> async
>>>>>> buffered reads, which may be why this particular commit makes a
>>>>>> difference to my program.
>>>>>>
>>>>>> My io_uring program is a multi-purpose long-running program with many
>>>>>> threads.  Most threads don't use io_uring but a few of them do. 
>>>>>> Normally, my core dumps are piped to a program so that they can be
>>>>>> compressed before being written to disk, but I can also test writing
>>>>>> the
>>>>>> core dumps directly to disk.  This is what I have found:
>>>>>>
>>>>>> *) Unpatched 5.10.57: if a thread that doesn't use io_uring triggers
>>>>>> a
>>>>>> coredump, the core file is written correctly, whether it is written
>>>>>> to
>>>>>> disk or piped to a program, even if another thread is using io_uring
>>>>>> at
>>>>>> the same time.
>>>>>>
>>>>>> *) Unpatched 5.10.57: if a thread that uses io_uring triggers a
>>>>>> coredump, the core file is truncated, whether written directly to
>>>>>> disk
>>>>>> or piped to a program.
>>>>>>
>>>>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
>>>>>> triggers a coredump, and the core is written directly to disk, then
>>>>>> it
>>>>>> is written correctly.
>>>>>>
>>>>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
>>>>>> triggers a coredump, and the core is piped to a program, then it is
>>>>>> truncated.
>>>>>>
>>>>>> *) 5.10.57+revert f38c7e3abfba: core dumps are written correctly,
>>>>>> whether written directly to disk or piped to a program.
>>>>> That is very interesting. Like Olivier mentioned, it's not that actual
>>>>> commit, but rather the change of behavior implemented by it. Before
>>>>> that
>>>>> commit, we'd hit the async workers more often, whereas after we do the
>>>>> correct retry method where it's driven by the wakeup when the page is
>>>>> unlocked. This is purely speculation, but perhaps the fact that the
>>>>> process changes state potentially mid dump is why the dump ends up
>>>>> being
>>>>> truncated?
>>>>>
>>>>> I'd love to dive into this and try and figure it out. Absent a test
>>>>> case, at least the above gives me an idea of what to try out. I'll see
>>>>> if it makes it easier for me to create a case that does result in a
>>>>> truncated core dump.
>>>>>
>>>> Jens,
>>>>
>>>> When I have first encountered the issue, the very first thing that I
>>>> did try was to create a simple test program that would synthetize the
>>>> problem.
>>>>
>>>> After few time consumming failed attempts, I just gave up the idea and
>>>> simply settle to my prod program that showcase systematically the
>>>> problem every time that I kill the process with a SEGV signal.
>>>>
>>>> In a nutshell, all the program does is to issue read operations with
>>>> io_uring on a TCP socket on which there is a constant data stream.
>>>>
>>>> Now that I have a better understanding of what is going on, I think
>>>> that one way that could reproduce the problem consistently could be
>>>> along those lines:
>>>>
>>>> 1. Create a pipe
>>>> 2. fork a child
>>>> 3. Initiate a read operation on the pipe with io_uring from the child
>>>> 4. Let the parent kill its child with a core dump generating signal.
>>>> 5. Write something in the pipe from the parent so that the io_uring
>>>> read operation completes while the core dump is generated.
>>>>
>>>> I guess that I'll end up doing that if I cannot fix the issue with my
>>>> current setup but here is what I have attempted so far:
>>>>
>>>> 1. Call io_uring_files_cancel from do_coredump
>>>> 2. Same as #1 but also make sure that TIF_NOTIFY_SIGNAL is cleared on
>>>> returning from io_uring_files_cancel
>>>>
>>>> Those attempts didn't work but lurking in the io_uring dev mailing list
>>>> is starting to pay off. I thought that I did reach the bottom of the
>>>> rabbit hole in my journey of understanding io_uring but the recent
>>>> patch set sent by Hao Xu
>>>>
>>>> https://lore.kernel.org/io-uring/90fce498-968e-6812-7b6a-fdf8520ea8d9@kernel.dk/T/#t
>>>>
>>>> made me realize that I still haven't assimilated all the small io_uring
>>>> nuances...
>>>>
>>>> Here is my feedback. From my casual io_uring code reader point of view,
>>>> it is not 100% obvious what the difference is between
>>>> io_uring_files_cancel and io_uring_task_cancel
>>>>
>>>> It seems like io_uring_files_cancel is cancelling polls only if they
>>>> have the REQ_F_INFLIGHT flag set.
>>>>
>>>> I have no idea what an inflight request means and why someone would
>>>> want to call io_uring_files_cancel over io_uring_task_cancel.
>>>>
>>>> I guess that if I was to meditate on the question for few hours, I
>>>> would at some point get some illumination strike me but I believe that
>>>> it could be a good idea to document in the code those concepts for
>>>> helping casual readers...
>>>>
>>>> Bottomline, I now understand that io_uring_files_cancel does not cancel
>>>> all the requests. Therefore, without fully understanding what I am
>>>> doing, I am going to replace my call to io_uring_files_cancel from
>>>> do_coredump with io_uring_task_cancel and see if this finally fix the
>>>> issue for good.
>>>>
>>>> What I am trying to do is to cancel pending io_uring requests to make
>>>> sure that TIF_NOTIFY_SIGNAL isn't set while core dump is generated.
>>>>
>>>> Maybe another solution would simply be to modify __dump_emit to make it
>>>> resilient to TIF_NOTIFY_SIGNAL as Eric W. Biederman originally
>>>> suggested.
>>>>
>>>> or maybe do both...
>>>>
>>>> Not sure which approach is best. If someone has an opinion, I would be
>>>> curious to hear it.
>>> It does indeed sound like it's TIF_NOTIFY_SIGNAL that will trigger some
>>> signal_pending() and cause an interruption of the core dump. Just out of
>>> curiosity, what is your /proc/sys/kernel/core_pattern set to? If it's
>>> set to some piped process, can you try and set it to 'core' and see if
>>> that eliminates the truncation of the core dumps for your case?
>> And assuming that works, then I suspect this one would fix your issue
>> even with a piped core dump:
>>
>> diff --git a/fs/coredump.c b/fs/coredump.c
>> index 07afb5ddb1c4..852737a9ccbf 100644
>> --- a/fs/coredump.c
>> +++ b/fs/coredump.c
>> @@ -41,6 +41,7 @@
>>  #include <linux/fs.h>
>>  #include <linux/path.h>
>>  #include <linux/timekeeping.h>
>> +#include <linux/io_uring.h>
>>  
>>  #include <linux/uaccess.h>
>>  #include <asm/mmu_context.h>
>> @@ -603,6 +604,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>>  	};
>>  
>>  	audit_core_dumps(siginfo->si_signo);
>> +	io_uring_task_cancel();
>>  
>>  	binfmt = mm->binfmt;
>>  	if (!binfmt || !binfmt->core_dump)
>>
> FYI, I tested kernel 5.10.59 + backport 06af8679449d + the patch above
> with my io_uring program.  The coredump locked up even when writing the
> core file directly to disk; the zombie process could not be killed with
> "kill -9".  Unfortunately I can't test with newer kernels without
> spending some time on it, and I am too busy with other stuff right now.

That sounds like 5.10-stable is missing some of the cancelation
backports, and your setup makes the cancelation stall because of that.
Need to go over the 11/12/13 fixes and ensure that we've got everything
we need for those stable versions, particularly 5.10.

> My io_uring program does async buffered reads
> (io_uring_prep_read()/io_uring_prep_readv()) from a raw disk partition
> (no filesystem).  One thread submits I/Os while another thread calls
> io_uring_wait_cqe() and processes the completions.  To trigger the
> coredump, I added an intentional abort() in the thread that submits I/Os
> after running for a second.

OK, so that one is also using task_work for the retry based async
buffered reads, so it makes sense.

Maybe a temporary work-around is to use 06af8679449d and eliminate the
pipe based coredump?

Jens Axboe Aug. 17, 2021, 9:28 p.m. UTC | #23

On 8/17/21 1:59 PM, Jens Axboe wrote:
> On 8/17/21 1:29 PM, Tony Battersby wrote:
>> On 8/17/21 2:24 PM, Jens Axboe wrote:
>>> On 8/17/21 12:15 PM, Jens Axboe wrote:
>>>> On 8/15/21 2:42 PM, Olivier Langlois wrote:
>>>>> On Wed, 2021-08-11 at 19:55 -0600, Jens Axboe wrote:
>>>>>> On 8/10/21 3:48 PM, Tony Battersby wrote:
>>>>>>> On 8/5/21 9:06 AM, Olivier Langlois wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I didn't forgot about this remaining issue and I have kept thinking
>>>>>>>> about it on and off.
>>>>>>>>
>>>>>>>> I did try the following on 5.12.19:
>>>>>>>>
>>>>>>>> diff --git a/fs/coredump.c b/fs/coredump.c
>>>>>>>> index 07afb5ddb1c4..614fe7a54c1a 100644
>>>>>>>> --- a/fs/coredump.c
>>>>>>>> +++ b/fs/coredump.c
>>>>>>>> @@ -41,6 +41,7 @@
>>>>>>>>  #include <linux/fs.h>
>>>>>>>>  #include <linux/path.h>
>>>>>>>>  #include <linux/timekeeping.h>
>>>>>>>> +#include <linux/io_uring.h>
>>>>>>>>  
>>>>>>>>  #include <linux/uaccess.h>
>>>>>>>>  #include <asm/mmu_context.h>
>>>>>>>> @@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t
>>>>>>>> *siginfo)
>>>>>>>>                 need_suid_safe = true;
>>>>>>>>         }
>>>>>>>>  
>>>>>>>> +       io_uring_files_cancel(current->files);
>>>>>>>> +
>>>>>>>>         retval = coredump_wait(siginfo->si_signo, &core_state);
>>>>>>>>         if (retval < 0)
>>>>>>>>                 goto fail_creds;
>>>>>>>> --
>>>>>>>> 2.32.0
>>>>>>>>
>>>>>>>> with my current understanding, io_uring_files_cancel is supposed to
>>>>>>>> cancel everything that might set the TIF_NOTIFY_SIGNAL.
>>>>>>>>
>>>>>>>> I must report that in my testing with generating a core dump
>>>>>>>> through a
>>>>>>>> pipe with the modif above, I still get truncated core dumps.
>>>>>>>>
>>>>>>>> systemd is having a weird error:
>>>>>>>> [ 2577.870742] systemd-coredump[4056]: Failed to get COMM: No such
>>>>>>>> process
>>>>>>>>
>>>>>>>> and nothing is captured
>>>>>>>>
>>>>>>>> so I have replaced it with a very simple shell:
>>>>>>>> $ cat /proc/sys/kernel/core_pattern 
>>>>>>>>> /home/lano1106/bin/pipe_core.sh %e %p
>>>>>>>> ~/bin $ cat pipe_core.sh 
>>>>>>>> #!/bin/sh
>>>>>>>>
>>>>>>>> cat > /home/lano1106/core/core.$1.$2
>>>>>>>>
>>>>>>>> BFD: warning: /home/lano1106/core/core.test.10886 is truncated:
>>>>>>>> expected core file size >= 24129536, found: 61440
>>>>>>>>
>>>>>>>> I conclude from my attempt that maybe io_uring_files_cancel is not
>>>>>>>> 100%
>>>>>>>> cleaning everything that it should clean.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> I just ran into this problem also - coredumps from an io_uring
>>>>>>> program
>>>>>>> to a pipe are truncated.  But I am using kernel 5.10.57, which does
>>>>>>> NOT
>>>>>>> have commit 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL")
>>>>>>> or
>>>>>>> commit 06af8679449d ("coredump: Limit what can interrupt coredumps").
>>>>>>> Kernel 5.4 works though, so I bisected the problem to commit
>>>>>>> f38c7e3abfba ("io_uring: ensure async buffered read-retry is setup
>>>>>>> properly") in kernel 5.9.  Note that my io_uring program uses only
>>>>>>> async
>>>>>>> buffered reads, which may be why this particular commit makes a
>>>>>>> difference to my program.
>>>>>>>
>>>>>>> My io_uring program is a multi-purpose long-running program with many
>>>>>>> threads.  Most threads don't use io_uring but a few of them do. 
>>>>>>> Normally, my core dumps are piped to a program so that they can be
>>>>>>> compressed before being written to disk, but I can also test writing
>>>>>>> the
>>>>>>> core dumps directly to disk.  This is what I have found:
>>>>>>>
>>>>>>> *) Unpatched 5.10.57: if a thread that doesn't use io_uring triggers
>>>>>>> a
>>>>>>> coredump, the core file is written correctly, whether it is written
>>>>>>> to
>>>>>>> disk or piped to a program, even if another thread is using io_uring
>>>>>>> at
>>>>>>> the same time.
>>>>>>>
>>>>>>> *) Unpatched 5.10.57: if a thread that uses io_uring triggers a
>>>>>>> coredump, the core file is truncated, whether written directly to
>>>>>>> disk
>>>>>>> or piped to a program.
>>>>>>>
>>>>>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
>>>>>>> triggers a coredump, and the core is written directly to disk, then
>>>>>>> it
>>>>>>> is written correctly.
>>>>>>>
>>>>>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
>>>>>>> triggers a coredump, and the core is piped to a program, then it is
>>>>>>> truncated.
>>>>>>>
>>>>>>> *) 5.10.57+revert f38c7e3abfba: core dumps are written correctly,
>>>>>>> whether written directly to disk or piped to a program.
>>>>>> That is very interesting. Like Olivier mentioned, it's not that actual
>>>>>> commit, but rather the change of behavior implemented by it. Before
>>>>>> that
>>>>>> commit, we'd hit the async workers more often, whereas after we do the
>>>>>> correct retry method where it's driven by the wakeup when the page is
>>>>>> unlocked. This is purely speculation, but perhaps the fact that the
>>>>>> process changes state potentially mid dump is why the dump ends up
>>>>>> being
>>>>>> truncated?
>>>>>>
>>>>>> I'd love to dive into this and try and figure it out. Absent a test
>>>>>> case, at least the above gives me an idea of what to try out. I'll see
>>>>>> if it makes it easier for me to create a case that does result in a
>>>>>> truncated core dump.
>>>>>>
>>>>> Jens,
>>>>>
>>>>> When I have first encountered the issue, the very first thing that I
>>>>> did try was to create a simple test program that would synthetize the
>>>>> problem.
>>>>>
>>>>> After few time consumming failed attempts, I just gave up the idea and
>>>>> simply settle to my prod program that showcase systematically the
>>>>> problem every time that I kill the process with a SEGV signal.
>>>>>
>>>>> In a nutshell, all the program does is to issue read operations with
>>>>> io_uring on a TCP socket on which there is a constant data stream.
>>>>>
>>>>> Now that I have a better understanding of what is going on, I think
>>>>> that one way that could reproduce the problem consistently could be
>>>>> along those lines:
>>>>>
>>>>> 1. Create a pipe
>>>>> 2. fork a child
>>>>> 3. Initiate a read operation on the pipe with io_uring from the child
>>>>> 4. Let the parent kill its child with a core dump generating signal.
>>>>> 5. Write something in the pipe from the parent so that the io_uring
>>>>> read operation completes while the core dump is generated.
>>>>>
>>>>> I guess that I'll end up doing that if I cannot fix the issue with my
>>>>> current setup but here is what I have attempted so far:
>>>>>
>>>>> 1. Call io_uring_files_cancel from do_coredump
>>>>> 2. Same as #1 but also make sure that TIF_NOTIFY_SIGNAL is cleared on
>>>>> returning from io_uring_files_cancel
>>>>>
>>>>> Those attempts didn't work but lurking in the io_uring dev mailing list
>>>>> is starting to pay off. I thought that I did reach the bottom of the
>>>>> rabbit hole in my journey of understanding io_uring but the recent
>>>>> patch set sent by Hao Xu
>>>>>
>>>>> https://lore.kernel.org/io-uring/90fce498-968e-6812-7b6a-fdf8520ea8d9@kernel.dk/T/#t
>>>>>
>>>>> made me realize that I still haven't assimilated all the small io_uring
>>>>> nuances...
>>>>>
>>>>> Here is my feedback. From my casual io_uring code reader point of view,
>>>>> it is not 100% obvious what the difference is between
>>>>> io_uring_files_cancel and io_uring_task_cancel
>>>>>
>>>>> It seems like io_uring_files_cancel is cancelling polls only if they
>>>>> have the REQ_F_INFLIGHT flag set.
>>>>>
>>>>> I have no idea what an inflight request means and why someone would
>>>>> want to call io_uring_files_cancel over io_uring_task_cancel.
>>>>>
>>>>> I guess that if I was to meditate on the question for few hours, I
>>>>> would at some point get some illumination strike me but I believe that
>>>>> it could be a good idea to document in the code those concepts for
>>>>> helping casual readers...
>>>>>
>>>>> Bottomline, I now understand that io_uring_files_cancel does not cancel
>>>>> all the requests. Therefore, without fully understanding what I am
>>>>> doing, I am going to replace my call to io_uring_files_cancel from
>>>>> do_coredump with io_uring_task_cancel and see if this finally fix the
>>>>> issue for good.
>>>>>
>>>>> What I am trying to do is to cancel pending io_uring requests to make
>>>>> sure that TIF_NOTIFY_SIGNAL isn't set while core dump is generated.
>>>>>
>>>>> Maybe another solution would simply be to modify __dump_emit to make it
>>>>> resilient to TIF_NOTIFY_SIGNAL as Eric W. Biederman originally
>>>>> suggested.
>>>>>
>>>>> or maybe do both...
>>>>>
>>>>> Not sure which approach is best. If someone has an opinion, I would be
>>>>> curious to hear it.
>>>> It does indeed sound like it's TIF_NOTIFY_SIGNAL that will trigger some
>>>> signal_pending() and cause an interruption of the core dump. Just out of
>>>> curiosity, what is your /proc/sys/kernel/core_pattern set to? If it's
>>>> set to some piped process, can you try and set it to 'core' and see if
>>>> that eliminates the truncation of the core dumps for your case?
>>> And assuming that works, then I suspect this one would fix your issue
>>> even with a piped core dump:
>>>
>>> diff --git a/fs/coredump.c b/fs/coredump.c
>>> index 07afb5ddb1c4..852737a9ccbf 100644
>>> --- a/fs/coredump.c
>>> +++ b/fs/coredump.c
>>> @@ -41,6 +41,7 @@
>>>  #include <linux/fs.h>
>>>  #include <linux/path.h>
>>>  #include <linux/timekeeping.h>
>>> +#include <linux/io_uring.h>
>>>  
>>>  #include <linux/uaccess.h>
>>>  #include <asm/mmu_context.h>
>>> @@ -603,6 +604,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>>>  	};
>>>  
>>>  	audit_core_dumps(siginfo->si_signo);
>>> +	io_uring_task_cancel();
>>>  
>>>  	binfmt = mm->binfmt;
>>>  	if (!binfmt || !binfmt->core_dump)
>>>
>> FYI, I tested kernel 5.10.59 + backport 06af8679449d + the patch above
>> with my io_uring program.  The coredump locked up even when writing the
>> core file directly to disk; the zombie process could not be killed with
>> "kill -9".  Unfortunately I can't test with newer kernels without
>> spending some time on it, and I am too busy with other stuff right now.
> 
> That sounds like 5.10-stable is missing some of the cancelation
> backports, and your setup makes the cancelation stall because of that.
> Need to go over the 11/12/13 fixes and ensure that we've got everything
> we need for those stable versions, particularly 5.10.
> 
>> My io_uring program does async buffered reads
>> (io_uring_prep_read()/io_uring_prep_readv()) from a raw disk partition
>> (no filesystem).  One thread submits I/Os while another thread calls
>> io_uring_wait_cqe() and processes the completions.  To trigger the
>> coredump, I added an intentional abort() in the thread that submits I/Os
>> after running for a second.
> 
> OK, so that one is also using task_work for the retry based async
> buffered reads, so it makes sense.
> 
> Maybe a temporary work-around is to use 06af8679449d and eliminate the
> pipe based coredump?

Another approach - don't allow TWA_SIGNAL task_work to get queued if
PF_SIGNALED has been set on the task. This is similar to how we reject
task_work_add() on process exit, and the callers must be able to handle
that already.

Can you test this one on top of your 5.10-stable?


diff --git a/fs/coredump.c b/fs/coredump.c
index 07afb5ddb1c4..ca7c1ee44ada 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -602,6 +602,14 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 		.mm_flags = mm->flags,
 	};
 
+	/*
+	 * task_work_add() will refuse to add work after PF_SIGNALED has
+	 * been set, ensure that we flush any pending TIF_NOTIFY_SIGNAL work
+	 * if any was queued before that.
+	 */
+	if (test_thread_flag(TIF_NOTIFY_SIGNAL))
+		tracehook_notify_signal();
+
 	audit_core_dumps(siginfo->si_signo);
 
 	binfmt = mm->binfmt;
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 1698fbe6f0e1..1ab28904adc4 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -41,6 +41,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
 		head = READ_ONCE(task->task_works);
 		if (unlikely(head == &work_exited))
 			return -ESRCH;
+		/*
+		 * TIF_NOTIFY_SIGNAL notifications will interfere with
+		 * a core dump in progress, reject them.
+		 */
+		if ((task->flags & PF_SIGNALED) && notify == TWA_SIGNAL)
+			return -ESRCH;
 		work->next = head;
 	} while (cmpxchg(&task->task_works, head, work) != head);

Tony Battersby Aug. 17, 2021, 9:39 p.m. UTC | #24

On 8/17/21 5:28 PM, Jens Axboe wrote:
>
> Another approach - don't allow TWA_SIGNAL task_work to get queued if
> PF_SIGNALED has been set on the task. This is similar to how we reject
> task_work_add() on process exit, and the callers must be able to handle
> that already.
>
> Can you test this one on top of your 5.10-stable?
>
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 07afb5ddb1c4..ca7c1ee44ada 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -602,6 +602,14 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>  		.mm_flags = mm->flags,
>  	};
>  
> +	/*
> +	 * task_work_add() will refuse to add work after PF_SIGNALED has
> +	 * been set, ensure that we flush any pending TIF_NOTIFY_SIGNAL work
> +	 * if any was queued before that.
> +	 */
> +	if (test_thread_flag(TIF_NOTIFY_SIGNAL))
> +		tracehook_notify_signal();
> +
>  	audit_core_dumps(siginfo->si_signo);
>  
>  	binfmt = mm->binfmt;
> diff --git a/kernel/task_work.c b/kernel/task_work.c
> index 1698fbe6f0e1..1ab28904adc4 100644
> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -41,6 +41,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
>  		head = READ_ONCE(task->task_works);
>  		if (unlikely(head == &work_exited))
>  			return -ESRCH;
> +		/*
> +		 * TIF_NOTIFY_SIGNAL notifications will interfere with
> +		 * a core dump in progress, reject them.
> +		 */
> +		if ((task->flags & PF_SIGNALED) && notify == TWA_SIGNAL)
> +			return -ESRCH;
>  		work->next = head;
>  	} while (cmpxchg(&task->task_works, head, work) != head);
>  
>
Doesn't compile.  5.10 doesn't have TIF_NOTIFY_SIGNAL.

Tony Battersby

Jens Axboe Aug. 17, 2021, 10:05 p.m. UTC | #25

On 8/17/21 3:39 PM, Tony Battersby wrote:
> On 8/17/21 5:28 PM, Jens Axboe wrote:
>>
>> Another approach - don't allow TWA_SIGNAL task_work to get queued if
>> PF_SIGNALED has been set on the task. This is similar to how we reject
>> task_work_add() on process exit, and the callers must be able to handle
>> that already.
>>
>> Can you test this one on top of your 5.10-stable?
>>
>>
>> diff --git a/fs/coredump.c b/fs/coredump.c
>> index 07afb5ddb1c4..ca7c1ee44ada 100644
>> --- a/fs/coredump.c
>> +++ b/fs/coredump.c
>> @@ -602,6 +602,14 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>>  		.mm_flags = mm->flags,
>>  	};
>>  
>> +	/*
>> +	 * task_work_add() will refuse to add work after PF_SIGNALED has
>> +	 * been set, ensure that we flush any pending TIF_NOTIFY_SIGNAL work
>> +	 * if any was queued before that.
>> +	 */
>> +	if (test_thread_flag(TIF_NOTIFY_SIGNAL))
>> +		tracehook_notify_signal();
>> +
>>  	audit_core_dumps(siginfo->si_signo);
>>  
>>  	binfmt = mm->binfmt;
>> diff --git a/kernel/task_work.c b/kernel/task_work.c
>> index 1698fbe6f0e1..1ab28904adc4 100644
>> --- a/kernel/task_work.c
>> +++ b/kernel/task_work.c
>> @@ -41,6 +41,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
>>  		head = READ_ONCE(task->task_works);
>>  		if (unlikely(head == &work_exited))
>>  			return -ESRCH;
>> +		/*
>> +		 * TIF_NOTIFY_SIGNAL notifications will interfere with
>> +		 * a core dump in progress, reject them.
>> +		 */
>> +		if ((task->flags & PF_SIGNALED) && notify == TWA_SIGNAL)
>> +			return -ESRCH;
>>  		work->next = head;
>>  	} while (cmpxchg(&task->task_works, head, work) != head);
>>  
>>
> Doesn't compile.  5.10 doesn't have TIF_NOTIFY_SIGNAL.

Oh right... Here's one hacked up for the 5.10 TWA_SIGNAL setup. Totally
untested...

diff --git a/fs/coredump.c b/fs/coredump.c
index c6acfc694f65..9e899ce67589 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -603,6 +603,19 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 		.mm_flags = mm->flags,
 	};
 
+	/*
+	 * task_work_add() will refuse to add work after PF_SIGNALED has
+	 * been set, ensure that we flush any pending TWA_SIGNAL work
+	 * if any was queued before that.
+	 */
+	if (signal_pending(current) && (current->jobctl & JOBCTL_TASK_WORK)) {
+		task_work_run();
+		spin_lock_irq(&current->sighand->siglock);
+		current->jobctl &= ~JOBCTL_TASK_WORK;
+		recalc_sigpending();
+		spin_unlock_irq(&current->sighand->siglock);
+	}
+
 	audit_core_dumps(siginfo->si_signo);
 
 	binfmt = mm->binfmt;
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 8d6e1217c451..93b3f262eb4a 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -39,6 +39,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
 		head = READ_ONCE(task->task_works);
 		if (unlikely(head == &work_exited))
 			return -ESRCH;
+		/*
+		 * TWA_SIGNAL notifications will interfere with
+		 * a core dump in progress, reject them.
+		 */
+		if ((task->flags & PF_SIGNALED) && notify == TWA_SIGNAL)
+			return -ESRCH;
 		work->next = head;
 	} while (cmpxchg(&task->task_works, head, work) != head);

Jens Axboe Aug. 18, 2021, 2:57 a.m. UTC | #26

On 8/17/21 3:28 PM, Jens Axboe wrote:
> On 8/17/21 1:59 PM, Jens Axboe wrote:
>> On 8/17/21 1:29 PM, Tony Battersby wrote:
>>> On 8/17/21 2:24 PM, Jens Axboe wrote:
>>>> On 8/17/21 12:15 PM, Jens Axboe wrote:
>>>>> On 8/15/21 2:42 PM, Olivier Langlois wrote:
>>>>>> On Wed, 2021-08-11 at 19:55 -0600, Jens Axboe wrote:
>>>>>>> On 8/10/21 3:48 PM, Tony Battersby wrote:
>>>>>>>> On 8/5/21 9:06 AM, Olivier Langlois wrote:
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I didn't forgot about this remaining issue and I have kept thinking
>>>>>>>>> about it on and off.
>>>>>>>>>
>>>>>>>>> I did try the following on 5.12.19:
>>>>>>>>>
>>>>>>>>> diff --git a/fs/coredump.c b/fs/coredump.c
>>>>>>>>> index 07afb5ddb1c4..614fe7a54c1a 100644
>>>>>>>>> --- a/fs/coredump.c
>>>>>>>>> +++ b/fs/coredump.c
>>>>>>>>> @@ -41,6 +41,7 @@
>>>>>>>>>  #include <linux/fs.h>
>>>>>>>>>  #include <linux/path.h>
>>>>>>>>>  #include <linux/timekeeping.h>
>>>>>>>>> +#include <linux/io_uring.h>
>>>>>>>>>  
>>>>>>>>>  #include <linux/uaccess.h>
>>>>>>>>>  #include <asm/mmu_context.h>
>>>>>>>>> @@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t
>>>>>>>>> *siginfo)
>>>>>>>>>                 need_suid_safe = true;
>>>>>>>>>         }
>>>>>>>>>  
>>>>>>>>> +       io_uring_files_cancel(current->files);
>>>>>>>>> +
>>>>>>>>>         retval = coredump_wait(siginfo->si_signo, &core_state);
>>>>>>>>>         if (retval < 0)
>>>>>>>>>                 goto fail_creds;
>>>>>>>>> --
>>>>>>>>> 2.32.0
>>>>>>>>>
>>>>>>>>> with my current understanding, io_uring_files_cancel is supposed to
>>>>>>>>> cancel everything that might set the TIF_NOTIFY_SIGNAL.
>>>>>>>>>
>>>>>>>>> I must report that in my testing with generating a core dump
>>>>>>>>> through a
>>>>>>>>> pipe with the modif above, I still get truncated core dumps.
>>>>>>>>>
>>>>>>>>> systemd is having a weird error:
>>>>>>>>> [ 2577.870742] systemd-coredump[4056]: Failed to get COMM: No such
>>>>>>>>> process
>>>>>>>>>
>>>>>>>>> and nothing is captured
>>>>>>>>>
>>>>>>>>> so I have replaced it with a very simple shell:
>>>>>>>>> $ cat /proc/sys/kernel/core_pattern 
>>>>>>>>>> /home/lano1106/bin/pipe_core.sh %e %p
>>>>>>>>> ~/bin $ cat pipe_core.sh 
>>>>>>>>> #!/bin/sh
>>>>>>>>>
>>>>>>>>> cat > /home/lano1106/core/core.$1.$2
>>>>>>>>>
>>>>>>>>> BFD: warning: /home/lano1106/core/core.test.10886 is truncated:
>>>>>>>>> expected core file size >= 24129536, found: 61440
>>>>>>>>>
>>>>>>>>> I conclude from my attempt that maybe io_uring_files_cancel is not
>>>>>>>>> 100%
>>>>>>>>> cleaning everything that it should clean.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> I just ran into this problem also - coredumps from an io_uring
>>>>>>>> program
>>>>>>>> to a pipe are truncated.  But I am using kernel 5.10.57, which does
>>>>>>>> NOT
>>>>>>>> have commit 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL")
>>>>>>>> or
>>>>>>>> commit 06af8679449d ("coredump: Limit what can interrupt coredumps").
>>>>>>>> Kernel 5.4 works though, so I bisected the problem to commit
>>>>>>>> f38c7e3abfba ("io_uring: ensure async buffered read-retry is setup
>>>>>>>> properly") in kernel 5.9.  Note that my io_uring program uses only
>>>>>>>> async
>>>>>>>> buffered reads, which may be why this particular commit makes a
>>>>>>>> difference to my program.
>>>>>>>>
>>>>>>>> My io_uring program is a multi-purpose long-running program with many
>>>>>>>> threads.  Most threads don't use io_uring but a few of them do. 
>>>>>>>> Normally, my core dumps are piped to a program so that they can be
>>>>>>>> compressed before being written to disk, but I can also test writing
>>>>>>>> the
>>>>>>>> core dumps directly to disk.  This is what I have found:
>>>>>>>>
>>>>>>>> *) Unpatched 5.10.57: if a thread that doesn't use io_uring triggers
>>>>>>>> a
>>>>>>>> coredump, the core file is written correctly, whether it is written
>>>>>>>> to
>>>>>>>> disk or piped to a program, even if another thread is using io_uring
>>>>>>>> at
>>>>>>>> the same time.
>>>>>>>>
>>>>>>>> *) Unpatched 5.10.57: if a thread that uses io_uring triggers a
>>>>>>>> coredump, the core file is truncated, whether written directly to
>>>>>>>> disk
>>>>>>>> or piped to a program.
>>>>>>>>
>>>>>>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
>>>>>>>> triggers a coredump, and the core is written directly to disk, then
>>>>>>>> it
>>>>>>>> is written correctly.
>>>>>>>>
>>>>>>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
>>>>>>>> triggers a coredump, and the core is piped to a program, then it is
>>>>>>>> truncated.
>>>>>>>>
>>>>>>>> *) 5.10.57+revert f38c7e3abfba: core dumps are written correctly,
>>>>>>>> whether written directly to disk or piped to a program.
>>>>>>> That is very interesting. Like Olivier mentioned, it's not that actual
>>>>>>> commit, but rather the change of behavior implemented by it. Before
>>>>>>> that
>>>>>>> commit, we'd hit the async workers more often, whereas after we do the
>>>>>>> correct retry method where it's driven by the wakeup when the page is
>>>>>>> unlocked. This is purely speculation, but perhaps the fact that the
>>>>>>> process changes state potentially mid dump is why the dump ends up
>>>>>>> being
>>>>>>> truncated?
>>>>>>>
>>>>>>> I'd love to dive into this and try and figure it out. Absent a test
>>>>>>> case, at least the above gives me an idea of what to try out. I'll see
>>>>>>> if it makes it easier for me to create a case that does result in a
>>>>>>> truncated core dump.
>>>>>>>
>>>>>> Jens,
>>>>>>
>>>>>> When I have first encountered the issue, the very first thing that I
>>>>>> did try was to create a simple test program that would synthetize the
>>>>>> problem.
>>>>>>
>>>>>> After few time consumming failed attempts, I just gave up the idea and
>>>>>> simply settle to my prod program that showcase systematically the
>>>>>> problem every time that I kill the process with a SEGV signal.
>>>>>>
>>>>>> In a nutshell, all the program does is to issue read operations with
>>>>>> io_uring on a TCP socket on which there is a constant data stream.
>>>>>>
>>>>>> Now that I have a better understanding of what is going on, I think
>>>>>> that one way that could reproduce the problem consistently could be
>>>>>> along those lines:
>>>>>>
>>>>>> 1. Create a pipe
>>>>>> 2. fork a child
>>>>>> 3. Initiate a read operation on the pipe with io_uring from the child
>>>>>> 4. Let the parent kill its child with a core dump generating signal.
>>>>>> 5. Write something in the pipe from the parent so that the io_uring
>>>>>> read operation completes while the core dump is generated.
>>>>>>
>>>>>> I guess that I'll end up doing that if I cannot fix the issue with my
>>>>>> current setup but here is what I have attempted so far:
>>>>>>
>>>>>> 1. Call io_uring_files_cancel from do_coredump
>>>>>> 2. Same as #1 but also make sure that TIF_NOTIFY_SIGNAL is cleared on
>>>>>> returning from io_uring_files_cancel
>>>>>>
>>>>>> Those attempts didn't work but lurking in the io_uring dev mailing list
>>>>>> is starting to pay off. I thought that I did reach the bottom of the
>>>>>> rabbit hole in my journey of understanding io_uring but the recent
>>>>>> patch set sent by Hao Xu
>>>>>>
>>>>>> https://lore.kernel.org/io-uring/90fce498-968e-6812-7b6a-fdf8520ea8d9@kernel.dk/T/#t
>>>>>>
>>>>>> made me realize that I still haven't assimilated all the small io_uring
>>>>>> nuances...
>>>>>>
>>>>>> Here is my feedback. From my casual io_uring code reader point of view,
>>>>>> it is not 100% obvious what the difference is between
>>>>>> io_uring_files_cancel and io_uring_task_cancel
>>>>>>
>>>>>> It seems like io_uring_files_cancel is cancelling polls only if they
>>>>>> have the REQ_F_INFLIGHT flag set.
>>>>>>
>>>>>> I have no idea what an inflight request means and why someone would
>>>>>> want to call io_uring_files_cancel over io_uring_task_cancel.
>>>>>>
>>>>>> I guess that if I was to meditate on the question for few hours, I
>>>>>> would at some point get some illumination strike me but I believe that
>>>>>> it could be a good idea to document in the code those concepts for
>>>>>> helping casual readers...
>>>>>>
>>>>>> Bottomline, I now understand that io_uring_files_cancel does not cancel
>>>>>> all the requests. Therefore, without fully understanding what I am
>>>>>> doing, I am going to replace my call to io_uring_files_cancel from
>>>>>> do_coredump with io_uring_task_cancel and see if this finally fix the
>>>>>> issue for good.
>>>>>>
>>>>>> What I am trying to do is to cancel pending io_uring requests to make
>>>>>> sure that TIF_NOTIFY_SIGNAL isn't set while core dump is generated.
>>>>>>
>>>>>> Maybe another solution would simply be to modify __dump_emit to make it
>>>>>> resilient to TIF_NOTIFY_SIGNAL as Eric W. Biederman originally
>>>>>> suggested.
>>>>>>
>>>>>> or maybe do both...
>>>>>>
>>>>>> Not sure which approach is best. If someone has an opinion, I would be
>>>>>> curious to hear it.
>>>>> It does indeed sound like it's TIF_NOTIFY_SIGNAL that will trigger some
>>>>> signal_pending() and cause an interruption of the core dump. Just out of
>>>>> curiosity, what is your /proc/sys/kernel/core_pattern set to? If it's
>>>>> set to some piped process, can you try and set it to 'core' and see if
>>>>> that eliminates the truncation of the core dumps for your case?
>>>> And assuming that works, then I suspect this one would fix your issue
>>>> even with a piped core dump:
>>>>
>>>> diff --git a/fs/coredump.c b/fs/coredump.c
>>>> index 07afb5ddb1c4..852737a9ccbf 100644
>>>> --- a/fs/coredump.c
>>>> +++ b/fs/coredump.c
>>>> @@ -41,6 +41,7 @@
>>>>  #include <linux/fs.h>
>>>>  #include <linux/path.h>
>>>>  #include <linux/timekeeping.h>
>>>> +#include <linux/io_uring.h>
>>>>  
>>>>  #include <linux/uaccess.h>
>>>>  #include <asm/mmu_context.h>
>>>> @@ -603,6 +604,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>>>>  	};
>>>>  
>>>>  	audit_core_dumps(siginfo->si_signo);
>>>> +	io_uring_task_cancel();
>>>>  
>>>>  	binfmt = mm->binfmt;
>>>>  	if (!binfmt || !binfmt->core_dump)
>>>>
>>> FYI, I tested kernel 5.10.59 + backport 06af8679449d + the patch above
>>> with my io_uring program.  The coredump locked up even when writing the
>>> core file directly to disk; the zombie process could not be killed with
>>> "kill -9".  Unfortunately I can't test with newer kernels without
>>> spending some time on it, and I am too busy with other stuff right now.
>>
>> That sounds like 5.10-stable is missing some of the cancelation
>> backports, and your setup makes the cancelation stall because of that.
>> Need to go over the 11/12/13 fixes and ensure that we've got everything
>> we need for those stable versions, particularly 5.10.
>>
>>> My io_uring program does async buffered reads
>>> (io_uring_prep_read()/io_uring_prep_readv()) from a raw disk partition
>>> (no filesystem).  One thread submits I/Os while another thread calls
>>> io_uring_wait_cqe() and processes the completions.  To trigger the
>>> coredump, I added an intentional abort() in the thread that submits I/Os
>>> after running for a second.
>>
>> OK, so that one is also using task_work for the retry based async
>> buffered reads, so it makes sense.
>>
>> Maybe a temporary work-around is to use 06af8679449d and eliminate the
>> pipe based coredump?
> 
> Another approach - don't allow TWA_SIGNAL task_work to get queued if
> PF_SIGNALED has been set on the task. This is similar to how we reject
> task_work_add() on process exit, and the callers must be able to handle
> that already.
> 
> Can you test this one on top of your 5.10-stable?
> 
> 
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 07afb5ddb1c4..ca7c1ee44ada 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -602,6 +602,14 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>  		.mm_flags = mm->flags,
>  	};
>  
> +	/*
> +	 * task_work_add() will refuse to add work after PF_SIGNALED has
> +	 * been set, ensure that we flush any pending TIF_NOTIFY_SIGNAL work
> +	 * if any was queued before that.
> +	 */
> +	if (test_thread_flag(TIF_NOTIFY_SIGNAL))
> +		tracehook_notify_signal();
> +
>  	audit_core_dumps(siginfo->si_signo);
>  
>  	binfmt = mm->binfmt;
> diff --git a/kernel/task_work.c b/kernel/task_work.c
> index 1698fbe6f0e1..1ab28904adc4 100644
> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -41,6 +41,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
>  		head = READ_ONCE(task->task_works);
>  		if (unlikely(head == &work_exited))
>  			return -ESRCH;
> +		/*
> +		 * TIF_NOTIFY_SIGNAL notifications will interfere with
> +		 * a core dump in progress, reject them.
> +		 */
> +		if ((task->flags & PF_SIGNALED) && notify == TWA_SIGNAL)
> +			return -ESRCH;
>  		work->next = head;
>  	} while (cmpxchg(&task->task_works, head, work) != head);
>  
> 

Olivier, I sent a 5.10 version for Nathan, any chance you can test this
one for the current kernels? Basically this one should work for 5.11+,
and the later 5.10 version is just for 5.10. I'm going to send it out
separately for review.

I do think this is the right solution, barring a tweak maybe on testing
notify == TWA_SIGNAL first before digging into the task struct. But the
principle is sound, and it'll work for other users of TWA_SIGNAL as
well. None right now as far as I can tell, but the live patching is
switching to TIF_NOTIFY_SIGNAL as well which will also cause issues with
coredumps potentially.

Jens Axboe Aug. 18, 2021, 2:58 a.m. UTC | #27

> Olivier, I sent a 5.10 version for Nathan, any chance you can test this
                                     ^^^^^^

Tony of course, my apologies.

Tony Battersby Aug. 18, 2021, 2:37 p.m. UTC | #28

On 8/17/21 6:05 PM, Jens Axboe wrote:
> On 8/17/21 3:39 PM, Tony Battersby wrote:
>> On 8/17/21 5:28 PM, Jens Axboe wrote:
>>> Another approach - don't allow TWA_SIGNAL task_work to get queued if
>>> PF_SIGNALED has been set on the task. This is similar to how we reject
>>> task_work_add() on process exit, and the callers must be able to handle
>>> that already.
>>>
>>> Can you test this one on top of your 5.10-stable?
>>>
>>>
>>> diff --git a/fs/coredump.c b/fs/coredump.c
>>> index 07afb5ddb1c4..ca7c1ee44ada 100644
>>> --- a/fs/coredump.c
>>> +++ b/fs/coredump.c
>>> @@ -602,6 +602,14 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>>>  		.mm_flags = mm->flags,
>>>  	};
>>>  
>>> +	/*
>>> +	 * task_work_add() will refuse to add work after PF_SIGNALED has
>>> +	 * been set, ensure that we flush any pending TIF_NOTIFY_SIGNAL work
>>> +	 * if any was queued before that.
>>> +	 */
>>> +	if (test_thread_flag(TIF_NOTIFY_SIGNAL))
>>> +		tracehook_notify_signal();
>>> +
>>>  	audit_core_dumps(siginfo->si_signo);
>>>  
>>>  	binfmt = mm->binfmt;
>>> diff --git a/kernel/task_work.c b/kernel/task_work.c
>>> index 1698fbe6f0e1..1ab28904adc4 100644
>>> --- a/kernel/task_work.c
>>> +++ b/kernel/task_work.c
>>> @@ -41,6 +41,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
>>>  		head = READ_ONCE(task->task_works);
>>>  		if (unlikely(head == &work_exited))
>>>  			return -ESRCH;
>>> +		/*
>>> +		 * TIF_NOTIFY_SIGNAL notifications will interfere with
>>> +		 * a core dump in progress, reject them.
>>> +		 */
>>> +		if ((task->flags & PF_SIGNALED) && notify == TWA_SIGNAL)
>>> +			return -ESRCH;
>>>  		work->next = head;
>>>  	} while (cmpxchg(&task->task_works, head, work) != head);
>>>  
>>>
>> Doesn't compile.  5.10 doesn't have TIF_NOTIFY_SIGNAL.
> Oh right... Here's one hacked up for the 5.10 TWA_SIGNAL setup. Totally
> untested...
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index c6acfc694f65..9e899ce67589 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -603,6 +603,19 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>  		.mm_flags = mm->flags,
>  	};
>  
> +	/*
> +	 * task_work_add() will refuse to add work after PF_SIGNALED has
> +	 * been set, ensure that we flush any pending TWA_SIGNAL work
> +	 * if any was queued before that.
> +	 */
> +	if (signal_pending(current) && (current->jobctl & JOBCTL_TASK_WORK)) {
> +		task_work_run();
> +		spin_lock_irq(&current->sighand->siglock);
> +		current->jobctl &= ~JOBCTL_TASK_WORK;
> +		recalc_sigpending();
> +		spin_unlock_irq(&current->sighand->siglock);
> +	}
> +
>  	audit_core_dumps(siginfo->si_signo);
>  
>  	binfmt = mm->binfmt;
> diff --git a/kernel/task_work.c b/kernel/task_work.c
> index 8d6e1217c451..93b3f262eb4a 100644
> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -39,6 +39,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
>  		head = READ_ONCE(task->task_works);
>  		if (unlikely(head == &work_exited))
>  			return -ESRCH;
> +		/*
> +		 * TWA_SIGNAL notifications will interfere with
> +		 * a core dump in progress, reject them.
> +		 */
> +		if ((task->flags & PF_SIGNALED) && notify == TWA_SIGNAL)
> +			return -ESRCH;
>  		work->next = head;
>  	} while (cmpxchg(&task->task_works, head, work) != head);
>  
>
Tested with 5.10.59 + backport 06af8679449d + the patch above.  That
fixes it for me.  I tested a couple of variations to make sure.

Thanks!

Tested-by: Tony Battersby <tonyb@cybernetics.com>

Jens Axboe Aug. 18, 2021, 2:46 p.m. UTC | #29

On 8/18/21 8:37 AM, Tony Battersby wrote:
> On 8/17/21 6:05 PM, Jens Axboe wrote:
>> On 8/17/21 3:39 PM, Tony Battersby wrote:
>>> On 8/17/21 5:28 PM, Jens Axboe wrote:
>>>> Another approach - don't allow TWA_SIGNAL task_work to get queued if
>>>> PF_SIGNALED has been set on the task. This is similar to how we reject
>>>> task_work_add() on process exit, and the callers must be able to handle
>>>> that already.
>>>>
>>>> Can you test this one on top of your 5.10-stable?
>>>>
>>>>
>>>> diff --git a/fs/coredump.c b/fs/coredump.c
>>>> index 07afb5ddb1c4..ca7c1ee44ada 100644
>>>> --- a/fs/coredump.c
>>>> +++ b/fs/coredump.c
>>>> @@ -602,6 +602,14 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>>>>  		.mm_flags = mm->flags,
>>>>  	};
>>>>  
>>>> +	/*
>>>> +	 * task_work_add() will refuse to add work after PF_SIGNALED has
>>>> +	 * been set, ensure that we flush any pending TIF_NOTIFY_SIGNAL work
>>>> +	 * if any was queued before that.
>>>> +	 */
>>>> +	if (test_thread_flag(TIF_NOTIFY_SIGNAL))
>>>> +		tracehook_notify_signal();
>>>> +
>>>>  	audit_core_dumps(siginfo->si_signo);
>>>>  
>>>>  	binfmt = mm->binfmt;
>>>> diff --git a/kernel/task_work.c b/kernel/task_work.c
>>>> index 1698fbe6f0e1..1ab28904adc4 100644
>>>> --- a/kernel/task_work.c
>>>> +++ b/kernel/task_work.c
>>>> @@ -41,6 +41,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
>>>>  		head = READ_ONCE(task->task_works);
>>>>  		if (unlikely(head == &work_exited))
>>>>  			return -ESRCH;
>>>> +		/*
>>>> +		 * TIF_NOTIFY_SIGNAL notifications will interfere with
>>>> +		 * a core dump in progress, reject them.
>>>> +		 */
>>>> +		if ((task->flags & PF_SIGNALED) && notify == TWA_SIGNAL)
>>>> +			return -ESRCH;
>>>>  		work->next = head;
>>>>  	} while (cmpxchg(&task->task_works, head, work) != head);
>>>>  
>>>>
>>> Doesn't compile.  5.10 doesn't have TIF_NOTIFY_SIGNAL.
>> Oh right... Here's one hacked up for the 5.10 TWA_SIGNAL setup. Totally
>> untested...
>>
>> diff --git a/fs/coredump.c b/fs/coredump.c
>> index c6acfc694f65..9e899ce67589 100644
>> --- a/fs/coredump.c
>> +++ b/fs/coredump.c
>> @@ -603,6 +603,19 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>>  		.mm_flags = mm->flags,
>>  	};
>>  
>> +	/*
>> +	 * task_work_add() will refuse to add work after PF_SIGNALED has
>> +	 * been set, ensure that we flush any pending TWA_SIGNAL work
>> +	 * if any was queued before that.
>> +	 */
>> +	if (signal_pending(current) && (current->jobctl & JOBCTL_TASK_WORK)) {
>> +		task_work_run();
>> +		spin_lock_irq(&current->sighand->siglock);
>> +		current->jobctl &= ~JOBCTL_TASK_WORK;
>> +		recalc_sigpending();
>> +		spin_unlock_irq(&current->sighand->siglock);
>> +	}
>> +
>>  	audit_core_dumps(siginfo->si_signo);
>>  
>>  	binfmt = mm->binfmt;
>> diff --git a/kernel/task_work.c b/kernel/task_work.c
>> index 8d6e1217c451..93b3f262eb4a 100644
>> --- a/kernel/task_work.c
>> +++ b/kernel/task_work.c
>> @@ -39,6 +39,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
>>  		head = READ_ONCE(task->task_works);
>>  		if (unlikely(head == &work_exited))
>>  			return -ESRCH;
>> +		/*
>> +		 * TWA_SIGNAL notifications will interfere with
>> +		 * a core dump in progress, reject them.
>> +		 */
>> +		if ((task->flags & PF_SIGNALED) && notify == TWA_SIGNAL)
>> +			return -ESRCH;
>>  		work->next = head;
>>  	} while (cmpxchg(&task->task_works, head, work) != head);
>>  
>>
> Tested with 5.10.59 + backport 06af8679449d + the patch above.  That
> fixes it for me.  I tested a couple of variations to make sure.
> 
> Thanks!
> 
> Tested-by: Tony Battersby <tonyb@cybernetics.com>

Great, thanks for testing! The 5.10 version is a bit uglier due to how
TWA_SIGNAL used to work, but it's the most straight forward backport of
the other version I sent.

Olivier Langlois Aug. 21, 2021, 9:48 a.m. UTC | #30

On Tue, 2021-08-17 at 12:15 -0600, Jens Axboe wrote:
> 
> It does indeed sound like it's TIF_NOTIFY_SIGNAL that will trigger
> some
> signal_pending() and cause an interruption of the core dump. Just out
> of
> curiosity, what is your /proc/sys/kernel/core_pattern set to? If it's
> set to some piped process, can you try and set it to 'core' and see
> if
> that eliminates the truncation of the core dumps for your case?
> 

/proc/sys/kernel/core_pattern is set to:
|/home/lano1106/bin/pipe_core.sh %e %p

It normally points to systemd coredump module. I have pointed to a
simpler program for debugging purposes.

when core_pattern points to a local file, core dump files are just
fine. That was the whole point of 

commit 06af8679449d ("coredump: Limit what can interrupt coredumps")

I have been distracted by other things this week but my last attempt to
fix this problem appears to be successful. I will send out a small
patch set shortly...

Olivier Langlois Aug. 21, 2021, 9:52 a.m. UTC | #31

On Tue, 2021-08-17 at 12:24 -0600, Jens Axboe wrote:
> 
> And assuming that works, then I suspect this one would fix your issue
> even with a piped core dump:
> 
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 07afb5ddb1c4..852737a9ccbf 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -41,6 +41,7 @@
>  #include <linux/fs.h>
>  #include <linux/path.h>
>  #include <linux/timekeeping.h>
> +#include <linux/io_uring.h>
>  
>  #include <linux/uaccess.h>
>  #include <asm/mmu_context.h>
> @@ -603,6 +604,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>         };
>  
>         audit_core_dumps(siginfo->si_signo);
> +       io_uring_task_cancel();
>  
>         binfmt = mm->binfmt;
>         if (!binfmt || !binfmt->core_dump)
> 
That is what my patch is doing. Function call is inserted at a
different place... I am not sure if one location is better than the
other or if it matters at all but there is an extra change required to
make it work...

diff --git a/fs/coredump.c b/fs/coredump.c
index 07afb5ddb1c4..614fe7a54c1a 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -41,6 +41,7 @@
 #include <linux/fs.h>
 #include <linux/path.h>
 #include <linux/timekeeping.h>
+#include <linux/io_uring.h>
 
 #include <linux/uaccess.h>
 #include <asm/mmu_context.h>
@@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 		need_suid_safe = true;
 	}
 
+	io_uring_task_cancel();
+
 	retval = coredump_wait(siginfo->si_signo, &core_state);
 	if (retval < 0)
 		goto fail_creds;

Olivier Langlois Aug. 21, 2021, 10:08 a.m. UTC | #32

On Tue, 2021-08-17 at 20:57 -0600, Jens Axboe wrote:
> 
> Olivier, I sent a 5.10 version for Nathan, any chance you can test
> this
> one for the current kernels? Basically this one should work for
> 5.11+,
> and the later 5.10 version is just for 5.10. I'm going to send it out
> separately for review.
> 
> I do think this is the right solution, barring a tweak maybe on
> testing
> notify == TWA_SIGNAL first before digging into the task struct. But
> the
> principle is sound, and it'll work for other users of TWA_SIGNAL as
> well. None right now as far as I can tell, but the live patching is
> switching to TIF_NOTIFY_SIGNAL as well which will also cause issues
> with
> coredumps potentially.
> 
Ok, I am going to give it a shot. This solution is probably superior to
the previous attempt as it does not inject io_uring dependency into the
coredump module.

The small extra change that I alluded to in my previous reply will
still be relevant even if we go with your patch...

I'll come back soon with your patch testing result and my small extra
change that I keep teasing about.

Greetings,

Olivier Langlois Aug. 21, 2021, 4:47 p.m. UTC | #33

On Sat, 2021-08-21 at 06:08 -0400, Olivier Langlois wrote:
> On Tue, 2021-08-17 at 20:57 -0600, Jens Axboe wrote:
> > 
> > Olivier, I sent a 5.10 version for Nathan, any chance you can test
> > this
> > one for the current kernels? Basically this one should work for
> > 5.11+,
> > and the later 5.10 version is just for 5.10. I'm going to send it
> > out
> > separately for review.
> > 
> > I do think this is the right solution, barring a tweak maybe on
> > testing
> > notify == TWA_SIGNAL first before digging into the task struct. But
> > the
> > principle is sound, and it'll work for other users of TWA_SIGNAL as
> > well. None right now as far as I can tell, but the live patching is
> > switching to TIF_NOTIFY_SIGNAL as well which will also cause issues
> > with
> > coredumps potentially.
> > 
> Ok, I am going to give it a shot. This solution is probably superior
> to
> the previous attempt as it does not inject io_uring dependency into
> the
> coredump module.
> 
> The small extra change that I alluded to in my previous reply will
> still be relevant even if we go with your patch...
> 
> I'll come back soon with your patch testing result and my small extra
> change that I keep teasing about.
> 
> Greetings,
> 
Jens,

your patch doesn't compile with 5.12+. AFAIK, the reason is that
JOBCTL_TASK_WORK is gone.

Wouldn't just a call to tracehook_notify_signal from do_coredump be
enough and backward compatible with every possible stable branches?

Greetings,

Jens Axboe Aug. 21, 2021, 4:51 p.m. UTC | #34

On 8/21/21 10:47 AM, Olivier Langlois wrote:
> On Sat, 2021-08-21 at 06:08 -0400, Olivier Langlois wrote:
>> On Tue, 2021-08-17 at 20:57 -0600, Jens Axboe wrote:
>>>
>>> Olivier, I sent a 5.10 version for Nathan, any chance you can test
>>> this
>>> one for the current kernels? Basically this one should work for
>>> 5.11+,
>>> and the later 5.10 version is just for 5.10. I'm going to send it
>>> out
>>> separately for review.
>>>
>>> I do think this is the right solution, barring a tweak maybe on
>>> testing
>>> notify == TWA_SIGNAL first before digging into the task struct. But
>>> the
>>> principle is sound, and it'll work for other users of TWA_SIGNAL as
>>> well. None right now as far as I can tell, but the live patching is
>>> switching to TIF_NOTIFY_SIGNAL as well which will also cause issues
>>> with
>>> coredumps potentially.
>>>
>> Ok, I am going to give it a shot. This solution is probably superior
>> to
>> the previous attempt as it does not inject io_uring dependency into
>> the
>> coredump module.
>>
>> The small extra change that I alluded to in my previous reply will
>> still be relevant even if we go with your patch...
>>
>> I'll come back soon with your patch testing result and my small extra
>> change that I keep teasing about.
>>
>> Greetings,
>>
> Jens,
> 
> your patch doesn't compile with 5.12+. AFAIK, the reason is that
> JOBCTL_TASK_WORK is gone.
> 
> Wouldn't just a call to tracehook_notify_signal from do_coredump be
> enough and backward compatible with every possible stable branches?

That version is just for 5.10, the first I posted is applicable to
5.11+

Olivier Langlois Aug. 21, 2021, 5:21 p.m. UTC | #35

On Sat, 2021-08-21 at 10:51 -0600, Jens Axboe wrote:
> On 8/21/21 10:47 AM, Olivier Langlois wrote:
> > Jens,
> > 
> > your patch doesn't compile with 5.12+. AFAIK, the reason is that
> > JOBCTL_TASK_WORK is gone.
> > 
> > Wouldn't just a call to tracehook_notify_signal from do_coredump be
> > enough and backward compatible with every possible stable branches?
> 
> That version is just for 5.10, the first I posted is applicable to
> 5.11+
> 
Ok, in that case, I can tell you that it partially works. There is a
small thing missing.

I have tested mine which I did share in
https://lore.kernel.org/io-uring/87pmwt6biw.fsf@disp2133/T/#m3b51d44c12e39a06b82aac1d372df05312cff833

and with another small patch added to it, it does work.

I will offer my patch later today.

coredump: Limit what can interrupt coredumps

Commit Message

Comments

Patch