diff mbox series

[v3] binfmt_elf: Dump smaller VMAs first in ELF cores

Message ID 036CD6AE-C560-4FC7-9B02-ADD08E380DC9@juniper.net (mailing list archive)
State New
Headers show
Series [v3] binfmt_elf: Dump smaller VMAs first in ELF cores | expand

Commit Message

Brian Mak Aug. 6, 2024, 6:16 p.m. UTC
Large cores may be truncated in some scenarios, such as with daemons
with stop timeouts that are not large enough or lack of disk space. This
impacts debuggability with large core dumps since critical information
necessary to form a usable backtrace, such as stacks and shared library
information, are omitted.

We attempted to figure out which VMAs are needed to create a useful
backtrace, and it turned out to be a non-trivial problem. Instead, we
try simply sorting the VMAs by size, which has the intended effect.

By sorting VMAs by dump size and dumping in that order, we have a
simple, yet effective heuristic.

Signed-off-by: Brian Mak <makb@juniper.net>
---

Hi all,

Still need to run rr tests on this, per Kees Cook's suggestion, will
update back once done. GDB and readelf show that this patch works
without issue though.

Thanks,
Brian Mak

v3: Edited commit message to better convey alternative solution as
    non-trivial

    Moved sorting logic to fs/coredump.c to make it in place

    Above edits suggested by Eric Biederman <ebiederm@xmission.com>

v2: Edited commit message to include more reasoning for sorting VMAs
    
    Removed conditional VMA sorting with debugfs knob
    
    Above edits suggested by Eric Biederman <ebiederm@xmission.com>

 fs/coredump.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)


base-commit: eb5e56d1491297e0881c95824e2050b7c205f0d4

Comments

Linus Torvalds Aug. 6, 2024, 6:33 p.m. UTC | #1
On Tue, 6 Aug 2024 at 11:16, Brian Mak <makb@juniper.net> wrote:
>
> @@ -1253,5 +1266,8 @@ static bool dump_vma_snapshot(struct coredump_params *cprm)
>                 cprm->vma_data_size += m->dump_size;
>         }
>
> +       sort(cprm->vma_meta, cprm->vma_count, sizeof(*cprm->vma_meta),
> +               cmp_vma_size, NULL);
> +
>         return true;
>  }

Hmm. Realistically we only dump core in ELF, and the order of the
segments shouldn't matter.

But I wonder if we should do this in the ->core_dump() function
itself, in case it would have mattered for other dump formats?

IOW, instead of being at the bottom of dump_vma_snapshot(), maybe the
sorting should be at the top of elf_core_dump()?

And yes, in practice I doubt we'll ever have other dump formats, and
no, a.out isn't doing some miraculous comeback either.

But I bet you didn't test elf_fdpic_core_dump() even if I bet it (a)
works and (b) nobody cares.

So moving it to the ELF side might be conceptually the right thing to do?

(Or is there some reason it needs to be done at snapshot time that I
just didn't fully appreciate?)

           Linus
Brian Mak Aug. 6, 2024, 7:24 p.m. UTC | #2
On Aug 6, 2024, at 11:33 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Hmm. Realistically we only dump core in ELF, and the order of the
> segments shouldn't matter.
> 
> But I wonder if we should do this in the ->core_dump() function
> itself, in case it would have mattered for other dump formats?
> 
> IOW, instead of being at the bottom of dump_vma_snapshot(), maybe the
> sorting should be at the top of elf_core_dump()?
> 
> And yes, in practice I doubt we'll ever have other dump formats, and
> no, a.out isn't doing some miraculous comeback either.
> 
> But I bet you didn't test elf_fdpic_core_dump() even if I bet it (a)
> works and (b) nobody cares.
> 
> So moving it to the ELF side might be conceptually the right thing to do?
> 
> (Or is there some reason it needs to be done at snapshot time that I
> just didn't fully appreciate?)

The main reason it was done at snapshot time was to make it so that it
works with other dump formats like ELF FDPIC (that being said, yes I
didn't explicitly test it). If another format is introduced and isn't
compatible with this type of reordering, then I think the introduction
of that format should be reconsidered for lack of flexibility, or if it
really must be introduced, then this logic can be changed at that time.

That being said, my opinion on this isn't too strong, so if you feel it
is better to have it in elf_core_dump, we can do that too. Since Eric
originally brought up placing it here, maybe he has his own opinions on
this as well.

Best,
Brian Mak
Kees Cook Aug. 7, 2024, 5:21 a.m. UTC | #3
On Tue, 06 Aug 2024 18:16:02 +0000, Brian Mak wrote:
> Large cores may be truncated in some scenarios, such as with daemons
> with stop timeouts that are not large enough or lack of disk space. This
> impacts debuggability with large core dumps since critical information
> necessary to form a usable backtrace, such as stacks and shared library
> information, are omitted.
> 
> We attempted to figure out which VMAs are needed to create a useful
> backtrace, and it turned out to be a non-trivial problem. Instead, we
> try simply sorting the VMAs by size, which has the intended effect.
> 
> [...]

While waiting on rr test validation, and since we're at the start of the
dev cycle, I figure let's get this into -next ASAP to see if anything
else pops out. We can drop/revise if there are problems. (And as always,
I will add any Acks/Reviews/etc that show up on the thread.)

Applied to for-next/execve, thanks!

[1/1] binfmt_elf: Dump smaller VMAs first in ELF cores
      https://git.kernel.org/kees/c/9c531dfdc1bc

Take care,
Eric W. Biederman Aug. 9, 2024, 2:39 p.m. UTC | #4
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Tue, 6 Aug 2024 at 11:16, Brian Mak <makb@juniper.net> wrote:
>>
>> @@ -1253,5 +1266,8 @@ static bool dump_vma_snapshot(struct coredump_params *cprm)
>>                 cprm->vma_data_size += m->dump_size;
>>         }
>>
>> +       sort(cprm->vma_meta, cprm->vma_count, sizeof(*cprm->vma_meta),
>> +               cmp_vma_size, NULL);
>> +
>>         return true;
>>  }
>
> Hmm. Realistically we only dump core in ELF, and the order of the
> segments shouldn't matter.
>
> But I wonder if we should do this in the ->core_dump() function
> itself, in case it would have mattered for other dump formats?
>
> IOW, instead of being at the bottom of dump_vma_snapshot(), maybe the
> sorting should be at the top of elf_core_dump()?
>
> And yes, in practice I doubt we'll ever have other dump formats, and
> no, a.out isn't doing some miraculous comeback either.
>
> But I bet you didn't test elf_fdpic_core_dump() even if I bet it (a)
> works and (b) nobody cares.
>
> So moving it to the ELF side might be conceptually the right thing to do?
>
> (Or is there some reason it needs to be done at snapshot time that I
> just didn't fully appreciate?)

I asked him to perform this at snapshot time.  Plus it is obvious at
snapshot time that you can change the allocated array, while it is
not so obvious in the ->core_dump methods.

I would argue that the long term maintainable thing to do is to
merge elf_core_dump and elf_fdpic_core_dump and put all of the code
in fs/coredump.c

Performing the sort at snapshot time avoids introducing one extra reason
why the two elf implementations of elf coredumping are different.

I did read through the elf fdpic code quickly and it looks like it
should just work no matter which order the vma's are dumped in.  Just
like the other elf coredump code does.




My practical concern is that someone has a coredump thing that walks
through the program headers and short circuits the walk because it knows
the program headers are all written in order.  But the only way to find
one of those is to just try it.

Eric
Linus Torvalds Aug. 9, 2024, 3:13 p.m. UTC | #5
On Fri, 9 Aug 2024 at 07:40, Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> I asked him to perform this at snapshot time.  Plus it is obvious at
> snapshot time that you can change the allocated array, while it is
> not so obvious in the ->core_dump methods.

Fair enough. The days when we supported a.out dumps are obviously long
long gone, and aren't coming back.

So I am not adamant that it has to be done in the dumper, and I
probably just have that historical "we have multiple different dumpers
with different rules" mindset that isn't really relevant any more.

> I would argue that the long term maintainable thing to do is to
> merge elf_core_dump and elf_fdpic_core_dump and put all of the code
> in fs/coredump.c

I wouldn't object. It's not like there's any foreseeable new core dump
format that we'd expect, and the indirection makes the code flow
harder to follow.

Not that most people look at this code a lot.

             Linus
Brian Mak Aug. 10, 2024, 12:52 a.m. UTC | #6
On Aug 6, 2024, at 10:21 PM, Kees Cook <kees@kernel.org> wrote:
> 
> On Tue, 06 Aug 2024 18:16:02 +0000, Brian Mak wrote:
>> Large cores may be truncated in some scenarios, such as with daemons
>> with stop timeouts that are not large enough or lack of disk space. This
>> impacts debuggability with large core dumps since critical information
>> necessary to form a usable backtrace, such as stacks and shared library
>> information, are omitted.
>> 
>> We attempted to figure out which VMAs are needed to create a useful
>> backtrace, and it turned out to be a non-trivial problem. Instead, we
>> try simply sorting the VMAs by size, which has the intended effect.
>> 
>> [...]
> 
> While waiting on rr test validation, and since we're at the start of the
> dev cycle, I figure let's get this into -next ASAP to see if anything
> else pops out. We can drop/revise if there are problems. (And as always,
> I will add any Acks/Reviews/etc that show up on the thread.)
> 
> Applied to for-next/execve, thanks!
> 
> [1/1] binfmt_elf: Dump smaller VMAs first in ELF cores
>      https://urldefense.com/v3/__https://git.kernel.org/kees/c/9c531dfdc1bc__;!!NEt6yMaO-gk!FK3UfXVndoYpve8Y7q7vacIoHOrTj2nJgSJbugqUB5LfciKy0_Xvit9aXz_XCWlXHpdRQO2ArP0$

Thanks, Kees! And, thanks Linus + Eric for taking the time to comment on
this.

Regarding the rr tests, it was not an easy task to get the environment
set up to do this, but I did it and was able to run the tests. The rr
tests require a lot of kernel config options and there's no list
documenting what's needed anywhere...

All the tests pass except for the sioc and sioc-no-syscallbuf tests.
However, these test failures are due to an incompatibility with the
network adapter I'm using. It seems that it only likes older network
adapters. I've switched my virtualized network adapter twice now, and
each time, the test gets a bit further than the previous time. Will
continue trying different network adapters until something hopefully
works. In any case, since this error isn't directly related to my
changes and the rest of the tests pass, then I think we can be pretty
confident that this change is not breaking rr.

Best,
Brian Mak

> Take care,
> 
> --
> Kees Cook
>
Kees Cook Aug. 10, 2024, 4:06 a.m. UTC | #7
On Sat, Aug 10, 2024 at 12:52:16AM +0000, Brian Mak wrote:
> On Aug 6, 2024, at 10:21 PM, Kees Cook <kees@kernel.org> wrote:
> > 
> > On Tue, 06 Aug 2024 18:16:02 +0000, Brian Mak wrote:
> >> Large cores may be truncated in some scenarios, such as with daemons
> >> with stop timeouts that are not large enough or lack of disk space. This
> >> impacts debuggability with large core dumps since critical information
> >> necessary to form a usable backtrace, such as stacks and shared library
> >> information, are omitted.
> >> 
> >> We attempted to figure out which VMAs are needed to create a useful
> >> backtrace, and it turned out to be a non-trivial problem. Instead, we
> >> try simply sorting the VMAs by size, which has the intended effect.
> >> 
> >> [...]
> > 
> > While waiting on rr test validation, and since we're at the start of the
> > dev cycle, I figure let's get this into -next ASAP to see if anything
> > else pops out. We can drop/revise if there are problems. (And as always,
> > I will add any Acks/Reviews/etc that show up on the thread.)
> > 
> > Applied to for-next/execve, thanks!
> > 
> > [1/1] binfmt_elf: Dump smaller VMAs first in ELF cores
> >      https://urldefense.com/v3/__https://git.kernel.org/kees/c/9c531dfdc1bc__;!!NEt6yMaO-gk!FK3UfXVndoYpve8Y7q7vacIoHOrTj2nJgSJbugqUB5LfciKy0_Xvit9aXz_XCWlXHpdRQO2ArP0$
> 
> Thanks, Kees! And, thanks Linus + Eric for taking the time to comment on
> this.
> 
> Regarding the rr tests, it was not an easy task to get the environment
> set up to do this, but I did it and was able to run the tests. The rr
> tests require a lot of kernel config options and there's no list
> documenting what's needed anywhere...

Thanks for suffering through that!

> All the tests pass except for the sioc and sioc-no-syscallbuf tests.
> However, these test failures are due to an incompatibility with the
> network adapter I'm using. It seems that it only likes older network
> adapters. I've switched my virtualized network adapter twice now, and
> each time, the test gets a bit further than the previous time. Will
> continue trying different network adapters until something hopefully
> works. In any case, since this error isn't directly related to my
> changes and the rest of the tests pass, then I think we can be pretty
> confident that this change is not breaking rr.

Perfect! Okay, we'll keep our eyes open for any reports of breakage. :)

-Kees
Eric W. Biederman Aug. 10, 2024, 12:28 p.m. UTC | #8
Brian Mak <makb@juniper.net> writes:

> Large cores may be truncated in some scenarios, such as with daemons
> with stop timeouts that are not large enough or lack of disk space. This
> impacts debuggability with large core dumps since critical information
> necessary to form a usable backtrace, such as stacks and shared library
> information, are omitted.
>
> We attempted to figure out which VMAs are needed to create a useful
> backtrace, and it turned out to be a non-trivial problem. Instead, we
> try simply sorting the VMAs by size, which has the intended effect.
>
> By sorting VMAs by dump size and dumping in that order, we have a
> simple, yet effective heuristic.

To make finding the history easier I would include:
v1: https://lkml.kernel.org/r/CB8195AE-518D-44C9-9841-B2694A5C4002@juniper.net
v2: https://lkml.kernel.org/r/C21B229F-D1E6-4E44-B506-A5ED4019A9DE@juniper.net

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>

As Kees has already picked this up this is quite possibly silly.
But *shrug* that was when I was out.


> Signed-off-by: Brian Mak <makb@juniper.net>
> ---
>
> Hi all,
>
> Still need to run rr tests on this, per Kees Cook's suggestion, will
> update back once done. GDB and readelf show that this patch works
> without issue though.
>
> Thanks,
> Brian Mak
>
> v3: Edited commit message to better convey alternative solution as
>     non-trivial
>
>     Moved sorting logic to fs/coredump.c to make it in place
>
>     Above edits suggested by Eric Biederman <ebiederm@xmission.com>
>
> v2: Edited commit message to include more reasoning for sorting VMAs
>     
>     Removed conditional VMA sorting with debugfs knob
>     
>     Above edits suggested by Eric Biederman <ebiederm@xmission.com>
>
>  fs/coredump.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 7f12ff6ad1d3..33c5ac53ab31 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -18,6 +18,7 @@
>  #include <linux/personality.h>
>  #include <linux/binfmts.h>
>  #include <linux/coredump.h>
> +#include <linux/sort.h>
>  #include <linux/sched/coredump.h>
>  #include <linux/sched/signal.h>
>  #include <linux/sched/task_stack.h>
> @@ -1191,6 +1192,18 @@ static void free_vma_snapshot(struct coredump_params *cprm)
>  	}
>  }
>  
> +static int cmp_vma_size(const void *vma_meta_lhs_ptr, const void *vma_meta_rhs_ptr)
> +{
> +	const struct core_vma_metadata *vma_meta_lhs = vma_meta_lhs_ptr;
> +	const struct core_vma_metadata *vma_meta_rhs = vma_meta_rhs_ptr;
> +
> +	if (vma_meta_lhs->dump_size < vma_meta_rhs->dump_size)
> +		return -1;
> +	if (vma_meta_lhs->dump_size > vma_meta_rhs->dump_size)
> +		return 1;
> +	return 0;
> +}
> +
>  /*
>   * Under the mmap_lock, take a snapshot of relevant information about the task's
>   * VMAs.
> @@ -1253,5 +1266,8 @@ static bool dump_vma_snapshot(struct coredump_params *cprm)
>  		cprm->vma_data_size += m->dump_size;
>  	}
>  
> +	sort(cprm->vma_meta, cprm->vma_count, sizeof(*cprm->vma_meta),
> +		cmp_vma_size, NULL);
> +
>  	return true;
>  }
>
> base-commit: eb5e56d1491297e0881c95824e2050b7c205f0d4
Kees Cook Aug. 12, 2024, 6:05 p.m. UTC | #9
On Sat, Aug 10, 2024 at 07:28:44AM -0500, Eric W. Biederman wrote:
> Brian Mak <makb@juniper.net> writes:
> 
> > Large cores may be truncated in some scenarios, such as with daemons
> > with stop timeouts that are not large enough or lack of disk space. This
> > impacts debuggability with large core dumps since critical information
> > necessary to form a usable backtrace, such as stacks and shared library
> > information, are omitted.
> >
> > We attempted to figure out which VMAs are needed to create a useful
> > backtrace, and it turned out to be a non-trivial problem. Instead, we
> > try simply sorting the VMAs by size, which has the intended effect.
> >
> > By sorting VMAs by dump size and dumping in that order, we have a
> > simple, yet effective heuristic.
> 
> To make finding the history easier I would include:
> v1: https://lkml.kernel.org/r/CB8195AE-518D-44C9-9841-B2694A5C4002@juniper.net
> v2: https://lkml.kernel.org/r/C21B229F-D1E6-4E44-B506-A5ED4019A9DE@juniper.net
> 
> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> As Kees has already picked this up this is quite possibly silly.
> But *shrug* that was when I was out.

I've updated the trailers. Thanks for the review!

-Kees
Brian Mak Aug. 12, 2024, 6:21 p.m. UTC | #10
On Aug 12, 2024, at 11:05 AM, Kees Cook <kees@kernel.org> wrote

> On Sat, Aug 10, 2024 at 07:28:44AM -0500, Eric W. Biederman wrote:
>> Brian Mak <makb@juniper.net> writes:
>> 
>>> Large cores may be truncated in some scenarios, such as with daemons
>>> with stop timeouts that are not large enough or lack of disk space. This
>>> impacts debuggability with large core dumps since critical information
>>> necessary to form a usable backtrace, such as stacks and shared library
>>> information, are omitted.
>>> 
>>> We attempted to figure out which VMAs are needed to create a useful
>>> backtrace, and it turned out to be a non-trivial problem. Instead, we
>>> try simply sorting the VMAs by size, which has the intended effect.
>>> 
>>> By sorting VMAs by dump size and dumping in that order, we have a
>>> simple, yet effective heuristic.
>> 
>> To make finding the history easier I would include:
>> v1: https://urldefense.com/v3/__https://lkml.kernel.org/r/CB8195AE-518D-44C9-9841-B2694A5C4002@juniper.net__;!!NEt6yMaO-gk!DavIB4o54KGrCPK44iq9_nJrOpKMJxUAlazBVF6lfKwmMCgLD_NviY088SQXriD19pS0rwhadvc$
>> v2: https://urldefense.com/v3/__https://lkml.kernel.org/r/C21B229F-D1E6-4E44-B506-A5ED4019A9DE@juniper.net__;!!NEt6yMaO-gk!DavIB4o54KGrCPK44iq9_nJrOpKMJxUAlazBVF6lfKwmMCgLD_NviY088SQXriD19pS0G7RQv4o$
>> 
>> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> 
>> As Kees has already picked this up this is quite possibly silly.
>> But *shrug* that was when I was out.
> 
> I've updated the trailers. Thanks for the review!

Hi Kees,

Thanks! I think you added it to the wrong commit though.

Please double check and update accordingly.

Regarding the sioc tests from earlier, I've reached a point where I
think I have a compatible virtual NIC (no more ioctl errors), but it's
giving me some mismatched registers error, causing the test to fail. I
can see this same test failure on a vanilla kernel with my setup, so
this is probably either some environment issue or a bug with rr or the
tests. Since all the other tests pass, I'm just going to leave it at
that.

Best,
Brian Mak
Kees Cook Aug. 12, 2024, 6:25 p.m. UTC | #11
On Mon, Aug 12, 2024 at 06:21:15PM +0000, Brian Mak wrote:
> On Aug 12, 2024, at 11:05 AM, Kees Cook <kees@kernel.org> wrote
> 
> > On Sat, Aug 10, 2024 at 07:28:44AM -0500, Eric W. Biederman wrote:
> >> Brian Mak <makb@juniper.net> writes:
> >> 
> >>> Large cores may be truncated in some scenarios, such as with daemons
> >>> with stop timeouts that are not large enough or lack of disk space. This
> >>> impacts debuggability with large core dumps since critical information
> >>> necessary to form a usable backtrace, such as stacks and shared library
> >>> information, are omitted.
> >>> 
> >>> We attempted to figure out which VMAs are needed to create a useful
> >>> backtrace, and it turned out to be a non-trivial problem. Instead, we
> >>> try simply sorting the VMAs by size, which has the intended effect.
> >>> 
> >>> By sorting VMAs by dump size and dumping in that order, we have a
> >>> simple, yet effective heuristic.
> >> 
> >> To make finding the history easier I would include:
> >> v1: https://urldefense.com/v3/__https://lkml.kernel.org/r/CB8195AE-518D-44C9-9841-B2694A5C4002@juniper.net__;!!NEt6yMaO-gk!DavIB4o54KGrCPK44iq9_nJrOpKMJxUAlazBVF6lfKwmMCgLD_NviY088SQXriD19pS0rwhadvc$
> >> v2: https://urldefense.com/v3/__https://lkml.kernel.org/r/C21B229F-D1E6-4E44-B506-A5ED4019A9DE@juniper.net__;!!NEt6yMaO-gk!DavIB4o54KGrCPK44iq9_nJrOpKMJxUAlazBVF6lfKwmMCgLD_NviY088SQXriD19pS0G7RQv4o$
> >> 
> >> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> >> 
> >> As Kees has already picked this up this is quite possibly silly.
> >> But *shrug* that was when I was out.
> > 
> > I've updated the trailers. Thanks for the review!
> 
> Hi Kees,
> 
> Thanks! I think you added it to the wrong commit though.

Ugh. Time for more coffee. Thanks; fixed. I need to update my "b4" -- it
was hanging doing the trailers update so I did it myself manually...
That'll teach me. ;)

> tests. Since all the other tests pass, I'm just going to leave it at
> that.

Yeah, I think you're good. Thank you for taking the time to test rr!
diff mbox series

Patch

diff --git a/fs/coredump.c b/fs/coredump.c
index 7f12ff6ad1d3..33c5ac53ab31 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -18,6 +18,7 @@ 
 #include <linux/personality.h>
 #include <linux/binfmts.h>
 #include <linux/coredump.h>
+#include <linux/sort.h>
 #include <linux/sched/coredump.h>
 #include <linux/sched/signal.h>
 #include <linux/sched/task_stack.h>
@@ -1191,6 +1192,18 @@  static void free_vma_snapshot(struct coredump_params *cprm)
 	}
 }
 
+static int cmp_vma_size(const void *vma_meta_lhs_ptr, const void *vma_meta_rhs_ptr)
+{
+	const struct core_vma_metadata *vma_meta_lhs = vma_meta_lhs_ptr;
+	const struct core_vma_metadata *vma_meta_rhs = vma_meta_rhs_ptr;
+
+	if (vma_meta_lhs->dump_size < vma_meta_rhs->dump_size)
+		return -1;
+	if (vma_meta_lhs->dump_size > vma_meta_rhs->dump_size)
+		return 1;
+	return 0;
+}
+
 /*
  * Under the mmap_lock, take a snapshot of relevant information about the task's
  * VMAs.
@@ -1253,5 +1266,8 @@  static bool dump_vma_snapshot(struct coredump_params *cprm)
 		cprm->vma_data_size += m->dump_size;
 	}
 
+	sort(cprm->vma_meta, cprm->vma_count, sizeof(*cprm->vma_meta),
+		cmp_vma_size, NULL);
+
 	return true;
 }