diff mbox series

[RFC] binfmt_elf: Dump smaller VMAs first in ELF cores

Message ID CB8195AE-518D-44C9-9841-B2694A5C4002@juniper.net (mailing list archive)
State New
Headers show
Series [RFC] binfmt_elf: Dump smaller VMAs first in ELF cores | expand

Commit Message

Brian Mak July 31, 2024, 10:14 p.m. UTC
Large cores may be truncated in some scenarios, such as daemons with stop
timeouts that are not large enough or lack of disk space. This impacts
debuggability with large core dumps since critical information necessary to
form a usable backtrace, such as stacks and shared library information, are
omitted. We can mitigate the impact of core dump truncation by dumping
smaller VMAs first, which may be more likely to contain memory for stacks
and shared library information, thus allowing a usable backtrace to be
formed.

We implement this by sorting the VMAs by dump size and dumping in that
order.

Signed-off-by: Brian Mak <makb@juniper.net>
---

Hi all,

My initial testing with a program that spawns several threads and allocates heap
memory shows that this patch does indeed prioritize information such as stacks,
which is crucial to forming a backtrace and debugging core dumps.

Requesting for comments on the following:

Are there cases where this might not necessarily prioritize dumping VMAs
needed to obtain a usable backtrace?

Thanks,
Brian Mak

 fs/binfmt_elf.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 62 insertions(+), 2 deletions(-)


base-commit: 94ede2a3e9135764736221c080ac7c0ad993dc2d

Comments

Eric W. Biederman Aug. 1, 2024, 2:52 a.m. UTC | #1
Brian Mak <makb@juniper.net> writes:

> Large cores may be truncated in some scenarios, such as daemons with stop
> timeouts that are not large enough or lack of disk space. This impacts
> debuggability with large core dumps since critical information necessary to
> form a usable backtrace, such as stacks and shared library information, are
> omitted. We can mitigate the impact of core dump truncation by dumping
> smaller VMAs first, which may be more likely to contain memory for stacks
> and shared library information, thus allowing a usable backtrace to be
> formed.

This sounds theoretical.  Do you happen to have a description of a
motivating case?  A situtation that bit someone and resulted in a core
file that wasn't usable?

A concrete situation would help us imagine what possible caveats there
are with sorting vmas this way.

The most common case I am aware of is distributions setting the core
file size to 0 (ulimit -c 0).

One practical concern with this approach is that I think the ELF
specification says that program headers should be written in memory
order.  So a comment on your testing to see if gdb or rr or any of
the other debuggers that read core dumps cares would be appreciated.


> We implement this by sorting the VMAs by dump size and dumping in that
> order.

Since your concern is about stacks, and the kernel has information about
stacks it might be worth using that information explicitly when sorting
vmas, instead of just assuming stacks will be small.

I expect the priorities would look something like jit generated
executable code segments, stacks, and then heap data.

I don't have enough information what is causing your truncated core
dumps, so I can't guess what the actual problem is your are fighting,
so I could be wrong on priorities.

Though I do wonder if this might be a buggy interaction between
core dumps and something like signals, or io_uring.  If it is something
other than a shortage of storage space causing your truncated core
dumps I expect we should first debug why the coredumps are being
truncated rather than proceed directly to working around truncation.

Eric

> Signed-off-by: Brian Mak <makb@juniper.net>
> ---
>
> Hi all,
>
> My initial testing with a program that spawns several threads and allocates heap
> memory shows that this patch does indeed prioritize information such as stacks,
> which is crucial to forming a backtrace and debugging core dumps.
>
> Requesting for comments on the following:
>
> Are there cases where this might not necessarily prioritize dumping VMAs
> needed to obtain a usable backtrace?
>
> Thanks,
> Brian Mak
>
>  fs/binfmt_elf.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 62 insertions(+), 2 deletions(-)
>
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 19fa49cd9907..d45240b0748d 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -13,6 +13,7 @@
>  #include <linux/module.h>
>  #include <linux/kernel.h>
>  #include <linux/fs.h>
> +#include <linux/debugfs.h>
>  #include <linux/log2.h>
>  #include <linux/mm.h>
>  #include <linux/mman.h>
> @@ -37,6 +38,7 @@
>  #include <linux/elf-randomize.h>
>  #include <linux/utsname.h>
>  #include <linux/coredump.h>
> +#include <linux/sort.h>
>  #include <linux/sched.h>
>  #include <linux/sched/coredump.h>
>  #include <linux/sched/task_stack.h>
> @@ -1990,6 +1992,22 @@ static void fill_extnum_info(struct elfhdr *elf, struct elf_shdr *shdr4extnum,
>  	shdr4extnum->sh_info = segs;
>  }
>  
> +static int cmp_vma_size(const void *vma_meta_lhs_ptr, const void *vma_meta_rhs_ptr)
> +{
> +	const struct core_vma_metadata *vma_meta_lhs = *(const struct core_vma_metadata **)
> +		vma_meta_lhs_ptr;
> +	const struct core_vma_metadata *vma_meta_rhs = *(const struct core_vma_metadata **)
> +		vma_meta_rhs_ptr;
> +
> +	if (vma_meta_lhs->dump_size < vma_meta_rhs->dump_size)
> +		return -1;
> +	if (vma_meta_lhs->dump_size > vma_meta_rhs->dump_size)
> +		return 1;
> +	return 0;
> +}
> +
> +static bool sort_elf_core_vmas = true;
> +
>  /*
>   * Actual dumper
>   *
> @@ -2008,6 +2026,7 @@ static int elf_core_dump(struct coredump_params *cprm)
>  	struct elf_shdr *shdr4extnum = NULL;
>  	Elf_Half e_phnum;
>  	elf_addr_t e_shoff;
> +	struct core_vma_metadata **sorted_vmas = NULL;
>  
>  	/*
>  	 * The number of segs are recored into ELF header as 16bit value.
> @@ -2071,11 +2090,27 @@ static int elf_core_dump(struct coredump_params *cprm)
>  	if (!dump_emit(cprm, phdr4note, sizeof(*phdr4note)))
>  		goto end_coredump;
>  
> +	/* Allocate memory to sort VMAs and sort if needed. */
> +	if (sort_elf_core_vmas)
> +		sorted_vmas = kvmalloc_array(cprm->vma_count, sizeof(*sorted_vmas), GFP_KERNEL);
> +
> +	if (!ZERO_OR_NULL_PTR(sorted_vmas)) {
> +		for (i = 0; i < cprm->vma_count; i++)
> +			sorted_vmas[i] = cprm->vma_meta + i;
> +
> +		sort(sorted_vmas, cprm->vma_count, sizeof(*sorted_vmas), cmp_vma_size, NULL);
> +	}
> +
>  	/* Write program headers for segments dump */
>  	for (i = 0; i < cprm->vma_count; i++) {
> -		struct core_vma_metadata *meta = cprm->vma_meta + i;
> +		struct core_vma_metadata *meta;
>  		struct elf_phdr phdr;
>  
> +		if (ZERO_OR_NULL_PTR(sorted_vmas))
> +			meta = cprm->vma_meta + i;
> +		else
> +			meta = sorted_vmas[i];
> +
>  		phdr.p_type = PT_LOAD;
>  		phdr.p_offset = offset;
>  		phdr.p_vaddr = meta->start;
> @@ -2111,7 +2146,12 @@ static int elf_core_dump(struct coredump_params *cprm)
>  	dump_skip_to(cprm, dataoff);
>  
>  	for (i = 0; i < cprm->vma_count; i++) {
> -		struct core_vma_metadata *meta = cprm->vma_meta + i;
> +		struct core_vma_metadata *meta;
> +
> +		if (ZERO_OR_NULL_PTR(sorted_vmas))
> +			meta = cprm->vma_meta + i;
> +		else
> +			meta = sorted_vmas[i];
>  
>  		if (!dump_user_range(cprm, meta->start, meta->dump_size))
>  			goto end_coredump;
> @@ -2128,10 +2168,26 @@ static int elf_core_dump(struct coredump_params *cprm)
>  end_coredump:
>  	free_note_info(&info);
>  	kfree(shdr4extnum);
> +	kvfree(sorted_vmas);
>  	kfree(phdr4note);
>  	return has_dumped;
>  }
>  
> +#ifdef CONFIG_DEBUG_FS
> +
> +static struct dentry *elf_core_debugfs;
> +
> +static int __init init_elf_core_debugfs(void)
> +{
> +	elf_core_debugfs = debugfs_create_dir("elf_core", NULL);
> +	debugfs_create_bool("sort_elf_core_vmas", 0644, elf_core_debugfs, &sort_elf_core_vmas);
> +	return 0;
> +}
> +
> +fs_initcall(init_elf_core_debugfs);
> +
> +#endif		/* CONFIG_DEBUG_FS */
> +
>  #endif		/* CONFIG_ELF_CORE */
>  
>  static int __init init_elf_binfmt(void)
> @@ -2144,6 +2200,10 @@ static void __exit exit_elf_binfmt(void)
>  {
>  	/* Remove the COFF and ELF loaders. */
>  	unregister_binfmt(&elf_format);
> +
> +#if defined(CONFIG_ELF_CORE) && defined(CONFIG_DEBUG_FS)
> +	debugfs_remove(elf_core_debugfs);
> +#endif
>  }
>  
>  core_initcall(init_elf_binfmt);
>
> base-commit: 94ede2a3e9135764736221c080ac7c0ad993dc2d
Brian Mak Aug. 1, 2024, 5:58 p.m. UTC | #2
On Jul 31, 2024, at 7:52 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:

> Brian Mak <makb@juniper.net> writes:
> 
>> Large cores may be truncated in some scenarios, such as daemons with stop
>> timeouts that are not large enough or lack of disk space. This impacts
>> debuggability with large core dumps since critical information necessary to
>> form a usable backtrace, such as stacks and shared library information, are
>> omitted. We can mitigate the impact of core dump truncation by dumping
>> smaller VMAs first, which may be more likely to contain memory for stacks
>> and shared library information, thus allowing a usable backtrace to be
>> formed.
> 
> This sounds theoretical.  Do you happen to have a description of a
> motivating case?  A situtation that bit someone and resulted in a core
> file that wasn't usable?
> 
> A concrete situation would help us imagine what possible caveats there
> are with sorting vmas this way.
> 
> The most common case I am aware of is distributions setting the core
> file size to 0 (ulimit -c 0).

Hi Eric,

Thanks for taking the time to reply. We have hit these scenarios before in
practice where large cores are truncated, resulting in an unusable core.

At Juniper, we have some daemons that can consume a lot of memory, where
upon crash, can result in core dumps of several GBs. While dumping, we've
encountered these two scenarios resulting in a unusable core:

1. Disk space is low at the time of core dump, resulting in a truncated
core once the disk is full.

2. A daemon has a TimeoutStopSec option configured in its systemd unit
file, where upon core dumping, could timeout (triggering a SIGKILL) if the
core dump is too large and is taking too long to dump.

In both scenarios, we see that the core file is already several GB, and
still does not contain the information necessary to form a backtrace, thus
creating the need for this change. In the second scenario, we are unable to
increase the timeout option due to our recovery time objective
requirements.

> One practical concern with this approach is that I think the ELF
> specification says that program headers should be written in memory
> order.  So a comment on your testing to see if gdb or rr or any of
> the other debuggers that read core dumps cares would be appreciated.

I've already tested readelf and gdb on core dumps (truncated and whole)
with this patch and it is able to read/use these core dumps in these
scenarios with a proper backtrace.

>> We implement this by sorting the VMAs by dump size and dumping in that
>> order.
> 
> Since your concern is about stacks, and the kernel has information about
> stacks it might be worth using that information explicitly when sorting
> vmas, instead of just assuming stacks will be small.

This was originally the approach that we explored, but ultimately moved
away from. We need more than just stacks to form a proper backtrace. I
didn't narrow down exactly what it was that we needed because the sorting
solution seemed to be cleaner than trying to narrow down each of these
pieces that we'd need. At the very least, we need information about shared
libraries (.dynamic, etc.) and stacks, but my testing showed that we need a
third piece sitting in an anonymous R/W VMA, which is the point that I
stopped exploring this path. I was having a difficult time narrowing down
what this last piece was.

> I expect the priorities would look something like jit generated
> executable code segments, stacks, and then heap data.
> 
> I don't have enough information what is causing your truncated core
> dumps, so I can't guess what the actual problem is your are fighting,
> so I could be wrong on priorities.
> 
> Though I do wonder if this might be a buggy interaction between
> core dumps and something like signals, or io_uring.  If it is something
> other than a shortage of storage space causing your truncated core
> dumps I expect we should first debug why the coredumps are being
> truncated rather than proceed directly to working around truncation.

I don't really see any feasible workarounds that can be done for preventing
truncation of these core dumps. Our truncated cores are also not the result
of any bugs, but rather a limitation.

Please let me know your thoughts!

Best,
Brian Mak

> Eric
> 
>> Signed-off-by: Brian Mak <makb@juniper.net>
>> ---
>> 
>> Hi all,
>> 
>> My initial testing with a program that spawns several threads and allocates heap
>> memory shows that this patch does indeed prioritize information such as stacks,
>> which is crucial to forming a backtrace and debugging core dumps.
>> 
>> Requesting for comments on the following:
>> 
>> Are there cases where this might not necessarily prioritize dumping VMAs
>> needed to obtain a usable backtrace?
>> 
>> Thanks,
>> Brian Mak
>> 
>> fs/binfmt_elf.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++--
>> 1 file changed, 62 insertions(+), 2 deletions(-)
>> 
>> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
>> index 19fa49cd9907..d45240b0748d 100644
>> --- a/fs/binfmt_elf.c
>> +++ b/fs/binfmt_elf.c
>> @@ -13,6 +13,7 @@
>> #include <linux/module.h>
>> #include <linux/kernel.h>
>> #include <linux/fs.h>
>> +#include <linux/debugfs.h>
>> #include <linux/log2.h>
>> #include <linux/mm.h>
>> #include <linux/mman.h>
>> @@ -37,6 +38,7 @@
>> #include <linux/elf-randomize.h>
>> #include <linux/utsname.h>
>> #include <linux/coredump.h>
>> +#include <linux/sort.h>
>> #include <linux/sched.h>
>> #include <linux/sched/coredump.h>
>> #include <linux/sched/task_stack.h>
>> @@ -1990,6 +1992,22 @@ static void fill_extnum_info(struct elfhdr *elf, struct elf_shdr *shdr4extnum,
>>      shdr4extnum->sh_info = segs;
>> }
>> 
>> +static int cmp_vma_size(const void *vma_meta_lhs_ptr, const void *vma_meta_rhs_ptr)
>> +{
>> +     const struct core_vma_metadata *vma_meta_lhs = *(const struct core_vma_metadata **)
>> +             vma_meta_lhs_ptr;
>> +     const struct core_vma_metadata *vma_meta_rhs = *(const struct core_vma_metadata **)
>> +             vma_meta_rhs_ptr;
>> +
>> +     if (vma_meta_lhs->dump_size < vma_meta_rhs->dump_size)
>> +             return -1;
>> +     if (vma_meta_lhs->dump_size > vma_meta_rhs->dump_size)
>> +             return 1;
>> +     return 0;
>> +}
>> +
>> +static bool sort_elf_core_vmas = true;
>> +
>> /*
>>  * Actual dumper
>>  *
>> @@ -2008,6 +2026,7 @@ static int elf_core_dump(struct coredump_params *cprm)
>>      struct elf_shdr *shdr4extnum = NULL;
>>      Elf_Half e_phnum;
>>      elf_addr_t e_shoff;
>> +     struct core_vma_metadata **sorted_vmas = NULL;
>> 
>>      /*
>>       * The number of segs are recored into ELF header as 16bit value.
>> @@ -2071,11 +2090,27 @@ static int elf_core_dump(struct coredump_params *cprm)
>>      if (!dump_emit(cprm, phdr4note, sizeof(*phdr4note)))
>>              goto end_coredump;
>> 
>> +     /* Allocate memory to sort VMAs and sort if needed. */
>> +     if (sort_elf_core_vmas)
>> +             sorted_vmas = kvmalloc_array(cprm->vma_count, sizeof(*sorted_vmas), GFP_KERNEL);
>> +
>> +     if (!ZERO_OR_NULL_PTR(sorted_vmas)) {
>> +             for (i = 0; i < cprm->vma_count; i++)
>> +                     sorted_vmas[i] = cprm->vma_meta + i;
>> +
>> +             sort(sorted_vmas, cprm->vma_count, sizeof(*sorted_vmas), cmp_vma_size, NULL);
>> +     }
>> +
>>      /* Write program headers for segments dump */
>>      for (i = 0; i < cprm->vma_count; i++) {
>> -             struct core_vma_metadata *meta = cprm->vma_meta + i;
>> +             struct core_vma_metadata *meta;
>>              struct elf_phdr phdr;
>> 
>> +             if (ZERO_OR_NULL_PTR(sorted_vmas))
>> +                     meta = cprm->vma_meta + i;
>> +             else
>> +                     meta = sorted_vmas[i];
>> +
>>              phdr.p_type = PT_LOAD;
>>              phdr.p_offset = offset;
>>              phdr.p_vaddr = meta->start;
>> @@ -2111,7 +2146,12 @@ static int elf_core_dump(struct coredump_params *cprm)
>>      dump_skip_to(cprm, dataoff);
>> 
>>      for (i = 0; i < cprm->vma_count; i++) {
>> -             struct core_vma_metadata *meta = cprm->vma_meta + i;
>> +             struct core_vma_metadata *meta;
>> +
>> +             if (ZERO_OR_NULL_PTR(sorted_vmas))
>> +                     meta = cprm->vma_meta + i;
>> +             else
>> +                     meta = sorted_vmas[i];
>> 
>>              if (!dump_user_range(cprm, meta->start, meta->dump_size))
>>                      goto end_coredump;
>> @@ -2128,10 +2168,26 @@ static int elf_core_dump(struct coredump_params *cprm)
>> end_coredump:
>>      free_note_info(&info);
>>      kfree(shdr4extnum);
>> +     kvfree(sorted_vmas);
>>      kfree(phdr4note);
>>      return has_dumped;
>> }
>> 
>> +#ifdef CONFIG_DEBUG_FS
>> +
>> +static struct dentry *elf_core_debugfs;
>> +
>> +static int __init init_elf_core_debugfs(void)
>> +{
>> +     elf_core_debugfs = debugfs_create_dir("elf_core", NULL);
>> +     debugfs_create_bool("sort_elf_core_vmas", 0644, elf_core_debugfs, &sort_elf_core_vmas);
>> +     return 0;
>> +}
>> +
>> +fs_initcall(init_elf_core_debugfs);
>> +
>> +#endif               /* CONFIG_DEBUG_FS */
>> +
>> #endif               /* CONFIG_ELF_CORE */
>> 
>> static int __init init_elf_binfmt(void)
>> @@ -2144,6 +2200,10 @@ static void __exit exit_elf_binfmt(void)
>> {
>>      /* Remove the COFF and ELF loaders. */
>>      unregister_binfmt(&elf_format);
>> +
>> +#if defined(CONFIG_ELF_CORE) && defined(CONFIG_DEBUG_FS)
>> +     debugfs_remove(elf_core_debugfs);
>> +#endif
>> }
>> 
>> core_initcall(init_elf_binfmt);
>> 
>> base-commit: 94ede2a3e9135764736221c080ac7c0ad993dc2d
Eric W. Biederman Aug. 2, 2024, 4:16 p.m. UTC | #3
Brian Mak <makb@juniper.net> writes:

> On Jul 31, 2024, at 7:52 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>
>> Brian Mak <makb@juniper.net> writes:
>> 
>>> Large cores may be truncated in some scenarios, such as daemons with stop
>>> timeouts that are not large enough or lack of disk space. This impacts
>>> debuggability with large core dumps since critical information necessary to
>>> form a usable backtrace, such as stacks and shared library information, are
>>> omitted. We can mitigate the impact of core dump truncation by dumping
>>> smaller VMAs first, which may be more likely to contain memory for stacks
>>> and shared library information, thus allowing a usable backtrace to be
>>> formed.
>> 
>> This sounds theoretical.  Do you happen to have a description of a
>> motivating case?  A situtation that bit someone and resulted in a core
>> file that wasn't usable?
>> 
>> A concrete situation would help us imagine what possible caveats there
>> are with sorting vmas this way.
>> 
>> The most common case I am aware of is distributions setting the core
>> file size to 0 (ulimit -c 0).
>
> Hi Eric,
>
> Thanks for taking the time to reply. We have hit these scenarios before in
> practice where large cores are truncated, resulting in an unusable core.
>
> At Juniper, we have some daemons that can consume a lot of memory, where
> upon crash, can result in core dumps of several GBs. While dumping, we've
> encountered these two scenarios resulting in a unusable core:
>
> 1. Disk space is low at the time of core dump, resulting in a truncated
> core once the disk is full.
>
> 2. A daemon has a TimeoutStopSec option configured in its systemd unit
> file, where upon core dumping, could timeout (triggering a SIGKILL) if the
> core dump is too large and is taking too long to dump.
>
> In both scenarios, we see that the core file is already several GB, and
> still does not contain the information necessary to form a backtrace, thus
> creating the need for this change. In the second scenario, we are unable to
> increase the timeout option due to our recovery time objective
> requirements.
>
>> One practical concern with this approach is that I think the ELF
>> specification says that program headers should be written in memory
>> order.  So a comment on your testing to see if gdb or rr or any of
>> the other debuggers that read core dumps cares would be appreciated.
>
> I've already tested readelf and gdb on core dumps (truncated and whole)
> with this patch and it is able to read/use these core dumps in these
> scenarios with a proper backtrace.
>
>>> We implement this by sorting the VMAs by dump size and dumping in that
>>> order.
>> 
>> Since your concern is about stacks, and the kernel has information about
>> stacks it might be worth using that information explicitly when sorting
>> vmas, instead of just assuming stacks will be small.
>
> This was originally the approach that we explored, but ultimately moved
> away from. We need more than just stacks to form a proper backtrace. I
> didn't narrow down exactly what it was that we needed because the sorting
> solution seemed to be cleaner than trying to narrow down each of these
> pieces that we'd need. At the very least, we need information about shared
> libraries (.dynamic, etc.) and stacks, but my testing showed that we need a
> third piece sitting in an anonymous R/W VMA, which is the point that I
> stopped exploring this path. I was having a difficult time narrowing down
> what this last piece was.
>
>> I expect the priorities would look something like jit generated
>> executable code segments, stacks, and then heap data.
>> 
>> I don't have enough information what is causing your truncated core
>> dumps, so I can't guess what the actual problem is your are fighting,
>> so I could be wrong on priorities.
>> 
>> Though I do wonder if this might be a buggy interaction between
>> core dumps and something like signals, or io_uring.  If it is something
>> other than a shortage of storage space causing your truncated core
>> dumps I expect we should first debug why the coredumps are being
>> truncated rather than proceed directly to working around truncation.
>
> I don't really see any feasible workarounds that can be done for preventing
> truncation of these core dumps. Our truncated cores are also not the result
> of any bugs, but rather a limitation.

Thanks that clarifies things.

From a quality of implementation standpoint I regret that at least some
pause during coredumping is inevitable.  Ideally I would like to
minimize that pause, preserve the memory and have a separate kernel
thread perform the coredumping work.  That in principle would remove the
need for coredumps to be stop when a SIGKILL is delievered and avoid the
issue with the systemd timeout.  Plus it would allow systemd to respawn
the process before the coredump was complete.  Getting there is in no
sense easy, and it would still leave the problem of not getting
the whole coredump when you are running out of disk space.

The explanation of the vma sort is good.  Sorting by size seems to make
a simple and very effective heuristic.  It would nice if that
explanation appeared in the change description.

From a maintenance perspective it would be very nice to perform the vma
size sort unconditionally.   Can you remove the knob making the size
sort conditional?  If someone reports a regression we can add a knob
making the sort conditional.

We are in the middle of the merge window right now but I expect Kees
could take a simplified patch (aka no knob) after the merge window
closes and get it into linux-next.  Which should give plenty of time
to spot any regressions caused by sorting the vmas.

Eric


>
> Please let me know your thoughts!
>
> Best,
> Brian Mak
>
>> Eric
>> 
>>> Signed-off-by: Brian Mak <makb@juniper.net>
>>> ---
>>> 
>>> Hi all,
>>> 
>>> My initial testing with a program that spawns several threads and allocates heap
>>> memory shows that this patch does indeed prioritize information such as stacks,
>>> which is crucial to forming a backtrace and debugging core dumps.
>>> 
>>> Requesting for comments on the following:
>>> 
>>> Are there cases where this might not necessarily prioritize dumping VMAs
>>> needed to obtain a usable backtrace?
>>> 
>>> Thanks,
>>> Brian Mak
>>> 
>>> fs/binfmt_elf.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++--
>>> 1 file changed, 62 insertions(+), 2 deletions(-)
>>> 
>>> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
>>> index 19fa49cd9907..d45240b0748d 100644
>>> --- a/fs/binfmt_elf.c
>>> +++ b/fs/binfmt_elf.c
>>> @@ -13,6 +13,7 @@
>>> #include <linux/module.h>
>>> #include <linux/kernel.h>
>>> #include <linux/fs.h>
>>> +#include <linux/debugfs.h>
>>> #include <linux/log2.h>
>>> #include <linux/mm.h>
>>> #include <linux/mman.h>
>>> @@ -37,6 +38,7 @@
>>> #include <linux/elf-randomize.h>
>>> #include <linux/utsname.h>
>>> #include <linux/coredump.h>
>>> +#include <linux/sort.h>
>>> #include <linux/sched.h>
>>> #include <linux/sched/coredump.h>
>>> #include <linux/sched/task_stack.h>
>>> @@ -1990,6 +1992,22 @@ static void fill_extnum_info(struct elfhdr *elf, struct elf_shdr *shdr4extnum,
>>>      shdr4extnum->sh_info = segs;
>>> }
>>> 
>>> +static int cmp_vma_size(const void *vma_meta_lhs_ptr, const void *vma_meta_rhs_ptr)
>>> +{
>>> +     const struct core_vma_metadata *vma_meta_lhs = *(const struct core_vma_metadata **)
>>> +             vma_meta_lhs_ptr;
>>> +     const struct core_vma_metadata *vma_meta_rhs = *(const struct core_vma_metadata **)
>>> +             vma_meta_rhs_ptr;
>>> +
>>> +     if (vma_meta_lhs->dump_size < vma_meta_rhs->dump_size)
>>> +             return -1;
>>> +     if (vma_meta_lhs->dump_size > vma_meta_rhs->dump_size)
>>> +             return 1;
>>> +     return 0;
>>> +}
>>> +
>>> +static bool sort_elf_core_vmas = true;
>>> +
>>> /*
>>>  * Actual dumper
>>>  *
>>> @@ -2008,6 +2026,7 @@ static int elf_core_dump(struct coredump_params *cprm)
>>>      struct elf_shdr *shdr4extnum = NULL;
>>>      Elf_Half e_phnum;
>>>      elf_addr_t e_shoff;
>>> +     struct core_vma_metadata **sorted_vmas = NULL;
>>> 
>>>      /*
>>>       * The number of segs are recored into ELF header as 16bit value.
>>> @@ -2071,11 +2090,27 @@ static int elf_core_dump(struct coredump_params *cprm)
>>>      if (!dump_emit(cprm, phdr4note, sizeof(*phdr4note)))
>>>              goto end_coredump;
>>> 
>>> +     /* Allocate memory to sort VMAs and sort if needed. */
>>> +     if (sort_elf_core_vmas)
>>> +             sorted_vmas = kvmalloc_array(cprm->vma_count, sizeof(*sorted_vmas), GFP_KERNEL);
>>> +
>>> +     if (!ZERO_OR_NULL_PTR(sorted_vmas)) {
>>> +             for (i = 0; i < cprm->vma_count; i++)
>>> +                     sorted_vmas[i] = cprm->vma_meta + i;
>>> +
>>> +             sort(sorted_vmas, cprm->vma_count, sizeof(*sorted_vmas), cmp_vma_size, NULL);
>>> +     }
>>> +
>>>      /* Write program headers for segments dump */
>>>      for (i = 0; i < cprm->vma_count; i++) {
>>> -             struct core_vma_metadata *meta = cprm->vma_meta + i;
>>> +             struct core_vma_metadata *meta;
>>>              struct elf_phdr phdr;
>>> 
>>> +             if (ZERO_OR_NULL_PTR(sorted_vmas))
>>> +                     meta = cprm->vma_meta + i;
>>> +             else
>>> +                     meta = sorted_vmas[i];
>>> +
>>>              phdr.p_type = PT_LOAD;
>>>              phdr.p_offset = offset;
>>>              phdr.p_vaddr = meta->start;
>>> @@ -2111,7 +2146,12 @@ static int elf_core_dump(struct coredump_params *cprm)
>>>      dump_skip_to(cprm, dataoff);
>>> 
>>>      for (i = 0; i < cprm->vma_count; i++) {
>>> -             struct core_vma_metadata *meta = cprm->vma_meta + i;
>>> +             struct core_vma_metadata *meta;
>>> +
>>> +             if (ZERO_OR_NULL_PTR(sorted_vmas))
>>> +                     meta = cprm->vma_meta + i;
>>> +             else
>>> +                     meta = sorted_vmas[i];
>>> 
>>>              if (!dump_user_range(cprm, meta->start, meta->dump_size))
>>>                      goto end_coredump;
>>> @@ -2128,10 +2168,26 @@ static int elf_core_dump(struct coredump_params *cprm)
>>> end_coredump:
>>>      free_note_info(&info);
>>>      kfree(shdr4extnum);
>>> +     kvfree(sorted_vmas);
>>>      kfree(phdr4note);
>>>      return has_dumped;
>>> }
>>> 
>>> +#ifdef CONFIG_DEBUG_FS
>>> +
>>> +static struct dentry *elf_core_debugfs;
>>> +
>>> +static int __init init_elf_core_debugfs(void)
>>> +{
>>> +     elf_core_debugfs = debugfs_create_dir("elf_core", NULL);
>>> +     debugfs_create_bool("sort_elf_core_vmas", 0644, elf_core_debugfs, &sort_elf_core_vmas);
>>> +     return 0;
>>> +}
>>> +
>>> +fs_initcall(init_elf_core_debugfs);
>>> +
>>> +#endif               /* CONFIG_DEBUG_FS */
>>> +
>>> #endif               /* CONFIG_ELF_CORE */
>>> 
>>> static int __init init_elf_binfmt(void)
>>> @@ -2144,6 +2200,10 @@ static void __exit exit_elf_binfmt(void)
>>> {
>>>      /* Remove the COFF and ELF loaders. */
>>>      unregister_binfmt(&elf_format);
>>> +
>>> +#if defined(CONFIG_ELF_CORE) && defined(CONFIG_DEBUG_FS)
>>> +     debugfs_remove(elf_core_debugfs);
>>> +#endif
>>> }
>>> 
>>> core_initcall(init_elf_binfmt);
>>> 
>>> base-commit: 94ede2a3e9135764736221c080ac7c0ad993dc2d
Brian Mak Aug. 2, 2024, 5:46 p.m. UTC | #4
On Aug 2, 2024, at 9:16 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:

> Brian Mak <makb@juniper.net> writes:
> 
>> On Jul 31, 2024, at 7:52 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>> 
>>> Brian Mak <makb@juniper.net> writes:
>>> 
>>>> Large cores may be truncated in some scenarios, such as daemons with stop
>>>> timeouts that are not large enough or lack of disk space. This impacts
>>>> debuggability with large core dumps since critical information necessary to
>>>> form a usable backtrace, such as stacks and shared library information, are
>>>> omitted. We can mitigate the impact of core dump truncation by dumping
>>>> smaller VMAs first, which may be more likely to contain memory for stacks
>>>> and shared library information, thus allowing a usable backtrace to be
>>>> formed.
>>> 
>>> This sounds theoretical.  Do you happen to have a description of a
>>> motivating case?  A situtation that bit someone and resulted in a core
>>> file that wasn't usable?
>>> 
>>> A concrete situation would help us imagine what possible caveats there
>>> are with sorting vmas this way.
>>> 
>>> The most common case I am aware of is distributions setting the core
>>> file size to 0 (ulimit -c 0).
>> 
>> Hi Eric,
>> 
>> Thanks for taking the time to reply. We have hit these scenarios before in
>> practice where large cores are truncated, resulting in an unusable core.
>> 
>> At Juniper, we have some daemons that can consume a lot of memory, where
>> upon crash, can result in core dumps of several GBs. While dumping, we've
>> encountered these two scenarios resulting in a unusable core:
>> 
>> 1. Disk space is low at the time of core dump, resulting in a truncated
>> core once the disk is full.
>> 
>> 2. A daemon has a TimeoutStopSec option configured in its systemd unit
>> file, where upon core dumping, could timeout (triggering a SIGKILL) if the
>> core dump is too large and is taking too long to dump.
>> 
>> In both scenarios, we see that the core file is already several GB, and
>> still does not contain the information necessary to form a backtrace, thus
>> creating the need for this change. In the second scenario, we are unable to
>> increase the timeout option due to our recovery time objective
>> requirements.
>> 
>>> One practical concern with this approach is that I think the ELF
>>> specification says that program headers should be written in memory
>>> order.  So a comment on your testing to see if gdb or rr or any of
>>> the other debuggers that read core dumps cares would be appreciated.
>> 
>> I've already tested readelf and gdb on core dumps (truncated and whole)
>> with this patch and it is able to read/use these core dumps in these
>> scenarios with a proper backtrace.
>> 
>>>> We implement this by sorting the VMAs by dump size and dumping in that
>>>> order.
>>> 
>>> Since your concern is about stacks, and the kernel has information about
>>> stacks it might be worth using that information explicitly when sorting
>>> vmas, instead of just assuming stacks will be small.
>> 
>> This was originally the approach that we explored, but ultimately moved
>> away from. We need more than just stacks to form a proper backtrace. I
>> didn't narrow down exactly what it was that we needed because the sorting
>> solution seemed to be cleaner than trying to narrow down each of these
>> pieces that we'd need. At the very least, we need information about shared
>> libraries (.dynamic, etc.) and stacks, but my testing showed that we need a
>> third piece sitting in an anonymous R/W VMA, which is the point that I
>> stopped exploring this path. I was having a difficult time narrowing down
>> what this last piece was.
>> 
>>> I expect the priorities would look something like jit generated
>>> executable code segments, stacks, and then heap data.
>>> 
>>> I don't have enough information what is causing your truncated core
>>> dumps, so I can't guess what the actual problem is your are fighting,
>>> so I could be wrong on priorities.
>>> 
>>> Though I do wonder if this might be a buggy interaction between
>>> core dumps and something like signals, or io_uring.  If it is something
>>> other than a shortage of storage space causing your truncated core
>>> dumps I expect we should first debug why the coredumps are being
>>> truncated rather than proceed directly to working around truncation.
>> 
>> I don't really see any feasible workarounds that can be done for preventing
>> truncation of these core dumps. Our truncated cores are also not the result
>> of any bugs, but rather a limitation.
> 
> Thanks that clarifies things.
> 
> From a quality of implementation standpoint I regret that at least some
> pause during coredumping is inevitable.  Ideally I would like to
> minimize that pause, preserve the memory and have a separate kernel
> thread perform the coredumping work.  That in principle would remove the
> need for coredumps to be stop when a SIGKILL is delievered and avoid the
> issue with the systemd timeout.  Plus it would allow systemd to respawn
> the process before the coredump was complete.  Getting there is in no
> sense easy, and it would still leave the problem of not getting
> the whole coredump when you are running out of disk space.

This is actually another approach that we thought about, but decided
against. The fact that it doesn't address running out of disk space is one
thing, but also, if systemd were to respawn the process before the coredump
completes, there would effectively be a doubling of that process' memory
usage. For applications that take up a significant chunk of available
memory on a system, effectively "duplicating" that memory consumption could
result in a system running out of memory.

With this approach, there's two options: to have systemd restart the
process or wait until core dumping is complete to restart. In the first
option, we would be risking system stability for debuggability in
applications with a large memory footprint. In the second option, we would
be back to where we started (either terminate the core dumping or miss
recovery time objectives).

> The explanation of the vma sort is good.  Sorting by size seems to make
> a simple and very effective heuristic.  It would nice if that
> explanation appeared in the change description.
> 
> From a maintenance perspective it would be very nice to perform the vma
> size sort unconditionally.   Can you remove the knob making the size
> sort conditional?  If someone reports a regression we can add a knob
> making the sort conditional.
> 
> We are in the middle of the merge window right now but I expect Kees
> could take a simplified patch (aka no knob) after the merge window
> closes and get it into linux-next.  Which should give plenty of time
> to spot any regressions caused by sorting the vmas.

Understood, will get a PATCH v2 sent out with the removal of the knob. I
should note though that the conditional sorting actually gives a second
benefit, which is that in the event there is low available memory,
allocating the sorted_vmas array may fail, allowing for a fallback to
original functionality without any additional memory allocations. However,
in the interest of maintainability, it may be better to just forgo that
benefit.

Thanks for looking this over!

Best,
Brian Mak

> Eric
> 
> 
>> 
>> Please let me know your thoughts!
>> 
>> Best,
>> Brian Mak
>> 
>>> Eric
>>> 
>>>> Signed-off-by: Brian Mak <makb@juniper.net>
>>>> ---
>>>> 
>>>> Hi all,
>>>> 
>>>> My initial testing with a program that spawns several threads and allocates heap
>>>> memory shows that this patch does indeed prioritize information such as stacks,
>>>> which is crucial to forming a backtrace and debugging core dumps.
>>>> 
>>>> Requesting for comments on the following:
>>>> 
>>>> Are there cases where this might not necessarily prioritize dumping VMAs
>>>> needed to obtain a usable backtrace?
>>>> 
>>>> Thanks,
>>>> Brian Mak
>>>> 
>>>> fs/binfmt_elf.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++--
>>>> 1 file changed, 62 insertions(+), 2 deletions(-)
>>>> 
>>>> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
>>>> index 19fa49cd9907..d45240b0748d 100644
>>>> --- a/fs/binfmt_elf.c
>>>> +++ b/fs/binfmt_elf.c
>>>> @@ -13,6 +13,7 @@
>>>> #include <linux/module.h>
>>>> #include <linux/kernel.h>
>>>> #include <linux/fs.h>
>>>> +#include <linux/debugfs.h>
>>>> #include <linux/log2.h>
>>>> #include <linux/mm.h>
>>>> #include <linux/mman.h>
>>>> @@ -37,6 +38,7 @@
>>>> #include <linux/elf-randomize.h>
>>>> #include <linux/utsname.h>
>>>> #include <linux/coredump.h>
>>>> +#include <linux/sort.h>
>>>> #include <linux/sched.h>
>>>> #include <linux/sched/coredump.h>
>>>> #include <linux/sched/task_stack.h>
>>>> @@ -1990,6 +1992,22 @@ static void fill_extnum_info(struct elfhdr *elf, struct elf_shdr *shdr4extnum,
>>>>     shdr4extnum->sh_info = segs;
>>>> }
>>>> 
>>>> +static int cmp_vma_size(const void *vma_meta_lhs_ptr, const void *vma_meta_rhs_ptr)
>>>> +{
>>>> +     const struct core_vma_metadata *vma_meta_lhs = *(const struct core_vma_metadata **)
>>>> +             vma_meta_lhs_ptr;
>>>> +     const struct core_vma_metadata *vma_meta_rhs = *(const struct core_vma_metadata **)
>>>> +             vma_meta_rhs_ptr;
>>>> +
>>>> +     if (vma_meta_lhs->dump_size < vma_meta_rhs->dump_size)
>>>> +             return -1;
>>>> +     if (vma_meta_lhs->dump_size > vma_meta_rhs->dump_size)
>>>> +             return 1;
>>>> +     return 0;
>>>> +}
>>>> +
>>>> +static bool sort_elf_core_vmas = true;
>>>> +
>>>> /*
>>>> * Actual dumper
>>>> *
>>>> @@ -2008,6 +2026,7 @@ static int elf_core_dump(struct coredump_params *cprm)
>>>>     struct elf_shdr *shdr4extnum = NULL;
>>>>     Elf_Half e_phnum;
>>>>     elf_addr_t e_shoff;
>>>> +     struct core_vma_metadata **sorted_vmas = NULL;
>>>> 
>>>>     /*
>>>>      * The number of segs are recored into ELF header as 16bit value.
>>>> @@ -2071,11 +2090,27 @@ static int elf_core_dump(struct coredump_params *cprm)
>>>>     if (!dump_emit(cprm, phdr4note, sizeof(*phdr4note)))
>>>>             goto end_coredump;
>>>> 
>>>> +     /* Allocate memory to sort VMAs and sort if needed. */
>>>> +     if (sort_elf_core_vmas)
>>>> +             sorted_vmas = kvmalloc_array(cprm->vma_count, sizeof(*sorted_vmas), GFP_KERNEL);
>>>> +
>>>> +     if (!ZERO_OR_NULL_PTR(sorted_vmas)) {
>>>> +             for (i = 0; i < cprm->vma_count; i++)
>>>> +                     sorted_vmas[i] = cprm->vma_meta + i;
>>>> +
>>>> +             sort(sorted_vmas, cprm->vma_count, sizeof(*sorted_vmas), cmp_vma_size, NULL);
>>>> +     }
>>>> +
>>>>     /* Write program headers for segments dump */
>>>>     for (i = 0; i < cprm->vma_count; i++) {
>>>> -             struct core_vma_metadata *meta = cprm->vma_meta + i;
>>>> +             struct core_vma_metadata *meta;
>>>>             struct elf_phdr phdr;
>>>> 
>>>> +             if (ZERO_OR_NULL_PTR(sorted_vmas))
>>>> +                     meta = cprm->vma_meta + i;
>>>> +             else
>>>> +                     meta = sorted_vmas[i];
>>>> +
>>>>             phdr.p_type = PT_LOAD;
>>>>             phdr.p_offset = offset;
>>>>             phdr.p_vaddr = meta->start;
>>>> @@ -2111,7 +2146,12 @@ static int elf_core_dump(struct coredump_params *cprm)
>>>>     dump_skip_to(cprm, dataoff);
>>>> 
>>>>     for (i = 0; i < cprm->vma_count; i++) {
>>>> -             struct core_vma_metadata *meta = cprm->vma_meta + i;
>>>> +             struct core_vma_metadata *meta;
>>>> +
>>>> +             if (ZERO_OR_NULL_PTR(sorted_vmas))
>>>> +                     meta = cprm->vma_meta + i;
>>>> +             else
>>>> +                     meta = sorted_vmas[i];
>>>> 
>>>>             if (!dump_user_range(cprm, meta->start, meta->dump_size))
>>>>                     goto end_coredump;
>>>> @@ -2128,10 +2168,26 @@ static int elf_core_dump(struct coredump_params *cprm)
>>>> end_coredump:
>>>>     free_note_info(&info);
>>>>     kfree(shdr4extnum);
>>>> +     kvfree(sorted_vmas);
>>>>     kfree(phdr4note);
>>>>     return has_dumped;
>>>> }
>>>> 
>>>> +#ifdef CONFIG_DEBUG_FS
>>>> +
>>>> +static struct dentry *elf_core_debugfs;
>>>> +
>>>> +static int __init init_elf_core_debugfs(void)
>>>> +{
>>>> +     elf_core_debugfs = debugfs_create_dir("elf_core", NULL);
>>>> +     debugfs_create_bool("sort_elf_core_vmas", 0644, elf_core_debugfs, &sort_elf_core_vmas);
>>>> +     return 0;
>>>> +}
>>>> +
>>>> +fs_initcall(init_elf_core_debugfs);
>>>> +
>>>> +#endif               /* CONFIG_DEBUG_FS */
>>>> +
>>>> #endif               /* CONFIG_ELF_CORE */
>>>> 
>>>> static int __init init_elf_binfmt(void)
>>>> @@ -2144,6 +2200,10 @@ static void __exit exit_elf_binfmt(void)
>>>> {
>>>>     /* Remove the COFF and ELF loaders. */
>>>>     unregister_binfmt(&elf_format);
>>>> +
>>>> +#if defined(CONFIG_ELF_CORE) && defined(CONFIG_DEBUG_FS)
>>>> +     debugfs_remove(elf_core_debugfs);
>>>> +#endif
>>>> }
>>>> 
>>>> core_initcall(init_elf_binfmt);
>>>> 
>>>> base-commit: 94ede2a3e9135764736221c080ac7c0ad993dc2d
Eric W. Biederman Aug. 3, 2024, 3:08 a.m. UTC | #5
Brian Mak <makb@juniper.net> writes:

> On Aug 2, 2024, at 9:16 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>
>> Brian Mak <makb@juniper.net> writes:
>> 
>>> On Jul 31, 2024, at 7:52 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>> 
>>>> Brian Mak <makb@juniper.net> writes:
>>>> 
>>>>> Large cores may be truncated in some scenarios, such as daemons with stop
>>>>> timeouts that are not large enough or lack of disk space. This impacts
>>>>> debuggability with large core dumps since critical information necessary to
>>>>> form a usable backtrace, such as stacks and shared library information, are
>>>>> omitted. We can mitigate the impact of core dump truncation by dumping
>>>>> smaller VMAs first, which may be more likely to contain memory for stacks
>>>>> and shared library information, thus allowing a usable backtrace to be
>>>>> formed.
>>>> 
>>>> This sounds theoretical.  Do you happen to have a description of a
>>>> motivating case?  A situtation that bit someone and resulted in a core
>>>> file that wasn't usable?
>>>> 
>>>> A concrete situation would help us imagine what possible caveats there
>>>> are with sorting vmas this way.
>>>> 
>>>> The most common case I am aware of is distributions setting the core
>>>> file size to 0 (ulimit -c 0).
>>> 
>>> Hi Eric,
>>> 
>>> Thanks for taking the time to reply. We have hit these scenarios before in
>>> practice where large cores are truncated, resulting in an unusable core.
>>> 
>>> At Juniper, we have some daemons that can consume a lot of memory, where
>>> upon crash, can result in core dumps of several GBs. While dumping, we've
>>> encountered these two scenarios resulting in a unusable core:
>>> 
>>> 1. Disk space is low at the time of core dump, resulting in a truncated
>>> core once the disk is full.
>>> 
>>> 2. A daemon has a TimeoutStopSec option configured in its systemd unit
>>> file, where upon core dumping, could timeout (triggering a SIGKILL) if the
>>> core dump is too large and is taking too long to dump.
>>> 
>>> In both scenarios, we see that the core file is already several GB, and
>>> still does not contain the information necessary to form a backtrace, thus
>>> creating the need for this change. In the second scenario, we are unable to
>>> increase the timeout option due to our recovery time objective
>>> requirements.
>>> 
>>>> One practical concern with this approach is that I think the ELF
>>>> specification says that program headers should be written in memory
>>>> order.  So a comment on your testing to see if gdb or rr or any of
>>>> the other debuggers that read core dumps cares would be appreciated.
>>> 
>>> I've already tested readelf and gdb on core dumps (truncated and whole)
>>> with this patch and it is able to read/use these core dumps in these
>>> scenarios with a proper backtrace.
>>> 
>>>>> We implement this by sorting the VMAs by dump size and dumping in that
>>>>> order.
>>>> 
>>>> Since your concern is about stacks, and the kernel has information about
>>>> stacks it might be worth using that information explicitly when sorting
>>>> vmas, instead of just assuming stacks will be small.
>>> 
>>> This was originally the approach that we explored, but ultimately moved
>>> away from. We need more than just stacks to form a proper backtrace. I
>>> didn't narrow down exactly what it was that we needed because the sorting
>>> solution seemed to be cleaner than trying to narrow down each of these
>>> pieces that we'd need. At the very least, we need information about shared
>>> libraries (.dynamic, etc.) and stacks, but my testing showed that we need a
>>> third piece sitting in an anonymous R/W VMA, which is the point that I
>>> stopped exploring this path. I was having a difficult time narrowing down
>>> what this last piece was.
>>> 
>>>> I expect the priorities would look something like jit generated
>>>> executable code segments, stacks, and then heap data.
>>>> 
>>>> I don't have enough information what is causing your truncated core
>>>> dumps, so I can't guess what the actual problem is your are fighting,
>>>> so I could be wrong on priorities.
>>>> 
>>>> Though I do wonder if this might be a buggy interaction between
>>>> core dumps and something like signals, or io_uring.  If it is something
>>>> other than a shortage of storage space causing your truncated core
>>>> dumps I expect we should first debug why the coredumps are being
>>>> truncated rather than proceed directly to working around truncation.
>>> 
>>> I don't really see any feasible workarounds that can be done for preventing
>>> truncation of these core dumps. Our truncated cores are also not the result
>>> of any bugs, but rather a limitation.
>> 
>> Thanks that clarifies things.
>> 
>> From a quality of implementation standpoint I regret that at least some
>> pause during coredumping is inevitable.  Ideally I would like to
>> minimize that pause, preserve the memory and have a separate kernel
>> thread perform the coredumping work.  That in principle would remove the
>> need for coredumps to be stop when a SIGKILL is delievered and avoid the
>> issue with the systemd timeout.  Plus it would allow systemd to respawn
>> the process before the coredump was complete.  Getting there is in no
>> sense easy, and it would still leave the problem of not getting
>> the whole coredump when you are running out of disk space.
>
> This is actually another approach that we thought about, but decided
> against. The fact that it doesn't address running out of disk space is one
> thing, but also, if systemd were to respawn the process before the coredump
> completes, there would effectively be a doubling of that process' memory
> usage. For applications that take up a significant chunk of available
> memory on a system, effectively "duplicating" that memory consumption could
> result in a system running out of memory.
>
> With this approach, there's two options: to have systemd restart the
> process or wait until core dumping is complete to restart. In the first
> option, we would be risking system stability for debuggability in
> applications with a large memory footprint. In the second option, we would
> be back to where we started (either terminate the core dumping or miss
> recovery time objectives).
>
>> The explanation of the vma sort is good.  Sorting by size seems to make
>> a simple and very effective heuristic.  It would nice if that
>> explanation appeared in the change description.
>> 
>> From a maintenance perspective it would be very nice to perform the vma
>> size sort unconditionally.   Can you remove the knob making the size
>> sort conditional?  If someone reports a regression we can add a knob
>> making the sort conditional.
>> 
>> We are in the middle of the merge window right now but I expect Kees
>> could take a simplified patch (aka no knob) after the merge window
>> closes and get it into linux-next.  Which should give plenty of time
>> to spot any regressions caused by sorting the vmas.
>
> Understood, will get a PATCH v2 sent out with the removal of the knob. I
> should note though that the conditional sorting actually gives a second
> benefit, which is that in the event there is low available memory,
> allocating the sorted_vmas array may fail, allowing for a fallback to
> original functionality without any additional memory allocations. However,
> in the interest of maintainability, it may be better to just forgo that
> benefit.

The entire array of vmas is initially allocated in
fs/coredump.c:dump_vma_snapshot(), to avoid the need to hold the
mmap_lock while writing the coredump.

Unless there is some technical reason I haven't thought of you might as
well sort the vma array in place and avoid the copy to a new array.

The logic seems to apply all coredumps so I would recommend placing the
sorting in dump_vma_snapshot.  Then we don't have to worry about the
other implementations of elf coredumps (ELF FDPIC) getting out of sync
in with the primary coredump implementation.

Eric

> Thanks for looking this over!
>
> Best,
> Brian Mak
>
>> Eric
>> 
>> 
>>> 
>>> Please let me know your thoughts!
>>> 
>>> Best,
>>> Brian Mak
>>> 
>>>> Eric
>>>> 
>>>>> Signed-off-by: Brian Mak <makb@juniper.net>
>>>>> ---
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> My initial testing with a program that spawns several threads and allocates heap
>>>>> memory shows that this patch does indeed prioritize information such as stacks,
>>>>> which is crucial to forming a backtrace and debugging core dumps.
>>>>> 
>>>>> Requesting for comments on the following:
>>>>> 
>>>>> Are there cases where this might not necessarily prioritize dumping VMAs
>>>>> needed to obtain a usable backtrace?
>>>>> 
>>>>> Thanks,
>>>>> Brian Mak
>>>>> 
>>>>> fs/binfmt_elf.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++--
>>>>> 1 file changed, 62 insertions(+), 2 deletions(-)
>>>>> 
>>>>> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
>>>>> index 19fa49cd9907..d45240b0748d 100644
>>>>> --- a/fs/binfmt_elf.c
>>>>> +++ b/fs/binfmt_elf.c
>>>>> @@ -13,6 +13,7 @@
>>>>> #include <linux/module.h>
>>>>> #include <linux/kernel.h>
>>>>> #include <linux/fs.h>
>>>>> +#include <linux/debugfs.h>
>>>>> #include <linux/log2.h>
>>>>> #include <linux/mm.h>
>>>>> #include <linux/mman.h>
>>>>> @@ -37,6 +38,7 @@
>>>>> #include <linux/elf-randomize.h>
>>>>> #include <linux/utsname.h>
>>>>> #include <linux/coredump.h>
>>>>> +#include <linux/sort.h>
>>>>> #include <linux/sched.h>
>>>>> #include <linux/sched/coredump.h>
>>>>> #include <linux/sched/task_stack.h>
>>>>> @@ -1990,6 +1992,22 @@ static void fill_extnum_info(struct elfhdr *elf, struct elf_shdr *shdr4extnum,
>>>>>     shdr4extnum->sh_info = segs;
>>>>> }
>>>>> 
>>>>> +static int cmp_vma_size(const void *vma_meta_lhs_ptr, const void *vma_meta_rhs_ptr)
>>>>> +{
>>>>> +     const struct core_vma_metadata *vma_meta_lhs = *(const struct core_vma_metadata **)
>>>>> +             vma_meta_lhs_ptr;
>>>>> +     const struct core_vma_metadata *vma_meta_rhs = *(const struct core_vma_metadata **)
>>>>> +             vma_meta_rhs_ptr;
>>>>> +
>>>>> +     if (vma_meta_lhs->dump_size < vma_meta_rhs->dump_size)
>>>>> +             return -1;
>>>>> +     if (vma_meta_lhs->dump_size > vma_meta_rhs->dump_size)
>>>>> +             return 1;
>>>>> +     return 0;
>>>>> +}
>>>>> +
>>>>> +static bool sort_elf_core_vmas = true;
>>>>> +
>>>>> /*
>>>>> * Actual dumper
>>>>> *
>>>>> @@ -2008,6 +2026,7 @@ static int elf_core_dump(struct coredump_params *cprm)
>>>>>     struct elf_shdr *shdr4extnum = NULL;
>>>>>     Elf_Half e_phnum;
>>>>>     elf_addr_t e_shoff;
>>>>> +     struct core_vma_metadata **sorted_vmas = NULL;
>>>>> 
>>>>>     /*
>>>>>      * The number of segs are recored into ELF header as 16bit value.
>>>>> @@ -2071,11 +2090,27 @@ static int elf_core_dump(struct coredump_params *cprm)
>>>>>     if (!dump_emit(cprm, phdr4note, sizeof(*phdr4note)))
>>>>>             goto end_coredump;
>>>>> 
>>>>> +     /* Allocate memory to sort VMAs and sort if needed. */
>>>>> +     if (sort_elf_core_vmas)
>>>>> +             sorted_vmas = kvmalloc_array(cprm->vma_count, sizeof(*sorted_vmas), GFP_KERNEL);
>>>>> +
>>>>> +     if (!ZERO_OR_NULL_PTR(sorted_vmas)) {
>>>>> +             for (i = 0; i < cprm->vma_count; i++)
>>>>> +                     sorted_vmas[i] = cprm->vma_meta + i;
>>>>> +
>>>>> +             sort(sorted_vmas, cprm->vma_count, sizeof(*sorted_vmas), cmp_vma_size, NULL);
>>>>> +     }
>>>>> +
>>>>>     /* Write program headers for segments dump */
>>>>>     for (i = 0; i < cprm->vma_count; i++) {
>>>>> -             struct core_vma_metadata *meta = cprm->vma_meta + i;
>>>>> +             struct core_vma_metadata *meta;
>>>>>             struct elf_phdr phdr;
>>>>> 
>>>>> +             if (ZERO_OR_NULL_PTR(sorted_vmas))
>>>>> +                     meta = cprm->vma_meta + i;
>>>>> +             else
>>>>> +                     meta = sorted_vmas[i];
>>>>> +
>>>>>             phdr.p_type = PT_LOAD;
>>>>>             phdr.p_offset = offset;
>>>>>             phdr.p_vaddr = meta->start;
>>>>> @@ -2111,7 +2146,12 @@ static int elf_core_dump(struct coredump_params *cprm)
>>>>>     dump_skip_to(cprm, dataoff);
>>>>> 
>>>>>     for (i = 0; i < cprm->vma_count; i++) {
>>>>> -             struct core_vma_metadata *meta = cprm->vma_meta + i;
>>>>> +             struct core_vma_metadata *meta;
>>>>> +
>>>>> +             if (ZERO_OR_NULL_PTR(sorted_vmas))
>>>>> +                     meta = cprm->vma_meta + i;
>>>>> +             else
>>>>> +                     meta = sorted_vmas[i];
>>>>> 
>>>>>             if (!dump_user_range(cprm, meta->start, meta->dump_size))
>>>>>                     goto end_coredump;
>>>>> @@ -2128,10 +2168,26 @@ static int elf_core_dump(struct coredump_params *cprm)
>>>>> end_coredump:
>>>>>     free_note_info(&info);
>>>>>     kfree(shdr4extnum);
>>>>> +     kvfree(sorted_vmas);
>>>>>     kfree(phdr4note);
>>>>>     return has_dumped;
>>>>> }
>>>>> 
>>>>> +#ifdef CONFIG_DEBUG_FS
>>>>> +
>>>>> +static struct dentry *elf_core_debugfs;
>>>>> +
>>>>> +static int __init init_elf_core_debugfs(void)
>>>>> +{
>>>>> +     elf_core_debugfs = debugfs_create_dir("elf_core", NULL);
>>>>> +     debugfs_create_bool("sort_elf_core_vmas", 0644, elf_core_debugfs, &sort_elf_core_vmas);
>>>>> +     return 0;
>>>>> +}
>>>>> +
>>>>> +fs_initcall(init_elf_core_debugfs);
>>>>> +
>>>>> +#endif               /* CONFIG_DEBUG_FS */
>>>>> +
>>>>> #endif               /* CONFIG_ELF_CORE */
>>>>> 
>>>>> static int __init init_elf_binfmt(void)
>>>>> @@ -2144,6 +2200,10 @@ static void __exit exit_elf_binfmt(void)
>>>>> {
>>>>>     /* Remove the COFF and ELF loaders. */
>>>>>     unregister_binfmt(&elf_format);
>>>>> +
>>>>> +#if defined(CONFIG_ELF_CORE) && defined(CONFIG_DEBUG_FS)
>>>>> +     debugfs_remove(elf_core_debugfs);
>>>>> +#endif
>>>>> }
>>>>> 
>>>>> core_initcall(init_elf_binfmt);
>>>>> 
>>>>> base-commit: 94ede2a3e9135764736221c080ac7c0ad993dc2d
Kees Cook Aug. 5, 2024, 5:25 p.m. UTC | #6
On Thu, Aug 01, 2024 at 05:58:06PM +0000, Brian Mak wrote:
> On Jul 31, 2024, at 7:52 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> > One practical concern with this approach is that I think the ELF
> > specification says that program headers should be written in memory
> > order.  So a comment on your testing to see if gdb or rr or any of
> > the other debuggers that read core dumps cares would be appreciated.
> 
> I've already tested readelf and gdb on core dumps (truncated and whole)
> with this patch and it is able to read/use these core dumps in these
> scenarios with a proper backtrace.

Can you compare the "rr" selftest before/after the patch? They have been
the most sensitive to changes to ELF, ptrace, seccomp, etc, so I've
tried to double-check "user visible" changes with their tree. :)

> > Since your concern is about stacks, and the kernel has information about
> > stacks it might be worth using that information explicitly when sorting
> > vmas, instead of just assuming stacks will be small.
> 
> This was originally the approach that we explored, but ultimately moved
> away from. We need more than just stacks to form a proper backtrace. I
> didn't narrow down exactly what it was that we needed because the sorting
> solution seemed to be cleaner than trying to narrow down each of these
> pieces that we'd need. At the very least, we need information about shared
> libraries (.dynamic, etc.) and stacks, but my testing showed that we need a
> third piece sitting in an anonymous R/W VMA, which is the point that I
> stopped exploring this path. I was having a difficult time narrowing down
> what this last piece was.

And those VMAs weren't thread stacks?

> Please let me know your thoughts!

I echo all of Eric's comments, especially the "let's make this the
default if we can". My only bit of discomfort is with making this change
is that it falls into the "it happens to work" case, and we don't really
understand _why_ it works for you. :)

It does also feel like part of the overall problem is that systemd
doesn't have a way to know the process is crashing, and then creates the
truncation problem. (i.e. we're trying to use the kernel to work around
a visibility issue in userspace.)

All this said, if it doesn't create problems for gdb and rr, I would be
fine to give a shot.

-Kees
Brian Mak Aug. 5, 2024, 6:44 p.m. UTC | #7
On Aug 5, 2024, at 10:25 AM, Kees Cook <kees@kernel.org> wrote:

> On Thu, Aug 01, 2024 at 05:58:06PM +0000, Brian Mak wrote:
>> On Jul 31, 2024, at 7:52 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>> One practical concern with this approach is that I think the ELF
>>> specification says that program headers should be written in memory
>>> order.  So a comment on your testing to see if gdb or rr or any of
>>> the other debuggers that read core dumps cares would be appreciated.
>> 
>> I've already tested readelf and gdb on core dumps (truncated and whole)
>> with this patch and it is able to read/use these core dumps in these
>> scenarios with a proper backtrace.
> 
> Can you compare the "rr" selftest before/after the patch? They have been
> the most sensitive to changes to ELF, ptrace, seccomp, etc, so I've
> tried to double-check "user visible" changes with their tree. :)

Hi Kees,

Thanks for your reply!

Can you please give me some more information on these self tests?
What/where are they? I'm not too familiar with rr.

>>> Since your concern is about stacks, and the kernel has information about
>>> stacks it might be worth using that information explicitly when sorting
>>> vmas, instead of just assuming stacks will be small.
>> 
>> This was originally the approach that we explored, but ultimately moved
>> away from. We need more than just stacks to form a proper backtrace. I
>> didn't narrow down exactly what it was that we needed because the sorting
>> solution seemed to be cleaner than trying to narrow down each of these
>> pieces that we'd need. At the very least, we need information about shared
>> libraries (.dynamic, etc.) and stacks, but my testing showed that we need a
>> third piece sitting in an anonymous R/W VMA, which is the point that I
>> stopped exploring this path. I was having a difficult time narrowing down
>> what this last piece was.
> 
> And those VMAs weren't thread stacks?

Admittedly, I did do all of this exploration months ago, and only have
my notes to go off of here, but no, they should not have been thread
stacks since I had pulled all of them in during a "first pass".

>> Please let me know your thoughts!
> 
> I echo all of Eric's comments, especially the "let's make this the
> default if we can". My only bit of discomfort is with making this change
> is that it falls into the "it happens to work" case, and we don't really
> understand _why_ it works for you. :)

Yep, the "let's make this the default" change is already in v2. v3 will
be out shortly with the change to sort in place rather than in a second
copy of the VMA list.

> It does also feel like part of the overall problem is that systemd
> doesn't have a way to know the process is crashing, and then creates the
> truncation problem. (i.e. we're trying to use the kernel to work around
> a visibility issue in userspace.)

Even if systemd had visibility into the fact that a crash is happening,
there's not much systemd can do in some circumstances. In applications
with strict time to recovery limits, the process needs to restart within
a certain time limit. We run into a similar issue as the issue I raised
in my last reply on this thread: to keep the core dump intact and
recover, we either need to start up a new process while the old one is
core dumping, or wait until core dumping is complete to restart.

If we start up a new process while the old one is core dumping, we risk
system stability in applications with a large memory footprint since we
could run out of memory from the duplication of memory consumption. If
we wait until core dumping is complete to restart, we're in the same
scenario as before with the core being truncated or we miss recovery
time objectives by waiting too long.

For this reason, I wouldn't say we're using the kernel to work around a
visibility issue or that systemd is creating the truncation problem, but
rather that the issue exists due to limitations in how we're truncating
cores. That being said, there might be some use in this type of
visibility for others with less strict recovery time objectives or
applications with a lower memory footprint.

Best,
Brian Mak

> All this said, if it doesn't create problems for gdb and rr, I would be
> fine to give a shot.
> 
> -Kees
> 
> --
> Kees Cook
Kees Cook Aug. 5, 2024, 8:34 p.m. UTC | #8
On Mon, Aug 05, 2024 at 06:44:44PM +0000, Brian Mak wrote:
> On Aug 5, 2024, at 10:25 AM, Kees Cook <kees@kernel.org> wrote:
> 
> > On Thu, Aug 01, 2024 at 05:58:06PM +0000, Brian Mak wrote:
> >> On Jul 31, 2024, at 7:52 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> >>> One practical concern with this approach is that I think the ELF
> >>> specification says that program headers should be written in memory
> >>> order.  So a comment on your testing to see if gdb or rr or any of
> >>> the other debuggers that read core dumps cares would be appreciated.
> >> 
> >> I've already tested readelf and gdb on core dumps (truncated and whole)
> >> with this patch and it is able to read/use these core dumps in these
> >> scenarios with a proper backtrace.
> > 
> > Can you compare the "rr" selftest before/after the patch? They have been
> > the most sensitive to changes to ELF, ptrace, seccomp, etc, so I've
> > tried to double-check "user visible" changes with their tree. :)
> 
> Hi Kees,
> 
> Thanks for your reply!
> 
> Can you please give me some more information on these self tests?
> What/where are they? I'm not too familiar with rr.

I start from where whenever I go through their tests:

https://github.com/rr-debugger/rr/wiki/Building-And-Installing#tests


> > And those VMAs weren't thread stacks?
> 
> Admittedly, I did do all of this exploration months ago, and only have
> my notes to go off of here, but no, they should not have been thread
> stacks since I had pulled all of them in during a "first pass".

Okay, cool. I suspect you'd already explored that, but I wanted to be
sure we didn't have an "easy to explain" solution. ;)

> > It does also feel like part of the overall problem is that systemd
> > doesn't have a way to know the process is crashing, and then creates the
> > truncation problem. (i.e. we're trying to use the kernel to work around
> > a visibility issue in userspace.)
> 
> Even if systemd had visibility into the fact that a crash is happening,
> there's not much systemd can do in some circumstances. In applications
> with strict time to recovery limits, the process needs to restart within
> a certain time limit. We run into a similar issue as the issue I raised
> in my last reply on this thread: to keep the core dump intact and
> recover, we either need to start up a new process while the old one is
> core dumping, or wait until core dumping is complete to restart.
> 
> If we start up a new process while the old one is core dumping, we risk
> system stability in applications with a large memory footprint since we
> could run out of memory from the duplication of memory consumption. If
> we wait until core dumping is complete to restart, we're in the same
> scenario as before with the core being truncated or we miss recovery
> time objectives by waiting too long.
> 
> For this reason, I wouldn't say we're using the kernel to work around a
> visibility issue or that systemd is creating the truncation problem, but
> rather that the issue exists due to limitations in how we're truncating
> cores. That being said, there might be some use in this type of
> visibility for others with less strict recovery time objectives or
> applications with a lower memory footprint.

Yeah, this is interesting. This effectively makes the coredumping
activity rather "critical path": the replacement process can't start
until the dump has finished... hmm. It feels like there should be a way
to move the dumping process aside, but with all the VMAs still live, I
can see how this might go weird. I'll think some more about this...
diff mbox series

Patch

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 19fa49cd9907..d45240b0748d 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -13,6 +13,7 @@ 
 #include <linux/module.h>
 #include <linux/kernel.h>
 #include <linux/fs.h>
+#include <linux/debugfs.h>
 #include <linux/log2.h>
 #include <linux/mm.h>
 #include <linux/mman.h>
@@ -37,6 +38,7 @@ 
 #include <linux/elf-randomize.h>
 #include <linux/utsname.h>
 #include <linux/coredump.h>
+#include <linux/sort.h>
 #include <linux/sched.h>
 #include <linux/sched/coredump.h>
 #include <linux/sched/task_stack.h>
@@ -1990,6 +1992,22 @@  static void fill_extnum_info(struct elfhdr *elf, struct elf_shdr *shdr4extnum,
 	shdr4extnum->sh_info = segs;
 }
 
+static int cmp_vma_size(const void *vma_meta_lhs_ptr, const void *vma_meta_rhs_ptr)
+{
+	const struct core_vma_metadata *vma_meta_lhs = *(const struct core_vma_metadata **)
+		vma_meta_lhs_ptr;
+	const struct core_vma_metadata *vma_meta_rhs = *(const struct core_vma_metadata **)
+		vma_meta_rhs_ptr;
+
+	if (vma_meta_lhs->dump_size < vma_meta_rhs->dump_size)
+		return -1;
+	if (vma_meta_lhs->dump_size > vma_meta_rhs->dump_size)
+		return 1;
+	return 0;
+}
+
+static bool sort_elf_core_vmas = true;
+
 /*
  * Actual dumper
  *
@@ -2008,6 +2026,7 @@  static int elf_core_dump(struct coredump_params *cprm)
 	struct elf_shdr *shdr4extnum = NULL;
 	Elf_Half e_phnum;
 	elf_addr_t e_shoff;
+	struct core_vma_metadata **sorted_vmas = NULL;
 
 	/*
 	 * The number of segs are recored into ELF header as 16bit value.
@@ -2071,11 +2090,27 @@  static int elf_core_dump(struct coredump_params *cprm)
 	if (!dump_emit(cprm, phdr4note, sizeof(*phdr4note)))
 		goto end_coredump;
 
+	/* Allocate memory to sort VMAs and sort if needed. */
+	if (sort_elf_core_vmas)
+		sorted_vmas = kvmalloc_array(cprm->vma_count, sizeof(*sorted_vmas), GFP_KERNEL);
+
+	if (!ZERO_OR_NULL_PTR(sorted_vmas)) {
+		for (i = 0; i < cprm->vma_count; i++)
+			sorted_vmas[i] = cprm->vma_meta + i;
+
+		sort(sorted_vmas, cprm->vma_count, sizeof(*sorted_vmas), cmp_vma_size, NULL);
+	}
+
 	/* Write program headers for segments dump */
 	for (i = 0; i < cprm->vma_count; i++) {
-		struct core_vma_metadata *meta = cprm->vma_meta + i;
+		struct core_vma_metadata *meta;
 		struct elf_phdr phdr;
 
+		if (ZERO_OR_NULL_PTR(sorted_vmas))
+			meta = cprm->vma_meta + i;
+		else
+			meta = sorted_vmas[i];
+
 		phdr.p_type = PT_LOAD;
 		phdr.p_offset = offset;
 		phdr.p_vaddr = meta->start;
@@ -2111,7 +2146,12 @@  static int elf_core_dump(struct coredump_params *cprm)
 	dump_skip_to(cprm, dataoff);
 
 	for (i = 0; i < cprm->vma_count; i++) {
-		struct core_vma_metadata *meta = cprm->vma_meta + i;
+		struct core_vma_metadata *meta;
+
+		if (ZERO_OR_NULL_PTR(sorted_vmas))
+			meta = cprm->vma_meta + i;
+		else
+			meta = sorted_vmas[i];
 
 		if (!dump_user_range(cprm, meta->start, meta->dump_size))
 			goto end_coredump;
@@ -2128,10 +2168,26 @@  static int elf_core_dump(struct coredump_params *cprm)
 end_coredump:
 	free_note_info(&info);
 	kfree(shdr4extnum);
+	kvfree(sorted_vmas);
 	kfree(phdr4note);
 	return has_dumped;
 }
 
+#ifdef CONFIG_DEBUG_FS
+
+static struct dentry *elf_core_debugfs;
+
+static int __init init_elf_core_debugfs(void)
+{
+	elf_core_debugfs = debugfs_create_dir("elf_core", NULL);
+	debugfs_create_bool("sort_elf_core_vmas", 0644, elf_core_debugfs, &sort_elf_core_vmas);
+	return 0;
+}
+
+fs_initcall(init_elf_core_debugfs);
+
+#endif		/* CONFIG_DEBUG_FS */
+
 #endif		/* CONFIG_ELF_CORE */
 
 static int __init init_elf_binfmt(void)
@@ -2144,6 +2200,10 @@  static void __exit exit_elf_binfmt(void)
 {
 	/* Remove the COFF and ELF loaders. */
 	unregister_binfmt(&elf_format);
+
+#if defined(CONFIG_ELF_CORE) && defined(CONFIG_DEBUG_FS)
+	debugfs_remove(elf_core_debugfs);
+#endif
 }
 
 core_initcall(init_elf_binfmt);