diff mbox series

[48/91] kernel/crash_core: add crashkernel=auto for vmcore creation

Message ID 20210507010432.IN24PudKT%akpm@linux-foundation.org (mailing list archive)
State New, archived
Headers show
Series [01/91] alpha: eliminate old-style function definitions | expand

Commit Message

Andrew Morton May 7, 2021, 1:04 a.m. UTC
From: Saeed Mirzamohammadi <saeed.mirzamohammadi@oracle.com>
Subject: kernel/crash_core: add crashkernel=auto for vmcore creation

This adds crashkernel=auto feature to configure reserved memory for vmcore
creation.  CONFIG_CRASH_AUTO_STR is defined to be set for different kernel
distributions and different archs based on their needs.

Link: https://lkml.kernel.org/r/20210223174153.72802-1-saeed.mirzamohammadi@oracle.com
Signed-off-by: Saeed Mirzamohammadi <saeed.mirzamohammadi@oracle.com>
Signed-off-by: John Donnelly <john.p.donnelly@oracle.com>
Tested-by: John Donnelly <john.p.donnelly@oracle.com>
ed-by: Dave Young <dyoung@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "Guilherme G. Piccoli" <gpiccoli@canonical.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Cc: YiFei Zhu <yifeifz2@illinois.edu>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/kdump/kdump.rst       |    3 +-
 Documentation/admin-guide/kernel-parameters.txt |    6 ++++
 arch/Kconfig                                    |   20 ++++++++++++++
 kernel/crash_core.c                             |    7 ++++
 4 files changed, 35 insertions(+), 1 deletion(-)

Comments

Linus Torvalds May 7, 2021, 7:25 a.m. UTC | #1
On Thu, May 6, 2021 at 6:04 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> From: Saeed Mirzamohammadi <saeed.mirzamohammadi@oracle.com>
> Subject: kernel/crash_core: add crashkernel=auto for vmcore creation
>
> This adds crashkernel=auto feature to configure reserved memory for vmcore
> creation.  CONFIG_CRASH_AUTO_STR is defined to be set for different kernel
> distributions and different archs based on their needs.

Ugh. I didn't realize how nasty this was until after I'd applied this patch.

I'm going to drop this patch, because the Kconfig thing for it is an
unmitigated mess. I was confused by the question, and then the help
message was actively misleading.

This is wrong for so many reasons:

 - this is a classic case of "you shouldn't ask a user this".

   The question makes no sense to any normal person, it certainly
didn't to me. Don't ask questions that don't have sane answers.

 - the config help text is actively misleading, and claims that the
option is about how much memory is reserved for a crash kernel

   Not so. It's the default string for when somebody uses "crashkernel=auto"

 - this shouldn't be a config option at all, it's clearly a distro
setting, and should be on the kernel command line with the other
distro settings.

So I'm dropping this, and I don't see it ever being applied in this
form for the above reasons.

People, I've said this before, and apparently I need to say it again:
the kernel config is likely the nastiest part of building a local
kernel, and the biggest impediment to people actually building their
own kernels.

And people building their own kernel is the first step to becoming a
kernel developer.

So the kernel configuration is already one of the less pleasant parts
of the kernel, but that does NOT mean that we should strive to make it
even worse.

Obscure, odd, strange config questions like this are a no-no. We're
not making an already bad experience wose for something like this.

           Linus
David Hildenbrand May 7, 2021, 8:16 a.m. UTC | #2
On 07.05.21 03:04, Andrew Morton wrote:
> From: Saeed Mirzamohammadi <saeed.mirzamohammadi@oracle.com>
> Subject: kernel/crash_core: add crashkernel=auto for vmcore creation
> 
> This adds crashkernel=auto feature to configure reserved memory for vmcore
> creation.  CONFIG_CRASH_AUTO_STR is defined to be set for different kernel
> distributions and different archs based on their needs.
> 
> Link: https://lkml.kernel.org/r/20210223174153.72802-1-saeed.mirzamohammadi@oracle.com
> Signed-off-by: Saeed Mirzamohammadi <saeed.mirzamohammadi@oracle.com>
> Signed-off-by: John Donnelly <john.p.donnelly@oracle.com>
> Tested-by: John Donnelly <john.p.donnelly@oracle.com>
> ed-by: Dave Young <dyoung@redhat.com>
> Cc: Baoquan He <bhe@redhat.com>
> Cc: Vivek Goyal <vgoyal@redhat.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Randy Dunlap <rdunlap@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: "Guilherme G. Piccoli" <gpiccoli@canonical.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
> Cc: YiFei Zhu <yifeifz2@illinois.edu>
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Masahiro Yamada <masahiroy@kernel.org>
> Cc: Sami Tolvanen <samitolvanen@google.com>
> Cc: Frederic Weisbecker <frederic@kernel.org>
> Cc: Christian Brauner <christian.brauner@ubuntu.com>
> Cc: Stephen Boyd <sboyd@kernel.org>
> Cc: Andrey Konovalov <andreyknvl@google.com>
> Cc: Colin Ian King <colin.king@canonical.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>   Documentation/admin-guide/kdump/kdump.rst       |    3 +-
>   Documentation/admin-guide/kernel-parameters.txt |    6 ++++
>   arch/Kconfig                                    |   20 ++++++++++++++
>   kernel/crash_core.c                             |    7 ++++
>   4 files changed, 35 insertions(+), 1 deletion(-)
> 
> --- a/arch/Kconfig~kernel-crash_core-add-crashkernel=auto-for-vmcore-creation
> +++ a/arch/Kconfig
> @@ -14,6 +14,26 @@ menu "General architecture-dependent opt
>   config CRASH_CORE
>   	bool
>   
> +config CRASH_AUTO_STR
> +	string "Memory reserved for crash kernel"
> +	depends on CRASH_CORE
> +	default "1G-64G:128M,64G-1T:256M,1T-:512M"
> +	help
> +	  This configures the reserved memory dependent
> +	  on the value of System RAM. The syntax is:
> +	  crashkernel=<range1>:<size1>[,<range2>:<size2>,...][@offset]
> +	              range=start-[end]
> +
> +	  For example:
> +	      crashkernel=512M-2G:64M,2G-:128M
> +
> +	  This would mean:
> +
> +	      1) if the RAM is smaller than 512M, then don't reserve anything
> +	         (this is the "rescue" case)
> +	      2) if the RAM size is between 512M and 2G (exclusive), then reserve 64M
> +	      3) if the RAM size is larger than 2G, then reserve 128M
> +
>   config KEXEC_CORE
>   	select CRASH_CORE
>   	bool
> --- a/Documentation/admin-guide/kdump/kdump.rst~kernel-crash_core-add-crashkernel=auto-for-vmcore-creation
> +++ a/Documentation/admin-guide/kdump/kdump.rst
> @@ -285,7 +285,8 @@ This would mean:
>       2) if the RAM size is between 512M and 2G (exclusive), then reserve 64M
>       3) if the RAM size is larger than 2G, then reserve 128M
>   
> -
> +Or you can use crashkernel=auto to choose the crash kernel memory size
> +based on the recommended configuration set for each arch.
>   
>   Boot into System Kernel
>   =======================
> --- a/Documentation/admin-guide/kernel-parameters.txt~kernel-crash_core-add-crashkernel=auto-for-vmcore-creation
> +++ a/Documentation/admin-guide/kernel-parameters.txt
> @@ -751,6 +751,12 @@
>   			a memory unit (amount[KMG]). See also
>   			Documentation/admin-guide/kdump/kdump.rst for an example.
>   
> +	crashkernel=auto
> +			[KNL] This parameter will set the reserved memory for
> +			the crash kernel based on the value of the CRASH_AUTO_STR
> +			that is the best effort estimation for each arch. See also
> +			arch/Kconfig for further details.
> +
>   	crashkernel=size[KMG],high
>   			[KNL, X86-64] range could be above 4G. Allow kernel
>   			to allocate physical memory region from top, so could
> --- a/kernel/crash_core.c~kernel-crash_core-add-crashkernel=auto-for-vmcore-creation
> +++ a/kernel/crash_core.c
> @@ -7,6 +7,7 @@
>   #include <linux/crash_core.h>
>   #include <linux/utsname.h>
>   #include <linux/vmalloc.h>
> +#include <linux/kexec.h>
>   
>   #include <asm/page.h>
>   #include <asm/sections.h>
> @@ -250,6 +251,12 @@ static int __init __parse_crashkernel(ch
>   	if (suffix)
>   		return parse_crashkernel_suffix(ck_cmdline, crash_size,
>   				suffix);
> +#ifdef CONFIG_CRASH_AUTO_STR
> +	if (strncmp(ck_cmdline, "auto", 4) == 0) {
> +		ck_cmdline = CONFIG_CRASH_AUTO_STR;
> +		pr_info("Using crashkernel=auto, the size chosen is a best effort estimation.\n");
> +	}
> +#endif
I remember that the original "crashkernel=auto" as once proposed by Red 
Hat people did not receive a warm welcome.

Let me take a look .... oh, there it is from 2009

https://marc.info/?t=125006512600002&r=1&w=2

and then we had it in 2018

https://lkml.org/lkml/2018/5/20/262


The issue I have with this: it's just plain wrong when you take memory 
hotplug into serious account as we see it quite heavily in VMs. You 
don't know what you'll need when building a kernel. Just pass it via the 
cmdline ...
Baoquan He May 8, 2021, 3:13 a.m. UTC | #3
Hi Linus,

On 05/07/21 at 12:25am, Linus Torvalds wrote:
> On Thu, May 6, 2021 at 6:04 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > From: Saeed Mirzamohammadi <saeed.mirzamohammadi@oracle.com>
> > Subject: kernel/crash_core: add crashkernel=auto for vmcore creation
> >
> > This adds crashkernel=auto feature to configure reserved memory for vmcore
> > creation.  CONFIG_CRASH_AUTO_STR is defined to be set for different kernel
> > distributions and different archs based on their needs.


> 
> Ugh. I didn't realize how nasty this was until after I'd applied this patch.
> 
> I'm going to drop this patch, because the Kconfig thing for it is an
> unmitigated mess. I was confused by the question, and then the help
> message was actively misleading.
> 
> This is wrong for so many reasons:
> 
>  - this is a classic case of "you shouldn't ask a user this".
> 
>    The question makes no sense to any normal person, it certainly
> didn't to me. Don't ask questions that don't have sane answers.
> 
>  - the config help text is actively misleading, and claims that the
> option is about how much memory is reserved for a crash kernel
> 
>    Not so. It's the default string for when somebody uses "crashkernel=auto"

Sorry for the confusion, we should have been more careful to reivew and
add the commit log and kernel config description.
> 
>  - this shouldn't be a config option at all, it's clearly a distro
> setting, and should be on the kernel command line with the other
> distro settings.

Don't know kernel config is disliked sometime, will remember it in the
future and more cautiously to add. 

Crashkernel=auto exists in our distros for many years, and as David
mentioned in other thread, we have been trying to adding rashkernel=auto
support into upstream. We pursue crashkernel=auto being added to upstream
because:

1) Empirical value is given to user by default;

It was required by customer originally, now has been an important part
of kdump feature and supported in several main ARCHes. With crashkernel=auto,
people w/o much knowledge of kdump details can use kdump to debug. Distros
can provide the suggested values with crashkernel=auto which are got by
investigation, analysis and tested widely on test environment. 

2) Cover corner case/special case;

In some cases, kernel may need extra memory to handle, kdump kernel is
not exceptional. E.g when sme/sev enabled, SWIOTLB will be enabled
necessarily, even in kdump kernel. (Below sme/sev related commits for
reference). Then extra 64M need be reserved for crashkernel. User
doesn't need to know this, we already have done it for them.

commit c7753208a94c ("x86, swiotlb: Add memory encryption support")
commit aba2d9a6385a ("iommu/amd: Do not disable SWIOTLB if SME is active")
commit d7b417fa08d1 ("x86/mm: Add DMA support for SEV memory encryption")

We are eager to push crashkernel=auto to upstream becasue of our
UPSTREAM FIRST rule. Since it has been in RHEL for many years, each time
a new RHEL main release anchor a upstream kernel release and is prepared,
these crashkernel=auto RHEL-only patches need be reviewed inside Redhat,
then we will be questioned and challenged why they are not in upstream.

As for how to implement crashkernel=auto, we have tried several ways.

1) Add into kernel command line

The suggested value need be stored in user space if added into kernel
command line, then added into kernel. This makes the suggested value
separated from kernel itself. It's not what we expect to see. Because
the suggested crashkernel value is strongly related to distros release.
We could adjust the value between sub-releases of kernel because of
of kernel change. Adding them into kernel command line make us lose the
track of them in kernel.

2) Add a weak generic function and several arch dependent functions
3) Hardcode values in __parse_crashkernel()

Method 2) is taken in our RHEL7, 3) is used in RHEL8, RHEL-only patches
add them. If we try to push them into upstram, any later value
adjustment need a upstream patch posting. Otherwise, RHEL-only patch
need be introduced again, Redhat internal reviewer will challenge us
again. (Put the value hard coding pieces at bottom for reference).

4) Add kernel config to add default value

It's done in this patch. With the kernel config CRASH_AUTO_STR, Distros can
add default value, and adjust it anytime in the future w/o bothering
upstream. If crashkernel=auto is specified, only below 3 LOC added, to
go to parse the CONFIG_CRASH_AUTO_STR directly.

@@ -250,6 +251,12 @@ static int __init __parse_crashkernel(ch
        if (suffix)
                return parse_crashkernel_suffix(ck_cmdline, crash_size,
                                suffix);
+#ifdef CONFIG_CRASH_AUTO_STR
+       if (strncmp(ck_cmdline, "auto", 4) == 0) {
+               ck_cmdline = CONFIG_CRASH_AUTO_STR;
+               pr_info("Using crashkernel=auto, the size chosen is a best effort estimation.\n");
+       }
+#endif


Before this, we don't know Saeed Mirzamohammadi, the patch author. He
could experience the same torture. We were wild with joy when noticing
his patch. We were planning to launch new round of post to add
crashkernel=auto, kernel config is our final option too. We could be too
happy to forget polishing the commit log.

Not sure if I make myself clear. Basically, we expect crashkernel=auto
to be added in upstream kernel. About how to implement it in kernel, we would
like to hear upstream people's suggestion.

Thanks
Baoquan


Hard code crashkernel=auto values in __parse_crashkernel()
===========================================================
static int __init __parse_crashkernel(char *cmdline,
                             unsigned long long system_ram,
                             unsigned long long *crash_size,
                             unsigned long long *crash_base,
                             const char *name,
                             const char *suffix)
{
......
        if (strncmp(ck_cmdline, "auto", 4) == 0) {
#if defined(CONFIG_X86_64) || defined(CONFIG_S390)
                ck_cmdline = "1G-4G:160M,4G-64G:192M,64G-1T:256M,1T-:512M";
#elif defined(CONFIG_ARM64)
                ck_cmdline = "2G-:448M";
#elif defined(CONFIG_PPC64)
                char *fadump_cmdline;

                fadump_cmdline = get_last_crashkernel(cmdline, "fadump=", NULL);
                fadump_cmdline = fadump_cmdline ?
                                fadump_cmdline + strlen("fadump=") : NULL;
                if (!fadump_cmdline || (strncmp(fadump_cmdline, "off", 3) == 0))
                        ck_cmdline = "2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G";
                else
                        ck_cmdline = "4G-16G:768M,16G-64G:1G,64G-128G:2G,128G-1T:4G,1T-2T:6G,2T-4T:12G,4T-8T:20G,8T-16T:36G,16T-32T:64G,32T-64T:128G,64T-:180G";
#endif
                pr_info("Using crashkernel=auto, the size chosen is a best effort estimation.\n");
        }

......
}
==================================================================
Baoquan He May 8, 2021, 3:29 a.m. UTC | #4
Add Kairui to CC since he is taking care of the crashkernel=auto code in
our Distros.

On 05/08/21 at 11:13am, Baoquan He wrote:
> Hi Linus,
> 
> On 05/07/21 at 12:25am, Linus Torvalds wrote:
> > On Thu, May 6, 2021 at 6:04 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > From: Saeed Mirzamohammadi <saeed.mirzamohammadi@oracle.com>
> > > Subject: kernel/crash_core: add crashkernel=auto for vmcore creation
> > >
> > > This adds crashkernel=auto feature to configure reserved memory for vmcore
> > > creation.  CONFIG_CRASH_AUTO_STR is defined to be set for different kernel
> > > distributions and different archs based on their needs.
> 
> 
> > 
> > Ugh. I didn't realize how nasty this was until after I'd applied this patch.
> > 
> > I'm going to drop this patch, because the Kconfig thing for it is an
> > unmitigated mess. I was confused by the question, and then the help
> > message was actively misleading.
> > 
> > This is wrong for so many reasons:
> > 
> >  - this is a classic case of "you shouldn't ask a user this".
> > 
> >    The question makes no sense to any normal person, it certainly
> > didn't to me. Don't ask questions that don't have sane answers.
> > 
> >  - the config help text is actively misleading, and claims that the
> > option is about how much memory is reserved for a crash kernel
> > 
> >    Not so. It's the default string for when somebody uses "crashkernel=auto"
> 
> Sorry for the confusion, we should have been more careful to reivew and
> add the commit log and kernel config description.
> > 
> >  - this shouldn't be a config option at all, it's clearly a distro
> > setting, and should be on the kernel command line with the other
> > distro settings.
> 
> Don't know kernel config is disliked sometime, will remember it in the
> future and more cautiously to add. 
> 
> Crashkernel=auto exists in our distros for many years, and as David
> mentioned in other thread, we have been trying to adding rashkernel=auto
> support into upstream. We pursue crashkernel=auto being added to upstream
> because:
> 
> 1) Empirical value is given to user by default;
> 
> It was required by customer originally, now has been an important part
> of kdump feature and supported in several main ARCHes. With crashkernel=auto,
> people w/o much knowledge of kdump details can use kdump to debug. Distros
> can provide the suggested values with crashkernel=auto which are got by
> investigation, analysis and tested widely on test environment. 
> 
> 2) Cover corner case/special case;
> 
> In some cases, kernel may need extra memory to handle, kdump kernel is
> not exceptional. E.g when sme/sev enabled, SWIOTLB will be enabled
> necessarily, even in kdump kernel. (Below sme/sev related commits for
> reference). Then extra 64M need be reserved for crashkernel. User
> doesn't need to know this, we already have done it for them.
> 
> commit c7753208a94c ("x86, swiotlb: Add memory encryption support")
> commit aba2d9a6385a ("iommu/amd: Do not disable SWIOTLB if SME is active")
> commit d7b417fa08d1 ("x86/mm: Add DMA support for SEV memory encryption")
> 
> We are eager to push crashkernel=auto to upstream becasue of our
> UPSTREAM FIRST rule. Since it has been in RHEL for many years, each time
> a new RHEL main release anchor a upstream kernel release and is prepared,
> these crashkernel=auto RHEL-only patches need be reviewed inside Redhat,
> then we will be questioned and challenged why they are not in upstream.
> 
> As for how to implement crashkernel=auto, we have tried several ways.
> 
> 1) Add into kernel command line
> 
> The suggested value need be stored in user space if added into kernel
> command line, then added into kernel. This makes the suggested value
> separated from kernel itself. It's not what we expect to see. Because
> the suggested crashkernel value is strongly related to distros release.
> We could adjust the value between sub-releases of kernel because of
> of kernel change. Adding them into kernel command line make us lose the
> track of them in kernel.
> 
> 2) Add a weak generic function and several arch dependent functions
> 3) Hardcode values in __parse_crashkernel()
> 
> Method 2) is taken in our RHEL7, 3) is used in RHEL8, RHEL-only patches
> add them. If we try to push them into upstram, any later value
> adjustment need a upstream patch posting. Otherwise, RHEL-only patch
> need be introduced again, Redhat internal reviewer will challenge us
> again. (Put the value hard coding pieces at bottom for reference).
> 
> 4) Add kernel config to add default value
> 
> It's done in this patch. With the kernel config CRASH_AUTO_STR, Distros can
> add default value, and adjust it anytime in the future w/o bothering
> upstream. If crashkernel=auto is specified, only below 3 LOC added, to
> go to parse the CONFIG_CRASH_AUTO_STR directly.
> 
> @@ -250,6 +251,12 @@ static int __init __parse_crashkernel(ch
>         if (suffix)
>                 return parse_crashkernel_suffix(ck_cmdline, crash_size,
>                                 suffix);
> +#ifdef CONFIG_CRASH_AUTO_STR
> +       if (strncmp(ck_cmdline, "auto", 4) == 0) {
> +               ck_cmdline = CONFIG_CRASH_AUTO_STR;
> +               pr_info("Using crashkernel=auto, the size chosen is a best effort estimation.\n");
> +       }
> +#endif
> 
> 
> Before this, we don't know Saeed Mirzamohammadi, the patch author. He
> could experience the same torture. We were wild with joy when noticing
> his patch. We were planning to launch new round of post to add
> crashkernel=auto, kernel config is our final option too. We could be too
> happy to forget polishing the commit log.
> 
> Not sure if I make myself clear. Basically, we expect crashkernel=auto
> to be added in upstream kernel. About how to implement it in kernel, we would
> like to hear upstream people's suggestion.
> 
> Thanks
> Baoquan
> 
> 
> Hard code crashkernel=auto values in __parse_crashkernel()
> ===========================================================
> static int __init __parse_crashkernel(char *cmdline,
>                              unsigned long long system_ram,
>                              unsigned long long *crash_size,
>                              unsigned long long *crash_base,
>                              const char *name,
>                              const char *suffix)
> {
> ......
>         if (strncmp(ck_cmdline, "auto", 4) == 0) {
> #if defined(CONFIG_X86_64) || defined(CONFIG_S390)
>                 ck_cmdline = "1G-4G:160M,4G-64G:192M,64G-1T:256M,1T-:512M";
> #elif defined(CONFIG_ARM64)
>                 ck_cmdline = "2G-:448M";
> #elif defined(CONFIG_PPC64)
>                 char *fadump_cmdline;
> 
>                 fadump_cmdline = get_last_crashkernel(cmdline, "fadump=", NULL);
>                 fadump_cmdline = fadump_cmdline ?
>                                 fadump_cmdline + strlen("fadump=") : NULL;
>                 if (!fadump_cmdline || (strncmp(fadump_cmdline, "off", 3) == 0))
>                         ck_cmdline = "2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G";
>                 else
>                         ck_cmdline = "4G-16G:768M,16G-64G:1G,64G-128G:2G,128G-1T:4G,1T-2T:6G,2T-4T:12G,4T-8T:20G,8T-16T:36G,16T-32T:64G,32T-64T:128G,64T-:180G";
> #endif
>                 pr_info("Using crashkernel=auto, the size chosen is a best effort estimation.\n");
>         }
> 
> ......
> }
> ==================================================================
> 
> 
> 
>
Baoquan He May 8, 2021, 8:51 a.m. UTC | #5
On 05/07/21 at 10:16am, David Hildenbrand wrote:
> On 07.05.21 03:04, Andrew Morton wrote:
......
> > 
> >   Documentation/admin-guide/kdump/kdump.rst       |    3 +-
> >   Documentation/admin-guide/kernel-parameters.txt |    6 ++++
> >   arch/Kconfig                                    |   20 ++++++++++++++
> >   kernel/crash_core.c                             |    7 ++++
> >   4 files changed, 35 insertions(+), 1 deletion(-)
> > 
> > --- a/arch/Kconfig~kernel-crash_core-add-crashkernel=auto-for-vmcore-creation
> > +++ a/arch/Kconfig
> > @@ -14,6 +14,26 @@ menu "General architecture-dependent opt
> >   config CRASH_CORE
> >   	bool
> > +config CRASH_AUTO_STR
> > +	string "Memory reserved for crash kernel"
> > +	depends on CRASH_CORE
> > +	default "1G-64G:128M,64G-1T:256M,1T-:512M"
> > +	help
> > +	  This configures the reserved memory dependent
> > +	  on the value of System RAM. The syntax is:
> > +	  crashkernel=<range1>:<size1>[,<range2>:<size2>,...][@offset]
> > +	              range=start-[end]
> > +
> > +	  For example:
> > +	      crashkernel=512M-2G:64M,2G-:128M
> > +
> > +	  This would mean:
> > +
> > +	      1) if the RAM is smaller than 512M, then don't reserve anything
> > +	         (this is the "rescue" case)
> > +	      2) if the RAM size is between 512M and 2G (exclusive), then reserve 64M
> > +	      3) if the RAM size is larger than 2G, then reserve 128M
> > +
> >   config KEXEC_CORE
> >   	select CRASH_CORE
> >   	bool
> > --- a/Documentation/admin-guide/kdump/kdump.rst~kernel-crash_core-add-crashkernel=auto-for-vmcore-creation
> > +++ a/Documentation/admin-guide/kdump/kdump.rst
> > @@ -285,7 +285,8 @@ This would mean:
> >       2) if the RAM size is between 512M and 2G (exclusive), then reserve 64M
> >       3) if the RAM size is larger than 2G, then reserve 128M
> > -
> > +Or you can use crashkernel=auto to choose the crash kernel memory size
> > +based on the recommended configuration set for each arch.
> >   Boot into System Kernel
> >   =======================
> > --- a/Documentation/admin-guide/kernel-parameters.txt~kernel-crash_core-add-crashkernel=auto-for-vmcore-creation
> > +++ a/Documentation/admin-guide/kernel-parameters.txt
> > @@ -751,6 +751,12 @@
> >   			a memory unit (amount[KMG]). See also
> >   			Documentation/admin-guide/kdump/kdump.rst for an example.
> > +	crashkernel=auto
> > +			[KNL] This parameter will set the reserved memory for
> > +			the crash kernel based on the value of the CRASH_AUTO_STR
> > +			that is the best effort estimation for each arch. See also
> > +			arch/Kconfig for further details.
> > +
> >   	crashkernel=size[KMG],high
> >   			[KNL, X86-64] range could be above 4G. Allow kernel
> >   			to allocate physical memory region from top, so could
> > --- a/kernel/crash_core.c~kernel-crash_core-add-crashkernel=auto-for-vmcore-creation
> > +++ a/kernel/crash_core.c
> > @@ -7,6 +7,7 @@
> >   #include <linux/crash_core.h>
> >   #include <linux/utsname.h>
> >   #include <linux/vmalloc.h>
> > +#include <linux/kexec.h>
> >   #include <asm/page.h>
> >   #include <asm/sections.h>
> > @@ -250,6 +251,12 @@ static int __init __parse_crashkernel(ch
> >   	if (suffix)
> >   		return parse_crashkernel_suffix(ck_cmdline, crash_size,
> >   				suffix);
> > +#ifdef CONFIG_CRASH_AUTO_STR
> > +	if (strncmp(ck_cmdline, "auto", 4) == 0) {
> > +		ck_cmdline = CONFIG_CRASH_AUTO_STR;
> > +		pr_info("Using crashkernel=auto, the size chosen is a best effort estimation.\n");
> > +	}
> > +#endif
> I remember that the original "crashkernel=auto" as once proposed by Red Hat
> people did not receive a warm welcome.
> 
> Let me take a look .... oh, there it is from 2009
> 
> https://marc.info/?t=125006512600002&r=1&w=2
> 
> and then we had it in 2018
> 
> https://lkml.org/lkml/2018/5/20/262

Thanks for digging these two out, otherwise I may need do for people to
know the history better.

> 
> 
> The issue I have with this: it's just plain wrong when you take memory
> hotplug into serious account as we see it quite heavily in VMs. You don't
> know what you'll need when building a kernel. Just pass it via the cmdline

Hmm, kdump may have no issue with memory hotplug in crashkernel
reservation aspect. The system RAM size is not correlated to
crashkernel size directly, that's why the default value in this patch is
not linear related to system RAM size. The proportion of crashkernel
size to the total RAM size is thing we take into account. Usually
crashkernel 160M is enough on most of systems. If system RAM size is
larger, extra memory can be added just in case, and not bring much
impact to system.

With our investigation, PCIe devices impact the crashkernel size, and
cpu number. There are always pci devices which driver require tens of KB
meomry, even MB. E.g in below patch, my colleague Coiby found out the
i40e network card even cost 1.5G memory to initialize its ringbuffer on
ppc, and 85M on x86_64.

[PATCH v1 0/3] Reducing memory usage of i40e for kdump
http://lists.infradead.org/pipermail/kexec/2021-March/022117.html

Even though not all pci devices need surprisingly large memory like
i40e, system with hundreds of pci devices can also cost more memory than
expected. This kind of system usually is high end server, specified
crashkernel value need be set manually. 

So system RAM size is the least important part to influence crashkernel
costing. Say my x1 laptop, even though I extended the RAM to 100TB, 160M
crashkernel is still enough. Just we would like to get a tiny extra part
to add to crashkernel if the total RAM is very large, that's the rule
for crashkernel=auto. As for VMs, given their very few devices, virtio
disk, NAT nic, etc, no matter how much memory is deployed and hot
added/removed, crashkernel size won't be influenced very much. My
personal understanding about it.

Thanks
Baoquan
David Hildenbrand May 8, 2021, 9:22 a.m. UTC | #6
>> Let me take a look .... oh, there it is from 2009
>>
>> https://marc.info/?t=125006512600002&r=1&w=2
>>
>> and then we had it in 2018
>>
>> https://lkml.org/lkml/2018/5/20/262
> 
> Thanks for digging these two out, otherwise I may need do for people to
> know the history better.

Sure, I stumbled over this myself recently when wondering about what 
fadump is.


>> The issue I have with this: it's just plain wrong when you take memory
>> hotplug into serious account as we see it quite heavily in VMs. You don't
>> know what you'll need when building a kernel. Just pass it via the cmdline
> 
> Hmm, kdump may have no issue with memory hotplug in crashkernel
> reservation aspect. The system RAM size is not correlated to
> crashkernel size directly, that's why the default value in this patch is

"Not correlated directly" ...

"1G-64G:128M,64G-1T:256M,1T-:512M"

Am I still asleep and dreaming? :)


> not linear related to system RAM size. The proportion of crashkernel
> size to the total RAM size is thing we take into account. Usually
> crashkernel 160M is enough on most of systems. If system RAM size is
> larger, extra memory can be added just in case, and not bring much
> impact to system.

So, all the rules we have are essentially broken because they rely 
completely on the system RAM during boot.

> 
> With our investigation, PCIe devices impact the crashkernel size, and
> cpu number. There are always pci devices which driver require tens of KB
> meomry, even MB. E.g in below patch, my colleague Coiby found out the
> i40e network card even cost 1.5G memory to initialize its ringbuffer on
> ppc, and 85M on x86_64.
> 
> [PATCH v1 0/3] Reducing memory usage of i40e for kdump
> http://lists.infradead.org/pipermail/kexec/2021-March/022117.html
> 
> Even though not all pci devices need surprisingly large memory like
> i40e, system with hundreds of pci devices can also cost more memory than
> expected. This kind of system usually is high end server, specified
> crashkernel value need be set manually.
> 
> So system RAM size is the least important part to influence crashkernel

Aehm, not with fadump, no?

> costing. Say my x1 laptop, even though I extended the RAM to 100TB, 160M
> crashkernel is still enough. Just we would like to get a tiny extra part
> to add to crashkernel if the total RAM is very large, that's the rule
> for crashkernel=auto. As for VMs, given their very few devices, virtio
> disk, NAT nic, etc, no matter how much memory is deployed and hot
> added/removed, crashkernel size won't be influenced very much. My
> personal understanding about it.

That's an interesting observation. But you're telling me that we end up 
wasting memory for the crashkernel because "crashkernel=auto" which is 
supposed to do something magical good automatically does something very 
suboptimal? Oh my ... this is broken.

Long story short: crashkernel=auto is pure ugliness.


Why can't we construct a crashkernel in user space when 
installing/activating kdump and requiring a reboot for kdump to be 
active as long as that crashkernel setting is not properly respected?

Just have a look at the system properties (is_qemu(), #PCI, ...) and 
propose a value for "crashkernel=". Check that that value is at least 
active when activating kdump. Otherwise don't enable kdump and fail.

Yes, it can be difficult with some newer/older kernels having some 
different demands, but things should change drastically, and a distro 
can always update its advises along with the kernel, no?

You could even have a kernel interface that gives you the current 
crashkernel size (maybe already there) vs. the recommended crashkernel 
size. Make kdump or *whoever* activate that in the cmdline and let kdump 
check if both values are satisfied when booting up.

Also: this approach here doesn't make any sense when you want to do 
something dependent on other cmdline parameters. Take "fadump=on" vs 
"fadump=off" as an example. You just cannot handle it properly as 
proposed in this patch. To me the approach in this patch makes least 
sense TBH.
Baoquan He May 10, 2021, 4:53 a.m. UTC | #7
On 05/08/21 at 11:22am, David Hildenbrand wrote:
> > > Let me take a look .... oh, there it is from 2009
> > > 
> > > https://marc.info/?t=125006512600002&r=1&w=2
> > > 
> > > and then we had it in 2018
> > > 
> > > https://lkml.org/lkml/2018/5/20/262
> > 
> > Thanks for digging these two out, otherwise I may need do for people to
> > know the history better.
> 
> Sure, I stumbled over this myself recently when wondering about what fadump
> is.
> 
> 
> > > The issue I have with this: it's just plain wrong when you take memory
> > > hotplug into serious account as we see it quite heavily in VMs. You don't
> > > know what you'll need when building a kernel. Just pass it via the cmdline
> > 
> > Hmm, kdump may have no issue with memory hotplug in crashkernel
> > reservation aspect. The system RAM size is not correlated to
> > crashkernel size directly, that's why the default value in this patch is
> 
> "Not correlated directly" ...
> 
> "1G-64G:128M,64G-1T:256M,1T-:512M"
> 
> Am I still asleep and dreaming? :)

Well, I said 'Not correlated directly', then gave sentences to explan
the reason. I would like to repeat them:

1) Crashkernel need more memory on some systems mainly because of
device driver. You can take a system, no matter how much memory you
increse or decrease total system RAM size, the crashkernel size needed
is invariable.

  - The extreme case I have give about the i40e.
  - And the more devices, narutally the more memory needed.

2) About "1G-64G:128M,64G-1T:256M,1T-:512M", I also said the different
value is because taking very low proprotion of extra memory to avoid
potential risk, it's cost effective. Here, add another 90M which is
0.13% of 64G, 0.0085% of 1TB.

Hope it can help people sober up.

> 
> 
> > not linear related to system RAM size. The proportion of crashkernel
> > size to the total RAM size is thing we take into account. Usually
> > crashkernel 160M is enough on most of systems. If system RAM size is
> > larger, extra memory can be added just in case, and not bring much
> > impact to system.
> 
> So, all the rules we have are essentially broken because they rely
> completely on the system RAM during boot.

How do you get this?

Crashkernel=auto is a default value. PC, VMs, normal workstation and server
which are the overall majority can work well with it. I can say the number
is 99%. Only very few high end workstation, servers which contain
many PCI devices need investigation to decide crashkernel size. A possible
manual setting and rebooting is needed for them. You call this
'essentially broken'? So you later suggestd constructing crashkernel value
in user space and rebooting is not broken? Even though it's the similar
thing? what is your logic behind your conclusion?

Crashkernel=auto is mainly targetting most of systems, help people
w/o much knowledge of kdump implementation to use it for debugging.

I can say more about the benefit of crashkernel=auto. On Fedora, the
community distros sponsord by Redhat, the kexec/kdump is also maintained
by us. Fedora kernel is mainline kernel, so no crashkernel=auto
provided. We almost never get bug report from users, means almost nobody
use  it. We hope Fedora users' usage can help test functionality of
component. 
> 
> > 
> > With our investigation, PCIe devices impact the crashkernel size, and
> > cpu number. There are always pci devices which driver require tens of KB
> > meomry, even MB. E.g in below patch, my colleague Coiby found out the
> > i40e network card even cost 1.5G memory to initialize its ringbuffer on
> > ppc, and 85M on x86_64.
> > 
> > [PATCH v1 0/3] Reducing memory usage of i40e for kdump
> > http://lists.infradead.org/pipermail/kexec/2021-March/022117.html
> > 
> > Even though not all pci devices need surprisingly large memory like
> > i40e, system with hundreds of pci devices can also cost more memory than
> > expected. This kind of system usually is high end server, specified
> > crashkernel value need be set manually.
> > 
> > So system RAM size is the least important part to influence crashkernel
> 
> Aehm, not with fadump, no?

Fadump makes use of crashkernel reservation, but has different mechanism
to dumping. It needs a kernel config too if this patch is accepted, or
it can add it to command line from a user space program, I will talk
about that later. This depends on IBM's decision, I have added Hari to CC,
they will make the best choice after consideration.

}
> 
> > costing. Say my x1 laptop, even though I extended the RAM to 100TB, 160M
> > crashkernel is still enough. Just we would like to get a tiny extra part
> > to add to crashkernel if the total RAM is very large, that's the rule
> > for crashkernel=auto. As for VMs, given their very few devices, virtio
> > disk, NAT nic, etc, no matter how much memory is deployed and hot
> > added/removed, crashkernel size won't be influenced very much. My
> > personal understanding about it.
> 
> That's an interesting observation. But you're telling me that we end up
> wasting memory for the crashkernel because "crashkernel=auto" which is
> supposed to do something magical good automatically does something very
> suboptimal? Oh my ... this is broken.
> 
> Long story short: crashkernel=auto is pure ugliness.

Very interesting. Your long story is clear to me, but your short story
confuses me a lot.

Let me try to sort out and understand. In your first reply, you asserted
"it's plain wrong when taking memory hotplug serious account as
we see it quite heavily in VMs", means you plain don't know if it's
wrong, but you say it's plain wrong. I answered you 'no, not at all'
with detailed explanation, means it's plain opposite to your assertion.
So then you quickly came to 'crashkernel=auto is pure ugliness'. If a
simple crashkernel=auto is added to cover 99% systems, and advanced
operation only need be done for the rest which is tiny proportion,
this is called pure ugliness, what's pure beauty? Here I say 99%, I
could be very conservative.

> 
> Why can't we construct a crashkernel in user space when
> installing/activating kdump and requiring a reboot for kdump to be active as
> long as that crashkernel setting is not properly respected?
> 
> Just have a look at the system properties (is_qemu(), #PCI, ...) and propose
> a value for "crashkernel=". Check that that value is at least active when
> activating kdump. Otherwise don't enable kdump and fail.
> 
> Yes, it can be difficult with some newer/older kernels having some different
> demands, but things should change drastically, and a distro can always
> update its advises along with the kernel, no?
> 
> You could even have a kernel interface that gives you the current
> crashkernel size (maybe already there) vs. the recommended crashkernel size.
> Make kdump or *whoever* activate that in the cmdline and let kdump check if
> both values are satisfied when booting up.

Now, let's go to your long story.

Yes, if you haven't seen our patch in fedora kexec-tools maining list,
your suggested approach is the exactly same thing we are doing, please
check below patch.

[PATCH v2] kdumpctl: Add kdumpctl estimate
https://lists.fedoraproject.org/archives/list/kexec@lists.fedoraproject.org/thread/YCEOJHQXKVEIVNB23M2TDAJGYVNP5MJZ/

We will provide a new feature in user space script, to let user check if
their current crashkernel size is good or not. If not, they can adjust
accordingly.

But, where's the current crashkernel size coming from? Surely
crashkernel=auto. You wouldn't add a random crashkernel size then
compared with the recommended crashkernel size, then reboot, will you?
If crashkernel=auto get the expected size, no need to reboot. Means 99%
of systems has no need to reboot. Only very few of systems, need reboot
after checking the recommended size.

Long story short. crashkernel=auto will give a default value, trying to
cover most of systems. (Very few high end server need check if it's
enough and adjust with the help of user space tools. Then reboot.)


> 
> Also: this approach here doesn't make any sense when you want to do
> something dependent on other cmdline parameters. Take "fadump=on" vs
> "fadump=off" as an example. You just cannot handle it properly as proposed
> in this patch. To me the approach in this patch makes least sense TBH.

Why? We don't have this kind of judgement in kernel? Crashkernel=auto is
a generic mechanism, and has been added much earlier. Fadump was added
later by IBM for their need on ppc only, it relies on crashkernel
reservation but different mechanism of dumping. If it has different value
than kdump, a special hanlding is certainly needed. Who tell it has to be
'fadump=on'? They can check the value in user space program and add into
cmdline as you suggested, they can also make it into auto. The most suitable
is the best.

And I have several questions to ask, hope you can help answer:

1) Have you ever met crashkernel=auto broken on virt platform?

Asking this because you are from Virt team, and crashkernel=auto has been
there in RHEL for many years, and we have been working with Virt team to
support dumping. We haven't seen any bug report or complaint about
crashkernel=auto from Virt. 

2) Adding crashkernel=auto, and the kdumpctl estimate as user space
program to get a recommended size, then reboot. Removing crashkernel=auto,
only the kdumpctl estimate to get a recommended size, always reboot.
In RHEL we will take the 1st option. Are you willing to take the 2nd one
for Virt platform since you think crashkernel=auto is plain wrong, pure
ugliness, essentially broken, least sense?

Thanks
Baoquan
David Hildenbrand May 10, 2021, 8:32 a.m. UTC | #8
>> "Not correlated directly" ...
>>
>> "1G-64G:128M,64G-1T:256M,1T-:512M"
>>
>> Am I still asleep and dreaming? :)
> 
> Well, I said 'Not correlated directly', then gave sentences to explan
> the reason. I would like to repeat them:
> 
> 1) Crashkernel need more memory on some systems mainly because of
> device driver. You can take a system, no matter how much memory you
> increse or decrease total system RAM size, the crashkernel size needed
> is invariable.
> 
>    - The extreme case I have give about the i40e.
>    - And the more devices, narutally the more memory needed.
> 
> 2) About "1G-64G:128M,64G-1T:256M,1T-:512M", I also said the different
> value is because taking very low proprotion of extra memory to avoid
> potential risk, it's cost effective. Here, add another 90M which is
> 0.13% of 64G, 0.0085% of 1TB.

Just let me clarify the problem I am having with all of this:

We model the crashkernel size as a function of the memory size. Yet, 
it's pretty much independent of the memory size. That screams for "ugly".

The main problem is that early during boot we don't have a clue how much 
crashkernel memory we may need. So what I see is that we are mostly 
using a heuristic based on the memory size to come up with the right 
answer how much devices we might have. That just feels very wrong.

I can understand the reasoning of "using a fraction of the memory size" 
when booting up just to be on the safe side as we don't know", and that 
motivation is much better than what I read so far. But then I wonder if 
we cannot handle that any better? Because this feels very suboptimal to 
me and I feel like there can be cases where the heuristic is just wrong.

As one example, can I add a whole bunch of devices to a 32GB VM and 
break "crashkernel=auto"?

As another example, when I boot a 64G VM, the crashkernel size will be 
512MB, although I really only might need 128MB. That's an effective 
overhead of 0.5%. And especially when we take memory ballooning etc. 
into account it can effectively be more than that.

Let's do a more detailed look. PPC64 in kernel-ark:

"2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G";

Assume I would only need 385M on a simple 16GB VM. We would have an 
overhead of ~4%. But maybe on ppc64 we do have to take the memory size 
into account (my assumption and, thus, my comment regarding memory hotplug)?


I wonder if we could always try allocating larger granularity (falling 
back to smaller if it fails), and once the kernel is able to come up 
with a better answer how many devices there are and, thus, how big the 
crashkernel area really should be, shrink the preallocated crashkernel 
(either from the kernel or from user space)? Not completely trivial but 
possible I think. It's trivial when we allocate a memmap for the 
crashkernel (I think we mostly do but I might be wrong).

The "crashkernel=auto" would really do something magical good instead of 
implement some heuristic base don the memory size.

[...]
>> So, all the rules we have are essentially broken because they rely
>> completely on the system RAM during boot.
> 
> How do you get this?
> 
> Crashkernel=auto is a default value. PC, VMs, normal workstation and server
> which are the overall majority can work well with it. I can say the number
> is 99%. Only very few high end workstation, servers which contain
> many PCI devices need investigation to decide crashkernel size. A possible
> manual setting and rebooting is needed for them. You call this
> 'essentially broken'? So you later suggestd constructing crashkernel value
> in user space and rebooting is not broken? Even though it's the similar
> thing? what is your logic behind your conclusion?

A kernel early during boot can only guess. A kernel late during boot 
knows. Please correct me if I'm wrong.

> 
> Crashkernel=auto is mainly targetting most of systems, help people
> w/o much knowledge of kdump implementation to use it for debugging.
> 
> I can say more about the benefit of crashkernel=auto. On Fedora, the
> community distros sponsord by Redhat, the kexec/kdump is also maintained
> by us. Fedora kernel is mainline kernel, so no crashkernel=auto
> provided. We almost never get bug report from users, means almost nobody
> use  it. We hope Fedora users' usage can help test functionality of
> component.

I know how helpful "crashkernel=auto" was so far, but I am also aware 
that there was strong pushback in the past, and I remember for the 
reasons I gave. IMHO we should refine that approach instead of trying to 
push the same thing upstream every couple of years.

I ran into the "512MB crashkernel" on a 64G VM with memory ballooning 
issue already but didn't report a BZ, because so far, I was under the 
impression that more memory means more crashkernel. But you explained to 
me that I was just running into a (for my use case) bad heuristic.

>>> So system RAM size is the least important part to influence crashkernel
>>
>> Aehm, not with fadump, no?
> 
> Fadump makes use of crashkernel reservation, but has different mechanism
> to dumping. It needs a kernel config too if this patch is accepted, or
> it can add it to command line from a user space program, I will talk
> about that later. This depends on IBM's decision, I have added Hari to CC,
> they will make the best choice after consideration.
> 

I was looking at RHEL8, and there we have

fadump_cmdline = get_last_crashkernel(cmdline, "fadump=", NULL);
...
if (!fadump_cmdline || (strncmp(fadump_cmdline, "off", 3) == 0))
	ck_cmdline = ...
else
	ck_cmdline = ...

which was a runtime check for fadump.

Something that cannot be modeled properly at least with this patch here.

> }
>>
>>> costing. Say my x1 laptop, even though I extended the RAM to 100TB, 160M
>>> crashkernel is still enough. Just we would like to get a tiny extra part
>>> to add to crashkernel if the total RAM is very large, that's the rule
>>> for crashkernel=auto. As for VMs, given their very few devices, virtio
>>> disk, NAT nic, etc, no matter how much memory is deployed and hot
>>> added/removed, crashkernel size won't be influenced very much. My
>>> personal understanding about it.
>>
>> That's an interesting observation. But you're telling me that we end up
>> wasting memory for the crashkernel because "crashkernel=auto" which is
>> supposed to do something magical good automatically does something very
>> suboptimal? Oh my ... this is broken.
>>
>> Long story short: crashkernel=auto is pure ugliness.
> 
> Very interesting. Your long story is clear to me, but your short story
> confuses me a lot.
> 
> Let me try to sort out and understand. In your first reply, you asserted
> "it's plain wrong when taking memory hotplug serious account as
> we see it quite heavily in VMs", means you plain don't know if it's
> wrong, but you say it's plain wrong. I answered you 'no, not at all'
> with detailed explanation, means it's plain opposite to your assertion.

Yep, I might be partially wrong about memory hotplug thingy, mostly 
because I had the RHEL8 rule for ppc64 (including fadump) in mind. For 
dynamic resizing of VMs, the current rules for VMs can be very sub-optimal.

Let's relax "plain wrong" to "the heuristic can be very suboptimal 
because it uses something mostly unrelated to come up with an answer". 
And it's simply not plain wrong because in practice it gets the job 
done. Mostly.


> So then you quickly came to 'crashkernel=auto is pure ugliness'. If a
> simple crashkernel=auto is added to cover 99% systems, and advanced
> operation only need be done for the rest which is tiny proportion,
> this is called pure ugliness, what's pure beauty? Here I say 99%, I
> could be very conservative.

I don't like wasting memory just because we cannot come up with a better 
heuristic. Yes, it somewhat gets the job done, but I call that ugly. My 
humble opinion.

[...]

> 
> Yes, if you haven't seen our patch in fedora kexec-tools maining list,
> your suggested approach is the exactly same thing we are doing, please
> check below patch.
> 
> [PATCH v2] kdumpctl: Add kdumpctl estimate
> https://lists.fedoraproject.org/archives/list/kexec@lists.fedoraproject.org/thread/YCEOJHQXKVEIVNB23M2TDAJGYVNP5MJZ/
> 
> We will provide a new feature in user space script, to let user check if
> their current crashkernel size is good or not. If not, they can adjust
> accordingly.

That's good, thanks for the pointer -- wasn't aware of that.

> 
> But, where's the current crashkernel size coming from? Surely
> crashkernel=auto. You wouldn't add a random crashkernel size then
> compared with the recommended crashkernel size, then reboot, will you?
> If crashkernel=auto get the expected size, no need to reboot. Means 99%
> of systems has no need to reboot. Only very few of systems, need reboot
> after checking the recommended size.
> 
> Long story short. crashkernel=auto will give a default value, trying to
> cover most of systems. (Very few high end server need check if it's
> enough and adjust with the help of user space tools. Then reboot.)

Then we might really want to investigate into shrinking a possibly 
larger allocation dynamically during boot.

>>
>> Also: this approach here doesn't make any sense when you want to do
>> something dependent on other cmdline parameters. Take "fadump=on" vs
>> "fadump=off" as an example. You just cannot handle it properly as proposed
>> in this patch. To me the approach in this patch makes least sense TBH.
> 
> Why? We don't have this kind of judgement in kernel? Crashkernel=auto is
> a generic mechanism, and has been added much earlier. Fadump was added
> later by IBM for their need on ppc only, it relies on crashkernel
> reservation but different mechanism of dumping. If it has different value
> than kdump, a special hanlding is certainly needed. Who tell it has to be
> 'fadump=on'? They can check the value in user space program and add into
> cmdline as you suggested, they can also make it into auto. The most suitable
> is the best.

Take a look at the RHEL8 handling to see where my comment is coming from.

> 
> And I have several questions to ask, hope you can help answer:
> 
> 1) Have you ever met crashkernel=auto broken on virt platform?

I have encountered it being very suboptimal. I call wasting hundreds of 
MB problematic, especially when dynamically resizing of VMs (for 
example, using memory ballooning)

> 
> Asking this because you are from Virt team, and crashkernel=auto has been
> there in RHEL for many years, and we have been working with Virt team to
> support dumping. We haven't seen any bug report or complaint about
> crashkernel=auto from Virt.

I've had plenty of bug reports where people try inflating the balloon 
fairly heavily but don't take the crashkernel size into account. The 
bigger the crashkernel size, the bigger the issue when people try 
squeezing the last couple of MB out of their VMs. I keep repeating to 
them "with crashkernel=auto, you have to be careful about how much 
memory might get set aside for the crashkernel and, therefore, reduces 
your effective guest OS RAM size and reduces the maximum balloon size".

> 
> 2) Adding crashkernel=auto, and the kdumpctl estimate as user space
> program to get a recommended size, then reboot. Removing crashkernel=auto,
> only the kdumpctl estimate to get a recommended size, always reboot.
> In RHEL we will take the 1st option. Are you willing to take the 2nd one
> for Virt platform since you think crashkernel=auto is plain wrong, pure
> ugliness, essentially broken, least sense?

We are talking about upstreaming stuff here and I am wearing my upstream 
hat here. I'm stating (just like people decades ago) that this might not 
be the right approach for upstream, at least not as it stands.

And no, I don't have time to solve problems/implement solutions/upstream 
patches to tackle fundamental issues that have been there for decades.

I'll be happy to help looking into dynamic shrinking of the crashkernel 
size if that approach makes sense. We could even let user space trigger 
that resizing -- without a reboot.
Baoquan He May 10, 2021, 10:43 a.m. UTC | #9
On 05/10/21 at 10:32am, David Hildenbrand wrote:
> 
> > > "Not correlated directly" ...
> > > 
> > > "1G-64G:128M,64G-1T:256M,1T-:512M"
> > > 
> > > Am I still asleep and dreaming? :)
> > 
> > Well, I said 'Not correlated directly', then gave sentences to explan
> > the reason. I would like to repeat them:
> > 
> > 1) Crashkernel need more memory on some systems mainly because of
> > device driver. You can take a system, no matter how much memory you
> > increse or decrease total system RAM size, the crashkernel size needed
> > is invariable.
> > 
> >    - The extreme case I have give about the i40e.
> >    - And the more devices, narutally the more memory needed.
> > 
> > 2) About "1G-64G:128M,64G-1T:256M,1T-:512M", I also said the different
> > value is because taking very low proprotion of extra memory to avoid
> > potential risk, it's cost effective. Here, add another 90M which is
> > 0.13% of 64G, 0.0085% of 1TB.
> 
> Just let me clarify the problem I am having with all of this:
> 
> We model the crashkernel size as a function of the memory size. Yet, it's
> pretty much independent of the memory size. That screams for "ugly".
> 
> The main problem is that early during boot we don't have a clue how much
> crashkernel memory we may need. So what I see is that we are mostly using a
> heuristic based on the memory size to come up with the right answer how much
> devices we might have. That just feels very wrong.
> 
> I can understand the reasoning of "using a fraction of the memory size" when
> booting up just to be on the safe side as we don't know", and that
> motivation is much better than what I read so far. But then I wonder if we
> cannot handle that any better? Because this feels very suboptimal to me and
> I feel like there can be cases where the heuristic is just wrong.

Yes, I understand what you said. Our headache is mainly from bare metal
system worrying the reservation is not enough becuase of many devices.

On VM, it is truly different. With much less devices, it does waste some
memory. Usually a fixed minimal size can cover 99.9% of system unless
too many devices attached/added to VM, I am not sure what's the
probability it could happen. While, by the help of /sys/kernel/kexec_crash_size,
you can shrink it to an small enough but available size. Just you may
need to reload kdump kernel because the loaded kernel should have been
erazed and out of control. The shrinking should be done at early stage of
kernel running, I would say, lest crash may happen during that period.

We ever tried several different ways to enlarge the crashkernel size
dynamically, but didn't think of a good way.

> 
> As one example, can I add a whole bunch of devices to a 32GB VM and break
> "crashkernel=auto"?
> 
> As another example, when I boot a 64G VM, the crashkernel size will be
> 512MB, although I really only might need 128MB. That's an effective overhead
> of 0.5%. And especially when we take memory ballooning etc. into account it
> can effectively be more than that.
> 
> Let's do a more detailed look. PPC64 in kernel-ark:
> 
> "2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G";

Yes, the wasting mainly happens on ppc. Its 64K page size, caused the
difference with other ARCHes. On x86_64, s390, it's much better, assuming
most of VM won't own memory bigger than 64G, their crashkernel size will
be 160M most of time.

> 
> Assume I would only need 385M on a simple 16GB VM. We would have an overhead
> of ~4%. But maybe on ppc64 we do have to take the memory size into account
> (my assumption and, thus, my comment regarding memory hotplug)?
> 
> 
> I wonder if we could always try allocating larger granularity (falling back
> to smaller if it fails), and once the kernel is able to come up with a
> better answer how many devices there are and, thus, how big the crashkernel
> area really should be, shrink the preallocated crashkernel (either from the
> kernel or from user space)? Not completely trivial but possible I think.
> It's trivial when we allocate a memmap for the crashkernel (I think we
> mostly do but I might be wrong).
> 
> The "crashkernel=auto" would really do something magical good instead of
> implement some heuristic base don the memory size.
> 
> [...]
> > > So, all the rules we have are essentially broken because they rely
> > > completely on the system RAM during boot.
> > 
> > How do you get this?
> > 
> > Crashkernel=auto is a default value. PC, VMs, normal workstation and server
> > which are the overall majority can work well with it. I can say the number
> > is 99%. Only very few high end workstation, servers which contain
> > many PCI devices need investigation to decide crashkernel size. A possible
> > manual setting and rebooting is needed for them. You call this
> > 'essentially broken'? So you later suggestd constructing crashkernel value
> > in user space and rebooting is not broken? Even though it's the similar
> > thing? what is your logic behind your conclusion?
> 
> A kernel early during boot can only guess. A kernel late during boot knows.
> Please correct me if I'm wrong.

Well, I would not say it's guess, and would like to call them experical
values from statistical data. With a priori vlaue given by 'auto',
basically normal users of kdump don't need to care about the setting.
E.g on Fedora, 'auto' can cover all systems, assume nobody would deploy
it on a high end server. Everything we do is to make thing simple enough.
If you don't know how to set, just add 'crashkernel=auto' to cmdline,
then everything is done. I believe you agree that not everybody would
like to dig into kexec/kdump just for getting how big crashkernel size
need be set when they want to use kdump functionality.

> 
> > 
> > Crashkernel=auto is mainly targetting most of systems, help people
> > w/o much knowledge of kdump implementation to use it for debugging.
> > 
> > I can say more about the benefit of crashkernel=auto. On Fedora, the
> > community distros sponsord by Redhat, the kexec/kdump is also maintained
> > by us. Fedora kernel is mainline kernel, so no crashkernel=auto
> > provided. We almost never get bug report from users, means almost nobody
> > use  it. We hope Fedora users' usage can help test functionality of
> > component.
> 
> I know how helpful "crashkernel=auto" was so far, but I am also aware that
> there was strong pushback in the past, and I remember for the reasons I
> gave. IMHO we should refine that approach instead of trying to push the same
> thing upstream every couple of years.
> 
> I ran into the "512MB crashkernel" on a 64G VM with memory ballooning issue
> already but didn't report a BZ, because so far, I was under the impression
> that more memory means more crashkernel. But you explained to me that I was
> just running into a (for my use case) bad heuristic.

I re-read the old posts, didn't see strong push-back. People just gave
some different ideas instead. When we were silent, we tried different
way, e.g the enlarging crashkernel at run time as told at above, but
failed. Reusing free pages and user space pages of 1st kernel in kdump
kernel, also failed. We also talked with people to consult if it's
doable to remove 'auto' support, nobody would like to give an affirmative
answer. I know SUSE is using the way you mentioned to get a recommended
size for long time, but it needs severeal more steps and need reboot. We
prefer to take that way too as an improvement. The simpler, the better.

Besides, 'auto' doesn't introduce tons of complicated code, and we don't
think of it with a pat on the head, then try to push to pollute kernel.

> 
> > > > So system RAM size is the least important part to influence crashkernel
> > > 
> > > Aehm, not with fadump, no?
> > 
> > Fadump makes use of crashkernel reservation, but has different mechanism
> > to dumping. It needs a kernel config too if this patch is accepted, or
> > it can add it to command line from a user space program, I will talk
> > about that later. This depends on IBM's decision, I have added Hari to CC,
> > they will make the best choice after consideration.
> > 
> 
> I was looking at RHEL8, and there we have
> 
> fadump_cmdline = get_last_crashkernel(cmdline, "fadump=", NULL);
> ...
> if (!fadump_cmdline || (strncmp(fadump_cmdline, "off", 3) == 0))
> 	ck_cmdline = ...
> else
> 	ck_cmdline = ...
> 
> which was a runtime check for fadump.
> 
> Something that cannot be modeled properly at least with this patch here.

Yes, I believe it won't be done like that. A static detection or a
global switch variable can solve it.

> 
> > }
> > > 
> > > > costing. Say my x1 laptop, even though I extended the RAM to 100TB, 160M
> > > > crashkernel is still enough. Just we would like to get a tiny extra part
> > > > to add to crashkernel if the total RAM is very large, that's the rule
> > > > for crashkernel=auto. As for VMs, given their very few devices, virtio
> > > > disk, NAT nic, etc, no matter how much memory is deployed and hot
> > > > added/removed, crashkernel size won't be influenced very much. My
> > > > personal understanding about it.
> > > 
> > > That's an interesting observation. But you're telling me that we end up
> > > wasting memory for the crashkernel because "crashkernel=auto" which is
> > > supposed to do something magical good automatically does something very
> > > suboptimal? Oh my ... this is broken.
> > > 
> > > Long story short: crashkernel=auto is pure ugliness.
> > 
> > Very interesting. Your long story is clear to me, but your short story
> > confuses me a lot.
> > 
> > Let me try to sort out and understand. In your first reply, you asserted
> > "it's plain wrong when taking memory hotplug serious account as
> > we see it quite heavily in VMs", means you plain don't know if it's
> > wrong, but you say it's plain wrong. I answered you 'no, not at all'
> > with detailed explanation, means it's plain opposite to your assertion.
> 
> Yep, I might be partially wrong about memory hotplug thingy, mostly because
> I had the RHEL8 rule for ppc64 (including fadump) in mind. For dynamic
> resizing of VMs, the current rules for VMs can be very sub-optimal.
> 
> Let's relax "plain wrong" to "the heuristic can be very suboptimal because
> it uses something mostly unrelated to come up with an answer". And it's
> simply not plain wrong because in practice it gets the job done. Mostly.
> 
> 
> > So then you quickly came to 'crashkernel=auto is pure ugliness'. If a
> > simple crashkernel=auto is added to cover 99% systems, and advanced
> > operation only need be done for the rest which is tiny proportion,
> > this is called pure ugliness, what's pure beauty? Here I say 99%, I
> > could be very conservative.
> 
> I don't like wasting memory just because we cannot come up with a better
> heuristic. Yes, it somewhat gets the job done, but I call that ugly. My
> humble opinion.
> 
> [...]
> 
> > 
> > Yes, if you haven't seen our patch in fedora kexec-tools maining list,
> > your suggested approach is the exactly same thing we are doing, please
> > check below patch.
> > 
> > [PATCH v2] kdumpctl: Add kdumpctl estimate
> > https://lists.fedoraproject.org/archives/list/kexec@lists.fedoraproject.org/thread/YCEOJHQXKVEIVNB23M2TDAJGYVNP5MJZ/
> > 
> > We will provide a new feature in user space script, to let user check if
> > their current crashkernel size is good or not. If not, they can adjust
> > accordingly.
> 
> That's good, thanks for the pointer -- wasn't aware of that.
> 
> > 
> > But, where's the current crashkernel size coming from? Surely
> > crashkernel=auto. You wouldn't add a random crashkernel size then
> > compared with the recommended crashkernel size, then reboot, will you?
> > If crashkernel=auto get the expected size, no need to reboot. Means 99%
> > of systems has no need to reboot. Only very few of systems, need reboot
> > after checking the recommended size.
> > 
> > Long story short. crashkernel=auto will give a default value, trying to
> > cover most of systems. (Very few high end server need check if it's
> > enough and adjust with the help of user space tools. Then reboot.)
> 
> Then we might really want to investigate into shrinking a possibly larger
> allocation dynamically during boot.
> 
> > > 
> > > Also: this approach here doesn't make any sense when you want to do
> > > something dependent on other cmdline parameters. Take "fadump=on" vs
> > > "fadump=off" as an example. You just cannot handle it properly as proposed
> > > in this patch. To me the approach in this patch makes least sense TBH.
> > 
> > Why? We don't have this kind of judgement in kernel? Crashkernel=auto is
> > a generic mechanism, and has been added much earlier. Fadump was added
> > later by IBM for their need on ppc only, it relies on crashkernel
> > reservation but different mechanism of dumping. If it has different value
> > than kdump, a special hanlding is certainly needed. Who tell it has to be
> > 'fadump=on'? They can check the value in user space program and add into
> > cmdline as you suggested, they can also make it into auto. The most suitable
> > is the best.
> 
> Take a look at the RHEL8 handling to see where my comment is coming from.
> 
> > 
> > And I have several questions to ask, hope you can help answer:
> > 
> > 1) Have you ever met crashkernel=auto broken on virt platform?
> 
> I have encountered it being very suboptimal. I call wasting hundreds of MB
> problematic, especially when dynamically resizing of VMs (for example, using
> memory ballooning)
> 
> > 
> > Asking this because you are from Virt team, and crashkernel=auto has been
> > there in RHEL for many years, and we have been working with Virt team to
> > support dumping. We haven't seen any bug report or complaint about
> > crashkernel=auto from Virt.
> 
> I've had plenty of bug reports where people try inflating the balloon fairly
> heavily but don't take the crashkernel size into account. The bigger the
> crashkernel size, the bigger the issue when people try squeezing the last
> couple of MB out of their VMs. I keep repeating to them "with
> crashkernel=auto, you have to be careful about how much memory might get set
> aside for the crashkernel and, therefore, reduces your effective guest OS
> RAM size and reduces the maximum balloon size".
> 
> > 
> > 2) Adding crashkernel=auto, and the kdumpctl estimate as user space
> > program to get a recommended size, then reboot. Removing crashkernel=auto,
> > only the kdumpctl estimate to get a recommended size, always reboot.
> > In RHEL we will take the 1st option. Are you willing to take the 2nd one
> > for Virt platform since you think crashkernel=auto is plain wrong, pure
> > ugliness, essentially broken, least sense?
> 
> We are talking about upstreaming stuff here and I am wearing my upstream hat
> here. I'm stating (just like people decades ago) that this might not be the
> right approach for upstream, at least not as it stands.
> 
> And no, I don't have time to solve problems/implement solutions/upstream
> patches to tackle fundamental issues that have been there for decades.
> 
> I'll be happy to help looking into dynamic shrinking of the crashkernel size
> if that approach makes sense. We could even let user space trigger that
> resizing -- without a reboot.

Don't reply each inline comment since I believe they have been covered
by the earlier reply. Thanks for looking to this and telling your
thought, to let us know that in fact you really care about the extra
memory on VMs which we have realized, but didn't realized it really cause
issue. 

Thanks
Baoquan
David Hildenbrand May 10, 2021, 11:01 a.m. UTC | #10
>> I can understand the reasoning of "using a fraction of the memory size" when
>> booting up just to be on the safe side as we don't know", and that
>> motivation is much better than what I read so far. But then I wonder if we
>> cannot handle that any better? Because this feels very suboptimal to me and
>> I feel like there can be cases where the heuristic is just wrong.
> 
> Yes, I understand what you said. Our headache is mainly from bare metal
> system worrying the reservation is not enough becuase of many devices.
> 
> On VM, it is truly different. With much less devices, it does waste some
> memory. Usually a fixed minimal size can cover 99.9% of system unless
> too many devices attached/added to VM, I am not sure what's the
> probability it could happen. While, by the help of /sys/kernel/kexec_crash_size,
> you can shrink it to an small enough but available size. Just you may
> need to reload kdump kernel because the loaded kernel should have been
> erazed and out of control. The shrinking should be done at early stage of
> kernel running, I would say, lest crash may happen during that period.
> 
> We ever tried several different ways to enlarge the crashkernel size
> dynamically, but didn't think of a good way.

Yes, enlarging it at runtime much more difficult than shrinking.

[...]

>> A kernel early during boot can only guess. A kernel late during boot knows.
>> Please correct me if I'm wrong.
> 
> Well, I would not say it's guess, and would like to call them experical
> values from statistical data. With a priori vlaue given by 'auto',
> basically normal users of kdump don't need to care about the setting.
> E.g on Fedora, 'auto' can cover all systems, assume nobody would deploy
> it on a high end server. Everything we do is to make thing simple enough.
> If you don't know how to set, just add 'crashkernel=auto' to cmdline,
> then everything is done. I believe you agree that not everybody would
> like to dig into kexec/kdump just for getting how big crashkernel size
> need be set when they want to use kdump functionality.

Oh absolutely. But OTOH, most users will leave the value untouched if it 
works -- and complain at least in the VM environment to me about 
surpises waste of system RAM with "crashkernel=auto".

[...]

>> I know how helpful "crashkernel=auto" was so far, but I am also aware that
>> there was strong pushback in the past, and I remember for the reasons I
>> gave. IMHO we should refine that approach instead of trying to push the same
>> thing upstream every couple of years.
>>
>> I ran into the "512MB crashkernel" on a 64G VM with memory ballooning issue
>> already but didn't report a BZ, because so far, I was under the impression
>> that more memory means more crashkernel. But you explained to me that I was
>> just running into a (for my use case) bad heuristic.
> 
> I re-read the old posts, didn't see strong push-back. People just gave
> some different ideas instead. When we were silent, we tried different
> way, e.g the enlarging crashkernel at run time as told at above, but
> failed. Reusing free pages and user space pages of 1st kernel in kdump
> kernel, also failed. We also talked with people to consult if it's

Thanks for an insight into the history.

> doable to remove 'auto' support, nobody would like to give an affirmative
> answer. I know SUSE is using the way you mentioned to get a recommended
> size for long time, but it needs severeal more steps and need reboot. We
> prefer to take that way too as an improvement. The simpler, the better.

At least I'm happy to hear that other people had the same idea as me ;)

I can understand the desire for simplicity. it would be great to hear 
SUSEs perception of the problem and how they would ideally want to move 
forward with this.

[...]

>> I'll be happy to help looking into dynamic shrinking of the crashkernel size
>> if that approach makes sense. We could even let user space trigger that
>> resizing -- without a reboot.
> 
> Don't reply each inline comment since I believe they have been covered
> by the earlier reply. Thanks for looking to this and telling your
> thought, to let us know that in fact you really care about the extra
> memory on VMs which we have realized, but didn't realized it really cause
> issue.

I mess with dynamic resizing of VMs, that's why I usually take a closer 
look at all things that do stuff based on the initial VM size; yes, 
there is still a lot other such things out there.

It also bugged me for quite a bit that we don't have a sane way to 
achieve what we're doing here upstream. It somewhat feels like "this 
doesn't belong in the kernel and is user policy" but then, the existing 
kernel support is suboptimal.

Maybe reserving some "maybe too big but okayish to boot the system in a 
sane environment -- e.g., X% of system RAM and at least Y" size first 
and shrinking it later as triggered by user space early (where we do 
seem to have a way to pre-calculate things now) might actually be a good 
direction to look into.
Dave Young May 10, 2021, 11:44 a.m. UTC | #11
Hi David,
On 05/10/21 at 01:01pm, David Hildenbrand wrote:
[snip]
> It also bugged me for quite a bit that we don't have a sane way to achieve
> what we're doing here upstream. It somewhat feels like "this doesn't belong
> in the kernel and is user policy" but then, the existing kernel support is
> suboptimal.
> 
> Maybe reserving some "maybe too big but okayish to boot the system in a sane
> environment -- e.g., X% of system RAM and at least Y" size first and
> shrinking it later as triggered by user space early (where we do seem to
> have a way to pre-calculate things now) might actually be a good direction
> to look into.

Hmm, that is also an option we considered before.  Even for your
suggestion we still need a kernel option to set the default ratio/value.
and the ratio/value should be another patch which expands crashkernel
syntax.

Actually the kconfig help text in this patch is indeed misleading, it is
not introducing crashkernel=a:b... and no need to explain about the
crashkernel syntax, the config option is actually just some interface we
can add any valid crashkernel settings to be used by default. So current
patch help text describes the default value of crash auto str, instead
of describes what crash auto str is. 

And crashkernel=auto makes this more flexibly. We can tune the values
easily when upgrading.  But if we pass a fixed value in userspace we
can not know if the value is set by distribution automatically or by user
manually thus we can not blindly update it.

Thanks
Dave
David Hildenbrand May 10, 2021, 11:56 a.m. UTC | #12
On 10.05.21 13:44, Dave Young wrote:
> Hi David,

Hi Dave,

> On 05/10/21 at 01:01pm, David Hildenbrand wrote:
> [snip]
>> It also bugged me for quite a bit that we don't have a sane way to achieve
>> what we're doing here upstream. It somewhat feels like "this doesn't belong
>> in the kernel and is user policy" but then, the existing kernel support is
>> suboptimal.
>>
>> Maybe reserving some "maybe too big but okayish to boot the system in a sane
>> environment -- e.g., X% of system RAM and at least Y" size first and
>> shrinking it later as triggered by user space early (where we do seem to
>> have a way to pre-calculate things now) might actually be a good direction
>> to look into.
> 
> Hmm, that is also an option we considered before.  Even for your
> suggestion we still need a kernel option to set the default ratio/value.
> and the ratio/value should be another patch which expands crashkernel
> syntax.

Right.

> 
> Actually the kconfig help text in this patch is indeed misleading, it is
> not introducing crashkernel=a:b... and no need to explain about the
> crashkernel syntax, the config option is actually just some interface we
> can add any valid crashkernel settings to be used by default. So current
> patch help text describes the default value of crash auto str, instead
> of describes what crash auto str is.

Right. And I would much rather prefer either

a) handling "auto" completely in the kernel, not just setting some 
questionable default at compile time
b) passing it explicitly in via the cmdline

> 
> And crashkernel=auto makes this more flexibly. We can tune the values
> easily when upgrading.  But if we pass a fixed value in userspace we
> can not know if the value is set by distribution automatically or by user
> manually thus we can not blindly update it.

I think there are two different cases:


1. kernel space updates the value later during boot. "crashkernel=auto" 
really does the right thing, meaning

a) allocate something reasonable and safe during early boot
b) update the allocation during late boot when we know what kind of 
system we're running on

Then, we indeed care about "crashkernel=auto" in the kernel and I think 
it would be a nice thing to have. The only question is on how to make 
that a little configurable, depending on different thingies we might 
want to run in the crashkernel (assuming someone doesn't want kdump).


2. user space updates the value later during boot

IMHO we don't really car who decided on the value as we do the update 
from user space. If an admin messes with crashkernel=, the admin can 
also mess with kdump not doing any overwrites (e.g., make that 
configurable, or detect the overwrite in kdump somehow).
Baoquan He May 11, 2021, 1:36 p.m. UTC | #13
On 05/10/21 at 01:56pm, David Hildenbrand wrote:
> On 10.05.21 13:44, Dave Young wrote:
> > Hi David,
> 
> Hi Dave,
> 
> > On 05/10/21 at 01:01pm, David Hildenbrand wrote:
> > [snip]
> > > It also bugged me for quite a bit that we don't have a sane way to achieve
> > > what we're doing here upstream. It somewhat feels like "this doesn't belong
> > > in the kernel and is user policy" but then, the existing kernel support is
> > > suboptimal.
> > > 
> > > Maybe reserving some "maybe too big but okayish to boot the system in a sane
> > > environment -- e.g., X% of system RAM and at least Y" size first and
> > > shrinking it later as triggered by user space early (where we do seem to
> > > have a way to pre-calculate things now) might actually be a good direction
> > > to look into.
> > 
> > Hmm, that is also an option we considered before.  Even for your
> > suggestion we still need a kernel option to set the default ratio/value.
> > and the ratio/value should be another patch which expands crashkernel
> > syntax.
> 
> Right.
> 
> > 
> > Actually the kconfig help text in this patch is indeed misleading, it is
> > not introducing crashkernel=a:b... and no need to explain about the
> > crashkernel syntax, the config option is actually just some interface we
> > can add any valid crashkernel settings to be used by default. So current
> > patch help text describes the default value of crash auto str, instead
> > of describes what crash auto str is.
> 
> Right. And I would much rather prefer either
> 
> a) handling "auto" completely in the kernel, not just setting some
> questionable default at compile time

Thanks for the suggestions.

If the way adding default value into kernel config is disliked,
this a) option looks good. We can get value with x% of system RAM, but
clamp it with CRASH_KERNEL_MIN/MAX. The CRASH_KERNEL_MIN/MAX may need be
defined with a default value for different ARCHes. It's very close to
our current implementation, and handling 'auto' in kernel.

And kernel config provided so that people can tune the MIN/MAX value,
but no need to post patch to do the tuning each time if have to?


> b) passing it explicitly in via the cmdline
> 
> > 
> > And crashkernel=auto makes this more flexibly. We can tune the values
> > easily when upgrading.  But if we pass a fixed value in userspace we
> > can not know if the value is set by distribution automatically or by user
> > manually thus we can not blindly update it.
> 
> I think there are two different cases:
> 
> 
> 1. kernel space updates the value later during boot. "crashkernel=auto"
> really does the right thing, meaning
> 
> a) allocate something reasonable and safe during early boot
> b) update the allocation during late boot when we know what kind of system
> we're running on
> 
> Then, we indeed care about "crashkernel=auto" in the kernel and I think it
> would be a nice thing to have. The only question is on how to make that a
> little configurable, depending on different thingies we might want to run in
> the crashkernel (assuming someone doesn't want kdump).
> 
> 
> 2. user space updates the value later during boot
> 
> IMHO we don't really car who decided on the value as we do the update from
> user space. If an admin messes with crashkernel=, the admin can also mess
> with kdump not doing any overwrites (e.g., make that configurable, or detect
> the overwrite in kdump somehow).
> 
> -- 
> Thanks,
> 
> David / dhildenb
>
Mike Rapoport May 11, 2021, 4:31 p.m. UTC | #14
Hi Baoquan,

On Tue, May 11, 2021 at 09:36:41PM +0800, Baoquan He wrote:
> On 05/10/21 at 01:56pm, David Hildenbrand wrote:
> > On 10.05.21 13:44, Dave Young wrote:
> > > Hi David,
> > 
> > Hi Dave,
> > 
> > > On 05/10/21 at 01:01pm, David Hildenbrand wrote:
> > > [snip]
> > > > It also bugged me for quite a bit that we don't have a sane way to achieve
> > > > what we're doing here upstream. It somewhat feels like "this doesn't belong
> > > > in the kernel and is user policy" but then, the existing kernel support is
> > > > suboptimal.
> > > > 
> > > > Maybe reserving some "maybe too big but okayish to boot the system in a sane
> > > > environment -- e.g., X% of system RAM and at least Y" size first and
> > > > shrinking it later as triggered by user space early (where we do seem to
> > > > have a way to pre-calculate things now) might actually be a good direction
> > > > to look into.
> > > 
> > > Hmm, that is also an option we considered before.  Even for your
> > > suggestion we still need a kernel option to set the default ratio/value.
> > > and the ratio/value should be another patch which expands crashkernel
> > > syntax.
> > 
> > Right.
> > 
> > > 
> > > Actually the kconfig help text in this patch is indeed misleading, it is
> > > not introducing crashkernel=a:b... and no need to explain about the
> > > crashkernel syntax, the config option is actually just some interface we
> > > can add any valid crashkernel settings to be used by default. So current
> > > patch help text describes the default value of crash auto str, instead
> > > of describes what crash auto str is.
> > 
> > Right. And I would much rather prefer either
> > 
> > a) handling "auto" completely in the kernel, not just setting some
> > questionable default at compile time
> 
> Thanks for the suggestions.
> 
> If the way adding default value into kernel config is disliked,
> this a) option looks good. We can get value with x% of system RAM, but
> clamp it with CRASH_KERNEL_MIN/MAX. The CRASH_KERNEL_MIN/MAX may need be
> defined with a default value for different ARCHes. It's very close to
> our current implementation, and handling 'auto' in kernel.
> 
> And kernel config provided so that people can tune the MIN/MAX value,
> but no need to post patch to do the tuning each time if have to?
 
Maybe I'm missing something, but the whole point is to avoid kernel
configuration option at all. If the crashkernel=auto works good for 99% of
the cases, there is no need to provide build time configuration along with
it. There are plenty of ways users can control crashkernel reservations
with the existing 2-4 (depending on architecture) command line options.

Simply hard coding a reasonable defaults (e.g.
"1G-64G:128M,64G-1T:256M,1T-:512M"), and using these defaults when
crashkernel=auto is set would cover the same 99% of users you referred to.

If we can resize the reservation later during boot this will also address
David's concern about the wasted memory.

You mentioned that amount of memory that is required for crash kernel
reservation depends on the devices present on the system. Is is possible to
detect how much memory is required at late stages of boot?

> > b) passing it explicitly in via the cmdline
> > 
> > > 
> > > And crashkernel=auto makes this more flexibly. We can tune the values
> > > easily when upgrading.  But if we pass a fixed value in userspace we
> > > can not know if the value is set by distribution automatically or by user
> > > manually thus we can not blindly update it.
> > 
> > I think there are two different cases:
> > 
> > 
> > 1. kernel space updates the value later during boot. "crashkernel=auto"
> > really does the right thing, meaning
> > 
> > a) allocate something reasonable and safe during early boot
> > b) update the allocation during late boot when we know what kind of system
> > we're running on
> > 
> > Then, we indeed care about "crashkernel=auto" in the kernel and I think it
> > would be a nice thing to have. The only question is on how to make that a
> > little configurable, depending on different thingies we might want to run in
> > the crashkernel (assuming someone doesn't want kdump).
> > 
> > 
> > 2. user space updates the value later during boot
> > 
> > IMHO we don't really car who decided on the value as we do the update from
> > user space. If an admin messes with crashkernel=, the admin can also mess
> > with kdump not doing any overwrites (e.g., make that configurable, or detect
> > the overwrite in kdump somehow).
> > 
> > -- 
> > Thanks,
> > 
> > David / dhildenb
> > 
> 
>
David Hildenbrand May 11, 2021, 5:07 p.m. UTC | #15
>> If the way adding default value into kernel config is disliked,
>> this a) option looks good. We can get value with x% of system RAM, but
>> clamp it with CRASH_KERNEL_MIN/MAX. The CRASH_KERNEL_MIN/MAX may need be
>> defined with a default value for different ARCHes. It's very close to
>> our current implementation, and handling 'auto' in kernel.
>>
>> And kernel config provided so that people can tune the MIN/MAX value,
>> but no need to post patch to do the tuning each time if have to?
>   
> Maybe I'm missing something, but the whole point is to avoid kernel
> configuration option at all. If the crashkernel=auto works good for 99% of
> the cases, there is no need to provide build time configuration along with
> it. There are plenty of ways users can control crashkernel reservations
> with the existing 2-4 (depending on architecture) command line options.
> 
> Simply hard coding a reasonable defaults (e.g.
> "1G-64G:128M,64G-1T:256M,1T-:512M"), and using these defaults when
> crashkernel=auto is set would cover the same 99% of users you referred to.

Right, and we can easily allocate a bit more as a safety net temporarily 
when we can actually shrink the area later.

> 
> If we can resize the reservation later during boot this will also address
> David's concern about the wasted memory.
> 

Yes.

> You mentioned that amount of memory that is required for crash kernel
> reservation depends on the devices present on the system. Is is possible to
> detect how much memory is required at late stages of boot?

Here is my thinking:

There seems to be some kind of formula we can roughly use to come up 
with the final crashkernel size. Baoquan for sure knows all the dirty 
details, I assume it's roughly "core kernel + drivers + user space".

In the kernel, we can only come up with "core kernel + drivers" 
expecting that we will run

a) roughly the same kernel
b) with roughly the same drivers

The "user space" part is completely under user space control, depending 
on what application will be run after kexec.

So I wonder if something like

crashkernel=auto,100M

whereby "100M" corresponds to user space demands in addition to the 
variable part depend on the current kernel + drivers.

would already be somewhat sufficient for main use cases I guess.

Of course, that approach will get more complicated if the user space 
portion heavily depends on the drivers etc. Then we need more tunables.
Dave Young May 12, 2021, 7:42 a.m. UTC | #16
On 05/10/21 at 01:56pm, David Hildenbrand wrote:
> On 10.05.21 13:44, Dave Young wrote:
> > Hi David,
> 
> Hi Dave,
> 
> > On 05/10/21 at 01:01pm, David Hildenbrand wrote:
> > [snip]
> > > It also bugged me for quite a bit that we don't have a sane way to achieve
> > > what we're doing here upstream. It somewhat feels like "this doesn't belong
> > > in the kernel and is user policy" but then, the existing kernel support is
> > > suboptimal.
> > > 
> > > Maybe reserving some "maybe too big but okayish to boot the system in a sane
> > > environment -- e.g., X% of system RAM and at least Y" size first and
> > > shrinking it later as triggered by user space early (where we do seem to
> > > have a way to pre-calculate things now) might actually be a good direction
> > > to look into.
> > 
> > Hmm, that is also an option we considered before.  Even for your
> > suggestion we still need a kernel option to set the default ratio/value.
> > and the ratio/value should be another patch which expands crashkernel
> > syntax.
> 
> Right.
> 
> > 
> > Actually the kconfig help text in this patch is indeed misleading, it is
> > not introducing crashkernel=a:b... and no need to explain about the
> > crashkernel syntax, the config option is actually just some interface we
> > can add any valid crashkernel settings to be used by default. So current
> > patch help text describes the default value of crash auto str, instead
> > of describes what crash auto str is.
> 
> Right. And I would much rather prefer either
> 
> a) handling "auto" completely in the kernel, not just setting some
> questionable default at compile time
> b) passing it explicitly in via the cmdline
> 
> > 
> > And crashkernel=auto makes this more flexibly. We can tune the values
> > easily when upgrading.  But if we pass a fixed value in userspace we
> > can not know if the value is set by distribution automatically or by user
> > manually thus we can not blindly update it.
> 
> I think there are two different cases:
> 
> 
> 1. kernel space updates the value later during boot. "crashkernel=auto"
> really does the right thing, meaning
> 
> a) allocate something reasonable and safe during early boot
> b) update the allocation during late boot when we know what kind of system
> we're running on

Sorry for my laggy reply :)

As for kernel late boot action, the other notable issue is most device
drivers are kernel modules, they are loaded with udev. Especially about
some complex storage/network drivers, they often use a lot memory.
Kairui has a tool named "memstrack" which can be used for monitor the
module loading phase peak memory.  But that can only be done in
userspace for now.

And we have some different setups for normal boot and kdump
kernel, eg. some special cmdline eg. nr_cpu=1; and also some in kernel
handling for example some patches merged in networking drivers to use
less memory in kdump kernel via smaller queues etc.

Otherwise about other kernel memory requirement can be done in kernel.

> 
> Then, we indeed care about "crashkernel=auto" in the kernel and I think it
> would be a nice thing to have. The only question is on how to make that a
> little configurable, depending on different thingies we might want to run in
> the crashkernel (assuming someone doesn't want kdump).
> 
> 
> 2. user space updates the value later during boot
> 
> IMHO we don't really car who decided on the value as we do the update from
> user space. If an admin messes with crashkernel=, the admin can also mess
> with kdump not doing any overwrites (e.g., make that configurable, or detect
> the overwrite in kdump somehow).
> 
> -- 
> Thanks,
> 
> David / dhildenb
> 

Thanks
Dave
Baoquan He May 12, 2021, 2:13 p.m. UTC | #17
On 05/11/21 at 07:31pm, Mike Rapoport wrote:
> Hi Baoquan,
> 
> On Tue, May 11, 2021 at 09:36:41PM +0800, Baoquan He wrote:
> > On 05/10/21 at 01:56pm, David Hildenbrand wrote:
> > > On 10.05.21 13:44, Dave Young wrote:
> > > > Hi David,
> > > 
> > > Hi Dave,
> > > 
> > > > On 05/10/21 at 01:01pm, David Hildenbrand wrote:
> > > > [snip]
> > > > > It also bugged me for quite a bit that we don't have a sane way to achieve
> > > > > what we're doing here upstream. It somewhat feels like "this doesn't belong
> > > > > in the kernel and is user policy" but then, the existing kernel support is
> > > > > suboptimal.
> > > > > 
> > > > > Maybe reserving some "maybe too big but okayish to boot the system in a sane
> > > > > environment -- e.g., X% of system RAM and at least Y" size first and
> > > > > shrinking it later as triggered by user space early (where we do seem to
> > > > > have a way to pre-calculate things now) might actually be a good direction
> > > > > to look into.
> > > > 
> > > > Hmm, that is also an option we considered before.  Even for your
> > > > suggestion we still need a kernel option to set the default ratio/value.
> > > > and the ratio/value should be another patch which expands crashkernel
> > > > syntax.
> > > 
> > > Right.
> > > 
> > > > 
> > > > Actually the kconfig help text in this patch is indeed misleading, it is
> > > > not introducing crashkernel=a:b... and no need to explain about the
> > > > crashkernel syntax, the config option is actually just some interface we
> > > > can add any valid crashkernel settings to be used by default. So current
> > > > patch help text describes the default value of crash auto str, instead
> > > > of describes what crash auto str is.
> > > 
> > > Right. And I would much rather prefer either
> > > 
> > > a) handling "auto" completely in the kernel, not just setting some
> > > questionable default at compile time
> > 
> > Thanks for the suggestions.
> > 
> > If the way adding default value into kernel config is disliked,
> > this a) option looks good. We can get value with x% of system RAM, but
> > clamp it with CRASH_KERNEL_MIN/MAX. The CRASH_KERNEL_MIN/MAX may need be
> > defined with a default value for different ARCHes. It's very close to
> > our current implementation, and handling 'auto' in kernel.
> > 
> > And kernel config provided so that people can tune the MIN/MAX value,
> > but no need to post patch to do the tuning each time if have to?
>  
> Maybe I'm missing something, but the whole point is to avoid kernel
> configuration option at all. If the crashkernel=auto works good for 99% of
> the cases, there is no need to provide build time configuration along with
> it. There are plenty of ways users can control crashkernel reservations
> with the existing 2-4 (depending on architecture) command line options.
> 
> Simply hard coding a reasonable defaults (e.g.
> "1G-64G:128M,64G-1T:256M,1T-:512M"), and using these defaults when
> crashkernel=auto is set would cover the same 99% of users you referred to.

Thanks for looking into this, Mike.

The crashkernel=auto works well for 99% of systems with a prerequisite
that values of 'auto' corresponds to a certain kernel, e.g distros kernel.
Say so because the kernel configs of a distros kernel decides the kernel
size, and also the initrd size. A generic default value for
crashkernel=auto doesn't make much sense when we make it into distros.
That's why we want to add the default value into kernel config originally.
Then asking for a minimal size with a kernel config tunable as the second
best when handle 'auto' in kernel as David's option a) suggested.

Here it's a little not clear to me about why kernel config has to be
avoided. We have this kind of tunable, e.g CONFIG_CMA_SIZE_SEL_MBYTES.

> 
> If we can resize the reservation later during boot this will also address
> David's concern about the wasted memory.

We can't resize the reservation, we can only shrink currently.

> 
> You mentioned that amount of memory that is required for crash kernel
> reservation depends on the devices present on the system. Is is possible to
> detect how much memory is required at late stages of boot?

It may be doable to detect at late stage of boot, need investigation, now we
are working to do after system bootup. The thing is the detection is
very coarse-grained. We count all loaded kernel modules in. But in kdump
kernel, only very necessary modules is added in our distros. e.g if we
dump through network, NIC modules are collected. otherwise we filter it out
to reduce memory usage in kdump kernel. For most of normal systems with
dozens of devices, memory required by device driver in kdump kernel is
limited. On VM guests, it's even much less since only very necessary
devices are added, e.g disk/NIC/serial.

So, I said 99% of systems can be covered by default value, it's based on
a certain kernel with fixed kernel configs, mainly related to distros.
Adding a permanent default value in upstream kernel doesn't make much
sense, if no tunable provided for distros to adjust.

> 
> > > b) passing it explicitly in via the cmdline
> > > 
> > > > 
> > > > And crashkernel=auto makes this more flexibly. We can tune the values
> > > > easily when upgrading.  But if we pass a fixed value in userspace we
> > > > can not know if the value is set by distribution automatically or by user
> > > > manually thus we can not blindly update it.
> > > 
> > > I think there are two different cases:
> > > 
> > > 
> > > 1. kernel space updates the value later during boot. "crashkernel=auto"
> > > really does the right thing, meaning
> > > 
> > > a) allocate something reasonable and safe during early boot
> > > b) update the allocation during late boot when we know what kind of system
> > > we're running on
> > > 
> > > Then, we indeed care about "crashkernel=auto" in the kernel and I think it
> > > would be a nice thing to have. The only question is on how to make that a
> > > little configurable, depending on different thingies we might want to run in
> > > the crashkernel (assuming someone doesn't want kdump).
> > > 
> > > 
> > > 2. user space updates the value later during boot
> > > 
> > > IMHO we don't really car who decided on the value as we do the update from
> > > user space. If an admin messes with crashkernel=, the admin can also mess
> > > with kdump not doing any overwrites (e.g., make that configurable, or detect
> > > the overwrite in kdump somehow).
> > > 
> > > -- 
> > > Thanks,
> > > 
> > > David / dhildenb
> > > 
> > 
> > 
> 
> -- 
> Sincerely yours,
> Mike.
>
Baoquan He May 12, 2021, 2:51 p.m. UTC | #18
On 05/11/21 at 07:07pm, David Hildenbrand wrote:
> > > If the way adding default value into kernel config is disliked,
> > > this a) option looks good. We can get value with x% of system RAM, but
> > > clamp it with CRASH_KERNEL_MIN/MAX. The CRASH_KERNEL_MIN/MAX may need be
> > > defined with a default value for different ARCHes. It's very close to
> > > our current implementation, and handling 'auto' in kernel.
> > > 
> > > And kernel config provided so that people can tune the MIN/MAX value,
> > > but no need to post patch to do the tuning each time if have to?
> > Maybe I'm missing something, but the whole point is to avoid kernel
> > configuration option at all. If the crashkernel=auto works good for 99% of
> > the cases, there is no need to provide build time configuration along with
> > it. There are plenty of ways users can control crashkernel reservations
> > with the existing 2-4 (depending on architecture) command line options.
> > 
> > Simply hard coding a reasonable defaults (e.g.
> > "1G-64G:128M,64G-1T:256M,1T-:512M"), and using these defaults when
> > crashkernel=auto is set would cover the same 99% of users you referred to.
> 
> Right, and we can easily allocate a bit more as a safety net temporarily
> when we can actually shrink the area later.
> 
> > 
> > If we can resize the reservation later during boot this will also address
> > David's concern about the wasted memory.
> > 
> 
> Yes.
> 
> > You mentioned that amount of memory that is required for crash kernel
> > reservation depends on the devices present on the system. Is is possible to
> > detect how much memory is required at late stages of boot?
> 
> Here is my thinking:
> 
> There seems to be some kind of formula we can roughly use to come up with
> the final crashkernel size. Baoquan for sure knows all the dirty details, I
> assume it's roughly "core kernel + drivers + user space".
> 
> In the kernel, we can only come up with "core kernel + drivers" expecting
> that we will run
> 
> a) roughly the same kernel
> b) with roughly the same drivers

As replied to Mike, kernel size is undecided for different kernel with
different configs. We can define a default minimal size to cover kernel
and driver on systems with not many devices, but hardcoding the size
into upstream is not helpful. If the size is big, users will be asked to
check and shrink always. If the size is too small, a new value need be
got and added to cmdline and reboot.

> 
> The "user space" part is completely under user space control, depending on
> what application will be run after kexec.
> 
> So I wonder if something like
> 
> crashkernel=auto,100M
> 
> whereby "100M" corresponds to user space demands in addition to the variable
> part depend on the current kernel + drivers.
> 
> would already be somewhat sufficient for main use cases I guess.
> 
> Of course, that approach will get more complicated if the user space portion
> heavily depends on the drivers etc. Then we need more tunables.
> 
> -- 
> Thanks,
> 
> David / dhildenb
>
David Hildenbrand May 12, 2021, 3:07 p.m. UTC | #19
On 12.05.21 16:51, Baoquan He wrote:
> On 05/11/21 at 07:07pm, David Hildenbrand wrote:
>>>> If the way adding default value into kernel config is disliked,
>>>> this a) option looks good. We can get value with x% of system RAM, but
>>>> clamp it with CRASH_KERNEL_MIN/MAX. The CRASH_KERNEL_MIN/MAX may need be
>>>> defined with a default value for different ARCHes. It's very close to
>>>> our current implementation, and handling 'auto' in kernel.
>>>>
>>>> And kernel config provided so that people can tune the MIN/MAX value,
>>>> but no need to post patch to do the tuning each time if have to?
>>> Maybe I'm missing something, but the whole point is to avoid kernel
>>> configuration option at all. If the crashkernel=auto works good for 99% of
>>> the cases, there is no need to provide build time configuration along with
>>> it. There are plenty of ways users can control crashkernel reservations
>>> with the existing 2-4 (depending on architecture) command line options.
>>>
>>> Simply hard coding a reasonable defaults (e.g.
>>> "1G-64G:128M,64G-1T:256M,1T-:512M"), and using these defaults when
>>> crashkernel=auto is set would cover the same 99% of users you referred to.
>>
>> Right, and we can easily allocate a bit more as a safety net temporarily
>> when we can actually shrink the area later.
>>
>>>
>>> If we can resize the reservation later during boot this will also address
>>> David's concern about the wasted memory.
>>>
>>
>> Yes.
>>
>>> You mentioned that amount of memory that is required for crash kernel
>>> reservation depends on the devices present on the system. Is is possible to
>>> detect how much memory is required at late stages of boot?
>>
>> Here is my thinking:
>>
>> There seems to be some kind of formula we can roughly use to come up with
>> the final crashkernel size. Baoquan for sure knows all the dirty details, I
>> assume it's roughly "core kernel + drivers + user space".
>>
>> In the kernel, we can only come up with "core kernel + drivers" expecting
>> that we will run
>>
>> a) roughly the same kernel
>> b) with roughly the same drivers
> 
> As replied to Mike, kernel size is undecided for different kernel with
> different configs. We can define a default minimal size to cover kernel
> and driver on systems with not many devices, but hardcoding the size

I never talked about hardcoding, did I?

> into upstream is not helpful. If the size is big, users will be asked to
> check and shrink always. If the size is too small, a new value need be
> got and added to cmdline and reboot.
Kairui Song May 12, 2021, 7:03 p.m. UTC | #20
On Wed, May 12, 2021 at 10:52 PM Baoquan He <bhe@redhat.com> wrote:
> On 05/11/21 at 07:07pm, David Hildenbrand wrote:
> > > > If the way adding default value into kernel config is disliked,
> > > > this a) option looks good. We can get value with x% of system RAM, but
> > > > clamp it with CRASH_KERNEL_MIN/MAX. The CRASH_KERNEL_MIN/MAX may need be
> > > > defined with a default value for different ARCHes. It's very close to
> > > > our current implementation, and handling 'auto' in kernel.
> > > >
> > > > And kernel config provided so that people can tune the MIN/MAX value,
> > > > but no need to post patch to do the tuning each time if have to?
> > > Maybe I'm missing something, but the whole point is to avoid kernel
> > > configuration option at all. If the crashkernel=auto works good for 99% of
> > > the cases, there is no need to provide build time configuration along with
> > > it. There are plenty of ways users can control crashkernel reservations
> > > with the existing 2-4 (depending on architecture) command line options.
> > >
> > > Simply hard coding a reasonable defaults (e.g.
> > > "1G-64G:128M,64G-1T:256M,1T-:512M"), and using these defaults when
> > > crashkernel=auto is set would cover the same 99% of users you referred to.
> >
> > Right, and we can easily allocate a bit more as a safety net temporarily
> > when we can actually shrink the area later.
> >
> > >
> > > If we can resize the reservation later during boot this will also address
> > > David's concern about the wasted memory.
> > >
> >
> > Yes.
> >
> > > You mentioned that amount of memory that is required for crash kernel
> > > reservation depends on the devices present on the system. Is is possible to
> > > detect how much memory is required at late stages of boot?
> >
> > Here is my thinking:
> >
> > There seems to be some kind of formula we can roughly use to come up with
> > the final crashkernel size. Baoquan for sure knows all the dirty details, I
> > assume it's roughly "core kernel + drivers + user space".
> >
> > In the kernel, we can only come up with "core kernel + drivers" expecting
> > that we will run
> >
> > a) roughly the same kernel
> > b) with roughly the same drivers
>
> As replied to Mike, kernel size is undecided for different kernel with
> different configs. We can define a default minimal size to cover kernel
> and driver on systems with not many devices, but hardcoding the size
> into upstream is not helpful. If the size is big, users will be asked to
> check and shrink always. If the size is too small, a new value need be
> got and added to cmdline and reboot.
>
> >
> > The "user space" part is completely under user space control, depending on
> > what application will be run after kexec.
> >
> > So I wonder if something like
> >
> > crashkernel=auto,100M
> >
> > whereby "100M" corresponds to user space demands in addition to the variable
> > part depend on the current kernel + drivers.
> >
> > would already be somewhat sufficient for main use cases I guess.
> >
> > Of course, that approach will get more complicated if the user space portion
> > heavily depends on the drivers etc. Then we need more tunables.
> >

I actually like this idea of "crashkernel=auto,100M" at first look, it
gives some tunable space for userspace, and kernel can just take care
of its own memory usage. Userspace is completely undeterminable.

But unfortunately estimating kernel usage for kdump is also very hard.
It's heavily related to the kdump kernel's cmdline, and kernel has
many kdump specified behavior/workaround that affects mem usage, and
kernel kconfig also affects it.

Just for example, `nr_cpus=1`, `noefi` are commonly used for kdump
kernel cmdline to reduce memory usage, but it's also completely
acceptable to not use such kernel params for kdump kernel. Even a
rough estimation most likely won't work, those moving parts can change
the memory usage by a lot.

So basically the kdump's memory usage (userspace or kernel) is not
estimable from kernel side in a generic way. It's strictly bonded to
distro implementation and config.

And also that's why this patch started with adding a kconfig, so
distros can set a value that corresponds to their default setup.

Baoquan has added reasons why passing the `crashkernel=` config via
cmdline also mess things up. So at the time this patch is sent, having
a tunable (via kconfig) `crashkernel=auto` seemed the most helpful
way. I'm not sure if there is a better way to make it distro tunable
if not through kconfig.

--
Best Regards,
Kairui Song
Baoquan He May 13, 2021, 5:04 a.m. UTC | #21
On 05/12/21 at 05:07pm, David Hildenbrand wrote:
> On 12.05.21 16:51, Baoquan He wrote:
> > On 05/11/21 at 07:07pm, David Hildenbrand wrote:
> > > > > If the way adding default value into kernel config is disliked,
> > > > > this a) option looks good. We can get value with x% of system RAM, but
> > > > > clamp it with CRASH_KERNEL_MIN/MAX. The CRASH_KERNEL_MIN/MAX may need be
> > > > > defined with a default value for different ARCHes. It's very close to
> > > > > our current implementation, and handling 'auto' in kernel.
> > > > > 
> > > > > And kernel config provided so that people can tune the MIN/MAX value,
> > > > > but no need to post patch to do the tuning each time if have to?
> > > > Maybe I'm missing something, but the whole point is to avoid kernel
> > > > configuration option at all. If the crashkernel=auto works good for 99% of
> > > > the cases, there is no need to provide build time configuration along with
> > > > it. There are plenty of ways users can control crashkernel reservations
> > > > with the existing 2-4 (depending on architecture) command line options.
> > > > 
> > > > Simply hard coding a reasonable defaults (e.g.
> > > > "1G-64G:128M,64G-1T:256M,1T-:512M"), and using these defaults when
> > > > crashkernel=auto is set would cover the same 99% of users you referred to.
> > > 
> > > Right, and we can easily allocate a bit more as a safety net temporarily
> > > when we can actually shrink the area later.
> > > 
> > > > 
> > > > If we can resize the reservation later during boot this will also address
> > > > David's concern about the wasted memory.
> > > > 
> > > 
> > > Yes.
> > > 
> > > > You mentioned that amount of memory that is required for crash kernel
> > > > reservation depends on the devices present on the system. Is is possible to
> > > > detect how much memory is required at late stages of boot?
> > > 
> > > Here is my thinking:
> > > 
> > > There seems to be some kind of formula we can roughly use to come up with
> > > the final crashkernel size. Baoquan for sure knows all the dirty details, I
> > > assume it's roughly "core kernel + drivers + user space".
> > > 
> > > In the kernel, we can only come up with "core kernel + drivers" expecting
> > > that we will run
> > > 
> > > a) roughly the same kernel
> > > b) with roughly the same drivers
> > 
> > As replied to Mike, kernel size is undecided for different kernel with
> > different configs. We can define a default minimal size to cover kernel
> > and driver on systems with not many devices, but hardcoding the size
> 
> I never talked about hardcoding, did I?

Sorry, I didn't make it clear. I said hardcoding, meaning a hardcoding
min value. No matter what formula we take, it needs a default MIN value
to restrict the lowest size, right? That MIN value is the hardcoding I
meant. With it properly chosen, most of systems have no need to shrink
or adjust the crashkernel, given most of systems own memory less than
64G. Let alone the later estimation is done in 1st kernel, very likely
it will get a bigger value as really needed.

> 
> > into upstream is not helpful. If the size is big, users will be asked to
> > check and shrink always. If the size is too small, a new value need be
> > got and added to cmdline and reboot.
> 
> 
> -- 
> Thanks,
> 
> David / dhildenb
>
David Hildenbrand May 17, 2021, 8:22 a.m. UTC | #22
On 12.05.21 16:51, Baoquan He wrote:
> On 05/11/21 at 07:07pm, David Hildenbrand wrote:
>>>> If the way adding default value into kernel config is disliked,
>>>> this a) option looks good. We can get value with x% of system RAM, but
>>>> clamp it with CRASH_KERNEL_MIN/MAX. The CRASH_KERNEL_MIN/MAX may need be
>>>> defined with a default value for different ARCHes. It's very close to
>>>> our current implementation, and handling 'auto' in kernel.
>>>>
>>>> And kernel config provided so that people can tune the MIN/MAX value,
>>>> but no need to post patch to do the tuning each time if have to?
>>> Maybe I'm missing something, but the whole point is to avoid kernel
>>> configuration option at all. If the crashkernel=auto works good for 99% of
>>> the cases, there is no need to provide build time configuration along with
>>> it. There are plenty of ways users can control crashkernel reservations
>>> with the existing 2-4 (depending on architecture) command line options.
>>>
>>> Simply hard coding a reasonable defaults (e.g.
>>> "1G-64G:128M,64G-1T:256M,1T-:512M"), and using these defaults when
>>> crashkernel=auto is set would cover the same 99% of users you referred to.
>>
>> Right, and we can easily allocate a bit more as a safety net temporarily
>> when we can actually shrink the area later.
>>
>>>
>>> If we can resize the reservation later during boot this will also address
>>> David's concern about the wasted memory.
>>>
>>
>> Yes.
>>
>>> You mentioned that amount of memory that is required for crash kernel
>>> reservation depends on the devices present on the system. Is is possible to
>>> detect how much memory is required at late stages of boot?
>>
>> Here is my thinking:
>>
>> There seems to be some kind of formula we can roughly use to come up with
>> the final crashkernel size. Baoquan for sure knows all the dirty details, I
>> assume it's roughly "core kernel + drivers + user space".
>>
>> In the kernel, we can only come up with "core kernel + drivers" expecting
>> that we will run
>>
>> a) roughly the same kernel
>> b) with roughly the same drivers
> 
> As replied to Mike, kernel size is undecided for different kernel with
> different configs. We can define a default minimal size to cover kernel
> and driver on systems with not many devices, but hardcoding the size
> into upstream is not helpful. If the size is big, users will be asked to
> check and shrink always. If the size is too small, a new value need be
> got and added to cmdline and reboot.
> 

Hi Baoquan, Kairui, Dave,

so IIUC now, our "old" kernel cannot actually tell us any reliable 
"crashkernel area size" because

a) it has no idea with which cmdline parameters the crashkernel will be
    started with, and these can have a big impact.
b) it has no idea which driver will be loaded in the crashkernel.
c) It has no idea what will be running in the crashkernel user space.


AFAIKS, best we can do without further information is, therefore, use 
some heuristic to a) allocate some memory early during boot in the 
kernel and b) later refine our allocation, triggered by user space (-> 
shrink the crashkernel area).

I dislike calling a) "auto". It provides a default based on some 
heuristic (boot memory size), and that default might be very unfortunate 
in some scenarios (-> waste memory).

While we could discuss calling the current approach ( a) 
)"crashkernel=default", whereby the default is encoded at compile time 
as determined by a distributor, I still still quite don't like it 
because it feels like this is not necessary. We have a way to pass 
something like that via the cmdline, so it's just a matter of properly 
using that feature from user space.


AFAIKS, all you want is most probably a more dynamic way to construct a 
kernel cmdline, with some properties specific to a kernel.

Let's assume the following:

a) When a distributor ships a kernel, he also ships some kind of 
defaults file. Let's assume for simplicity

/lib/modules/5.11.19-200.fc33.x86_64/defaults.conf

The file might contain

CRASHKERNEL_DEFAULT=WHATEVER


b) When generating the cmdline for e.g., 
/boot/loader/entries/XXX-5.11.19-200.fc33.x86_64.conf we run some script 
that consult that file in addition to /etc/default/grub. For example, if 
the kdump service was installed and /etc/default/grub does not contain 
"crashkernel=" (except when we encounter "crashkernel=auto" for compat 
handling), we add "crashkernel=WHATEVER". Of course, we might do more 
involved stuff based on the current setup, user config, etc.


c) When we install the kdump service, all we have to do is re-generate 
the boot entries AFAIKS. Just like we would when adding 
"crashkernel=auto" right now.


The end result would also allow for having per-kernel defaults and 
change them on kernel updates. Would require some thought on how to make 
it fly in user space, how to "ship" the defaults etc.
Baoquan He May 18, 2021, 8:49 a.m. UTC | #23
On 05/17/21 at 10:22am, David Hildenbrand wrote:
> On 12.05.21 16:51, Baoquan He wrote:
> > On 05/11/21 at 07:07pm, David Hildenbrand wrote:
> > > > > If the way adding default value into kernel config is disliked,
> > > > > this a) option looks good. We can get value with x% of system RAM, but
> > > > > clamp it with CRASH_KERNEL_MIN/MAX. The CRASH_KERNEL_MIN/MAX may need be
> > > > > defined with a default value for different ARCHes. It's very close to
> > > > > our current implementation, and handling 'auto' in kernel.
> > > > > 
> > > > > And kernel config provided so that people can tune the MIN/MAX value,
> > > > > but no need to post patch to do the tuning each time if have to?
> > > > Maybe I'm missing something, but the whole point is to avoid kernel
> > > > configuration option at all. If the crashkernel=auto works good for 99% of
> > > > the cases, there is no need to provide build time configuration along with
> > > > it. There are plenty of ways users can control crashkernel reservations
> > > > with the existing 2-4 (depending on architecture) command line options.
> > > > 
> > > > Simply hard coding a reasonable defaults (e.g.
> > > > "1G-64G:128M,64G-1T:256M,1T-:512M"), and using these defaults when
> > > > crashkernel=auto is set would cover the same 99% of users you referred to.
> > > 
> > > Right, and we can easily allocate a bit more as a safety net temporarily
> > > when we can actually shrink the area later.
> > > 
> > > > 
> > > > If we can resize the reservation later during boot this will also address
> > > > David's concern about the wasted memory.
> > > > 
> > > 
> > > Yes.
> > > 
> > > > You mentioned that amount of memory that is required for crash kernel
> > > > reservation depends on the devices present on the system. Is is possible to
> > > > detect how much memory is required at late stages of boot?
> > > 
> > > Here is my thinking:
> > > 
> > > There seems to be some kind of formula we can roughly use to come up with
> > > the final crashkernel size. Baoquan for sure knows all the dirty details, I
> > > assume it's roughly "core kernel + drivers + user space".
> > > 
> > > In the kernel, we can only come up with "core kernel + drivers" expecting
> > > that we will run
> > > 
> > > a) roughly the same kernel
> > > b) with roughly the same drivers
> > 
> > As replied to Mike, kernel size is undecided for different kernel with
> > different configs. We can define a default minimal size to cover kernel
> > and driver on systems with not many devices, but hardcoding the size
> > into upstream is not helpful. If the size is big, users will be asked to
> > check and shrink always. If the size is too small, a new value need be
> > got and added to cmdline and reboot.
> > 
> 
> Hi Baoquan, Kairui, Dave,
> 
> so IIUC now, our "old" kernel cannot actually tell us any reliable
> "crashkernel area size" because
> 
> a) it has no idea with which cmdline parameters the crashkernel will be
>    started with, and these can have a big impact.
> b) it has no idea which driver will be loaded in the crashkernel.
> c) It has no idea what will be running in the crashkernel user space.
> 
> 
> AFAIKS, best we can do without further information is, therefore, use some
> heuristic to a) allocate some memory early during boot in the kernel and b)
> later refine our allocation, triggered by user space (-> shrink the
> crashkernel area).
> 
> I dislike calling a) "auto". It provides a default based on some heuristic
> (boot memory size), and that default might be very unfortunate in some
> scenarios (-> waste memory).
> 
> While we could discuss calling the current approach ( a)
> )"crashkernel=default", whereby the default is encoded at compile time as
> determined by a distributor, I still still quite don't like it because it
> feels like this is not necessary. We have a way to pass something like that
> via the cmdline, so it's just a matter of properly using that feature from
> user space.
> 
> 
> AFAIKS, all you want is most probably a more dynamic way to construct a
> kernel cmdline, with some properties specific to a kernel.
> 
> Let's assume the following:
> 
> a) When a distributor ships a kernel, he also ships some kind of defaults
> file. Let's assume for simplicity
> 
> /lib/modules/5.11.19-200.fc33.x86_64/defaults.conf
> 
> The file might contain
> 
> CRASHKERNEL_DEFAULT=WHATEVER
> 
> 
> b) When generating the cmdline for e.g.,
> /boot/loader/entries/XXX-5.11.19-200.fc33.x86_64.conf we run some script
> that consult that file in addition to /etc/default/grub. For example, if the
> kdump service was installed and /etc/default/grub does not contain
> "crashkernel=" (except when we encounter "crashkernel=auto" for compat
> handling), we add "crashkernel=WHATEVER". Of course, we might do more
> involved stuff based on the current setup, user config, etc.
> 
> 
> c) When we install the kdump service, all we have to do is re-generate the
> boot entries AFAIKS. Just like we would when adding "crashkernel=auto" right
> now.
> 
> 
> The end result would also allow for having per-kernel defaults and change
> them on kernel updates. Would require some thought on how to make it fly in
> user space, how to "ship" the defaults etc.

Thanks for looking into this, and really appreciate your insight,
comments and patience.

We had a sync in team about various viable solutions the other day,
and also talked about the similar one as you suggested here since
it seems to be able to resolve the concerns we have for a replacement
of crashkernel=auto. We will try these in userspace in our side, hope it
won't introduce risk and can replace crashkernel=auto perfectly.
David Hildenbrand May 18, 2021, 8:51 a.m. UTC | #24
On 18.05.21 10:49, Baoquan He wrote:
> On 05/17/21 at 10:22am, David Hildenbrand wrote:
>> On 12.05.21 16:51, Baoquan He wrote:
>>> On 05/11/21 at 07:07pm, David Hildenbrand wrote:
>>>>>> If the way adding default value into kernel config is disliked,
>>>>>> this a) option looks good. We can get value with x% of system RAM, but
>>>>>> clamp it with CRASH_KERNEL_MIN/MAX. The CRASH_KERNEL_MIN/MAX may need be
>>>>>> defined with a default value for different ARCHes. It's very close to
>>>>>> our current implementation, and handling 'auto' in kernel.
>>>>>>
>>>>>> And kernel config provided so that people can tune the MIN/MAX value,
>>>>>> but no need to post patch to do the tuning each time if have to?
>>>>> Maybe I'm missing something, but the whole point is to avoid kernel
>>>>> configuration option at all. If the crashkernel=auto works good for 99% of
>>>>> the cases, there is no need to provide build time configuration along with
>>>>> it. There are plenty of ways users can control crashkernel reservations
>>>>> with the existing 2-4 (depending on architecture) command line options.
>>>>>
>>>>> Simply hard coding a reasonable defaults (e.g.
>>>>> "1G-64G:128M,64G-1T:256M,1T-:512M"), and using these defaults when
>>>>> crashkernel=auto is set would cover the same 99% of users you referred to.
>>>>
>>>> Right, and we can easily allocate a bit more as a safety net temporarily
>>>> when we can actually shrink the area later.
>>>>
>>>>>
>>>>> If we can resize the reservation later during boot this will also address
>>>>> David's concern about the wasted memory.
>>>>>
>>>>
>>>> Yes.
>>>>
>>>>> You mentioned that amount of memory that is required for crash kernel
>>>>> reservation depends on the devices present on the system. Is is possible to
>>>>> detect how much memory is required at late stages of boot?
>>>>
>>>> Here is my thinking:
>>>>
>>>> There seems to be some kind of formula we can roughly use to come up with
>>>> the final crashkernel size. Baoquan for sure knows all the dirty details, I
>>>> assume it's roughly "core kernel + drivers + user space".
>>>>
>>>> In the kernel, we can only come up with "core kernel + drivers" expecting
>>>> that we will run
>>>>
>>>> a) roughly the same kernel
>>>> b) with roughly the same drivers
>>>
>>> As replied to Mike, kernel size is undecided for different kernel with
>>> different configs. We can define a default minimal size to cover kernel
>>> and driver on systems with not many devices, but hardcoding the size
>>> into upstream is not helpful. If the size is big, users will be asked to
>>> check and shrink always. If the size is too small, a new value need be
>>> got and added to cmdline and reboot.
>>>
>>
>> Hi Baoquan, Kairui, Dave,
>>
>> so IIUC now, our "old" kernel cannot actually tell us any reliable
>> "crashkernel area size" because
>>
>> a) it has no idea with which cmdline parameters the crashkernel will be
>>     started with, and these can have a big impact.
>> b) it has no idea which driver will be loaded in the crashkernel.
>> c) It has no idea what will be running in the crashkernel user space.
>>
>>
>> AFAIKS, best we can do without further information is, therefore, use some
>> heuristic to a) allocate some memory early during boot in the kernel and b)
>> later refine our allocation, triggered by user space (-> shrink the
>> crashkernel area).
>>
>> I dislike calling a) "auto". It provides a default based on some heuristic
>> (boot memory size), and that default might be very unfortunate in some
>> scenarios (-> waste memory).
>>
>> While we could discuss calling the current approach ( a)
>> )"crashkernel=default", whereby the default is encoded at compile time as
>> determined by a distributor, I still still quite don't like it because it
>> feels like this is not necessary. We have a way to pass something like that
>> via the cmdline, so it's just a matter of properly using that feature from
>> user space.
>>
>>
>> AFAIKS, all you want is most probably a more dynamic way to construct a
>> kernel cmdline, with some properties specific to a kernel.
>>
>> Let's assume the following:
>>
>> a) When a distributor ships a kernel, he also ships some kind of defaults
>> file. Let's assume for simplicity
>>
>> /lib/modules/5.11.19-200.fc33.x86_64/defaults.conf
>>
>> The file might contain
>>
>> CRASHKERNEL_DEFAULT=WHATEVER
>>
>>
>> b) When generating the cmdline for e.g.,
>> /boot/loader/entries/XXX-5.11.19-200.fc33.x86_64.conf we run some script
>> that consult that file in addition to /etc/default/grub. For example, if the
>> kdump service was installed and /etc/default/grub does not contain
>> "crashkernel=" (except when we encounter "crashkernel=auto" for compat
>> handling), we add "crashkernel=WHATEVER". Of course, we might do more
>> involved stuff based on the current setup, user config, etc.
>>
>>
>> c) When we install the kdump service, all we have to do is re-generate the
>> boot entries AFAIKS. Just like we would when adding "crashkernel=auto" right
>> now.
>>
>>
>> The end result would also allow for having per-kernel defaults and change
>> them on kernel updates. Would require some thought on how to make it fly in
>> user space, how to "ship" the defaults etc.
> 
> Thanks for looking into this, and really appreciate your insight,
> comments and patience.

Thanks for being patient with me :)

> 
> We had a sync in team about various viable solutions the other day,
> and also talked about the similar one as you suggested here since
> it seems to be able to resolve the concerns we have for a replacement
> of crashkernel=auto. We will try these in userspace in our side, hope it
> won't introduce risk and can replace crashkernel=auto perfectly.

Sure, and as I said, if we want to look into shrinking of the 
crashkernel area triggered by user space, I'm happy to help.
Dave Young May 18, 2021, 9:24 a.m. UTC | #25
[Add kexec list, for people interested about the old replies, please find in linux-mm archive]
On 05/18/21 at 10:51am, David Hildenbrand wrote:
> On 18.05.21 10:49, Baoquan He wrote:
> > On 05/17/21 at 10:22am, David Hildenbrand wrote:
> > > On 12.05.21 16:51, Baoquan He wrote:
> > > > On 05/11/21 at 07:07pm, David Hildenbrand wrote:
> > > > > > > If the way adding default value into kernel config is disliked,
> > > > > > > this a) option looks good. We can get value with x% of system RAM, but
> > > > > > > clamp it with CRASH_KERNEL_MIN/MAX. The CRASH_KERNEL_MIN/MAX may need be
> > > > > > > defined with a default value for different ARCHes. It's very close to
> > > > > > > our current implementation, and handling 'auto' in kernel.
> > > > > > > 
> > > > > > > And kernel config provided so that people can tune the MIN/MAX value,
> > > > > > > but no need to post patch to do the tuning each time if have to?
> > > > > > Maybe I'm missing something, but the whole point is to avoid kernel
> > > > > > configuration option at all. If the crashkernel=auto works good for 99% of
> > > > > > the cases, there is no need to provide build time configuration along with
> > > > > > it. There are plenty of ways users can control crashkernel reservations
> > > > > > with the existing 2-4 (depending on architecture) command line options.
> > > > > > 
> > > > > > Simply hard coding a reasonable defaults (e.g.
> > > > > > "1G-64G:128M,64G-1T:256M,1T-:512M"), and using these defaults when
> > > > > > crashkernel=auto is set would cover the same 99% of users you referred to.
> > > > > 
> > > > > Right, and we can easily allocate a bit more as a safety net temporarily
> > > > > when we can actually shrink the area later.
> > > > > 
> > > > > > 
> > > > > > If we can resize the reservation later during boot this will also address
> > > > > > David's concern about the wasted memory.
> > > > > > 
> > > > > 
> > > > > Yes.
> > > > > 
> > > > > > You mentioned that amount of memory that is required for crash kernel
> > > > > > reservation depends on the devices present on the system. Is is possible to
> > > > > > detect how much memory is required at late stages of boot?
> > > > > 
> > > > > Here is my thinking:
> > > > > 
> > > > > There seems to be some kind of formula we can roughly use to come up with
> > > > > the final crashkernel size. Baoquan for sure knows all the dirty details, I
> > > > > assume it's roughly "core kernel + drivers + user space".
> > > > > 
> > > > > In the kernel, we can only come up with "core kernel + drivers" expecting
> > > > > that we will run
> > > > > 
> > > > > a) roughly the same kernel
> > > > > b) with roughly the same drivers
> > > > 
> > > > As replied to Mike, kernel size is undecided for different kernel with
> > > > different configs. We can define a default minimal size to cover kernel
> > > > and driver on systems with not many devices, but hardcoding the size
> > > > into upstream is not helpful. If the size is big, users will be asked to
> > > > check and shrink always. If the size is too small, a new value need be
> > > > got and added to cmdline and reboot.
> > > > 
> > > 
> > > Hi Baoquan, Kairui, Dave,
> > > 
> > > so IIUC now, our "old" kernel cannot actually tell us any reliable
> > > "crashkernel area size" because
> > > 
> > > a) it has no idea with which cmdline parameters the crashkernel will be
> > >     started with, and these can have a big impact.
> > > b) it has no idea which driver will be loaded in the crashkernel.
> > > c) It has no idea what will be running in the crashkernel user space.
> > > 
> > > 
> > > AFAIKS, best we can do without further information is, therefore, use some
> > > heuristic to a) allocate some memory early during boot in the kernel and b)
> > > later refine our allocation, triggered by user space (-> shrink the
> > > crashkernel area).
> > > 
> > > I dislike calling a) "auto". It provides a default based on some heuristic
> > > (boot memory size), and that default might be very unfortunate in some
> > > scenarios (-> waste memory).
> > > 
> > > While we could discuss calling the current approach ( a)
> > > )"crashkernel=default", whereby the default is encoded at compile time as
> > > determined by a distributor, I still still quite don't like it because it
> > > feels like this is not necessary. We have a way to pass something like that
> > > via the cmdline, so it's just a matter of properly using that feature from
> > > user space.
> > > 
> > > 
> > > AFAIKS, all you want is most probably a more dynamic way to construct a
> > > kernel cmdline, with some properties specific to a kernel.
> > > 
> > > Let's assume the following:
> > > 
> > > a) When a distributor ships a kernel, he also ships some kind of defaults
> > > file. Let's assume for simplicity
> > > 
> > > /lib/modules/5.11.19-200.fc33.x86_64/defaults.conf
> > > 
> > > The file might contain
> > > 
> > > CRASHKERNEL_DEFAULT=WHATEVER
> > > 
> > > 
> > > b) When generating the cmdline for e.g.,
> > > /boot/loader/entries/XXX-5.11.19-200.fc33.x86_64.conf we run some script
> > > that consult that file in addition to /etc/default/grub. For example, if the
> > > kdump service was installed and /etc/default/grub does not contain
> > > "crashkernel=" (except when we encounter "crashkernel=auto" for compat
> > > handling), we add "crashkernel=WHATEVER". Of course, we might do more
> > > involved stuff based on the current setup, user config, etc.
> > > 
> > > 
> > > c) When we install the kdump service, all we have to do is re-generate the
> > > boot entries AFAIKS. Just like we would when adding "crashkernel=auto" right
> > > now.
> > > 
> > > 
> > > The end result would also allow for having per-kernel defaults and change
> > > them on kernel updates. Would require some thought on how to make it fly in
> > > user space, how to "ship" the defaults etc.
> > 
> > Thanks for looking into this, and really appreciate your insight,
> > comments and patience.
> 
> Thanks for being patient with me :)
> 
> > 
> > We had a sync in team about various viable solutions the other day,
> > and also talked about the similar one as you suggested here since
> > it seems to be able to resolve the concerns we have for a replacement
> > of crashkernel=auto. We will try these in userspace in our side, hope it
> > won't introduce risk and can replace crashkernel=auto perfectly.
> 
> Sure, and as I said, if we want to look into shrinking of the crashkernel
> area triggered by user space, I'm happy to help.
> 

David, Baoquan, thank you both for exploring the issue.  Let's try to do
it like this in downstream.

Kdump initramfs is created for kdump needed only with less memory
requirements, but fadump depends on the normal kernel initramfs thus
fadump needs more memory than kdump.

Hari, with this new no-auto approach, another thing we need to consider is how
fadump will use same value if you do not introduce a new param.  As you
are working in dracut to pack kdump initramfs into 1st kernel initramfs,
it is possible that kdump and fadump can use same value, maybe kdump
crashkernel value plus some static number for powerpc only. Anyway just
a thought.  Please provide your comments if any.

Thanks
Dave
diff mbox series

Patch

--- a/arch/Kconfig~kernel-crash_core-add-crashkernel=auto-for-vmcore-creation
+++ a/arch/Kconfig
@@ -14,6 +14,26 @@  menu "General architecture-dependent opt
 config CRASH_CORE
 	bool
 
+config CRASH_AUTO_STR
+	string "Memory reserved for crash kernel"
+	depends on CRASH_CORE
+	default "1G-64G:128M,64G-1T:256M,1T-:512M"
+	help
+	  This configures the reserved memory dependent
+	  on the value of System RAM. The syntax is:
+	  crashkernel=<range1>:<size1>[,<range2>:<size2>,...][@offset]
+	              range=start-[end]
+
+	  For example:
+	      crashkernel=512M-2G:64M,2G-:128M
+
+	  This would mean:
+
+	      1) if the RAM is smaller than 512M, then don't reserve anything
+	         (this is the "rescue" case)
+	      2) if the RAM size is between 512M and 2G (exclusive), then reserve 64M
+	      3) if the RAM size is larger than 2G, then reserve 128M
+
 config KEXEC_CORE
 	select CRASH_CORE
 	bool
--- a/Documentation/admin-guide/kdump/kdump.rst~kernel-crash_core-add-crashkernel=auto-for-vmcore-creation
+++ a/Documentation/admin-guide/kdump/kdump.rst
@@ -285,7 +285,8 @@  This would mean:
     2) if the RAM size is between 512M and 2G (exclusive), then reserve 64M
     3) if the RAM size is larger than 2G, then reserve 128M
 
-
+Or you can use crashkernel=auto to choose the crash kernel memory size
+based on the recommended configuration set for each arch.
 
 Boot into System Kernel
 =======================
--- a/Documentation/admin-guide/kernel-parameters.txt~kernel-crash_core-add-crashkernel=auto-for-vmcore-creation
+++ a/Documentation/admin-guide/kernel-parameters.txt
@@ -751,6 +751,12 @@ 
 			a memory unit (amount[KMG]). See also
 			Documentation/admin-guide/kdump/kdump.rst for an example.
 
+	crashkernel=auto
+			[KNL] This parameter will set the reserved memory for
+			the crash kernel based on the value of the CRASH_AUTO_STR
+			that is the best effort estimation for each arch. See also
+			arch/Kconfig for further details.
+
 	crashkernel=size[KMG],high
 			[KNL, X86-64] range could be above 4G. Allow kernel
 			to allocate physical memory region from top, so could
--- a/kernel/crash_core.c~kernel-crash_core-add-crashkernel=auto-for-vmcore-creation
+++ a/kernel/crash_core.c
@@ -7,6 +7,7 @@ 
 #include <linux/crash_core.h>
 #include <linux/utsname.h>
 #include <linux/vmalloc.h>
+#include <linux/kexec.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -250,6 +251,12 @@  static int __init __parse_crashkernel(ch
 	if (suffix)
 		return parse_crashkernel_suffix(ck_cmdline, crash_size,
 				suffix);
+#ifdef CONFIG_CRASH_AUTO_STR
+	if (strncmp(ck_cmdline, "auto", 4) == 0) {
+		ck_cmdline = CONFIG_CRASH_AUTO_STR;
+		pr_info("Using crashkernel=auto, the size chosen is a best effort estimation.\n");
+	}
+#endif
 	/*
 	 * if the commandline contains a ':', then that's the extended
 	 * syntax -- if not, it must be the classic syntax