[18/19] arm64: kdump: update a kernel doc

Message ID	fb4506943bbaed8b0657ed367ebd57e4064d57ed.1452884156.git.geoff@infradead.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> Message-Id: <fb4506943bbaed8b0657ed367ebd57e4064d57ed.1452884156.git.geoff@infradead.org> In-Reply-To: <cover.1452884156.git.geoff@infradead.org> References: <cover.1452884156.git.geoff@infradead.org> From: Geoff Levand <geoff@infradead.org> Patch-Date: Wed, 19 Aug 2015 11:50:20 +0900 Subject: [PATCH 18/19] arm64: kdump: update a kernel doc To: Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will.deacon@arm.com> Date: Fri, 15 Jan 2016 19:18:38 +0000 Precedence: list Cc: Mark Rutland <mark.rutland@arm.com>, marc.zyngier@arm.com, kexec@lists.infradead.org, AKASHI Takahiro <takahiro.akashi@linaro.org>, James Morse <james.morse@arm.com>, linux-arm-kernel@lists.infradead.org, christoffer.dall@linaro.org MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Geoff Levand Jan. 15, 2016, 7:18 p.m. UTC

From: AKASHI Takahiro <takahiro.akashi@linaro.org>

This patch adds arch specific descriptions about kdump usage on arm64
to kdump.txt.

Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
---
 Documentation/kdump/kdump.txt | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

Mark Rutland Jan. 15, 2016, 8:16 p.m. UTC | #1

On Fri, Jan 15, 2016 at 07:18:38PM +0000, Geoff Levand wrote:
> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> 
> This patch adds arch specific descriptions about kdump usage on arm64
> to kdump.txt.
> 
> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
> ---
>  Documentation/kdump/kdump.txt | 23 ++++++++++++++++++++++-
>  1 file changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
> index bc4bd5a..36cf978 100644
> --- a/Documentation/kdump/kdump.txt
> +++ b/Documentation/kdump/kdump.txt
> @@ -18,7 +18,7 @@ memory image to a dump file on the local disk, or across the network to
>  a remote system.
>  
>  Kdump and kexec are currently supported on the x86, x86_64, ppc64, ia64,
> -s390x and arm architectures.
> +s390x, arm and arm64 architectures.
>  
>  When the system kernel boots, it reserves a small section of memory for
>  the dump-capture kernel. This ensures that ongoing Direct Memory Access
> @@ -249,6 +249,20 @@ Dump-capture kernel config options (Arch Dependent, arm)
>  
>      AUTO_ZRELADDR=y
>  
> +Dump-capture kernel config options (Arch Dependent, arm64)
> +----------------------------------------------------------
> +
> +1) The maximum memory size on the dump-capture kernel must be limited by
> +   specifying:
> +
> +   mem=X[MG]
> +
> +   where X should be less than or equal to the size in "crashkernel="
> +   boot parameter. Kexec-tools will automatically add this.


This is extremely fragile, and will trivially fail when the kernel can
be loaded anywhere (see [1]).

We must explicitly describe the set of regions the crash kernel may use
(i.e. we need base and size). NAK in the absence of that.

Thanks,
Mark.

> +
> +2) Currently, kvm will not be enabled on the dump-capture kernel even
> +   if it is configured.
> +
>  Extended crashkernel syntax
>  ===========================
>  
> @@ -312,6 +326,8 @@ Boot into System Kernel
>     any space below the alignment point may be overwritten by the dump-capture kernel,
>     which means it is possible that the vmcore is not that precise as expected.
>  
> +   On arm64, use "crashkernel=Y[@X]".  Note that the start address of
> +   the kernel, X if explicitly specified, must be aligned to 2MiB (0x200000).
>  
>  Load the Dump-capture Kernel
>  ============================
> @@ -334,6 +350,8 @@ For s390x:
>  	- Use image or bzImage
>  For arm:
>  	- Use zImage
> +For arm64:
> +	- Use vmlinux or Image
>  
>  If you are using a uncompressed vmlinux image then use following command
>  to load dump-capture kernel.
> @@ -377,6 +395,9 @@ For s390x:
>  For arm:
>  	"1 maxcpus=1 reset_devices"
>  
> +For arm64:
> +	"1 mem=X[MG] maxcpus=1 reset_devices"
> +
>  Notes on loading the dump-capture kernel:
>  
>  * By default, the ELF headers are stored in ELF64 format to support
> -- 
> 2.5.0
> 
> 

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-January/398527.html

AKASHI Takahiro Jan. 18, 2016, 10:26 a.m. UTC | #2

On 01/16/2016 05:16 AM, Mark Rutland wrote:
> On Fri, Jan 15, 2016 at 07:18:38PM +0000, Geoff Levand wrote:
>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>
>> This patch adds arch specific descriptions about kdump usage on arm64
>> to kdump.txt.
>>
>> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
>> ---
>>   Documentation/kdump/kdump.txt | 23 ++++++++++++++++++++++-
>>   1 file changed, 22 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
>> index bc4bd5a..36cf978 100644
>> --- a/Documentation/kdump/kdump.txt
>> +++ b/Documentation/kdump/kdump.txt
>> @@ -18,7 +18,7 @@ memory image to a dump file on the local disk, or across the network to
>>   a remote system.
>>
>>   Kdump and kexec are currently supported on the x86, x86_64, ppc64, ia64,
>> -s390x and arm architectures.
>> +s390x, arm and arm64 architectures.
>>
>>   When the system kernel boots, it reserves a small section of memory for
>>   the dump-capture kernel. This ensures that ongoing Direct Memory Access
>> @@ -249,6 +249,20 @@ Dump-capture kernel config options (Arch Dependent, arm)
>>
>>       AUTO_ZRELADDR=y
>>
>> +Dump-capture kernel config options (Arch Dependent, arm64)
>> +----------------------------------------------------------
>> +
>> +1) The maximum memory size on the dump-capture kernel must be limited by
>> +   specifying:
>> +
>> +   mem=X[MG]
>> +
>> +   where X should be less than or equal to the size in "crashkernel="
>> +   boot parameter. Kexec-tools will automatically add this.
>
>
> This is extremely fragile, and will trivially fail when the kernel can
> be loaded anywhere (see [1]).

As I said before, this restriction also exists on arm, but I understand
that recent Ard's patches break it.

> We must explicitly describe the set of regions the crash kernel may use
> (i.e. we need base and size). NAK in the absence of that.

There seem to exist several approaches:
(a) use a device-tree property, "linux,usable-memory", in addition to "reg"
     under "memory" node
(b) use a kernel's early parameter, "memmap=nn[@#$]ss"

Power PC takes (a), while this does not work on efi-started kernel
because dtb has no "memory" nodes under efi.
X86 takes (b). If we take this, we will need to overwrite a weak
early_init_dt_add_memory().
(I thought that this approach was not smart as we have three different
ways to specify memory regions, dtb, efi and this kernel parameter.)

Do you have any other ideas?

Thanks,
-Takahiro AKASHI


> Thanks,
> Mark.
>
>> +
>> +2) Currently, kvm will not be enabled on the dump-capture kernel even
>> +   if it is configured.
>> +
>>   Extended crashkernel syntax
>>   ===========================
>>
>> @@ -312,6 +326,8 @@ Boot into System Kernel
>>      any space below the alignment point may be overwritten by the dump-capture kernel,
>>      which means it is possible that the vmcore is not that precise as expected.
>>
>> +   On arm64, use "crashkernel=Y[@X]".  Note that the start address of
>> +   the kernel, X if explicitly specified, must be aligned to 2MiB (0x200000).
>>
>>   Load the Dump-capture Kernel
>>   ============================
>> @@ -334,6 +350,8 @@ For s390x:
>>   	- Use image or bzImage
>>   For arm:
>>   	- Use zImage
>> +For arm64:
>> +	- Use vmlinux or Image
>>
>>   If you are using a uncompressed vmlinux image then use following command
>>   to load dump-capture kernel.
>> @@ -377,6 +395,9 @@ For s390x:
>>   For arm:
>>   	"1 maxcpus=1 reset_devices"
>>
>> +For arm64:
>> +	"1 mem=X[MG] maxcpus=1 reset_devices"
>> +
>>   Notes on loading the dump-capture kernel:
>>
>>   * By default, the ELF headers are stored in ELF64 format to support
>> --
>> 2.5.0
>>
>>
>
> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-January/398527.html
>

Mark Rutland Jan. 18, 2016, 11:29 a.m. UTC | #3

On Mon, Jan 18, 2016 at 07:26:04PM +0900, AKASHI Takahiro wrote:
> On 01/16/2016 05:16 AM, Mark Rutland wrote:
> >On Fri, Jan 15, 2016 at 07:18:38PM +0000, Geoff Levand wrote:
> >>From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> >>
> >>This patch adds arch specific descriptions about kdump usage on arm64
> >>to kdump.txt.
> >>
> >>Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
> >>---
> >>  Documentation/kdump/kdump.txt | 23 ++++++++++++++++++++++-
> >>  1 file changed, 22 insertions(+), 1 deletion(-)
> >>
> >>diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
> >>index bc4bd5a..36cf978 100644
> >>--- a/Documentation/kdump/kdump.txt
> >>+++ b/Documentation/kdump/kdump.txt
> >>@@ -18,7 +18,7 @@ memory image to a dump file on the local disk, or across the network to
> >>  a remote system.
> >>
> >>  Kdump and kexec are currently supported on the x86, x86_64, ppc64, ia64,
> >>-s390x and arm architectures.
> >>+s390x, arm and arm64 architectures.
> >>
> >>  When the system kernel boots, it reserves a small section of memory for
> >>  the dump-capture kernel. This ensures that ongoing Direct Memory Access
> >>@@ -249,6 +249,20 @@ Dump-capture kernel config options (Arch Dependent, arm)
> >>
> >>      AUTO_ZRELADDR=y
> >>
> >>+Dump-capture kernel config options (Arch Dependent, arm64)
> >>+----------------------------------------------------------
> >>+
> >>+1) The maximum memory size on the dump-capture kernel must be limited by
> >>+   specifying:
> >>+
> >>+   mem=X[MG]
> >>+
> >>+   where X should be less than or equal to the size in "crashkernel="
> >>+   boot parameter. Kexec-tools will automatically add this.
> >
> >
> >This is extremely fragile, and will trivially fail when the kernel can
> >be loaded anywhere (see [1]).
> 
> As I said before, this restriction also exists on arm, but I understand
> that recent Ard's patches break it.
> 
> >We must explicitly describe the set of regions the crash kernel may use
> >(i.e. we need base and size). NAK in the absence of that.
> 
> There seem to exist several approaches:
> (a) use a device-tree property, "linux,usable-memory", in addition to "reg"

I'm not opposed to the idea of a DT property, though I think that should
live under /chosen.

I see that "linux,usable-memory" exists already, though I'm confused as
to exactly what it is for as there is no documentation (neither in the
kernel nor in ePAPR). It's also painful to alter multiple memory nodes
to use that, and I can see that going wrong.

>     under "memory" node
> (b) use a kernel's early parameter, "memmap=nn[@#$]ss"

I'm not too keen on this, as I think it's fragile, and logically
somewhat distinct from what mem= is for (a best effort testing tool).

> Power PC takes (a), while this does not work on efi-started kernel
> because dtb has no "memory" nodes under efi.

A property under /chosen would work for EFI too.

> X86 takes (b). If we take this, we will need to overwrite a weak
> early_init_dt_add_memory().
> (I thought that this approach was not smart as we have three different
> ways to specify memory regions, dtb, efi and this kernel parameter.)

I'm not sure that's a big problem. We may be able to make this generic,
also.

We don't necessarily need a weak add memory function if we can guarantee
nothing gets memblock_alloc'd before we carve it out.

Something like the nomap stuff Ard put together might be useful here.

Thanks,
Mark.

Dave Young Jan. 19, 2016, 1:43 a.m. UTC | #4

On 01/18/16 at 07:26pm, AKASHI Takahiro wrote:
> On 01/16/2016 05:16 AM, Mark Rutland wrote:
> >On Fri, Jan 15, 2016 at 07:18:38PM +0000, Geoff Levand wrote:
> >>From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> >>
> >>This patch adds arch specific descriptions about kdump usage on arm64
> >>to kdump.txt.
> >>
> >>Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
> >>---
> >>  Documentation/kdump/kdump.txt | 23 ++++++++++++++++++++++-
> >>  1 file changed, 22 insertions(+), 1 deletion(-)
> >>
> >>diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
> >>index bc4bd5a..36cf978 100644
> >>--- a/Documentation/kdump/kdump.txt
> >>+++ b/Documentation/kdump/kdump.txt
> >>@@ -18,7 +18,7 @@ memory image to a dump file on the local disk, or across the network to
> >>  a remote system.
> >>
> >>  Kdump and kexec are currently supported on the x86, x86_64, ppc64, ia64,
> >>-s390x and arm architectures.
> >>+s390x, arm and arm64 architectures.
> >>
> >>  When the system kernel boots, it reserves a small section of memory for
> >>  the dump-capture kernel. This ensures that ongoing Direct Memory Access
> >>@@ -249,6 +249,20 @@ Dump-capture kernel config options (Arch Dependent, arm)
> >>
> >>      AUTO_ZRELADDR=y
> >>
> >>+Dump-capture kernel config options (Arch Dependent, arm64)
> >>+----------------------------------------------------------
> >>+
> >>+1) The maximum memory size on the dump-capture kernel must be limited by
> >>+   specifying:
> >>+
> >>+   mem=X[MG]
> >>+
> >>+   where X should be less than or equal to the size in "crashkernel="
> >>+   boot parameter. Kexec-tools will automatically add this.
> >
> >
> >This is extremely fragile, and will trivially fail when the kernel can
> >be loaded anywhere (see [1]).
> 
> As I said before, this restriction also exists on arm, but I understand
> that recent Ard's patches break it.
> 
> >We must explicitly describe the set of regions the crash kernel may use
> >(i.e. we need base and size). NAK in the absence of that.
> 
> There seem to exist several approaches:
> (a) use a device-tree property, "linux,usable-memory", in addition to "reg"
>     under "memory" node
> (b) use a kernel's early parameter, "memmap=nn[@#$]ss"
> 
> Power PC takes (a), while this does not work on efi-started kernel
> because dtb has no "memory" nodes under efi.
> X86 takes (b). If we take this, we will need to overwrite a weak
> early_init_dt_add_memory().

X86 takes another way in latest kexec-tools and kexec_file_load, that is
recreating E820 table and pass it to kexec/kdump kernel, if the entries
are over E820 limitation then turn to use setup_data list for remain
entries.

I think it is X86 specific. Personally I think device tree property is
better.

Thanks
Dave

Dave Young Jan. 19, 2016, 1:50 a.m. UTC | #5

On 01/19/16 at 09:43am, Dave Young wrote:
> On 01/18/16 at 07:26pm, AKASHI Takahiro wrote:
> > On 01/16/2016 05:16 AM, Mark Rutland wrote:
> > >On Fri, Jan 15, 2016 at 07:18:38PM +0000, Geoff Levand wrote:
> > >>From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > >>
> > >>This patch adds arch specific descriptions about kdump usage on arm64
> > >>to kdump.txt.
> > >>
> > >>Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > >>---
> > >>  Documentation/kdump/kdump.txt | 23 ++++++++++++++++++++++-
> > >>  1 file changed, 22 insertions(+), 1 deletion(-)
> > >>
> > >>diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
> > >>index bc4bd5a..36cf978 100644
> > >>--- a/Documentation/kdump/kdump.txt
> > >>+++ b/Documentation/kdump/kdump.txt
> > >>@@ -18,7 +18,7 @@ memory image to a dump file on the local disk, or across the network to
> > >>  a remote system.
> > >>
> > >>  Kdump and kexec are currently supported on the x86, x86_64, ppc64, ia64,
> > >>-s390x and arm architectures.
> > >>+s390x, arm and arm64 architectures.
> > >>
> > >>  When the system kernel boots, it reserves a small section of memory for
> > >>  the dump-capture kernel. This ensures that ongoing Direct Memory Access
> > >>@@ -249,6 +249,20 @@ Dump-capture kernel config options (Arch Dependent, arm)
> > >>
> > >>      AUTO_ZRELADDR=y
> > >>
> > >>+Dump-capture kernel config options (Arch Dependent, arm64)
> > >>+----------------------------------------------------------
> > >>+
> > >>+1) The maximum memory size on the dump-capture kernel must be limited by
> > >>+   specifying:
> > >>+
> > >>+   mem=X[MG]
> > >>+
> > >>+   where X should be less than or equal to the size in "crashkernel="
> > >>+   boot parameter. Kexec-tools will automatically add this.
> > >
> > >
> > >This is extremely fragile, and will trivially fail when the kernel can
> > >be loaded anywhere (see [1]).
> > 
> > As I said before, this restriction also exists on arm, but I understand
> > that recent Ard's patches break it.
> > 
> > >We must explicitly describe the set of regions the crash kernel may use
> > >(i.e. we need base and size). NAK in the absence of that.
> > 
> > There seem to exist several approaches:
> > (a) use a device-tree property, "linux,usable-memory", in addition to "reg"
> >     under "memory" node
> > (b) use a kernel's early parameter, "memmap=nn[@#$]ss"
> > 
> > Power PC takes (a), while this does not work on efi-started kernel
> > because dtb has no "memory" nodes under efi.
> > X86 takes (b). If we take this, we will need to overwrite a weak
> > early_init_dt_add_memory().
> 
> X86 takes another way in latest kexec-tools and kexec_file_load, that is
> recreating E820 table and pass it to kexec/kdump kernel, if the entries
> are over E820 limitation then turn to use setup_data list for remain
> entries.

One reason about the changing is kernel cmdline space is limited. We need
to pass not only usable memory ranges, also reserved and other different
memory ranges, cmdline array size is not enough sometimes.

> 
> I think it is X86 specific. Personally I think device tree property is
> better.
> 
> Thanks
> Dave

AKASHI Takahiro Jan. 19, 2016, 5:31 a.m. UTC | #6

On 01/18/2016 08:29 PM, Mark Rutland wrote:
> On Mon, Jan 18, 2016 at 07:26:04PM +0900, AKASHI Takahiro wrote:
>> On 01/16/2016 05:16 AM, Mark Rutland wrote:
>>> On Fri, Jan 15, 2016 at 07:18:38PM +0000, Geoff Levand wrote:
>>>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>>
>>>> This patch adds arch specific descriptions about kdump usage on arm64
>>>> to kdump.txt.
>>>>
>>>> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>> ---
>>>>   Documentation/kdump/kdump.txt | 23 ++++++++++++++++++++++-
>>>>   1 file changed, 22 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
>>>> index bc4bd5a..36cf978 100644
>>>> --- a/Documentation/kdump/kdump.txt
>>>> +++ b/Documentation/kdump/kdump.txt
>>>> @@ -18,7 +18,7 @@ memory image to a dump file on the local disk, or across the network to
>>>>   a remote system.
>>>>
>>>>   Kdump and kexec are currently supported on the x86, x86_64, ppc64, ia64,
>>>> -s390x and arm architectures.
>>>> +s390x, arm and arm64 architectures.
>>>>
>>>>   When the system kernel boots, it reserves a small section of memory for
>>>>   the dump-capture kernel. This ensures that ongoing Direct Memory Access
>>>> @@ -249,6 +249,20 @@ Dump-capture kernel config options (Arch Dependent, arm)
>>>>
>>>>       AUTO_ZRELADDR=y
>>>>
>>>> +Dump-capture kernel config options (Arch Dependent, arm64)
>>>> +----------------------------------------------------------
>>>> +
>>>> +1) The maximum memory size on the dump-capture kernel must be limited by
>>>> +   specifying:
>>>> +
>>>> +   mem=X[MG]
>>>> +
>>>> +   where X should be less than or equal to the size in "crashkernel="
>>>> +   boot parameter. Kexec-tools will automatically add this.
>>>
>>>
>>> This is extremely fragile, and will trivially fail when the kernel can
>>> be loaded anywhere (see [1]).
>>
>> As I said before, this restriction also exists on arm, but I understand
>> that recent Ard's patches break it.
>>
>>> We must explicitly describe the set of regions the crash kernel may use
>>> (i.e. we need base and size). NAK in the absence of that.
>>
>> There seem to exist several approaches:
>> (a) use a device-tree property, "linux,usable-memory", in addition to "reg"
>
> I'm not opposed to the idea of a DT property, though I think that should
> live under /chosen.

In fact, powerpc uses another property, "linux,crashkernel-base(& size)",
under /chosen in order for the *1st kernel* to export info about a memory
region for the 2nd(crash dump) kernel to user apps (kexec-tools).

> I see that "linux,usable-memory" exists already, though I'm confused as
> to exactly what it is for as there is no documentation (neither in the
> kernel nor in ePAPR).

For example,
   memory@0x80000000 {
     reg = <0x0 0x80000000 0x0 0x80000000>;
     linux,usable-memory = <0x0 0x8c000000 0x0 0x4000000>;
   }
There exists 2GB memory available on the system, but the last 64MB can be
used as a system ram. See early_init_dt_scan_memory() in fdt.c.

> It's also painful to alter multiple memory nodes
> to use that, and I can see that going wrong.

Yeah, I implemented this feature in my old versions experimentally,
but didn't like it as we had to touch all the memory nodes.

>>      under "memory" node
>> (b) use a kernel's early parameter, "memmap=nn[@#$]ss"
>
> I'm not too keen on this, as I think it's fragile, and logically
> somewhat distinct from what mem= is for (a best effort testing tool).

I'm not sure whether it is fragile, and contrary to x86, as Dave
described, I think we will only need a single memmap= on arm64 as
efi's mem map table is accessible even on the crash kernel.

>> Power PC takes (a), while this does not work on efi-started kernel
>> because dtb has no "memory" nodes under efi.
>
> A property under /chosen would work for EFI too.
>
>> X86 takes (b). If we take this, we will need to overwrite a weak
>> early_init_dt_add_memory().
>> (I thought that this approach was not smart as we have three different
>> ways to specify memory regions, dtb, efi and this kernel parameter.)
>
> I'm not sure that's a big problem. We may be able to make this generic,
> also.
>
> We don't necessarily need a weak add memory function if we can guarantee
> nothing gets memblock_alloc'd before we carve it out.
>
> Something like the nomap stuff Ard put together might be useful here.

I'm afraid it doesn't work.
It doesn't matter whether it is linearly mapped or not. We should prevent
any part of memory regions used by the 1st kernel from being reclaimed
by memblock_alloc() and others.
Or do you mean we can introduce another memblock flag?

-Takahiro AKASHI

> Thanks,
> Mark.
>

AKASHI Takahiro Jan. 19, 2016, 5:35 a.m. UTC | #7

On 01/19/2016 10:43 AM, Dave Young wrote:
> On 01/18/16 at 07:26pm, AKASHI Takahiro wrote:
>> On 01/16/2016 05:16 AM, Mark Rutland wrote:
>>> On Fri, Jan 15, 2016 at 07:18:38PM +0000, Geoff Levand wrote:
>>>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>>
>>>> This patch adds arch specific descriptions about kdump usage on arm64
>>>> to kdump.txt.
>>>>
>>>> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>> ---
>>>>   Documentation/kdump/kdump.txt | 23 ++++++++++++++++++++++-
>>>>   1 file changed, 22 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
>>>> index bc4bd5a..36cf978 100644
>>>> --- a/Documentation/kdump/kdump.txt
>>>> +++ b/Documentation/kdump/kdump.txt
>>>> @@ -18,7 +18,7 @@ memory image to a dump file on the local disk, or across the network to
>>>>   a remote system.
>>>>
>>>>   Kdump and kexec are currently supported on the x86, x86_64, ppc64, ia64,
>>>> -s390x and arm architectures.
>>>> +s390x, arm and arm64 architectures.
>>>>
>>>>   When the system kernel boots, it reserves a small section of memory for
>>>>   the dump-capture kernel. This ensures that ongoing Direct Memory Access
>>>> @@ -249,6 +249,20 @@ Dump-capture kernel config options (Arch Dependent, arm)
>>>>
>>>>       AUTO_ZRELADDR=y
>>>>
>>>> +Dump-capture kernel config options (Arch Dependent, arm64)
>>>> +----------------------------------------------------------
>>>> +
>>>> +1) The maximum memory size on the dump-capture kernel must be limited by
>>>> +   specifying:
>>>> +
>>>> +   mem=X[MG]
>>>> +
>>>> +   where X should be less than or equal to the size in "crashkernel="
>>>> +   boot parameter. Kexec-tools will automatically add this.
>>>
>>>
>>> This is extremely fragile, and will trivially fail when the kernel can
>>> be loaded anywhere (see [1]).
>>
>> As I said before, this restriction also exists on arm, but I understand
>> that recent Ard's patches break it.
>>
>>> We must explicitly describe the set of regions the crash kernel may use
>>> (i.e. we need base and size). NAK in the absence of that.
>>
>> There seem to exist several approaches:
>> (a) use a device-tree property, "linux,usable-memory", in addition to "reg"
>>      under "memory" node
>> (b) use a kernel's early parameter, "memmap=nn[@#$]ss"
>>
>> Power PC takes (a), while this does not work on efi-started kernel
>> because dtb has no "memory" nodes under efi.
>> X86 takes (b). If we take this, we will need to overwrite a weak
>> early_init_dt_add_memory().
>
> X86 takes another way in latest kexec-tools and kexec_file_load, that is
> recreating E820 table and pass it to kexec/kdump kernel, if the entries
> are over E820 limitation then turn to use setup_data list for remain
> entries.

Thanks. I will visit x86 code again.

> I think it is X86 specific. Personally I think device tree property is
> better.

Do you think so?

-Takahiro AKASHI


>
> Thanks
> Dave
>

Mark Rutland Jan. 19, 2016, 12:10 p.m. UTC | #8

On Tue, Jan 19, 2016 at 02:31:05PM +0900, AKASHI Takahiro wrote:
> On 01/18/2016 08:29 PM, Mark Rutland wrote:
> >On Mon, Jan 18, 2016 at 07:26:04PM +0900, AKASHI Takahiro wrote:
> >>On 01/16/2016 05:16 AM, Mark Rutland wrote:
> >>>On Fri, Jan 15, 2016 at 07:18:38PM +0000, Geoff Levand wrote:
> >>>>From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> >>>>
> >>>>This patch adds arch specific descriptions about kdump usage on arm64
> >>>>to kdump.txt.
> >>>>
> >>>>Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
> >>>>---
> >>>>  Documentation/kdump/kdump.txt | 23 ++++++++++++++++++++++-
> >>>>  1 file changed, 22 insertions(+), 1 deletion(-)
> >>>>
> >>>>diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
> >>>>index bc4bd5a..36cf978 100644
> >>>>--- a/Documentation/kdump/kdump.txt
> >>>>+++ b/Documentation/kdump/kdump.txt
> >>>>@@ -18,7 +18,7 @@ memory image to a dump file on the local disk, or across the network to
> >>>>  a remote system.
> >>>>
> >>>>  Kdump and kexec are currently supported on the x86, x86_64, ppc64, ia64,
> >>>>-s390x and arm architectures.
> >>>>+s390x, arm and arm64 architectures.
> >>>>
> >>>>  When the system kernel boots, it reserves a small section of memory for
> >>>>  the dump-capture kernel. This ensures that ongoing Direct Memory Access
> >>>>@@ -249,6 +249,20 @@ Dump-capture kernel config options (Arch Dependent, arm)
> >>>>
> >>>>      AUTO_ZRELADDR=y
> >>>>
> >>>>+Dump-capture kernel config options (Arch Dependent, arm64)
> >>>>+----------------------------------------------------------
> >>>>+
> >>>>+1) The maximum memory size on the dump-capture kernel must be limited by
> >>>>+   specifying:
> >>>>+
> >>>>+   mem=X[MG]
> >>>>+
> >>>>+   where X should be less than or equal to the size in "crashkernel="
> >>>>+   boot parameter. Kexec-tools will automatically add this.
> >>>
> >>>
> >>>This is extremely fragile, and will trivially fail when the kernel can
> >>>be loaded anywhere (see [1]).
> >>
> >>As I said before, this restriction also exists on arm, but I understand
> >>that recent Ard's patches break it.
> >>
> >>>We must explicitly describe the set of regions the crash kernel may use
> >>>(i.e. we need base and size). NAK in the absence of that.
> >>
> >>There seem to exist several approaches:
> >>(a) use a device-tree property, "linux,usable-memory", in addition to "reg"
> >
> >I'm not opposed to the idea of a DT property, though I think that should
> >live under /chosen.
> 
> In fact, powerpc uses another property, "linux,crashkernel-base(& size)",
> under /chosen in order for the *1st kernel* to export info about a memory
> region for the 2nd(crash dump) kernel to user apps (kexec-tools).

Do you mean that said property is provided _to_ the 1st kernel, or
provided _by_ the first kernel?

> >I see that "linux,usable-memory" exists already, though I'm confused as
> >to exactly what it is for as there is no documentation (neither in the
> >kernel nor in ePAPR).
> 
> For example,
>   memory@0x80000000 {
>     reg = <0x0 0x80000000 0x0 0x80000000>;
>     linux,usable-memory = <0x0 0x8c000000 0x0 0x4000000>;
>   }
> There exists 2GB memory available on the system, but the last 64MB can be
> used as a system ram. See early_init_dt_scan_memory() in fdt.c.

Sure, except that's the implementation rather than the intended
semantics (which are not defined).

> >It's also painful to alter multiple memory nodes
> >to use that, and I can see that going wrong.
> 
> Yeah, I implemented this feature in my old versions experimentally,
> but didn't like it as we had to touch all the memory nodes.
> 
> >>     under "memory" node
> >>(b) use a kernel's early parameter, "memmap=nn[@#$]ss"
> >
> >I'm not too keen on this, as I think it's fragile, and logically
> >somewhat distinct from what mem= is for (a best effort testing tool).
> 
> I'm not sure whether it is fragile, and contrary to x86, as Dave
> described, I think we will only need a single memmap= on arm64 as
> efi's mem map table is accessible even on the crash kernel.

I just realised I misread this as "mem=", apologies.

It looks like memmap= to force a specific region of memory to be used
may work.

I'd still err on the side of preferring an explicit property in the DT.

> >>Power PC takes (a), while this does not work on efi-started kernel
> >>because dtb has no "memory" nodes under efi.
> >
> >A property under /chosen would work for EFI too.
> >
> >>X86 takes (b). If we take this, we will need to overwrite a weak
> >>early_init_dt_add_memory().
> >>(I thought that this approach was not smart as we have three different
> >>ways to specify memory regions, dtb, efi and this kernel parameter.)
> >
> >I'm not sure that's a big problem. We may be able to make this generic,
> >also.
> >
> >We don't necessarily need a weak add memory function if we can guarantee
> >nothing gets memblock_alloc'd before we carve it out.
> >
> >Something like the nomap stuff Ard put together might be useful here.
> 
> I'm afraid it doesn't work.
> It doesn't matter whether it is linearly mapped or not. We should prevent
> any part of memory regions used by the 1st kernel from being reclaimed
> by memblock_alloc() and others.

Are you certain that nomap memory can be allocated? That sounds like a
major bug.

Nomap memory should act like reserved memory with the additional
property that the kernel must not map it implicitly.

> Or do you mean we can introduce another memblock flag?

That wasn't what I meant, but that would be a potential solution.

Thanks,
Mark.

Mark Rutland Jan. 19, 2016, 12:17 p.m. UTC | #9

On Tue, Jan 19, 2016 at 09:43:32AM +0800, Dave Young wrote:
> On 01/18/16 at 07:26pm, AKASHI Takahiro wrote:
> > On 01/16/2016 05:16 AM, Mark Rutland wrote:
> > >On Fri, Jan 15, 2016 at 07:18:38PM +0000, Geoff Levand wrote:
> > >>From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > >>
> > >>This patch adds arch specific descriptions about kdump usage on arm64
> > >>to kdump.txt.
> > >>
> > >>Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > >>---
> > >>  Documentation/kdump/kdump.txt | 23 ++++++++++++++++++++++-
> > >>  1 file changed, 22 insertions(+), 1 deletion(-)
> > >>
> > >>diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
> > >>index bc4bd5a..36cf978 100644
> > >>--- a/Documentation/kdump/kdump.txt
> > >>+++ b/Documentation/kdump/kdump.txt
> > >>@@ -18,7 +18,7 @@ memory image to a dump file on the local disk, or across the network to
> > >>  a remote system.
> > >>
> > >>  Kdump and kexec are currently supported on the x86, x86_64, ppc64, ia64,
> > >>-s390x and arm architectures.
> > >>+s390x, arm and arm64 architectures.
> > >>
> > >>  When the system kernel boots, it reserves a small section of memory for
> > >>  the dump-capture kernel. This ensures that ongoing Direct Memory Access
> > >>@@ -249,6 +249,20 @@ Dump-capture kernel config options (Arch Dependent, arm)
> > >>
> > >>      AUTO_ZRELADDR=y
> > >>
> > >>+Dump-capture kernel config options (Arch Dependent, arm64)
> > >>+----------------------------------------------------------
> > >>+
> > >>+1) The maximum memory size on the dump-capture kernel must be limited by
> > >>+   specifying:
> > >>+
> > >>+   mem=X[MG]
> > >>+
> > >>+   where X should be less than or equal to the size in "crashkernel="
> > >>+   boot parameter. Kexec-tools will automatically add this.
> > >
> > >
> > >This is extremely fragile, and will trivially fail when the kernel can
> > >be loaded anywhere (see [1]).
> > 
> > As I said before, this restriction also exists on arm, but I understand
> > that recent Ard's patches break it.
> > 
> > >We must explicitly describe the set of regions the crash kernel may use
> > >(i.e. we need base and size). NAK in the absence of that.
> > 
> > There seem to exist several approaches:
> > (a) use a device-tree property, "linux,usable-memory", in addition to "reg"
> >     under "memory" node
> > (b) use a kernel's early parameter, "memmap=nn[@#$]ss"
> > 
> > Power PC takes (a), while this does not work on efi-started kernel
> > because dtb has no "memory" nodes under efi.
> > X86 takes (b). If we take this, we will need to overwrite a weak
> > early_init_dt_add_memory().
> 
> X86 takes another way in latest kexec-tools and kexec_file_load, that is
> recreating E820 table and pass it to kexec/kdump kernel, if the entries
> are over E820 limitation then turn to use setup_data list for remain
> entries.

This would imply modifying the EFI memory map or the memory nodes, which
I'm not keen on.

I would prefer that they are left _pristine_, and we describe the
restriction on the kdump kernel with additional properties under
/chosen.

That leaves us with more useful information about the environment of the
first kernel, is simpler for userspace (it's resilient to updates to the
UEFI memory map spec, for example), and is simple for the crash kernel.

> I think it is X86 specific. Personally I think device tree property is
> better.

As above, agreed.

Thanks,
Mark.

Dave Young Jan. 19, 2016, 12:28 p.m. UTC | #10

On 01/19/16 at 02:35pm, AKASHI Takahiro wrote:
> On 01/19/2016 10:43 AM, Dave Young wrote:
> >On 01/18/16 at 07:26pm, AKASHI Takahiro wrote:
> >>On 01/16/2016 05:16 AM, Mark Rutland wrote:
> >>>On Fri, Jan 15, 2016 at 07:18:38PM +0000, Geoff Levand wrote:
> >>>>From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> >>>>
> >>>>This patch adds arch specific descriptions about kdump usage on arm64
> >>>>to kdump.txt.
> >>>>
> >>>>Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
> >>>>---
> >>>>  Documentation/kdump/kdump.txt | 23 ++++++++++++++++++++++-
> >>>>  1 file changed, 22 insertions(+), 1 deletion(-)
> >>>>
> >>>>diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
> >>>>index bc4bd5a..36cf978 100644
> >>>>--- a/Documentation/kdump/kdump.txt
> >>>>+++ b/Documentation/kdump/kdump.txt
> >>>>@@ -18,7 +18,7 @@ memory image to a dump file on the local disk, or across the network to
> >>>>  a remote system.
> >>>>
> >>>>  Kdump and kexec are currently supported on the x86, x86_64, ppc64, ia64,
> >>>>-s390x and arm architectures.
> >>>>+s390x, arm and arm64 architectures.
> >>>>
> >>>>  When the system kernel boots, it reserves a small section of memory for
> >>>>  the dump-capture kernel. This ensures that ongoing Direct Memory Access
> >>>>@@ -249,6 +249,20 @@ Dump-capture kernel config options (Arch Dependent, arm)
> >>>>
> >>>>      AUTO_ZRELADDR=y
> >>>>
> >>>>+Dump-capture kernel config options (Arch Dependent, arm64)
> >>>>+----------------------------------------------------------
> >>>>+
> >>>>+1) The maximum memory size on the dump-capture kernel must be limited by
> >>>>+   specifying:
> >>>>+
> >>>>+   mem=X[MG]
> >>>>+
> >>>>+   where X should be less than or equal to the size in "crashkernel="
> >>>>+   boot parameter. Kexec-tools will automatically add this.
> >>>
> >>>
> >>>This is extremely fragile, and will trivially fail when the kernel can
> >>>be loaded anywhere (see [1]).
> >>
> >>As I said before, this restriction also exists on arm, but I understand
> >>that recent Ard's patches break it.
> >>
> >>>We must explicitly describe the set of regions the crash kernel may use
> >>>(i.e. we need base and size). NAK in the absence of that.
> >>
> >>There seem to exist several approaches:
> >>(a) use a device-tree property, "linux,usable-memory", in addition to "reg"
> >>     under "memory" node
> >>(b) use a kernel's early parameter, "memmap=nn[@#$]ss"
> >>
> >>Power PC takes (a), while this does not work on efi-started kernel
> >>because dtb has no "memory" nodes under efi.
> >>X86 takes (b). If we take this, we will need to overwrite a weak
> >>early_init_dt_add_memory().
> >
> >X86 takes another way in latest kexec-tools and kexec_file_load, that is
> >recreating E820 table and pass it to kexec/kdump kernel, if the entries
> >are over E820 limitation then turn to use setup_data list for remain
> >entries.
> 
> Thanks. I will visit x86 code again.
> 
> >I think it is X86 specific. Personally I think device tree property is
> >better.
> 
> Do you think so?

I'm not sure it is the best way. For X86 we run into problem with
memmap= design, one example is pci domain X (X>1) need the pci memory
ranges being passed to kdump kernel. When we passed reserved ranges in /proc/iomem
to 2nd kernel we find that cmdline[] array is not big enough.

Do you think for arm64 only usable memory is necessary to let kdump kernel
know? I'm curious about how arm64 kernel get all memory layout from boot loader,
via UEFI memmap?

Thanks
Dave

Mark Rutland Jan. 19, 2016, 12:51 p.m. UTC | #11

On Tue, Jan 19, 2016 at 08:28:48PM +0800, Dave Young wrote:
> On 01/19/16 at 02:35pm, AKASHI Takahiro wrote:
> > On 01/19/2016 10:43 AM, Dave Young wrote:
> > >On 01/18/16 at 07:26pm, AKASHI Takahiro wrote:
> > >>On 01/16/2016 05:16 AM, Mark Rutland wrote:
> > >>>On Fri, Jan 15, 2016 at 07:18:38PM +0000, Geoff Levand wrote:
> > >>>>From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > >>>>
> > >>>>This patch adds arch specific descriptions about kdump usage on arm64
> > >>>>to kdump.txt.
> > >>>>
> > >>>>Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > >>>>---
> > >>>>  Documentation/kdump/kdump.txt | 23 ++++++++++++++++++++++-
> > >>>>  1 file changed, 22 insertions(+), 1 deletion(-)
> > >>>>
> > >>>>diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
> > >>>>index bc4bd5a..36cf978 100644
> > >>>>--- a/Documentation/kdump/kdump.txt
> > >>>>+++ b/Documentation/kdump/kdump.txt
> > >>>>@@ -18,7 +18,7 @@ memory image to a dump file on the local disk, or across the network to
> > >>>>  a remote system.
> > >>>>
> > >>>>  Kdump and kexec are currently supported on the x86, x86_64, ppc64, ia64,
> > >>>>-s390x and arm architectures.
> > >>>>+s390x, arm and arm64 architectures.
> > >>>>
> > >>>>  When the system kernel boots, it reserves a small section of memory for
> > >>>>  the dump-capture kernel. This ensures that ongoing Direct Memory Access
> > >>>>@@ -249,6 +249,20 @@ Dump-capture kernel config options (Arch Dependent, arm)
> > >>>>
> > >>>>      AUTO_ZRELADDR=y
> > >>>>
> > >>>>+Dump-capture kernel config options (Arch Dependent, arm64)
> > >>>>+----------------------------------------------------------
> > >>>>+
> > >>>>+1) The maximum memory size on the dump-capture kernel must be limited by
> > >>>>+   specifying:
> > >>>>+
> > >>>>+   mem=X[MG]
> > >>>>+
> > >>>>+   where X should be less than or equal to the size in "crashkernel="
> > >>>>+   boot parameter. Kexec-tools will automatically add this.
> > >>>
> > >>>
> > >>>This is extremely fragile, and will trivially fail when the kernel can
> > >>>be loaded anywhere (see [1]).
> > >>
> > >>As I said before, this restriction also exists on arm, but I understand
> > >>that recent Ard's patches break it.
> > >>
> > >>>We must explicitly describe the set of regions the crash kernel may use
> > >>>(i.e. we need base and size). NAK in the absence of that.
> > >>
> > >>There seem to exist several approaches:
> > >>(a) use a device-tree property, "linux,usable-memory", in addition to "reg"
> > >>     under "memory" node
> > >>(b) use a kernel's early parameter, "memmap=nn[@#$]ss"
> > >>
> > >>Power PC takes (a), while this does not work on efi-started kernel
> > >>because dtb has no "memory" nodes under efi.
> > >>X86 takes (b). If we take this, we will need to overwrite a weak
> > >>early_init_dt_add_memory().
> > >
> > >X86 takes another way in latest kexec-tools and kexec_file_load, that is
> > >recreating E820 table and pass it to kexec/kdump kernel, if the entries
> > >are over E820 limitation then turn to use setup_data list for remain
> > >entries.
> > 
> > Thanks. I will visit x86 code again.
> > 
> > >I think it is X86 specific. Personally I think device tree property is
> > >better.
> > 
> > Do you think so?
> 
> I'm not sure it is the best way. For X86 we run into problem with
> memmap= design, one example is pci domain X (X>1) need the pci memory
> ranges being passed to kdump kernel. When we passed reserved ranges in /proc/iomem
> to 2nd kernel we find that cmdline[] array is not big enough.

I'm not sure how PCI ranges relate to the memory map used for normal
memory (i.e. RAM), though I'm probably missing some caveat with the way
ACPI and UEFI describe PCI. Why does memmap= affect PCI memory?

If the kernel got the rest of its system topology from DT, the PCI
regions would be described there.

> Do you think for arm64 only usable memory is necessary to let kdump kernel
> know? I'm curious about how arm64 kernel get all memory layout from boot loader,
> via UEFI memmap?

When booted via EFI, we use the EFI memory map. The EFI stub handles
acquring the relevant information and passing that to the first kernel
in the DTB (see Documentation/arm/uefi.txt).

A kexec'd kernel should simply inherit that. So long as the DTB and/or
UEFI tables in memory are the same, it would be the same as a cold boot.

In the !EFI case, we use the memory nodes in the DTB. Only in this case
could usable-memory properties in memory nodes make sense. I'd prefer a
uniform property under /chosen for both cases.

Thanks,
Mark.

Dave Young Jan. 19, 2016, 1:45 p.m. UTC | #12

On 01/19/16 at 12:51pm, Mark Rutland wrote:
> On Tue, Jan 19, 2016 at 08:28:48PM +0800, Dave Young wrote:
> > On 01/19/16 at 02:35pm, AKASHI Takahiro wrote:
> > > On 01/19/2016 10:43 AM, Dave Young wrote:
> > > >On 01/18/16 at 07:26pm, AKASHI Takahiro wrote:
> > > >>On 01/16/2016 05:16 AM, Mark Rutland wrote:
> > > >>>On Fri, Jan 15, 2016 at 07:18:38PM +0000, Geoff Levand wrote:
> > > >>>>From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > > >>>>
> > > >>>>This patch adds arch specific descriptions about kdump usage on arm64
> > > >>>>to kdump.txt.
> > > >>>>
> > > >>>>Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > > >>>>---
> > > >>>>  Documentation/kdump/kdump.txt | 23 ++++++++++++++++++++++-
> > > >>>>  1 file changed, 22 insertions(+), 1 deletion(-)
> > > >>>>
> > > >>>>diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
> > > >>>>index bc4bd5a..36cf978 100644
> > > >>>>--- a/Documentation/kdump/kdump.txt
> > > >>>>+++ b/Documentation/kdump/kdump.txt
> > > >>>>@@ -18,7 +18,7 @@ memory image to a dump file on the local disk, or across the network to
> > > >>>>  a remote system.
> > > >>>>
> > > >>>>  Kdump and kexec are currently supported on the x86, x86_64, ppc64, ia64,
> > > >>>>-s390x and arm architectures.
> > > >>>>+s390x, arm and arm64 architectures.
> > > >>>>
> > > >>>>  When the system kernel boots, it reserves a small section of memory for
> > > >>>>  the dump-capture kernel. This ensures that ongoing Direct Memory Access
> > > >>>>@@ -249,6 +249,20 @@ Dump-capture kernel config options (Arch Dependent, arm)
> > > >>>>
> > > >>>>      AUTO_ZRELADDR=y
> > > >>>>
> > > >>>>+Dump-capture kernel config options (Arch Dependent, arm64)
> > > >>>>+----------------------------------------------------------
> > > >>>>+
> > > >>>>+1) The maximum memory size on the dump-capture kernel must be limited by
> > > >>>>+   specifying:
> > > >>>>+
> > > >>>>+   mem=X[MG]
> > > >>>>+
> > > >>>>+   where X should be less than or equal to the size in "crashkernel="
> > > >>>>+   boot parameter. Kexec-tools will automatically add this.
> > > >>>
> > > >>>
> > > >>>This is extremely fragile, and will trivially fail when the kernel can
> > > >>>be loaded anywhere (see [1]).
> > > >>
> > > >>As I said before, this restriction also exists on arm, but I understand
> > > >>that recent Ard's patches break it.
> > > >>
> > > >>>We must explicitly describe the set of regions the crash kernel may use
> > > >>>(i.e. we need base and size). NAK in the absence of that.
> > > >>
> > > >>There seem to exist several approaches:
> > > >>(a) use a device-tree property, "linux,usable-memory", in addition to "reg"
> > > >>     under "memory" node
> > > >>(b) use a kernel's early parameter, "memmap=nn[@#$]ss"
> > > >>
> > > >>Power PC takes (a), while this does not work on efi-started kernel
> > > >>because dtb has no "memory" nodes under efi.
> > > >>X86 takes (b). If we take this, we will need to overwrite a weak
> > > >>early_init_dt_add_memory().
> > > >
> > > >X86 takes another way in latest kexec-tools and kexec_file_load, that is
> > > >recreating E820 table and pass it to kexec/kdump kernel, if the entries
> > > >are over E820 limitation then turn to use setup_data list for remain
> > > >entries.
> > > 
> > > Thanks. I will visit x86 code again.
> > > 
> > > >I think it is X86 specific. Personally I think device tree property is
> > > >better.
> > > 
> > > Do you think so?
> > 
> > I'm not sure it is the best way. For X86 we run into problem with
> > memmap= design, one example is pci domain X (X>1) need the pci memory
> > ranges being passed to kdump kernel. When we passed reserved ranges in /proc/iomem
> > to 2nd kernel we find that cmdline[] array is not big enough.
> 
> I'm not sure how PCI ranges relate to the memory map used for normal
> memory (i.e. RAM), though I'm probably missing some caveat with the way
> ACPI and UEFI describe PCI. Why does memmap= affect PCI memory?

Here is the old patch which was rejected in kexec-tools:
http://lists.infradead.org/pipermail/kexec/2013-February/007924.html

> 
> If the kernel got the rest of its system topology from DT, the PCI
> regions would be described there.

Yes, if kdump kernel use same DT as 1st kernel.

> 
> > Do you think for arm64 only usable memory is necessary to let kdump kernel
> > know? I'm curious about how arm64 kernel get all memory layout from boot loader,
> > via UEFI memmap?
> 
> When booted via EFI, we use the EFI memory map. The EFI stub handles
> acquring the relevant information and passing that to the first kernel
> in the DTB (see Documentation/arm/uefi.txt).

Ok, thanks for the pointer. So in dt we are just have uefi memmap infomation
instead of memory nodes details.. 

> 
> A kexec'd kernel should simply inherit that. So long as the DTB and/or
> UEFI tables in memory are the same, it would be the same as a cold boot.

For kexec all memory ranges are same, for kdump we need use original reserved
range with crashkernel= as usable memory and all other orignal usable ranges
are not usable anymore. 

Is it possible to modify uefi memmap for kdump case?

> 
> In the !EFI case, we use the memory nodes in the DTB. Only in this case
> could usable-memory properties in memory nodes make sense. I'd prefer a
> uniform property under /chosen for both cases.

We stil use same DTB, need to modify the DT and update the usable and unusable
nodes for kdump?

> Thanks,
> Mark.

Thanks
Dave

Dave Young Jan. 19, 2016, 1:52 p.m. UTC | #13

On 01/19/16 at 12:17pm, Mark Rutland wrote:
> On Tue, Jan 19, 2016 at 09:43:32AM +0800, Dave Young wrote:
> > On 01/18/16 at 07:26pm, AKASHI Takahiro wrote:
> > > On 01/16/2016 05:16 AM, Mark Rutland wrote:
> > > >On Fri, Jan 15, 2016 at 07:18:38PM +0000, Geoff Levand wrote:
> > > >>From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > > >>
> > > >>This patch adds arch specific descriptions about kdump usage on arm64
> > > >>to kdump.txt.
> > > >>
> > > >>Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > > >>---
> > > >>  Documentation/kdump/kdump.txt | 23 ++++++++++++++++++++++-
> > > >>  1 file changed, 22 insertions(+), 1 deletion(-)
> > > >>
> > > >>diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
> > > >>index bc4bd5a..36cf978 100644
> > > >>--- a/Documentation/kdump/kdump.txt
> > > >>+++ b/Documentation/kdump/kdump.txt
> > > >>@@ -18,7 +18,7 @@ memory image to a dump file on the local disk, or across the network to
> > > >>  a remote system.
> > > >>
> > > >>  Kdump and kexec are currently supported on the x86, x86_64, ppc64, ia64,
> > > >>-s390x and arm architectures.
> > > >>+s390x, arm and arm64 architectures.
> > > >>
> > > >>  When the system kernel boots, it reserves a small section of memory for
> > > >>  the dump-capture kernel. This ensures that ongoing Direct Memory Access
> > > >>@@ -249,6 +249,20 @@ Dump-capture kernel config options (Arch Dependent, arm)
> > > >>
> > > >>      AUTO_ZRELADDR=y
> > > >>
> > > >>+Dump-capture kernel config options (Arch Dependent, arm64)
> > > >>+----------------------------------------------------------
> > > >>+
> > > >>+1) The maximum memory size on the dump-capture kernel must be limited by
> > > >>+   specifying:
> > > >>+
> > > >>+   mem=X[MG]
> > > >>+
> > > >>+   where X should be less than or equal to the size in "crashkernel="
> > > >>+   boot parameter. Kexec-tools will automatically add this.
> > > >
> > > >
> > > >This is extremely fragile, and will trivially fail when the kernel can
> > > >be loaded anywhere (see [1]).
> > > 
> > > As I said before, this restriction also exists on arm, but I understand
> > > that recent Ard's patches break it.
> > > 
> > > >We must explicitly describe the set of regions the crash kernel may use
> > > >(i.e. we need base and size). NAK in the absence of that.
> > > 
> > > There seem to exist several approaches:
> > > (a) use a device-tree property, "linux,usable-memory", in addition to "reg"
> > >     under "memory" node
> > > (b) use a kernel's early parameter, "memmap=nn[@#$]ss"
> > > 
> > > Power PC takes (a), while this does not work on efi-started kernel
> > > because dtb has no "memory" nodes under efi.
> > > X86 takes (b). If we take this, we will need to overwrite a weak
> > > early_init_dt_add_memory().
> > 
> > X86 takes another way in latest kexec-tools and kexec_file_load, that is
> > recreating E820 table and pass it to kexec/kdump kernel, if the entries
> > are over E820 limitation then turn to use setup_data list for remain
> > entries.
> 
> This would imply modifying the EFI memory map or the memory nodes, which
> I'm not keen on.
> 
> I would prefer that they are left _pristine_, and we describe the
> restriction on the kdump kernel with additional properties under
> /chosen.
> 
> That leaves us with more useful information about the environment of the
> first kernel, is simpler for userspace (it's resilient to updates to the
> UEFI memory map spec, for example), and is simple for the crash kernel.

In theory kexec as boot loader should prepare correct efi memmap and pass
to kernel, but as you said yes it will increase complexity. We need banlance
them.

Thanks
Dave

Mark Rutland Jan. 19, 2016, 2:01 p.m. UTC | #14

On Tue, Jan 19, 2016 at 09:45:53PM +0800, Dave Young wrote:
> On 01/19/16 at 12:51pm, Mark Rutland wrote:
> > On Tue, Jan 19, 2016 at 08:28:48PM +0800, Dave Young wrote:
> > > On 01/19/16 at 02:35pm, AKASHI Takahiro wrote:
> > > > On 01/19/2016 10:43 AM, Dave Young wrote:
> > > > >X86 takes another way in latest kexec-tools and kexec_file_load, that is
> > > > >recreating E820 table and pass it to kexec/kdump kernel, if the entries
> > > > >are over E820 limitation then turn to use setup_data list for remain
> > > > >entries.
> > > > 
> > > > Thanks. I will visit x86 code again.
> > > > 
> > > > >I think it is X86 specific. Personally I think device tree property is
> > > > >better.
> > > > 
> > > > Do you think so?
> > > 
> > > I'm not sure it is the best way. For X86 we run into problem with
> > > memmap= design, one example is pci domain X (X>1) need the pci memory
> > > ranges being passed to kdump kernel. When we passed reserved ranges in /proc/iomem
> > > to 2nd kernel we find that cmdline[] array is not big enough.
> > 
> > I'm not sure how PCI ranges relate to the memory map used for normal
> > memory (i.e. RAM), though I'm probably missing some caveat with the way
> > ACPI and UEFI describe PCI. Why does memmap= affect PCI memory?
> 
> Here is the old patch which was rejected in kexec-tools:
> http://lists.infradead.org/pipermail/kexec/2013-February/007924.html
> 
> > 
> > If the kernel got the rest of its system topology from DT, the PCI
> > regions would be described there.
> 
> Yes, if kdump kernel use same DT as 1st kernel.

Other than for testing purposes, I don't see why you'd pass the kdump
kernel a DTB inconsistent with that the 1st kernel was passsed (other
than some proerties under /chosen).

We added /sys/firmware/fdt specifically to allow the kexec tools to get
the exact DTB the first kernel used. There's no reason for tools to have
to make something up.

> > > Do you think for arm64 only usable memory is necessary to let kdump kernel
> > > know? I'm curious about how arm64 kernel get all memory layout from boot loader,
> > > via UEFI memmap?
> > 
> > When booted via EFI, we use the EFI memory map. The EFI stub handles
> > acquring the relevant information and passing that to the first kernel
> > in the DTB (see Documentation/arm/uefi.txt).
> 
> Ok, thanks for the pointer. So in dt we are just have uefi memmap infomation
> instead of memory nodes details.. 

When booted via EFI, yes.

For NUMA topology in !ACPI kernels, we might need to also retain and
parse memory nodes, but only for toplogy information. The kernel would
still only use memory as described by the EFI memory map.

There's a horrible edge case I've spotted if performing a chain of
cross-endian kexecs: LE -> BE -> LE, as the BE kernel would have to
respect the EFI memory map so as to avoid corrupting it for the
subsequent LE kernel. Other than this I believe everything should just
work.

> > A kexec'd kernel should simply inherit that. So long as the DTB and/or
> > UEFI tables in memory are the same, it would be the same as a cold boot.
> 
> For kexec all memory ranges are same, for kdump we need use original reserved
> range with crashkernel= as usable memory and all other orignal usable ranges
> are not usable anymore. 

Sure. This is what I believe we should expose with an additional
property under /chosen, while keeping everything else pristine.

The crash kernel can then limit itself to that region, while it would
have the information of the full memory map (which it could log and/or
use to drive other dumping).

> Is it possible to modify uefi memmap for kdump case?

Technically it would be possible, however I don't think it's necessary,
and I think it would be disadvantageous to do so.

Describing the range(s) the crash kernel can use in separate properties
under /chosen has a number of advantages.

> > In the !EFI case, we use the memory nodes in the DTB. Only in this case
> > could usable-memory properties in memory nodes make sense. I'd prefer a
> > uniform property under /chosen for both cases.
> 
> We stil use same DTB, need to modify the DT and update the usable and unusable
> nodes for kdump?

We'd have a (slightly) modified DTB that contained additional properties
describing the range(s) reserved for use by the crash kernel.

Other than those properties under /chosen (e.g. the command line, initrd
pointers if any), it would be the original DTB.

Thanks,
Mark.

Mark Rutland Jan. 19, 2016, 2:05 p.m. UTC | #15

On Tue, Jan 19, 2016 at 09:52:33PM +0800, Dave Young wrote:
> On 01/19/16 at 12:17pm, Mark Rutland wrote:
> > On Tue, Jan 19, 2016 at 09:43:32AM +0800, Dave Young wrote:
> > > On 01/18/16 at 07:26pm, AKASHI Takahiro wrote:
> > > > On 01/16/2016 05:16 AM, Mark Rutland wrote:
> > > > >On Fri, Jan 15, 2016 at 07:18:38PM +0000, Geoff Levand wrote:
> > > > >>From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > > > >>
> > > > >>This patch adds arch specific descriptions about kdump usage on arm64
> > > > >>to kdump.txt.
> > > > >>
> > > > >>Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > > > >>---
> > > > >>  Documentation/kdump/kdump.txt | 23 ++++++++++++++++++++++-
> > > > >>  1 file changed, 22 insertions(+), 1 deletion(-)
> > > > >>
> > > > >>diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
> > > > >>index bc4bd5a..36cf978 100644
> > > > >>--- a/Documentation/kdump/kdump.txt
> > > > >>+++ b/Documentation/kdump/kdump.txt
> > > > >>@@ -18,7 +18,7 @@ memory image to a dump file on the local disk, or across the network to
> > > > >>  a remote system.
> > > > >>
> > > > >>  Kdump and kexec are currently supported on the x86, x86_64, ppc64, ia64,
> > > > >>-s390x and arm architectures.
> > > > >>+s390x, arm and arm64 architectures.
> > > > >>
> > > > >>  When the system kernel boots, it reserves a small section of memory for
> > > > >>  the dump-capture kernel. This ensures that ongoing Direct Memory Access
> > > > >>@@ -249,6 +249,20 @@ Dump-capture kernel config options (Arch Dependent, arm)
> > > > >>
> > > > >>      AUTO_ZRELADDR=y
> > > > >>
> > > > >>+Dump-capture kernel config options (Arch Dependent, arm64)
> > > > >>+----------------------------------------------------------
> > > > >>+
> > > > >>+1) The maximum memory size on the dump-capture kernel must be limited by
> > > > >>+   specifying:
> > > > >>+
> > > > >>+   mem=X[MG]
> > > > >>+
> > > > >>+   where X should be less than or equal to the size in "crashkernel="
> > > > >>+   boot parameter. Kexec-tools will automatically add this.
> > > > >
> > > > >
> > > > >This is extremely fragile, and will trivially fail when the kernel can
> > > > >be loaded anywhere (see [1]).
> > > > 
> > > > As I said before, this restriction also exists on arm, but I understand
> > > > that recent Ard's patches break it.
> > > > 
> > > > >We must explicitly describe the set of regions the crash kernel may use
> > > > >(i.e. we need base and size). NAK in the absence of that.
> > > > 
> > > > There seem to exist several approaches:
> > > > (a) use a device-tree property, "linux,usable-memory", in addition to "reg"
> > > >     under "memory" node
> > > > (b) use a kernel's early parameter, "memmap=nn[@#$]ss"
> > > > 
> > > > Power PC takes (a), while this does not work on efi-started kernel
> > > > because dtb has no "memory" nodes under efi.
> > > > X86 takes (b). If we take this, we will need to overwrite a weak
> > > > early_init_dt_add_memory().
> > > 
> > > X86 takes another way in latest kexec-tools and kexec_file_load, that is
> > > recreating E820 table and pass it to kexec/kdump kernel, if the entries
> > > are over E820 limitation then turn to use setup_data list for remain
> > > entries.
> > 
> > This would imply modifying the EFI memory map or the memory nodes, which
> > I'm not keen on.
> > 
> > I would prefer that they are left _pristine_, and we describe the
> > restriction on the kdump kernel with additional properties under
> > /chosen.
> > 
> > That leaves us with more useful information about the environment of the
> > first kernel, is simpler for userspace (it's resilient to updates to the
> > UEFI memory map spec, for example), and is simple for the crash kernel.
> 
> In theory kexec as boot loader should prepare correct efi memmap and pass
> to kernel, but as you said yes it will increase complexity. We need banlance
> them.

I'd argue that the "correct efi memmap" is what we were given by the
firmware initially -- none of that information is any less true.

For kdump all we need to ensure is that the kdump kernel only uses the
memory that was specially reserved for it by the first kernel. The
simplest way of doing that is to tell the kdump kernel which specific
region(s) of memory were reserved for it, leaving the EFI memory map
alone.

Thanks,
Mark.

Dave Young Jan. 20, 2016, 2:49 a.m. UTC | #16

On 01/19/16 at 02:01pm, Mark Rutland wrote:
> On Tue, Jan 19, 2016 at 09:45:53PM +0800, Dave Young wrote:
> > On 01/19/16 at 12:51pm, Mark Rutland wrote:
> > > On Tue, Jan 19, 2016 at 08:28:48PM +0800, Dave Young wrote:
> > > > On 01/19/16 at 02:35pm, AKASHI Takahiro wrote:
> > > > > On 01/19/2016 10:43 AM, Dave Young wrote:
> > > > > >X86 takes another way in latest kexec-tools and kexec_file_load, that is
> > > > > >recreating E820 table and pass it to kexec/kdump kernel, if the entries
> > > > > >are over E820 limitation then turn to use setup_data list for remain
> > > > > >entries.
> > > > > 
> > > > > Thanks. I will visit x86 code again.
> > > > > 
> > > > > >I think it is X86 specific. Personally I think device tree property is
> > > > > >better.
> > > > > 
> > > > > Do you think so?
> > > > 
> > > > I'm not sure it is the best way. For X86 we run into problem with
> > > > memmap= design, one example is pci domain X (X>1) need the pci memory
> > > > ranges being passed to kdump kernel. When we passed reserved ranges in /proc/iomem
> > > > to 2nd kernel we find that cmdline[] array is not big enough.
> > > 
> > > I'm not sure how PCI ranges relate to the memory map used for normal
> > > memory (i.e. RAM), though I'm probably missing some caveat with the way
> > > ACPI and UEFI describe PCI. Why does memmap= affect PCI memory?
> > 
> > Here is the old patch which was rejected in kexec-tools:
> > http://lists.infradead.org/pipermail/kexec/2013-February/007924.html
> > 
> > > 
> > > If the kernel got the rest of its system topology from DT, the PCI
> > > regions would be described there.
> > 
> > Yes, if kdump kernel use same DT as 1st kernel.
> 
> Other than for testing purposes, I don't see why you'd pass the kdump
> kernel a DTB inconsistent with that the 1st kernel was passsed (other
> than some proerties under /chosen).
> 
> We added /sys/firmware/fdt specifically to allow the kexec tools to get
> the exact DTB the first kernel used. There's no reason for tools to have
> to make something up.

Agreed, but kexec-tools has an option to pass in any dtb files. Who knows
how one will use it unless dropping the option and use /sys/firmware/fdt
unconditionally. 

If we choose to implement kexec_file_load only in kernel, the interfaces
provided are kernel, initrd and cmdline. We can always use same dtb.
 
> 
> > > > Do you think for arm64 only usable memory is necessary to let kdump kernel
> > > > know? I'm curious about how arm64 kernel get all memory layout from boot loader,
> > > > via UEFI memmap?
> > > 
> > > When booted via EFI, we use the EFI memory map. The EFI stub handles
> > > acquring the relevant information and passing that to the first kernel
> > > in the DTB (see Documentation/arm/uefi.txt).
> > 
> > Ok, thanks for the pointer. So in dt we are just have uefi memmap infomation
> > instead of memory nodes details.. 
> 
> When booted via EFI, yes.
> 
> For NUMA topology in !ACPI kernels, we might need to also retain and
> parse memory nodes, but only for toplogy information. The kernel would
> still only use memory as described by the EFI memory map.
> 
> There's a horrible edge case I've spotted if performing a chain of
> cross-endian kexecs: LE -> BE -> LE, as the BE kernel would have to
> respect the EFI memory map so as to avoid corrupting it for the
> subsequent LE kernel. Other than this I believe everything should just
> work.

Firmware do not know kernel endianniess, kernel should respect firmware
maps and adapt to it, it sounds like a generic issue not specfic to kexec.

> 
> > > A kexec'd kernel should simply inherit that. So long as the DTB and/or
> > > UEFI tables in memory are the same, it would be the same as a cold boot.
> > 
> > For kexec all memory ranges are same, for kdump we need use original reserved
> > range with crashkernel= as usable memory and all other orignal usable ranges
> > are not usable anymore. 
> 
> Sure. This is what I believe we should expose with an additional
> property under /chosen, while keeping everything else pristine.
> 
> The crash kernel can then limit itself to that region, while it would
> have the information of the full memory map (which it could log and/or
> use to drive other dumping).

In this way kernel should be aware it is a kdump booting, it is doable though
I feel it is better for kdump kernel in a black box with infomations it
can use just like the 1st kernel. Things here is where we choose to cook
the memory infomation in boot loader or in kernel itself.

> 
> > Is it possible to modify uefi memmap for kdump case?
> 
> Technically it would be possible, however I don't think it's necessary,
> and I think it would be disadvantageous to do so.
> 
> Describing the range(s) the crash kernel can use in separate properties
> under /chosen has a number of advantages.

Ok, I got the points. We have a is_kdump_kernel() by checking if there is
elfcorehdr_addr kernel cmdline. This is mainly for some drivers which
do not work well in kdump kernel some uncertain reasons. But ideally I
think kernel should handle things just like in 1st kernel and avoid to use
it. 

> 
> > > In the !EFI case, we use the memory nodes in the DTB. Only in this case
> > > could usable-memory properties in memory nodes make sense. I'd prefer a
> > > uniform property under /chosen for both cases.
> > 
> > We stil use same DTB, need to modify the DT and update the usable and unusable
> > nodes for kdump?
> 
> We'd have a (slightly) modified DTB that contained additional properties
> describing the range(s) reserved for use by the crash kernel.
> 
> Other than those properties under /chosen (e.g. the command line, initrd
> pointers if any), it would be the original DTB.
> 
> Thanks,
> Mark.

Thanks
Dave

Dave Young Jan. 20, 2016, 2:54 a.m. UTC | #17

On 01/19/16 at 02:05pm, Mark Rutland wrote:
> On Tue, Jan 19, 2016 at 09:52:33PM +0800, Dave Young wrote:
> > On 01/19/16 at 12:17pm, Mark Rutland wrote:
> > > On Tue, Jan 19, 2016 at 09:43:32AM +0800, Dave Young wrote:
> > > > On 01/18/16 at 07:26pm, AKASHI Takahiro wrote:
> > > > > On 01/16/2016 05:16 AM, Mark Rutland wrote:
> > > > > >On Fri, Jan 15, 2016 at 07:18:38PM +0000, Geoff Levand wrote:
> > > > > >>From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > > > > >>
> > > > > >>This patch adds arch specific descriptions about kdump usage on arm64
> > > > > >>to kdump.txt.
> > > > > >>
> > > > > >>Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > > > > >>---
> > > > > >>  Documentation/kdump/kdump.txt | 23 ++++++++++++++++++++++-
> > > > > >>  1 file changed, 22 insertions(+), 1 deletion(-)
> > > > > >>
> > > > > >>diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
> > > > > >>index bc4bd5a..36cf978 100644
> > > > > >>--- a/Documentation/kdump/kdump.txt
> > > > > >>+++ b/Documentation/kdump/kdump.txt
> > > > > >>@@ -18,7 +18,7 @@ memory image to a dump file on the local disk, or across the network to
> > > > > >>  a remote system.
> > > > > >>
> > > > > >>  Kdump and kexec are currently supported on the x86, x86_64, ppc64, ia64,
> > > > > >>-s390x and arm architectures.
> > > > > >>+s390x, arm and arm64 architectures.
> > > > > >>
> > > > > >>  When the system kernel boots, it reserves a small section of memory for
> > > > > >>  the dump-capture kernel. This ensures that ongoing Direct Memory Access
> > > > > >>@@ -249,6 +249,20 @@ Dump-capture kernel config options (Arch Dependent, arm)
> > > > > >>
> > > > > >>      AUTO_ZRELADDR=y
> > > > > >>
> > > > > >>+Dump-capture kernel config options (Arch Dependent, arm64)
> > > > > >>+----------------------------------------------------------
> > > > > >>+
> > > > > >>+1) The maximum memory size on the dump-capture kernel must be limited by
> > > > > >>+   specifying:
> > > > > >>+
> > > > > >>+   mem=X[MG]
> > > > > >>+
> > > > > >>+   where X should be less than or equal to the size in "crashkernel="
> > > > > >>+   boot parameter. Kexec-tools will automatically add this.
> > > > > >
> > > > > >
> > > > > >This is extremely fragile, and will trivially fail when the kernel can
> > > > > >be loaded anywhere (see [1]).
> > > > > 
> > > > > As I said before, this restriction also exists on arm, but I understand
> > > > > that recent Ard's patches break it.
> > > > > 
> > > > > >We must explicitly describe the set of regions the crash kernel may use
> > > > > >(i.e. we need base and size). NAK in the absence of that.
> > > > > 
> > > > > There seem to exist several approaches:
> > > > > (a) use a device-tree property, "linux,usable-memory", in addition to "reg"
> > > > >     under "memory" node
> > > > > (b) use a kernel's early parameter, "memmap=nn[@#$]ss"
> > > > > 
> > > > > Power PC takes (a), while this does not work on efi-started kernel
> > > > > because dtb has no "memory" nodes under efi.
> > > > > X86 takes (b). If we take this, we will need to overwrite a weak
> > > > > early_init_dt_add_memory().
> > > > 
> > > > X86 takes another way in latest kexec-tools and kexec_file_load, that is
> > > > recreating E820 table and pass it to kexec/kdump kernel, if the entries
> > > > are over E820 limitation then turn to use setup_data list for remain
> > > > entries.
> > > 
> > > This would imply modifying the EFI memory map or the memory nodes, which
> > > I'm not keen on.
> > > 
> > > I would prefer that they are left _pristine_, and we describe the
> > > restriction on the kdump kernel with additional properties under
> > > /chosen.
> > > 
> > > That leaves us with more useful information about the environment of the
> > > first kernel, is simpler for userspace (it's resilient to updates to the
> > > UEFI memory map spec, for example), and is simple for the crash kernel.
> > 
> > In theory kexec as boot loader should prepare correct efi memmap and pass
> > to kernel, but as you said yes it will increase complexity. We need banlance
> > them.
> 
> I'd argue that the "correct efi memmap" is what we were given by the
> firmware initially -- none of that information is any less true.

In X86 boot loader will cook a E820 map for kernel use, there's no such needs
in arm so maybe it is acceptable to use same memmap to avoid modifying it only
for kdump.

I think I will not insist though I like more about doing something in
bootloader instead of in kernel.

> 
> For kdump all we need to ensure is that the kdump kernel only uses the
> memory that was specially reserved for it by the first kernel. The
> simplest way of doing that is to tell the kdump kernel which specific
> region(s) of memory were reserved for it, leaving the EFI memory map
> alone.

Yes, agreed that it is simpler.

> 
> Thanks,
> Mark.

Thanks
Dave

AKASHI Takahiro Jan. 20, 2016, 4:34 a.m. UTC | #18

On 01/19/2016 09:10 PM, Mark Rutland wrote:
> On Tue, Jan 19, 2016 at 02:31:05PM +0900, AKASHI Takahiro wrote:
>> On 01/18/2016 08:29 PM, Mark Rutland wrote:
>>> On Mon, Jan 18, 2016 at 07:26:04PM +0900, AKASHI Takahiro wrote:
>>>> On 01/16/2016 05:16 AM, Mark Rutland wrote:
>>>>> On Fri, Jan 15, 2016 at 07:18:38PM +0000, Geoff Levand wrote:
>>>>>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>>>>
>>>>>> This patch adds arch specific descriptions about kdump usage on arm64
>>>>>> to kdump.txt.
>>>>>>
>>>>>> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>>>> ---
>>>>>>   Documentation/kdump/kdump.txt | 23 ++++++++++++++++++++++-
>>>>>>   1 file changed, 22 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
>>>>>> index bc4bd5a..36cf978 100644
>>>>>> --- a/Documentation/kdump/kdump.txt
>>>>>> +++ b/Documentation/kdump/kdump.txt
>>>>>> @@ -18,7 +18,7 @@ memory image to a dump file on the local disk, or across the network to
>>>>>>   a remote system.
>>>>>>
>>>>>>   Kdump and kexec are currently supported on the x86, x86_64, ppc64, ia64,
>>>>>> -s390x and arm architectures.
>>>>>> +s390x, arm and arm64 architectures.
>>>>>>
>>>>>>   When the system kernel boots, it reserves a small section of memory for
>>>>>>   the dump-capture kernel. This ensures that ongoing Direct Memory Access
>>>>>> @@ -249,6 +249,20 @@ Dump-capture kernel config options (Arch Dependent, arm)
>>>>>>
>>>>>>       AUTO_ZRELADDR=y
>>>>>>
>>>>>> +Dump-capture kernel config options (Arch Dependent, arm64)
>>>>>> +----------------------------------------------------------
>>>>>> +
>>>>>> +1) The maximum memory size on the dump-capture kernel must be limited by
>>>>>> +   specifying:
>>>>>> +
>>>>>> +   mem=X[MG]
>>>>>> +
>>>>>> +   where X should be less than or equal to the size in "crashkernel="
>>>>>> +   boot parameter. Kexec-tools will automatically add this.
>>>>>
>>>>>
>>>>> This is extremely fragile, and will trivially fail when the kernel can
>>>>> be loaded anywhere (see [1]).
>>>>
>>>> As I said before, this restriction also exists on arm, but I understand
>>>> that recent Ard's patches break it.
>>>>
>>>>> We must explicitly describe the set of regions the crash kernel may use
>>>>> (i.e. we need base and size). NAK in the absence of that.
>>>>
>>>> There seem to exist several approaches:
>>>> (a) use a device-tree property, "linux,usable-memory", in addition to "reg"
>>>
>>> I'm not opposed to the idea of a DT property, though I think that should
>>> live under /chosen.
>>
>> In fact, powerpc uses another property, "linux,crashkernel-base(& size)",
>> under /chosen in order for the *1st kernel* to export info about a memory
>> region for the 2nd(crash dump) kernel to user apps (kexec-tools).
>
> Do you mean that said property is provided _to_ the 1st kernel, or
> provided _by_ the first kernel?

_by_ the 1st kernel.

Based on a kernel parameter, "crashkernel=", the 1st kernel reserve some
memory region at boot time and export its information through this property.
Most architectures other than powerpc, however, use an iomem resource entry,
"Crash kernel", in /proc/iomem instead for this purpose.

>>> I see that "linux,usable-memory" exists already, though I'm confused as
>>> to exactly what it is for as there is no documentation (neither in the
>>> kernel nor in ePAPR).
>>
>> For example,
>>    memory@0x80000000 {
>>      reg = <0x0 0x80000000 0x0 0x80000000>;
>>      linux,usable-memory = <0x0 0x8c000000 0x0 0x4000000>;
>>    }
>> There exists 2GB memory available on the system, but the last 64MB can be
>> used as a system ram. See early_init_dt_scan_memory() in fdt.c.
>
> Sure, except that's the implementation rather than the intended
> semantics (which are not defined).

Yeah, but the code itself was ack'ed (actually committed) by Grant:)

>>> It's also painful to alter multiple memory nodes
>>> to use that, and I can see that going wrong.
>>
>> Yeah, I implemented this feature in my old versions experimentally,
>> but didn't like it as we had to touch all the memory nodes.
>>
>>>>      under "memory" node
>>>> (b) use a kernel's early parameter, "memmap=nn[@#$]ss"
>>>
>>> I'm not too keen on this, as I think it's fragile, and logically
>>> somewhat distinct from what mem= is for (a best effort testing tool).
>>
>> I'm not sure whether it is fragile, and contrary to x86, as Dave
>> described, I think we will only need a single memmap= on arm64 as
>> efi's mem map table is accessible even on the crash kernel.
>
> I just realised I misread this as "mem=", apologies.
>
> It looks like memmap= to force a specific region of memory to be used
> may work.
>
> I'd still err on the side of preferring an explicit property in the DT.

Let's discuss in succeeding replies.

>>>> Power PC takes (a), while this does not work on efi-started kernel
>>>> because dtb has no "memory" nodes under efi.
>>>
>>> A property under /chosen would work for EFI too.
>>>
>>>> X86 takes (b). If we take this, we will need to overwrite a weak
>>>> early_init_dt_add_memory().
>>>> (I thought that this approach was not smart as we have three different
>>>> ways to specify memory regions, dtb, efi and this kernel parameter.)
>>>
>>> I'm not sure that's a big problem. We may be able to make this generic,
>>> also.
>>>
>>> We don't necessarily need a weak add memory function if we can guarantee
>>> nothing gets memblock_alloc'd before we carve it out.
>>>
>>> Something like the nomap stuff Ard put together might be useful here.
>>
>> I'm afraid it doesn't work.
>> It doesn't matter whether it is linearly mapped or not. We should prevent
>> any part of memory regions used by the 1st kernel from being reclaimed
>> by memblock_alloc() and others.
>
> Are you certain that nomap memory can be allocated? That sounds like a
> major bug.

I misunderstood. __next_mem_range() called by mem_alloc stuff has some check.

-Takahiro AKASHI

> Nomap memory should act like reserved memory with the additional
> property that the kernel must not map it implicitly.
>
>> Or do you mean we can introduce another memblock flag?
>
> That wasn't what I meant, but that would be a potential solution.
>
> Thanks,
> Mark.
>

AKASHI Takahiro Jan. 20, 2016, 5:25 a.m. UTC | #19

On 01/19/2016 11:01 PM, Mark Rutland wrote:
> On Tue, Jan 19, 2016 at 09:45:53PM +0800, Dave Young wrote:
>> On 01/19/16 at 12:51pm, Mark Rutland wrote:
>>> On Tue, Jan 19, 2016 at 08:28:48PM +0800, Dave Young wrote:
>>>> On 01/19/16 at 02:35pm, AKASHI Takahiro wrote:
>>>>> On 01/19/2016 10:43 AM, Dave Young wrote:
>>>>>> X86 takes another way in latest kexec-tools and kexec_file_load, that is
>>>>>> recreating E820 table and pass it to kexec/kdump kernel, if the entries
>>>>>> are over E820 limitation then turn to use setup_data list for remain
>>>>>> entries.
>>>>>
>>>>> Thanks. I will visit x86 code again.
>>>>>
>>>>>> I think it is X86 specific. Personally I think device tree property is
>>>>>> better.
>>>>>
>>>>> Do you think so?
>>>>
>>>> I'm not sure it is the best way. For X86 we run into problem with
>>>> memmap= design, one example is pci domain X (X>1) need the pci memory
>>>> ranges being passed to kdump kernel. When we passed reserved ranges in /proc/iomem
>>>> to 2nd kernel we find that cmdline[] array is not big enough.
>>>
>>> I'm not sure how PCI ranges relate to the memory map used for normal
>>> memory (i.e. RAM), though I'm probably missing some caveat with the way
>>> ACPI and UEFI describe PCI. Why does memmap= affect PCI memory?
>>
>> Here is the old patch which was rejected in kexec-tools:
>> http://lists.infradead.org/pipermail/kexec/2013-February/007924.html
>>
>>>
>>> If the kernel got the rest of its system topology from DT, the PCI
>>> regions would be described there.
>>
>> Yes, if kdump kernel use same DT as 1st kernel.
>
> Other than for testing purposes, I don't see why you'd pass the kdump
> kernel a DTB inconsistent with that the 1st kernel was passsed (other
> than some proerties under /chosen).
>
> We added /sys/firmware/fdt specifically to allow the kexec tools to get
> the exact DTB the first kernel used. There's no reason for tools to have
> to make something up.

Currently, arm64 kexec-tools modifies only a cmdline property in dtb
to pass a "elfcorehdr=" parameter as well as other restrictions (like maxcpus=1).

>>>> Do you think for arm64 only usable memory is necessary to let kdump kernel
>>>> know? I'm curious about how arm64 kernel get all memory layout from boot loader,
>>>> via UEFI memmap?
>>>
>>> When booted via EFI, we use the EFI memory map. The EFI stub handles
>>> acquring the relevant information and passing that to the first kernel
>>> in the DTB (see Documentation/arm/uefi.txt).
>>
>> Ok, thanks for the pointer. So in dt we are just have uefi memmap infomation
>> instead of memory nodes details..
>
> When booted via EFI, yes.
>
> For NUMA topology in !ACPI kernels, we might need to also retain and
> parse memory nodes, but only for toplogy information. The kernel would
> still only use memory as described by the EFI memory map.
>
> There's a horrible edge case I've spotted if performing a chain of
> cross-endian kexecs: LE -> BE -> LE, as the BE kernel would have to
> respect the EFI memory map so as to avoid corrupting it for the
> subsequent LE kernel. Other than this I believe everything should just
> work.

BE kernel doesn't support UEFI yet and cannot access UEFI memmap table. So,
for LE -> BE, we don't use a dtb generated from /sys/firmware/fdt (or /proc/device-tree)
(as in the case of LE -> LE) and require users to provide a dtb file explicitly.

For BE -> LE, BE kernel doesn't know wther UEFI memmap table is available or not
and so use the same (explicitly-provided) dtb (as LE -> LE in !UEFI)

>>> A kexec'd kernel should simply inherit that. So long as the DTB and/or
>>> UEFI tables in memory are the same, it would be the same as a cold boot.
>>
>> For kexec all memory ranges are same, for kdump we need use original reserved
>> range with crashkernel= as usable memory and all other orignal usable ranges
>> are not usable anymore.
>
> Sure. This is what I believe we should expose with an additional
> property under /chosen, while keeping everything else pristine.
>
> The crash kernel can then limit itself to that region, while it would
> have the information of the full memory map (which it could log and/or
> use to drive other dumping).

FYI,
all the original usable memory regions used by the 1st kernel are also
described in an ELF core header specified by "elfcorehdr=" parameter to
the crash dump kernel.

-Takahiro AKASHI

>> Is it possible to modify uefi memmap for kdump case?
>
> Technically it would be possible, however I don't think it's necessary,
> and I think it would be disadvantageous to do so.
>
> Describing the range(s) the crash kernel can use in separate properties
> under /chosen has a number of advantages.
>
>>> In the !EFI case, we use the memory nodes in the DTB. Only in this case
>>> could usable-memory properties in memory nodes make sense. I'd prefer a
>>> uniform property under /chosen for both cases.
>>
>> We stil use same DTB, need to modify the DT and update the usable and unusable
>> nodes for kdump?
>
> We'd have a (slightly) modified DTB that contained additional properties
> describing the range(s) reserved for use by the crash kernel.
>
> Other than those properties under /chosen (e.g. the command line, initrd
> pointers if any), it would be the original DTB.
>
> Thanks,
> Mark.
>

AKASHI Takahiro Jan. 20, 2016, 6:07 a.m. UTC | #20

On 01/20/2016 11:49 AM, Dave Young wrote:
> On 01/19/16 at 02:01pm, Mark Rutland wrote:
>> On Tue, Jan 19, 2016 at 09:45:53PM +0800, Dave Young wrote:
>>> On 01/19/16 at 12:51pm, Mark Rutland wrote:
>>>> On Tue, Jan 19, 2016 at 08:28:48PM +0800, Dave Young wrote:
>>>>> On 01/19/16 at 02:35pm, AKASHI Takahiro wrote:
>>>>>> On 01/19/2016 10:43 AM, Dave Young wrote:
>>>>>>> X86 takes another way in latest kexec-tools and kexec_file_load, that is
>>>>>>> recreating E820 table and pass it to kexec/kdump kernel, if the entries
>>>>>>> are over E820 limitation then turn to use setup_data list for remain
>>>>>>> entries.
>>>>>>
>>>>>> Thanks. I will visit x86 code again.
>>>>>>
>>>>>>> I think it is X86 specific. Personally I think device tree property is
>>>>>>> better.
>>>>>>
>>>>>> Do you think so?
>>>>>
>>>>> I'm not sure it is the best way. For X86 we run into problem with
>>>>> memmap= design, one example is pci domain X (X>1) need the pci memory
>>>>> ranges being passed to kdump kernel. When we passed reserved ranges in /proc/iomem
>>>>> to 2nd kernel we find that cmdline[] array is not big enough.
>>>>
>>>> I'm not sure how PCI ranges relate to the memory map used for normal
>>>> memory (i.e. RAM), though I'm probably missing some caveat with the way
>>>> ACPI and UEFI describe PCI. Why does memmap= affect PCI memory?
>>>
>>> Here is the old patch which was rejected in kexec-tools:
>>> http://lists.infradead.org/pipermail/kexec/2013-February/007924.html
>>>
>>>>
>>>> If the kernel got the rest of its system topology from DT, the PCI
>>>> regions would be described there.
>>>
>>> Yes, if kdump kernel use same DT as 1st kernel.
>>
>> Other than for testing purposes, I don't see why you'd pass the kdump
>> kernel a DTB inconsistent with that the 1st kernel was passsed (other
>> than some proerties under /chosen).
>>
>> We added /sys/firmware/fdt specifically to allow the kexec tools to get
>> the exact DTB the first kernel used. There's no reason for tools to have
>> to make something up.
>
> Agreed, but kexec-tools has an option to pass in any dtb files. Who knows
> how one will use it unless dropping the option and use /sys/firmware/fdt
> unconditionally.

As a matter of fact, specifying proper command line parameters as well as
dtb is partly users' responsibility for kdump to work correctly.
(especially for BE kernel)

> If we choose to implement kexec_file_load only in kernel, the interfaces
> provided are kernel, initrd and cmdline. We can always use same dtb.

I would say that we can always use the same dtb even with kexec_load
from user's perspective. Right?
(The difference is whether changes are made by kernel itself or kexec-tools.)

>>
>>>>> Do you think for arm64 only usable memory is necessary to let kdump kernel
>>>>> know? I'm curious about how arm64 kernel get all memory layout from boot loader,
>>>>> via UEFI memmap?
>>>>
>>>> When booted via EFI, we use the EFI memory map. The EFI stub handles
>>>> acquring the relevant information and passing that to the first kernel
>>>> in the DTB (see Documentation/arm/uefi.txt).
>>>
>>> Ok, thanks for the pointer. So in dt we are just have uefi memmap infomation
>>> instead of memory nodes details..
>>
>> When booted via EFI, yes.
>>
>> For NUMA topology in !ACPI kernels, we might need to also retain and
>> parse memory nodes, but only for toplogy information. The kernel would
>> still only use memory as described by the EFI memory map.
>>
>> There's a horrible edge case I've spotted if performing a chain of
>> cross-endian kexecs: LE -> BE -> LE, as the BE kernel would have to
>> respect the EFI memory map so as to avoid corrupting it for the
>> subsequent LE kernel. Other than this I believe everything should just
>> work.
>
> Firmware do not know kernel endianniess, kernel should respect firmware
> maps and adapt to it, it sounds like a generic issue not specfic to kexec.

On arm64, a kernel image header has a bit field to specify the image's endianness.
Anyway, our current implementation replies on a user-supplied dtb to start BE kernel.

>>
>>>> A kexec'd kernel should simply inherit that. So long as the DTB and/or
>>>> UEFI tables in memory are the same, it would be the same as a cold boot.
>>>
>>> For kexec all memory ranges are same, for kdump we need use original reserved
>>> range with crashkernel= as usable memory and all other orignal usable ranges
>>> are not usable anymore.
>>
>> Sure. This is what I believe we should expose with an additional
>> property under /chosen, while keeping everything else pristine.
>>
>> The crash kernel can then limit itself to that region, while it would
>> have the information of the full memory map (which it could log and/or
>> use to drive other dumping).
>
> In this way kernel should be aware it is a kdump booting, it is doable though
> I feel it is better for kdump kernel in a black box with infomations it
> can use just like the 1st kernel. Things here is where we choose to cook
> the memory infomation in boot loader or in kernel itself.
>
>>
>>> Is it possible to modify uefi memmap for kdump case?
>>
>> Technically it would be possible, however I don't think it's necessary,
>> and I think it would be disadvantageous to do so.
>>
>> Describing the range(s) the crash kernel can use in separate properties
>> under /chosen has a number of advantages.
>
> Ok, I got the points. We have a is_kdump_kernel() by checking if there is
> elfcorehdr_addr kernel cmdline. This is mainly for some drivers which
> do not work well in kdump kernel some uncertain reasons. But ideally I
> think kernel should handle things just like in 1st kernel and avoid to use
> it.

So I'm not still sure about what are advantages of a property under /chosen
over "memmap=" kernel parameter.
Both are simple and can have the same effect with minimizing changes to dtb.
(But if, in the latter case, we have to provide *all* the memory-related information
through "memmap=" parameters, it would be much complicated.)

-Takahiro AKASHI

>>
>>>> In the !EFI case, we use the memory nodes in the DTB. Only in this case
>>>> could usable-memory properties in memory nodes make sense. I'd prefer a
>>>> uniform property under /chosen for both cases.
>>>
>>> We stil use same DTB, need to modify the DT and update the usable and unusable
>>> nodes for kdump?
>>
>> We'd have a (slightly) modified DTB that contained additional properties
>> describing the range(s) reserved for use by the crash kernel.
>>
>> Other than those properties under /chosen (e.g. the command line, initrd
>> pointers if any), it would be the original DTB.
>>
>> Thanks,
>> Mark.
>
> Thanks
> Dave
>

Dave Young Jan. 20, 2016, 6:38 a.m. UTC | #21

On 01/20/16 at 03:07pm, AKASHI Takahiro wrote:
> On 01/20/2016 11:49 AM, Dave Young wrote:
> >On 01/19/16 at 02:01pm, Mark Rutland wrote:
> >>On Tue, Jan 19, 2016 at 09:45:53PM +0800, Dave Young wrote:
> >>>On 01/19/16 at 12:51pm, Mark Rutland wrote:
> >>>>On Tue, Jan 19, 2016 at 08:28:48PM +0800, Dave Young wrote:
> >>>>>On 01/19/16 at 02:35pm, AKASHI Takahiro wrote:
> >>>>>>On 01/19/2016 10:43 AM, Dave Young wrote:
> >>>>>>>X86 takes another way in latest kexec-tools and kexec_file_load, that is
> >>>>>>>recreating E820 table and pass it to kexec/kdump kernel, if the entries
> >>>>>>>are over E820 limitation then turn to use setup_data list for remain
> >>>>>>>entries.
> >>>>>>
> >>>>>>Thanks. I will visit x86 code again.
> >>>>>>
> >>>>>>>I think it is X86 specific. Personally I think device tree property is
> >>>>>>>better.
> >>>>>>
> >>>>>>Do you think so?
> >>>>>
> >>>>>I'm not sure it is the best way. For X86 we run into problem with
> >>>>>memmap= design, one example is pci domain X (X>1) need the pci memory
> >>>>>ranges being passed to kdump kernel. When we passed reserved ranges in /proc/iomem
> >>>>>to 2nd kernel we find that cmdline[] array is not big enough.
> >>>>
> >>>>I'm not sure how PCI ranges relate to the memory map used for normal
> >>>>memory (i.e. RAM), though I'm probably missing some caveat with the way
> >>>>ACPI and UEFI describe PCI. Why does memmap= affect PCI memory?
> >>>
> >>>Here is the old patch which was rejected in kexec-tools:
> >>>http://lists.infradead.org/pipermail/kexec/2013-February/007924.html
> >>>
> >>>>
> >>>>If the kernel got the rest of its system topology from DT, the PCI
> >>>>regions would be described there.
> >>>
> >>>Yes, if kdump kernel use same DT as 1st kernel.
> >>
> >>Other than for testing purposes, I don't see why you'd pass the kdump
> >>kernel a DTB inconsistent with that the 1st kernel was passsed (other
> >>than some proerties under /chosen).
> >>
> >>We added /sys/firmware/fdt specifically to allow the kexec tools to get
> >>the exact DTB the first kernel used. There's no reason for tools to have
> >>to make something up.
> >
> >Agreed, but kexec-tools has an option to pass in any dtb files. Who knows
> >how one will use it unless dropping the option and use /sys/firmware/fdt
> >unconditionally.
> 
> As a matter of fact, specifying proper command line parameters as well as
> dtb is partly users' responsibility for kdump to work correctly.
> (especially for BE kernel)

Right.

> 
> >If we choose to implement kexec_file_load only in kernel, the interfaces
> >provided are kernel, initrd and cmdline. We can always use same dtb.
> 
> I would say that we can always use the same dtb even with kexec_load
> from user's perspective. Right?
> (The difference is whether changes are made by kernel itself or kexec-tools.)

Right.

> 
> >>
> >>>>>Do you think for arm64 only usable memory is necessary to let kdump kernel
> >>>>>know? I'm curious about how arm64 kernel get all memory layout from boot loader,
> >>>>>via UEFI memmap?
> >>>>
> >>>>When booted via EFI, we use the EFI memory map. The EFI stub handles
> >>>>acquring the relevant information and passing that to the first kernel
> >>>>in the DTB (see Documentation/arm/uefi.txt).
> >>>
> >>>Ok, thanks for the pointer. So in dt we are just have uefi memmap infomation
> >>>instead of memory nodes details..
> >>
> >>When booted via EFI, yes.
> >>
> >>For NUMA topology in !ACPI kernels, we might need to also retain and
> >>parse memory nodes, but only for toplogy information. The kernel would
> >>still only use memory as described by the EFI memory map.
> >>
> >>There's a horrible edge case I've spotted if performing a chain of
> >>cross-endian kexecs: LE -> BE -> LE, as the BE kernel would have to
> >>respect the EFI memory map so as to avoid corrupting it for the
> >>subsequent LE kernel. Other than this I believe everything should just
> >>work.
> >
> >Firmware do not know kernel endianniess, kernel should respect firmware
> >maps and adapt to it, it sounds like a generic issue not specfic to kexec.
> 
> On arm64, a kernel image header has a bit field to specify the image's endianness.
> Anyway, our current implementation replies on a user-supplied dtb to start BE kernel.

Ok, I means uefi memmap are same, not specific to LE or BE.

> 
> >>
> >>>>A kexec'd kernel should simply inherit that. So long as the DTB and/or
> >>>>UEFI tables in memory are the same, it would be the same as a cold boot.
> >>>
> >>>For kexec all memory ranges are same, for kdump we need use original reserved
> >>>range with crashkernel= as usable memory and all other orignal usable ranges
> >>>are not usable anymore.
> >>
> >>Sure. This is what I believe we should expose with an additional
> >>property under /chosen, while keeping everything else pristine.
> >>
> >>The crash kernel can then limit itself to that region, while it would
> >>have the information of the full memory map (which it could log and/or
> >>use to drive other dumping).
> >
> >In this way kernel should be aware it is a kdump booting, it is doable though
> >I feel it is better for kdump kernel in a black box with infomations it
> >can use just like the 1st kernel. Things here is where we choose to cook
> >the memory infomation in boot loader or in kernel itself.
> >
> >>
> >>>Is it possible to modify uefi memmap for kdump case?
> >>
> >>Technically it would be possible, however I don't think it's necessary,
> >>and I think it would be disadvantageous to do so.
> >>
> >>Describing the range(s) the crash kernel can use in separate properties
> >>under /chosen has a number of advantages.
> >
> >Ok, I got the points. We have a is_kdump_kernel() by checking if there is
> >elfcorehdr_addr kernel cmdline. This is mainly for some drivers which
> >do not work well in kdump kernel some uncertain reasons. But ideally I
> >think kernel should handle things just like in 1st kernel and avoid to use
> >it.
> 
> So I'm not still sure about what are advantages of a property under /chosen
> over "memmap=" kernel parameter.
> Both are simple and can have the same effect with minimizing changes to dtb.
> (But if, in the latter case, we have to provide *all* the memory-related information
> through "memmap=" parameters, it would be much complicated.)

Maybe I did not say it clearly, I prefer kexec syscall/tool to modifiy dtb
or uefi-memmap so that we do not need any extra kernel cmdline.

For x86 we would like to drop the memmap= usage in kexec-tools but we can not
do that for a compatibility problem about calgary iommu. So that currently
kexec-tools supports both recreating E820 maps and passing memmap=.

We should think it carefully because it will be hard to remove once we support it.
IMO handling it in code is better than using an external interface.

> 
> -Takahiro AKASHI
> 
> >>
> >>>>In the !EFI case, we use the memory nodes in the DTB. Only in this case
> >>>>could usable-memory properties in memory nodes make sense. I'd prefer a
> >>>>uniform property under /chosen for both cases.
> >>>
> >>>We stil use same DTB, need to modify the DT and update the usable and unusable
> >>>nodes for kdump?
> >>
> >>We'd have a (slightly) modified DTB that contained additional properties
> >>describing the range(s) reserved for use by the crash kernel.
> >>
> >>Other than those properties under /chosen (e.g. the command line, initrd
> >>pointers if any), it would be the original DTB.
> >>
> >>Thanks,
> >>Mark.
> >
> >Thanks
> >Dave
> >

Thanks
Dave

Dave Young Jan. 20, 2016, 7 a.m. UTC | #22

> > So I'm not still sure about what are advantages of a property under /chosen
> > over "memmap=" kernel parameter.
> > Both are simple and can have the same effect with minimizing changes to dtb.
> > (But if, in the latter case, we have to provide *all* the memory-related information
> > through "memmap=" parameters, it would be much complicated.)
> 
> Maybe I did not say it clearly, I prefer kexec syscall/tool to modifiy dtb
> or uefi-memmap so that we do not need any extra kernel cmdline.
> 
> For x86 we would like to drop the memmap= usage in kexec-tools but we can not
> do that for a compatibility problem about calgary iommu. So that currently
> kexec-tools supports both recreating E820 maps and passing memmap=.
> 
> We should think it carefully because it will be hard to remove once we support it.
> IMO handling it in code is better than using an external interface.
> 

Also seems semantic of memmap=exactmap is different than current use in the implementation
exactmap means we need pass each range seperately including reserved, acpi and other types
We can not reuse ranges in uefi memmap for other than usable memory.

It will also have the cmdline array size issue.

Thanks
Dave

AKASHI Takahiro Jan. 20, 2016, 8:01 a.m. UTC | #23

On 01/20/2016 04:00 PM, Dave Young wrote:
>>> So I'm not still sure about what are advantages of a property under /chosen
>>> over "memmap=" kernel parameter.
>>> Both are simple and can have the same effect with minimizing changes to dtb.
>>> (But if, in the latter case, we have to provide *all* the memory-related information
>>> through "memmap=" parameters, it would be much complicated.)
>>
>> Maybe I did not say it clearly, I prefer kexec syscall/tool to modifiy dtb
>> or uefi-memmap so that we do not need any extra kernel cmdline.

Yes, I understand.
But on arm64, kexec-tools can generate a "memmap=" parameter for crash kernel's
memory region without any user's interaction.
(please note that this parameter eventually goes into dtb's cmdline property in
/chosen.)

In this sense, it is no different from an extra property under /chosen
as kexec-tools can also add it to dtb passed to the crash dump kernel.

(See what I mean?)

>> For x86 we would like to drop the memmap= usage in kexec-tools

I didn't know that :)

>> but we can not
>> do that for a compatibility problem about calgary iommu. So that currently
>> kexec-tools supports both recreating E820 maps and passing memmap=.
>>
>> We should think it carefully because it will be hard to remove once we support it.

Absolutely.

>> IMO handling it in code is better than using an external interface.
>
> Also seems semantic of memmap=exactmap is different than current use in the implementation
> exactmap means we need pass each range seperately including reserved, acpi and other types
> We can not reuse ranges in uefi memmap for other than usable memory.


If necessary, we may use a different name, say, "usablememmap=" for arm64
or just extend "mem=" semantics (allowing XX@YY format) to avoid any confusion.

Thanks,
-Takahiro AKASHI

> It will also have the cmdline array size issue.k
 >
> Thanks
> Dave
>

Dave Young Jan. 20, 2016, 8:26 a.m. UTC | #24

On 01/20/16 at 05:01pm, AKASHI Takahiro wrote:
> On 01/20/2016 04:00 PM, Dave Young wrote:
> >>>So I'm not still sure about what are advantages of a property under /chosen
> >>>over "memmap=" kernel parameter.
> >>>Both are simple and can have the same effect with minimizing changes to dtb.
> >>>(But if, in the latter case, we have to provide *all* the memory-related information
> >>>through "memmap=" parameters, it would be much complicated.)
> >>
> >>Maybe I did not say it clearly, I prefer kexec syscall/tool to modifiy dtb
> >>or uefi-memmap so that we do not need any extra kernel cmdline.
> 
> Yes, I understand.
> But on arm64, kexec-tools can generate a "memmap=" parameter for crash kernel's
> memory region without any user's interaction.
> (please note that this parameter eventually goes into dtb's cmdline property in
> /chosen.)
> 
> In this sense, it is no different from an extra property under /chosen
> as kexec-tools can also add it to dtb passed to the crash dump kernel.
> 
> (See what I mean?)

I think I understand your points, what I would prefer is not an extra property
but modifying uefi memmap or recreating memory nodes for !EFI to be used in kdump kernel.

> 
> >>For x86 we would like to drop the memmap= usage in kexec-tools
> 
> I didn't know that :)
> 
> >>but we can not
> >>do that for a compatibility problem about calgary iommu. So that currently
> >>kexec-tools supports both recreating E820 maps and passing memmap=.
> >>
> >>We should think it carefully because it will be hard to remove once we support it.
> 
> Absolutely.
> 
> >>IMO handling it in code is better than using an external interface.
> >
> >Also seems semantic of memmap=exactmap is different than current use in the implementation
> >exactmap means we need pass each range seperately including reserved, acpi and other types
> >We can not reuse ranges in uefi memmap for other than usable memory.
> 
> 
> If necessary, we may use a different name, say, "usablememmap=" for arm64
> or just extend "mem=" semantics (allowing XX@YY format) to avoid any confusion.

For either of above what is the 1st kernel behavior with these params?

> 
> Thanks,
> -Takahiro AKASHI
> 
> >It will also have the cmdline array size issue.k
> >
> >Thanks
> >Dave
> >

Mark Rutland Jan. 20, 2016, 11:28 a.m. UTC | #25

On Wed, Jan 20, 2016 at 10:49:46AM +0800, Dave Young wrote:
> On 01/19/16 at 02:01pm, Mark Rutland wrote:
> > On Tue, Jan 19, 2016 at 09:45:53PM +0800, Dave Young wrote:
> > > On 01/19/16 at 12:51pm, Mark Rutland wrote:
> > > > On Tue, Jan 19, 2016 at 08:28:48PM +0800, Dave Young wrote:
> > > > > On 01/19/16 at 02:35pm, AKASHI Takahiro wrote:
> > > > > > On 01/19/2016 10:43 AM, Dave Young wrote:
> > > > > > >X86 takes another way in latest kexec-tools and kexec_file_load, that is
> > > > > > >recreating E820 table and pass it to kexec/kdump kernel, if the entries
> > > > > > >are over E820 limitation then turn to use setup_data list for remain
> > > > > > >entries.
> > > > > > 
> > > > > > Thanks. I will visit x86 code again.
> > > > > > 
> > > > > > >I think it is X86 specific. Personally I think device tree property is
> > > > > > >better.
> > > > > > 
> > > > > > Do you think so?
> > > > > 
> > > > > I'm not sure it is the best way. For X86 we run into problem with
> > > > > memmap= design, one example is pci domain X (X>1) need the pci memory
> > > > > ranges being passed to kdump kernel. When we passed reserved ranges in /proc/iomem
> > > > > to 2nd kernel we find that cmdline[] array is not big enough.
> > > > 
> > > > I'm not sure how PCI ranges relate to the memory map used for normal
> > > > memory (i.e. RAM), though I'm probably missing some caveat with the way
> > > > ACPI and UEFI describe PCI. Why does memmap= affect PCI memory?
> > > 
> > > Here is the old patch which was rejected in kexec-tools:
> > > http://lists.infradead.org/pipermail/kexec/2013-February/007924.html
> > > 
> > > > 
> > > > If the kernel got the rest of its system topology from DT, the PCI
> > > > regions would be described there.
> > > 
> > > Yes, if kdump kernel use same DT as 1st kernel.
> > 
> > Other than for testing purposes, I don't see why you'd pass the kdump
> > kernel a DTB inconsistent with that the 1st kernel was passsed (other
> > than some proerties under /chosen).
> > 
> > We added /sys/firmware/fdt specifically to allow the kexec tools to get
> > the exact DTB the first kernel used. There's no reason for tools to have
> > to make something up.
> 
> Agreed, but kexec-tools has an option to pass in any dtb files. Who knows
> how one will use it unless dropping the option and use /sys/firmware/fdt
> unconditionally. 

I think this is a tangential discussion. I think it's fine to say that
for kdump we do not expect this -- a user would be shooting themselves
in the foot if they did. Regardless, I was under the impression that
kdump was usually set up by distribution-provided init code.

or kdump, which typically is set up automatically by the OS, 

> If we choose to implement kexec_file_load only in kernel, the interfaces
> provided are kernel, initrd and cmdline. We can always use same dtb.

There are use-cases where being in complete control of the purgatory
code is necessary. For example, the next OS might not be Linux (and
might not accept a DTB, or have different requirements on the initial
register state).

Regardless of the need for something like kexec_file_load for kdump in
Secure Boot environments, there is also a need for kexec_load with the
user having complete control.

> > > > > Do you think for arm64 only usable memory is necessary to let kdump kernel
> > > > > know? I'm curious about how arm64 kernel get all memory layout from boot loader,
> > > > > via UEFI memmap?
> > > > 
> > > > When booted via EFI, we use the EFI memory map. The EFI stub handles
> > > > acquring the relevant information and passing that to the first kernel
> > > > in the DTB (see Documentation/arm/uefi.txt).
> > > 
> > > Ok, thanks for the pointer. So in dt we are just have uefi memmap infomation
> > > instead of memory nodes details.. 
> > 
> > When booted via EFI, yes.
> > 
> > For NUMA topology in !ACPI kernels, we might need to also retain and
> > parse memory nodes, but only for toplogy information. The kernel would
> > still only use memory as described by the EFI memory map.
> > 
> > There's a horrible edge case I've spotted if performing a chain of
> > cross-endian kexecs: LE -> BE -> LE, as the BE kernel would have to
> > respect the EFI memory map so as to avoid corrupting it for the
> > subsequent LE kernel. Other than this I believe everything should just
> > work.
> 
> Firmware do not know kernel endianniess, kernel should respect firmware
> maps and adapt to it, it sounds like a generic issue not specfic to kexec.

I agree that this isn't kexec's fault as such, but in the absence of
kexec, the above issue does not exist, so one can't consider it in
isolation.

> > > > A kexec'd kernel should simply inherit that. So long as the DTB and/or
> > > > UEFI tables in memory are the same, it would be the same as a cold boot.
> > > 
> > > For kexec all memory ranges are same, for kdump we need use original reserved
> > > range with crashkernel= as usable memory and all other orignal usable ranges
> > > are not usable anymore. 
> > 
> > Sure. This is what I believe we should expose with an additional
> > property under /chosen, while keeping everything else pristine.
> > 
> > The crash kernel can then limit itself to that region, while it would
> > have the information of the full memory map (which it could log and/or
> > use to drive other dumping).
> 
> In this way kernel should be aware it is a kdump booting, it is doable though
> I feel it is better for kdump kernel in a black box with infomations it
> can use just like the 1st kernel. Things here is where we choose to cook
> the memory infomation in boot loader or in kernel itself.

Sorry, I can't follow what you are trying to say here. Could you
elaborate?

> > > Is it possible to modify uefi memmap for kdump case?
> > 
> > Technically it would be possible, however I don't think it's necessary,
> > and I think it would be disadvantageous to do so.
> > 
> > Describing the range(s) the crash kernel can use in separate properties
> > under /chosen has a number of advantages.
> 
> Ok, I got the points. We have a is_kdump_kernel() by checking if there is
> elfcorehdr_addr kernel cmdline. This is mainly for some drivers which
> do not work well in kdump kernel some uncertain reasons. But ideally I
> think kernel should handle things just like in 1st kernel and avoid to use
> it. 

I agree that we should not have kexec/kdump knowledge spread throughout
the kernel, and that the boot protocol should be uniform with a cold
boot as far as possible.

However, requiring userspace or the first kernel to modify
firmware-provided information has a number of risks and reduces the
amount of information available to the kdump kernel. To that end I am
opposed to modifying the memory nodes in the DTB, or to modifying the
EFI memory map.

Having a property in the DTB describing the range(s) of memory reserved
for use by the kdump kernel is vastly simpler, and avoids those risks:

* It requires a tiny amount of self-contained code in the kdump kernel
  to parse the property and apply the constraints imposed (i.e. carve up
  memblock).

  This is easy to contain in a single function (or at least within a
  single file), and need not affect drivers or other code.

* It is uniform regardless of whether the EFI memory map, DT memory
  nodes, or some other mechanism is used to discover memory in the
  systems.

  This makes it easy to impose the restrictions consistently, and is
  somewhat future-proof.

* Userspace or the first kernel to not need to parse and modify an
  arbitrary amount of data (which might be in an extended format it
  doesn't fully understand). There is less risk for this to go wrong.

  It is far easier to add a property than it is to correctly modify the
  EFI memory map, memory nodes, or some other data structure. There is
  less risk, and it is somewhat future-proof.

* The original memory map information is preserved, even though unused.
  This may be useful for debugging, and it may turn out that the kdump
  kernel needs to know about certain portions of the original memory
  map, even if we are not currently aware of why we would need this.

Thanks,
Mark.

Mark Rutland Jan. 20, 2016, 11:49 a.m. UTC | #26

On Wed, Jan 20, 2016 at 03:07:53PM +0900, AKASHI Takahiro wrote:
> On 01/20/2016 11:49 AM, Dave Young wrote:
> >On 01/19/16 at 02:01pm, Mark Rutland wrote:
> >>On Tue, Jan 19, 2016 at 09:45:53PM +0800, Dave Young wrote:
> >>>On 01/19/16 at 12:51pm, Mark Rutland wrote:
> >>>>On Tue, Jan 19, 2016 at 08:28:48PM +0800, Dave Young wrote:
> >>>>>On 01/19/16 at 02:35pm, AKASHI Takahiro wrote:
> >>>>>>On 01/19/2016 10:43 AM, Dave Young wrote:
> >>>>>>>X86 takes another way in latest kexec-tools and kexec_file_load, that is
> >>>>>>>recreating E820 table and pass it to kexec/kdump kernel, if the entries
> >>>>>>>are over E820 limitation then turn to use setup_data list for remain
> >>>>>>>entries.
> >>>>>>
> >>>>>>Thanks. I will visit x86 code again.
> >>>>>>
> >>>>>>>I think it is X86 specific. Personally I think device tree property is
> >>>>>>>better.
> >>>>>>
> >>>>>>Do you think so?
> >>>>>
> >>>>>I'm not sure it is the best way. For X86 we run into problem with
> >>>>>memmap= design, one example is pci domain X (X>1) need the pci memory
> >>>>>ranges being passed to kdump kernel. When we passed reserved ranges in /proc/iomem
> >>>>>to 2nd kernel we find that cmdline[] array is not big enough.
> >>>>
> >>>>I'm not sure how PCI ranges relate to the memory map used for normal
> >>>>memory (i.e. RAM), though I'm probably missing some caveat with the way
> >>>>ACPI and UEFI describe PCI. Why does memmap= affect PCI memory?
> >>>
> >>>Here is the old patch which was rejected in kexec-tools:
> >>>http://lists.infradead.org/pipermail/kexec/2013-February/007924.html
> >>>
> >>>>
> >>>>If the kernel got the rest of its system topology from DT, the PCI
> >>>>regions would be described there.
> >>>
> >>>Yes, if kdump kernel use same DT as 1st kernel.
> >>
> >>Other than for testing purposes, I don't see why you'd pass the kdump
> >>kernel a DTB inconsistent with that the 1st kernel was passsed (other
> >>than some proerties under /chosen).
> >>
> >>We added /sys/firmware/fdt specifically to allow the kexec tools to get
> >>the exact DTB the first kernel used. There's no reason for tools to have
> >>to make something up.
> >
> >Agreed, but kexec-tools has an option to pass in any dtb files. Who knows
> >how one will use it unless dropping the option and use /sys/firmware/fdt
> >unconditionally.
> 
> As a matter of fact, specifying proper command line parameters as well as
> dtb is partly users' responsibility for kdump to work correctly.
> (especially for BE kernel)
> 
> >If we choose to implement kexec_file_load only in kernel, the interfaces
> >provided are kernel, initrd and cmdline. We can always use same dtb.
> 
> I would say that we can always use the same dtb even with kexec_load
> from user's perspective. Right?

No.

This breaks using kexec for boot-loader purposes, and imposes a policy.

For better or worse kexec_file_load has always imposed a constrained
Linux-only policy, so that's a different story.

> >>There's a horrible edge case I've spotted if performing a chain of
> >>cross-endian kexecs: LE -> BE -> LE, as the BE kernel would have to
> >>respect the EFI memory map so as to avoid corrupting it for the
> >>subsequent LE kernel. Other than this I believe everything should just
> >>work.
> >
> >Firmware do not know kernel endianniess, kernel should respect firmware
> >maps and adapt to it, it sounds like a generic issue not specfic to kexec.
> 
> On arm64, a kernel image header has a bit field to specify the image's endianness.
> Anyway, our current implementation replies on a user-supplied dtb to start BE kernel.

The firmware should _never_ care about the kernel's endianness. The
bootlaoder or first kernel shouldn't care about the next kernel's
endianness apart from in exceptional circumstances. The DTB for a LE
kernel should look identical to that passed to a BE kernel.

In my mind, the only valid reason to look at that bit is so that
bootloaders can provide a warning if the CPU does not implement that
endianness.

The issue I mention above should be solved by changes to the BE kernel.

> >>>Is it possible to modify uefi memmap for kdump case?
> >>
> >>Technically it would be possible, however I don't think it's necessary,
> >>and I think it would be disadvantageous to do so.
> >>
> >>Describing the range(s) the crash kernel can use in separate properties
> >>under /chosen has a number of advantages.
> >
> >Ok, I got the points. We have a is_kdump_kernel() by checking if there is
> >elfcorehdr_addr kernel cmdline. This is mainly for some drivers which
> >do not work well in kdump kernel some uncertain reasons. But ideally I
> >think kernel should handle things just like in 1st kernel and avoid to use
> >it.
> 
> So I'm not still sure about what are advantages of a property under /chosen
> over "memmap=" kernel parameter.
> Both are simple and can have the same effect with minimizing changes to dtb.
> (But if, in the latter case, we have to provide *all* the memory-related information
> through "memmap=" parameters, it would be much complicated.)

The reason I prefer a property over command line additions include:

* It keeps the command line simple (as you mention the opposite is
  "complicated").

* It is logically separate from options the user may pass to the kernel
  in that the restricted region(s) of memory avaialble are effectively
  properties of the system (in that the crashed OS is part of the system
  state).

* The semantics of the command line parsing can change subtly over time
  (for example, see 51e158c12aca3c9a, which terminates command line
  parseing at "--"). Maknig sure that a command line option will
  actually be parsed by the next kernel is not trivial.
  
  Keeping this information isolated from the command line is more
  robust.

* Addition of a property is a self-contained operation, that doesn't
  require any knowledge about the command line.

Thanks,
Mark.

Mark Rutland Jan. 20, 2016, 11:54 a.m. UTC | #27

On Wed, Jan 20, 2016 at 02:38:56PM +0800, Dave Young wrote:
> Maybe I did not say it clearly, I prefer kexec syscall/tool to modifiy dtb
> or uefi-memmap so that we do not need any extra kernel cmdline.

I am strongly opposed to modifying the FW-provided memroy map
information, for the reasons I expressed in other replies.

What are your concerns with a property under /chosen?

> For x86 we would like to drop the memmap= usage in kexec-tools but we can not
> do that for a compatibility problem about calgary iommu. So that currently
> kexec-tools supports both recreating E820 maps and passing memmap=.
> 
> We should think it carefully because it will be hard to remove once we support it.

I agree that we don't want a plethora of solutions that we have to
support forever.

> IMO handling it in code is better than using an external interface.

I'm not sure what you mean by this. What is the "external interface",
and which code do you beleive it is better to handle this in?

Thanks,
Mark.

Mark Rutland Jan. 20, 2016, 12:02 p.m. UTC | #28

On Wed, Jan 20, 2016 at 02:25:07PM +0900, AKASHI Takahiro wrote:
> On 01/19/2016 11:01 PM, Mark Rutland wrote:
> >For NUMA topology in !ACPI kernels, we might need to also retain and
> >parse memory nodes, but only for toplogy information. The kernel would
> >still only use memory as described by the EFI memory map.
> >
> >There's a horrible edge case I've spotted if performing a chain of
> >cross-endian kexecs: LE -> BE -> LE, as the BE kernel would have to
> >respect the EFI memory map so as to avoid corrupting it for the
> >subsequent LE kernel. Other than this I believe everything should just
> >work.
> 
> BE kernel doesn't support UEFI yet and cannot access UEFI memmap table. So,
> for LE -> BE, we don't use a dtb generated from /sys/firmware/fdt (or /proc/device-tree)
> (as in the case of LE -> LE) and require users to provide a dtb file explicitly.

As I mentioned above, the problem exists when memory nodes also exist
(for describing NUMA topology). In that case the BE kernel would try to
use the information from the memory nodes.

> For BE -> LE, BE kernel doesn't know wther UEFI memmap table is available or not
> and so use the same (explicitly-provided) dtb (as LE -> LE in !UEFI)

See above. The problem I imagine is:

LE kernel - uses EFI mmap, takes NUMA information from DT memory nodes

    v       kexec

BE kernel - uses DT memory nodes
          - clobbers EFI runtime regions as it sees them as available

    v       kexec

LE kernel - uses EFI mmap, takes NUMA information from DT memory nodes
          - tries to call EFI runtime services, and explodes.

> >>>A kexec'd kernel should simply inherit that. So long as the DTB and/or
> >>>UEFI tables in memory are the same, it would be the same as a cold boot.
> >>
> >>For kexec all memory ranges are same, for kdump we need use original reserved
> >>range with crashkernel= as usable memory and all other orignal usable ranges
> >>are not usable anymore.
> >
> >Sure. This is what I believe we should expose with an additional
> >property under /chosen, while keeping everything else pristine.
> >
> >The crash kernel can then limit itself to that region, while it would
> >have the information of the full memory map (which it could log and/or
> >use to drive other dumping).
> 
> FYI,
> all the original usable memory regions used by the 1st kernel are also
> described in an ELF core header specified by "elfcorehdr=" parameter to
> the crash dump kernel.

That only describes what the first kernel parsed and thus believed, not
exactly what the firmware described.

Thanks,
Mark.

Mark Rutland Jan. 20, 2016, 12:36 p.m. UTC | #29

Ard, Ganapatrao, the below is something we need to consider for the
combination of the NUMA & kexec approaches. It only becomes a problem
if/when we preserve DT memory nodes in the presence of EFI, though it
would be nice to not box ourselves into a corner.

On Wed, Jan 20, 2016 at 12:02:58PM +0000, Mark Rutland wrote:
> On Wed, Jan 20, 2016 at 02:25:07PM +0900, AKASHI Takahiro wrote:
> > On 01/19/2016 11:01 PM, Mark Rutland wrote:
> > >For NUMA topology in !ACPI kernels, we might need to also retain and
> > >parse memory nodes, but only for toplogy information. The kernel would
> > >still only use memory as described by the EFI memory map.
> > >
> > >There's a horrible edge case I've spotted if performing a chain of
> > >cross-endian kexecs: LE -> BE -> LE, as the BE kernel would have to
> > >respect the EFI memory map so as to avoid corrupting it for the
> > >subsequent LE kernel. Other than this I believe everything should just
> > >work.
> > 
> > BE kernel doesn't support UEFI yet and cannot access UEFI memmap table. So,
> > for LE -> BE, we don't use a dtb generated from /sys/firmware/fdt (or /proc/device-tree)
> > (as in the case of LE -> LE) and require users to provide a dtb file explicitly.
> 
> As I mentioned above, the problem exists when memory nodes also exist
> (for describing NUMA topology). In that case the BE kernel would try to
> use the information from the memory nodes.
> 
> > For BE -> LE, BE kernel doesn't know wther UEFI memmap table is available or not
> > and so use the same (explicitly-provided) dtb (as LE -> LE in !UEFI)
> 
> See above. The problem I imagine is:
> 
> LE kernel - uses EFI mmap, takes NUMA information from DT memory nodes
> 
>     v       kexec
> 
> BE kernel - uses DT memory nodes
>           - clobbers EFI runtime regions as it sees them as available
> 
>     v       kexec
> 
> LE kernel - uses EFI mmap, takes NUMA information from DT memory nodes
>           - tries to call EFI runtime services, and explodes.

I'm not really sure what the best approach is here, but I thought that
it would be good to raise awareness of the edge-case.

Mark.

Ard Biesheuvel Jan. 20, 2016, 2:59 p.m. UTC | #30

On 20 January 2016 at 13:36, Mark Rutland <mark.rutland@arm.com> wrote:
> Ard, Ganapatrao, the below is something we need to consider for the
> combination of the NUMA & kexec approaches. It only becomes a problem
> if/when we preserve DT memory nodes in the presence of EFI, though it
> would be nice to not box ourselves into a corner.
>
> On Wed, Jan 20, 2016 at 12:02:58PM +0000, Mark Rutland wrote:
>> On Wed, Jan 20, 2016 at 02:25:07PM +0900, AKASHI Takahiro wrote:
>> > On 01/19/2016 11:01 PM, Mark Rutland wrote:
>> > >For NUMA topology in !ACPI kernels, we might need to also retain and
>> > >parse memory nodes, but only for toplogy information. The kernel would
>> > >still only use memory as described by the EFI memory map.
>> > >
>> > >There's a horrible edge case I've spotted if performing a chain of
>> > >cross-endian kexecs: LE -> BE -> LE, as the BE kernel would have to
>> > >respect the EFI memory map so as to avoid corrupting it for the
>> > >subsequent LE kernel. Other than this I believe everything should just
>> > >work.
>> >
>> > BE kernel doesn't support UEFI yet and cannot access UEFI memmap table. So,
>> > for LE -> BE, we don't use a dtb generated from /sys/firmware/fdt (or /proc/device-tree)
>> > (as in the case of LE -> LE) and require users to provide a dtb file explicitly.
>>
>> As I mentioned above, the problem exists when memory nodes also exist
>> (for describing NUMA topology). In that case the BE kernel would try to
>> use the information from the memory nodes.
>>
>> > For BE -> LE, BE kernel doesn't know wther UEFI memmap table is available or not
>> > and so use the same (explicitly-provided) dtb (as LE -> LE in !UEFI)
>>
>> See above. The problem I imagine is:
>>
>> LE kernel - uses EFI mmap, takes NUMA information from DT memory nodes
>>
>>     v       kexec
>>
>> BE kernel - uses DT memory nodes
>>           - clobbers EFI runtime regions as it sees them as available
>>
>>     v       kexec
>>
>> LE kernel - uses EFI mmap, takes NUMA information from DT memory nodes
>>           - tries to call EFI runtime services, and explodes.
>
> I'm not really sure what the best approach is here, but I thought that
> it would be good to raise awareness of the edge-case.
>

I think we should simply allow the BE kernel to deal with a UEFI
memory map. It only involves a bit of byte swapping (which I already
implemented at some point)

It would require some minor refactoring to make the UEFI init code
separate from all the other bits, but I don't see any major issues
here

Mark Rutland Jan. 20, 2016, 3:04 p.m. UTC | #31

On Wed, Jan 20, 2016 at 03:59:08PM +0100, Ard Biesheuvel wrote:
> On 20 January 2016 at 13:36, Mark Rutland <mark.rutland@arm.com> wrote:
> > Ard, Ganapatrao, the below is something we need to consider for the
> > combination of the NUMA & kexec approaches. It only becomes a problem
> > if/when we preserve DT memory nodes in the presence of EFI, though it
> > would be nice to not box ourselves into a corner.
> >
> > On Wed, Jan 20, 2016 at 12:02:58PM +0000, Mark Rutland wrote:
> >> On Wed, Jan 20, 2016 at 02:25:07PM +0900, AKASHI Takahiro wrote:
> >> > On 01/19/2016 11:01 PM, Mark Rutland wrote:
> >> > >For NUMA topology in !ACPI kernels, we might need to also retain and
> >> > >parse memory nodes, but only for toplogy information. The kernel would
> >> > >still only use memory as described by the EFI memory map.
> >> > >
> >> > >There's a horrible edge case I've spotted if performing a chain of
> >> > >cross-endian kexecs: LE -> BE -> LE, as the BE kernel would have to
> >> > >respect the EFI memory map so as to avoid corrupting it for the
> >> > >subsequent LE kernel. Other than this I believe everything should just
> >> > >work.
> >> >
> >> > BE kernel doesn't support UEFI yet and cannot access UEFI memmap table. So,
> >> > for LE -> BE, we don't use a dtb generated from /sys/firmware/fdt (or /proc/device-tree)
> >> > (as in the case of LE -> LE) and require users to provide a dtb file explicitly.
> >>
> >> As I mentioned above, the problem exists when memory nodes also exist
> >> (for describing NUMA topology). In that case the BE kernel would try to
> >> use the information from the memory nodes.
> >>
> >> > For BE -> LE, BE kernel doesn't know wther UEFI memmap table is available or not
> >> > and so use the same (explicitly-provided) dtb (as LE -> LE in !UEFI)
> >>
> >> See above. The problem I imagine is:
> >>
> >> LE kernel - uses EFI mmap, takes NUMA information from DT memory nodes
> >>
> >>     v       kexec
> >>
> >> BE kernel - uses DT memory nodes
> >>           - clobbers EFI runtime regions as it sees them as available
> >>
> >>     v       kexec
> >>
> >> LE kernel - uses EFI mmap, takes NUMA information from DT memory nodes
> >>           - tries to call EFI runtime services, and explodes.
> >
> > I'm not really sure what the best approach is here, but I thought that
> > it would be good to raise awareness of the edge-case.
> >
> 
> I think we should simply allow the BE kernel to deal with a UEFI
> memory map. It only involves a bit of byte swapping (which I already
> implemented at some point)
> 
> It would require some minor refactoring to make the UEFI init code
> separate from all the other bits, but I don't see any major issues
> here

Ok. I had assumed that getting the BE kernel to deal with the UEFI
memory map would be a bit more involved.

I'm happy to be proven wrong. :)

Thanks,
Mark.

Dave Young Jan. 21, 2016, 2:54 a.m. UTC | #32

On 01/20/16 at 11:28am, Mark Rutland wrote:
> On Wed, Jan 20, 2016 at 10:49:46AM +0800, Dave Young wrote:
> > On 01/19/16 at 02:01pm, Mark Rutland wrote:
> > > On Tue, Jan 19, 2016 at 09:45:53PM +0800, Dave Young wrote:
> > > > On 01/19/16 at 12:51pm, Mark Rutland wrote:
> > > > > On Tue, Jan 19, 2016 at 08:28:48PM +0800, Dave Young wrote:
> > > > > > On 01/19/16 at 02:35pm, AKASHI Takahiro wrote:
> > > > > > > On 01/19/2016 10:43 AM, Dave Young wrote:
> > > > > > > >X86 takes another way in latest kexec-tools and kexec_file_load, that is
> > > > > > > >recreating E820 table and pass it to kexec/kdump kernel, if the entries
> > > > > > > >are over E820 limitation then turn to use setup_data list for remain
> > > > > > > >entries.
> > > > > > > 
> > > > > > > Thanks. I will visit x86 code again.
> > > > > > > 
> > > > > > > >I think it is X86 specific. Personally I think device tree property is
> > > > > > > >better.
> > > > > > > 
> > > > > > > Do you think so?
> > > > > > 
> > > > > > I'm not sure it is the best way. For X86 we run into problem with
> > > > > > memmap= design, one example is pci domain X (X>1) need the pci memory
> > > > > > ranges being passed to kdump kernel. When we passed reserved ranges in /proc/iomem
> > > > > > to 2nd kernel we find that cmdline[] array is not big enough.
> > > > > 
> > > > > I'm not sure how PCI ranges relate to the memory map used for normal
> > > > > memory (i.e. RAM), though I'm probably missing some caveat with the way
> > > > > ACPI and UEFI describe PCI. Why does memmap= affect PCI memory?
> > > > 
> > > > Here is the old patch which was rejected in kexec-tools:
> > > > http://lists.infradead.org/pipermail/kexec/2013-February/007924.html
> > > > 
> > > > > 
> > > > > If the kernel got the rest of its system topology from DT, the PCI
> > > > > regions would be described there.
> > > > 
> > > > Yes, if kdump kernel use same DT as 1st kernel.
> > > 
> > > Other than for testing purposes, I don't see why you'd pass the kdump
> > > kernel a DTB inconsistent with that the 1st kernel was passsed (other
> > > than some proerties under /chosen).
> > > 
> > > We added /sys/firmware/fdt specifically to allow the kexec tools to get
> > > the exact DTB the first kernel used. There's no reason for tools to have
> > > to make something up.
> > 
> > Agreed, but kexec-tools has an option to pass in any dtb files. Who knows
> > how one will use it unless dropping the option and use /sys/firmware/fdt
> > unconditionally. 
> 
> I think this is a tangential discussion. I think it's fine to say that
> for kdump we do not expect this -- a user would be shooting themselves
> in the foot if they did. Regardless, I was under the impression that
> kdump was usually set up by distribution-provided init code.
> 
> or kdump, which typically is set up automatically by the OS, 

Yes, usually os setup kdump but with some config file user can tune the kexec
arguments. Anyway I agree that one should do right but if we are sure exact
fdt in 1st kernel is needed we can drop kexec-tools --dtb option. 

> 
> > If we choose to implement kexec_file_load only in kernel, the interfaces
> > provided are kernel, initrd and cmdline. We can always use same dtb.
> 
> There are use-cases where being in complete control of the purgatory
> code is necessary. For example, the next OS might not be Linux (and
> might not accept a DTB, or have different requirements on the initial
> register state).
> 
> Regardless of the need for something like kexec_file_load for kdump in
> Secure Boot environments, there is also a need for kexec_load with the
> user having complete control.

I'm not sure if there are such use cases in arm64 in real life.
But indeed it is a reason kexec_load can exist if there really are such requests. 

> 
> > > > > > Do you think for arm64 only usable memory is necessary to let kdump kernel
> > > > > > know? I'm curious about how arm64 kernel get all memory layout from boot loader,
> > > > > > via UEFI memmap?
> > > > > 
> > > > > When booted via EFI, we use the EFI memory map. The EFI stub handles
> > > > > acquring the relevant information and passing that to the first kernel
> > > > > in the DTB (see Documentation/arm/uefi.txt).
> > > > 
> > > > Ok, thanks for the pointer. So in dt we are just have uefi memmap infomation
> > > > instead of memory nodes details.. 
> > > 
> > > When booted via EFI, yes.
> > > 
> > > For NUMA topology in !ACPI kernels, we might need to also retain and
> > > parse memory nodes, but only for toplogy information. The kernel would
> > > still only use memory as described by the EFI memory map.
> > > 
> > > There's a horrible edge case I've spotted if performing a chain of
> > > cross-endian kexecs: LE -> BE -> LE, as the BE kernel would have to
> > > respect the EFI memory map so as to avoid corrupting it for the
> > > subsequent LE kernel. Other than this I believe everything should just
> > > work.
> > 
> > Firmware do not know kernel endianniess, kernel should respect firmware
> > maps and adapt to it, it sounds like a generic issue not specfic to kexec.
> 
> I agree that this isn't kexec's fault as such, but in the absence of
> kexec, the above issue does not exist, so one can't consider it in
> isolation.
> 
> > > > > A kexec'd kernel should simply inherit that. So long as the DTB and/or
> > > > > UEFI tables in memory are the same, it would be the same as a cold boot.
> > > > 
> > > > For kexec all memory ranges are same, for kdump we need use original reserved
> > > > range with crashkernel= as usable memory and all other orignal usable ranges
> > > > are not usable anymore. 
> > > 
> > > Sure. This is what I believe we should expose with an additional
> > > property under /chosen, while keeping everything else pristine.
> > > 
> > > The crash kernel can then limit itself to that region, while it would
> > > have the information of the full memory map (which it could log and/or
> > > use to drive other dumping).
> > 
> > In this way kernel should be aware it is a kdump booting, it is doable though
> > I feel it is better for kdump kernel in a black box with infomations it
> > can use just like the 1st kernel. Things here is where we choose to cook
> > the memory infomation in boot loader or in kernel itself.
> 
> Sorry, I can't follow what you are trying to say here. Could you
> elaborate?

Hmm, I menas if we prepare a kdump usable uefi memmap then we do not need introduce
the extra dtb property and kdump kernel just works like a normal boot.
I think we have understand each other upon latter part of this mail :)

One additianl issue with the simple way is if it can be used only in kdump kernel
Or it applys to both normal boot and kdump kernel boot so that it becomes a
general interface instead of only for kdump.

That means in 1st kernel we need override all system ram sections from uefi if
the usable chosen property is provided in 1st kernel.

> 
> > > > Is it possible to modify uefi memmap for kdump case?
> > > 
> > > Technically it would be possible, however I don't think it's necessary,
> > > and I think it would be disadvantageous to do so.
> > > 
> > > Describing the range(s) the crash kernel can use in separate properties
> > > under /chosen has a number of advantages.
> > 
> > Ok, I got the points. We have a is_kdump_kernel() by checking if there is
> > elfcorehdr_addr kernel cmdline. This is mainly for some drivers which
> > do not work well in kdump kernel some uncertain reasons. But ideally I
> > think kernel should handle things just like in 1st kernel and avoid to use
> > it. 
> 
> I agree that we should not have kexec/kdump knowledge spread throughout
> the kernel, and that the boot protocol should be uniform with a cold
> boot as far as possible.
> 
> However, requiring userspace or the first kernel to modify
> firmware-provided information has a number of risks and reduces the
> amount of information available to the kdump kernel. To that end I am
> opposed to modifying the memory nodes in the DTB, or to modifying the
> EFI memory map.
> 
> Having a property in the DTB describing the range(s) of memory reserved
> for use by the kdump kernel is vastly simpler, and avoids those risks:
> 
> * It requires a tiny amount of self-contained code in the kdump kernel
>   to parse the property and apply the constraints imposed (i.e. carve up
>   memblock).
> 
>   This is easy to contain in a single function (or at least within a
>   single file), and need not affect drivers or other code.
> 
> * It is uniform regardless of whether the EFI memory map, DT memory
>   nodes, or some other mechanism is used to discover memory in the
>   systems.
> 
>   This makes it easy to impose the restrictions consistently, and is
>   somewhat future-proof.

Ok, considering arm64 specific complexity of the several combind cases
especially this one I would say choosing a simple solution may be the
best choice.

> 
> * Userspace or the first kernel to not need to parse and modify an
>   arbitrary amount of data (which might be in an extended format it
>   doesn't fully understand). There is less risk for this to go wrong.
> 
>   It is far easier to add a property than it is to correctly modify the
>   EFI memory map, memory nodes, or some other data structure. There is
>   less risk, and it is somewhat future-proof.
>   
> * The original memory map information is preserved, even though unused.
>   This may be useful for debugging, and it may turn out that the kdump
>   kernel needs to know about certain portions of the original memory
>   map, even if we are not currently aware of why we would need this.

Thanks
Dave

> 
> Thanks,
> Mark.

Dave Young Jan. 21, 2016, 2:57 a.m. UTC | #33

Hi, Mark

On 01/20/16 at 11:54am, Mark Rutland wrote:
> On Wed, Jan 20, 2016 at 02:38:56PM +0800, Dave Young wrote:
> > Maybe I did not say it clearly, I prefer kexec syscall/tool to modifiy dtb
> > or uefi-memmap so that we do not need any extra kernel cmdline.
> 
> I am strongly opposed to modifying the FW-provided memroy map
> information, for the reasons I expressed in other replies.
> 
> What are your concerns with a property under /chosen?
> 
> > For x86 we would like to drop the memmap= usage in kexec-tools but we can not
> > do that for a compatibility problem about calgary iommu. So that currently
> > kexec-tools supports both recreating E820 maps and passing memmap=.
> > 
> > We should think it carefully because it will be hard to remove once we support it.
> 
> I agree that we don't want a plethora of solutions that we have to
> support forever.
> 
> > IMO handling it in code is better than using an external interface.
> 
> I'm not sure what you mean by this. What is the "external interface",
> and which code do you beleive it is better to handle this in?

I think we have covered all these in another reply. Let's continue
the discussion if needed in that thread.

Thanks
Dave

Dave Young Jan. 21, 2016, 3:03 a.m. UTC | #34

> I am strongly opposed to modifying the FW-provided memroy map
> information, for the reasons I expressed in other replies.
> 
> What are your concerns with a property under /chosen?

If we choose a simpler way between memmap= and /chosen property then
I think a property in /chosen looks better to me.

Thanks
Dave

AKASHI Takahiro Jan. 21, 2016, 5:43 a.m. UTC | #35

On 01/20/2016 11:59 PM, Ard Biesheuvel wrote:
> On 20 January 2016 at 13:36, Mark Rutland <mark.rutland@arm.com> wrote:
>> Ard, Ganapatrao, the below is something we need to consider for the
>> combination of the NUMA & kexec approaches. It only becomes a problem
>> if/when we preserve DT memory nodes in the presence of EFI, though it
>> would be nice to not box ourselves into a corner.
>>
>> On Wed, Jan 20, 2016 at 12:02:58PM +0000, Mark Rutland wrote:
>>> On Wed, Jan 20, 2016 at 02:25:07PM +0900, AKASHI Takahiro wrote:
>>>> On 01/19/2016 11:01 PM, Mark Rutland wrote:
>>>>> For NUMA topology in !ACPI kernels, we might need to also retain and
>>>>> parse memory nodes, but only for toplogy information. The kernel would
>>>>> still only use memory as described by the EFI memory map.
>>>>>
>>>>> There's a horrible edge case I've spotted if performing a chain of
>>>>> cross-endian kexecs: LE -> BE -> LE, as the BE kernel would have to
>>>>> respect the EFI memory map so as to avoid corrupting it for the
>>>>> subsequent LE kernel. Other than this I believe everything should just
>>>>> work.
>>>>
>>>> BE kernel doesn't support UEFI yet and cannot access UEFI memmap table. So,
>>>> for LE -> BE, we don't use a dtb generated from /sys/firmware/fdt (or /proc/device-tree)
>>>> (as in the case of LE -> LE) and require users to provide a dtb file explicitly.
>>>
>>> As I mentioned above, the problem exists when memory nodes also exist
>>> (for describing NUMA topology). In that case the BE kernel would try to
>>> use the information from the memory nodes.
>>>
>>>> For BE -> LE, BE kernel doesn't know wther UEFI memmap table is available or not
>>>> and so use the same (explicitly-provided) dtb (as LE -> LE in !UEFI)
>>>
>>> See above. The problem I imagine is:
>>>
>>> LE kernel - uses EFI mmap, takes NUMA information from DT memory nodes
>>>
>>>      v       kexec
>>>
>>> BE kernel - uses DT memory nodes
>>>            - clobbers EFI runtime regions as it sees them as available
>>>
>>>      v       kexec
>>>
>>> LE kernel - uses EFI mmap, takes NUMA information from DT memory nodes
>>>            - tries to call EFI runtime services, and explodes.
>>
>> I'm not really sure what the best approach is here, but I thought that
>> it would be good to raise awareness of the edge-case.
>>
>
> I think we should simply allow the BE kernel to deal with a UEFI
> memory map. It only involves a bit of byte swapping (which I already
> implemented at some point)

Just from my curiosity,
will runtime services be also available on BE kernel with LE uefi?

-Takahiro AKASHI

> It would require some minor refactoring to make the UEFI init code
> separate from all the other bits, but I don't see any major issues
> here
>

AKASHI Takahiro Jan. 21, 2016, 6:53 a.m. UTC | #36

On 01/20/2016 08:49 PM, Mark Rutland wrote:
> On Wed, Jan 20, 2016 at 03:07:53PM +0900, AKASHI Takahiro wrote:
>> On 01/20/2016 11:49 AM, Dave Young wrote:
>>> On 01/19/16 at 02:01pm, Mark Rutland wrote:
>>>> On Tue, Jan 19, 2016 at 09:45:53PM +0800, Dave Young wrote:
>>>>> On 01/19/16 at 12:51pm, Mark Rutland wrote:
>>>>>> On Tue, Jan 19, 2016 at 08:28:48PM +0800, Dave Young wrote:
>>>>>>> On 01/19/16 at 02:35pm, AKASHI Takahiro wrote:
>>>>>>>> On 01/19/2016 10:43 AM, Dave Young wrote:
>>>>>>>>> X86 takes another way in latest kexec-tools and kexec_file_load, that is
>>>>>>>>> recreating E820 table and pass it to kexec/kdump kernel, if the entries
>>>>>>>>> are over E820 limitation then turn to use setup_data list for remain
>>>>>>>>> entries.
>>>>>>>>
>>>>>>>> Thanks. I will visit x86 code again.
>>>>>>>>
>>>>>>>>> I think it is X86 specific. Personally I think device tree property is
>>>>>>>>> better.
>>>>>>>>
>>>>>>>> Do you think so?
>>>>>>>
>>>>>>> I'm not sure it is the best way. For X86 we run into problem with
>>>>>>> memmap= design, one example is pci domain X (X>1) need the pci memory
>>>>>>> ranges being passed to kdump kernel. When we passed reserved ranges in /proc/iomem
>>>>>>> to 2nd kernel we find that cmdline[] array is not big enough.
>>>>>>
>>>>>> I'm not sure how PCI ranges relate to the memory map used for normal
>>>>>> memory (i.e. RAM), though I'm probably missing some caveat with the way
>>>>>> ACPI and UEFI describe PCI. Why does memmap= affect PCI memory?
>>>>>
>>>>> Here is the old patch which was rejected in kexec-tools:
>>>>> http://lists.infradead.org/pipermail/kexec/2013-February/007924.html
>>>>>
>>>>>>
>>>>>> If the kernel got the rest of its system topology from DT, the PCI
>>>>>> regions would be described there.
>>>>>
>>>>> Yes, if kdump kernel use same DT as 1st kernel.
>>>>
>>>> Other than for testing purposes, I don't see why you'd pass the kdump
>>>> kernel a DTB inconsistent with that the 1st kernel was passsed (other
>>>> than some proerties under /chosen).
>>>>
>>>> We added /sys/firmware/fdt specifically to allow the kexec tools to get
>>>> the exact DTB the first kernel used. There's no reason for tools to have
>>>> to make something up.
>>>
>>> Agreed, but kexec-tools has an option to pass in any dtb files. Who knows
>>> how one will use it unless dropping the option and use /sys/firmware/fdt
>>> unconditionally.
>>
>> As a matter of fact, specifying proper command line parameters as well as
>> dtb is partly users' responsibility for kdump to work correctly.
>> (especially for BE kernel)
>>
>>> If we choose to implement kexec_file_load only in kernel, the interfaces
>>> provided are kernel, initrd and cmdline. We can always use same dtb.
>>
>> I would say that we can always use the same dtb even with kexec_load
>> from user's perspective. Right?
>
> No.
>
> This breaks using kexec for boot-loader purposes, and imposes a policy.

What kind of policy?
I said "can", but if we want to use other setting/configuration, we can
still have a full control.

> For better or worse kexec_file_load has always imposed a constrained
> Linux-only policy, so that's a different story.
>
>>>> There's a horrible edge case I've spotted if performing a chain of
>>>> cross-endian kexecs: LE -> BE -> LE, as the BE kernel would have to
>>>> respect the EFI memory map so as to avoid corrupting it for the
>>>> subsequent LE kernel. Other than this I believe everything should just
>>>> work.
>>>
>>> Firmware do not know kernel endianniess, kernel should respect firmware
>>> maps and adapt to it, it sounds like a generic issue not specfic to kexec.
>>
>> On arm64, a kernel image header has a bit field to specify the image's endianness.
>> Anyway, our current implementation replies on a user-supplied dtb to start BE kernel.
>
> The firmware should _never_ care about the kernel's endianness. The
> bootlaoder or first kernel shouldn't care about the next kernel's
> endianness apart from in exceptional circumstances. The DTB for a LE
> kernel should look identical to that passed to a BE kernel.

Please note that I didn't say anything different from your last two statements.
The current arm64 kexec implementation doesn't do anything specific to BE,
but as far as BE kernel doesn't support UEFI, users are responsible for
providing a proper dtb.

> In my mind, the only valid reason to look at that bit is so that
> bootloaders can provide a warning if the CPU does not implement that
> endianness.
>
> The issue I mention above should be solved by changes to the BE kernel.
>
>>>>> Is it possible to modify uefi memmap for kdump case?
>>>>
>>>> Technically it would be possible, however I don't think it's necessary,
>>>> and I think it would be disadvantageous to do so.
>>>>
>>>> Describing the range(s) the crash kernel can use in separate properties
>>>> under /chosen has a number of advantages.
>>>
>>> Ok, I got the points. We have a is_kdump_kernel() by checking if there is
>>> elfcorehdr_addr kernel cmdline. This is mainly for some drivers which
>>> do not work well in kdump kernel some uncertain reasons. But ideally I
>>> think kernel should handle things just like in 1st kernel and avoid to use
>>> it.
>>
>> So I'm not still sure about what are advantages of a property under /chosen
>> over "memmap=" kernel parameter.
>> Both are simple and can have the same effect with minimizing changes to dtb.
>> (But if, in the latter case, we have to provide *all* the memory-related information
>> through "memmap=" parameters, it would be much complicated.)
>
> The reason I prefer a property over command line additions include:

Take some examples:
(a) a property under /chosen
   {
     chosen = {
       cmdline = "elfcorehdr=AA@BB maxcpus=1 ...";
     }
     usable-memory = <XX YY>;
     memory {
       ...
     }
   }

(b) a kernel command line parameter
   (I use the same name, "usable-memory", to show the similarity. may use another name though.)
   {
     chosen = {
       cmdline = "elfcorehdr=AA@BB maxcpus=1 usable-memory=YY@XX ...";
     }
     memory {
       ...
     }
   }

> * It keeps the command line simple (as you mention the opposite is
>    "complicated").

I think both are simple.

> * It is logically separate from options the user may pass to the kernel
>    in that the restricted region(s) of memory avaialble are effectively
>    properties of the system (in that the crashed OS is part of the system
>    state).

"elfcorehdr=" parameter already breaks your point.
"elfcorehdr=" looks to be, what you say, a system property, and is actually
added by kexec-tools on all architectures, and "usable-memory", whether it is
a DT property or a kernel parameter, will also be added by kexec-tools.
(Users don't have to care.)

> * The semantics of the command line parsing can change subtly over time
>    (for example, see 51e158c12aca3c9a, which terminates command line
>    parseing at "--"). Maknig sure that a command line option will
>    actually be parsed by the next kernel is not trivial.
>
>    Keeping this information isolated from the command line is more
>    robust.

Even so, who wants to use kdump without testing?
and this is not a kdump specific issue.

> * Addition of a property is a self-contained operation, that doesn't
>    require any knowledge about the command line.

I don't get your point here.
For a kernel parameter, early_param() can encapsulate all the stuffs necessary.
Once the kernel recognizes a usable memory region, limiting available
memory should be done in the exact same way.

Thanks,
-Takahiro AKASHI


> Thanks,
> Mark.
>

Mark Rutland Jan. 21, 2016, 12:02 p.m. UTC | #37

On Thu, Jan 21, 2016 at 03:53:42PM +0900, AKASHI Takahiro wrote:
> On 01/20/2016 08:49 PM, Mark Rutland wrote:
> >On Wed, Jan 20, 2016 at 03:07:53PM +0900, AKASHI Takahiro wrote:
> >>On 01/20/2016 11:49 AM, Dave Young wrote:
> >>>On 01/19/16 at 02:01pm, Mark Rutland wrote:
> >>>>On Tue, Jan 19, 2016 at 09:45:53PM +0800, Dave Young wrote:
> >>>>>On 01/19/16 at 12:51pm, Mark Rutland wrote:
> >>>>>>On Tue, Jan 19, 2016 at 08:28:48PM +0800, Dave Young wrote:
> >>>>>>>On 01/19/16 at 02:35pm, AKASHI Takahiro wrote:
> >>>>>>>>On 01/19/2016 10:43 AM, Dave Young wrote:
> >>>>>>>>>X86 takes another way in latest kexec-tools and kexec_file_load, that is
> >>>>>>>>>recreating E820 table and pass it to kexec/kdump kernel, if the entries
> >>>>>>>>>are over E820 limitation then turn to use setup_data list for remain
> >>>>>>>>>entries.
> >>>>>>>>
> >>>>>>>>Thanks. I will visit x86 code again.
> >>>>>>>>
> >>>>>>>>>I think it is X86 specific. Personally I think device tree property is
> >>>>>>>>>better.
> >>>>>>>>
> >>>>>>>>Do you think so?
> >>>>>>>
> >>>>>>>I'm not sure it is the best way. For X86 we run into problem with
> >>>>>>>memmap= design, one example is pci domain X (X>1) need the pci memory
> >>>>>>>ranges being passed to kdump kernel. When we passed reserved ranges in /proc/iomem
> >>>>>>>to 2nd kernel we find that cmdline[] array is not big enough.
> >>>>>>
> >>>>>>I'm not sure how PCI ranges relate to the memory map used for normal
> >>>>>>memory (i.e. RAM), though I'm probably missing some caveat with the way
> >>>>>>ACPI and UEFI describe PCI. Why does memmap= affect PCI memory?
> >>>>>
> >>>>>Here is the old patch which was rejected in kexec-tools:
> >>>>>http://lists.infradead.org/pipermail/kexec/2013-February/007924.html
> >>>>>
> >>>>>>
> >>>>>>If the kernel got the rest of its system topology from DT, the PCI
> >>>>>>regions would be described there.
> >>>>>
> >>>>>Yes, if kdump kernel use same DT as 1st kernel.
> >>>>
> >>>>Other than for testing purposes, I don't see why you'd pass the kdump
> >>>>kernel a DTB inconsistent with that the 1st kernel was passsed (other
> >>>>than some proerties under /chosen).
> >>>>
> >>>>We added /sys/firmware/fdt specifically to allow the kexec tools to get
> >>>>the exact DTB the first kernel used. There's no reason for tools to have
> >>>>to make something up.
> >>>
> >>>Agreed, but kexec-tools has an option to pass in any dtb files. Who knows
> >>>how one will use it unless dropping the option and use /sys/firmware/fdt
> >>>unconditionally.
> >>
> >>As a matter of fact, specifying proper command line parameters as well as
> >>dtb is partly users' responsibility for kdump to work correctly.
> >>(especially for BE kernel)
> >>
> >>>If we choose to implement kexec_file_load only in kernel, the interfaces
> >>>provided are kernel, initrd and cmdline. We can always use same dtb.
> >>
> >>I would say that we can always use the same dtb even with kexec_load
> >>from user's perspective. Right?
> >
> >No.
> >
> >This breaks using kexec for boot-loader purposes, and imposes a policy.
> 
> What kind of policy?
> I said "can", but if we want to use other setting/configuration, we can
> still have a full control.

Apologies, I misunderstood.

In most cases, using the existing DTB (with minor modifications to
/chosen for bootargs and such) is fine. If the user just wants to boot
another Linux kernel, that's generally what they should do.

I think we're agreed on that.

However, there are cases when the user may want to use a different DTB,
or use a different purgatory. So we cannot mandate that the existing DTB
is reused, nor that an in-kernel purgatory must be used, as that imposes
a policy.

> >For better or worse kexec_file_load has always imposed a constrained
> >Linux-only policy, so that's a different story.
> >
> >>>>There's a horrible edge case I've spotted if performing a chain of
> >>>>cross-endian kexecs: LE -> BE -> LE, as the BE kernel would have to
> >>>>respect the EFI memory map so as to avoid corrupting it for the
> >>>>subsequent LE kernel. Other than this I believe everything should just
> >>>>work.
> >>>
> >>>Firmware do not know kernel endianniess, kernel should respect firmware
> >>>maps and adapt to it, it sounds like a generic issue not specfic to kexec.
> >>
> >>On arm64, a kernel image header has a bit field to specify the image's endianness.
> >>Anyway, our current implementation replies on a user-supplied dtb to start BE kernel.
> >
> >The firmware should _never_ care about the kernel's endianness. The
> >bootlaoder or first kernel shouldn't care about the next kernel's
> >endianness apart from in exceptional circumstances. The DTB for a LE
> >kernel should look identical to that passed to a BE kernel.
> 
> Please note that I didn't say anything different from your last two statements.
> The current arm64 kexec implementation doesn't do anything specific to BE,
> but as far as BE kernel doesn't support UEFI, users are responsible for
> providing a proper dtb.

I'm just confused as to what you mean by a "proper dtb" in that case.

If you just mean one with memory nodes hacked in, then that would
currently be a way to make that work, yes.

It seems like the better option is to fix the BE kernel to support a
UEFI memory map, as that solves other issues.

> 
> >In my mind, the only valid reason to look at that bit is so that
> >bootloaders can provide a warning if the CPU does not implement that
> >endianness.
> >
> >The issue I mention above should be solved by changes to the BE kernel.
> >
> >>>>>Is it possible to modify uefi memmap for kdump case?
> >>>>
> >>>>Technically it would be possible, however I don't think it's necessary,
> >>>>and I think it would be disadvantageous to do so.
> >>>>
> >>>>Describing the range(s) the crash kernel can use in separate properties
> >>>>under /chosen has a number of advantages.
> >>>
> >>>Ok, I got the points. We have a is_kdump_kernel() by checking if there is
> >>>elfcorehdr_addr kernel cmdline. This is mainly for some drivers which
> >>>do not work well in kdump kernel some uncertain reasons. But ideally I
> >>>think kernel should handle things just like in 1st kernel and avoid to use
> >>>it.
> >>
> >>So I'm not still sure about what are advantages of a property under /chosen
> >>over "memmap=" kernel parameter.
> >>Both are simple and can have the same effect with minimizing changes to dtb.
> >>(But if, in the latter case, we have to provide *all* the memory-related information
> >>through "memmap=" parameters, it would be much complicated.)
> >
> >The reason I prefer a property over command line additions include:
> 
> Take some examples:
> (a) a property under /chosen
>   {
>     chosen = {
>       cmdline = "elfcorehdr=AA@BB maxcpus=1 ...";
>     }
>     usable-memory = <XX YY>;
>     memory {
>       ...
>     }
>   }
> 
> (b) a kernel command line parameter
>   (I use the same name, "usable-memory", to show the similarity. may use another name though.)
>   {
>     chosen = {
>       cmdline = "elfcorehdr=AA@BB maxcpus=1 usable-memory=YY@XX ...";
>     }
>     memory {
>       ...
>     }
>   }
> 
> >* It keeps the command line simple (as you mention the opposite is
> >   "complicated").
> 
> I think both are simple.
> 
> >* It is logically separate from options the user may pass to the kernel
> >   in that the restricted region(s) of memory avaialble are effectively
> >   properties of the system (in that the crashed OS is part of the system
> >   state).
> 
> "elfcorehdr=" parameter already breaks your point.
> "elfcorehdr=" looks to be, what you say, a system property, and is actually
> added by kexec-tools on all architectures, and "usable-memory", whether it is
> a DT property or a kernel parameter, will also be added by kexec-tools.
> (Users don't have to care.)

Just because architectures do one thing today does not mean that we have
to follow it.

I don't think that breaks my point so much as shows that a different
approach is taken by others today.

There's also no reason this cannot be a property under /chosen.

> >* The semantics of the command line parsing can change subtly over time
> >   (for example, see 51e158c12aca3c9a, which terminates command line
> >   parseing at "--"). Maknig sure that a command line option will
> >   actually be parsed by the next kernel is not trivial.
> >
> >   Keeping this information isolated from the command line is more
> >   robust.
> 
> Even so, who wants to use kdump without testing?
> and this is not a kdump specific issue.
> 
> >* Addition of a property is a self-contained operation, that doesn't
> >   require any knowledge about the command line.
> 
> I don't get your point here.
> For a kernel parameter, early_param() can encapsulate all the stuffs necessary.
> Once the kernel recognizes a usable memory region, limiting available
> memory should be done in the exact same way.

I mean when modifying the command line.

To place "elfcorehdr=" or "memmap="/"usable-memory=" into the command
line, one needs to know where it is valid to place it. Appending doesn't
always work, as per the example above with 51e158c12aca3c9a.

For both of these my point was that generally there is some fragility in
this area. While it's easy to say that breaking this would be someone
else's fault, we can easily avoid the possibility of that happening, and
avoid a set of problems trying to maintain backwards compatibility if
there were a sensible change that happened to break things.

Thanks,
Mark.

Mark Rutland Jan. 21, 2016, 1:02 p.m. UTC | #38

On Thu, Jan 21, 2016 at 02:43:15PM +0900, AKASHI Takahiro wrote:
> On 01/20/2016 11:59 PM, Ard Biesheuvel wrote:
> >On 20 January 2016 at 13:36, Mark Rutland <mark.rutland@arm.com> wrote:
> >>Ard, Ganapatrao, the below is something we need to consider for the
> >>combination of the NUMA & kexec approaches. It only becomes a problem
> >>if/when we preserve DT memory nodes in the presence of EFI, though it
> >>would be nice to not box ourselves into a corner.
> >>
> >>On Wed, Jan 20, 2016 at 12:02:58PM +0000, Mark Rutland wrote:
> >>>On Wed, Jan 20, 2016 at 02:25:07PM +0900, AKASHI Takahiro wrote:
> >>>>On 01/19/2016 11:01 PM, Mark Rutland wrote:
> >>>>>For NUMA topology in !ACPI kernels, we might need to also retain and
> >>>>>parse memory nodes, but only for toplogy information. The kernel would
> >>>>>still only use memory as described by the EFI memory map.
> >>>>>
> >>>>>There's a horrible edge case I've spotted if performing a chain of
> >>>>>cross-endian kexecs: LE -> BE -> LE, as the BE kernel would have to
> >>>>>respect the EFI memory map so as to avoid corrupting it for the
> >>>>>subsequent LE kernel. Other than this I believe everything should just
> >>>>>work.
> >>>>
> >>>>BE kernel doesn't support UEFI yet and cannot access UEFI memmap table. So,
> >>>>for LE -> BE, we don't use a dtb generated from /sys/firmware/fdt (or /proc/device-tree)
> >>>>(as in the case of LE -> LE) and require users to provide a dtb file explicitly.
> >>>
> >>>As I mentioned above, the problem exists when memory nodes also exist
> >>>(for describing NUMA topology). In that case the BE kernel would try to
> >>>use the information from the memory nodes.
> >>>
> >>>>For BE -> LE, BE kernel doesn't know wther UEFI memmap table is available or not
> >>>>and so use the same (explicitly-provided) dtb (as LE -> LE in !UEFI)
> >>>
> >>>See above. The problem I imagine is:
> >>>
> >>>LE kernel - uses EFI mmap, takes NUMA information from DT memory nodes
> >>>
> >>>     v       kexec
> >>>
> >>>BE kernel - uses DT memory nodes
> >>>           - clobbers EFI runtime regions as it sees them as available
> >>>
> >>>     v       kexec
> >>>
> >>>LE kernel - uses EFI mmap, takes NUMA information from DT memory nodes
> >>>           - tries to call EFI runtime services, and explodes.
> >>
> >>I'm not really sure what the best approach is here, but I thought that
> >>it would be good to raise awareness of the edge-case.
> >>
> >
> >I think we should simply allow the BE kernel to deal with a UEFI
> >memory map. It only involves a bit of byte swapping (which I already
> >implemented at some point)
> 
> Just from my curiosity,
> will runtime services be also available on BE kernel with LE uefi?

It may be possible to implement that (I recall that Ard had a go), but
that's far more complicated than simply supporting the EFI memory map,
as you need separate (endian-swapped) page tables and other data
structures, lose the ability to handle exceptions, etc.

All that's suggested above is supporting the memory map.

Thanks,
Mark.

AKASHI Takahiro Jan. 22, 2016, 6:23 a.m. UTC | #39

On 01/21/2016 09:02 PM, Mark Rutland wrote:
> On Thu, Jan 21, 2016 at 03:53:42PM +0900, AKASHI Takahiro wrote:
>> On 01/20/2016 08:49 PM, Mark Rutland wrote:
>>> On Wed, Jan 20, 2016 at 03:07:53PM +0900, AKASHI Takahiro wrote:
>>>> On 01/20/2016 11:49 AM, Dave Young wrote:
>>>>> On 01/19/16 at 02:01pm, Mark Rutland wrote:
>>>>>> On Tue, Jan 19, 2016 at 09:45:53PM +0800, Dave Young wrote:
>>>>>>> On 01/19/16 at 12:51pm, Mark Rutland wrote:
>>>>>>>> On Tue, Jan 19, 2016 at 08:28:48PM +0800, Dave Young wrote:
>>>>>>>>> On 01/19/16 at 02:35pm, AKASHI Takahiro wrote:
>>>>>>>>>> On 01/19/2016 10:43 AM, Dave Young wrote:
>>>>>>>>>>> X86 takes another way in latest kexec-tools and kexec_file_load, that is
>>>>>>>>>>> recreating E820 table and pass it to kexec/kdump kernel, if the entries
>>>>>>>>>>> are over E820 limitation then turn to use setup_data list for remain
>>>>>>>>>>> entries.
>>>>>>>>>>
>>>>>>>>>> Thanks. I will visit x86 code again.
>>>>>>>>>>
>>>>>>>>>>> I think it is X86 specific. Personally I think device tree property is
>>>>>>>>>>> better.
>>>>>>>>>>
>>>>>>>>>> Do you think so?
>>>>>>>>>
>>>>>>>>> I'm not sure it is the best way. For X86 we run into problem with
>>>>>>>>> memmap= design, one example is pci domain X (X>1) need the pci memory
>>>>>>>>> ranges being passed to kdump kernel. When we passed reserved ranges in /proc/iomem
>>>>>>>>> to 2nd kernel we find that cmdline[] array is not big enough.
>>>>>>>>
>>>>>>>> I'm not sure how PCI ranges relate to the memory map used for normal
>>>>>>>> memory (i.e. RAM), though I'm probably missing some caveat with the way
>>>>>>>> ACPI and UEFI describe PCI. Why does memmap= affect PCI memory?
>>>>>>>
>>>>>>> Here is the old patch which was rejected in kexec-tools:
>>>>>>> http://lists.infradead.org/pipermail/kexec/2013-February/007924.html
>>>>>>>
>>>>>>>>
>>>>>>>> If the kernel got the rest of its system topology from DT, the PCI
>>>>>>>> regions would be described there.
>>>>>>>
>>>>>>> Yes, if kdump kernel use same DT as 1st kernel.
>>>>>>
>>>>>> Other than for testing purposes, I don't see why you'd pass the kdump
>>>>>> kernel a DTB inconsistent with that the 1st kernel was passsed (other
>>>>>> than some proerties under /chosen).
>>>>>>
>>>>>> We added /sys/firmware/fdt specifically to allow the kexec tools to get
>>>>>> the exact DTB the first kernel used. There's no reason for tools to have
>>>>>> to make something up.
>>>>>
>>>>> Agreed, but kexec-tools has an option to pass in any dtb files. Who knows
>>>>> how one will use it unless dropping the option and use /sys/firmware/fdt
>>>>> unconditionally.
>>>>
>>>> As a matter of fact, specifying proper command line parameters as well as
>>>> dtb is partly users' responsibility for kdump to work correctly.
>>>> (especially for BE kernel)
>>>>
>>>>> If we choose to implement kexec_file_load only in kernel, the interfaces
>>>>> provided are kernel, initrd and cmdline. We can always use same dtb.
>>>>
>>>> I would say that we can always use the same dtb even with kexec_load
>>> >from user's perspective. Right?
>>>
>>> No.
>>>
>>> This breaks using kexec for boot-loader purposes, and imposes a policy.
>>
>> What kind of policy?
>> I said "can", but if we want to use other setting/configuration, we can
>> still have a full control.
>
> Apologies, I misunderstood.
>
> In most cases, using the existing DTB (with minor modifications to
> /chosen for bootargs and such) is fine. If the user just wants to boot
> another Linux kernel, that's generally what they should do.
>
> I think we're agreed on that.

Yes.

> However, there are cases when the user may want to use a different DTB,
> or use a different purgatory. So we cannot mandate that the existing DTB
> is reused, nor that an in-kernel purgatory must be used, as that imposes
> a policy.

Agree!

>>> For better or worse kexec_file_load has always imposed a constrained
>>> Linux-only policy, so that's a different story.
>>>
>>>>>> There's a horrible edge case I've spotted if performing a chain of
>>>>>> cross-endian kexecs: LE -> BE -> LE, as the BE kernel would have to
>>>>>> respect the EFI memory map so as to avoid corrupting it for the
>>>>>> subsequent LE kernel. Other than this I believe everything should just
>>>>>> work.
>>>>>
>>>>> Firmware do not know kernel endianniess, kernel should respect firmware
>>>>> maps and adapt to it, it sounds like a generic issue not specfic to kexec.
>>>>
>>>> On arm64, a kernel image header has a bit field to specify the image's endianness.
>>>> Anyway, our current implementation replies on a user-supplied dtb to start BE kernel.
>>>
>>> The firmware should _never_ care about the kernel's endianness. The
>>> bootlaoder or first kernel shouldn't care about the next kernel's
>>> endianness apart from in exceptional circumstances. The DTB for a LE
>>> kernel should look identical to that passed to a BE kernel.
>>
>> Please note that I didn't say anything different from your last two statements.
>> The current arm64 kexec implementation doesn't do anything specific to BE,
>> but as far as BE kernel doesn't support UEFI, users are responsible for
>> providing a proper dtb.
>
> I'm just confused as to what you mean by a "proper dtb" in that case.
>
> If you just mean one with memory nodes hacked in, then that would
> currently be a way to make that work, yes.

One of useful cases that I have in my mind is kdump.
We may want to use a small sub-set of dtb, especially devices, to
make the reboot more reliable. Device drivers are likely to be vulnerable
at crash.

> It seems like the better option is to fix the BE kernel to support a
> UEFI memory map, as that solves other issues.

Why did Ard throw away his patch?

>>
>>> In my mind, the only valid reason to look at that bit is so that
>>> bootloaders can provide a warning if the CPU does not implement that
>>> endianness.
>>>
>>> The issue I mention above should be solved by changes to the BE kernel.
>>>
>>>>>>> Is it possible to modify uefi memmap for kdump case?
>>>>>>
>>>>>> Technically it would be possible, however I don't think it's necessary,
>>>>>> and I think it would be disadvantageous to do so.
>>>>>>
>>>>>> Describing the range(s) the crash kernel can use in separate properties
>>>>>> under /chosen has a number of advantages.
>>>>>
>>>>> Ok, I got the points. We have a is_kdump_kernel() by checking if there is
>>>>> elfcorehdr_addr kernel cmdline. This is mainly for some drivers which
>>>>> do not work well in kdump kernel some uncertain reasons. But ideally I
>>>>> think kernel should handle things just like in 1st kernel and avoid to use
>>>>> it.
>>>>
>>>> So I'm not still sure about what are advantages of a property under /chosen
>>>> over "memmap=" kernel parameter.
>>>> Both are simple and can have the same effect with minimizing changes to dtb.
>>>> (But if, in the latter case, we have to provide *all* the memory-related information
>>>> through "memmap=" parameters, it would be much complicated.)
>>>
>>> The reason I prefer a property over command line additions include:
>>
>> Take some examples:
>> (a) a property under /chosen
>>    {
>>      chosen = {
>>        cmdline = "elfcorehdr=AA@BB maxcpus=1 ...";
>>      }
>>      usable-memory = <XX YY>;
>>      memory {
>>        ...
>>      }
>>    }
>>
>> (b) a kernel command line parameter
>>    (I use the same name, "usable-memory", to show the similarity. may use another name though.)
>>    {
>>      chosen = {
>>        cmdline = "elfcorehdr=AA@BB maxcpus=1 usable-memory=YY@XX ...";
>>      }
>>      memory {
>>        ...
>>      }
>>    }
>>
>>> * It keeps the command line simple (as you mention the opposite is
>>>    "complicated").
>>
>> I think both are simple.
>>
>>> * It is logically separate from options the user may pass to the kernel
>>>    in that the restricted region(s) of memory avaialble are effectively
>>>    properties of the system (in that the crashed OS is part of the system
>>>    state).
>>
>> "elfcorehdr=" parameter already breaks your point.
>> "elfcorehdr=" looks to be, what you say, a system property, and is actually
>> added by kexec-tools on all architectures, and "usable-memory", whether it is
>> a DT property or a kernel parameter, will also be added by kexec-tools.
>> (Users don't have to care.)
>
> Just because architectures do one thing today does not mean that we have
> to follow it.
>
> I don't think that breaks my point so much as shows that a different
> approach is taken by others today.
>
> There's also no reason this cannot be a property under /chosen.

No, but no strong reason to be so IMO.

>>> * The semantics of the command line parsing can change subtly over time
>>>    (for example, see 51e158c12aca3c9a, which terminates command line
>>>    parseing at "--"). Maknig sure that a command line option will
>>>    actually be parsed by the next kernel is not trivial.
>>>
>>>    Keeping this information isolated from the command line is more
>>>    robust.
>>
>> Even so, who wants to use kdump without testing?
>> and this is not a kdump specific issue.
>>
>>> * Addition of a property is a self-contained operation, that doesn't
>>>    require any knowledge about the command line.
>>
>> I don't get your point here.
>> For a kernel parameter, early_param() can encapsulate all the stuffs necessary.
>> Once the kernel recognizes a usable memory region, limiting available
>> memory should be done in the exact same way.
>
> I mean when modifying the command line.

OK, I understand what you mean.

> To place "elfcorehdr=" or "memmap="/"usable-memory=" into the command
> line, one needs to know where it is valid to place it. Appending doesn't
> always work, as per the example above with 51e158c12aca3c9a.

So, are you now suggesting that we put both "elfcorehdr=" and
"usable-memory=" under /chosen in dtb? That's fair enough.
(as far as nobody cares about incompatibility with other archs.)

-Takahiro AKASHI


> For both of these my point was that generally there is some fragility in
> this area. While it's easy to say that breaking this would be someone
> else's fault, we can easily avoid the possibility of that happening, and
> avoid a set of problems trying to maintain backwards compatibility if
> there were a sensible change that happened to break things.
>
> Thanks,
> Mark.
>

Mark Rutland Jan. 22, 2016, 11:13 a.m. UTC | #40

On Fri, Jan 22, 2016 at 03:23:14PM +0900, AKASHI Takahiro wrote:
> On 01/21/2016 09:02 PM, Mark Rutland wrote:
> >On Thu, Jan 21, 2016 at 03:53:42PM +0900, AKASHI Takahiro wrote:
> >>On 01/20/2016 08:49 PM, Mark Rutland wrote:
> >>>On Wed, Jan 20, 2016 at 03:07:53PM +0900, AKASHI Takahiro wrote:
> >>>>On 01/20/2016 11:49 AM, Dave Young wrote:
> >>>>>Firmware do not know kernel endianniess, kernel should respect firmware
> >>>>>maps and adapt to it, it sounds like a generic issue not specfic to kexec.
> >>>>
> >>>>On arm64, a kernel image header has a bit field to specify the image's endianness.
> >>>>Anyway, our current implementation replies on a user-supplied dtb to start BE kernel.
> >>>
> >>>The firmware should _never_ care about the kernel's endianness. The
> >>>bootlaoder or first kernel shouldn't care about the next kernel's
> >>>endianness apart from in exceptional circumstances. The DTB for a LE
> >>>kernel should look identical to that passed to a BE kernel.
> >>
> >>Please note that I didn't say anything different from your last two statements.
> >>The current arm64 kexec implementation doesn't do anything specific to BE,
> >>but as far as BE kernel doesn't support UEFI, users are responsible for
> >>providing a proper dtb.
> >
> >I'm just confused as to what you mean by a "proper dtb" in that case.
> >
> >If you just mean one with memory nodes hacked in, then that would
> >currently be a way to make that work, yes.
> 
> One of useful cases that I have in my mind is kdump.
> We may want to use a small sub-set of dtb, especially devices, to
> make the reboot more reliable. Device drivers are likely to be vulnerable
> at crash.

I don't think that we can reliably have userspace carve out devices from
the DTB or from ACPI tables in order to achieve that. That's going to
end up complex and/or incomplete. We also can't do this in the
kexec_load_file / Secure Boot case.

That's not to say we cannot try, as it's possible when using kexec_load.
However, it's only going to be possible on a subset of systems, and it
would probably make sense to reserve this approach to those cases we
cannot work around by other means (e.g. whitelisting "safe" devices in
the kdump kernel, forcing explicit resets, etc).

> >It seems like the better option is to fix the BE kernel to support a
> >UEFI memory map, as that solves other issues.
> 
> Why did Ard throw away his patch?

In the absence of kexec it wasn't necessary, it only supported a subset
of the runtime services (and no other features like DMI IIRC), and it
looked like it would be painful to debug (if something went wrong while
a CPU was in LE mode, we couldn't even panic()).

Given BE kernels on UEFI were never supported until that point, there
wasn't a compelling reason to support that case.

Even if we support the UEFI memory map, I don't think it's worth the
effort to support runtime services, ACPI, and related code that's only
ever been tested on LE. So realistically this would only work on systems
using UEFI && DT rather than UEFI && ACPI.

> So, are you now suggesting that we put both "elfcorehdr=" and
> "usable-memory=" under /chosen in dtb?

Yes.

> That's fair enough.  (as far as nobody cares about incompatibility
> with other archs.)

Glad to hear! :)

Thanks,
Mark.

Dave Young Jan. 25, 2016, 3:19 a.m. UTC | #41

Hi, AKASHI

> >To place "elfcorehdr=" or "memmap="/"usable-memory=" into the command
> >line, one needs to know where it is valid to place it. Appending doesn't
> >always work, as per the example above with 51e158c12aca3c9a.
> 
> So, are you now suggesting that we put both "elfcorehdr=" and
> "usable-memory=" under /chosen in dtb? That's fair enough.
> (as far as nobody cares about incompatibility with other archs.)

You may need move is_kdump_kernel as a weak function so that in arm64 you
can still use it in kdump kernel.

Thanks
Dave

Dave Young Jan. 25, 2016, 4:23 a.m. UTC | #42

On 01/25/16 at 11:19am, Dave Young wrote:
> Hi, AKASHI
> 
> > >To place "elfcorehdr=" or "memmap="/"usable-memory=" into the command
> > >line, one needs to know where it is valid to place it. Appending doesn't
> > >always work, as per the example above with 51e158c12aca3c9a.
> > 
> > So, are you now suggesting that we put both "elfcorehdr=" and
> > "usable-memory=" under /chosen in dtb? That's fair enough.
> > (as far as nobody cares about incompatibility with other archs.)
> 
> You may need move is_kdump_kernel as a weak function so that in arm64 you
> can still use it in kdump kernel.

Ignore the comment please, you just use a different way to set the variable
is_kdump_kernel will still work..

It is pretty fine to use dtb along with usable memories then.

Thanks
Dave

AKASHI Takahiro Feb. 2, 2016, 5:18 a.m. UTC | #43

Mark,

On 01/22/2016 08:13 PM, Mark Rutland wrote:
> On Fri, Jan 22, 2016 at 03:23:14PM +0900, AKASHI Takahiro wrote:
>> On 01/21/2016 09:02 PM, Mark Rutland wrote:
>>> On Thu, Jan 21, 2016 at 03:53:42PM +0900, AKASHI Takahiro wrote:
>>>> On 01/20/2016 08:49 PM, Mark Rutland wrote:
>>>>> On Wed, Jan 20, 2016 at 03:07:53PM +0900, AKASHI Takahiro wrote:
>>>>>> On 01/20/2016 11:49 AM, Dave Young wrote:
>>>>>>> Firmware do not know kernel endianniess, kernel should respect firmware
>>>>>>> maps and adapt to it, it sounds like a generic issue not specfic to kexec.
>>>>>>
>>>>>> On arm64, a kernel image header has a bit field to specify the image's endianness.
>>>>>> Anyway, our current implementation replies on a user-supplied dtb to start BE kernel.
>>>>>
>>>>> The firmware should _never_ care about the kernel's endianness. The
>>>>> bootlaoder or first kernel shouldn't care about the next kernel's
>>>>> endianness apart from in exceptional circumstances. The DTB for a LE
>>>>> kernel should look identical to that passed to a BE kernel.
>>>>
>>>> Please note that I didn't say anything different from your last two statements.
>>>> The current arm64 kexec implementation doesn't do anything specific to BE,
>>>> but as far as BE kernel doesn't support UEFI, users are responsible for
>>>> providing a proper dtb.
>>>
>>> I'm just confused as to what you mean by a "proper dtb" in that case.
>>>
>>> If you just mean one with memory nodes hacked in, then that would
>>> currently be a way to make that work, yes.
>>
>> One of useful cases that I have in my mind is kdump.
>> We may want to use a small sub-set of dtb, especially devices, to
>> make the reboot more reliable. Device drivers are likely to be vulnerable
>> at crash.
>
> I don't think that we can reliably have userspace carve out devices from
> the DTB or from ACPI tables in order to achieve that. That's going to
> end up complex and/or incomplete. We also can't do this in the
> kexec_load_file / Secure Boot case.
>
> That's not to say we cannot try, as it's possible when using kexec_load.
> However, it's only going to be possible on a subset of systems, and it
> would probably make sense to reserve this approach to those cases we
> cannot work around by other means (e.g. whitelisting "safe" devices in
> the kdump kernel, forcing explicit resets, etc).
>
>>> It seems like the better option is to fix the BE kernel to support a
>>> UEFI memory map, as that solves other issues.
>>
>> Why did Ard throw away his patch?
>
> In the absence of kexec it wasn't necessary, it only supported a subset
> of the runtime services (and no other features like DMI IIRC), and it
> looked like it would be painful to debug (if something went wrong while
> a CPU was in LE mode, we couldn't even panic()).
>
> Given BE kernels on UEFI were never supported until that point, there
> wasn't a compelling reason to support that case.
>
> Even if we support the UEFI memory map, I don't think it's worth the
> effort to support runtime services, ACPI, and related code that's only
> ever been tested on LE. So realistically this would only work on systems
> using UEFI && DT rather than UEFI && ACPI.
>
>> So, are you now suggesting that we put both "elfcorehdr=" and
>> "usable-memory=" under /chosen in dtb?
>
> Yes.
>
>> That's fair enough.  (as far as nobody cares about incompatibility
>> with other archs.)
>
> Glad to hear! :)

I'm preparing for a new version based on our discussions.
Do you think that UEFI memory map support on BE kernel is a prerequisite
for accepting my kdump?

-Takahiro AKASHI

> Thanks,
> Mark.
>

[18/19] arm64: kdump: update a kernel doc

Commit Message

Comments

Patch