diff mbox series

arm64, vmcoreinfo : Append 'MAX_USER_VA_BITS' and 'MAX_PHYSMEM_BITS' to vmcoreinfo

Message ID 1548850991-11879-1-git-send-email-bhsharma@redhat.com (mailing list archive)
State New, archived
Headers show
Series arm64, vmcoreinfo : Append 'MAX_USER_VA_BITS' and 'MAX_PHYSMEM_BITS' to vmcoreinfo | expand

Commit Message

Bhupesh Sharma Jan. 30, 2019, 12:23 p.m. UTC
With ARMv8.2-LVA and LPA architecture extensions, arm64 hardware which
supports these extensions can support upto 52-bit virtual and 52-bit
physical addresses respectively.

Since at the moment we enable the support of these extensions via CONFIG
flags, e.g.
 - LPA via CONFIG_ARM64_PA_BITS_52

there are no clear mechanisms in user-space right now to
deteremine these CONFIG flag values and also determine the PARange and
VARange address values.

User-space tools like 'makedumpfile' and 'crash-utility' can instead
use the 'MAX_USER_VA_BITS' and 'MAX_PHYSMEM_BITS' values to determine
the maximum virtual address and physical address (respectively)
supported by underlying kernel.

A reference 'makedumpfile' implementation which uses this approach to
determining the maximum physical address is available in [0].

[0]. https://github.com/bhupesh-sharma/makedumpfile/blob/52-bit-pa-support-via-vmcore-v1/arch/arm64.c#L490

Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Bhupesh Sharma <bhsharma@redhat.com>
---
 arch/arm64/kernel/crash_core.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

James Morse Jan. 30, 2019, 3:21 p.m. UTC | #1
Hi Bhupesh,

On 01/30/2019 12:23 PM, Bhupesh Sharma wrote:
> With ARMv8.2-LVA and LPA architecture extensions, arm64 hardware which
> supports these extensions can support upto 52-bit virtual and 52-bit
> physical addresses respectively.
> 
> Since at the moment we enable the support of these extensions via CONFIG
> flags, e.g.
>   - LPA via CONFIG_ARM64_PA_BITS_52
> 
> there are no clear mechanisms in user-space right now to
> deteremine these CONFIG flag values and also determine the PARange and
> VARange address values.
> User-space tools like 'makedumpfile' and 'crash-utility' can instead
> use the 'MAX_USER_VA_BITS' and 'MAX_PHYSMEM_BITS' values to determine
> the maximum virtual address and physical address (respectively)
> supported by underlying kernel.
> 
> A reference 'makedumpfile' implementation which uses this approach to
> determining the maximum physical address is available in [0].

Why does it need to know?

(Suzuki asked the same question on your earlier version)
https://lore.kernel.org/linux-arm-kernel/cff44754-7fe4-efea-bc8e-4dde2277c821@arm.com/


 From your github link it looks like you use this to re-assemble the two 
bits of the PFN from the pte. Can't you always do this for 64K pages? 
CPUs with the feature always do this too, its not something the kernel 
turns on.


Thanks,

James


> diff --git a/arch/arm64/kernel/crash_core.c b/arch/arm64/kernel/crash_core.c
> index ca4c3e12d8c5..ad231be5c0d8 100644
> --- a/arch/arm64/kernel/crash_core.c
> +++ b/arch/arm64/kernel/crash_core.c
> @@ -10,6 +10,8 @@
>   void arch_crash_save_vmcoreinfo(void)
>   {
>   	VMCOREINFO_NUMBER(VA_BITS);
> +	VMCOREINFO_NUMBER(MAX_USER_VA_BITS);
> +	VMCOREINFO_NUMBER(MAX_PHYSMEM_BITS);
>   	/* Please note VMCOREINFO_NUMBER() uses "%d", not "%x" */
>   	vmcoreinfo_append_str("NUMBER(kimage_voffset)=0x%llx\n",
>   						kimage_voffset);
>
Bhupesh Sharma Jan. 30, 2019, 9:39 p.m. UTC | #2
Hi James,

Thanks for review.
Please see my comments inline.


On 01/30/2019 08:51 PM, James Morse wrote:
> Hi Bhupesh,
> 
> On 01/30/2019 12:23 PM, Bhupesh Sharma wrote:
>> With ARMv8.2-LVA and LPA architecture extensions, arm64 hardware which
>> supports these extensions can support upto 52-bit virtual and 52-bit
>> physical addresses respectively.
>>
>> Since at the moment we enable the support of these extensions via CONFIG
>> flags, e.g.
>>   - LPA via CONFIG_ARM64_PA_BITS_52
>>
>> there are no clear mechanisms in user-space right now to
>> deteremine these CONFIG flag values and also determine the PARange and
>> VARange address values.
>> User-space tools like 'makedumpfile' and 'crash-utility' can instead
>> use the 'MAX_USER_VA_BITS' and 'MAX_PHYSMEM_BITS' values to determine
>> the maximum virtual address and physical address (respectively)
>> supported by underlying kernel.
>>
>> A reference 'makedumpfile' implementation which uses this approach to
>> determining the maximum physical address is available in [0].
> 
> Why does it need to know?
> 
> (Suzuki asked the same question on your earlier version)
> https://lore.kernel.org/linux-arm-kernel/cff44754-7fe4-efea-bc8e-4dde2277c821@arm.com/ 

I have shared some details (after discussion with our test teams) in 
reply to the review comments from Suzuki here:
http://lists.infradead.org/pipermail/kexec/2019-January/022389.html, and
http://lists.infradead.org/pipermail/kexec/2019-January/022390.html

Just to summarize, I mentioned in my replies to the review comments tha 
the makedumpfile implementation (for decoding the PTE) was just as an 
example, however there can be other user-space applications, for e.g a 
user-space application running with 48-bit kernel VA and 52-bit user 
space VA and requesting allocation in 'high' address via a 'hint' to mmap.

>  From your github link it looks like you use this to re-assemble the two 
> bits of the PFN from the pte. Can't you always do this for 64K pages? 
> CPUs with the feature always do this too, its not something the kernel 
> turns on.

Ok, let me try to give some perspective of a common makedumpfile 
use-case before I jump into the details:

(a) makedumpfile tool can be used to generate a vmcore and analyze it 
later. So for example we can create vmcore for a system running with 
page-size = 64K and analyze it later on a different system using 
page-size = 4K.

Since several makedumpfile code legs (for page-table walk) are common in 
both the paths (creating a vmcore and analyzing a vmcore), we cannot 
hardcode the PTE calculation masks for either 48-bit or 52-bit address 
spaces (or 4K/64K page sizes). The example invocations for the two cases 
is given below:

Create a vmcore dump on a 64K machine:
# makedumpfile -l --message-level 1 -d 31 /proc/vmcore vmcore

Analyze the vmcore dump on a 4K machine:
# makedumpfile -d 31 -x vmlinux vmcore dumpfile

Also hardcoding the PTE calculation to use the high address bit mask 
always will break the backward compatibility with older kernels (which 
don't support 52-bit address space extensions).

(b). Also x86_64 already has a vmcoreinfo export for 'pgtable_l5_enabled':

void arch_crash_save_vmcoreinfo(void)
{
	<.. snip..>
	vmcoreinfo_append_str("NUMBER(pgtable_l5_enabled)=%d\n",
			pgtable_l5_enabled());
}

And the makedumpfile code uses the same to determine support for 5-level 
page tables in x86_64, see 
<https://github.com/bhupesh-sharma/makedumpfile/blob/52-bit-pa-support-via-vmcore-v1/arch/x86_64.c#L36> 
for example.

Thanks,
Bhupesh

> 
>> diff --git a/arch/arm64/kernel/crash_core.c 
>> b/arch/arm64/kernel/crash_core.c
>> index ca4c3e12d8c5..ad231be5c0d8 100644
>> --- a/arch/arm64/kernel/crash_core.c
>> +++ b/arch/arm64/kernel/crash_core.c
>> @@ -10,6 +10,8 @@
>>   void arch_crash_save_vmcoreinfo(void)
>>   {
>>       VMCOREINFO_NUMBER(VA_BITS);
>> +    VMCOREINFO_NUMBER(MAX_USER_VA_BITS);
>> +    VMCOREINFO_NUMBER(MAX_PHYSMEM_BITS);
>>       /* Please note VMCOREINFO_NUMBER() uses "%d", not "%x" */
>>       vmcoreinfo_append_str("NUMBER(kimage_voffset)=0x%llx\n",
>>                           kimage_voffset);
>>
>
Dave Young Jan. 31, 2019, 1:48 a.m. UTC | #3
+ more people
On 01/30/19 at 05:53pm, Bhupesh Sharma wrote:
> With ARMv8.2-LVA and LPA architecture extensions, arm64 hardware which
> supports these extensions can support upto 52-bit virtual and 52-bit
> physical addresses respectively.
> 
> Since at the moment we enable the support of these extensions via CONFIG
> flags, e.g.
>  - LPA via CONFIG_ARM64_PA_BITS_52
> 
> there are no clear mechanisms in user-space right now to
> deteremine these CONFIG flag values and also determine the PARange and
> VARange address values.
> 
> User-space tools like 'makedumpfile' and 'crash-utility' can instead
> use the 'MAX_USER_VA_BITS' and 'MAX_PHYSMEM_BITS' values to determine
> the maximum virtual address and physical address (respectively)
> supported by underlying kernel.
> 
> A reference 'makedumpfile' implementation which uses this approach to
> determining the maximum physical address is available in [0].
> 
> [0]. https://github.com/bhupesh-sharma/makedumpfile/blob/52-bit-pa-support-via-vmcore-v1/arch/arm64.c#L490

I'm not objecting the patch, just want to make sure to make clear about
things and make sure these issues are aware by people, and leave arm
people to review the arm bits.

1. MAX_PHYSMEM_BITS
As we previously found, back to 2014 makedumpfile took a patch to read the
value from vmcore but the kernel patch was not accepted.
So we should first make clear if this is really needed, why other arches
do not need this in makedumpfile.

If we really need it then should it be arm64 only?

If it is arm64 only then the makedumpfile code should read this number
only for arm64.

Also Lianbo added the vmcoreinfo documents, I believe it stays in -tip
tree,  need to make sure to document this as well.

2. MAX_USER_VA_BITS
Does makedumpfile care about userspace VA bits?  I do not see other code
doing this,  Kazu and Dave A should be able to comment.

I tend to doubt about this.

> 
> Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: James Morse <james.morse@arm.com>
> Signed-off-by: Bhupesh Sharma <bhsharma@redhat.com>
> ---
>  arch/arm64/kernel/crash_core.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/arm64/kernel/crash_core.c b/arch/arm64/kernel/crash_core.c
> index ca4c3e12d8c5..ad231be5c0d8 100644
> --- a/arch/arm64/kernel/crash_core.c
> +++ b/arch/arm64/kernel/crash_core.c
> @@ -10,6 +10,8 @@
>  void arch_crash_save_vmcoreinfo(void)
>  {
>  	VMCOREINFO_NUMBER(VA_BITS);
> +	VMCOREINFO_NUMBER(MAX_USER_VA_BITS);
> +	VMCOREINFO_NUMBER(MAX_PHYSMEM_BITS);
>  	/* Please note VMCOREINFO_NUMBER() uses "%d", not "%x" */
>  	vmcoreinfo_append_str("NUMBER(kimage_voffset)=0x%llx\n",
>  						kimage_voffset);
> -- 
> 2.7.4
> 
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

Thanks
Dave
Bhupesh Sharma Jan. 31, 2019, 10 a.m. UTC | #4
On 01/31/2019 07:18 AM, Dave Young wrote:
> + more people
> On 01/30/19 at 05:53pm, Bhupesh Sharma wrote:
>> With ARMv8.2-LVA and LPA architecture extensions, arm64 hardware which
>> supports these extensions can support upto 52-bit virtual and 52-bit
>> physical addresses respectively.
>>
>> Since at the moment we enable the support of these extensions via CONFIG
>> flags, e.g.
>>   - LPA via CONFIG_ARM64_PA_BITS_52
>>
>> there are no clear mechanisms in user-space right now to
>> deteremine these CONFIG flag values and also determine the PARange and
>> VARange address values.
>>
>> User-space tools like 'makedumpfile' and 'crash-utility' can instead
>> use the 'MAX_USER_VA_BITS' and 'MAX_PHYSMEM_BITS' values to determine
>> the maximum virtual address and physical address (respectively)
>> supported by underlying kernel.
>>
>> A reference 'makedumpfile' implementation which uses this approach to
>> determining the maximum physical address is available in [0].
>>
>> [0]. https://github.com/bhupesh-sharma/makedumpfile/blob/52-bit-pa-support-via-vmcore-v1/arch/arm64.c#L490
> 
> I'm not objecting the patch, just want to make sure to make clear about
> things and make sure these issues are aware by people, and leave arm
> people to review the arm bits.
> 
> 1. MAX_PHYSMEM_BITS
> As we previously found, back to 2014 makedumpfile took a patch to read the
> value from vmcore but the kernel patch was not accepted.
> So we should first make clear if this is really needed, why other arches
> do not need this in makedumpfile.

I explained this a bit in my reply to Suzuki's and James's review 
comments yesterday, but let me summarize the same again for better clarity:

Let's take the example of x86. We have CONFIG_X86_5LEVEL config flag to 
indicate 5 level page-table support. We export the same in vmcoreinfo 
for x86_64 using:

void arch_crash_save_vmcoreinfo(void)
{
     <.. snip..>
     vmcoreinfo_append_str("NUMBER(pgtable_l5_enabled)=%d\n",
             pgtable_l5_enabled());
}

Also a simple grep in makedumpfile and crash for the same indicates that 
the user-space code determines the MAX_PHYSMEM_BITS value using the 
'pgtable_l5_enabled' value available in vmcoreinfo:

Example from makedumpfile:
-------------------------
int
get_versiondep_info_x86_64(void)
{
	/*
	 * On linux-2.6.26, MAX_PHYSMEM_BITS is changed to 44 from 40.
	 */
	if (info->kernel_version < KERNEL_VERSION(2, 6, 26))
		info->max_physmem_bits  = _MAX_PHYSMEM_BITS_ORIG;
	else if (info->kernel_version < KERNEL_VERSION(2, 6, 31))
		info->max_physmem_bits  = _MAX_PHYSMEM_BITS_2_6_26;
	else if(check_5level_paging())
		info->max_physmem_bits  = _MAX_PHYSMEM_BITS_5LEVEL;
	else
		info->max_physmem_bits  = _MAX_PHYSMEM_BITS_2_6_31;

	...
}

As we can see above, we use several if-else cases to determine the 
'MAX_PHYSMEM_BITS', one of which is setting it to 52-bit, if the 
'pgtable_l5_enabled' value is available and TRUE in vmcoreinfo.

So, we determine 'MAX_PHYSMEM_BITS' value in makedumpfile via a 
vmcoreinfo export'ed variable (rather than using  named 
'MAX_PHYSMEM_BITS' its 'pgtable_l5_enabled' that is being used).

Since for arm64, we don't have a single CONFIG flag for 52-bit addresses 
spaces (kernel VA, user-space VA and PA), so its better to export the 
respective CONFIG flags in the vmcoreinfo directly.

> If we really need it then should it be arm64 only?

See above, since archs like x86 use a single flag: CONFIG_X86_5LEVEL, 
whereas arm64 can use the combination of following flags to indicate 
combinations of various address spaces:

- 48-bit kernel VA + 48-bit user-space VA + 52-bit PA
- 48-bit kernel VA + 52-bit user-space VA + 52-bit PA
- 52-bit kernel VA + 52-bit user-space VA + 52-bit PA


CONFIG_ARM64_64K_PAGES
CONFIG_ARM64_USER_VA_BITS_52
CONFIG_ARM64_VA_BITS
CONFIG_ARM64_PA_BITS_52
CONFIG_ARM64_PA_BITS
CONFIG_EXPERT, and
CONFIG_ARM64_FORCE_52BIT

so probably its not correct to compare the two cases one-to-one (its 
more like an apple and an orange comparison).

> If it is arm64 only then the makedumpfile code should read this number
> only for arm64.
> 
> Also Lianbo added the vmcoreinfo documents, I believe it stays in -tip
> tree,  need to make sure to document this as well.

Sure, I will send a separate patch to fix the same, once this gets in.

> 
> 2. MAX_USER_VA_BITS
> Does makedumpfile care about userspace VA bits? 

Yes. Consider the case 48-bit kernel VA and 52-bit user-space VA, which 
is a perfectly valid case on arm64. In such cases VA_BITS is set to 48, 
whereas  MAX_USER_VA_BITS is set to 52, which allows user-space 
applications which use 52-bit virtual address to pass a hint to 'mmap' 
to get high addresses.

  I do not see other code
> doing this,  Kazu and Dave A should be able to comment.

I talked to Dave A. yesterday off-list, I think he mentioned that these 
changes are useful for crash-utility as well and he was hoping it gets 
accepted soon so that kernel-debugging tools can handle increased 
address spaces on arm64.

But it will be great to have reviews/ACKs from Dave A and others as well.

Thanks,
Bhupesh

> 
> I tend to doubt about this.
> 
>>
>> Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
>> Cc: Mark Rutland <mark.rutland@arm.com>
>> Cc: Will Deacon <will.deacon@arm.com>
>> Cc: James Morse <james.morse@arm.com>
>> Signed-off-by: Bhupesh Sharma <bhsharma@redhat.com>
>> ---
>>   arch/arm64/kernel/crash_core.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/arch/arm64/kernel/crash_core.c b/arch/arm64/kernel/crash_core.c
>> index ca4c3e12d8c5..ad231be5c0d8 100644
>> --- a/arch/arm64/kernel/crash_core.c
>> +++ b/arch/arm64/kernel/crash_core.c
>> @@ -10,6 +10,8 @@
>>   void arch_crash_save_vmcoreinfo(void)
>>   {
>>   	VMCOREINFO_NUMBER(VA_BITS);
>> +	VMCOREINFO_NUMBER(MAX_USER_VA_BITS);
>> +	VMCOREINFO_NUMBER(MAX_PHYSMEM_BITS);
>>   	/* Please note VMCOREINFO_NUMBER() uses "%d", not "%x" */
>>   	vmcoreinfo_append_str("NUMBER(kimage_voffset)=0x%llx\n",
>>   						kimage_voffset);
>> -- 
>> 2.7.4
>>
>>
>> _______________________________________________
>> kexec mailing list
>> kexec@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/kexec
> 
> Thanks
> Dave
>
Dave Anderson Jan. 31, 2019, 2:03 p.m. UTC | #5
----- Original Message -----
> + more people
> On 01/30/19 at 05:53pm, Bhupesh Sharma wrote:
> > With ARMv8.2-LVA and LPA architecture extensions, arm64 hardware which
> > supports these extensions can support upto 52-bit virtual and 52-bit
> > physical addresses respectively.
> > 
> > Since at the moment we enable the support of these extensions via CONFIG
> > flags, e.g.
> >  - LPA via CONFIG_ARM64_PA_BITS_52
> > 
> > there are no clear mechanisms in user-space right now to
> > deteremine these CONFIG flag values and also determine the PARange and
> > VARange address values.
> > 
> > User-space tools like 'makedumpfile' and 'crash-utility' can instead
> > use the 'MAX_USER_VA_BITS' and 'MAX_PHYSMEM_BITS' values to determine
> > the maximum virtual address and physical address (respectively)
> > supported by underlying kernel.
> > 
> > A reference 'makedumpfile' implementation which uses this approach to
> > determining the maximum physical address is available in [0].
> > 
> > [0].
> > https://github.com/bhupesh-sharma/makedumpfile/blob/52-bit-pa-support-via-vmcore-v1/arch/arm64.c#L490
> 
> I'm not objecting the patch, just want to make sure to make clear about
> things and make sure these issues are aware by people, and leave arm
> people to review the arm bits.
> 
> 1. MAX_PHYSMEM_BITS
> As we previously found, back to 2014 makedumpfile took a patch to read the
> value from vmcore but the kernel patch was not accepted.
> So we should first make clear if this is really needed, why other arches
> do not need this in makedumpfile.
> 
> If we really need it then should it be arm64 only?
> 
> If it is arm64 only then the makedumpfile code should read this number
> only for arm64.

With respect to the crash utility, this is the first time/architecture where the
determination of MAX_PHYSMEM_BITS cannot be determined.  

> 
> Also Lianbo added the vmcoreinfo documents, I believe it stays in -tip
> tree,  need to make sure to document this as well.
> 
> 2. MAX_USER_VA_BITS
> Does makedumpfile care about userspace VA bits?  I do not see other code
> doing this,  Kazu and Dave A should be able to comment.
> 
> I tend to doubt about this.

With respect to the crash utility, this won't be required because there is
the exported "vabits_user" variable.

Dave Anderson


> 
> >  
> > Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > Cc: Mark Rutland <mark.rutland@arm.com>
> > Cc: Will Deacon <will.deacon@arm.com>
> > Cc: James Morse <james.morse@arm.com>
> > Signed-off-by: Bhupesh Sharma <bhsharma@redhat.com>
> > ---
> >  arch/arm64/kernel/crash_core.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/arch/arm64/kernel/crash_core.c
> > b/arch/arm64/kernel/crash_core.c
> > index ca4c3e12d8c5..ad231be5c0d8 100644
> > --- a/arch/arm64/kernel/crash_core.c
> > +++ b/arch/arm64/kernel/crash_core.c
> > @@ -10,6 +10,8 @@
> >  void arch_crash_save_vmcoreinfo(void)
> >  {
> >  	VMCOREINFO_NUMBER(VA_BITS);
> > +	VMCOREINFO_NUMBER(MAX_USER_VA_BITS);
> > +	VMCOREINFO_NUMBER(MAX_PHYSMEM_BITS);
> >  	/* Please note VMCOREINFO_NUMBER() uses "%d", not "%x" */
> >  	vmcoreinfo_append_str("NUMBER(kimage_voffset)=0x%llx\n",
> >  						kimage_voffset);
> > --
> > 2.7.4
> > 
> > 
> > _______________________________________________
> > kexec mailing list
> > kexec@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/kexec
> 
> Thanks
> Dave
>
Bhupesh Sharma Feb. 4, 2019, 2:35 p.m. UTC | #6
Hi James, All,

On 01/31/2019 03:09 AM, Bhupesh Sharma wrote:
> Hi James,
> 
> Thanks for review.
> Please see my comments inline.
> 
> 
> On 01/30/2019 08:51 PM, James Morse wrote:
>> Hi Bhupesh,
>>
>> On 01/30/2019 12:23 PM, Bhupesh Sharma wrote:
>>> With ARMv8.2-LVA and LPA architecture extensions, arm64 hardware which
>>> supports these extensions can support upto 52-bit virtual and 52-bit
>>> physical addresses respectively.
>>>
>>> Since at the moment we enable the support of these extensions via CONFIG
>>> flags, e.g.
>>>   - LPA via CONFIG_ARM64_PA_BITS_52
>>>
>>> there are no clear mechanisms in user-space right now to
>>> deteremine these CONFIG flag values and also determine the PARange and
>>> VARange address values.
>>> User-space tools like 'makedumpfile' and 'crash-utility' can instead
>>> use the 'MAX_USER_VA_BITS' and 'MAX_PHYSMEM_BITS' values to determine
>>> the maximum virtual address and physical address (respectively)
>>> supported by underlying kernel.
>>>
>>> A reference 'makedumpfile' implementation which uses this approach to
>>> determining the maximum physical address is available in [0].
>>
>> Why does it need to know?
>>
>> (Suzuki asked the same question on your earlier version)
>> https://lore.kernel.org/linux-arm-kernel/cff44754-7fe4-efea-bc8e-4dde2277c821@arm.com/ 
> 
> 
> I have shared some details (after discussion with our test teams) in 
> reply to the review comments from Suzuki here:
> http://lists.infradead.org/pipermail/kexec/2019-January/022389.html, and
> http://lists.infradead.org/pipermail/kexec/2019-January/022390.html
> 
> Just to summarize, I mentioned in my replies to the review comments tha 
> the makedumpfile implementation (for decoding the PTE) was just as an 
> example, however there can be other user-space applications, for e.g a 
> user-space application running with 48-bit kernel VA and 52-bit user 
> space VA and requesting allocation in 'high' address via a 'hint' to mmap.
> 
>>  From your github link it looks like you use this to re-assemble the 
>> two bits of the PFN from the pte. Can't you always do this for 64K 
>> pages? CPUs with the feature always do this too, its not something the 
>> kernel turns on.
> 
> Ok, let me try to give some perspective of a common makedumpfile 
> use-case before I jump into the details:
> 
> (a) makedumpfile tool can be used to generate a vmcore and analyze it 
> later. So for example we can create vmcore for a system running with 
> page-size = 64K and analyze it later on a different system using 
> page-size = 4K.
> 
> Since several makedumpfile code legs (for page-table walk) are common in 
> both the paths (creating a vmcore and analyzing a vmcore), we cannot 
> hardcode the PTE calculation masks for either 48-bit or 52-bit address 
> spaces (or 4K/64K page sizes). The example invocations for the two cases 
> is given below:
> 
> Create a vmcore dump on a 64K machine:
> # makedumpfile -l --message-level 1 -d 31 /proc/vmcore vmcore
> 
> Analyze the vmcore dump on a 4K machine:
> # makedumpfile -d 31 -x vmlinux vmcore dumpfile
> 
> Also hardcoding the PTE calculation to use the high address bit mask 
> always will break the backward compatibility with older kernels (which 
> don't support 52-bit address space extensions).
> 
> (b). Also x86_64 already has a vmcoreinfo export for 'pgtable_l5_enabled':
> 
> void arch_crash_save_vmcoreinfo(void)
> {
>      <.. snip..>
>      vmcoreinfo_append_str("NUMBER(pgtable_l5_enabled)=%d\n",
>              pgtable_l5_enabled());
> }
> 
> And the makedumpfile code uses the same to determine support for 5-level 
> page tables in x86_64, see 
> <https://github.com/bhupesh-sharma/makedumpfile/blob/52-bit-pa-support-via-vmcore-v1/arch/x86_64.c#L36> 
> for example.

Ping. Since this patch fixes a regression with user-space tools like 
makedumpfile and crash-utility which are broken since arm64 kernels 
with 52-bit VA and PA support are available (and distributions which 
enable them), would request review comments/ack on this simple change.

Thanks,
Bhupesh

>>
>>> diff --git a/arch/arm64/kernel/crash_core.c 
>>> b/arch/arm64/kernel/crash_core.c
>>> index ca4c3e12d8c5..ad231be5c0d8 100644
>>> --- a/arch/arm64/kernel/crash_core.c
>>> +++ b/arch/arm64/kernel/crash_core.c
>>> @@ -10,6 +10,8 @@
>>>   void arch_crash_save_vmcoreinfo(void)
>>>   {
>>>       VMCOREINFO_NUMBER(VA_BITS);
>>> +    VMCOREINFO_NUMBER(MAX_USER_VA_BITS);
>>> +    VMCOREINFO_NUMBER(MAX_PHYSMEM_BITS);
>>>       /* Please note VMCOREINFO_NUMBER() uses "%d", not "%x" */
>>>       vmcoreinfo_append_str("NUMBER(kimage_voffset)=0x%llx\n",
>>>                           kimage_voffset);
>>>
>>
>
Robin Murphy Feb. 4, 2019, 3:31 p.m. UTC | #7
On 04/02/2019 14:35, Bhupesh Sharma wrote:
[...]
>> Also hardcoding the PTE calculation to use the high address bit mask 
>> always will break the backward compatibility with older kernels (which 
>> don't support 52-bit address space extensions).

No it won't. There's no difference between an old kernel, a new kernel 
on a CPU without ARMv8.2-LPA, or a new kernel on a CPU with ARMv8.2-LPA 
in a system which happens to have less than 49 bits of physical memory 
map - in all those cases the relevant bits are either RES0 or just 
actually 0 in the PTE, so replacing 4 bits of zeros with 4 bits of other 
zeros in the final physical address has no effect whatsoever other than 
taking a couple of extra instructions to perform.

If you're running a 64K page kernel on a system with an SMMU, note how 
that's already been "broken" for nearly a year now ;)

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/iommu/io-pgtable-arm.c#n211


Robin.
Kazuhito Hagio Feb. 4, 2019, 4:04 p.m. UTC | #8
On 1/30/2019 8:48 PM, Dave Young wrote:
> + more people
> On 01/30/19 at 05:53pm, Bhupesh Sharma wrote:
> > With ARMv8.2-LVA and LPA architecture extensions, arm64 hardware which
> > supports these extensions can support upto 52-bit virtual and 52-bit
> > physical addresses respectively.
> >
> > Since at the moment we enable the support of these extensions via CONFIG
> > flags, e.g.
> >  - LPA via CONFIG_ARM64_PA_BITS_52
> >
> > there are no clear mechanisms in user-space right now to
> > deteremine these CONFIG flag values and also determine the PARange and
> > VARange address values.
> >
> > User-space tools like 'makedumpfile' and 'crash-utility' can instead
> > use the 'MAX_USER_VA_BITS' and 'MAX_PHYSMEM_BITS' values to determine
> > the maximum virtual address and physical address (respectively)
> > supported by underlying kernel.
> >
> > A reference 'makedumpfile' implementation which uses this approach to
> > determining the maximum physical address is available in [0].
> >
> > [0].
> https://github.com/bhupesh-sharma/makedumpfile/blob/52-bit-pa-support-via-vmcore-v1/arch/arm64.c#L490
> 
> I'm not objecting the patch, just want to make sure to make clear about
> things and make sure these issues are aware by people, and leave arm
> people to review the arm bits.
> 
> 1. MAX_PHYSMEM_BITS
> As we previously found, back to 2014 makedumpfile took a patch to read the
> value from vmcore but the kernel patch was not accepted.
> So we should first make clear if this is really needed, why other arches
> do not need this in makedumpfile.
> 
> If we really need it then should it be arm64 only?
> 
> If it is arm64 only then the makedumpfile code should read this number
> only for arm64.

Sorry for the delay.

According to the kernel patch, some of arm32 platforms may need it
http://lists.infradead.org/pipermail/kexec/2014-May/011909.html
but except for them (and arm64), makedumpfile can manage with kernel
version and some switches to determine this value so far.

> 
> Also Lianbo added the vmcoreinfo documents, I believe it stays in -tip
> tree,  need to make sure to document this as well.
> 
> 2. MAX_USER_VA_BITS
> Does makedumpfile care about userspace VA bits?  I do not see other code
> doing this,  Kazu and Dave A should be able to comment.

The mapping makedumpfile uses on arm64 is swapper_pg_dir only, so
unless the config affects its structure or something, makedumpfile
will not need this value.

Thanks,
Kazu

> 
> I tend to doubt about this.
> 
> >
> > Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > Cc: Mark Rutland <mark.rutland@arm.com>
> > Cc: Will Deacon <will.deacon@arm.com>
> > Cc: James Morse <james.morse@arm.com>
> > Signed-off-by: Bhupesh Sharma <bhsharma@redhat.com>
> > ---
> >  arch/arm64/kernel/crash_core.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/arch/arm64/kernel/crash_core.c b/arch/arm64/kernel/crash_core.c
> > index ca4c3e12d8c5..ad231be5c0d8 100644
> > --- a/arch/arm64/kernel/crash_core.c
> > +++ b/arch/arm64/kernel/crash_core.c
> > @@ -10,6 +10,8 @@
> >  void arch_crash_save_vmcoreinfo(void)
> >  {
> >  	VMCOREINFO_NUMBER(VA_BITS);
> > +	VMCOREINFO_NUMBER(MAX_USER_VA_BITS);
> > +	VMCOREINFO_NUMBER(MAX_PHYSMEM_BITS);
> >  	/* Please note VMCOREINFO_NUMBER() uses "%d", not "%x" */
> >  	vmcoreinfo_append_str("NUMBER(kimage_voffset)=0x%llx\n",
> >  						kimage_voffset);
> > --
> > 2.7.4
> >
> >
> > _______________________________________________
> > kexec mailing list
> > kexec@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/kexec
> 
> Thanks
> Dave
James Morse Feb. 4, 2019, 4:56 p.m. UTC | #9
Hi Bhupesh,

On 04/02/2019 14:35, Bhupesh Sharma wrote:
> On 01/31/2019 03:09 AM, Bhupesh Sharma wrote:
>> On 01/30/2019 08:51 PM, James Morse wrote:
>>> On 01/30/2019 12:23 PM, Bhupesh Sharma wrote:
>>>> With ARMv8.2-LVA and LPA architecture extensions, arm64 hardware which
>>>> supports these extensions can support upto 52-bit virtual and 52-bit
>>>> physical addresses respectively.
>>>>
>>>> Since at the moment we enable the support of these extensions via CONFIG
>>>> flags, e.g.
>>>>   - LPA via CONFIG_ARM64_PA_BITS_52
>>>>
>>>> there are no clear mechanisms in user-space right now to
>>>> deteremine these CONFIG flag values and also determine the PARange and
>>>> VARange address values.
>>>> User-space tools like 'makedumpfile' and 'crash-utility' can instead
>>>> use the 'MAX_USER_VA_BITS' and 'MAX_PHYSMEM_BITS' values to determine
>>>> the maximum virtual address and physical address (respectively)
>>>> supported by underlying kernel.
>>>>
>>>> A reference 'makedumpfile' implementation which uses this approach to
>>>> determining the maximum physical address is available in [0].
>>>
>>> Why does it need to know?
>>>
>>> (Suzuki asked the same question on your earlier version)
>>> https://lore.kernel.org/linux-arm-kernel/cff44754-7fe4-efea-bc8e-4dde2277c821@arm.com/


>> I have shared some details (after discussion with our test teams) in reply to
>> the review comments from Suzuki here:
>> http://lists.infradead.org/pipermail/kexec/2019-January/022389.html, and
>> http://lists.infradead.org/pipermail/kexec/2019-January/022390.html
>>
>> Just to summarize, I mentioned in my replies to the review comments tha the
>> makedumpfile implementation (for decoding the PTE) was just as an example,
>> however there can be other user-space applications, for e.g a user-space
>> application running with 48-bit kernel VA and 52-bit user space VA and
>> requesting allocation in 'high' address via a 'hint' to mmap.

But vmcoreinfo is the wrong place to expose this information. (it can be
configured off, and is only accessible to root)


>>>  From your github link it looks like you use this to re-assemble the two bits
>>> of the PFN from the pte. Can't you always do this for 64K pages? CPUs with
>>> the feature always do this too, its not something the kernel turns on.
>>
>> Ok, let me try to give some perspective of a common makedumpfile use-case
>> before I jump into the details:


>> Also hardcoding the PTE calculation to use the high address bit mask always
>> will break the backward compatibility with older kernels (which don't support
>> 52-bit address space extensions).

What would go wrong?

The hardware ignores those bits and supplies zero. As far as I can see the
kernel has always generated zero there.


>> (b). Also x86_64 already has a vmcoreinfo export for 'pgtable_l5_enabled':

So? 5-level page tables is a different feature. I agree you need to know the
number of levels to walk the page-tables, but that isn't how arm64's 52bit stuff
works.


> Ping. Since this patch fixes a regression with user-space tools like
> makedumpfile and crash-utility which are broken since arm64 kernels with 52-bit
> VA and PA support are available (and distributions which enable them), would
> request review comments/ack on this simple change.

Broken how? What goes wrong?

I can see how a not-52bit-aware crash/gdb/whatever would be confused by high
bits being set in the physical address, and possibly throw them away.
But once it supports this for 64K pages, I don't see what can go wrong if those
bits aren't set. Why does it need to know?


Thanks,

James
Bhupesh Sharma Feb. 12, 2019, 4:55 a.m. UTC | #10
Hi Robin,

On 02/04/2019 09:01 PM, Robin Murphy wrote:
> On 04/02/2019 14:35, Bhupesh Sharma wrote:
> [...]
>>> Also hardcoding the PTE calculation to use the high address bit mask 
>>> always will break the backward compatibility with older kernels 
>>> (which don't support 52-bit address space extensions).
> 
> No it won't. There's no difference between an old kernel, a new kernel 
> on a CPU without ARMv8.2-LPA, or a new kernel on a CPU with ARMv8.2-LPA 
> in a system which happens to have less than 49 bits of physical memory 
> map - in all those cases the relevant bits are either RES0 or just 
> actually 0 in the PTE, so replacing 4 bits of zeros with 4 bits of other 
> zeros in the final physical address has no effect whatsoever other than 
> taking a couple of extra instructions to perform.

Right, but lets think this from a distribution user p-o-v. Why would a 
user with newer kernel and CPU which doesn't support ARMv8.2 LPA want to 
update a user-space utility? Well unless something is broken in the 
user-space. Requesting an upgrade is easier in this case, as the 
user-space components like crash-utility/kexec-tools (or for that matter 
any user-space tool which requires a page-table walk to determine 
vaddr/paddr values) are indeed broken with this combination.

On the other hand, if a user a having a old kernel, CPU which doesn't 
support LPA, but if we still want them to upgrade to a newer version of 
user-space utility (with a upgrade fix note reading something like: "Add 
52-bit PA support"), surely that would not be an ideal case requiring an 
upgrade as his underlying environment doesn't support the same.

That's what I meant when I said that we need to take care of the 
backward compatibility also when we propose new code changes to 
user-space packages.

Thanks,
Bhupesh

> If you're running a 64K page kernel on a system with an SMMU, note how 
> that's already been "broken" for nearly a year now ;)
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/iommu/io-pgtable-arm.c#n211 
> 
> 
> 
> Robin.
Bhupesh Sharma Feb. 12, 2019, 5:07 a.m. UTC | #11
Hi Kazu,

On 02/04/2019 09:34 PM, Kazuhito Hagio wrote:
> On 1/30/2019 8:48 PM, Dave Young wrote:
>> + more people
>> On 01/30/19 at 05:53pm, Bhupesh Sharma wrote:
>>> With ARMv8.2-LVA and LPA architecture extensions, arm64 hardware which
>>> supports these extensions can support upto 52-bit virtual and 52-bit
>>> physical addresses respectively.
>>>
>>> Since at the moment we enable the support of these extensions via CONFIG
>>> flags, e.g.
>>>   - LPA via CONFIG_ARM64_PA_BITS_52
>>>
>>> there are no clear mechanisms in user-space right now to
>>> deteremine these CONFIG flag values and also determine the PARange and
>>> VARange address values.
>>>
>>> User-space tools like 'makedumpfile' and 'crash-utility' can instead
>>> use the 'MAX_USER_VA_BITS' and 'MAX_PHYSMEM_BITS' values to determine
>>> the maximum virtual address and physical address (respectively)
>>> supported by underlying kernel.
>>>
>>> A reference 'makedumpfile' implementation which uses this approach to
>>> determining the maximum physical address is available in [0].
>>>
>>> [0].
>> https://github.com/bhupesh-sharma/makedumpfile/blob/52-bit-pa-support-via-vmcore-v1/arch/arm64.c#L490
>>
>> I'm not objecting the patch, just want to make sure to make clear about
>> things and make sure these issues are aware by people, and leave arm
>> people to review the arm bits.
>>
>> 1. MAX_PHYSMEM_BITS
>> As we previously found, back to 2014 makedumpfile took a patch to read the
>> value from vmcore but the kernel patch was not accepted.
>> So we should first make clear if this is really needed, why other arches
>> do not need this in makedumpfile.
>>
>> If we really need it then should it be arm64 only?
>>
>> If it is arm64 only then the makedumpfile code should read this number
>> only for arm64.
> 
> Sorry for the delay.
> 
> According to the kernel patch, some of arm32 platforms may need it
> http://lists.infradead.org/pipermail/kexec/2014-May/011909.html
> but except for them (and arm64), makedumpfile can manage with kernel
> version and some switches to determine this value so far.
> 
>>
>> Also Lianbo added the vmcoreinfo documents, I believe it stays in -tip
>> tree,  need to make sure to document this as well.
>>
>> 2. MAX_USER_VA_BITS
>> Does makedumpfile care about userspace VA bits?  I do not see other code
>> doing this,  Kazu and Dave A should be able to comment.
> 
> The mapping makedumpfile uses on arm64 is swapper_pg_dir only, so
> unless the config affects its structure or something, makedumpfile
> will not need this value.

I captured this case in more details while sending out the makedumpfile  
enablement patch for ARMv8.2-LVA (see [0]), but here is a brief summary  
on the same:

Since at the moment we enable the support of the ARMv8.2-LVA extension  
for 52-bit user-space VA in the kernel via a CONFIG flags  
(CONFIG_ARM64_USER_VA_BITS_52), so there are no clear mechanisms in  
user-space to determine this CONFIG
flag value and use it to determine the address range values.

Since 'VA_BITS' are already exported via vmcoreinfo, if we export  
'MAX_USER_VA_BITS' as well, we can use the same in user-space to check  
if the 'MAX_USER_VA_BITS' value is greater than 'VA_BITS'. If yes, then  
we are running a use-case where user-space is 52-bit while the  
underlying kernel is still 48-bit.

The increased 'PTRS_PER_PGD' value for such cases needs to be then  
calculated as is done by the underlying kernel (see  
'arch/arm64/include/asm/pgtable-hwdef.h' for details):

#define PTRS_PER_PGD		(1 << (MAX_USER_VA_BITS - PGDIR_SHIFT))

Also, note that 'arch/arm64/include/asm/memory.h' defines  
'MAX_USER_VA_BITS' as 'VA_BITS' in case 'CONFIG_ARM64_USER_VA_BITS_52'  
is set to 'n':

#ifdef CONFIG_ARM64_USER_VA_BITS_52
#define MAX_USER_VA_BITS	52
#else
#define MAX_USER_VA_BITS	VA_BITS
#endif

So, makedumpfile will need this symbol exported in vmcore to make the  
above determination.

[0]. http://lists.infradead.org/pipermail/kexec/2019-February/022425.html

Thanks,
Bhupesh
Dave Young Feb. 12, 2019, 10:44 a.m. UTC | #12
On 02/12/19 at 10:37am, Bhupesh Sharma wrote:
> Hi Kazu,
> 
> On 02/04/2019 09:34 PM, Kazuhito Hagio wrote:
> > On 1/30/2019 8:48 PM, Dave Young wrote:
> > > + more people
> > > On 01/30/19 at 05:53pm, Bhupesh Sharma wrote:
> > > > With ARMv8.2-LVA and LPA architecture extensions, arm64 hardware which
> > > > supports these extensions can support upto 52-bit virtual and 52-bit
> > > > physical addresses respectively.
> > > > 
> > > > Since at the moment we enable the support of these extensions via CONFIG
> > > > flags, e.g.
> > > >   - LPA via CONFIG_ARM64_PA_BITS_52
> > > > 
> > > > there are no clear mechanisms in user-space right now to
> > > > deteremine these CONFIG flag values and also determine the PARange and
> > > > VARange address values.
> > > > 
> > > > User-space tools like 'makedumpfile' and 'crash-utility' can instead
> > > > use the 'MAX_USER_VA_BITS' and 'MAX_PHYSMEM_BITS' values to determine
> > > > the maximum virtual address and physical address (respectively)
> > > > supported by underlying kernel.
> > > > 
> > > > A reference 'makedumpfile' implementation which uses this approach to
> > > > determining the maximum physical address is available in [0].
> > > > 
> > > > [0].
> > > https://github.com/bhupesh-sharma/makedumpfile/blob/52-bit-pa-support-via-vmcore-v1/arch/arm64.c#L490
> > > 
> > > I'm not objecting the patch, just want to make sure to make clear about
> > > things and make sure these issues are aware by people, and leave arm
> > > people to review the arm bits.
> > > 
> > > 1. MAX_PHYSMEM_BITS
> > > As we previously found, back to 2014 makedumpfile took a patch to read the
> > > value from vmcore but the kernel patch was not accepted.
> > > So we should first make clear if this is really needed, why other arches
> > > do not need this in makedumpfile.
> > > 
> > > If we really need it then should it be arm64 only?
> > > 
> > > If it is arm64 only then the makedumpfile code should read this number
> > > only for arm64.
> > 
> > Sorry for the delay.
> > 
> > According to the kernel patch, some of arm32 platforms may need it
> > http://lists.infradead.org/pipermail/kexec/2014-May/011909.html
> > but except for them (and arm64), makedumpfile can manage with kernel
> > version and some switches to determine this value so far.
> > 
> > > 
> > > Also Lianbo added the vmcoreinfo documents, I believe it stays in -tip
> > > tree,  need to make sure to document this as well.
> > > 
> > > 2. MAX_USER_VA_BITS
> > > Does makedumpfile care about userspace VA bits?  I do not see other code
> > > doing this,  Kazu and Dave A should be able to comment.
> > 
> > The mapping makedumpfile uses on arm64 is swapper_pg_dir only, so
> > unless the config affects its structure or something, makedumpfile
> > will not need this value.
> 
> I captured this case in more details while sending out the makedumpfile
> enablement patch for ARMv8.2-LVA (see [0]), but here is a brief summary on
> the same:
> 
> Since at the moment we enable the support of the ARMv8.2-LVA extension for
> 52-bit user-space VA in the kernel via a CONFIG flags
> (CONFIG_ARM64_USER_VA_BITS_52), so there are no clear mechanisms in
> user-space to determine this CONFIG
> flag value and use it to determine the address range values.
> 
> Since 'VA_BITS' are already exported via vmcoreinfo, if we export
> 'MAX_USER_VA_BITS' as well, we can use the same in user-space to check if
> the 'MAX_USER_VA_BITS' value is greater than 'VA_BITS'. If yes, then we are
> running a use-case where user-space is 52-bit while the underlying kernel is
> still 48-bit.

Problem is why this is needed, it sounds like you are talking about some
non-exist use case.

> 
> The increased 'PTRS_PER_PGD' value for such cases needs to be then
> calculated as is done by the underlying kernel (see
> 'arch/arm64/include/asm/pgtable-hwdef.h' for details):
> 
> #define PTRS_PER_PGD		(1 << (MAX_USER_VA_BITS - PGDIR_SHIFT))
> 
> Also, note that 'arch/arm64/include/asm/memory.h' defines 'MAX_USER_VA_BITS'
> as 'VA_BITS' in case 'CONFIG_ARM64_USER_VA_BITS_52' is set to 'n':
> 
> #ifdef CONFIG_ARM64_USER_VA_BITS_52
> #define MAX_USER_VA_BITS	52
> #else
> #define MAX_USER_VA_BITS	VA_BITS
> #endif
> 
> So, makedumpfile will need this symbol exported in vmcore to make the above
> determination.
> 
> [0]. http://lists.infradead.org/pipermail/kexec/2019-February/022425.html
> 
> Thanks,
> Bhupesh

Thanks
Dave
Robin Murphy Feb. 12, 2019, 10:49 a.m. UTC | #13
On 12/02/2019 04:55, Bhupesh Sharma wrote:
> Hi Robin,
> 
> On 02/04/2019 09:01 PM, Robin Murphy wrote:
>> On 04/02/2019 14:35, Bhupesh Sharma wrote:
>> [...]
>>>> Also hardcoding the PTE calculation to use the high address bit mask 
>>>> always will break the backward compatibility with older kernels 
>>>> (which don't support 52-bit address space extensions).
>>
>> No it won't. There's no difference between an old kernel, a new kernel 
>> on a CPU without ARMv8.2-LPA, or a new kernel on a CPU with 
>> ARMv8.2-LPA in a system which happens to have less than 49 bits of 
>> physical memory map - in all those cases the relevant bits are either 
>> RES0 or just actually 0 in the PTE, so replacing 4 bits of zeros with 
>> 4 bits of other zeros in the final physical address has no effect 
>> whatsoever other than taking a couple of extra instructions to perform.
> 
> Right, but lets think this from a distribution user p-o-v. Why would a 
> user with newer kernel and CPU which doesn't support ARMv8.2 LPA want to 
> update a user-space utility? Well unless something is broken in the 
> user-space. Requesting an upgrade is easier in this case, as the 
> user-space components like crash-utility/kexec-tools (or for that matter 
> any user-space tool which requires a page-table walk to determine 
> vaddr/paddr values) are indeed broken with this combination.

Huh? Nothing gets broken unless we go out of our way to break it. The 
point I'm making is that the format of vmcoreinfo *does not* need to 
change in order for the userspace tool to support LPA. There is no 
compatibility issue either way. If a user's machine doesn't support or 
make use of LPA, then yeah, they should absolutely not have to upgrade 
their tools for the sake of LPA regardless of what kernel they use.

> On the other hand, if a user a having a old kernel, CPU which doesn't 
> support LPA, but if we still want them to upgrade to a newer version of 
> user-space utility (with a upgrade fix note reading something like: "Add 
> 52-bit PA support"), surely that would not be an ideal case requiring an 
> upgrade as his underlying environment doesn't support the same.

I don't get why we would "still want them to upgrade" if there's no 
technical reason for them to do so. That sounds like the kind of crap 
that proprietary OS vendors pull :/

> That's what I meant when I said that we need to take care of the 
> backward compatibility also when we propose new code changes to 
> user-space packages.

And what I meant is that you can trivially update the userspace tools to 
support LPA entirely transparently, without requiring kernel changes or 
creating any compatibility issues in any direction.

Robin.
Bhupesh Sharma Feb. 12, 2019, 7:59 p.m. UTC | #14
On Tue, Feb 12, 2019 at 4:14 PM Dave Young <dyoung@redhat.com> wrote:
>
> On 02/12/19 at 10:37am, Bhupesh Sharma wrote:
> > Hi Kazu,
> >
> > On 02/04/2019 09:34 PM, Kazuhito Hagio wrote:
> > > On 1/30/2019 8:48 PM, Dave Young wrote:
> > > > + more people
> > > > On 01/30/19 at 05:53pm, Bhupesh Sharma wrote:
> > > > > With ARMv8.2-LVA and LPA architecture extensions, arm64 hardware which
> > > > > supports these extensions can support upto 52-bit virtual and 52-bit
> > > > > physical addresses respectively.
> > > > >
> > > > > Since at the moment we enable the support of these extensions via CONFIG
> > > > > flags, e.g.
> > > > >   - LPA via CONFIG_ARM64_PA_BITS_52
> > > > >
> > > > > there are no clear mechanisms in user-space right now to
> > > > > deteremine these CONFIG flag values and also determine the PARange and
> > > > > VARange address values.
> > > > >
> > > > > User-space tools like 'makedumpfile' and 'crash-utility' can instead
> > > > > use the 'MAX_USER_VA_BITS' and 'MAX_PHYSMEM_BITS' values to determine
> > > > > the maximum virtual address and physical address (respectively)
> > > > > supported by underlying kernel.
> > > > >
> > > > > A reference 'makedumpfile' implementation which uses this approach to
> > > > > determining the maximum physical address is available in [0].
> > > > >
> > > > > [0].
> > > > https://github.com/bhupesh-sharma/makedumpfile/blob/52-bit-pa-support-via-vmcore-v1/arch/arm64.c#L490
> > > >
> > > > I'm not objecting the patch, just want to make sure to make clear about
> > > > things and make sure these issues are aware by people, and leave arm
> > > > people to review the arm bits.
> > > >
> > > > 1. MAX_PHYSMEM_BITS
> > > > As we previously found, back to 2014 makedumpfile took a patch to read the
> > > > value from vmcore but the kernel patch was not accepted.
> > > > So we should first make clear if this is really needed, why other arches
> > > > do not need this in makedumpfile.
> > > >
> > > > If we really need it then should it be arm64 only?
> > > >
> > > > If it is arm64 only then the makedumpfile code should read this number
> > > > only for arm64.
> > >
> > > Sorry for the delay.
> > >
> > > According to the kernel patch, some of arm32 platforms may need it
> > > http://lists.infradead.org/pipermail/kexec/2014-May/011909.html
> > > but except for them (and arm64), makedumpfile can manage with kernel
> > > version and some switches to determine this value so far.
> > >
> > > >
> > > > Also Lianbo added the vmcoreinfo documents, I believe it stays in -tip
> > > > tree,  need to make sure to document this as well.
> > > >
> > > > 2. MAX_USER_VA_BITS
> > > > Does makedumpfile care about userspace VA bits?  I do not see other code
> > > > doing this,  Kazu and Dave A should be able to comment.
> > >
> > > The mapping makedumpfile uses on arm64 is swapper_pg_dir only, so
> > > unless the config affects its structure or something, makedumpfile
> > > will not need this value.
> >
> > I captured this case in more details while sending out the makedumpfile
> > enablement patch for ARMv8.2-LVA (see [0]), but here is a brief summary on
> > the same:
> >
> > Since at the moment we enable the support of the ARMv8.2-LVA extension for
> > 52-bit user-space VA in the kernel via a CONFIG flags
> > (CONFIG_ARM64_USER_VA_BITS_52), so there are no clear mechanisms in
> > user-space to determine this CONFIG
> > flag value and use it to determine the address range values.
> >
> > Since 'VA_BITS' are already exported via vmcoreinfo, if we export
> > 'MAX_USER_VA_BITS' as well, we can use the same in user-space to check if
> > the 'MAX_USER_VA_BITS' value is greater than 'VA_BITS'. If yes, then we are
> > running a use-case where user-space is 52-bit while the underlying kernel is
> > still 48-bit.
>
> Problem is why this is needed, it sounds like you are talking about some
> non-exist use case.

I already explained this in an earlier reply to your initial comments
(see <http://lists.infradead.org/pipermail/kexec/2019-January/022395.html>).
Perhaps you missed reading it, so here are the architecturally
possible and kernel supported uses cases with ARMv8.2 extensions
(depending on the combination of CONFIG flags and kernel version):

- 48-bit kernel VA + 48-bit user-space VA + 52-bit PA
- 48-bit kernel VA + 52-bit user-space VA + 52-bit PA
- 52-bit kernel VA + 52-bit user-space VA + 52-bit PA

Please see 'config ARM64_USER_VA_BITS_52'  help text inside
'arch/arm64/Kconfig' for more details:
config ARM64_USER_VA_BITS_52
    bool "52-bit (user)"
    depends on ARM64_64K_PAGES && (ARM64_PAN || !ARM64_SW_TTBR0_PAN)
    help
      Enable 52-bit virtual addressing for userspace when explicitly
      requested via a hint to mmap(). The kernel will continue to
      use 48-bit virtual addresses for its own mappings.

BTW, in the makedumpfile enablement patch thread for ARMv8.2 LVA
(which I sent out for 52-bit User space VA enablement) (see [0]), Kazu
mentioned that the changes look necessary.

[0]. http://lists.infradead.org/pipermail/kexec/2019-February/022431.html

Thanks,
Bhupesh

> >
> > The increased 'PTRS_PER_PGD' value for such cases needs to be then
> > calculated as is done by the underlying kernel (see
> > 'arch/arm64/include/asm/pgtable-hwdef.h' for details):
> >
> > #define PTRS_PER_PGD          (1 << (MAX_USER_VA_BITS - PGDIR_SHIFT))
> >
> > Also, note that 'arch/arm64/include/asm/memory.h' defines 'MAX_USER_VA_BITS'
> > as 'VA_BITS' in case 'CONFIG_ARM64_USER_VA_BITS_52' is set to 'n':
> >
> > #ifdef CONFIG_ARM64_USER_VA_BITS_52
> > #define MAX_USER_VA_BITS      52
> > #else
> > #define MAX_USER_VA_BITS      VA_BITS
> > #endif
> >
> > So, makedumpfile will need this symbol exported in vmcore to make the above
> > determination.
> >
> > [0]. http://lists.infradead.org/pipermail/kexec/2019-February/022425.html
> >
> > Thanks,
> > Bhupesh
>
> Thanks
> Dave
Kazuhito Hagio Feb. 12, 2019, 11:03 p.m. UTC | #15
On 2/12/2019 2:59 PM, Bhupesh Sharma wrote:
> BTW, in the makedumpfile enablement patch thread for ARMv8.2 LVA
> (which I sent out for 52-bit User space VA enablement) (see [0]), Kazu
> mentioned that the changes look necessary.
> 
> [0]. http://lists.infradead.org/pipermail/kexec/2019-February/022431.html

> > > The increased 'PTRS_PER_PGD' value for such cases needs to be then
> > > calculated as is done by the underlying kernel (see
> > > 'arch/arm64/include/asm/pgtable-hwdef.h' for details):
> > >
> > > #define PTRS_PER_PGD          (1 << (MAX_USER_VA_BITS - PGDIR_SHIFT))

Yes, this is the reason why makedumpfile needs the MAX_USER_VA_BITS.
It is used for pgd_index() also in makedumpfile to walk page tables.

/* to find an entry in a page-table-directory */
#define pgd_index(addr)         (((addr) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))

Thanks,
Kazu

> > >
> > > Also, note that 'arch/arm64/include/asm/memory.h' defines 'MAX_USER_VA_BITS'
> > > as 'VA_BITS' in case 'CONFIG_ARM64_USER_VA_BITS_52' is set to 'n':
> > >
> > > #ifdef CONFIG_ARM64_USER_VA_BITS_52
> > > #define MAX_USER_VA_BITS      52
> > > #else
> > > #define MAX_USER_VA_BITS      VA_BITS
> > > #endif
> > >
> > > So, makedumpfile will need this symbol exported in vmcore to make the above
> > > determination.
> > >
> > > [0]. http://lists.infradead.org/pipermail/kexec/2019-February/022425.html
> > >
> > > Thanks,
> > > Bhupesh
> >
> > Thanks
> > Dave
Dave Young Feb. 13, 2019, 11:15 a.m. UTC | #16
On 02/12/19 at 11:03pm, Kazuhito Hagio wrote:
> 
> On 2/12/2019 2:59 PM, Bhupesh Sharma wrote:
> > BTW, in the makedumpfile enablement patch thread for ARMv8.2 LVA
> > (which I sent out for 52-bit User space VA enablement) (see [0]), Kazu
> > mentioned that the changes look necessary.
> > 
> > [0]. http://lists.infradead.org/pipermail/kexec/2019-February/022431.html
> 
> > > > The increased 'PTRS_PER_PGD' value for such cases needs to be then
> > > > calculated as is done by the underlying kernel (see
> > > > 'arch/arm64/include/asm/pgtable-hwdef.h' for details):
> > > >
> > > > #define PTRS_PER_PGD          (1 << (MAX_USER_VA_BITS - PGDIR_SHIFT))
> 
> Yes, this is the reason why makedumpfile needs the MAX_USER_VA_BITS.
> It is used for pgd_index() also in makedumpfile to walk page tables.
> 
> /* to find an entry in a page-table-directory */
> #define pgd_index(addr)         (((addr) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))

Since Dave mentioned crash tool does not need it, but crash should also
travel the pg tables.

If this is really necessary it would be good to describe what will
happen without the patch, eg. some user visible error from an actual test etc.

> 
> Thanks,
> Kazu
> 
> > > >
> > > > Also, note that 'arch/arm64/include/asm/memory.h' defines 'MAX_USER_VA_BITS'
> > > > as 'VA_BITS' in case 'CONFIG_ARM64_USER_VA_BITS_52' is set to 'n':
> > > >
> > > > #ifdef CONFIG_ARM64_USER_VA_BITS_52
> > > > #define MAX_USER_VA_BITS      52
> > > > #else
> > > > #define MAX_USER_VA_BITS      VA_BITS
> > > > #endif
> > > >
> > > > So, makedumpfile will need this symbol exported in vmcore to make the above
> > > > determination.
> > > >
> > > > [0]. http://lists.infradead.org/pipermail/kexec/2019-February/022425.html
> > > >
> > > > Thanks,
> > > > Bhupesh
> > >
> > > Thanks
> > > Dave
>
James Morse Feb. 13, 2019, 6:22 p.m. UTC | #17
Hi guys,

On 13/02/2019 11:15, Dave Young wrote:
> On 02/12/19 at 11:03pm, Kazuhito Hagio wrote:
>> On 2/12/2019 2:59 PM, Bhupesh Sharma wrote:
>>> BTW, in the makedumpfile enablement patch thread for ARMv8.2 LVA
>>> (which I sent out for 52-bit User space VA enablement) (see [0]), Kazu
>>> mentioned that the changes look necessary.
>>>
>>> [0]. http://lists.infradead.org/pipermail/kexec/2019-February/022431.html
>>
>>>>> The increased 'PTRS_PER_PGD' value for such cases needs to be then
>>>>> calculated as is done by the underlying kernel

Aha! Nothing to do with which-bits-are-pfn in the tables...

You need to know if the top level PGD is 512bytes or bigger. As we use a
kmem-cache the adjacent data could be some else's page tables.

Is this really a problem though? You can't pull the user-space pgd pointers out
of no-where, you must have walked some task_struct and struct_mm's to find them.
In which case you would have the VMAs on hand to tell you if its in the mapped
user range.

It would be good to avoid putting something arch-specific in here if we can at
all help it.


>>>>> (see
>>>>> 'arch/arm64/include/asm/pgtable-hwdef.h' for details):
>>>>>
>>>>> #define PTRS_PER_PGD          (1 << (MAX_USER_VA_BITS - PGDIR_SHIFT))
>>
>> Yes, this is the reason why makedumpfile needs the MAX_USER_VA_BITS.
>> It is used for pgd_index() also in makedumpfile to walk page tables.
>>
>> /* to find an entry in a page-table-directory */
>> #define pgd_index(addr)         (((addr) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
> 
> Since Dave mentioned crash tool does not need it, but crash should also
> travel the pg tables.
> 
> If this is really necessary it would be good to describe what will
> happen without the patch, eg. some user visible error from an actual test etc.

Yes please, it would really help if there was a specific example we could discuss.


Thanks,

James
Kazuhito Hagio Feb. 13, 2019, 7:52 p.m. UTC | #18
On 2/13/2019 1:22 PM, James Morse wrote:
> Hi guys,
> 
> On 13/02/2019 11:15, Dave Young wrote:
> > On 02/12/19 at 11:03pm, Kazuhito Hagio wrote:
> >> On 2/12/2019 2:59 PM, Bhupesh Sharma wrote:
> >>> BTW, in the makedumpfile enablement patch thread for ARMv8.2 LVA
> >>> (which I sent out for 52-bit User space VA enablement) (see [0]), Kazu
> >>> mentioned that the changes look necessary.
> >>>
> >>> [0]. http://lists.infradead.org/pipermail/kexec/2019-February/022431.html
> >>
> >>>>> The increased 'PTRS_PER_PGD' value for such cases needs to be then
> >>>>> calculated as is done by the underlying kernel
> 
> Aha! Nothing to do with which-bits-are-pfn in the tables...
> 
> You need to know if the top level PGD is 512bytes or bigger. As we use a
> kmem-cache the adjacent data could be some else's page tables.
> 
> Is this really a problem though? You can't pull the user-space pgd pointers out
> of no-where, you must have walked some task_struct and struct_mm's to find them.
> In which case you would have the VMAs on hand to tell you if its in the mapped
> user range.
> 
> It would be good to avoid putting something arch-specific in here if we can at
> all help it.
> 
> 
> >>>>> (see
> >>>>> 'arch/arm64/include/asm/pgtable-hwdef.h' for details):
> >>>>>
> >>>>> #define PTRS_PER_PGD          (1 << (MAX_USER_VA_BITS - PGDIR_SHIFT))
> >>
> >> Yes, this is the reason why makedumpfile needs the MAX_USER_VA_BITS.
> >> It is used for pgd_index() also in makedumpfile to walk page tables.
> >>
> >> /* to find an entry in a page-table-directory */
> >> #define pgd_index(addr)         (((addr) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
> >
> > Since Dave mentioned crash tool does not need it, but crash should also
> > travel the pg tables.

The crash utility is always invoked with vmlinux, so it can read the
vabits_user variable directly from vmcore, but makedumpfile can not.

> > If this is really necessary it would be good to describe what will
> > happen without the patch, eg. some user visible error from an actual test etc.
> 
> Yes please, it would really help if there was a specific example we could discuss.

With 52-bit user space and 48-bit kernel space configuration,
makedumpfile will not be able to convert a virtual kernel address
to a physical address, and fail to capture a dumpfile, because the
pgd_index() will return a wrong index.

But I don't have any suitable test system on hand, so have not tried
the kernel configuration actually. If found, I'll try.

Bhupesh, do you have any test result?

Thanks,
Kazu

> 
> 
> Thanks,
> 
> James
Bhupesh Sharma Feb. 14, 2019, 7:30 p.m. UTC | #19
Hi James,

On 02/13/2019 11:52 PM, James Morse wrote:
> Hi guys,
> 
> On 13/02/2019 11:15, Dave Young wrote:
>> On 02/12/19 at 11:03pm, Kazuhito Hagio wrote:
>>> On 2/12/2019 2:59 PM, Bhupesh Sharma wrote:
>>>> BTW, in the makedumpfile enablement patch thread for ARMv8.2 LVA
>>>> (which I sent out for 52-bit User space VA enablement) (see [0]), Kazu
>>>> mentioned that the changes look necessary.
>>>>
>>>> [0]. http://lists.infradead.org/pipermail/kexec/2019-February/022431.html
>>>
>>>>>> The increased 'PTRS_PER_PGD' value for such cases needs to be then
>>>>>> calculated as is done by the underlying kernel
> 
> Aha! Nothing to do with which-bits-are-pfn in the tables...
> 
> You need to know if the top level PGD is 512bytes or bigger. As we use a
> kmem-cache the adjacent data could be some else's page tables.
> 
> Is this really a problem though? You can't pull the user-space pgd pointers out
> of no-where, you must have walked some task_struct and struct_mm's to find them.
> In which case you would have the VMAs on hand to tell you if its in the mapped
> user range.
> 
> It would be good to avoid putting something arch-specific in here if we can at
> all help it.
> 
> 
>>>>>> (see
>>>>>> 'arch/arm64/include/asm/pgtable-hwdef.h' for details):
>>>>>>
>>>>>> #define PTRS_PER_PGD          (1 << (MAX_USER_VA_BITS - PGDIR_SHIFT))
>>>
>>> Yes, this is the reason why makedumpfile needs the MAX_USER_VA_BITS.
>>> It is used for pgd_index() also in makedumpfile to walk page tables.
>>>
>>> /* to find an entry in a page-table-directory */
>>> #define pgd_index(addr)         (((addr) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
>>
>> Since Dave mentioned crash tool does not need it, but crash should also
>> travel the pg tables.
>>
>> If this is really necessary it would be good to describe what will
>> happen without the patch, eg. some user visible error from an actual test etc.
> 
> Yes please, it would really help if there was a specific example we could discuss.

Sure. Here are two use-cases/regressions reported and which I have been 
able to reproduce. Note that I tested them both on a CPU which does not 
support ARMv8.2-LPA/LVA and on ARMv8 FVP model (which supports ARMv8.2 
extensions).

Environment:
------------
Latest Upstream kernel: sha-id: 1f947a7a011fcceb14cb912f5481a53b18f1879a 
("Merge branch 'akpm' (patches from Andrew)")

Latest makedumpfile code: (git://git.code.sf.net/p/makedumpfile/code , 
branch: devel)

crash-utility code: (https://github.com/crash-utility/crash.git, sha-id: 
e082c372c7f1a782b058ec359dfbbbee0f0b6aad)

Note that Dave A. has since fixed crash-utility by using a hardcoded 
value of 'MAX_PHYSMEM_BITS' (via sha id: 
ac5a7889d31bb37aa0687110ecea08837f8a66a8) and determining 'vabits_user' 
value via vmlinux (via sha id: 8618ddd817621c40c1f44f0ab6df7c7805234416)

(1). Regression Case 1 (ARMv8.2-LPA enabled kernel):

- Upstream makedumpfile and crash-utility (with sha-id 
e082c372c7f1a782b058ec359dfbbbee0f0b6aad) are broken on following kind 
of platforms:

a. Upstream Kernel 5.0.0-rc6+ with the following kernel configuration:

CONFIG_ARM64_64K_PAGES=y
# CONFIG_ARM64_VA_BITS_42 is not set
CONFIG_ARM64_VA_BITS_48=y
# CONFIG_ARM64_USER_VA_BITS_52 is not set
CONFIG_ARM64_VA_BITS=48
# CONFIG_ARM64_PA_BITS_48 is not set
CONFIG_ARM64_PA_BITS_52=y
CONFIG_ARM64_PA_BITS=52

b. Both on CPUs which don't support ARMv8.2 LPA extension and on ARMv8 
FVP model with ARMv8.2 LPA extensions.

- Error message from makedumpfile:

$ makedumpfile -f --mem-usage /proc/kcore -D
max_mapnr    : a00000
kimage_voffset   : fffeffff80000000
max_physmem_bits : 30
section_size_bits: 1e
vaddr_to_paddr_arm64: pgda=90d70000, pudv.pgd=ffffff80ffffffd0
vaddr_to_paddr_arm64: puda=90d70000, pudv.pgd=9fffff0003
vaddr_to_paddr_arm64: pmda=9fffff0000, pmdv.pud=9ffffe0003
vaddr_to_paddr_arm64: paddr=911a962c
va_bits      : 48
page_offset  : ffff800000000000
vaddr_to_paddr_arm64: pgda=90d70000, pudv.pgd=41a474
vaddr_to_paddr_arm64: puda=90d70000, pudv.pgd=9fffff0003
vaddr_to_paddr_arm64: pmda=9fffff0000, pmdv.pud=9ffffe0003
vaddr_to_paddr_arm64: paddr=911a1320
num of NODEs : 1
Memory type  : SPARSEMEM

vaddr_to_paddr_arm64: pgda=90d70100, pudv.pgd=a657470206461
vaddr_to_paddr_arm64: puda=90d70100, pudv.pgd=9ffffd0003
vaddr_to_paddr_arm64: pmda=9ffffd27d8, pmdv.pud=9ffeee0003
vaddr_to_paddr_arm64: paddr=9ffeedc600
vaddr_to_paddr_arm64: pgda=90d70100, pudv.pgd=ffff97d98224
vaddr_to_paddr_arm64: puda=90d70100, pudv.pgd=9ffffd0003
vaddr_to_paddr_arm64: pmda=9ffffd27d8, pmdv.pud=9ffeee0003
vaddr_to_paddr_arm64: paddr=9ffeedc600
get_mm_sparsemem: Can't get the address of mem_section.

c. Root Cause Analysis -

- After the PA_BITS changes in arm64 kernel we set:
#define MAX_PHYSMEM_BITS	CONFIG_ARM64_PA_BITS

- For SPARSEMEM, this value is used to calculate the bits space required 
to store a section:
#define SECTIONS_SHIFT	(MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
#define NR_MEM_SECTIONS		(1UL << SECTIONS_SHIFT)

- User-space tools use a similar mechanism to determine the SPARSEMEM 
type (extreme or not) using the 'NR_MEM_SECTIONS' value (an example from 
makedumpfile code):

int
is_sparsemem_extreme(void)
{
	if ((ARRAY_LENGTH(mem_section)
	     == divideup(NR_MEM_SECTIONS(), _SECTIONS_PER_ROOT_EXTREME()))
	    || (ARRAY_LENGTH(mem_section) == NOT_FOUND_STRUCTURE))
		return TRUE;
	else
		return FALSE;
}

- Since MAX_PHYSMEM_BITS are 48 bits for normal cases and are 52 bits 
for extended PA address space, the memory type is incorrectly calculated 
as SPARSEMEM rather than SPARSEMEM_EX in above case.

- Exporting correct 'MAX_PHYSMEM_BITS' via vmcoreinfo for 52-bit PA 
case, fixes the above mentioned issue:

$ makedumpfile -f --mem-usage /proc/kcore -D

<..snip..>
  NUMBER(MAX_PHYSMEM_BITS)=52

<..snip..>
max_mapnr    : a00000
kimage_voffset   : fffeffff80000000
max_physmem_bits : 30
section_size_bits: 1e
vaddr_to_paddr_arm64: pgda=90d70000, pudv.pgd=ffffff80ffffffd0
vaddr_to_paddr_arm64: puda=90d70000, pudv.pgd=9fffff0003
vaddr_to_paddr_arm64: pmda=9fffff0000, pmdv.pud=9ffffe0003
vaddr_to_paddr_arm64: paddr=911a962c
va_bits      : 48
page_offset  : ffff800000000000
vaddr_to_paddr_arm64: pgda=90d70000, pudv.pgd=41a474
vaddr_to_paddr_arm64: puda=90d70000, pudv.pgd=9fffff0003
vaddr_to_paddr_arm64: pmda=9fffff0000, pmdv.pud=9ffffe0003
vaddr_to_paddr_arm64: paddr=911a1320
num of NODEs : 1
Memory type  : SPARSEMEM_EX
<..snip..>

TYPE		PAGES			EXCLUDABLE	DESCRIPTION
----------------------------------------------------------------------
ZERO		2626            	yes		Pages filled with zero
NON_PRI_CACHE	569             	yes		Cache pages without private flag
PRI_CACHE	5446            	yes		Cache pages with private flag
USER		3213            	yes		User process pages
FREE		2048971         	yes		Free pages
KERN_DATA	19034           	no		Dumpable kernel data

page size:		65536
Total pages on system:	2079859
Total size on system:	136305639424     Byte

(2). Regression Case 2 (ARMv8.2-LPA + LVA [52-bit user-space VA] enabled 
kernel):

- Upstream makedumpfile and crash-utility (with sha-id 
e082c372c7f1a782b058ec359dfbbbee0f0b6aad) are broken on following kind 
of platforms:

a. Upstream Kernel 5.0.0-rc6+ with the following kernel configuration:

CONFIG_ARM64_64K_PAGES=y
# CONFIG_ARM64_VA_BITS_42 is not set
# CONFIG_ARM64_VA_BITS_48 is not set
CONFIG_ARM64_USER_VA_BITS_52=y
CONFIG_ARM64_VA_BITS=48
# CONFIG_ARM64_PA_BITS_48 is not set
CONFIG_ARM64_PA_BITS_52=y
CONFIG_ARM64_PA_BITS=52

b. Both on CPUs which don't support ARMv8.2 extensions and on ARMv8 FVP 
model with  ARMv8.2 extensions.

- Error message from makedumpfile:

$ makedumpfile -f --mem-usage /proc/kcore -D
max_mapnr    : a00000
kimage_voffset   : fffeffff78000000
max_physmem_bits : 30
section_size_bits: 1e
vaddr_to_paddr_arm64: pgda=90f30000, pudv.pgd=ffffff80ffffffd0
vaddr_to_paddr_arm64: puda=90f30000, pudv.pgd=0
readpage_elf: Attempt to read non-existent page at 0x0.
readmem: type_addr: 1, addr:0, size:8
vaddr_to_paddr_arm64: Can't read pmd
readmem: Can't convert a virtual address(ffff0000093c576c) to physical 
address.
readmem: type_addr: 0, addr:ffff0000093c576c, size:390
check_release: Can't get the address of system_utsname.

c. Root Cause Analysis -

- After the 52-bit user-space VA_BIT changes in arm64 kernel we set:
#define PTRS_PER_PGD          (1 << (MAX_USER_VA_BITS - PGDIR_SHIFT))

- User-space tools like makedumpfile and crash use the 'PTRS_PER_PGD' 
value to calculate the 'pgd_index()' of a vaddr:

#define pgd_index(vaddr) 		(((vaddr) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))

- Since the kernel now defines 'MAX_USER_VA_BITS' as:
#ifdef CONFIG_ARM64_USER_VA_BITS_52
#define MAX_USER_VA_BITS	52
#else
#define MAX_USER_VA_BITS	VA_BITS
#endif

so, the user-space also needs this value to calculate the 'PTRS_PER_PGD' 
and hence 'pgd_index()' correctly.

- Exporting correct 'MAX_USER_VA_BITS' via vmcoreinfo for the above 
case, fixes the above mentioned issue:

$ makedumpfile -f --mem-usage /proc/kcore -D

<..snip..>
max_mapnr    : a00000
pa_bits    : 52
va_bits    : 48 (vmcoreinfo)
max_user_va_bits : 52 (vmcoreinfo)
kimage_voffset   : fffeffff78000000
max_physmem_bits : 52
section_size_bits: 30
vaddr_to_paddr_arm64: pgda=90f31e00, pudv.pgd=ffffff80ffffffd0
vaddr_to_paddr_arm64: puda=90f31e00, pudv.pgd=9fffff0803
vaddr_to_paddr_arm64: pmda=9fffff0000, pmdv.pud=9ffffe0803
vaddr_to_paddr_arm64: paddr=913c576c
page_offset  : ffff800000000000
vaddr_to_paddr_arm64: pgda=90f31e00, pudv.pgd=16e28e8bed294900
vaddr_to_paddr_arm64: puda=90f31e00, pudv.pgd=9fffff0803
vaddr_to_paddr_arm64: pmda=9fffff0000, pmdv.pud=9ffffe0803
vaddr_to_paddr_arm64: paddr=913bd2f8
num of NODEs : 1
Memory type  : SPARSEMEM_EX
<..snip..>

Other important notes
---------------------

1. I have quoted only one makedumpfile use-case failure above (i.e. 
calculating --mem-usage on the primary kernel). Other use-cases like 
creating a dumpfile using /proc/vmcore or post-processing a vmcore are 
also broken similarly and get fixed when a kernel which exports 
'MAX_USER_VA_BITS' and 'MAX_PHYSMEM_BITS' to vmcoreinfo is used along 
with a modified user-space which can read this information from the 
vmcoreinfo.

2. I was also going through some of the suggestions on earlier threads 
about the PTE calculations for the 52-bit LPA case and discussed them 
with some partner arm64 SoC enggs.

The suggestions to convert a page table entry to a physical address 
without awareness of 52-bit (with an assumption of 64k page size) can be 
risky.

With 64k page and older non-52-bit kernels, while it looks like in the 
current checks that bits [15:12] are zero, and we can move the zeros to 
bits [51:48] (because the zeros don't affect the overall PA) to generate 
the overall 52-bit PA. However, this can cause IMPLEMENTATION SPECIFIC 
issues on different platforms while generating a PA and IPA.

Lets see what the ARMv8 architecture reference manual says about the 
Bits [15:12] for a 64KB page size:

"Bits [15:12] of each valid translation table descriptor hold Bits 
[51:48] of the output address, or of the address
of the translation table to be used for the initial lookup at the next 
level of translation. If the implementation
does not support 52-bit physical addresses, then it is IMPLEMENTATION 
DEFINED whether non-zero values for
these bits generate an Address size fault. In this case, not generating 
an Address Size Fault is deprecated."

As per the vendors, we should not assume that hardware (which does not 
support 52-bit physical addresses) would generate an Address size fault 
for non-zero values of Bits[15:12], so extending them to bits [51:48] 
always can lead to PA address which might cause UNDEFINED behavior on 
some SoCs.

Hope the above text clarifies the problem and what I am trying to fix 
via this patch. Please let me know if something is missing here.

Thanks,
Bhupesh
James Morse Feb. 15, 2019, 5:34 p.m. UTC | #20
Hi guys,

(CC: +Steve, +Kristina) "What's the best way of letting user-space know the MMU
config when 52-bit VA and pointer-auth may be in use?"

On 13/02/2019 19:52, Kazuhito Hagio wrote:
> On 2/13/2019 1:22 PM, James Morse wrote:
>> On 13/02/2019 11:15, Dave Young wrote:
>>> On 02/12/19 at 11:03pm, Kazuhito Hagio wrote:
>>>> On 2/12/2019 2:59 PM, Bhupesh Sharma wrote:
>>>>> BTW, in the makedumpfile enablement patch thread for ARMv8.2 LVA
>>>>> (which I sent out for 52-bit User space VA enablement) (see [0]), Kazu
>>>>> mentioned that the changes look necessary.
>>>>>
>>>>> [0]. http://lists.infradead.org/pipermail/kexec/2019-February/022431.html
>>>>
>>>>>>> The increased 'PTRS_PER_PGD' value for such cases needs to be then
>>>>>>> calculated as is done by the underlying kernel
>>
>> Aha! Nothing to do with which-bits-are-pfn in the tables...
>>
>> You need to know if the top level PGD is 512bytes or bigger. As we use a
>> kmem-cache the adjacent data could be some else's page tables.
>>
>> Is this really a problem though? You can't pull the user-space pgd pointers out
>> of no-where, you must have walked some task_struct and struct_mm's to find them.
>> In which case you would have the VMAs on hand to tell you if its in the mapped
>> user range.
>>
>> It would be good to avoid putting something arch-specific in here if we can at
>> all help it.

>>>>>>> (see
>>>>>>> 'arch/arm64/include/asm/pgtable-hwdef.h' for details):
>>>>>>>
>>>>>>> #define PTRS_PER_PGD          (1 << (MAX_USER_VA_BITS - PGDIR_SHIFT))
>>>>
>>>> Yes, this is the reason why makedumpfile needs the MAX_USER_VA_BITS.
>>>> It is used for pgd_index() also in makedumpfile to walk page tables.
>>>>
>>>> /* to find an entry in a page-table-directory */
>>>> #define pgd_index(addr)         (((addr) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
>>>
>>> Since Dave mentioned crash tool does not need it, but crash should also
>>> travel the pg tables.
> 
> The crash utility is always invoked with vmlinux, so it can read the
> vabits_user variable directly from vmcore, but makedumpfile can not.

(This sounds fragile. That symbol's name may change, it may disappear
completely! ... but I guess crash changes with every kernel release anyway)


>>> If this is really necessary it would be good to describe what will
>>> happen without the patch, eg. some user visible error from an actual test etc.
>>
>> Yes please, it would really help if there was a specific example we could discuss.
> 
> With 52-bit user space and 48-bit kernel space configuration,
> makedumpfile will not be able to convert a virtual kernel address
> to a physical address, and fail to capture a dumpfile, because the
> pgd_index() will return a wrong index.

Got it, thanks!
(all this user stuff had me thinking it was user-space you were trying to walk).

Yes, this is because of commit e842dfb5a2d3 ("arm64: mm: Offset TTBR1 to allow
52-bit PTRS_PER_PGD"). The kernel has offset the ttbr1 value, if you try and
walk it without knowing the offset you get junk.

Ideally we tell you the offset with some 'ttbr1_offset=' in vmcoreinfo, but if
the offsetting code disappears, the kernel would still have to provide
'ttbr1_offset=0' for user-space to keep working.

I'd like to find something future-proof that always has an unambiguous meaning,
and isn't a problem if the kernel variable/symbol/kconfig names change.

With pointer-auth in use too you can't guess which bits are address and which
bits are data.

Taking arch-specific to its extreme, we could expose TCR_EL1, but this is a
problem if we ever switch that per task (some new bits may turn up with a new
feature). Some of those bits vary per cpu too, so we'd have to mask them out in
case user-space tries to conclude something from them.


My current best suggestion is to export:
from core code:
* USER_MMAP_END, the maximum value a user-space can try and mmap().
This would normally be TASK_SIZE, but x86 and powerpc also have support for
larger VA space, and its plumbed into mm slightly differently. We should have
one arch-independent property that covers all these. On arm64 this would be the
runtime va bits for user-space's TTBR. (This assumes the value isn't per-task)

arch specific:
* ARM64_TCR.T1SZ, the va bits mapped by the kernel's TTBR. (We can assume we'll
never flip user/kernel space). This has to be arch specific, it will always have
a value and its meaning comes from the ARM-ARM (so linux can't change it in the
future). It should be the same on every CPU.
* ARM64_TTBR1.BADDR, the pa of the kernel page tables, which implicitly has the
offset. Again this always has a value, and its meaning comes from the ARM-ARM.
If we ever get clever with different page-tables/TCR values on different CPUs,
these two should come from the same CPU.


I think this gives you what you need if user/kernel may both be using
pointer-auth and both may be using 52-bit va. I'm pretty sure the 48:52 bits can
be picked at boot time depending on the kernel kconfig and the hardware support.

Does anyone have a better idea? (or a corner where this won't work?)


Thanks,

James
Bhupesh Sharma Feb. 15, 2019, 6:01 p.m. UTC | #21
Hi James,

On Fri, Feb 15, 2019 at 11:04 PM James Morse <james.morse@arm.com> wrote:
>
> Hi guys,
>
> (CC: +Steve, +Kristina) "What's the best way of letting user-space know the MMU
> config when 52-bit VA and pointer-auth may be in use?"
>
> On 13/02/2019 19:52, Kazuhito Hagio wrote:
> > On 2/13/2019 1:22 PM, James Morse wrote:
> >> On 13/02/2019 11:15, Dave Young wrote:
> >>> On 02/12/19 at 11:03pm, Kazuhito Hagio wrote:
> >>>> On 2/12/2019 2:59 PM, Bhupesh Sharma wrote:
> >>>>> BTW, in the makedumpfile enablement patch thread for ARMv8.2 LVA
> >>>>> (which I sent out for 52-bit User space VA enablement) (see [0]), Kazu
> >>>>> mentioned that the changes look necessary.
> >>>>>
> >>>>> [0]. http://lists.infradead.org/pipermail/kexec/2019-February/022431.html
> >>>>
> >>>>>>> The increased 'PTRS_PER_PGD' value for such cases needs to be then
> >>>>>>> calculated as is done by the underlying kernel
> >>
> >> Aha! Nothing to do with which-bits-are-pfn in the tables...
> >>
> >> You need to know if the top level PGD is 512bytes or bigger. As we use a
> >> kmem-cache the adjacent data could be some else's page tables.
> >>
> >> Is this really a problem though? You can't pull the user-space pgd pointers out
> >> of no-where, you must have walked some task_struct and struct_mm's to find them.
> >> In which case you would have the VMAs on hand to tell you if its in the mapped
> >> user range.
> >>
> >> It would be good to avoid putting something arch-specific in here if we can at
> >> all help it.
>
> >>>>>>> (see
> >>>>>>> 'arch/arm64/include/asm/pgtable-hwdef.h' for details):
> >>>>>>>
> >>>>>>> #define PTRS_PER_PGD          (1 << (MAX_USER_VA_BITS - PGDIR_SHIFT))
> >>>>
> >>>> Yes, this is the reason why makedumpfile needs the MAX_USER_VA_BITS.
> >>>> It is used for pgd_index() also in makedumpfile to walk page tables.
> >>>>
> >>>> /* to find an entry in a page-table-directory */
> >>>> #define pgd_index(addr)         (((addr) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
> >>>
> >>> Since Dave mentioned crash tool does not need it, but crash should also
> >>> travel the pg tables.
> >
> > The crash utility is always invoked with vmlinux, so it can read the
> > vabits_user variable directly from vmcore, but makedumpfile can not.
>
> (This sounds fragile. That symbol's name may change, it may disappear
> completely! ... but I guess crash changes with every kernel release anyway)
>
>
> >>> If this is really necessary it would be good to describe what will
> >>> happen without the patch, eg. some user visible error from an actual test etc.
> >>
> >> Yes please, it would really help if there was a specific example we could discuss.
> >
> > With 52-bit user space and 48-bit kernel space configuration,
> > makedumpfile will not be able to convert a virtual kernel address
> > to a physical address, and fail to capture a dumpfile, because the
> > pgd_index() will return a wrong index.
>
> Got it, thanks!
> (all this user stuff had me thinking it was user-space you were trying to walk).
>
> Yes, this is because of commit e842dfb5a2d3 ("arm64: mm: Offset TTBR1 to allow
> 52-bit PTRS_PER_PGD"). The kernel has offset the ttbr1 value, if you try and
> walk it without knowing the offset you get junk.
>
> Ideally we tell you the offset with some 'ttbr1_offset=' in vmcoreinfo, but if
> the offsetting code disappears, the kernel would still have to provide
> 'ttbr1_offset=0' for user-space to keep working.
>
> I'd like to find something future-proof that always has an unambiguous meaning,
> and isn't a problem if the kernel variable/symbol/kconfig names change.
>
> With pointer-auth in use too you can't guess which bits are address and which
> bits are data.
>
> Taking arch-specific to its extreme, we could expose TCR_EL1, but this is a
> problem if we ever switch that per task (some new bits may turn up with a new
> feature). Some of those bits vary per cpu too, so we'd have to mask them out in
> case user-space tries to conclude something from them.
>
>
> My current best suggestion is to export:
> from core code:
> * USER_MMAP_END, the maximum value a user-space can try and mmap().
> This would normally be TASK_SIZE, but x86 and powerpc also have support for
> larger VA space, and its plumbed into mm slightly differently. We should have
> one arch-independent property that covers all these. On arm64 this would be the
> runtime va bits for user-space's TTBR. (This assumes the value isn't per-task)
>
> arch specific:
> * ARM64_TCR.T1SZ, the va bits mapped by the kernel's TTBR. (We can assume we'll
> never flip user/kernel space). This has to be arch specific, it will always have
> a value and its meaning comes from the ARM-ARM (so linux can't change it in the
> future). It should be the same on every CPU.
> * ARM64_TTBR1.BADDR, the pa of the kernel page tables, which implicitly has the
> offset. Again this always has a value, and its meaning comes from the ARM-ARM.
> If we ever get clever with different page-tables/TCR values on different CPUs,
> these two should come from the same CPU.
>
>
> I think this gives you what you need if user/kernel may both be using
> pointer-auth and both may be using 52-bit va. I'm pretty sure the 48:52 bits can
> be picked at boot time depending on the kernel kconfig and the hardware support.
>
> Does anyone have a better idea? (or a corner where this won't work?)

I am not sure you got a chance to look at the two regression cases I
reported here:
<http://lists.infradead.org/pipermail/kexec/2019-February/022449.html>

Unfortunately the above suggestion doesn't provide any fix for
ARMv8.2-LPA regression (see text under heading '
(1). Regression Case 1 (ARMv8.2-LPA enabled kernel)')

After going through the regression reports, I think exporting
'MAX_USER_VA_BITS' and 'MAX_PHYSMEM_BITS' to vmcoreinfo is sufficient
for the above regressions (without over-complicating the stuff) as
ARM64_TCR.T1SZ and friends seem to arch specific as compared to
VA_BITS + 'MAX_USER_VA_BITS' .

Thanks,
Bhupesh
Steve Capper Feb. 18, 2019, 3:27 p.m. UTC | #22
Hi Bhupesh,

Sorry for joining this thread late...

On Fri, Feb 15, 2019 at 11:31:56PM +0530, Bhupesh Sharma wrote:
> Hi James,
> 
> On Fri, Feb 15, 2019 at 11:04 PM James Morse <james.morse@arm.com> wrote:
> >
> > Hi guys,
> >
> > (CC: +Steve, +Kristina) "What's the best way of letting user-space know the MMU
> > config when 52-bit VA and pointer-auth may be in use?"
> >
> > On 13/02/2019 19:52, Kazuhito Hagio wrote:
> > > On 2/13/2019 1:22 PM, James Morse wrote:
> > >> On 13/02/2019 11:15, Dave Young wrote:
> > >>> On 02/12/19 at 11:03pm, Kazuhito Hagio wrote:
> > >>>> On 2/12/2019 2:59 PM, Bhupesh Sharma wrote:
> > >>>>> BTW, in the makedumpfile enablement patch thread for ARMv8.2 LVA
> > >>>>> (which I sent out for 52-bit User space VA enablement) (see [0]), Kazu
> > >>>>> mentioned that the changes look necessary.
> > >>>>>
> > >>>>> [0]. http://lists.infradead.org/pipermail/kexec/2019-February/022431.html
> > >>>>
> > >>>>>>> The increased 'PTRS_PER_PGD' value for such cases needs to be then
> > >>>>>>> calculated as is done by the underlying kernel
> > >>
> > >> Aha! Nothing to do with which-bits-are-pfn in the tables...
> > >>
> > >> You need to know if the top level PGD is 512bytes or bigger. As we use a
> > >> kmem-cache the adjacent data could be some else's page tables.
> > >>
> > >> Is this really a problem though? You can't pull the user-space pgd pointers out
> > >> of no-where, you must have walked some task_struct and struct_mm's to find them.
> > >> In which case you would have the VMAs on hand to tell you if its in the mapped
> > >> user range.
> > >>
> > >> It would be good to avoid putting something arch-specific in here if we can at
> > >> all help it.
> >
> > >>>>>>> (see
> > >>>>>>> 'arch/arm64/include/asm/pgtable-hwdef.h' for details):
> > >>>>>>>
> > >>>>>>> #define PTRS_PER_PGD          (1 << (MAX_USER_VA_BITS - PGDIR_SHIFT))
> > >>>>
> > >>>> Yes, this is the reason why makedumpfile needs the MAX_USER_VA_BITS.
> > >>>> It is used for pgd_index() also in makedumpfile to walk page tables.
> > >>>>
> > >>>> /* to find an entry in a page-table-directory */
> > >>>> #define pgd_index(addr)         (((addr) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
> > >>>
> > >>> Since Dave mentioned crash tool does not need it, but crash should also
> > >>> travel the pg tables.
> > >
> > > The crash utility is always invoked with vmlinux, so it can read the
> > > vabits_user variable directly from vmcore, but makedumpfile can not.
> >
> > (This sounds fragile. That symbol's name may change, it may disappear
> > completely! ... but I guess crash changes with every kernel release anyway)
> >
> >
> > >>> If this is really necessary it would be good to describe what will
> > >>> happen without the patch, eg. some user visible error from an actual test etc.
> > >>
> > >> Yes please, it would really help if there was a specific example we could discuss.
> > >
> > > With 52-bit user space and 48-bit kernel space configuration,
> > > makedumpfile will not be able to convert a virtual kernel address
> > > to a physical address, and fail to capture a dumpfile, because the
> > > pgd_index() will return a wrong index.
> >
> > Got it, thanks!
> > (all this user stuff had me thinking it was user-space you were trying to walk).
> >
> > Yes, this is because of commit e842dfb5a2d3 ("arm64: mm: Offset TTBR1 to allow
> > 52-bit PTRS_PER_PGD"). The kernel has offset the ttbr1 value, if you try and
> > walk it without knowing the offset you get junk.
> >
> > Ideally we tell you the offset with some 'ttbr1_offset=' in vmcoreinfo, but if
> > the offsetting code disappears, the kernel would still have to provide
> > 'ttbr1_offset=0' for user-space to keep working.
> >
> > I'd like to find something future-proof that always has an unambiguous meaning,
> > and isn't a problem if the kernel variable/symbol/kconfig names change.
> >
> > With pointer-auth in use too you can't guess which bits are address and which
> > bits are data.
> >
> > Taking arch-specific to its extreme, we could expose TCR_EL1, but this is a
> > problem if we ever switch that per task (some new bits may turn up with a new
> > feature). Some of those bits vary per cpu too, so we'd have to mask them out in
> > case user-space tries to conclude something from them.
> >
> >
> > My current best suggestion is to export:
> > from core code:
> > * USER_MMAP_END, the maximum value a user-space can try and mmap().
> > This would normally be TASK_SIZE, but x86 and powerpc also have support for
> > larger VA space, and its plumbed into mm slightly differently. We should have
> > one arch-independent property that covers all these. On arm64 this would be the
> > runtime va bits for user-space's TTBR. (This assumes the value isn't per-task)
> >
> > arch specific:
> > * ARM64_TCR.T1SZ, the va bits mapped by the kernel's TTBR. (We can assume we'll
> > never flip user/kernel space). This has to be arch specific, it will always have
> > a value and its meaning comes from the ARM-ARM (so linux can't change it in the
> > future). It should be the same on every CPU.
> > * ARM64_TTBR1.BADDR, the pa of the kernel page tables, which implicitly has the
> > offset. Again this always has a value, and its meaning comes from the ARM-ARM.
> > If we ever get clever with different page-tables/TCR values on different CPUs,
> > these two should come from the same CPU.
> >
> >
> > I think this gives you what you need if user/kernel may both be using
> > pointer-auth and both may be using 52-bit va. I'm pretty sure the 48:52 bits can
> > be picked at boot time depending on the kernel kconfig and the hardware support.
> >
> > Does anyone have a better idea? (or a corner where this won't work?)
> 
> I am not sure you got a chance to look at the two regression cases I
> reported here:
> <http://lists.infradead.org/pipermail/kexec/2019-February/022449.html>
> 
> Unfortunately the above suggestion doesn't provide any fix for
> ARMv8.2-LPA regression (see text under heading '
> (1). Regression Case 1 (ARMv8.2-LPA enabled kernel)')
> 
> After going through the regression reports, I think exporting
> 'MAX_USER_VA_BITS' and 'MAX_PHYSMEM_BITS' to vmcoreinfo is sufficient
> for the above regressions (without over-complicating the stuff) as
> ARM64_TCR.T1SZ and friends seem to arch specific as compared to
> VA_BITS + 'MAX_USER_VA_BITS' .
> 

For MAX_USER_VA_BITS, IIUC you are just after a value of PTRS_PER_PGD?
Why not just add PTRS_PER_PGD to the vmcoreinfo?

FWIW it is possible in vaddr_to_paddr_arm64 to detect a zero pgd entry
then try again with another ptrs_per_pgd value (granted this is a little
hacky).

Cheers,
Kazuhito Hagio Feb. 19, 2019, 8:47 p.m. UTC | #23
Hi Bhupesh,

-----Original Message-----
> I am not sure you got a chance to look at the two regression cases I
> reported here:
> <http://lists.infradead.org/pipermail/kexec/2019-February/022449.html>
> 
> Unfortunately the above suggestion doesn't provide any fix for
> ARMv8.2-LPA regression (see text under heading '
> (1). Regression Case 1 (ARMv8.2-LPA enabled kernel)')

As for MAX_PHYSMEM_BITS, I realized that ppc64 makedumpfile can detect
it because there is only one SECTION_SIZE_BITS for ppc64. I think we
can use the same way as set_ppc64_max_physmem_bits() does also for
arm64 for now. I'm going to write it for kernels not having
NUMBER(MAX_PHYSMEM_BITS) in vmcoreinfo.

Thanks,
Kazu

> 
> After going through the regression reports, I think exporting
> 'MAX_USER_VA_BITS' and 'MAX_PHYSMEM_BITS' to vmcoreinfo is sufficient
> for the above regressions (without over-complicating the stuff) as
> ARM64_TCR.T1SZ and friends seem to arch specific as compared to
> VA_BITS + 'MAX_USER_VA_BITS' .
> 
> Thanks,
> Bhupesh
Bhupesh Sharma Feb. 21, 2019, 4:08 p.m. UTC | #24
Hi Steve,

On 02/18/2019 08:57 PM, Steve Capper wrote:
> Hi Bhupesh,
> 
> Sorry for joining this thread late...
> 
> On Fri, Feb 15, 2019 at 11:31:56PM +0530, Bhupesh Sharma wrote:
>> Hi James,
>>
>> On Fri, Feb 15, 2019 at 11:04 PM James Morse <james.morse@arm.com> wrote:
>>>
>>> Hi guys,
>>>
>>> (CC: +Steve, +Kristina) "What's the best way of letting user-space know the MMU
>>> config when 52-bit VA and pointer-auth may be in use?"
>>>
>>> On 13/02/2019 19:52, Kazuhito Hagio wrote:
>>>> On 2/13/2019 1:22 PM, James Morse wrote:
>>>>> On 13/02/2019 11:15, Dave Young wrote:
>>>>>> On 02/12/19 at 11:03pm, Kazuhito Hagio wrote:
>>>>>>> On 2/12/2019 2:59 PM, Bhupesh Sharma wrote:
>>>>>>>> BTW, in the makedumpfile enablement patch thread for ARMv8.2 LVA
>>>>>>>> (which I sent out for 52-bit User space VA enablement) (see [0]), Kazu
>>>>>>>> mentioned that the changes look necessary.
>>>>>>>>
>>>>>>>> [0]. http://lists.infradead.org/pipermail/kexec/2019-February/022431.html
>>>>>>>
>>>>>>>>>> The increased 'PTRS_PER_PGD' value for such cases needs to be then
>>>>>>>>>> calculated as is done by the underlying kernel
>>>>>
>>>>> Aha! Nothing to do with which-bits-are-pfn in the tables...
>>>>>
>>>>> You need to know if the top level PGD is 512bytes or bigger. As we use a
>>>>> kmem-cache the adjacent data could be some else's page tables.
>>>>>
>>>>> Is this really a problem though? You can't pull the user-space pgd pointers out
>>>>> of no-where, you must have walked some task_struct and struct_mm's to find them.
>>>>> In which case you would have the VMAs on hand to tell you if its in the mapped
>>>>> user range.
>>>>>
>>>>> It would be good to avoid putting something arch-specific in here if we can at
>>>>> all help it.
>>>
>>>>>>>>>> (see
>>>>>>>>>> 'arch/arm64/include/asm/pgtable-hwdef.h' for details):
>>>>>>>>>>
>>>>>>>>>> #define PTRS_PER_PGD          (1 << (MAX_USER_VA_BITS - PGDIR_SHIFT))
>>>>>>>
>>>>>>> Yes, this is the reason why makedumpfile needs the MAX_USER_VA_BITS.
>>>>>>> It is used for pgd_index() also in makedumpfile to walk page tables.
>>>>>>>
>>>>>>> /* to find an entry in a page-table-directory */
>>>>>>> #define pgd_index(addr)         (((addr) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
>>>>>>
>>>>>> Since Dave mentioned crash tool does not need it, but crash should also
>>>>>> travel the pg tables.
>>>>
>>>> The crash utility is always invoked with vmlinux, so it can read the
>>>> vabits_user variable directly from vmcore, but makedumpfile can not.
>>>
>>> (This sounds fragile. That symbol's name may change, it may disappear
>>> completely! ... but I guess crash changes with every kernel release anyway)
>>>
>>>
>>>>>> If this is really necessary it would be good to describe what will
>>>>>> happen without the patch, eg. some user visible error from an actual test etc.
>>>>>
>>>>> Yes please, it would really help if there was a specific example we could discuss.
>>>>
>>>> With 52-bit user space and 48-bit kernel space configuration,
>>>> makedumpfile will not be able to convert a virtual kernel address
>>>> to a physical address, and fail to capture a dumpfile, because the
>>>> pgd_index() will return a wrong index.
>>>
>>> Got it, thanks!
>>> (all this user stuff had me thinking it was user-space you were trying to walk).
>>>
>>> Yes, this is because of commit e842dfb5a2d3 ("arm64: mm: Offset TTBR1 to allow
>>> 52-bit PTRS_PER_PGD"). The kernel has offset the ttbr1 value, if you try and
>>> walk it without knowing the offset you get junk.
>>>
>>> Ideally we tell you the offset with some 'ttbr1_offset=' in vmcoreinfo, but if
>>> the offsetting code disappears, the kernel would still have to provide
>>> 'ttbr1_offset=0' for user-space to keep working.
>>>
>>> I'd like to find something future-proof that always has an unambiguous meaning,
>>> and isn't a problem if the kernel variable/symbol/kconfig names change.
>>>
>>> With pointer-auth in use too you can't guess which bits are address and which
>>> bits are data.
>>>
>>> Taking arch-specific to its extreme, we could expose TCR_EL1, but this is a
>>> problem if we ever switch that per task (some new bits may turn up with a new
>>> feature). Some of those bits vary per cpu too, so we'd have to mask them out in
>>> case user-space tries to conclude something from them.
>>>
>>>
>>> My current best suggestion is to export:
>>> from core code:
>>> * USER_MMAP_END, the maximum value a user-space can try and mmap().
>>> This would normally be TASK_SIZE, but x86 and powerpc also have support for
>>> larger VA space, and its plumbed into mm slightly differently. We should have
>>> one arch-independent property that covers all these. On arm64 this would be the
>>> runtime va bits for user-space's TTBR. (This assumes the value isn't per-task)
>>>
>>> arch specific:
>>> * ARM64_TCR.T1SZ, the va bits mapped by the kernel's TTBR. (We can assume we'll
>>> never flip user/kernel space). This has to be arch specific, it will always have
>>> a value and its meaning comes from the ARM-ARM (so linux can't change it in the
>>> future). It should be the same on every CPU.
>>> * ARM64_TTBR1.BADDR, the pa of the kernel page tables, which implicitly has the
>>> offset. Again this always has a value, and its meaning comes from the ARM-ARM.
>>> If we ever get clever with different page-tables/TCR values on different CPUs,
>>> these two should come from the same CPU.
>>>
>>>
>>> I think this gives you what you need if user/kernel may both be using
>>> pointer-auth and both may be using 52-bit va. I'm pretty sure the 48:52 bits can
>>> be picked at boot time depending on the kernel kconfig and the hardware support.
>>>
>>> Does anyone have a better idea? (or a corner where this won't work?)
>>
>> I am not sure you got a chance to look at the two regression cases I
>> reported here:
>> <http://lists.infradead.org/pipermail/kexec/2019-February/022449.html>
>>
>> Unfortunately the above suggestion doesn't provide any fix for
>> ARMv8.2-LPA regression (see text under heading '
>> (1). Regression Case 1 (ARMv8.2-LPA enabled kernel)')
>>
>> After going through the regression reports, I think exporting
>> 'MAX_USER_VA_BITS' and 'MAX_PHYSMEM_BITS' to vmcoreinfo is sufficient
>> for the above regressions (without over-complicating the stuff) as
>> ARM64_TCR.T1SZ and friends seem to arch specific as compared to
>> VA_BITS + 'MAX_USER_VA_BITS' .
>>
> 
> For MAX_USER_VA_BITS, IIUC you are just after a value of PTRS_PER_PGD?
> Why not just add PTRS_PER_PGD to the vmcoreinfo?

That's a good suggestion. I will re-spin the v2 with the same.

> FWIW it is possible in vaddr_to_paddr_arm64 to detect a zero pgd entry
> then try again with another ptrs_per_pgd value (granted this is a little
> hacky).

Right, but having this hack replicated across various user-space tools 
is perhaps not the ideal portable solution, when we can simply add a 
valid hint in the vmcoreinfo itself.

Thanks,
Bhupesh
Bhupesh Sharma Feb. 21, 2019, 4:20 p.m. UTC | #25
Hi Kazu,

On 02/20/2019 02:17 AM, Kazuhito Hagio wrote:
> Hi Bhupesh,
> 
> -----Original Message-----
>> I am not sure you got a chance to look at the two regression cases I
>> reported here:
>> <http://lists.infradead.org/pipermail/kexec/2019-February/022449.html>
>>
>> Unfortunately the above suggestion doesn't provide any fix for
>> ARMv8.2-LPA regression (see text under heading '
>> (1). Regression Case 1 (ARMv8.2-LPA enabled kernel)')
> 
> As for MAX_PHYSMEM_BITS, I realized that ppc64 makedumpfile can detect
> it because there is only one SECTION_SIZE_BITS for ppc64. I think we
> can use the same way as set_ppc64_max_physmem_bits() does also for
> arm64 for now. I'm going to write it for kernels not having
> NUMBER(MAX_PHYSMEM_BITS) in vmcoreinfo.

I see two drawbacks with the above approach:

a). This means that other user-space tools like crash-utility would 
still be broken and would probably need to find MAX_PHYSMEM_BITS for 
arm64 via a similar (hack'ish ?) approach.

b). I am looking at the makedumpfile code for 'MAX_PHYSMEM_BITS' 
determination for two archs as an example:

ppc
---

int
set_ppc64_max_physmem_bits(void)
{
         long array_len = ARRAY_LENGTH(mem_section);
         /*
          * The older ppc64 kernels uses _MAX_PHYSMEM_BITS as 42 and the
          * newer kernels 3.7 onwards uses 46 bits.
          */

         info->max_physmem_bits  = _MAX_PHYSMEM_BITS_ORIG ;
         if ((array_len == (NR_MEM_SECTIONS() / 
_SECTIONS_PER_ROOT_EXTREME()))
                 || (array_len == (NR_MEM_SECTIONS() / 
_SECTIONS_PER_ROOT())))
                 return TRUE;

         info->max_physmem_bits  = _MAX_PHYSMEM_BITS_3_7;
         if ((array_len == (NR_MEM_SECTIONS() / 
_SECTIONS_PER_ROOT_EXTREME()))
                 || (array_len == (NR_MEM_SECTIONS() / 
_SECTIONS_PER_ROOT())))
                 return TRUE;

         info->max_physmem_bits  = _MAX_PHYSMEM_BITS_4_19;
         if ((array_len == (NR_MEM_SECTIONS() / 
_SECTIONS_PER_ROOT_EXTREME()))
                 || (array_len == (NR_MEM_SECTIONS() / 
_SECTIONS_PER_ROOT())))
                 return TRUE;

         info->max_physmem_bits  = _MAX_PHYSMEM_BITS_4_20;
         if ((array_len == (NR_MEM_SECTIONS() / 
_SECTIONS_PER_ROOT_EXTREME()))
                 || (array_len == (NR_MEM_SECTIONS() / 
_SECTIONS_PER_ROOT())))
                 return TRUE;

         return FALSE;
}

x86_64:
------

int
get_versiondep_info_x86_64(void)
{
     /*
      * On linux-2.6.26, MAX_PHYSMEM_BITS is changed to 44 from 40.
      */
     if (info->kernel_version < KERNEL_VERSION(2, 6, 26))
         info->max_physmem_bits  = _MAX_PHYSMEM_BITS_ORIG;
     else if (info->kernel_version < KERNEL_VERSION(2, 6, 31))
         info->max_physmem_bits  = _MAX_PHYSMEM_BITS_2_6_26;
     else if(check_5level_paging())
         info->max_physmem_bits  = _MAX_PHYSMEM_BITS_5LEVEL;
     else
         info->max_physmem_bits  = _MAX_PHYSMEM_BITS_2_6_31;

     ...
}

Looking at the above, two questions come to my mind:

- Do we really need all the above complexity in user-space code, to hoop 
across various kernel versions and perform allocations for something 
that can be so easily exported via vmcoreinfo? Also we need to see how 
portable is the above code for a new kernel version - IMO, it will need 
another fix patch when we update to a new kernel version in near future.

- Also do we need to replicate the above implementations across 
user-space tools when they can also utilize the vmcoreinfo information 
to determine the PA_BITS range without any additional arch/kernel 
version specific details as the single point of obtaining this 
information from the kernel?

So, in view of the above, I would still advocate that we use a 
vmcoreinfo export for 'MAX_PHYSMEM_BITS' as well to have a uniform 
interface for the same across all user-land applications.

Thanks,
Bhupesh
Dave Anderson Feb. 21, 2019, 4:42 p.m. UTC | #26
----- Original Message -----
> Hi Kazu,
> 
> On 02/20/2019 02:17 AM, Kazuhito Hagio wrote:
> > Hi Bhupesh,
> > 
> > -----Original Message-----
> >> I am not sure you got a chance to look at the two regression cases I
> >> reported here:
> >> <http://lists.infradead.org/pipermail/kexec/2019-February/022449.html>
> >>
> >> Unfortunately the above suggestion doesn't provide any fix for
> >> ARMv8.2-LPA regression (see text under heading '
> >> (1). Regression Case 1 (ARMv8.2-LPA enabled kernel)')
> > 
> > As for MAX_PHYSMEM_BITS, I realized that ppc64 makedumpfile can detect
> > it because there is only one SECTION_SIZE_BITS for ppc64. I think we
> > can use the same way as set_ppc64_max_physmem_bits() does also for
> > arm64 for now. I'm going to write it for kernels not having
> > NUMBER(MAX_PHYSMEM_BITS) in vmcoreinfo.
> 
> I see two drawbacks with the above approach:
> 
> a). This means that other user-space tools like crash-utility would
> still be broken and would probably need to find MAX_PHYSMEM_BITS for
> arm64 via a similar (hack'ish ?) approach.
> 
> b). I am looking at the makedumpfile code for 'MAX_PHYSMEM_BITS'
> determination for two archs as an example:
> 
> ppc
> ---
> 
> int
> set_ppc64_max_physmem_bits(void)
> {
>          long array_len = ARRAY_LENGTH(mem_section);
>          /*
>           * The older ppc64 kernels uses _MAX_PHYSMEM_BITS as 42 and the
>           * newer kernels 3.7 onwards uses 46 bits.
>           */
> 
>          info->max_physmem_bits  = _MAX_PHYSMEM_BITS_ORIG ;
>          if ((array_len == (NR_MEM_SECTIONS() / _SECTIONS_PER_ROOT_EXTREME()))
>                  || (array_len == (NR_MEM_SECTIONS() / _SECTIONS_PER_ROOT())))
>                  return TRUE;
> 
>          info->max_physmem_bits  = _MAX_PHYSMEM_BITS_3_7;
>          if ((array_len == (NR_MEM_SECTIONS() / _SECTIONS_PER_ROOT_EXTREME()))
>                  || (array_len == (NR_MEM_SECTIONS() / _SECTIONS_PER_ROOT())))
>                  return TRUE;
> 
>          info->max_physmem_bits  = _MAX_PHYSMEM_BITS_4_19;
>          if ((array_len == (NR_MEM_SECTIONS() / _SECTIONS_PER_ROOT_EXTREME()))
>                  || (array_len == (NR_MEM_SECTIONS() / _SECTIONS_PER_ROOT())))
>                  return TRUE;
> 
>          info->max_physmem_bits  = _MAX_PHYSMEM_BITS_4_20;
>          if ((array_len == (NR_MEM_SECTIONS() / _SECTIONS_PER_ROOT_EXTREME()))
>                  || (array_len == (NR_MEM_SECTIONS() /  _SECTIONS_PER_ROOT())))
>                  return TRUE;
> 
>          return FALSE;
> }
> 
> x86_64:
> ------
> 
> int
> get_versiondep_info_x86_64(void)
> {
>      /*
>       * On linux-2.6.26, MAX_PHYSMEM_BITS is changed to 44 from 40.
>       */
>      if (info->kernel_version < KERNEL_VERSION(2, 6, 26))
>          info->max_physmem_bits  = _MAX_PHYSMEM_BITS_ORIG;
>      else if (info->kernel_version < KERNEL_VERSION(2, 6, 31))
>          info->max_physmem_bits  = _MAX_PHYSMEM_BITS_2_6_26;
>      else if(check_5level_paging())
>          info->max_physmem_bits  = _MAX_PHYSMEM_BITS_5LEVEL;
>      else
>          info->max_physmem_bits  = _MAX_PHYSMEM_BITS_2_6_31;
> 
>      ...
> }
> 
> Looking at the above, two questions come to my mind:
> 
> - Do we really need all the above complexity in user-space code, to hoop
> across various kernel versions and perform allocations for something
> that can be so easily exported via vmcoreinfo? Also we need to see how
> portable is the above code for a new kernel version - IMO, it will need
> another fix patch when we update to a new kernel version in near future.

I agree -- not to mention that the "kernel version" way of determining things 
does not account for distribution-specific backports.

> 
> - Also do we need to replicate the above implementations across
> user-space tools when they can also utilize the vmcoreinfo information
> to determine the PA_BITS range without any additional arch/kernel
> version specific details as the single point of obtaining this
> information from the kernel?
> 
> So, in view of the above, I would still advocate that we use a
> vmcoreinfo export for 'MAX_PHYSMEM_BITS' as well to have a uniform
> interface for the same across all user-land applications.

Again, totally agree.

Dave

 
> Thanks,
> Bhupesh
> 
>
Kazuhito Hagio Feb. 21, 2019, 7:02 p.m. UTC | #27
-----Original Message-----
> ----- Original Message -----
> > Hi Kazu,
> >
> > On 02/20/2019 02:17 AM, Kazuhito Hagio wrote:
> > > Hi Bhupesh,
> > >
> > > -----Original Message-----
> > >> I am not sure you got a chance to look at the two regression cases I
> > >> reported here:
> > >> <http://lists.infradead.org/pipermail/kexec/2019-February/022449.html>
> > >>
> > >> Unfortunately the above suggestion doesn't provide any fix for
> > >> ARMv8.2-LPA regression (see text under heading '
> > >> (1). Regression Case 1 (ARMv8.2-LPA enabled kernel)')
> > >
> > > As for MAX_PHYSMEM_BITS, I realized that ppc64 makedumpfile can detect
> > > it because there is only one SECTION_SIZE_BITS for ppc64. I think we
> > > can use the same way as set_ppc64_max_physmem_bits() does also for
> > > arm64 for now. I'm going to write it for kernels not having
> > > NUMBER(MAX_PHYSMEM_BITS) in vmcoreinfo.
> >
> > I see two drawbacks with the above approach:
> >
> > a). This means that other user-space tools like crash-utility would
> > still be broken and would probably need to find MAX_PHYSMEM_BITS for
> > arm64 via a similar (hack'ish ?) approach.
> >
> > b). I am looking at the makedumpfile code for 'MAX_PHYSMEM_BITS'
> > determination for two archs as an example:
> >
> > ppc
> > ---
> >
> > int
> > set_ppc64_max_physmem_bits(void)
> > {
> >          long array_len = ARRAY_LENGTH(mem_section);
> >          /*
> >           * The older ppc64 kernels uses _MAX_PHYSMEM_BITS as 42 and the
> >           * newer kernels 3.7 onwards uses 46 bits.
> >           */
> >
> >          info->max_physmem_bits  = _MAX_PHYSMEM_BITS_ORIG ;
> >          if ((array_len == (NR_MEM_SECTIONS() / _SECTIONS_PER_ROOT_EXTREME()))
> >                  || (array_len == (NR_MEM_SECTIONS() / _SECTIONS_PER_ROOT())))
> >                  return TRUE;
> >
> >          info->max_physmem_bits  = _MAX_PHYSMEM_BITS_3_7;
> >          if ((array_len == (NR_MEM_SECTIONS() / _SECTIONS_PER_ROOT_EXTREME()))
> >                  || (array_len == (NR_MEM_SECTIONS() / _SECTIONS_PER_ROOT())))
> >                  return TRUE;
> >
> >          info->max_physmem_bits  = _MAX_PHYSMEM_BITS_4_19;
> >          if ((array_len == (NR_MEM_SECTIONS() / _SECTIONS_PER_ROOT_EXTREME()))
> >                  || (array_len == (NR_MEM_SECTIONS() / _SECTIONS_PER_ROOT())))
> >                  return TRUE;
> >
> >          info->max_physmem_bits  = _MAX_PHYSMEM_BITS_4_20;
> >          if ((array_len == (NR_MEM_SECTIONS() / _SECTIONS_PER_ROOT_EXTREME()))
> >                  || (array_len == (NR_MEM_SECTIONS() /  _SECTIONS_PER_ROOT())))
> >                  return TRUE;
> >
> >          return FALSE;
> > }
> >
> > x86_64:
> > ------
> >
> > int
> > get_versiondep_info_x86_64(void)
> > {
> >      /*
> >       * On linux-2.6.26, MAX_PHYSMEM_BITS is changed to 44 from 40.
> >       */
> >      if (info->kernel_version < KERNEL_VERSION(2, 6, 26))
> >          info->max_physmem_bits  = _MAX_PHYSMEM_BITS_ORIG;
> >      else if (info->kernel_version < KERNEL_VERSION(2, 6, 31))
> >          info->max_physmem_bits  = _MAX_PHYSMEM_BITS_2_6_26;
> >      else if(check_5level_paging())
> >          info->max_physmem_bits  = _MAX_PHYSMEM_BITS_5LEVEL;
> >      else
> >          info->max_physmem_bits  = _MAX_PHYSMEM_BITS_2_6_31;
> >
> >      ...
> > }
> >
> > Looking at the above, two questions come to my mind:
> >
> > - Do we really need all the above complexity in user-space code, to hoop
> > across various kernel versions and perform allocations for something
> > that can be so easily exported via vmcoreinfo? Also we need to see how
> > portable is the above code for a new kernel version - IMO, it will need
> > another fix patch when we update to a new kernel version in near future.
> 
> I agree -- not to mention that the "kernel version" way of determining things
> does not account for distribution-specific backports.
> 
> >
> > - Also do we need to replicate the above implementations across
> > user-space tools when they can also utilize the vmcoreinfo information
> > to determine the PA_BITS range without any additional arch/kernel
> > version specific details as the single point of obtaining this
> > information from the kernel?
> >
> > So, in view of the above, I would still advocate that we use a
> > vmcoreinfo export for 'MAX_PHYSMEM_BITS' as well to have a uniform
> > interface for the same across all user-land applications.
> 
> Again, totally agree.

I also agree that we should do so. Then it will be better to have
it in kernel core code, not in arch-specific code.

Although makedumpfile may have to have the kludge for kernels that
support 52-bit PA and don't have the exported MAX_PHYSMEM_BITS
sooner or later..

Thanks,
Kazu


> 
> Dave
> 
> 
> > Thanks,
> > Bhupesh
> >
> >
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
Bhupesh Sharma March 1, 2019, 4:01 a.m. UTC | #28
Hi Kazu,

On 02/22/2019 12:32 AM, Kazuhito Hagio wrote:
> -----Original Message-----
>> ----- Original Message -----
>>> Hi Kazu,
>>>
>>> On 02/20/2019 02:17 AM, Kazuhito Hagio wrote:
>>>> Hi Bhupesh,
>>>>
>>>> -----Original Message-----
>>>>> I am not sure you got a chance to look at the two regression cases I
>>>>> reported here:
>>>>> <http://lists.infradead.org/pipermail/kexec/2019-February/022449.html>
>>>>>
>>>>> Unfortunately the above suggestion doesn't provide any fix for
>>>>> ARMv8.2-LPA regression (see text under heading '
>>>>> (1). Regression Case 1 (ARMv8.2-LPA enabled kernel)')
>>>>
>>>> As for MAX_PHYSMEM_BITS, I realized that ppc64 makedumpfile can detect
>>>> it because there is only one SECTION_SIZE_BITS for ppc64. I think we
>>>> can use the same way as set_ppc64_max_physmem_bits() does also for
>>>> arm64 for now. I'm going to write it for kernels not having
>>>> NUMBER(MAX_PHYSMEM_BITS) in vmcoreinfo.
>>>
>>> I see two drawbacks with the above approach:
>>>
>>> a). This means that other user-space tools like crash-utility would
>>> still be broken and would probably need to find MAX_PHYSMEM_BITS for
>>> arm64 via a similar (hack'ish ?) approach.
>>>
>>> b). I am looking at the makedumpfile code for 'MAX_PHYSMEM_BITS'
>>> determination for two archs as an example:
>>>
>>> ppc
>>> ---
>>>
>>> int
>>> set_ppc64_max_physmem_bits(void)
>>> {
>>>           long array_len = ARRAY_LENGTH(mem_section);
>>>           /*
>>>            * The older ppc64 kernels uses _MAX_PHYSMEM_BITS as 42 and the
>>>            * newer kernels 3.7 onwards uses 46 bits.
>>>            */
>>>
>>>           info->max_physmem_bits  = _MAX_PHYSMEM_BITS_ORIG ;
>>>           if ((array_len == (NR_MEM_SECTIONS() / _SECTIONS_PER_ROOT_EXTREME()))
>>>                   || (array_len == (NR_MEM_SECTIONS() / _SECTIONS_PER_ROOT())))
>>>                   return TRUE;
>>>
>>>           info->max_physmem_bits  = _MAX_PHYSMEM_BITS_3_7;
>>>           if ((array_len == (NR_MEM_SECTIONS() / _SECTIONS_PER_ROOT_EXTREME()))
>>>                   || (array_len == (NR_MEM_SECTIONS() / _SECTIONS_PER_ROOT())))
>>>                   return TRUE;
>>>
>>>           info->max_physmem_bits  = _MAX_PHYSMEM_BITS_4_19;
>>>           if ((array_len == (NR_MEM_SECTIONS() / _SECTIONS_PER_ROOT_EXTREME()))
>>>                   || (array_len == (NR_MEM_SECTIONS() / _SECTIONS_PER_ROOT())))
>>>                   return TRUE;
>>>
>>>           info->max_physmem_bits  = _MAX_PHYSMEM_BITS_4_20;
>>>           if ((array_len == (NR_MEM_SECTIONS() / _SECTIONS_PER_ROOT_EXTREME()))
>>>                   || (array_len == (NR_MEM_SECTIONS() /  _SECTIONS_PER_ROOT())))
>>>                   return TRUE;
>>>
>>>           return FALSE;
>>> }
>>>
>>> x86_64:
>>> ------
>>>
>>> int
>>> get_versiondep_info_x86_64(void)
>>> {
>>>       /*
>>>        * On linux-2.6.26, MAX_PHYSMEM_BITS is changed to 44 from 40.
>>>        */
>>>       if (info->kernel_version < KERNEL_VERSION(2, 6, 26))
>>>           info->max_physmem_bits  = _MAX_PHYSMEM_BITS_ORIG;
>>>       else if (info->kernel_version < KERNEL_VERSION(2, 6, 31))
>>>           info->max_physmem_bits  = _MAX_PHYSMEM_BITS_2_6_26;
>>>       else if(check_5level_paging())
>>>           info->max_physmem_bits  = _MAX_PHYSMEM_BITS_5LEVEL;
>>>       else
>>>           info->max_physmem_bits  = _MAX_PHYSMEM_BITS_2_6_31;
>>>
>>>       ...
>>> }
>>>
>>> Looking at the above, two questions come to my mind:
>>>
>>> - Do we really need all the above complexity in user-space code, to hoop
>>> across various kernel versions and perform allocations for something
>>> that can be so easily exported via vmcoreinfo? Also we need to see how
>>> portable is the above code for a new kernel version - IMO, it will need
>>> another fix patch when we update to a new kernel version in near future.
>>
>> I agree -- not to mention that the "kernel version" way of determining things
>> does not account for distribution-specific backports.
>>
>>>
>>> - Also do we need to replicate the above implementations across
>>> user-space tools when they can also utilize the vmcoreinfo information
>>> to determine the PA_BITS range without any additional arch/kernel
>>> version specific details as the single point of obtaining this
>>> information from the kernel?
>>>
>>> So, in view of the above, I would still advocate that we use a
>>> vmcoreinfo export for 'MAX_PHYSMEM_BITS' as well to have a uniform
>>> interface for the same across all user-land applications.
>>
>> Again, totally agree.
> 
> I also agree that we should do so. Then it will be better to have
> it in kernel core code, not in arch-specific code.
> 
> Although makedumpfile may have to have the kludge for kernels that
> support 52-bit PA and don't have the exported MAX_PHYSMEM_BITS
> sooner or later..

I was waiting for maintainers Cc'ed in this thread to share their 
opinions on the above suggestion.

We especially need comments from x86 maintainers on this (Boris), since 
x86 and ppc user-space utilities are currently not broken (no regression 
reported) due to a lacking export of 'MAX_PHYSMEM_BITS' via vmcore, 
whereas as I shared earlier arm64 user-space utilities are broken right 
now with 52-bit address space changes.

Thanks,
Bhupesh
diff mbox series

Patch

diff --git a/arch/arm64/kernel/crash_core.c b/arch/arm64/kernel/crash_core.c
index ca4c3e12d8c5..ad231be5c0d8 100644
--- a/arch/arm64/kernel/crash_core.c
+++ b/arch/arm64/kernel/crash_core.c
@@ -10,6 +10,8 @@ 
 void arch_crash_save_vmcoreinfo(void)
 {
 	VMCOREINFO_NUMBER(VA_BITS);
+	VMCOREINFO_NUMBER(MAX_USER_VA_BITS);
+	VMCOREINFO_NUMBER(MAX_PHYSMEM_BITS);
 	/* Please note VMCOREINFO_NUMBER() uses "%d", not "%x" */
 	vmcoreinfo_append_str("NUMBER(kimage_voffset)=0x%llx\n",
 						kimage_voffset);