diff mbox series

[5/5] arm64: kdump: Don't defer the reservation of crash high memory

Message ID 20220613080932.663-6-thunder.leizhen@huawei.com (mailing list archive)
State New, archived
Headers show
Series arm64: kdump: Function supplement and performance optimization | expand

Commit Message

Zhen Lei June 13, 2022, 8:09 a.m. UTC
If the crashkernel has both high memory above DMA zones and low memory
in DMA zones, kexec always loads the content such as Image and dtb to the
high memory instead of the low memory. This means that only high memory
requires write protection based on page-level mapping. The allocation of
high memory does not depend on the DMA boundary. So we can reserve the
high memory first even if the crashkernel reservation is deferred.

This means that the block mapping can still be performed on other kernel
linear address spaces, the TLB miss rate can be reduced and the system
performance will be improved.

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
---
 arch/arm64/mm/init.c | 71 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 65 insertions(+), 6 deletions(-)

Comments

Baoquan He June 21, 2022, 5:33 a.m. UTC | #1
Hi,

On 06/13/22 at 04:09pm, Zhen Lei wrote:
> If the crashkernel has both high memory above DMA zones and low memory
> in DMA zones, kexec always loads the content such as Image and dtb to the
> high memory instead of the low memory. This means that only high memory
> requires write protection based on page-level mapping. The allocation of
> high memory does not depend on the DMA boundary. So we can reserve the
> high memory first even if the crashkernel reservation is deferred.
> 
> This means that the block mapping can still be performed on other kernel
> linear address spaces, the TLB miss rate can be reduced and the system
> performance will be improved.

Ugh, this looks a little ugly, honestly.

If that's for sure arm64 can't split large page mapping of linear
region, this patch is one way to optimize linear mapping. Given kdump
setting is necessary on arm64 server, the booting speed is truly
impacted heavily.

However, I would suggest letting it as is with below reasons:

1) The code will complicate the crashkernel reservatoin code which
is already difficult to understand. 
2) It can only optimize the two cases, first is CONFIG_ZONE_DMA|DMA32
  disabled, the other is crashkernel=,high is specified. While both
  two cases are corner case, most of systems have CONFIG_ZONE_DMA|DMA32
  enabled, and most of systems have crashkernel=xM which is enough.
  Having them optimized won't bring benefit to most of systems.
3) Besides, the crashkernel=,high can be handled earlier because 
  arm64 alwasys have memblock.bottom_up == false currently, thus we
  don't need worry arbout the lower limit of crashkernel,high
  reservation for now. If memblock.bottom_up is set true in the future,
  this patch doesn't work any more.


...
        crash_base = memblock_phys_alloc_range(crash_size, CRASH_ALIGN,
                                               crash_base, crash_max);

So, in my opinion, we can leave the current NON_BLOCK|SECT mapping as
is caused by crashkernel reserving, since no regression is brought.
And meantime, turning to check if there's any way to make the contiguous
linear mapping and later splitting work. The patch 4, 5 in this patchset
doesn't make much sense to me, frankly speaking.

Thanks
Baoquan

> 
> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
> ---
>  arch/arm64/mm/init.c | 71 ++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 65 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index fb24efbc46f5ef4..ae0bae2cafe6ab0 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -141,15 +141,44 @@ static void __init reserve_crashkernel(int dma_state)
>  	unsigned long long crash_max = CRASH_ADDR_LOW_MAX;
>  	char *cmdline = boot_command_line;
>  	int dma_enabled = IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32);
> -	int ret;
> +	int ret, skip_res = 0, skip_low_res = 0;
>  	bool fixed_base;
>  
>  	if (!IS_ENABLED(CONFIG_KEXEC_CORE))
>  		return;
>  
> -	if ((!dma_enabled && (dma_state != DMA_PHYS_LIMIT_UNKNOWN)) ||
> -	     (dma_enabled && (dma_state != DMA_PHYS_LIMIT_KNOWN)))
> -		return;
> +	/*
> +	 * In the following table:
> +	 * X,high  means crashkernel=X,high
> +	 * unknown means dma_state = DMA_PHYS_LIMIT_UNKNOWN
> +	 * known   means dma_state = DMA_PHYS_LIMIT_KNOWN
> +	 *
> +	 * The first two columns indicate the status, and the last two
> +	 * columns indicate the phase in which crash high or low memory
> +	 * needs to be reserved.
> +	 *  ---------------------------------------------------
> +	 * | DMA enabled | X,high used |  unknown  |   known   |
> +	 *  ---------------------------------------------------
> +	 * |      N            N       |    low    |    NOP    |
> +	 * |      Y            N       |    NOP    |    low    |
> +	 * |      N            Y       |  high/low |    NOP    |
> +	 * |      Y            Y       |    high   |    low    |
> +	 *  ---------------------------------------------------
> +	 *
> +	 * But in this function, the crash high memory allocation of
> +	 * crashkernel=Y,high and the crash low memory allocation of
> +	 * crashkernel=X[@offset] for crashk_res are mixed at one place.
> +	 * So the table above need to be adjusted as below:
> +	 *  ---------------------------------------------------
> +	 * | DMA enabled | X,high used |  unknown  |   known   |
> +	 *  ---------------------------------------------------
> +	 * |      N            N       |    res    |    NOP    |
> +	 * |      Y            N       |    NOP    |    res    |
> +	 * |      N            Y       |res/low_res|    NOP    |
> +	 * |      Y            Y       |    res    |  low_res  |
> +	 *  ---------------------------------------------------
> +	 *
> +	 */
>  
>  	/* crashkernel=X[@offset] */
>  	ret = parse_crashkernel(cmdline, memblock_phys_mem_size(),
> @@ -169,10 +198,33 @@ static void __init reserve_crashkernel(int dma_state)
>  		else if (ret)
>  			return;
>  
> +		/* See the third row of the second table above, NOP */
> +		if (!dma_enabled && (dma_state == DMA_PHYS_LIMIT_KNOWN))
> +			return;
> +
> +		/* See the fourth row of the second table above */
> +		if (dma_enabled) {
> +			if (dma_state == DMA_PHYS_LIMIT_UNKNOWN)
> +				skip_low_res = 1;
> +			else
> +				skip_res = 1;
> +		}
> +
>  		crash_max = CRASH_ADDR_HIGH_MAX;
>  	} else if (ret || !crash_size) {
>  		/* The specified value is invalid */
>  		return;
> +	} else {
> +		/* See the 1-2 rows of the second table above, NOP */
> +		if ((!dma_enabled && (dma_state == DMA_PHYS_LIMIT_KNOWN)) ||
> +		     (dma_enabled && (dma_state == DMA_PHYS_LIMIT_UNKNOWN)))
> +			return;
> +	}
> +
> +	if (skip_res) {
> +		crash_base = crashk_res.start;
> +		crash_size = crashk_res.end - crashk_res.start + 1;
> +		goto check_low;
>  	}
>  
>  	fixed_base = !!crash_base;
> @@ -202,9 +254,18 @@ static void __init reserve_crashkernel(int dma_state)
>  		return;
>  	}
>  
> +	crashk_res.start = crash_base;
> +	crashk_res.end = crash_base + crash_size - 1;
> +
> +check_low:
> +	if (skip_low_res)
> +		return;
> +
>  	if ((crash_base >= CRASH_ADDR_LOW_MAX) &&
>  	     crash_low_size && reserve_crashkernel_low(crash_low_size)) {
>  		memblock_phys_free(crash_base, crash_size);
> +		crashk_res.start = 0;
> +		crashk_res.end = 0;
>  		return;
>  	}
>  
> @@ -219,8 +280,6 @@ static void __init reserve_crashkernel(int dma_state)
>  	if (crashk_low_res.end)
>  		kmemleak_ignore_phys(crashk_low_res.start);
>  
> -	crashk_res.start = crash_base;
> -	crashk_res.end = crash_base + crash_size - 1;
>  	insert_resource(&iomem_resource, &crashk_res);
>  }
>  
> -- 
> 2.25.1
>
Kefeng Wang June 21, 2022, 6:24 a.m. UTC | #2
On 2022/6/21 13:33, Baoquan He wrote:
> Hi,
>
> On 06/13/22 at 04:09pm, Zhen Lei wrote:
>> If the crashkernel has both high memory above DMA zones and low memory
>> in DMA zones, kexec always loads the content such as Image and dtb to the
>> high memory instead of the low memory. This means that only high memory
>> requires write protection based on page-level mapping. The allocation of
>> high memory does not depend on the DMA boundary. So we can reserve the
>> high memory first even if the crashkernel reservation is deferred.
>>
>> This means that the block mapping can still be performed on other kernel
>> linear address spaces, the TLB miss rate can be reduced and the system
>> performance will be improved.
> Ugh, this looks a little ugly, honestly.
>
> If that's for sure arm64 can't split large page mapping of linear
> region, this patch is one way to optimize linear mapping. Given kdump
> setting is necessary on arm64 server, the booting speed is truly
> impacted heavily.

Is there some conclusion or discussion that arm64 can't split large page 
mapping?

Could the crashkernel reservation (and Kfence pool) be splited dynamically?

I found Mark replay "arm64: remove page granularity limitation from 
KFENCE"[1],

   "We also avoid live changes from block<->table mappings, since the
   archtitecture gives us very weak guarantees there and generally requires
   a Break-Before-Make sequence (though IIRC this was tightened up
   somewhat, so maybe going one way is supposed to work). Unless it's
   really necessary, I'd rather not split these block mappings while
   they're live."

Hi Mark and Catalin,  could you give some comment,  many thanks.

[1] 
https://lore.kernel.org/lkml/20210920101938.GA13863@C02TD0UTHF1T.local/T/#m1a7f974593f5545cbcfc0d21560df4e7926b1381


>
> However, I would suggest letting it as is with below reasons:
>
> 1) The code will complicate the crashkernel reservatoin code which
> is already difficult to understand.
> 2) It can only optimize the two cases, first is CONFIG_ZONE_DMA|DMA32
>    disabled, the other is crashkernel=,high is specified. While both
>    two cases are corner case, most of systems have CONFIG_ZONE_DMA|DMA32
>    enabled, and most of systems have crashkernel=xM which is enough.
>    Having them optimized won't bring benefit to most of systems.
> 3) Besides, the crashkernel=,high can be handled earlier because
>    arm64 alwasys have memblock.bottom_up == false currently, thus we
>    don't need worry arbout the lower limit of crashkernel,high
>    reservation for now. If memblock.bottom_up is set true in the future,
>    this patch doesn't work any more.
>
>
> ...
>          crash_base = memblock_phys_alloc_range(crash_size, CRASH_ALIGN,
>                                                 crash_base, crash_max);
>
> So, in my opinion, we can leave the current NON_BLOCK|SECT mapping as
> is caused by crashkernel reserving, since no regression is brought.
> And meantime, turning to check if there's any way to make the contiguous
> linear mapping and later splitting work. The patch 4, 5 in this patchset
> doesn't make much sense to me, frankly speaking.
>
> Thanks
> Baoquan
Zhen Lei June 21, 2022, 7:56 a.m. UTC | #3
On 2022/6/21 13:33, Baoquan He wrote:
> Hi,
> 
> On 06/13/22 at 04:09pm, Zhen Lei wrote:
>> If the crashkernel has both high memory above DMA zones and low memory
>> in DMA zones, kexec always loads the content such as Image and dtb to the
>> high memory instead of the low memory. This means that only high memory
>> requires write protection based on page-level mapping. The allocation of
>> high memory does not depend on the DMA boundary. So we can reserve the
>> high memory first even if the crashkernel reservation is deferred.
>>
>> This means that the block mapping can still be performed on other kernel
>> linear address spaces, the TLB miss rate can be reduced and the system
>> performance will be improved.
> 
> Ugh, this looks a little ugly, honestly.
> 
> If that's for sure arm64 can't split large page mapping of linear
> region, this patch is one way to optimize linear mapping. Given kdump
> setting is necessary on arm64 server, the booting speed is truly
> impacted heavily.

There is also a performance impact when running.

> 
> However, I would suggest letting it as is with below reasons:
> 
> 1) The code will complicate the crashkernel reservatoin code which
> is already difficult to understand. 

Yeah, I feel it, too.

> 2) It can only optimize the two cases, first is CONFIG_ZONE_DMA|DMA32
>   disabled, the other is crashkernel=,high is specified. While both
>   two cases are corner case, most of systems have CONFIG_ZONE_DMA|DMA32
>   enabled, and most of systems have crashkernel=xM which is enough.
>   Having them optimized won't bring benefit to most of systems.

The case of CONFIG_ZONE_DMA|DMA32 disabled have been resolved by
commit 031495635b46 ("arm64: Do not defer reserve_crashkernel() for platforms with no DMA memory zones").
Currently the performance problem to be optimized is that DMA is enabled.


> 3) Besides, the crashkernel=,high can be handled earlier because 
>   arm64 alwasys have memblock.bottom_up == false currently, thus we
>   don't need worry arbout the lower limit of crashkernel,high
>   reservation for now. If memblock.bottom_up is set true in the future,
>   this patch doesn't work any more.
> 
> 
> ...
>         crash_base = memblock_phys_alloc_range(crash_size, CRASH_ALIGN,
>                                                crash_base, crash_max);
> 
> So, in my opinion, we can leave the current NON_BLOCK|SECT mapping as
> is caused by crashkernel reserving, since no regression is brought.
> And meantime, turning to check if there's any way to make the contiguous
> linear mapping and later splitting work. The patch 4, 5 in this patchset
> doesn't make much sense to me, frankly speaking.

OK. As discussed earlier, I can rethink if there is a better way to patch 4-5,
and this time focus on patch 1-2. In this way, all the functions are complete,
and only optimization is left.

> 
> Thanks
> Baoquan
> 
>>
>> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
>> ---
>>  arch/arm64/mm/init.c | 71 ++++++++++++++++++++++++++++++++++++++++----
>>  1 file changed, 65 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
>> index fb24efbc46f5ef4..ae0bae2cafe6ab0 100644
>> --- a/arch/arm64/mm/init.c
>> +++ b/arch/arm64/mm/init.c
>> @@ -141,15 +141,44 @@ static void __init reserve_crashkernel(int dma_state)
>>  	unsigned long long crash_max = CRASH_ADDR_LOW_MAX;
>>  	char *cmdline = boot_command_line;
>>  	int dma_enabled = IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32);
>> -	int ret;
>> +	int ret, skip_res = 0, skip_low_res = 0;
>>  	bool fixed_base;
>>  
>>  	if (!IS_ENABLED(CONFIG_KEXEC_CORE))
>>  		return;
>>  
>> -	if ((!dma_enabled && (dma_state != DMA_PHYS_LIMIT_UNKNOWN)) ||
>> -	     (dma_enabled && (dma_state != DMA_PHYS_LIMIT_KNOWN)))
>> -		return;
>> +	/*
>> +	 * In the following table:
>> +	 * X,high  means crashkernel=X,high
>> +	 * unknown means dma_state = DMA_PHYS_LIMIT_UNKNOWN
>> +	 * known   means dma_state = DMA_PHYS_LIMIT_KNOWN
>> +	 *
>> +	 * The first two columns indicate the status, and the last two
>> +	 * columns indicate the phase in which crash high or low memory
>> +	 * needs to be reserved.
>> +	 *  ---------------------------------------------------
>> +	 * | DMA enabled | X,high used |  unknown  |   known   |
>> +	 *  ---------------------------------------------------
>> +	 * |      N            N       |    low    |    NOP    |
>> +	 * |      Y            N       |    NOP    |    low    |
>> +	 * |      N            Y       |  high/low |    NOP    |
>> +	 * |      Y            Y       |    high   |    low    |
>> +	 *  ---------------------------------------------------
>> +	 *
>> +	 * But in this function, the crash high memory allocation of
>> +	 * crashkernel=Y,high and the crash low memory allocation of
>> +	 * crashkernel=X[@offset] for crashk_res are mixed at one place.
>> +	 * So the table above need to be adjusted as below:
>> +	 *  ---------------------------------------------------
>> +	 * | DMA enabled | X,high used |  unknown  |   known   |
>> +	 *  ---------------------------------------------------
>> +	 * |      N            N       |    res    |    NOP    |
>> +	 * |      Y            N       |    NOP    |    res    |
>> +	 * |      N            Y       |res/low_res|    NOP    |
>> +	 * |      Y            Y       |    res    |  low_res  |
>> +	 *  ---------------------------------------------------
>> +	 *
>> +	 */
>>  
>>  	/* crashkernel=X[@offset] */
>>  	ret = parse_crashkernel(cmdline, memblock_phys_mem_size(),
>> @@ -169,10 +198,33 @@ static void __init reserve_crashkernel(int dma_state)
>>  		else if (ret)
>>  			return;
>>  
>> +		/* See the third row of the second table above, NOP */
>> +		if (!dma_enabled && (dma_state == DMA_PHYS_LIMIT_KNOWN))
>> +			return;
>> +
>> +		/* See the fourth row of the second table above */
>> +		if (dma_enabled) {
>> +			if (dma_state == DMA_PHYS_LIMIT_UNKNOWN)
>> +				skip_low_res = 1;
>> +			else
>> +				skip_res = 1;
>> +		}
>> +
>>  		crash_max = CRASH_ADDR_HIGH_MAX;
>>  	} else if (ret || !crash_size) {
>>  		/* The specified value is invalid */
>>  		return;
>> +	} else {
>> +		/* See the 1-2 rows of the second table above, NOP */
>> +		if ((!dma_enabled && (dma_state == DMA_PHYS_LIMIT_KNOWN)) ||
>> +		     (dma_enabled && (dma_state == DMA_PHYS_LIMIT_UNKNOWN)))
>> +			return;
>> +	}
>> +
>> +	if (skip_res) {
>> +		crash_base = crashk_res.start;
>> +		crash_size = crashk_res.end - crashk_res.start + 1;
>> +		goto check_low;
>>  	}
>>  
>>  	fixed_base = !!crash_base;
>> @@ -202,9 +254,18 @@ static void __init reserve_crashkernel(int dma_state)
>>  		return;
>>  	}
>>  
>> +	crashk_res.start = crash_base;
>> +	crashk_res.end = crash_base + crash_size - 1;
>> +
>> +check_low:
>> +	if (skip_low_res)
>> +		return;
>> +
>>  	if ((crash_base >= CRASH_ADDR_LOW_MAX) &&
>>  	     crash_low_size && reserve_crashkernel_low(crash_low_size)) {
>>  		memblock_phys_free(crash_base, crash_size);
>> +		crashk_res.start = 0;
>> +		crashk_res.end = 0;
>>  		return;
>>  	}
>>  
>> @@ -219,8 +280,6 @@ static void __init reserve_crashkernel(int dma_state)
>>  	if (crashk_low_res.end)
>>  		kmemleak_ignore_phys(crashk_low_res.start);
>>  
>> -	crashk_res.start = crash_base;
>> -	crashk_res.end = crash_base + crash_size - 1;
>>  	insert_resource(&iomem_resource, &crashk_res);
>>  }
>>  
>> -- 
>> 2.25.1
>>
> 
> .
>
Baoquan He June 21, 2022, 9:27 a.m. UTC | #4
On 06/21/22 at 02:24pm, Kefeng Wang wrote:
> 
> On 2022/6/21 13:33, Baoquan He wrote:
> > Hi,
> > 
> > On 06/13/22 at 04:09pm, Zhen Lei wrote:
> > > If the crashkernel has both high memory above DMA zones and low memory
> > > in DMA zones, kexec always loads the content such as Image and dtb to the
> > > high memory instead of the low memory. This means that only high memory
> > > requires write protection based on page-level mapping. The allocation of
> > > high memory does not depend on the DMA boundary. So we can reserve the
> > > high memory first even if the crashkernel reservation is deferred.
> > > 
> > > This means that the block mapping can still be performed on other kernel
> > > linear address spaces, the TLB miss rate can be reduced and the system
> > > performance will be improved.
> > Ugh, this looks a little ugly, honestly.
> > 
> > If that's for sure arm64 can't split large page mapping of linear
> > region, this patch is one way to optimize linear mapping. Given kdump
> > setting is necessary on arm64 server, the booting speed is truly
> > impacted heavily.
> 
> Is there some conclusion or discussion that arm64 can't split large page
> mapping?

Yes, please see below commit log. 
commit d27cfa1fc823 ("arm64: mm: set the contiguous bit for kernel mappings where appropriate")

> 
> Could the crashkernel reservation (and Kfence pool) be splited dynamically?

For crashkernel region, we have arch_kexec_protect_crashkres() to secure
the region, and crash_shrink_memory() could be called to shrink it.
While crahshkernel region could be crossig part of a block mapping or section
mapping and the mapping need be splitted, that will cause TLB conflicts.

> 
> I found Mark replay "arm64: remove page granularity limitation from
> KFENCE"[1],
> 
>   "We also avoid live changes from block<->table mappings, since the
>   archtitecture gives us very weak guarantees there and generally requires
>   a Break-Before-Make sequence (though IIRC this was tightened up
>   somewhat, so maybe going one way is supposed to work). Unless it's
>   really necessary, I'd rather not split these block mappings while
>   they're live."
> 
> Hi Mark and Catalin,  could you give some comment,  many thanks.
> 
> [1] https://lore.kernel.org/lkml/20210920101938.GA13863@C02TD0UTHF1T.local/T/#m1a7f974593f5545cbcfc0d21560df4e7926b1381
> 
> 
> > 
> > However, I would suggest letting it as is with below reasons:
> > 
> > 1) The code will complicate the crashkernel reservatoin code which
> > is already difficult to understand.
> > 2) It can only optimize the two cases, first is CONFIG_ZONE_DMA|DMA32
> >    disabled, the other is crashkernel=,high is specified. While both
> >    two cases are corner case, most of systems have CONFIG_ZONE_DMA|DMA32
> >    enabled, and most of systems have crashkernel=xM which is enough.
> >    Having them optimized won't bring benefit to most of systems.
> > 3) Besides, the crashkernel=,high can be handled earlier because
> >    arm64 alwasys have memblock.bottom_up == false currently, thus we
> >    don't need worry arbout the lower limit of crashkernel,high
> >    reservation for now. If memblock.bottom_up is set true in the future,
> >    this patch doesn't work any more.
> > 
> > 
> > ...
> >          crash_base = memblock_phys_alloc_range(crash_size, CRASH_ALIGN,
> >                                                 crash_base, crash_max);
> > 
> > So, in my opinion, we can leave the current NON_BLOCK|SECT mapping as
> > is caused by crashkernel reserving, since no regression is brought.
> > And meantime, turning to check if there's any way to make the contiguous
> > linear mapping and later splitting work. The patch 4, 5 in this patchset
> > doesn't make much sense to me, frankly speaking.
> > 
> > Thanks
> > Baoquan
>
Baoquan He June 21, 2022, 9:35 a.m. UTC | #5
On 06/21/22 at 03:56pm, Leizhen (ThunderTown) wrote:
> 
> 
> On 2022/6/21 13:33, Baoquan He wrote:
> > Hi,
> > 
> > On 06/13/22 at 04:09pm, Zhen Lei wrote:
> >> If the crashkernel has both high memory above DMA zones and low memory
> >> in DMA zones, kexec always loads the content such as Image and dtb to the
> >> high memory instead of the low memory. This means that only high memory
> >> requires write protection based on page-level mapping. The allocation of
> >> high memory does not depend on the DMA boundary. So we can reserve the
> >> high memory first even if the crashkernel reservation is deferred.
> >>
> >> This means that the block mapping can still be performed on other kernel
> >> linear address spaces, the TLB miss rate can be reduced and the system
> >> performance will be improved.
> > 
> > Ugh, this looks a little ugly, honestly.
> > 
> > If that's for sure arm64 can't split large page mapping of linear
> > region, this patch is one way to optimize linear mapping. Given kdump
> > setting is necessary on arm64 server, the booting speed is truly
> > impacted heavily.
> 
> There is also a performance impact when running.

Yes, indeed, the TLB flush will happen more often.

> 
> > 
> > However, I would suggest letting it as is with below reasons:
> > 
> > 1) The code will complicate the crashkernel reservatoin code which
> > is already difficult to understand. 
> 
> Yeah, I feel it, too.
> 
> > 2) It can only optimize the two cases, first is CONFIG_ZONE_DMA|DMA32
> >   disabled, the other is crashkernel=,high is specified. While both
> >   two cases are corner case, most of systems have CONFIG_ZONE_DMA|DMA32
> >   enabled, and most of systems have crashkernel=xM which is enough.
> >   Having them optimized won't bring benefit to most of systems.
> 
> The case of CONFIG_ZONE_DMA|DMA32 disabled have been resolved by
> commit 031495635b46 ("arm64: Do not defer reserve_crashkernel() for platforms with no DMA memory zones").
> Currently the performance problem to be optimized is that DMA is enabled.

Yes, the disabled CONFIG_ZONE_DMA|DMA32 case has avoided the problem since
its boundary is decided already at that time. Crashkenrel=,high can slso
avoid this benefitting from the top done memblock allocating. However,
the crashkerne=xM which now gets the fallback support is the main syntax
we will use, that still has the problem.

> 
> 
> > 3) Besides, the crashkernel=,high can be handled earlier because 
> >   arm64 alwasys have memblock.bottom_up == false currently, thus we
> >   don't need worry arbout the lower limit of crashkernel,high
> >   reservation for now. If memblock.bottom_up is set true in the future,
> >   this patch doesn't work any more.
> > 
> > 
> > ...
> >         crash_base = memblock_phys_alloc_range(crash_size, CRASH_ALIGN,
> >                                                crash_base, crash_max);
> > 
> > So, in my opinion, we can leave the current NON_BLOCK|SECT mapping as
> > is caused by crashkernel reserving, since no regression is brought.
> > And meantime, turning to check if there's any way to make the contiguous
> > linear mapping and later splitting work. The patch 4, 5 in this patchset
> > doesn't make much sense to me, frankly speaking.
> 
> OK. As discussed earlier, I can rethink if there is a better way to patch 4-5,
> and this time focus on patch 1-2. In this way, all the functions are complete,
> and only optimization is left.

Sounds nice, thx.

> > 
> >>
> >> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
> >> ---
> >>  arch/arm64/mm/init.c | 71 ++++++++++++++++++++++++++++++++++++++++----
> >>  1 file changed, 65 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> >> index fb24efbc46f5ef4..ae0bae2cafe6ab0 100644
> >> --- a/arch/arm64/mm/init.c
> >> +++ b/arch/arm64/mm/init.c
> >> @@ -141,15 +141,44 @@ static void __init reserve_crashkernel(int dma_state)
> >>  	unsigned long long crash_max = CRASH_ADDR_LOW_MAX;
> >>  	char *cmdline = boot_command_line;
> >>  	int dma_enabled = IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32);
> >> -	int ret;
> >> +	int ret, skip_res = 0, skip_low_res = 0;
> >>  	bool fixed_base;
> >>  
> >>  	if (!IS_ENABLED(CONFIG_KEXEC_CORE))
> >>  		return;
> >>  
> >> -	if ((!dma_enabled && (dma_state != DMA_PHYS_LIMIT_UNKNOWN)) ||
> >> -	     (dma_enabled && (dma_state != DMA_PHYS_LIMIT_KNOWN)))
> >> -		return;
> >> +	/*
> >> +	 * In the following table:
> >> +	 * X,high  means crashkernel=X,high
> >> +	 * unknown means dma_state = DMA_PHYS_LIMIT_UNKNOWN
> >> +	 * known   means dma_state = DMA_PHYS_LIMIT_KNOWN
> >> +	 *
> >> +	 * The first two columns indicate the status, and the last two
> >> +	 * columns indicate the phase in which crash high or low memory
> >> +	 * needs to be reserved.
> >> +	 *  ---------------------------------------------------
> >> +	 * | DMA enabled | X,high used |  unknown  |   known   |
> >> +	 *  ---------------------------------------------------
> >> +	 * |      N            N       |    low    |    NOP    |
> >> +	 * |      Y            N       |    NOP    |    low    |
> >> +	 * |      N            Y       |  high/low |    NOP    |
> >> +	 * |      Y            Y       |    high   |    low    |
> >> +	 *  ---------------------------------------------------
> >> +	 *
> >> +	 * But in this function, the crash high memory allocation of
> >> +	 * crashkernel=Y,high and the crash low memory allocation of
> >> +	 * crashkernel=X[@offset] for crashk_res are mixed at one place.
> >> +	 * So the table above need to be adjusted as below:
> >> +	 *  ---------------------------------------------------
> >> +	 * | DMA enabled | X,high used |  unknown  |   known   |
> >> +	 *  ---------------------------------------------------
> >> +	 * |      N            N       |    res    |    NOP    |
> >> +	 * |      Y            N       |    NOP    |    res    |
> >> +	 * |      N            Y       |res/low_res|    NOP    |
> >> +	 * |      Y            Y       |    res    |  low_res  |
> >> +	 *  ---------------------------------------------------
> >> +	 *
> >> +	 */
> >>  
> >>  	/* crashkernel=X[@offset] */
> >>  	ret = parse_crashkernel(cmdline, memblock_phys_mem_size(),
> >> @@ -169,10 +198,33 @@ static void __init reserve_crashkernel(int dma_state)
> >>  		else if (ret)
> >>  			return;
> >>  
> >> +		/* See the third row of the second table above, NOP */
> >> +		if (!dma_enabled && (dma_state == DMA_PHYS_LIMIT_KNOWN))
> >> +			return;
> >> +
> >> +		/* See the fourth row of the second table above */
> >> +		if (dma_enabled) {
> >> +			if (dma_state == DMA_PHYS_LIMIT_UNKNOWN)
> >> +				skip_low_res = 1;
> >> +			else
> >> +				skip_res = 1;
> >> +		}
> >> +
> >>  		crash_max = CRASH_ADDR_HIGH_MAX;
> >>  	} else if (ret || !crash_size) {
> >>  		/* The specified value is invalid */
> >>  		return;
> >> +	} else {
> >> +		/* See the 1-2 rows of the second table above, NOP */
> >> +		if ((!dma_enabled && (dma_state == DMA_PHYS_LIMIT_KNOWN)) ||
> >> +		     (dma_enabled && (dma_state == DMA_PHYS_LIMIT_UNKNOWN)))
> >> +			return;
> >> +	}
> >> +
> >> +	if (skip_res) {
> >> +		crash_base = crashk_res.start;
> >> +		crash_size = crashk_res.end - crashk_res.start + 1;
> >> +		goto check_low;
> >>  	}
> >>  
> >>  	fixed_base = !!crash_base;
> >> @@ -202,9 +254,18 @@ static void __init reserve_crashkernel(int dma_state)
> >>  		return;
> >>  	}
> >>  
> >> +	crashk_res.start = crash_base;
> >> +	crashk_res.end = crash_base + crash_size - 1;
> >> +
> >> +check_low:
> >> +	if (skip_low_res)
> >> +		return;
> >> +
> >>  	if ((crash_base >= CRASH_ADDR_LOW_MAX) &&
> >>  	     crash_low_size && reserve_crashkernel_low(crash_low_size)) {
> >>  		memblock_phys_free(crash_base, crash_size);
> >> +		crashk_res.start = 0;
> >> +		crashk_res.end = 0;
> >>  		return;
> >>  	}
> >>  
> >> @@ -219,8 +280,6 @@ static void __init reserve_crashkernel(int dma_state)
> >>  	if (crashk_low_res.end)
> >>  		kmemleak_ignore_phys(crashk_low_res.start);
> >>  
> >> -	crashk_res.start = crash_base;
> >> -	crashk_res.end = crash_base + crash_size - 1;
> >>  	insert_resource(&iomem_resource, &crashk_res);
> >>  }
> >>  
> >> -- 
> >> 2.25.1
> >>
> > 
> > .
> > 
> 
> -- 
> Regards,
>   Zhen Lei
>
Catalin Marinas June 21, 2022, 6:04 p.m. UTC | #6
On Tue, Jun 21, 2022 at 02:24:01PM +0800, Kefeng Wang wrote:
> On 2022/6/21 13:33, Baoquan He wrote:
> > On 06/13/22 at 04:09pm, Zhen Lei wrote:
> > > If the crashkernel has both high memory above DMA zones and low memory
> > > in DMA zones, kexec always loads the content such as Image and dtb to the
> > > high memory instead of the low memory. This means that only high memory
> > > requires write protection based on page-level mapping. The allocation of
> > > high memory does not depend on the DMA boundary. So we can reserve the
> > > high memory first even if the crashkernel reservation is deferred.
> > > 
> > > This means that the block mapping can still be performed on other kernel
> > > linear address spaces, the TLB miss rate can be reduced and the system
> > > performance will be improved.
> > 
> > Ugh, this looks a little ugly, honestly.
> > 
> > If that's for sure arm64 can't split large page mapping of linear
> > region, this patch is one way to optimize linear mapping. Given kdump
> > setting is necessary on arm64 server, the booting speed is truly
> > impacted heavily.
> 
> Is there some conclusion or discussion that arm64 can't split large page
> mapping?
> 
> Could the crashkernel reservation (and Kfence pool) be splited dynamically?
> 
> I found Mark replay "arm64: remove page granularity limitation from
> KFENCE"[1],
> 
>   "We also avoid live changes from block<->table mappings, since the
>   archtitecture gives us very weak guarantees there and generally requires
>   a Break-Before-Make sequence (though IIRC this was tightened up
>   somewhat, so maybe going one way is supposed to work). Unless it's
>   really necessary, I'd rather not split these block mappings while
>   they're live."

The problem with splitting is that you can end up with two entries in
the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another
for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict
abort (but can be worse like loss of coherency).

Prior to FEAT_BBM (added in ARMv8.4), such scenario was not allowed at
all, the software would have to unmap the range, TLBI, remap. With
FEAT_BBM (level 2), we can do this without tearing the mapping down but
we still need to handle the potential TLB conflict abort. The handler
only needs a TLBI but if it touches the memory range being changed it
risks faulting again. With vmap stacks and the kernel image mapped in
the vmalloc space, we have a small window where this could be handled
but we probably can't go into the C part of the exception handling
(tracing etc. may access a kmalloc'ed object for example).

Another option is to do a stop_machine() (if multi-processor at that
point), disable the MMUs, modify the page tables, re-enable the MMU but
it's also complicated.
Baoquan He June 22, 2022, 8:35 a.m. UTC | #7
Hi Catalin,

On 06/21/22 at 07:04pm, Catalin Marinas wrote:
> On Tue, Jun 21, 2022 at 02:24:01PM +0800, Kefeng Wang wrote:
> > On 2022/6/21 13:33, Baoquan He wrote:
> > > On 06/13/22 at 04:09pm, Zhen Lei wrote:
> > > > If the crashkernel has both high memory above DMA zones and low memory
> > > > in DMA zones, kexec always loads the content such as Image and dtb to the
> > > > high memory instead of the low memory. This means that only high memory
> > > > requires write protection based on page-level mapping. The allocation of
> > > > high memory does not depend on the DMA boundary. So we can reserve the
> > > > high memory first even if the crashkernel reservation is deferred.
> > > > 
> > > > This means that the block mapping can still be performed on other kernel
> > > > linear address spaces, the TLB miss rate can be reduced and the system
> > > > performance will be improved.
> > > 
> > > Ugh, this looks a little ugly, honestly.
> > > 
> > > If that's for sure arm64 can't split large page mapping of linear
> > > region, this patch is one way to optimize linear mapping. Given kdump
> > > setting is necessary on arm64 server, the booting speed is truly
> > > impacted heavily.
> > 
> > Is there some conclusion or discussion that arm64 can't split large page
> > mapping?
> > 
> > Could the crashkernel reservation (and Kfence pool) be splited dynamically?
> > 
> > I found Mark replay "arm64: remove page granularity limitation from
> > KFENCE"[1],
> > 
> >   "We also avoid live changes from block<->table mappings, since the
> >   archtitecture gives us very weak guarantees there and generally requires
> >   a Break-Before-Make sequence (though IIRC this was tightened up
> >   somewhat, so maybe going one way is supposed to work). Unless it's
> >   really necessary, I'd rather not split these block mappings while
> >   they're live."
> 
> The problem with splitting is that you can end up with two entries in
> the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another
> for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict
> abort (but can be worse like loss of coherency).

Thanks for this explanation. Is this a drawback of arm64 design? X86
code do the same thing w/o issue, is there way to overcome this on
arm64 from hardware or software side?

I ever got a arm64 server with huge memory, w or w/o crashkernel setting 
have different bootup time. And the more often TLB miss and flush will
cause performance cost. It is really a pity if we have very powerful
arm64 cpu and system capacity, but bottlenecked by this drawback.

> 
> Prior to FEAT_BBM (added in ARMv8.4), such scenario was not allowed at
> all, the software would have to unmap the range, TLBI, remap. With
> FEAT_BBM (level 2), we can do this without tearing the mapping down but
> we still need to handle the potential TLB conflict abort. The handler
> only needs a TLBI but if it touches the memory range being changed it
> risks faulting again. With vmap stacks and the kernel image mapped in
> the vmalloc space, we have a small window where this could be handled
> but we probably can't go into the C part of the exception handling
> (tracing etc. may access a kmalloc'ed object for example).
> 
> Another option is to do a stop_machine() (if multi-processor at that
> point), disable the MMUs, modify the page tables, re-enable the MMU but
> it's also complicated.
> 
> -- 
> Catalin
>
Kefeng Wang June 22, 2022, 12:03 p.m. UTC | #8
On 2022/6/22 2:04, Catalin Marinas wrote:
> On Tue, Jun 21, 2022 at 02:24:01PM +0800, Kefeng Wang wrote:
>> On 2022/6/21 13:33, Baoquan He wrote:
>>> On 06/13/22 at 04:09pm, Zhen Lei wrote:
>>>> If the crashkernel has both high memory above DMA zones and low memory
>>>> in DMA zones, kexec always loads the content such as Image and dtb to the
>>>> high memory instead of the low memory. This means that only high memory
>>>> requires write protection based on page-level mapping. The allocation of
>>>> high memory does not depend on the DMA boundary. So we can reserve the
>>>> high memory first even if the crashkernel reservation is deferred.
>>>>
>>>> This means that the block mapping can still be performed on other kernel
>>>> linear address spaces, the TLB miss rate can be reduced and the system
>>>> performance will be improved.
>>> Ugh, this looks a little ugly, honestly.
>>>
>>> If that's for sure arm64 can't split large page mapping of linear
>>> region, this patch is one way to optimize linear mapping. Given kdump
>>> setting is necessary on arm64 server, the booting speed is truly
>>> impacted heavily.
>> Is there some conclusion or discussion that arm64 can't split large page
>> mapping?
>>
>> Could the crashkernel reservation (and Kfence pool) be splited dynamically?
>>
>> I found Mark replay "arm64: remove page granularity limitation from
>> KFENCE"[1],
>>
>>    "We also avoid live changes from block<->table mappings, since the
>>    archtitecture gives us very weak guarantees there and generally requires
>>    a Break-Before-Make sequence (though IIRC this was tightened up
>>    somewhat, so maybe going one way is supposed to work). Unless it's
>>    really necessary, I'd rather not split these block mappings while
>>    they're live."
> The problem with splitting is that you can end up with two entries in
> the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another
> for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict
> abort (but can be worse like loss of coherency).
Thanks for your explanation,
> Prior to FEAT_BBM (added in ARMv8.4), such scenario was not allowed at
> all, the software would have to unmap the range, TLBI, remap. With
> FEAT_BBM (level 2), we can do this without tearing the mapping down but
> we still need to handle the potential TLB conflict abort. The handler
> only needs a TLBI but if it touches the memory range being changed it
> risks faulting again. With vmap stacks and the kernel image mapped in
> the vmalloc space, we have a small window where this could be handled
> but we probably can't go into the C part of the exception handling
> (tracing etc. may access a kmalloc'ed object for example).

So if without FEAT_BBM,we can only guarantee BBM sequence via

"unmap the range, TLBI, remap" or the following option, and with

FEAT_BBM (level 2), we could have easy way to avoid TLB conflict for

some vmalloc space, but still hard to deal with other scence?


>
> Another option is to do a stop_machine() (if multi-processor at that
> point), disable the MMUs, modify the page tables, re-enable the MMU but
> it's also complicated.
>
Catalin Marinas June 23, 2022, 10:27 a.m. UTC | #9
On Wed, Jun 22, 2022 at 08:03:21PM +0800, Kefeng Wang wrote:
> On 2022/6/22 2:04, Catalin Marinas wrote:
> > On Tue, Jun 21, 2022 at 02:24:01PM +0800, Kefeng Wang wrote:
> > > On 2022/6/21 13:33, Baoquan He wrote:
> > > > On 06/13/22 at 04:09pm, Zhen Lei wrote:
> > > > > If the crashkernel has both high memory above DMA zones and low memory
> > > > > in DMA zones, kexec always loads the content such as Image and dtb to the
> > > > > high memory instead of the low memory. This means that only high memory
> > > > > requires write protection based on page-level mapping. The allocation of
> > > > > high memory does not depend on the DMA boundary. So we can reserve the
> > > > > high memory first even if the crashkernel reservation is deferred.
> > > > > 
> > > > > This means that the block mapping can still be performed on other kernel
> > > > > linear address spaces, the TLB miss rate can be reduced and the system
> > > > > performance will be improved.
> > > > Ugh, this looks a little ugly, honestly.
> > > > 
> > > > If that's for sure arm64 can't split large page mapping of linear
> > > > region, this patch is one way to optimize linear mapping. Given kdump
> > > > setting is necessary on arm64 server, the booting speed is truly
> > > > impacted heavily.
> > > Is there some conclusion or discussion that arm64 can't split large page
> > > mapping?
> > > 
> > > Could the crashkernel reservation (and Kfence pool) be splited dynamically?
> > > 
> > > I found Mark replay "arm64: remove page granularity limitation from
> > > KFENCE"[1],
> > > 
> > >    "We also avoid live changes from block<->table mappings, since the
> > >    archtitecture gives us very weak guarantees there and generally requires
> > >    a Break-Before-Make sequence (though IIRC this was tightened up
> > >    somewhat, so maybe going one way is supposed to work). Unless it's
> > >    really necessary, I'd rather not split these block mappings while
> > >    they're live."
> > The problem with splitting is that you can end up with two entries in
> > the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another
> > for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict
> > abort (but can be worse like loss of coherency).
> Thanks for your explanation,
> > Prior to FEAT_BBM (added in ARMv8.4), such scenario was not allowed at
> > all, the software would have to unmap the range, TLBI, remap. With
> > FEAT_BBM (level 2), we can do this without tearing the mapping down but
> > we still need to handle the potential TLB conflict abort. The handler
> > only needs a TLBI but if it touches the memory range being changed it
> > risks faulting again. With vmap stacks and the kernel image mapped in
> > the vmalloc space, we have a small window where this could be handled
> > but we probably can't go into the C part of the exception handling
> > (tracing etc. may access a kmalloc'ed object for example).
> 
> So if without FEAT_BBM,we can only guarantee BBM sequence via
> "unmap the range, TLBI, remap" or the following option,

Yes, that's the break-before-make sequence.

> and with FEAT_BBM (level 2), we could have easy way to avoid TLB
> conflict for some vmalloc space, but still hard to deal with other
> scence?

It's not too hard in theory. Basically there's a small risk of getting a
TLB conflict abort for the mappings you change without a BBM sequence (I
think it's nearly non-existed when going from large block to smaller
pages, though the architecture states that it's still possible). Since
we only want to do this for the linear map and the kernel and stack are
in the vmalloc space, we can handle such trap as an safety measure (it
just needs a TLBI). It may help to tweak a model to force it to generate
such conflict aborts, otherwise we'd not be able to test the code.

It's possible that such trap is raised at EL2 if a guest caused the
conflict abort (the architecture left this as IMP DEF). The hypervisors
may need to be taught to do a TLBI VMALLS12E1 instead of killing the
guest. I haven't checked what KVM does.
Catalin Marinas June 23, 2022, 2:07 p.m. UTC | #10
On Wed, Jun 22, 2022 at 04:35:16PM +0800, Baoquan He wrote:
> On 06/21/22 at 07:04pm, Catalin Marinas wrote:
> > The problem with splitting is that you can end up with two entries in
> > the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another
> > for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict
> > abort (but can be worse like loss of coherency).
> 
> Thanks for this explanation. Is this a drawback of arm64 design? X86
> code do the same thing w/o issue, is there way to overcome this on
> arm64 from hardware or software side?

It is a drawback of the arm64 implementations. Having multiple TLB
entries for the same VA would need additional logic in hardware to
detect, so the microarchitects have pushed back. In ARMv8.4, some
balanced was reached with FEAT_BBM so that the only visible side-effect
is a potential TLB conflict abort that could be resolved by software.

> I ever got a arm64 server with huge memory, w or w/o crashkernel setting 
> have different bootup time. And the more often TLB miss and flush will
> cause performance cost. It is really a pity if we have very powerful
> arm64 cpu and system capacity, but bottlenecked by this drawback.

Is it only the boot time affected or the runtime performance as well?
Kefeng Wang June 23, 2022, 2:23 p.m. UTC | #11
On 2022/6/23 18:27, Catalin Marinas wrote:
> On Wed, Jun 22, 2022 at 08:03:21PM +0800, Kefeng Wang wrote:
>> On 2022/6/22 2:04, Catalin Marinas wrote:
>>> On Tue, Jun 21, 2022 at 02:24:01PM +0800, Kefeng Wang wrote:
>>>> On 2022/6/21 13:33, Baoquan He wrote:
>>>>> On 06/13/22 at 04:09pm, Zhen Lei wrote:
>>>>>> If the crashkernel has both high memory above DMA zones and low memory
>>>>>> in DMA zones, kexec always loads the content such as Image and dtb to the
>>>>>> high memory instead of the low memory. This means that only high memory
>>>>>> requires write protection based on page-level mapping. The allocation of
>>>>>> high memory does not depend on the DMA boundary. So we can reserve the
>>>>>> high memory first even if the crashkernel reservation is deferred.
>>>>>>
>>>>>> This means that the block mapping can still be performed on other kernel
>>>>>> linear address spaces, the TLB miss rate can be reduced and the system
>>>>>> performance will be improved.
>>>>> Ugh, this looks a little ugly, honestly.
>>>>>
>>>>> If that's for sure arm64 can't split large page mapping of linear
>>>>> region, this patch is one way to optimize linear mapping. Given kdump
>>>>> setting is necessary on arm64 server, the booting speed is truly
>>>>> impacted heavily.
>>>> Is there some conclusion or discussion that arm64 can't split large page
>>>> mapping?
>>>>
>>>> Could the crashkernel reservation (and Kfence pool) be splited dynamically?
>>>>
>>>> I found Mark replay "arm64: remove page granularity limitation from
>>>> KFENCE"[1],
>>>>
>>>>     "We also avoid live changes from block<->table mappings, since the
>>>>     archtitecture gives us very weak guarantees there and generally requires
>>>>     a Break-Before-Make sequence (though IIRC this was tightened up
>>>>     somewhat, so maybe going one way is supposed to work). Unless it's
>>>>     really necessary, I'd rather not split these block mappings while
>>>>     they're live."
>>> The problem with splitting is that you can end up with two entries in
>>> the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another
>>> for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict
>>> abort (but can be worse like loss of coherency).
>> Thanks for your explanation,
>>> Prior to FEAT_BBM (added in ARMv8.4), such scenario was not allowed at
>>> all, the software would have to unmap the range, TLBI, remap. With
>>> FEAT_BBM (level 2), we can do this without tearing the mapping down but
>>> we still need to handle the potential TLB conflict abort. The handler
>>> only needs a TLBI but if it touches the memory range being changed it
>>> risks faulting again. With vmap stacks and the kernel image mapped in
>>> the vmalloc space, we have a small window where this could be handled
>>> but we probably can't go into the C part of the exception handling
>>> (tracing etc. may access a kmalloc'ed object for example).
>> So if without FEAT_BBM,we can only guarantee BBM sequence via
>> "unmap the range, TLBI, remap" or the following option,
> Yes, that's the break-before-make sequence.
>
>> and with FEAT_BBM (level 2), we could have easy way to avoid TLB
>> conflict for some vmalloc space, but still hard to deal with other
>> scence?
> It's not too hard in theory. Basically there's a small risk of getting a
> TLB conflict abort for the mappings you change without a BBM sequence (I
> think it's nearly non-existed when going from large block to smaller
> pages, though the architecture states that it's still possible). Since
> we only want to do this for the linear map and the kernel and stack are
> in the vmalloc space, we can handle such trap as an safety measure (it
> just needs a TLBI). It may help to tweak a model to force it to generate
> such conflict aborts, otherwise we'd not be able to test the code.
>
> It's possible that such trap is raised at EL2 if a guest caused the
> conflict abort (the architecture left this as IMP DEF). The hypervisors
> may need to be taught to do a TLBI VMALLS12E1 instead of killing the
> guest. I haven't checked what KVM does.
Got it,many thanks.
Baoquan He June 27, 2022, 2:52 a.m. UTC | #12
On 06/23/22 at 03:07pm, Catalin Marinas wrote:
> On Wed, Jun 22, 2022 at 04:35:16PM +0800, Baoquan He wrote:
> > On 06/21/22 at 07:04pm, Catalin Marinas wrote:
> > > The problem with splitting is that you can end up with two entries in
> > > the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another
> > > for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict
> > > abort (but can be worse like loss of coherency).
> > 
> > Thanks for this explanation. Is this a drawback of arm64 design? X86
> > code do the same thing w/o issue, is there way to overcome this on
> > arm64 from hardware or software side?
> 
> It is a drawback of the arm64 implementations. Having multiple TLB
> entries for the same VA would need additional logic in hardware to
> detect, so the microarchitects have pushed back. In ARMv8.4, some
> balanced was reached with FEAT_BBM so that the only visible side-effect
> is a potential TLB conflict abort that could be resolved by software.

I see, thx.

> 
> > I ever got a arm64 server with huge memory, w or w/o crashkernel setting 
> > have different bootup time. And the more often TLB miss and flush will
> > cause performance cost. It is really a pity if we have very powerful
> > arm64 cpu and system capacity, but bottlenecked by this drawback.
> 
> Is it only the boot time affected or the runtime performance as well?

Sorry for late reply. What I observerd is the boot time serious latecy
with huge memory. Since the timestamp is not available at that time,
we can't tell the number. I didn't notice the runtime performance.
Zhen Lei June 27, 2022, 9:17 a.m. UTC | #13
On 2022/6/27 10:52, Baoquan He wrote:
> On 06/23/22 at 03:07pm, Catalin Marinas wrote:
>> On Wed, Jun 22, 2022 at 04:35:16PM +0800, Baoquan He wrote:
>>> On 06/21/22 at 07:04pm, Catalin Marinas wrote:
>>>> The problem with splitting is that you can end up with two entries in
>>>> the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another
>>>> for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict
>>>> abort (but can be worse like loss of coherency).
>>>
>>> Thanks for this explanation. Is this a drawback of arm64 design? X86
>>> code do the same thing w/o issue, is there way to overcome this on
>>> arm64 from hardware or software side?
>>
>> It is a drawback of the arm64 implementations. Having multiple TLB
>> entries for the same VA would need additional logic in hardware to
>> detect, so the microarchitects have pushed back. In ARMv8.4, some
>> balanced was reached with FEAT_BBM so that the only visible side-effect
>> is a potential TLB conflict abort that could be resolved by software.
> 
> I see, thx.
> 
>>
>>> I ever got a arm64 server with huge memory, w or w/o crashkernel setting 
>>> have different bootup time. And the more often TLB miss and flush will
>>> cause performance cost. It is really a pity if we have very powerful
>>> arm64 cpu and system capacity, but bottlenecked by this drawback.
>>
>> Is it only the boot time affected or the runtime performance as well?
> 
> Sorry for late reply. What I observerd is the boot time serious latecy
> with huge memory. Since the timestamp is not available at that time,
> we can't tell the number. I didn't notice the runtime performance.

There's some data here, and I see you're not on the cc list.

https://lore.kernel.org/linux-mm/1656241815-28494-1-git-send-email-guanghuifeng@linux.alibaba.com/T/

> 
> .
>
Baoquan He June 27, 2022, 10:17 a.m. UTC | #14
On 06/27/22 at 05:17pm, Leizhen (ThunderTown) wrote:
> 
> 
> On 2022/6/27 10:52, Baoquan He wrote:
> > On 06/23/22 at 03:07pm, Catalin Marinas wrote:
> >> On Wed, Jun 22, 2022 at 04:35:16PM +0800, Baoquan He wrote:
> >>> On 06/21/22 at 07:04pm, Catalin Marinas wrote:
> >>>> The problem with splitting is that you can end up with two entries in
> >>>> the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another
> >>>> for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict
> >>>> abort (but can be worse like loss of coherency).
> >>>
> >>> Thanks for this explanation. Is this a drawback of arm64 design? X86
> >>> code do the same thing w/o issue, is there way to overcome this on
> >>> arm64 from hardware or software side?
> >>
> >> It is a drawback of the arm64 implementations. Having multiple TLB
> >> entries for the same VA would need additional logic in hardware to
> >> detect, so the microarchitects have pushed back. In ARMv8.4, some
> >> balanced was reached with FEAT_BBM so that the only visible side-effect
> >> is a potential TLB conflict abort that could be resolved by software.
> > 
> > I see, thx.
> > 
> >>
> >>> I ever got a arm64 server with huge memory, w or w/o crashkernel setting 
> >>> have different bootup time. And the more often TLB miss and flush will
> >>> cause performance cost. It is really a pity if we have very powerful
> >>> arm64 cpu and system capacity, but bottlenecked by this drawback.
> >>
> >> Is it only the boot time affected or the runtime performance as well?
> > 
> > Sorry for late reply. What I observerd is the boot time serious latecy
> > with huge memory. Since the timestamp is not available at that time,
> > we can't tell the number. I didn't notice the runtime performance.
> 
> There's some data here, and I see you're not on the cc list.
> 
> https://lore.kernel.org/linux-mm/1656241815-28494-1-git-send-email-guanghuifeng@linux.alibaba.com/T/

Thanks, Zhen Lei. I also saw the patch. That seems to be a good way,
since there's only one process running at that time. Not sure if there's
still risk of multiple TLB entries for the same VA existing.
Zhen Lei June 27, 2022, 11:11 a.m. UTC | #15
On 2022/6/27 18:17, Baoquan He wrote:
> On 06/27/22 at 05:17pm, Leizhen (ThunderTown) wrote:
>>
>>
>> On 2022/6/27 10:52, Baoquan He wrote:
>>> On 06/23/22 at 03:07pm, Catalin Marinas wrote:
>>>> On Wed, Jun 22, 2022 at 04:35:16PM +0800, Baoquan He wrote:
>>>>> On 06/21/22 at 07:04pm, Catalin Marinas wrote:
>>>>>> The problem with splitting is that you can end up with two entries in
>>>>>> the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another
>>>>>> for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict
>>>>>> abort (but can be worse like loss of coherency).
>>>>>
>>>>> Thanks for this explanation. Is this a drawback of arm64 design? X86
>>>>> code do the same thing w/o issue, is there way to overcome this on
>>>>> arm64 from hardware or software side?
>>>>
>>>> It is a drawback of the arm64 implementations. Having multiple TLB
>>>> entries for the same VA would need additional logic in hardware to
>>>> detect, so the microarchitects have pushed back. In ARMv8.4, some
>>>> balanced was reached with FEAT_BBM so that the only visible side-effect
>>>> is a potential TLB conflict abort that could be resolved by software.
>>>
>>> I see, thx.
>>>
>>>>
>>>>> I ever got a arm64 server with huge memory, w or w/o crashkernel setting 
>>>>> have different bootup time. And the more often TLB miss and flush will
>>>>> cause performance cost. It is really a pity if we have very powerful
>>>>> arm64 cpu and system capacity, but bottlenecked by this drawback.
>>>>
>>>> Is it only the boot time affected or the runtime performance as well?
>>>
>>> Sorry for late reply. What I observerd is the boot time serious latecy
>>> with huge memory. Since the timestamp is not available at that time,
>>> we can't tell the number. I didn't notice the runtime performance.
>>
>> There's some data here, and I see you're not on the cc list.
>>
>> https://lore.kernel.org/linux-mm/1656241815-28494-1-git-send-email-guanghuifeng@linux.alibaba.com/T/
> 
> Thanks, Zhen Lei. I also saw the patch. That seems to be a good way,

Yes.

> since there's only one process running at that time. Not sure if there's
> still risk of multiple TLB entries for the same VA existing.
> 
> .
>
diff mbox series

Patch

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index fb24efbc46f5ef4..ae0bae2cafe6ab0 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -141,15 +141,44 @@  static void __init reserve_crashkernel(int dma_state)
 	unsigned long long crash_max = CRASH_ADDR_LOW_MAX;
 	char *cmdline = boot_command_line;
 	int dma_enabled = IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32);
-	int ret;
+	int ret, skip_res = 0, skip_low_res = 0;
 	bool fixed_base;
 
 	if (!IS_ENABLED(CONFIG_KEXEC_CORE))
 		return;
 
-	if ((!dma_enabled && (dma_state != DMA_PHYS_LIMIT_UNKNOWN)) ||
-	     (dma_enabled && (dma_state != DMA_PHYS_LIMIT_KNOWN)))
-		return;
+	/*
+	 * In the following table:
+	 * X,high  means crashkernel=X,high
+	 * unknown means dma_state = DMA_PHYS_LIMIT_UNKNOWN
+	 * known   means dma_state = DMA_PHYS_LIMIT_KNOWN
+	 *
+	 * The first two columns indicate the status, and the last two
+	 * columns indicate the phase in which crash high or low memory
+	 * needs to be reserved.
+	 *  ---------------------------------------------------
+	 * | DMA enabled | X,high used |  unknown  |   known   |
+	 *  ---------------------------------------------------
+	 * |      N            N       |    low    |    NOP    |
+	 * |      Y            N       |    NOP    |    low    |
+	 * |      N            Y       |  high/low |    NOP    |
+	 * |      Y            Y       |    high   |    low    |
+	 *  ---------------------------------------------------
+	 *
+	 * But in this function, the crash high memory allocation of
+	 * crashkernel=Y,high and the crash low memory allocation of
+	 * crashkernel=X[@offset] for crashk_res are mixed at one place.
+	 * So the table above need to be adjusted as below:
+	 *  ---------------------------------------------------
+	 * | DMA enabled | X,high used |  unknown  |   known   |
+	 *  ---------------------------------------------------
+	 * |      N            N       |    res    |    NOP    |
+	 * |      Y            N       |    NOP    |    res    |
+	 * |      N            Y       |res/low_res|    NOP    |
+	 * |      Y            Y       |    res    |  low_res  |
+	 *  ---------------------------------------------------
+	 *
+	 */
 
 	/* crashkernel=X[@offset] */
 	ret = parse_crashkernel(cmdline, memblock_phys_mem_size(),
@@ -169,10 +198,33 @@  static void __init reserve_crashkernel(int dma_state)
 		else if (ret)
 			return;
 
+		/* See the third row of the second table above, NOP */
+		if (!dma_enabled && (dma_state == DMA_PHYS_LIMIT_KNOWN))
+			return;
+
+		/* See the fourth row of the second table above */
+		if (dma_enabled) {
+			if (dma_state == DMA_PHYS_LIMIT_UNKNOWN)
+				skip_low_res = 1;
+			else
+				skip_res = 1;
+		}
+
 		crash_max = CRASH_ADDR_HIGH_MAX;
 	} else if (ret || !crash_size) {
 		/* The specified value is invalid */
 		return;
+	} else {
+		/* See the 1-2 rows of the second table above, NOP */
+		if ((!dma_enabled && (dma_state == DMA_PHYS_LIMIT_KNOWN)) ||
+		     (dma_enabled && (dma_state == DMA_PHYS_LIMIT_UNKNOWN)))
+			return;
+	}
+
+	if (skip_res) {
+		crash_base = crashk_res.start;
+		crash_size = crashk_res.end - crashk_res.start + 1;
+		goto check_low;
 	}
 
 	fixed_base = !!crash_base;
@@ -202,9 +254,18 @@  static void __init reserve_crashkernel(int dma_state)
 		return;
 	}
 
+	crashk_res.start = crash_base;
+	crashk_res.end = crash_base + crash_size - 1;
+
+check_low:
+	if (skip_low_res)
+		return;
+
 	if ((crash_base >= CRASH_ADDR_LOW_MAX) &&
 	     crash_low_size && reserve_crashkernel_low(crash_low_size)) {
 		memblock_phys_free(crash_base, crash_size);
+		crashk_res.start = 0;
+		crashk_res.end = 0;
 		return;
 	}
 
@@ -219,8 +280,6 @@  static void __init reserve_crashkernel(int dma_state)
 	if (crashk_low_res.end)
 		kmemleak_ignore_phys(crashk_low_res.start);
 
-	crashk_res.start = crash_base;
-	crashk_res.end = crash_base + crash_size - 1;
 	insert_resource(&iomem_resource, &crashk_res);
 }