diff mbox series

[v4] arm64: mm: fix linear mem mapping access performance degradation

Message ID 1656777473-73887-1-git-send-email-guanghuifeng@linux.alibaba.com (mailing list archive)
State New, archived
Headers show
Series [v4] arm64: mm: fix linear mem mapping access performance degradation | expand

Commit Message

guanghui.fgh July 2, 2022, 3:57 p.m. UTC
The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone
(enable crashkernel, disable rodata full, disable kfence), the mem_map will
use non block/section mapping(for crashkernel requires to shrink the region
in page granularity). But it will degrade performance when doing larging
continuous mem access in kernel(memcpy/memmove, etc).

There are many changes and discussions:
commit 031495635b46 ("arm64: Do not defer reserve_crashkernel() for
platforms with no DMA memory zones")
commit 0a30c53573b0 ("arm64: mm: Move reserve_crashkernel() into
mem_init()")
commit 2687275a5843 ("arm64: Force NO_BLOCK_MAPPINGS if crashkernel
reservation is required")

This patch changes mem_map to use block/section mapping with crashkernel.
Firstly, do block/section mapping(normally 2M or 1G) for all avail mem at
mem_map, reserve crashkernel memory. And then walking pagetable to split
block/section mapping to non block/section mapping(normally 4K) [[[only]]]
for crashkernel mem. So the linear mem mapping use block/section mapping
as more as possible. We will reduce the cpu dTLB miss conspicuously, and
accelerate mem access about 10-20% performance improvement.

I have tested it with pft(Page Fault Test) and fio, obtained great
performace improvement.

For fio test:
1.prepare ramdisk
  modprobe -r brd
  modprobe brd rd_nr=1 rd_size=67108864
  dmsetup remove_all
  wipefs -a --force /dev/ram0
  mkfs -t ext4 -E lazy_itable_init=0,lazy_journal_init=0 -q -F /dev/ram0
  mkdir -p /fs/ram0
  mount -t ext4 /dev/ram0 /fs/ram0

2.prepare fio paremeter in x.fio file:
[global]
bs=4k
ioengine=psync
iodepth=128
size=32G
direct=1
invalidate=1
group_reporting
thread=1
rw=read
directory=/fs/ram0
numjobs=1

[task_0]
cpus_allowed=16
stonewall=1

3.run testcase:
perf stat -e dTLB-load-misses fio x.fio

4.contrast
------------------------
			without patch		with patch
fio READ		aggrb=1493.2MB/s	aggrb=1775.3MB/s
dTLB-load-misses	1,818,320,693		438,729,774
time elapsed(s)		70.500326434		62.877316408
user(s)			15.926332000		15.684721000
sys(s)			54.211939000		47.046165000

5.conclusion
Using this patch will reduce dTLB misses and improve performace greatly.

Signed-off-by: Guanghui Feng <guanghuifeng@linux.alibaba.com>
---
 arch/arm64/include/asm/mmu.h |   1 +
 arch/arm64/mm/init.c         |   8 +-
 arch/arm64/mm/mmu.c          | 176 +++++++++++++++++++++++++++++++------------
 3 files changed, 132 insertions(+), 53 deletions(-)

Comments

Will Deacon July 4, 2022, 10:35 a.m. UTC | #1
On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote:
> The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone
> (enable crashkernel, disable rodata full, disable kfence), the mem_map will
> use non block/section mapping(for crashkernel requires to shrink the region
> in page granularity). But it will degrade performance when doing larging
> continuous mem access in kernel(memcpy/memmove, etc).

Hmm. It seems a bit silly to me that we take special care to unmap the
crashkernel from the linear map even when can_set_direct_map() is false, as
we won't be protecting the main kernel at all!

Why don't we just leave the crashkernel mapped if !can_set_direct_map()
and then this problem just goes away?

Will
guanghui.fgh July 4, 2022, 10:58 a.m. UTC | #2
在 2022/7/4 18:35, Will Deacon 写道:
> On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote:
>> The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone
>> (enable crashkernel, disable rodata full, disable kfence), the mem_map will
>> use non block/section mapping(for crashkernel requires to shrink the region
>> in page granularity). But it will degrade performance when doing larging
>> continuous mem access in kernel(memcpy/memmove, etc).
> 
> Hmm. It seems a bit silly to me that we take special care to unmap the
> crashkernel from the linear map even when can_set_direct_map() is false, as
> we won't be protecting the main kernel at all!
> 
> Why don't we just leave the crashkernel mapped if !can_set_direct_map()
> and then this problem just goes away?
> 
> Will

This question had been asked lask week.


1.Quoted messages from arch/arm64/mm/init.c

"Memory reservation for crash kernel either done early or deferred
depending on DMA memory zones configs (ZONE_DMA) --

In absence of ZONE_DMA configs arm64_dma_phys_limit initialized
here instead of max_zone_phys().  This lets early reservation of
crash kernel memory which has a dependency on arm64_dma_phys_limit.
Reserving memory early for crash kernel allows linear creation of block
mappings (greater than page-granularity) for all the memory bank rangs.
In this scheme a comparatively quicker boot is observed.

If ZONE_DMA configs are defined, crash kernel memory reservation
is delayed until DMA zone memory range size initialization performed in
zone_sizes_init().  The defer is necessary to steer clear of DMA zone
memory range to avoid overlap allocation.

[[[
So crash kernel memory boundaries are not known when mapping all bank 
memory ranges, which otherwise means not possible to exclude crash 
kernel range from creating block mappings so page-granularity mappings 
are created for the entire memory range.
]]]"

Namely, the init order: memblock init--->linear mem mapping(4k mapping 
for crashkernel, requirinig page-granularity changing))--->zone dma 
limit--->reserve crashkernel.
So when enable ZONE DMA and using crashkernel, the mem mapping using 4k 
mapping.


2.As mentioned above, when linear mem use 4k mapping simply, there is 
high dtlb miss(degrade performance).
This patch use block/section mapping as far as possible with performance 
improvement.

3.This patch reserve crashkernel as same as the history(ZONE DMA & 
crashkernel reserving order), and only change the linear mem mapping to 
block/section mapping.
Init order: memblock init--->linear mem mapping(block/section mapping 
for linear mem mapping))--->zone dma limit--->reserve 
crashkernel--->[[[only]]] rebuild 4k pagesize mapping for crashkernel mem

With this method, there will use block/section mapping as far as possible.
Will Deacon July 4, 2022, 11:14 a.m. UTC | #3
On Mon, Jul 04, 2022 at 06:58:20PM +0800, guanghui.fgh wrote:
> 
> 
> 在 2022/7/4 18:35, Will Deacon 写道:
> > On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote:
> > > The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone
> > > (enable crashkernel, disable rodata full, disable kfence), the mem_map will
> > > use non block/section mapping(for crashkernel requires to shrink the region
> > > in page granularity). But it will degrade performance when doing larging
> > > continuous mem access in kernel(memcpy/memmove, etc).
> > 
> > Hmm. It seems a bit silly to me that we take special care to unmap the
> > crashkernel from the linear map even when can_set_direct_map() is false, as
> > we won't be protecting the main kernel at all!
> > 
> > Why don't we just leave the crashkernel mapped if !can_set_direct_map()
> > and then this problem just goes away?
> > 
> > Will
> 
> This question had been asked lask week.

Sorry, I didn't spot that. Please could you link me to the conversation, as
I'm still unable to find it in my inbox?

> 1.Quoted messages from arch/arm64/mm/init.c
> 
> "Memory reservation for crash kernel either done early or deferred
> depending on DMA memory zones configs (ZONE_DMA) --
> 
> In absence of ZONE_DMA configs arm64_dma_phys_limit initialized
> here instead of max_zone_phys().  This lets early reservation of
> crash kernel memory which has a dependency on arm64_dma_phys_limit.
> Reserving memory early for crash kernel allows linear creation of block
> mappings (greater than page-granularity) for all the memory bank rangs.
> In this scheme a comparatively quicker boot is observed.
> 
> If ZONE_DMA configs are defined, crash kernel memory reservation
> is delayed until DMA zone memory range size initialization performed in
> zone_sizes_init().  The defer is necessary to steer clear of DMA zone
> memory range to avoid overlap allocation.
> 
> [[[
> So crash kernel memory boundaries are not known when mapping all bank memory
> ranges, which otherwise means not possible to exclude crash kernel range
> from creating block mappings so page-granularity mappings are created for
> the entire memory range.
> ]]]"
> 
> Namely, the init order: memblock init--->linear mem mapping(4k mapping for
> crashkernel, requirinig page-granularity changing))--->zone dma
> limit--->reserve crashkernel.
> So when enable ZONE DMA and using crashkernel, the mem mapping using 4k
> mapping.

Yes, I understand that is how things work today but I'm saying that we may
as well leave the crashkernel mapped (at block granularity) if
!can_set_direct_map() and then I think your patch becomes a lot simpler.

Will
guanghui.fgh July 4, 2022, 12:05 p.m. UTC | #4
在 2022/7/4 19:14, Will Deacon 写道:
> On Mon, Jul 04, 2022 at 06:58:20PM +0800, guanghui.fgh wrote:
>>
>>
>> 在 2022/7/4 18:35, Will Deacon 写道:
>>> On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote:
>>>> The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone
>>>> (enable crashkernel, disable rodata full, disable kfence), the mem_map will
>>>> use non block/section mapping(for crashkernel requires to shrink the region
>>>> in page granularity). But it will degrade performance when doing larging
>>>> continuous mem access in kernel(memcpy/memmove, etc).
>>>
>>> Hmm. It seems a bit silly to me that we take special care to unmap the
>>> crashkernel from the linear map even when can_set_direct_map() is false, as
>>> we won't be protecting the main kernel at all!
>>>
>>> Why don't we just leave the crashkernel mapped if !can_set_direct_map()
>>> and then this problem just goes away?
>>>
>>> Will
>>
>> This question had been asked lask week.
> 
> Sorry, I didn't spot that. Please could you link me to the conversation, as
> I'm still unable to find it in my inbox?

Please access this link:
https://lore.kernel.org/linux-arm-kernel/075b0a8e-cb7e-70f6-b45a-54cd31886794@linux.alibaba.com/T/

> 
>> 1.Quoted messages from arch/arm64/mm/init.c
>>
>> "Memory reservation for crash kernel either done early or deferred
>> depending on DMA memory zones configs (ZONE_DMA) --
>>
>> In absence of ZONE_DMA configs arm64_dma_phys_limit initialized
>> here instead of max_zone_phys().  This lets early reservation of
>> crash kernel memory which has a dependency on arm64_dma_phys_limit.
>> Reserving memory early for crash kernel allows linear creation of block
>> mappings (greater than page-granularity) for all the memory bank rangs.
>> In this scheme a comparatively quicker boot is observed.
>>
>> If ZONE_DMA configs are defined, crash kernel memory reservation
>> is delayed until DMA zone memory range size initialization performed in
>> zone_sizes_init().  The defer is necessary to steer clear of DMA zone
>> memory range to avoid overlap allocation.
>>
>> [[[
>> So crash kernel memory boundaries are not known when mapping all bank memory
>> ranges, which otherwise means not possible to exclude crash kernel range
>> from creating block mappings so page-granularity mappings are created for
>> the entire memory range.
>> ]]]"
>>
>> Namely, the init order: memblock init--->linear mem mapping(4k mapping for
>> crashkernel, requirinig page-granularity changing))--->zone dma
>> limit--->reserve crashkernel.
>> So when enable ZONE DMA and using crashkernel, the mem mapping using 4k
>> mapping.
> 
> Yes, I understand that is how things work today but I'm saying that we may
> as well leave the crashkernel mapped (at block granularity) if
> !can_set_direct_map() and then I think your patch becomes a lot simpler.
> 
> Will

But Page-granularity mapppings are necessary for crash kernel memory 
range for shrinking its size via /sys/kernel/kexec_crash_size 
interfac(Quoted from arch/arm64/mm/init.c).
So this patch split block/section mapping to 4k page-granularity mapping 
for crashkernel mem.

Thanks.
Will Deacon July 4, 2022, 1:15 p.m. UTC | #5
On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote:
> 
> 
> 在 2022/7/4 19:14, Will Deacon 写道:
> > On Mon, Jul 04, 2022 at 06:58:20PM +0800, guanghui.fgh wrote:
> > > 
> > > 
> > > 在 2022/7/4 18:35, Will Deacon 写道:
> > > > On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote:
> > > > > The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone
> > > > > (enable crashkernel, disable rodata full, disable kfence), the mem_map will
> > > > > use non block/section mapping(for crashkernel requires to shrink the region
> > > > > in page granularity). But it will degrade performance when doing larging
> > > > > continuous mem access in kernel(memcpy/memmove, etc).
> > > > 
> > > > Hmm. It seems a bit silly to me that we take special care to unmap the
> > > > crashkernel from the linear map even when can_set_direct_map() is false, as
> > > > we won't be protecting the main kernel at all!
> > > > 
> > > > Why don't we just leave the crashkernel mapped if !can_set_direct_map()
> > > > and then this problem just goes away?
> > > 
> > > This question had been asked lask week.
> > 
> > Sorry, I didn't spot that. Please could you link me to the conversation, as
> > I'm still unable to find it in my inbox?
> 
> Please access this link:
> https://lore.kernel.org/linux-arm-kernel/075b0a8e-cb7e-70f6-b45a-54cd31886794@linux.alibaba.com/T/

Sorry, but I read through the thread and I still can't find where the
possibility of leaving the crashkernel mapped was discussed.

> > > 1.Quoted messages from arch/arm64/mm/init.c
> > > 
> > > "Memory reservation for crash kernel either done early or deferred
> > > depending on DMA memory zones configs (ZONE_DMA) --
> > > 
> > > In absence of ZONE_DMA configs arm64_dma_phys_limit initialized
> > > here instead of max_zone_phys().  This lets early reservation of
> > > crash kernel memory which has a dependency on arm64_dma_phys_limit.
> > > Reserving memory early for crash kernel allows linear creation of block
> > > mappings (greater than page-granularity) for all the memory bank rangs.
> > > In this scheme a comparatively quicker boot is observed.
> > > 
> > > If ZONE_DMA configs are defined, crash kernel memory reservation
> > > is delayed until DMA zone memory range size initialization performed in
> > > zone_sizes_init().  The defer is necessary to steer clear of DMA zone
> > > memory range to avoid overlap allocation.
> > > 
> > > [[[
> > > So crash kernel memory boundaries are not known when mapping all bank memory
> > > ranges, which otherwise means not possible to exclude crash kernel range
> > > from creating block mappings so page-granularity mappings are created for
> > > the entire memory range.
> > > ]]]"
> > > 
> > > Namely, the init order: memblock init--->linear mem mapping(4k mapping for
> > > crashkernel, requirinig page-granularity changing))--->zone dma
> > > limit--->reserve crashkernel.
> > > So when enable ZONE DMA and using crashkernel, the mem mapping using 4k
> > > mapping.
> > 
> > Yes, I understand that is how things work today but I'm saying that we may
> > as well leave the crashkernel mapped (at block granularity) if
> > !can_set_direct_map() and then I think your patch becomes a lot simpler.
> 
> But Page-granularity mapppings are necessary for crash kernel memory range
> for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from
> arch/arm64/mm/init.c).
> So this patch split block/section mapping to 4k page-granularity mapping for
> crashkernel mem.

Why? I don't see why the mapping granularity is relevant at all if we
always leave the whole thing mapped.

Will
guanghui.fgh July 4, 2022, 1:41 p.m. UTC | #6
Thanks.

在 2022/7/4 21:15, Will Deacon 写道:
> On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote:
>>
>>
>> 在 2022/7/4 19:14, Will Deacon 写道:
>>> On Mon, Jul 04, 2022 at 06:58:20PM +0800, guanghui.fgh wrote:
>>>>
>>>>
>>>> 在 2022/7/4 18:35, Will Deacon 写道:
>>>>> On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote:
>>>>>> The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone
>>>>>> (enable crashkernel, disable rodata full, disable kfence), the mem_map will
>>>>>> use non block/section mapping(for crashkernel requires to shrink the region
>>>>>> in page granularity). But it will degrade performance when doing larging
>>>>>> continuous mem access in kernel(memcpy/memmove, etc).
>>>>>
>>>>> Hmm. It seems a bit silly to me that we take special care to unmap the
>>>>> crashkernel from the linear map even when can_set_direct_map() is false, as
>>>>> we won't be protecting the main kernel at all!
>>>>>
>>>>> Why don't we just leave the crashkernel mapped if !can_set_direct_map()
>>>>> and then this problem just goes away?
>>>>
>>>> This question had been asked lask week.
>>>
>>> Sorry, I didn't spot that. Please could you link me to the conversation, as
>>> I'm still unable to find it in my inbox?
>>
>> Please access this link:
>> https://lore.kernel.org/linux-arm-kernel/075b0a8e-cb7e-70f6-b45a-54cd31886794@linux.alibaba.com/T/
> 
> Sorry, but I read through the thread and I still can't find where the
> possibility of leaving the crashkernel mapped was discussed >
>>>> 1.Quoted messages from arch/arm64/mm/init.c
>>>>
>>>> "Memory reservation for crash kernel either done early or deferred
>>>> depending on DMA memory zones configs (ZONE_DMA) --
>>>>
>>>> In absence of ZONE_DMA configs arm64_dma_phys_limit initialized
>>>> here instead of max_zone_phys().  This lets early reservation of
>>>> crash kernel memory which has a dependency on arm64_dma_phys_limit.
>>>> Reserving memory early for crash kernel allows linear creation of block
>>>> mappings (greater than page-granularity) for all the memory bank rangs.
>>>> In this scheme a comparatively quicker boot is observed.
>>>>
>>>> If ZONE_DMA configs are defined, crash kernel memory reservation
>>>> is delayed until DMA zone memory range size initialization performed in
>>>> zone_sizes_init().  The defer is necessary to steer clear of DMA zone
>>>> memory range to avoid overlap allocation.
>>>>
>>>> [[[
>>>> So crash kernel memory boundaries are not known when mapping all bank memory
>>>> ranges, which otherwise means not possible to exclude crash kernel range
>>>> from creating block mappings so page-granularity mappings are created for
>>>> the entire memory range.
>>>> ]]]"
>>>>
>>>> Namely, the init order: memblock init--->linear mem mapping(4k mapping for
>>>> crashkernel, requirinig page-granularity changing))--->zone dma
>>>> limit--->reserve crashkernel.
>>>> So when enable ZONE DMA and using crashkernel, the mem mapping using 4k
>>>> mapping.
>>>
>>> Yes, I understand that is how things work today but I'm saying that we may
>>> as well leave the crashkernel mapped (at block granularity) if
>>> !can_set_direct_map() and then I think your patch becomes a lot simpler.
>>
>> But Page-granularity mapppings are necessary for crash kernel memory range
>> for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from
>> arch/arm64/mm/init.c).
>> So this patch split block/section mapping to 4k page-granularity mapping for
>> crashkernel mem.
> 
> Why? I don't see why the mapping granularity is relevant at all if we
> always leave the whole thing mapped.
> 
> Will

I have find the commit 06a7f711246b081afc21fff859f1003f1f2a0fbc adding 
/sys/kernel/kexec_crash_size for changing crashkernel mem size.

"Implement shrinking the reserved memory for crash kernel, if it is more 
than enough."

Maybe we could use block/section mapping for crashkernle mem, and split 
a part of crashkernel mem block/section mapping when shringking(by 
writing to /sys/kernel/kexec_crash_size(handled by crash_shrink_memory, 
crash_free_reserved_phys_range)).
(Maybe there is no need to split all crashkernle mem block/section 
mapping at boot time).

Thanks.
guanghui.fgh July 4, 2022, 2:11 p.m. UTC | #7
在 2022/7/4 21:15, Will Deacon 写道:
> On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote:
>>
>>
>> 在 2022/7/4 19:14, Will Deacon 写道:
>>> On Mon, Jul 04, 2022 at 06:58:20PM +0800, guanghui.fgh wrote:
>>>>
>>>>
>>>> 在 2022/7/4 18:35, Will Deacon 写道:
>>>>> On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote:
>>>>>> The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone
>>>>>> (enable crashkernel, disable rodata full, disable kfence), the mem_map will
>>>>>> use non block/section mapping(for crashkernel requires to shrink the region
>>>>>> in page granularity). But it will degrade performance when doing larging
>>>>>> continuous mem access in kernel(memcpy/memmove, etc).
>>>>>
>>>>> Hmm. It seems a bit silly to me that we take special care to unmap the
>>>>> crashkernel from the linear map even when can_set_direct_map() is false, as
>>>>> we won't be protecting the main kernel at all!
>>>>>
>>>>> Why don't we just leave the crashkernel mapped if !can_set_direct_map()
>>>>> and then this problem just goes away?
>>>>
>>>> This question had been asked lask week.
>>>
>>> Sorry, I didn't spot that. Please could you link me to the conversation, as
>>> I'm still unable to find it in my inbox?
>>
>> Please access this link:
>> https://lore.kernel.org/linux-arm-kernel/075b0a8e-cb7e-70f6-b45a-54cd31886794@linux.alibaba.com/T/
> 
> Sorry, but I read through the thread and I still can't find where the
> possibility of leaving the crashkernel mapped was discussed.
> 
>>>> 1.Quoted messages from arch/arm64/mm/init.c
>>>>
>>>> "Memory reservation for crash kernel either done early or deferred
>>>> depending on DMA memory zones configs (ZONE_DMA) --
>>>>
>>>> In absence of ZONE_DMA configs arm64_dma_phys_limit initialized
>>>> here instead of max_zone_phys().  This lets early reservation of
>>>> crash kernel memory which has a dependency on arm64_dma_phys_limit.
>>>> Reserving memory early for crash kernel allows linear creation of block
>>>> mappings (greater than page-granularity) for all the memory bank rangs.
>>>> In this scheme a comparatively quicker boot is observed.
>>>>
>>>> If ZONE_DMA configs are defined, crash kernel memory reservation
>>>> is delayed until DMA zone memory range size initialization performed in
>>>> zone_sizes_init().  The defer is necessary to steer clear of DMA zone
>>>> memory range to avoid overlap allocation.
>>>>
>>>> [[[
>>>> So crash kernel memory boundaries are not known when mapping all bank memory
>>>> ranges, which otherwise means not possible to exclude crash kernel range
>>>> from creating block mappings so page-granularity mappings are created for
>>>> the entire memory range.
>>>> ]]]"
>>>>
>>>> Namely, the init order: memblock init--->linear mem mapping(4k mapping for
>>>> crashkernel, requirinig page-granularity changing))--->zone dma
>>>> limit--->reserve crashkernel.
>>>> So when enable ZONE DMA and using crashkernel, the mem mapping using 4k
>>>> mapping.
>>>
>>> Yes, I understand that is how things work today but I'm saying that we may
>>> as well leave the crashkernel mapped (at block granularity) if
>>> !can_set_direct_map() and then I think your patch becomes a lot simpler.
>>
>> But Page-granularity mapppings are necessary for crash kernel memory range
>> for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from
>> arch/arm64/mm/init.c).
>> So this patch split block/section mapping to 4k page-granularity mapping for
>> crashkernel mem.
> 
> Why? I don't see why the mapping granularity is relevant at all if we
> always leave the whole thing mapped.
> 
> Will

There is another reason.

When loading crashkernel finish, the do_kexec_load will use 
arch_kexec_protect_crashkres to invalid all the pagetable for 
crashkernel mem(protect crashkernel mem from access).

arch_kexec_protect_crashkres--->set_memory_valid--->...--->apply_to_pmd_range

In the apply_to_pmd_range, there is a judement: BUG_ON(pud_huge(*pud)). 
And if the crashkernel use block/section mapping, there will be some error.

Namely, it's need to use non block/section mapping for crashkernel mem 
before shringking.

Thanks.
Will Deacon July 4, 2022, 2:23 p.m. UTC | #8
On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote:
> 在 2022/7/4 21:15, Will Deacon 写道:
> > On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote:
> > > > > 1.Quoted messages from arch/arm64/mm/init.c
> > > > > 
> > > > > "Memory reservation for crash kernel either done early or deferred
> > > > > depending on DMA memory zones configs (ZONE_DMA) --
> > > > > 
> > > > > In absence of ZONE_DMA configs arm64_dma_phys_limit initialized
> > > > > here instead of max_zone_phys().  This lets early reservation of
> > > > > crash kernel memory which has a dependency on arm64_dma_phys_limit.
> > > > > Reserving memory early for crash kernel allows linear creation of block
> > > > > mappings (greater than page-granularity) for all the memory bank rangs.
> > > > > In this scheme a comparatively quicker boot is observed.
> > > > > 
> > > > > If ZONE_DMA configs are defined, crash kernel memory reservation
> > > > > is delayed until DMA zone memory range size initialization performed in
> > > > > zone_sizes_init().  The defer is necessary to steer clear of DMA zone
> > > > > memory range to avoid overlap allocation.
> > > > > 
> > > > > [[[
> > > > > So crash kernel memory boundaries are not known when mapping all bank memory
> > > > > ranges, which otherwise means not possible to exclude crash kernel range
> > > > > from creating block mappings so page-granularity mappings are created for
> > > > > the entire memory range.
> > > > > ]]]"
> > > > > 
> > > > > Namely, the init order: memblock init--->linear mem mapping(4k mapping for
> > > > > crashkernel, requirinig page-granularity changing))--->zone dma
> > > > > limit--->reserve crashkernel.
> > > > > So when enable ZONE DMA and using crashkernel, the mem mapping using 4k
> > > > > mapping.
> > > > 
> > > > Yes, I understand that is how things work today but I'm saying that we may
> > > > as well leave the crashkernel mapped (at block granularity) if
> > > > !can_set_direct_map() and then I think your patch becomes a lot simpler.
> > > 
> > > But Page-granularity mapppings are necessary for crash kernel memory range
> > > for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from
> > > arch/arm64/mm/init.c).
> > > So this patch split block/section mapping to 4k page-granularity mapping for
> > > crashkernel mem.
> > 
> > Why? I don't see why the mapping granularity is relevant at all if we
> > always leave the whole thing mapped.
> > 
> There is another reason.
> 
> When loading crashkernel finish, the do_kexec_load will use
> arch_kexec_protect_crashkres to invalid all the pagetable for crashkernel
> mem(protect crashkernel mem from access).
> 
> arch_kexec_protect_crashkres--->set_memory_valid--->...--->apply_to_pmd_range
> 
> In the apply_to_pmd_range, there is a judement: BUG_ON(pud_huge(*pud)). And
> if the crashkernel use block/section mapping, there will be some error.
> 
> Namely, it's need to use non block/section mapping for crashkernel mem
> before shringking.

Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do
that if we're leaving the thing mapped, no?

Will
guanghui.fgh July 4, 2022, 2:34 p.m. UTC | #9
Thanks.

在 2022/7/4 22:23, Will Deacon 写道:
> On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote:
>> 在 2022/7/4 21:15, Will Deacon 写道:
>>> On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote:
>>>>>> 1.Quoted messages from arch/arm64/mm/init.c
>>>>>>
>>>>>> "Memory reservation for crash kernel either done early or deferred
>>>>>> depending on DMA memory zones configs (ZONE_DMA) --
>>>>>>
>>>>>> In absence of ZONE_DMA configs arm64_dma_phys_limit initialized
>>>>>> here instead of max_zone_phys().  This lets early reservation of
>>>>>> crash kernel memory which has a dependency on arm64_dma_phys_limit.
>>>>>> Reserving memory early for crash kernel allows linear creation of block
>>>>>> mappings (greater than page-granularity) for all the memory bank rangs.
>>>>>> In this scheme a comparatively quicker boot is observed.
>>>>>>
>>>>>> If ZONE_DMA configs are defined, crash kernel memory reservation
>>>>>> is delayed until DMA zone memory range size initialization performed in
>>>>>> zone_sizes_init().  The defer is necessary to steer clear of DMA zone
>>>>>> memory range to avoid overlap allocation.
>>>>>>
>>>>>> [[[
>>>>>> So crash kernel memory boundaries are not known when mapping all bank memory
>>>>>> ranges, which otherwise means not possible to exclude crash kernel range
>>>>>> from creating block mappings so page-granularity mappings are created for
>>>>>> the entire memory range.
>>>>>> ]]]"
>>>>>>
>>>>>> Namely, the init order: memblock init--->linear mem mapping(4k mapping for
>>>>>> crashkernel, requirinig page-granularity changing))--->zone dma
>>>>>> limit--->reserve crashkernel.
>>>>>> So when enable ZONE DMA and using crashkernel, the mem mapping using 4k
>>>>>> mapping.
>>>>>
>>>>> Yes, I understand that is how things work today but I'm saying that we may
>>>>> as well leave the crashkernel mapped (at block granularity) if
>>>>> !can_set_direct_map() and then I think your patch becomes a lot simpler.
>>>>
>>>> But Page-granularity mapppings are necessary for crash kernel memory range
>>>> for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from
>>>> arch/arm64/mm/init.c).
>>>> So this patch split block/section mapping to 4k page-granularity mapping for
>>>> crashkernel mem.
>>>
>>> Why? I don't see why the mapping granularity is relevant at all if we
>>> always leave the whole thing mapped.
>>>
>> There is another reason.
>>
>> When loading crashkernel finish, the do_kexec_load will use
>> arch_kexec_protect_crashkres to invalid all the pagetable for crashkernel
>> mem(protect crashkernel mem from access).
>>
>> arch_kexec_protect_crashkres--->set_memory_valid--->...--->apply_to_pmd_range
>>
>> In the apply_to_pmd_range, there is a judement: BUG_ON(pud_huge(*pud)). And
>> if the crashkernel use block/section mapping, there will be some error.
>>
>> Namely, it's need to use non block/section mapping for crashkernel mem
>> before shringking.
> 
> Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do
> that if we're leaving the thing mapped, no?
> 
> Will

I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem.

Because when invalid crashkernel mem pagetable, there is no chance to 
rd/wr the crashkernel mem by mistake.

If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel 
mem pagetable, there maybe some write operations to these mem by mistake 
which may cause crashkernel boot error and vmcore saving error.

Can we change the arch_kexec_[un]protect_crashkres to support 
block/section mapping?(But we also need to remap when shrinking)

Thanks.
Will Deacon July 4, 2022, 4:38 p.m. UTC | #10
On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote:
> Thanks.
> 
> 在 2022/7/4 22:23, Will Deacon 写道:
> > On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote:
> > > 在 2022/7/4 21:15, Will Deacon 写道:
> > > > On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote:
> > > > > > > 1.Quoted messages from arch/arm64/mm/init.c
> > > > > > > 
> > > > > > > "Memory reservation for crash kernel either done early or deferred
> > > > > > > depending on DMA memory zones configs (ZONE_DMA) --
> > > > > > > 
> > > > > > > In absence of ZONE_DMA configs arm64_dma_phys_limit initialized
> > > > > > > here instead of max_zone_phys().  This lets early reservation of
> > > > > > > crash kernel memory which has a dependency on arm64_dma_phys_limit.
> > > > > > > Reserving memory early for crash kernel allows linear creation of block
> > > > > > > mappings (greater than page-granularity) for all the memory bank rangs.
> > > > > > > In this scheme a comparatively quicker boot is observed.
> > > > > > > 
> > > > > > > If ZONE_DMA configs are defined, crash kernel memory reservation
> > > > > > > is delayed until DMA zone memory range size initialization performed in
> > > > > > > zone_sizes_init().  The defer is necessary to steer clear of DMA zone
> > > > > > > memory range to avoid overlap allocation.
> > > > > > > 
> > > > > > > [[[
> > > > > > > So crash kernel memory boundaries are not known when mapping all bank memory
> > > > > > > ranges, which otherwise means not possible to exclude crash kernel range
> > > > > > > from creating block mappings so page-granularity mappings are created for
> > > > > > > the entire memory range.
> > > > > > > ]]]"
> > > > > > > 
> > > > > > > Namely, the init order: memblock init--->linear mem mapping(4k mapping for
> > > > > > > crashkernel, requirinig page-granularity changing))--->zone dma
> > > > > > > limit--->reserve crashkernel.
> > > > > > > So when enable ZONE DMA and using crashkernel, the mem mapping using 4k
> > > > > > > mapping.
> > > > > > 
> > > > > > Yes, I understand that is how things work today but I'm saying that we may
> > > > > > as well leave the crashkernel mapped (at block granularity) if
> > > > > > !can_set_direct_map() and then I think your patch becomes a lot simpler.
> > > > > 
> > > > > But Page-granularity mapppings are necessary for crash kernel memory range
> > > > > for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from
> > > > > arch/arm64/mm/init.c).
> > > > > So this patch split block/section mapping to 4k page-granularity mapping for
> > > > > crashkernel mem.
> > > > 
> > > > Why? I don't see why the mapping granularity is relevant at all if we
> > > > always leave the whole thing mapped.
> > > > 
> > > There is another reason.
> > > 
> > > When loading crashkernel finish, the do_kexec_load will use
> > > arch_kexec_protect_crashkres to invalid all the pagetable for crashkernel
> > > mem(protect crashkernel mem from access).
> > > 
> > > arch_kexec_protect_crashkres--->set_memory_valid--->...--->apply_to_pmd_range
> > > 
> > > In the apply_to_pmd_range, there is a judement: BUG_ON(pud_huge(*pud)). And
> > > if the crashkernel use block/section mapping, there will be some error.
> > > 
> > > Namely, it's need to use non block/section mapping for crashkernel mem
> > > before shringking.
> > 
> > Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do
> > that if we're leaving the thing mapped, no?
> > 
> I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem.
> 
> Because when invalid crashkernel mem pagetable, there is no chance to rd/wr
> the crashkernel mem by mistake.
> 
> If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem
> pagetable, there maybe some write operations to these mem by mistake which
> may cause crashkernel boot error and vmcore saving error.

I don't really buy this line of reasoning. The entire main kernel is
writable, so why do we care about protecting the crashkernel so much? The
_code_ to launch the crash kernel is writable! If you care about preventing
writes to memory which should not be writable, then you should use
rodata=full.

Will
Ard Biesheuvel July 4, 2022, 5:09 p.m. UTC | #11
On Mon, 4 Jul 2022 at 18:38, Will Deacon <will@kernel.org> wrote:
>
> On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote:
> > Thanks.
> >
> > 在 2022/7/4 22:23, Will Deacon 写道:
> > > On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote:
...
> > > > Namely, it's need to use non block/section mapping for crashkernel mem
> > > > before shringking.
> > >
> > > Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do
> > > that if we're leaving the thing mapped, no?
> > >
> > I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem.
> >
> > Because when invalid crashkernel mem pagetable, there is no chance to rd/wr
> > the crashkernel mem by mistake.
> >
> > If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem
> > pagetable, there maybe some write operations to these mem by mistake which
> > may cause crashkernel boot error and vmcore saving error.
>
> I don't really buy this line of reasoning. The entire main kernel is
> writable, so why do we care about protecting the crashkernel so much? The
> _code_ to launch the crash kernel is writable! If you care about preventing
> writes to memory which should not be writable, then you should use
> rodata=full.
>

This is not entirely true - the core kernel text and rodata are
remapped r/o in the linear map, whereas all module code and rodata are
left writable when rodata != full.

But the conclusion is the same, imo: if you can't be bothered to
protect a good chunk of the code and rodata that the kernel relies on,
why should the crashkernel be treated any differently?
guanghui.fgh July 5, 2022, 2:44 a.m. UTC | #12
在 2022/7/5 0:38, Will Deacon 写道:
> On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote:
>> Thanks.
>>
>> 在 2022/7/4 22:23, Will Deacon 写道:
>>> On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote:
>>>> 在 2022/7/4 21:15, Will Deacon 写道:
>>>>> On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote:
>>>>>>>> 1.Quoted messages from arch/arm64/mm/init.c
>>>>>>>>
>>>>>>>> "Memory reservation for crash kernel either done early or deferred
>>>>>>>> depending on DMA memory zones configs (ZONE_DMA) --
>>>>>>>>
>>>>>>>> In absence of ZONE_DMA configs arm64_dma_phys_limit initialized
>>>>>>>> here instead of max_zone_phys().  This lets early reservation of
>>>>>>>> crash kernel memory which has a dependency on arm64_dma_phys_limit.
>>>>>>>> Reserving memory early for crash kernel allows linear creation of block
>>>>>>>> mappings (greater than page-granularity) for all the memory bank rangs.
>>>>>>>> In this scheme a comparatively quicker boot is observed.
>>>>>>>>
>>>>>>>> If ZONE_DMA configs are defined, crash kernel memory reservation
>>>>>>>> is delayed until DMA zone memory range size initialization performed in
>>>>>>>> zone_sizes_init().  The defer is necessary to steer clear of DMA zone
>>>>>>>> memory range to avoid overlap allocation.
>>>>>>>>
>>>>>>>> [[[
>>>>>>>> So crash kernel memory boundaries are not known when mapping all bank memory
>>>>>>>> ranges, which otherwise means not possible to exclude crash kernel range
>>>>>>>> from creating block mappings so page-granularity mappings are created for
>>>>>>>> the entire memory range.
>>>>>>>> ]]]"
>>>>>>>>
>>>>>>>> Namely, the init order: memblock init--->linear mem mapping(4k mapping for
>>>>>>>> crashkernel, requirinig page-granularity changing))--->zone dma
>>>>>>>> limit--->reserve crashkernel.
>>>>>>>> So when enable ZONE DMA and using crashkernel, the mem mapping using 4k
>>>>>>>> mapping.
>>>>>>>
>>>>>>> Yes, I understand that is how things work today but I'm saying that we may
>>>>>>> as well leave the crashkernel mapped (at block granularity) if
>>>>>>> !can_set_direct_map() and then I think your patch becomes a lot simpler.
>>>>>>
>>>>>> But Page-granularity mapppings are necessary for crash kernel memory range
>>>>>> for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from
>>>>>> arch/arm64/mm/init.c).
>>>>>> So this patch split block/section mapping to 4k page-granularity mapping for
>>>>>> crashkernel mem.
>>>>>
>>>>> Why? I don't see why the mapping granularity is relevant at all if we
>>>>> always leave the whole thing mapped.
>>>>>
>>>> There is another reason.
>>>>
>>>> When loading crashkernel finish, the do_kexec_load will use
>>>> arch_kexec_protect_crashkres to invalid all the pagetable for crashkernel
>>>> mem(protect crashkernel mem from access).
>>>>
>>>> arch_kexec_protect_crashkres--->set_memory_valid--->...--->apply_to_pmd_range
>>>>
>>>> In the apply_to_pmd_range, there is a judement: BUG_ON(pud_huge(*pud)). And
>>>> if the crashkernel use block/section mapping, there will be some error.
>>>>
>>>> Namely, it's need to use non block/section mapping for crashkernel mem
>>>> before shringking.
>>>
>>> Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do
>>> that if we're leaving the thing mapped, no?
>>>
>> I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem.
>>
>> Because when invalid crashkernel mem pagetable, there is no chance to rd/wr
>> the crashkernel mem by mistake.
>>
>> If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem
>> pagetable, there maybe some write operations to these mem by mistake which
>> may cause crashkernel boot error and vmcore saving error.
> 
> I don't really buy this line of reasoning. The entire main kernel is
> writable, so why do we care about protecting the crashkernel so much? The
> _code_ to launch the crash kernel is writable! If you care about preventing
> writes to memory which should not be writable, then you should use
> rodata=full.
> 
> Will
Thanks.

1.I think the normal kernel mem can be writeable or readonly

2.But the normal kernel should't access(read/write) crashkernel mem 
after the crashkernel is loaded to the mem.(despite the normal kernel 
write/read access protect)
So invalid pagetable for crashkernel mem is needed.

With this method, it can guarantee the usability of the crashkernel when 
occuring emergency and generating vmcore.
Baoquan He July 5, 2022, 8:35 a.m. UTC | #13
On 07/04/22 at 07:09pm, Ard Biesheuvel wrote:
> On Mon, 4 Jul 2022 at 18:38, Will Deacon <will@kernel.org> wrote:
> >
> > On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote:
> > > Thanks.
> > >
> > > 在 2022/7/4 22:23, Will Deacon 写道:
> > > > On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote:
> ...
> > > > > Namely, it's need to use non block/section mapping for crashkernel mem
> > > > > before shringking.
> > > >
> > > > Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do
> > > > that if we're leaving the thing mapped, no?
> > > >
> > > I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem.
> > >
> > > Because when invalid crashkernel mem pagetable, there is no chance to rd/wr
> > > the crashkernel mem by mistake.
> > >
> > > If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem
> > > pagetable, there maybe some write operations to these mem by mistake which
> > > may cause crashkernel boot error and vmcore saving error.
> >
> > I don't really buy this line of reasoning. The entire main kernel is
> > writable, so why do we care about protecting the crashkernel so much? The
> > _code_ to launch the crash kernel is writable! If you care about preventing
> > writes to memory which should not be writable, then you should use
> > rodata=full.
> >
> 
> This is not entirely true - the core kernel text and rodata are
> remapped r/o in the linear map, whereas all module code and rodata are
> left writable when rodata != full.
> 
> But the conclusion is the same, imo: if you can't be bothered to
> protect a good chunk of the code and rodata that the kernel relies on,
> why should the crashkernel be treated any differently?

Kernel text and rodata are remapped r/o in linear map, whereas
module code and rodata are left writable, it's different concept
than crashkernel region being mapped r/o.

If it's doable in technology to remap module code and rodata r/o, and
stamping into those regions will corrupt the entire system, we should do
it too. However, kdump is a system error diagonosing mechanism which is
very important and helpful on server, or some application scenarios, e.g
cloud. Stamping into crashkernel region will make it useless.

I am not against removing the arch_kexec_[un]protect_crashkres on arm64.
It is a balance:

Protecting the crashkernel region, causeing severe performance
degradation. This is always felt since we usually don't specify rodata
and enable kfence.

Taking off the protecting of crashkernel region, performance improved
very much, while wrong code may stamp into crashkernel region and fail
kdump. That could happen one in a million. Once happen, it's a nightmare
of kernel dev.
Will Deacon July 5, 2022, 9:52 a.m. UTC | #14
On Mon, Jul 04, 2022 at 07:09:23PM +0200, Ard Biesheuvel wrote:
> On Mon, 4 Jul 2022 at 18:38, Will Deacon <will@kernel.org> wrote:
> >
> > On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote:
> > > Thanks.
> > >
> > > 在 2022/7/4 22:23, Will Deacon 写道:
> > > > On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote:
> ...
> > > > > Namely, it's need to use non block/section mapping for crashkernel mem
> > > > > before shringking.
> > > >
> > > > Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do
> > > > that if we're leaving the thing mapped, no?
> > > >
> > > I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem.
> > >
> > > Because when invalid crashkernel mem pagetable, there is no chance to rd/wr
> > > the crashkernel mem by mistake.
> > >
> > > If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem
> > > pagetable, there maybe some write operations to these mem by mistake which
> > > may cause crashkernel boot error and vmcore saving error.
> >
> > I don't really buy this line of reasoning. The entire main kernel is
> > writable, so why do we care about protecting the crashkernel so much? The
> > _code_ to launch the crash kernel is writable! If you care about preventing
> > writes to memory which should not be writable, then you should use
> > rodata=full.
> >
> 
> This is not entirely true - the core kernel text and rodata are
> remapped r/o in the linear map, whereas all module code and rodata are
> left writable when rodata != full.

Yes, sorry, you're quite right. The kernel text is only writable if
rodata=off.

But I still think it makes sense to protect the crashkernel only if
rodata=full (which is the default on arm64) as this allows us to rely
on page mappings and I think fits well with what we do for modules.

> But the conclusion is the same, imo: if you can't be bothered to
> protect a good chunk of the code and rodata that the kernel relies on,
> why should the crashkernel be treated any differently?

Thanks.

Will
guanghui.fgh July 5, 2022, 12:07 p.m. UTC | #15
在 2022/7/5 17:52, Will Deacon 写道:
> On Mon, Jul 04, 2022 at 07:09:23PM +0200, Ard Biesheuvel wrote:
>> On Mon, 4 Jul 2022 at 18:38, Will Deacon <will@kernel.org> wrote:
>>>
>>> On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote:
>>>> Thanks.
>>>>
>>>> 在 2022/7/4 22:23, Will Deacon 写道:
>>>>> On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote:
>> ...
>>>>>> Namely, it's need to use non block/section mapping for crashkernel mem
>>>>>> before shringking.
>>>>>
>>>>> Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do
>>>>> that if we're leaving the thing mapped, no?
>>>>>
>>>> I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem.
>>>>
>>>> Because when invalid crashkernel mem pagetable, there is no chance to rd/wr
>>>> the crashkernel mem by mistake.
>>>>
>>>> If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem
>>>> pagetable, there maybe some write operations to these mem by mistake which
>>>> may cause crashkernel boot error and vmcore saving error.
>>>
>>> I don't really buy this line of reasoning. The entire main kernel is
>>> writable, so why do we care about protecting the crashkernel so much? The
>>> _code_ to launch the crash kernel is writable! If you care about preventing
>>> writes to memory which should not be writable, then you should use
>>> rodata=full.
>>>
>>
>> This is not entirely true - the core kernel text and rodata are
>> remapped r/o in the linear map, whereas all module code and rodata are
>> left writable when rodata != full.
> 
> Yes, sorry, you're quite right. The kernel text is only writable if
> rodata=off.
> 
> But I still think it makes sense to protect the crashkernel only if
> rodata=full (which is the default on arm64) as this allows us to rely
> on page mappings and I think fits well with what we do for modules.
> 
>> But the conclusion is the same, imo: if you can't be bothered to
>> protect a good chunk of the code and rodata that the kernel relies on,
>> why should the crashkernel be treated any differently?
> 
> Thanks.
> 
> Will
Thanks.

1.The rodata full is harm to the performance and has been disabled in-house.

2.When using crashkernel with rodata non full, the kernel also will use 
non block/section mapping which cause high d-TLB miss and degrade 
performance greatly.
This patch fix it to use block/section mapping as far as possible.

bool can_set_direct_map(void)
{
	return rodata_full || debug_pagealloc_enabled();
}

map_mem:
if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
	flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;

3.When rodata full is disabled, crashkernel also need protect(keep 
arch_kexec_[un]protect_crashkres using).
I think crashkernel should't depend on radata full(Maybe other 
architecture don't support radata full now).

Thanks.
Will Deacon July 5, 2022, 12:11 p.m. UTC | #16
On Tue, Jul 05, 2022 at 08:07:07PM +0800, guanghui.fgh wrote:
> 
> 
> 在 2022/7/5 17:52, Will Deacon 写道:
> > On Mon, Jul 04, 2022 at 07:09:23PM +0200, Ard Biesheuvel wrote:
> > > On Mon, 4 Jul 2022 at 18:38, Will Deacon <will@kernel.org> wrote:
> > > > 
> > > > On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote:
> > > > > Thanks.
> > > > > 
> > > > > 在 2022/7/4 22:23, Will Deacon 写道:
> > > > > > On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote:
> > > ...
> > > > > > > Namely, it's need to use non block/section mapping for crashkernel mem
> > > > > > > before shringking.
> > > > > > 
> > > > > > Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do
> > > > > > that if we're leaving the thing mapped, no?
> > > > > > 
> > > > > I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem.
> > > > > 
> > > > > Because when invalid crashkernel mem pagetable, there is no chance to rd/wr
> > > > > the crashkernel mem by mistake.
> > > > > 
> > > > > If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem
> > > > > pagetable, there maybe some write operations to these mem by mistake which
> > > > > may cause crashkernel boot error and vmcore saving error.
> > > > 
> > > > I don't really buy this line of reasoning. The entire main kernel is
> > > > writable, so why do we care about protecting the crashkernel so much? The
> > > > _code_ to launch the crash kernel is writable! If you care about preventing
> > > > writes to memory which should not be writable, then you should use
> > > > rodata=full.
> > > > 
> > > 
> > > This is not entirely true - the core kernel text and rodata are
> > > remapped r/o in the linear map, whereas all module code and rodata are
> > > left writable when rodata != full.
> > 
> > Yes, sorry, you're quite right. The kernel text is only writable if
> > rodata=off.
> > 
> > But I still think it makes sense to protect the crashkernel only if
> > rodata=full (which is the default on arm64) as this allows us to rely
> > on page mappings and I think fits well with what we do for modules.
> > 
> > > But the conclusion is the same, imo: if you can't be bothered to
> > > protect a good chunk of the code and rodata that the kernel relies on,
> > > why should the crashkernel be treated any differently?
> > 
> > Thanks.
> > 
> > Will
> Thanks.
> 
> 1.The rodata full is harm to the performance and has been disabled in-house.
> 
> 2.When using crashkernel with rodata non full, the kernel also will use non
> block/section mapping which cause high d-TLB miss and degrade performance
> greatly.
> This patch fix it to use block/section mapping as far as possible.
> 
> bool can_set_direct_map(void)
> {
> 	return rodata_full || debug_pagealloc_enabled();
> }
> 
> map_mem:
> if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
> 	flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> 
> 3.When rodata full is disabled, crashkernel also need protect(keep
> arch_kexec_[un]protect_crashkres using).
> I think crashkernel should't depend on radata full(Maybe other architecture
> don't support radata full now).

I think this is going round in circles :/

As a first step, can we please leave the crashkernel mapped unless
rodata=full? It should be a much simpler patch to write, review and maintain
and it gives you the performance you want when crashkernel is being used.

Will
guanghui.fgh July 5, 2022, 12:27 p.m. UTC | #17
在 2022/7/5 20:11, Will Deacon 写道:
> On Tue, Jul 05, 2022 at 08:07:07PM +0800, guanghui.fgh wrote:
>>
>>
>> 在 2022/7/5 17:52, Will Deacon 写道:
>>> On Mon, Jul 04, 2022 at 07:09:23PM +0200, Ard Biesheuvel wrote:
>>>> On Mon, 4 Jul 2022 at 18:38, Will Deacon <will@kernel.org> wrote:
>>>>>
>>>>> On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote:
>>>>>> Thanks.
>>>>>>
>>>>>> 在 2022/7/4 22:23, Will Deacon 写道:
>>>>>>> On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote:
>>>> ...
>>>>>>>> Namely, it's need to use non block/section mapping for crashkernel mem
>>>>>>>> before shringking.
>>>>>>>
>>>>>>> Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do
>>>>>>> that if we're leaving the thing mapped, no?
>>>>>>>
>>>>>> I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem.
>>>>>>
>>>>>> Because when invalid crashkernel mem pagetable, there is no chance to rd/wr
>>>>>> the crashkernel mem by mistake.
>>>>>>
>>>>>> If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem
>>>>>> pagetable, there maybe some write operations to these mem by mistake which
>>>>>> may cause crashkernel boot error and vmcore saving error.
>>>>>
>>>>> I don't really buy this line of reasoning. The entire main kernel is
>>>>> writable, so why do we care about protecting the crashkernel so much? The
>>>>> _code_ to launch the crash kernel is writable! If you care about preventing
>>>>> writes to memory which should not be writable, then you should use
>>>>> rodata=full.
>>>>>
>>>>
>>>> This is not entirely true - the core kernel text and rodata are
>>>> remapped r/o in the linear map, whereas all module code and rodata are
>>>> left writable when rodata != full.
>>>
>>> Yes, sorry, you're quite right. The kernel text is only writable if
>>> rodata=off.
>>>
>>> But I still think it makes sense to protect the crashkernel only if
>>> rodata=full (which is the default on arm64) as this allows us to rely
>>> on page mappings and I think fits well with what we do for modules.
>>>
>>>> But the conclusion is the same, imo: if you can't be bothered to
>>>> protect a good chunk of the code and rodata that the kernel relies on,
>>>> why should the crashkernel be treated any differently?
>>>
>>> Thanks.
>>>
>>> Will
>> Thanks.
>>
>> 1.The rodata full is harm to the performance and has been disabled in-house.
>>
>> 2.When using crashkernel with rodata non full, the kernel also will use non
>> block/section mapping which cause high d-TLB miss and degrade performance
>> greatly.
>> This patch fix it to use block/section mapping as far as possible.
>>
>> bool can_set_direct_map(void)
>> {
>> 	return rodata_full || debug_pagealloc_enabled();
>> }
>>
>> map_mem:
>> if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
>> 	flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>>
>> 3.When rodata full is disabled, crashkernel also need protect(keep
>> arch_kexec_[un]protect_crashkres using).
>> I think crashkernel should't depend on radata full(Maybe other architecture
>> don't support radata full now).
> 
> I think this is going round in circles :/
> 
> As a first step, can we please leave the crashkernel mapped unless
> rodata=full? It should be a much simpler patch to write, review and maintain
> and it gives you the performance you want when crashkernel is being used.
> 
> Will

Thanks.

There is a circle.

1.When the rodata is non full, there will be some error when calling 
arch_kexec_[un]protect_crashkres(BUG_ON(pud_huge(*pud))) now.
It's also need non-block/section mapping for crashkernel mem.

2.In other words, maybe we should change 
arch_kexec_[un]protect_crashkres to support block/section mapping which 
can leave crashkernel block/section mapping.

But when we shrink the crashkernel mem, we also need to split some 
block/section mapping(part mem for crashkernel, the left for the normal 
kernel).
As a result, maybe we build crashkernel mem with non-block/section 
mapping is appropriate(as this patch doing).
Mike Rapoport July 5, 2022, 12:56 p.m. UTC | #18
On Tue, Jul 05, 2022 at 01:11:16PM +0100, Will Deacon wrote:
> On Tue, Jul 05, 2022 at 08:07:07PM +0800, guanghui.fgh wrote:
> > 
> > 1.The rodata full is harm to the performance and has been disabled in-house.
> > 
> > 2.When using crashkernel with rodata non full, the kernel also will use non
> > block/section mapping which cause high d-TLB miss and degrade performance
> > greatly.
> > This patch fix it to use block/section mapping as far as possible.
> > 
> > bool can_set_direct_map(void)
> > {
> > 	return rodata_full || debug_pagealloc_enabled();
> > }
> > 
> > map_mem:
> > if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
> > 	flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> > 
> > 3.When rodata full is disabled, crashkernel also need protect(keep
> > arch_kexec_[un]protect_crashkres using).
> > I think crashkernel should't depend on radata full(Maybe other architecture
> > don't support radata full now).
> 
> I think this is going round in circles :/
> 
> As a first step, can we please leave the crashkernel mapped unless
> rodata=full? It should be a much simpler patch to write, review and maintain
> and it gives you the performance you want when crashkernel is being used.

Since we are talking about large systems, what do you think about letting
them set CRASH_ALIGN to PUD_SIZE, then

	unmap(crashkernel);
	__create_pgd_mapping(crashkernel, NO_BLOCK_MAPPINGS);

should be enough to make crash kernel mapped with base pages.
 
> Will
guanghui.fgh July 5, 2022, 1:17 p.m. UTC | #19
在 2022/7/5 20:56, Mike Rapoport 写道:
> On Tue, Jul 05, 2022 at 01:11:16PM +0100, Will Deacon wrote:
>> On Tue, Jul 05, 2022 at 08:07:07PM +0800, guanghui.fgh wrote:
>>>
>>> 1.The rodata full is harm to the performance and has been disabled in-house.
>>>
>>> 2.When using crashkernel with rodata non full, the kernel also will use non
>>> block/section mapping which cause high d-TLB miss and degrade performance
>>> greatly.
>>> This patch fix it to use block/section mapping as far as possible.
>>>
>>> bool can_set_direct_map(void)
>>> {
>>> 	return rodata_full || debug_pagealloc_enabled();
>>> }
>>>
>>> map_mem:
>>> if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
>>> 	flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>>>
>>> 3.When rodata full is disabled, crashkernel also need protect(keep
>>> arch_kexec_[un]protect_crashkres using).
>>> I think crashkernel should't depend on radata full(Maybe other architecture
>>> don't support radata full now).
>>
>> I think this is going round in circles :/
>>
>> As a first step, can we please leave the crashkernel mapped unless
>> rodata=full? It should be a much simpler patch to write, review and maintain
>> and it gives you the performance you want when crashkernel is being used.
> 
> Since we are talking about large systems, what do you think about letting
> them set CRASH_ALIGN to PUD_SIZE, then
> 
> 	unmap(crashkernel);
> 	__create_pgd_mapping(crashkernel, NO_BLOCK_MAPPINGS);
> 
> should be enough to make crash kernel mapped with base pages.
>   
>> Will
> 
Thanks.

1.When kernel boots with crashkernel, the crashkernel parameters format is:
0M-2G:0M,2G-256G:256M,256G-1024G:320M,1024G-:384M which is self-adaption 
to multiple system.

2.As mentioned above, the crashkernel mem size maybe less than 
PUD_SIZE(Not multiple time to PUD_SIZE).
So, maybe we also need use some non block/section mappings.
Mike Rapoport July 5, 2022, 3:02 p.m. UTC | #20
On Tue, Jul 05, 2022 at 01:11:16PM +0100, Will Deacon wrote:
> On Tue, Jul 05, 2022 at 08:07:07PM +0800, guanghui.fgh wrote:
> > 
> > 3.When rodata full is disabled, crashkernel also need protect(keep
> > arch_kexec_[un]protect_crashkres using).
> > I think crashkernel should't depend on radata full(Maybe other architecture
> > don't support radata full now).
> 
> I think this is going round in circles :/
> 
> As a first step, can we please leave the crashkernel mapped unless
> rodata=full? It should be a much simpler patch to write, review and maintain
> and it gives you the performance you want when crashkernel is being used.

As it seems I failed to communicate my thoughts about reusing the existing
unmap_hotplug_range() to remap the crash kernel, let's try a more formal
approach ;-)

This is what I came up with and it does not look too complex. There are a
couple of extra #ifdefs that can be removed if we toss some code around in
a preparation patch.

From 5adbfcbe370da0f09cd917e73aaac7ba8c6b45df Mon Sep 17 00:00:00 2001
From: Mike Rapoport <rppt@linux.ibm.com>
Date: Sat, 2 Jul 2022 23:57:53 +0800
Subject: [PATCH] arm64/mm: remap crash kernel with base pages even if
 rodata_full disabled

For server systems it is important to protect crash kernel memory for
post-mortem analysis. In order to protect this memory it should be mapped
at PTE level.

When CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 is enabled, usage of crash kernel
essentially forces mapping of the entire linear map with base pages even if
rodata_full is not set (commit 2687275a5843 ("arm64: Force
NO_BLOCK_MAPPINGS if crashkernel reservation is required")) and this causes
performance degradation.

To reduce the performance degradation, postpone reservation of the crash
kernel memory to bootmem_init() regardless of CONFIG_ZONE_DMA or
CONFIG_ZONE_DMA32 and enable remapping of the crash kernel memory at PTE
level.

Co-developed-by: Guanghui Feng <guanghuifeng@linux.alibaba.com>
Signed-off-by: Guanghui Feng <guanghuifeng@linux.alibaba.com>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
---
 arch/arm64/include/asm/mmu.h |  1 +
 arch/arm64/mm/init.c         |  8 +---
 arch/arm64/mm/mmu.c          | 91 +++++++++++++++++++-----------------
 3 files changed, 52 insertions(+), 48 deletions(-)

diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 48f8466a4be9..f4eb2f61dd0d 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -71,6 +71,7 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
 extern void *fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot);
 extern void mark_linear_text_alias_ro(void);
 extern bool kaslr_requires_kpti(void);
+extern void remap_crashkernel(void);
 
 #define INIT_MM_CONTEXT(name)	\
 	.pgd = init_pg_dir,
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 339ee84e5a61..51f8329931f8 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -190,6 +190,7 @@ static void __init reserve_crashkernel(void)
 	crashk_res.start = crash_base;
 	crashk_res.end = crash_base + crash_size - 1;
 	insert_resource(&iomem_resource, &crashk_res);
+	remap_crashkernel();
 }
 
 /*
@@ -388,10 +389,6 @@ void __init arm64_memblock_init(void)
 	}
 
 	early_init_fdt_scan_reserved_mem();
-
-	if (!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32))
-		reserve_crashkernel();
-
 	high_memory = __va(memblock_end_of_DRAM() - 1) + 1;
 }
 
@@ -438,8 +435,7 @@ void __init bootmem_init(void)
 	 * request_standard_resources() depends on crashkernel's memory being
 	 * reserved, so do it here.
 	 */
-	if (IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32))
-		reserve_crashkernel();
+	reserve_crashkernel();
 
 	memblock_dump_all();
 }
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 626ec32873c6..e0b5769bfc9f 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -483,21 +483,6 @@ void __init mark_linear_text_alias_ro(void)
 			    PAGE_KERNEL_RO);
 }
 
-static bool crash_mem_map __initdata;
-
-static int __init enable_crash_mem_map(char *arg)
-{
-	/*
-	 * Proper parameter parsing is done by reserve_crashkernel(). We only
-	 * need to know if the linear map has to avoid block mappings so that
-	 * the crashkernel reservations can be unmapped later.
-	 */
-	crash_mem_map = true;
-
-	return 0;
-}
-early_param("crashkernel", enable_crash_mem_map);
-
 static void __init map_mem(pgd_t *pgdp)
 {
 	static const u64 direct_map_end = _PAGE_END(VA_BITS_MIN);
@@ -527,17 +512,6 @@ static void __init map_mem(pgd_t *pgdp)
 	 */
 	memblock_mark_nomap(kernel_start, kernel_end - kernel_start);
 
-#ifdef CONFIG_KEXEC_CORE
-	if (crash_mem_map) {
-		if (IS_ENABLED(CONFIG_ZONE_DMA) ||
-		    IS_ENABLED(CONFIG_ZONE_DMA32))
-			flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
-		else if (crashk_res.end)
-			memblock_mark_nomap(crashk_res.start,
-			    resource_size(&crashk_res));
-	}
-#endif
-
 	/* map all the memory banks */
 	for_each_mem_range(i, &start, &end) {
 		if (start >= end)
@@ -570,19 +544,6 @@ static void __init map_mem(pgd_t *pgdp)
 	 * in page granularity and put back unused memory to buddy system
 	 * through /sys/kernel/kexec_crash_size interface.
 	 */
-#ifdef CONFIG_KEXEC_CORE
-	if (crash_mem_map &&
-	    !IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) {
-		if (crashk_res.end) {
-			__map_memblock(pgdp, crashk_res.start,
-				       crashk_res.end + 1,
-				       PAGE_KERNEL,
-				       NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
-			memblock_clear_nomap(crashk_res.start,
-					     resource_size(&crashk_res));
-		}
-	}
-#endif
 }
 
 void mark_rodata_ro(void)
@@ -827,7 +788,7 @@ int kern_addr_valid(unsigned long addr)
 	return pfn_valid(pte_pfn(pte));
 }
 
-#ifdef CONFIG_MEMORY_HOTPLUG
+#if defined(CONFIG_MEMORY_HOTPLUG) || defined(CONFIG_KEXEC_CORE)
 static void free_hotplug_page_range(struct page *page, size_t size,
 				    struct vmem_altmap *altmap)
 {
@@ -839,6 +800,7 @@ static void free_hotplug_page_range(struct page *page, size_t size,
 	}
 }
 
+#ifdef CONFIG_MEMORY_HOTPLUG
 static void free_hotplug_pgtable_page(struct page *page)
 {
 	free_hotplug_page_range(page, PAGE_SIZE, NULL);
@@ -862,6 +824,7 @@ static bool pgtable_range_aligned(unsigned long start, unsigned long end,
 		return false;
 	return true;
 }
+#endif /* CONFIG_MEMORY_HOTPLUG */
 
 static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr,
 				    unsigned long end, bool free_mapped,
@@ -994,7 +957,9 @@ static void unmap_hotplug_range(unsigned long addr, unsigned long end,
 		unmap_hotplug_p4d_range(pgdp, addr, next, free_mapped, altmap);
 	} while (addr = next, addr < end);
 }
+#endif /* CONFIG_MEMORY_HOTPLUG || CONFIG_KEXEC_CORE */
 
+#ifdef CONFIG_MEMORY_HOTPLUG
 static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr,
 				 unsigned long end, unsigned long floor,
 				 unsigned long ceiling)
@@ -1148,7 +1113,7 @@ static void free_empty_tables(unsigned long addr, unsigned long end,
 		free_empty_p4d_table(pgdp, addr, next, floor, ceiling);
 	} while (addr = next, addr < end);
 }
-#endif
+#endif /* CONFIG_MEMORY_HOTPLUG */
 
 #if !ARM64_KERNEL_USES_PMD_MAPS
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
@@ -1213,7 +1178,7 @@ void vmemmap_free(unsigned long start, unsigned long end,
 	unmap_hotplug_range(start, end, true, altmap);
 	free_empty_tables(start, end, VMEMMAP_START, VMEMMAP_END);
 }
-#endif /* CONFIG_MEMORY_HOTPLUG */
+#endif /* CONFIG_MEMORY_HOTPLUG || CONFIG_KEXEC_CORE */
 
 static inline pud_t *fixmap_pud(unsigned long addr)
 {
@@ -1677,3 +1642,45 @@ static int __init prevent_bootmem_remove_init(void)
 }
 early_initcall(prevent_bootmem_remove_init);
 #endif
+
+void __init remap_crashkernel(void)
+{
+#ifdef CONFIG_KEXEC_CORE
+	phys_addr_t start, end, size;
+	phys_addr_t aligned_start, aligned_end;
+
+	if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
+	    return;
+
+	if (!crashk_res.end)
+	    return;
+
+	start = crashk_res.start & PAGE_MASK;
+	end = PAGE_ALIGN(crashk_res.end);
+
+	aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE);
+	aligned_end = ALIGN(end, PUD_SIZE);
+
+	/* Clear PUDs containing crash kernel memory */
+	unmap_hotplug_range(__phys_to_virt(aligned_start),
+			    __phys_to_virt(aligned_end), false, NULL);
+
+	/* map area from PUD start to start of crash kernel with large pages */
+	size = start - aligned_start;
+	__create_pgd_mapping(swapper_pg_dir, aligned_start,
+			     __phys_to_virt(aligned_start),
+			     size, PAGE_KERNEL, early_pgtable_alloc, 0);
+
+	/* map crash kernel memory with base pages */
+	size = end - start;
+	__create_pgd_mapping(swapper_pg_dir, start,  __phys_to_virt(start),
+			     size, PAGE_KERNEL, early_pgtable_alloc,
+			     NO_EXEC_MAPPINGS | NO_BLOCK_MAPPINGS |
+			     NO_CONT_MAPPINGS);
+
+	/* map area from end of crash kernel to PUD end with large pages */
+	size = aligned_end - end;
+	__create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end),
+			     size, PAGE_KERNEL, early_pgtable_alloc, 0);
+#endif
+}
Catalin Marinas July 5, 2022, 3:34 p.m. UTC | #21
On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote:
> +void __init remap_crashkernel(void)
> +{
> +#ifdef CONFIG_KEXEC_CORE
> +	phys_addr_t start, end, size;
> +	phys_addr_t aligned_start, aligned_end;
> +
> +	if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
> +	    return;
> +
> +	if (!crashk_res.end)
> +	    return;
> +
> +	start = crashk_res.start & PAGE_MASK;
> +	end = PAGE_ALIGN(crashk_res.end);
> +
> +	aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE);
> +	aligned_end = ALIGN(end, PUD_SIZE);
> +
> +	/* Clear PUDs containing crash kernel memory */
> +	unmap_hotplug_range(__phys_to_virt(aligned_start),
> +			    __phys_to_virt(aligned_end), false, NULL);

What I don't understand is what happens if there's valid kernel data
between aligned_start and crashk_res.start (or the other end of the
range).
Mike Rapoport July 5, 2022, 3:57 p.m. UTC | #22
On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote:
> On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote:
> > +void __init remap_crashkernel(void)
> > +{
> > +#ifdef CONFIG_KEXEC_CORE
> > +	phys_addr_t start, end, size;
> > +	phys_addr_t aligned_start, aligned_end;
> > +
> > +	if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
> > +	    return;
> > +
> > +	if (!crashk_res.end)
> > +	    return;
> > +
> > +	start = crashk_res.start & PAGE_MASK;
> > +	end = PAGE_ALIGN(crashk_res.end);
> > +
> > +	aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE);
> > +	aligned_end = ALIGN(end, PUD_SIZE);
> > +
> > +	/* Clear PUDs containing crash kernel memory */
> > +	unmap_hotplug_range(__phys_to_virt(aligned_start),
> > +			    __phys_to_virt(aligned_end), false, NULL);
> 
> What I don't understand is what happens if there's valid kernel data
> between aligned_start and crashk_res.start (or the other end of the
> range).

Data shouldn't go anywhere :)

There is 

+	/* map area from PUD start to start of crash kernel with large pages */
+	size = start - aligned_start;
+	__create_pgd_mapping(swapper_pg_dir, aligned_start,
+			     __phys_to_virt(aligned_start),
+			     size, PAGE_KERNEL, early_pgtable_alloc, 0);

and 

+	/* map area from end of crash kernel to PUD end with large pages */
+	size = aligned_end - end;
+	__create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end),
+			     size, PAGE_KERNEL, early_pgtable_alloc, 0);

after the unmap, so after we tear down a part of a linear map we
immediately recreate it, just with a different page size.

This all happens before SMP, so there is no concurrency at that point.

> -- 
> Catalin
Catalin Marinas July 5, 2022, 5:05 p.m. UTC | #23
On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote:
> On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote:
> > On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote:
> > > +void __init remap_crashkernel(void)
> > > +{
> > > +#ifdef CONFIG_KEXEC_CORE
> > > +	phys_addr_t start, end, size;
> > > +	phys_addr_t aligned_start, aligned_end;
> > > +
> > > +	if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
> > > +	    return;
> > > +
> > > +	if (!crashk_res.end)
> > > +	    return;
> > > +
> > > +	start = crashk_res.start & PAGE_MASK;
> > > +	end = PAGE_ALIGN(crashk_res.end);
> > > +
> > > +	aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE);
> > > +	aligned_end = ALIGN(end, PUD_SIZE);
> > > +
> > > +	/* Clear PUDs containing crash kernel memory */
> > > +	unmap_hotplug_range(__phys_to_virt(aligned_start),
> > > +			    __phys_to_virt(aligned_end), false, NULL);
> > 
> > What I don't understand is what happens if there's valid kernel data
> > between aligned_start and crashk_res.start (or the other end of the
> > range).
> 
> Data shouldn't go anywhere :)
> 
> There is 
> 
> +	/* map area from PUD start to start of crash kernel with large pages */
> +	size = start - aligned_start;
> +	__create_pgd_mapping(swapper_pg_dir, aligned_start,
> +			     __phys_to_virt(aligned_start),
> +			     size, PAGE_KERNEL, early_pgtable_alloc, 0);
> 
> and 
> 
> +	/* map area from end of crash kernel to PUD end with large pages */
> +	size = aligned_end - end;
> +	__create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end),
> +			     size, PAGE_KERNEL, early_pgtable_alloc, 0);
> 
> after the unmap, so after we tear down a part of a linear map we
> immediately recreate it, just with a different page size.
> 
> This all happens before SMP, so there is no concurrency at that point.

That brief period of unmap worries me. The kernel text, data and stack
are all in the vmalloc space but any other (memblock) allocation to this
point may be in the unmapped range before and after the crashkernel
reservation. The interrupts are off, so I think the only allocation and
potential access that may go in this range is the page table itself. But
it looks fragile to me.
Mike Rapoport July 5, 2022, 8:45 p.m. UTC | #24
On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote:
> On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote:
> > On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote:
> > > On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote:
> > > > +void __init remap_crashkernel(void)
> > > > +{
> > > > +#ifdef CONFIG_KEXEC_CORE
> > > > +	phys_addr_t start, end, size;
> > > > +	phys_addr_t aligned_start, aligned_end;
> > > > +
> > > > +	if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
> > > > +	    return;
> > > > +
> > > > +	if (!crashk_res.end)
> > > > +	    return;
> > > > +
> > > > +	start = crashk_res.start & PAGE_MASK;
> > > > +	end = PAGE_ALIGN(crashk_res.end);
> > > > +
> > > > +	aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE);
> > > > +	aligned_end = ALIGN(end, PUD_SIZE);
> > > > +
> > > > +	/* Clear PUDs containing crash kernel memory */
> > > > +	unmap_hotplug_range(__phys_to_virt(aligned_start),
> > > > +			    __phys_to_virt(aligned_end), false, NULL);
> > > 
> > > What I don't understand is what happens if there's valid kernel data
> > > between aligned_start and crashk_res.start (or the other end of the
> > > range).
> > 
> > Data shouldn't go anywhere :)
> > 
> > There is 
> > 
> > +	/* map area from PUD start to start of crash kernel with large pages */
> > +	size = start - aligned_start;
> > +	__create_pgd_mapping(swapper_pg_dir, aligned_start,
> > +			     __phys_to_virt(aligned_start),
> > +			     size, PAGE_KERNEL, early_pgtable_alloc, 0);
> > 
> > and 
> > 
> > +	/* map area from end of crash kernel to PUD end with large pages */
> > +	size = aligned_end - end;
> > +	__create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end),
> > +			     size, PAGE_KERNEL, early_pgtable_alloc, 0);
> > 
> > after the unmap, so after we tear down a part of a linear map we
> > immediately recreate it, just with a different page size.
> > 
> > This all happens before SMP, so there is no concurrency at that point.
> 
> That brief period of unmap worries me. The kernel text, data and stack
> are all in the vmalloc space but any other (memblock) allocation to this
> point may be in the unmapped range before and after the crashkernel
> reservation. The interrupts are off, so I think the only allocation and
> potential access that may go in this range is the page table itself. But
> it looks fragile to me.

I agree there are chances there will be an allocation from the unmapped
range. 

We can make sure this won't happen, though. We can cap the memblock
allocations with memblock_set_current_limit(aligned_end) or
memblock_reserve(algined_start, aligned_end) until the mappings are
restored. 
 
> -- 
> Catalin
guanghui.fgh July 6, 2022, 2:49 a.m. UTC | #25
Thanks.

在 2022/7/6 4:45, Mike Rapoport 写道:
> On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote:
>> On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote:
>>> On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote:
>>>> On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote:
>>>>> +void __init remap_crashkernel(void)
>>>>> +{
>>>>> +#ifdef CONFIG_KEXEC_CORE
>>>>> +	phys_addr_t start, end, size;
>>>>> +	phys_addr_t aligned_start, aligned_end;
>>>>> +
>>>>> +	if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
>>>>> +	    return;
>>>>> +
>>>>> +	if (!crashk_res.end)
>>>>> +	    return;
>>>>> +
>>>>> +	start = crashk_res.start & PAGE_MASK;
>>>>> +	end = PAGE_ALIGN(crashk_res.end);
>>>>> +
>>>>> +	aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE);
>>>>> +	aligned_end = ALIGN(end, PUD_SIZE);
>>>>> +
>>>>> +	/* Clear PUDs containing crash kernel memory */
>>>>> +	unmap_hotplug_range(__phys_to_virt(aligned_start),
>>>>> +			    __phys_to_virt(aligned_end), false, NULL);
>>>>
>>>> What I don't understand is what happens if there's valid kernel data
>>>> between aligned_start and crashk_res.start (or the other end of the
>>>> range).
>>>
>>> Data shouldn't go anywhere :)
>>>
>>> There is
>>>
>>> +	/* map area from PUD start to start of crash kernel with large pages */
>>> +	size = start - aligned_start;
>>> +	__create_pgd_mapping(swapper_pg_dir, aligned_start,
>>> +			     __phys_to_virt(aligned_start),
>>> +			     size, PAGE_KERNEL, early_pgtable_alloc, 0);
>>>
>>> and
>>>
>>> +	/* map area from end of crash kernel to PUD end with large pages */
>>> +	size = aligned_end - end;
>>> +	__create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end),
>>> +			     size, PAGE_KERNEL, early_pgtable_alloc, 0);
>>>
>>> after the unmap, so after we tear down a part of a linear map we
>>> immediately recreate it, just with a different page size.
>>>
>>> This all happens before SMP, so there is no concurrency at that point.
>>
>> That brief period of unmap worries me. The kernel text, data and stack
>> are all in the vmalloc space but any other (memblock) allocation to this
>> point may be in the unmapped range before and after the crashkernel
>> reservation. The interrupts are off, so I think the only allocation and
>> potential access that may go in this range is the page table itself. But
>> it looks fragile to me.
> 
> I agree there are chances there will be an allocation from the unmapped
> range.
> 
> We can make sure this won't happen, though. We can cap the memblock
> allocations with memblock_set_current_limit(aligned_end) or
> memblock_reserve(algined_start, aligned_end) until the mappings are
> restored.
>   
>> -- 
>> Catalin
> 
I think there is no need to worry about vmalloc mem.

1.As mentioned above,
When reserving crashkernel and remapping linear mem mapping, there is 
only one boot cpu running. There is no other cpu/thread running at the 
same time.

2.Although vmalloc may alloc mem from the ummaped area, but we will 
rebuid remapping using pte level mapping which keeps virtual address to 
the same physical address
(At the same time, no other cpu/thread is access vmalloc mem).

As a result, it has no effect to vmalloc mem.
Catalin Marinas July 6, 2022, 7:43 a.m. UTC | #26
On Wed, Jul 06, 2022 at 10:49:43AM +0800, guanghui.fgh wrote:
> 在 2022/7/6 4:45, Mike Rapoport 写道:
> > On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote:
> > > On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote:
> > > > On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote:
> > > > > On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote:
> > > > > > +void __init remap_crashkernel(void)
> > > > > > +{
> > > > > > +#ifdef CONFIG_KEXEC_CORE
> > > > > > +	phys_addr_t start, end, size;
> > > > > > +	phys_addr_t aligned_start, aligned_end;
> > > > > > +
> > > > > > +	if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
> > > > > > +	    return;
> > > > > > +
> > > > > > +	if (!crashk_res.end)
> > > > > > +	    return;
> > > > > > +
> > > > > > +	start = crashk_res.start & PAGE_MASK;
> > > > > > +	end = PAGE_ALIGN(crashk_res.end);
> > > > > > +
> > > > > > +	aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE);
> > > > > > +	aligned_end = ALIGN(end, PUD_SIZE);
> > > > > > +
> > > > > > +	/* Clear PUDs containing crash kernel memory */
> > > > > > +	unmap_hotplug_range(__phys_to_virt(aligned_start),
> > > > > > +			    __phys_to_virt(aligned_end), false, NULL);
> > > > > 
> > > > > What I don't understand is what happens if there's valid kernel data
> > > > > between aligned_start and crashk_res.start (or the other end of the
> > > > > range).
> > > > 
> > > > Data shouldn't go anywhere :)
> > > > 
> > > > There is
> > > > 
> > > > +	/* map area from PUD start to start of crash kernel with large pages */
> > > > +	size = start - aligned_start;
> > > > +	__create_pgd_mapping(swapper_pg_dir, aligned_start,
> > > > +			     __phys_to_virt(aligned_start),
> > > > +			     size, PAGE_KERNEL, early_pgtable_alloc, 0);
> > > > 
> > > > and
> > > > 
> > > > +	/* map area from end of crash kernel to PUD end with large pages */
> > > > +	size = aligned_end - end;
> > > > +	__create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end),
> > > > +			     size, PAGE_KERNEL, early_pgtable_alloc, 0);
> > > > 
> > > > after the unmap, so after we tear down a part of a linear map we
> > > > immediately recreate it, just with a different page size.
> > > > 
> > > > This all happens before SMP, so there is no concurrency at that point.
> > > 
> > > That brief period of unmap worries me. The kernel text, data and stack
> > > are all in the vmalloc space but any other (memblock) allocation to this
> > > point may be in the unmapped range before and after the crashkernel
> > > reservation. The interrupts are off, so I think the only allocation and
> > > potential access that may go in this range is the page table itself. But
> > > it looks fragile to me.
> > 
> > I agree there are chances there will be an allocation from the unmapped
> > range.
> > 
> > We can make sure this won't happen, though. We can cap the memblock
> > allocations with memblock_set_current_limit(aligned_end) or
> > memblock_reserve(algined_start, aligned_end) until the mappings are
> > restored.
> 
> I think there is no need to worry about vmalloc mem.

That's not what I'm worried about. It's about memblock allocations that
are accessed through the linear map.
Catalin Marinas July 6, 2022, 10:04 a.m. UTC | #27
On Tue, Jul 05, 2022 at 11:45:40PM +0300, Mike Rapoport wrote:
> On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote:
> > On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote:
> > > On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote:
> > > > On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote:
> > > > > +void __init remap_crashkernel(void)
> > > > > +{
> > > > > +#ifdef CONFIG_KEXEC_CORE
> > > > > +	phys_addr_t start, end, size;
> > > > > +	phys_addr_t aligned_start, aligned_end;
> > > > > +
> > > > > +	if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
> > > > > +	    return;
> > > > > +
> > > > > +	if (!crashk_res.end)
> > > > > +	    return;
> > > > > +
> > > > > +	start = crashk_res.start & PAGE_MASK;
> > > > > +	end = PAGE_ALIGN(crashk_res.end);
> > > > > +
> > > > > +	aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE);
> > > > > +	aligned_end = ALIGN(end, PUD_SIZE);
> > > > > +
> > > > > +	/* Clear PUDs containing crash kernel memory */
> > > > > +	unmap_hotplug_range(__phys_to_virt(aligned_start),
> > > > > +			    __phys_to_virt(aligned_end), false, NULL);
> > > > 
> > > > What I don't understand is what happens if there's valid kernel data
> > > > between aligned_start and crashk_res.start (or the other end of the
> > > > range).
> > > 
> > > Data shouldn't go anywhere :)
> > > 
> > > There is 
> > > 
> > > +	/* map area from PUD start to start of crash kernel with large pages */
> > > +	size = start - aligned_start;
> > > +	__create_pgd_mapping(swapper_pg_dir, aligned_start,
> > > +			     __phys_to_virt(aligned_start),
> > > +			     size, PAGE_KERNEL, early_pgtable_alloc, 0);
> > > 
> > > and 
> > > 
> > > +	/* map area from end of crash kernel to PUD end with large pages */
> > > +	size = aligned_end - end;
> > > +	__create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end),
> > > +			     size, PAGE_KERNEL, early_pgtable_alloc, 0);
> > > 
> > > after the unmap, so after we tear down a part of a linear map we
> > > immediately recreate it, just with a different page size.
> > > 
> > > This all happens before SMP, so there is no concurrency at that point.
> > 
> > That brief period of unmap worries me. The kernel text, data and stack
> > are all in the vmalloc space but any other (memblock) allocation to this
> > point may be in the unmapped range before and after the crashkernel
> > reservation. The interrupts are off, so I think the only allocation and
> > potential access that may go in this range is the page table itself. But
> > it looks fragile to me.
> 
> I agree there are chances there will be an allocation from the unmapped
> range. 
> 
> We can make sure this won't happen, though. We can cap the memblock
> allocations with memblock_set_current_limit(aligned_end) or
> memblock_reserve(algined_start, aligned_end) until the mappings are
> restored. 

We can reserve the region just before unmapping to avoid new allocations
for the page tables but we can't do much about pages already allocated
prior to calling remap_crashkernel().
Mike Rapoport July 6, 2022, 1:54 p.m. UTC | #28
On Wed, Jul 06, 2022 at 11:04:24AM +0100, Catalin Marinas wrote:
> On Tue, Jul 05, 2022 at 11:45:40PM +0300, Mike Rapoport wrote:
> > On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote:
> > > On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote:
> > > > On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote:
> > > > > On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote:
> > > > > > +void __init remap_crashkernel(void)
> > > > > > +{
> > > > > > +#ifdef CONFIG_KEXEC_CORE
> > > > > > +	phys_addr_t start, end, size;
> > > > > > +	phys_addr_t aligned_start, aligned_end;
> > > > > > +
> > > > > > +	if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
> > > > > > +	    return;
> > > > > > +
> > > > > > +	if (!crashk_res.end)
> > > > > > +	    return;
> > > > > > +
> > > > > > +	start = crashk_res.start & PAGE_MASK;
> > > > > > +	end = PAGE_ALIGN(crashk_res.end);
> > > > > > +
> > > > > > +	aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE);
> > > > > > +	aligned_end = ALIGN(end, PUD_SIZE);
> > > > > > +
> > > > > > +	/* Clear PUDs containing crash kernel memory */
> > > > > > +	unmap_hotplug_range(__phys_to_virt(aligned_start),
> > > > > > +			    __phys_to_virt(aligned_end), false, NULL);
> > > > > 
> > > > > What I don't understand is what happens if there's valid kernel data
> > > > > between aligned_start and crashk_res.start (or the other end of the
> > > > > range).
> > > > 
> > > > Data shouldn't go anywhere :)
> > > > 
> > > > There is 
> > > > 
> > > > +	/* map area from PUD start to start of crash kernel with large pages */
> > > > +	size = start - aligned_start;
> > > > +	__create_pgd_mapping(swapper_pg_dir, aligned_start,
> > > > +			     __phys_to_virt(aligned_start),
> > > > +			     size, PAGE_KERNEL, early_pgtable_alloc, 0);
> > > > 
> > > > and 
> > > > 
> > > > +	/* map area from end of crash kernel to PUD end with large pages */
> > > > +	size = aligned_end - end;
> > > > +	__create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end),
> > > > +			     size, PAGE_KERNEL, early_pgtable_alloc, 0);
> > > > 
> > > > after the unmap, so after we tear down a part of a linear map we
> > > > immediately recreate it, just with a different page size.
> > > > 
> > > > This all happens before SMP, so there is no concurrency at that point.
> > > 
> > > That brief period of unmap worries me. The kernel text, data and stack
> > > are all in the vmalloc space but any other (memblock) allocation to this
> > > point may be in the unmapped range before and after the crashkernel
> > > reservation. The interrupts are off, so I think the only allocation and
> > > potential access that may go in this range is the page table itself. But
> > > it looks fragile to me.
> > 
> > I agree there are chances there will be an allocation from the unmapped
> > range. 
> > 
> > We can make sure this won't happen, though. We can cap the memblock
> > allocations with memblock_set_current_limit(aligned_end) or
> > memblock_reserve(algined_start, aligned_end) until the mappings are
> > restored. 
> 
> We can reserve the region just before unmapping to avoid new allocations
> for the page tables but we can't do much about pages already allocated
> prior to calling remap_crashkernel().

Right, this was bothering me too after I re-read you previous email.

One thing I can think of is to only remap the crash kernel memory if it is
a part of an allocation that exactly fits into one ore more PUDs.

Say, in reserve_crashkernel() we try the memblock_phys_alloc() with
PUD_SIZE as alignment and size rounded up to PUD_SIZE. If this allocation
succeeds, we remap the entire area that now contains only memory allocated
in reserve_crashkernel() and free the extra memory after remapping is done.
If the large allocation fails, we fall back to the original size and
alignment and don't allow unmapping crash kernel memory in
arch_kexec_protect_crashkres().
 
> -- 
> Catalin
guanghui.fgh July 6, 2022, 3:18 p.m. UTC | #29
Thanks.

在 2022/7/6 21:54, Mike Rapoport 写道:
> On Wed, Jul 06, 2022 at 11:04:24AM +0100, Catalin Marinas wrote:
>> On Tue, Jul 05, 2022 at 11:45:40PM +0300, Mike Rapoport wrote:
>>> On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote:
>>>> On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote:
>>>>> On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote:
>>>>>> On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote:
>>>>>>> +void __init remap_crashkernel(void)
>>>>>>> +{
>>>>>>> +#ifdef CONFIG_KEXEC_CORE
>>>>>>> +	phys_addr_t start, end, size;
>>>>>>> +	phys_addr_t aligned_start, aligned_end;
>>>>>>> +
>>>>>>> +	if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
>>>>>>> +	    return;
>>>>>>> +
>>>>>>> +	if (!crashk_res.end)
>>>>>>> +	    return;
>>>>>>> +
>>>>>>> +	start = crashk_res.start & PAGE_MASK;
>>>>>>> +	end = PAGE_ALIGN(crashk_res.end);
>>>>>>> +
>>>>>>> +	aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE);
>>>>>>> +	aligned_end = ALIGN(end, PUD_SIZE);
>>>>>>> +
>>>>>>> +	/* Clear PUDs containing crash kernel memory */
>>>>>>> +	unmap_hotplug_range(__phys_to_virt(aligned_start),
>>>>>>> +			    __phys_to_virt(aligned_end), false, NULL);
>>>>>>
>>>>>> What I don't understand is what happens if there's valid kernel data
>>>>>> between aligned_start and crashk_res.start (or the other end of the
>>>>>> range).
>>>>>
>>>>> Data shouldn't go anywhere :)
>>>>>
>>>>> There is
>>>>>
>>>>> +	/* map area from PUD start to start of crash kernel with large pages */
>>>>> +	size = start - aligned_start;
>>>>> +	__create_pgd_mapping(swapper_pg_dir, aligned_start,
>>>>> +			     __phys_to_virt(aligned_start),
>>>>> +			     size, PAGE_KERNEL, early_pgtable_alloc, 0);
>>>>>
>>>>> and
>>>>>
>>>>> +	/* map area from end of crash kernel to PUD end with large pages */
>>>>> +	size = aligned_end - end;
>>>>> +	__create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end),
>>>>> +			     size, PAGE_KERNEL, early_pgtable_alloc, 0);
>>>>>
>>>>> after the unmap, so after we tear down a part of a linear map we
>>>>> immediately recreate it, just with a different page size.
>>>>>
>>>>> This all happens before SMP, so there is no concurrency at that point.
>>>>
>>>> That brief period of unmap worries me. The kernel text, data and stack
>>>> are all in the vmalloc space but any other (memblock) allocation to this
>>>> point may be in the unmapped range before and after the crashkernel
>>>> reservation. The interrupts are off, so I think the only allocation and
>>>> potential access that may go in this range is the page table itself. But
>>>> it looks fragile to me.
>>>
>>> I agree there are chances there will be an allocation from the unmapped
>>> range.
>>>
>>> We can make sure this won't happen, though. We can cap the memblock
>>> allocations with memblock_set_current_limit(aligned_end) or
>>> memblock_reserve(algined_start, aligned_end) until the mappings are
>>> restored.
>>
>> We can reserve the region just before unmapping to avoid new allocations
>> for the page tables but we can't do much about pages already allocated
>> prior to calling remap_crashkernel().
> 
> Right, this was bothering me too after I re-read you previous email.
> 
> One thing I can think of is to only remap the crash kernel memory if it is
> a part of an allocation that exactly fits into one ore more PUDs.
> 
> Say, in reserve_crashkernel() we try the memblock_phys_alloc() with
> PUD_SIZE as alignment and size rounded up to PUD_SIZE. If this allocation
> succeeds, we remap the entire area that now contains only memory allocated
> in reserve_crashkernel() and free the extra memory after remapping is done.
> If the large allocation fails, we fall back to the original size and
> alignment and don't allow unmapping crash kernel memory in
> arch_kexec_protect_crashkres().
>   
>> -- 
>> Catalin
> 
Thanks.

There is a new method.
I think we should use the patch v3(similar but need add some changes)

1.We can walk crashkernle block/section pagetable,
[[[(keep the origin block/section mapping valid]]]
rebuild the pte level page mapping for the crashkernel mem
rebuild left & right margin mem(which is in same block/section mapping 
but out of crashkernel mem) with block/section mapping

2.'replace' the origin block/section mapping by new builded mapping 
iterately

With this method, all the mem mapping keep valid all the time.

3.the patch v3 link:
https://lore.kernel.org/linux-mm/6dc308db-3685-4df5-506a-71f9e3794ec8@linux.alibaba.com/T/
(Need some changes)
guanghui.fgh July 6, 2022, 3:30 p.m. UTC | #30
在 2022/7/6 23:18, guanghui.fgh 写道:
> Thanks.
> 
> 在 2022/7/6 21:54, Mike Rapoport 写道:
>> On Wed, Jul 06, 2022 at 11:04:24AM +0100, Catalin Marinas wrote:
>>> On Tue, Jul 05, 2022 at 11:45:40PM +0300, Mike Rapoport wrote:
>>>> On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote:
>>>>> On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote:
>>>>>> On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote:
>>>>>>> On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote:
>>>>>>>> +void __init remap_crashkernel(void)
>>>>>>>> +{
>>>>>>>> +#ifdef CONFIG_KEXEC_CORE
>>>>>>>> +    phys_addr_t start, end, size;
>>>>>>>> +    phys_addr_t aligned_start, aligned_end;
>>>>>>>> +
>>>>>>>> +    if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
>>>>>>>> +        return;
>>>>>>>> +
>>>>>>>> +    if (!crashk_res.end)
>>>>>>>> +        return;
>>>>>>>> +
>>>>>>>> +    start = crashk_res.start & PAGE_MASK;
>>>>>>>> +    end = PAGE_ALIGN(crashk_res.end);
>>>>>>>> +
>>>>>>>> +    aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE);
>>>>>>>> +    aligned_end = ALIGN(end, PUD_SIZE);
>>>>>>>> +
>>>>>>>> +    /* Clear PUDs containing crash kernel memory */
>>>>>>>> +    unmap_hotplug_range(__phys_to_virt(aligned_start),
>>>>>>>> +                __phys_to_virt(aligned_end), false, NULL);
>>>>>>>
>>>>>>> What I don't understand is what happens if there's valid kernel data
>>>>>>> between aligned_start and crashk_res.start (or the other end of the
>>>>>>> range).
>>>>>>
>>>>>> Data shouldn't go anywhere :)
>>>>>>
>>>>>> There is
>>>>>>
>>>>>> +    /* map area from PUD start to start of crash kernel with 
>>>>>> large pages */
>>>>>> +    size = start - aligned_start;
>>>>>> +    __create_pgd_mapping(swapper_pg_dir, aligned_start,
>>>>>> +                 __phys_to_virt(aligned_start),
>>>>>> +                 size, PAGE_KERNEL, early_pgtable_alloc, 0);
>>>>>>
>>>>>> and
>>>>>>
>>>>>> +    /* map area from end of crash kernel to PUD end with large 
>>>>>> pages */
>>>>>> +    size = aligned_end - end;
>>>>>> +    __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end),
>>>>>> +                 size, PAGE_KERNEL, early_pgtable_alloc, 0);
>>>>>>
>>>>>> after the unmap, so after we tear down a part of a linear map we
>>>>>> immediately recreate it, just with a different page size.
>>>>>>
>>>>>> This all happens before SMP, so there is no concurrency at that 
>>>>>> point.
>>>>>
>>>>> That brief period of unmap worries me. The kernel text, data and stack
>>>>> are all in the vmalloc space but any other (memblock) allocation to 
>>>>> this
>>>>> point may be in the unmapped range before and after the crashkernel
>>>>> reservation. The interrupts are off, so I think the only allocation 
>>>>> and
>>>>> potential access that may go in this range is the page table 
>>>>> itself. But
>>>>> it looks fragile to me.
>>>>
>>>> I agree there are chances there will be an allocation from the unmapped
>>>> range.
>>>>
>>>> We can make sure this won't happen, though. We can cap the memblock
>>>> allocations with memblock_set_current_limit(aligned_end) or
>>>> memblock_reserve(algined_start, aligned_end) until the mappings are
>>>> restored.
>>>
>>> We can reserve the region just before unmapping to avoid new allocations
>>> for the page tables but we can't do much about pages already allocated
>>> prior to calling remap_crashkernel().
>>
>> Right, this was bothering me too after I re-read you previous email.
>>
>> One thing I can think of is to only remap the crash kernel memory if 
>> it is
>> a part of an allocation that exactly fits into one ore more PUDs.
>>
>> Say, in reserve_crashkernel() we try the memblock_phys_alloc() with
>> PUD_SIZE as alignment and size rounded up to PUD_SIZE. If this allocation
>> succeeds, we remap the entire area that now contains only memory 
>> allocated
>> in reserve_crashkernel() and free the extra memory after remapping is 
>> done.
>> If the large allocation fails, we fall back to the original size and
>> alignment and don't allow unmapping crash kernel memory in
>> arch_kexec_protect_crashkres().
>>> -- 
>>> Catalin
>>
> Thanks.
> 
> There is a new method.
> I think we should use the patch v3(similar but need add some changes)
> 
> 1.We can walk crashkernle block/section pagetable,
> [[[(keep the origin block/section mapping valid]]]
> rebuild the pte level page mapping for the crashkernel mem
> rebuild left & right margin mem(which is in same block/section mapping 
> but out of crashkernel mem) with block/section mapping
> 
> 2.'replace' the origin block/section mapping by new builded mapping 
> iterately
> 
> With this method, all the mem mapping keep valid all the time.
> 
> 3.the patch v3 link:
> https://lore.kernel.org/linux-mm/6dc308db-3685-4df5-506a-71f9e3794ec8@linux.alibaba.com/T/ 
> 
> (Need some changes)

Namely,
When rebuilding for crashkernel mem pagemapping, there is no change to 
the origin mapping.

When the new mapping is ready, we replace old mapping with the new 
builded mapping.

With this method, keep all mem mapping valid all the time.
Catalin Marinas July 6, 2022, 3:40 p.m. UTC | #31
On Wed, Jul 06, 2022 at 11:18:22PM +0800, guanghui.fgh wrote:
> 在 2022/7/6 21:54, Mike Rapoport 写道:
> > One thing I can think of is to only remap the crash kernel memory if it is
> > a part of an allocation that exactly fits into one ore more PUDs.
> > 
> > Say, in reserve_crashkernel() we try the memblock_phys_alloc() with
> > PUD_SIZE as alignment and size rounded up to PUD_SIZE. If this allocation
> > succeeds, we remap the entire area that now contains only memory allocated
> > in reserve_crashkernel() and free the extra memory after remapping is done.
> > If the large allocation fails, we fall back to the original size and
> > alignment and don't allow unmapping crash kernel memory in
> > arch_kexec_protect_crashkres().
> 
> There is a new method.
> I think we should use the patch v3(similar but need add some changes)
> 
> 1.We can walk crashkernle block/section pagetable,
> [[[(keep the origin block/section mapping valid]]]
> rebuild the pte level page mapping for the crashkernel mem
> rebuild left & right margin mem(which is in same block/section mapping but
> out of crashkernel mem) with block/section mapping
> 
> 2.'replace' the origin block/section mapping by new builded mapping
> iterately
> 
> With this method, all the mem mapping keep valid all the time.

As I already commented on one of your previous patches, this is not
allowed by the architecture. If FEAT_BBM is implemented (ARMv8.4 I
think), the worst that can happen is a TLB conflict abort and the
handler should invalidate the TLBs and restart the faulting instruction,
assuming the handler won't try to access the same conflicting virtual
address. Prior to FEAT_BBM, that's not possible as the architecture does
not describe a precise behaviour of conflicting TLB entries (you might
as well get the TLB output of multiple entries being or'ed together).
guanghui.fgh July 7, 2022, 5:02 p.m. UTC | #32
Thanks.

在 2022/7/6 23:40, Catalin Marinas 写道:
> On Wed, Jul 06, 2022 at 11:18:22PM +0800, guanghui.fgh wrote:
>> 在 2022/7/6 21:54, Mike Rapoport 写道:
>>> One thing I can think of is to only remap the crash kernel memory if it is
>>> a part of an allocation that exactly fits into one ore more PUDs.
>>>
>>> Say, in reserve_crashkernel() we try the memblock_phys_alloc() with
>>> PUD_SIZE as alignment and size rounded up to PUD_SIZE. If this allocation
>>> succeeds, we remap the entire area that now contains only memory allocated
>>> in reserve_crashkernel() and free the extra memory after remapping is done.
>>> If the large allocation fails, we fall back to the original size and
>>> alignment and don't allow unmapping crash kernel memory in
>>> arch_kexec_protect_crashkres().
>>
>> There is a new method.
>> I think we should use the patch v3(similar but need add some changes)
>>
>> 1.We can walk crashkernle block/section pagetable,
>> [[[(keep the origin block/section mapping valid]]]
>> rebuild the pte level page mapping for the crashkernel mem
>> rebuild left & right margin mem(which is in same block/section mapping but
>> out of crashkernel mem) with block/section mapping
>>
>> 2.'replace' the origin block/section mapping by new builded mapping
>> iterately
>>
>> With this method, all the mem mapping keep valid all the time.
> 
> As I already commented on one of your previous patches, this is not
> allowed by the architecture. If FEAT_BBM is implemented (ARMv8.4 I
> think), the worst that can happen is a TLB conflict abort and the
> handler should invalidate the TLBs and restart the faulting instruction,
> assuming the handler won't try to access the same conflicting virtual
> address. Prior to FEAT_BBM, that's not possible as the architecture does
> not describe a precise behaviour of conflicting TLB entries (you might
> as well get the TLB output of multiple entries being or'ed together).
> 
I think there is another way to handle it.

1.We can rebuild the crashkernel mem mapping firstly,
but [[[don't change the origin linear mapping]]].

2.Afterward, we can reuse the idmap_pg_dir and switch to it.
We use idmap_pg_dir to change the linear mapping which complyes with the 
TLB BBM.
guanghui.fgh July 8, 2022, 12:28 p.m. UTC | #33
Thanks.

在 2022/7/6 23:40, Catalin Marinas 写道:
> On Wed, Jul 06, 2022 at 11:18:22PM +0800, guanghui.fgh wrote:
>> 在 2022/7/6 21:54, Mike Rapoport 写道:
>>> One thing I can think of is to only remap the crash kernel memory if it is
>>> a part of an allocation that exactly fits into one ore more PUDs.
>>>
>>> Say, in reserve_crashkernel() we try the memblock_phys_alloc() with
>>> PUD_SIZE as alignment and size rounded up to PUD_SIZE. If this allocation
>>> succeeds, we remap the entire area that now contains only memory allocated
>>> in reserve_crashkernel() and free the extra memory after remapping is done.
>>> If the large allocation fails, we fall back to the original size and
>>> alignment and don't allow unmapping crash kernel memory in
>>> arch_kexec_protect_crashkres().
>>
>> There is a new method.
>> I think we should use the patch v3(similar but need add some changes)
>>
>> 1.We can walk crashkernle block/section pagetable,
>> [[[(keep the origin block/section mapping valid]]]
>> rebuild the pte level page mapping for the crashkernel mem
>> rebuild left & right margin mem(which is in same block/section mapping but
>> out of crashkernel mem) with block/section mapping
>>
>> 2.'replace' the origin block/section mapping by new builded mapping
>> iterately
>>
>> With this method, all the mem mapping keep valid all the time.
> 
> As I already commented on one of your previous patches, this is not
> allowed by the architecture. If FEAT_BBM is implemented (ARMv8.4 I
> think), the worst that can happen is a TLB conflict abort and the
> handler should invalidate the TLBs and restart the faulting instruction,
> assuming the handler won't try to access the same conflicting virtual
> address. Prior to FEAT_BBM, that's not possible as the architecture does
> not describe a precise behaviour of conflicting TLB entries (you might
> as well get the TLB output of multiple entries being or'ed together).
> 

The cpu can generate a TLB conflict abort if it detects that the address 
being looked up in the TLB hits multiple entries.

(1).I think when gathering small page to block/section mapping, there 
maybe tlb conflict if no complying with BBM.

Namely:
a.Map a 4KB page (address X)
   Touch that page, in order to get the translation cached in the TLB

b.Modify the translation tables
   replacing the mapping for address X with a 2MB mapping - DO NOT 
INVALIDATE the TLB

c.Touch "X + 4KB"
   This will/should miss in the TLB, causing a new walk returning the 
2MB mapping

d.Touch X
   Assuming they've not been evicted, you'll hit both on the 4KB and 2MB 
mapping - as both cover address X.

There is tlb conflict.
(link: 
https://community.arm.com/support-forums/f/dev-platforms-forum/13583/tlb-conflict-abort)



(2).But when spliting large block/section mapping to small granularity, 
there maybe no tlb conflict.

Namely:
a.rebuild the pte level mapping without any change to orgin pagetable
   (the relation between virtual address and physicall address keep same)

b.modify 1G mappting to use the new pte level mapping in the [[[mem]]] 
without tlb flush

c.When the cpu access the 1G mem(anywhere),
   If 1G tlb entry already cached in tlb, all the 1G mem will access 
success(without any tlb loaded, no confilict)

   If 1G tlb entry has been evicted, then the tlb will access pagetable 
in mem(despite the cpu "catch" the old(1G) or new(4k) mapped pagetale in 
the mem, all the 1G mem can access sucess)(load new tlb entry, no conflict)

d.Afterward, we flush the tlb and force cpu use the new pagetable.(no 
conflict)

It seems that there are no two tlb entries for a same virtual address in 
the tlb cache When spliting large block/section mapping.



(3).At the same time, I think we can use another way.
As the system linear maping is builded with init_pg_dir, we can also 
resue the init_pg_dir to split the block/setion mapping sometime.
As init_pg_dir contain all kernel text/data access and we can comply 
with the BBM requirement.

a.rebuild new pte level mapping without any change to the old 
mapping(the cpu can't walk access the new page mapping, it's isolated)

b.change to use init_pg_dir

c.clear the old 1G block mapping and flush tlb

d.modify the linear mapping to use new pte level page mapping with 
init_pg_dir(TLB BBM)

e.switch to swapper_pg_dir


Could you give me some advice?

Thanks.
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 48f8466..1a46b81 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -63,6 +63,7 @@  static inline bool arm64_kernel_unmapped_at_el0(void)
 extern void arm64_memblock_init(void);
 extern void paging_init(void);
 extern void bootmem_init(void);
+extern void map_crashkernel(void);
 extern void __iomem *early_io_map(phys_addr_t phys, unsigned long virt);
 extern void init_mem_pgprot(void);
 extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 339ee84..241d27e 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -190,6 +190,7 @@  static void __init reserve_crashkernel(void)
 	crashk_res.start = crash_base;
 	crashk_res.end = crash_base + crash_size - 1;
 	insert_resource(&iomem_resource, &crashk_res);
+	map_crashkernel();
 }
 
 /*
@@ -388,10 +389,6 @@  void __init arm64_memblock_init(void)
 	}
 
 	early_init_fdt_scan_reserved_mem();
-
-	if (!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32))
-		reserve_crashkernel();
-
 	high_memory = __va(memblock_end_of_DRAM() - 1) + 1;
 }
 
@@ -438,8 +435,7 @@  void __init bootmem_init(void)
 	 * request_standard_resources() depends on crashkernel's memory being
 	 * reserved, so do it here.
 	 */
-	if (IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32))
-		reserve_crashkernel();
+	reserve_crashkernel();
 
 	memblock_dump_all();
 }
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 626ec32..76a4ff0 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -65,6 +65,10 @@ 
 
 static DEFINE_SPINLOCK(swapper_pgdir_lock);
 static DEFINE_MUTEX(fixmap_lock);
+static void unmap_hotplug_range(unsigned long addr, unsigned long end,
+				bool free_mapped, struct vmem_altmap *altmap,
+				pgprot_t prot,
+				phys_addr_t (*pgtable_alloc)(int), int flags);
 
 void set_swapper_pgd(pgd_t *pgdp, pgd_t pgd)
 {
@@ -483,20 +487,49 @@  void __init mark_linear_text_alias_ro(void)
 			    PAGE_KERNEL_RO);
 }
 
-static bool crash_mem_map __initdata;
+#ifdef CONFIG_KEXEC_CORE
+static phys_addr_t __init early_crashkernel_pgtable_alloc(int shift)
+{
+	phys_addr_t phys;
+	void *ptr;
+
+	phys = memblock_phys_alloc_range(PAGE_SIZE, PAGE_SIZE, 0,
+					 MEMBLOCK_ALLOC_NOLEAKTRACE);
+	if (!phys)
+		panic("Failed to allocate page table page\n");
+
+	ptr = (void *)__phys_to_virt(phys);
+	memset(ptr, 0, PAGE_SIZE);
+	return phys;
+}
 
-static int __init enable_crash_mem_map(char *arg)
+void __init map_crashkernel(void)
 {
-	/*
-	 * Proper parameter parsing is done by reserve_crashkernel(). We only
-	 * need to know if the linear map has to avoid block mappings so that
-	 * the crashkernel reservations can be unmapped later.
-	 */
-	crash_mem_map = true;
+	phys_addr_t start, end, size;
 
-	return 0;
+	if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
+	    return;
+
+	if (!crashk_res.end)
+	    return;
+
+	start = crashk_res.start & PAGE_MASK;
+	end = PAGE_ALIGN(crashk_res.end);
+	size = end - start;
+
+	unmap_hotplug_range(__phys_to_virt(start), __phys_to_virt(end), false,
+			    NULL, PAGE_KERNEL, early_crashkernel_pgtable_alloc,
+			    NO_EXEC_MAPPINGS);
+	__create_pgd_mapping(swapper_pg_dir, crashk_res.start,
+			     __phys_to_virt(crashk_res.start),
+			     size, PAGE_KERNEL,
+			     early_crashkernel_pgtable_alloc,
+			     NO_EXEC_MAPPINGS | NO_BLOCK_MAPPINGS |
+			     NO_CONT_MAPPINGS);
 }
-early_param("crashkernel", enable_crash_mem_map);
+#else
+void __init mapping_crashkernel(void) {}
+#endif
 
 static void __init map_mem(pgd_t *pgdp)
 {
@@ -527,17 +560,6 @@  static void __init map_mem(pgd_t *pgdp)
 	 */
 	memblock_mark_nomap(kernel_start, kernel_end - kernel_start);
 
-#ifdef CONFIG_KEXEC_CORE
-	if (crash_mem_map) {
-		if (IS_ENABLED(CONFIG_ZONE_DMA) ||
-		    IS_ENABLED(CONFIG_ZONE_DMA32))
-			flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
-		else if (crashk_res.end)
-			memblock_mark_nomap(crashk_res.start,
-			    resource_size(&crashk_res));
-	}
-#endif
-
 	/* map all the memory banks */
 	for_each_mem_range(i, &start, &end) {
 		if (start >= end)
@@ -570,19 +592,6 @@  static void __init map_mem(pgd_t *pgdp)
 	 * in page granularity and put back unused memory to buddy system
 	 * through /sys/kernel/kexec_crash_size interface.
 	 */
-#ifdef CONFIG_KEXEC_CORE
-	if (crash_mem_map &&
-	    !IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) {
-		if (crashk_res.end) {
-			__map_memblock(pgdp, crashk_res.start,
-				       crashk_res.end + 1,
-				       PAGE_KERNEL,
-				       NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
-			memblock_clear_nomap(crashk_res.start,
-					     resource_size(&crashk_res));
-		}
-	}
-#endif
 }
 
 void mark_rodata_ro(void)
@@ -827,7 +836,6 @@  int kern_addr_valid(unsigned long addr)
 	return pfn_valid(pte_pfn(pte));
 }
 
-#ifdef CONFIG_MEMORY_HOTPLUG
 static void free_hotplug_page_range(struct page *page, size_t size,
 				    struct vmem_altmap *altmap)
 {
@@ -863,9 +871,25 @@  static bool pgtable_range_aligned(unsigned long start, unsigned long end,
 	return true;
 }
 
+static void pte_clear_cont(pte_t *ptep)
+{
+	int i = 0;
+	pte_t pte = READ_ONCE(*ptep);
+	if (pte_none(pte) || !pte_cont(pte))
+		return;
+	ptep -= ((u64)ptep / sizeof(pte_t)) &
+		((1 << CONFIG_ARM64_CONT_PTE_SHIFT) - 1);
+	do {
+		pte = pte_mknoncont(READ_ONCE(*ptep));
+		set_pte(ptep, pte);
+	} while (++ptep, ++i < CONT_PTES);
+}
+
 static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr,
 				    unsigned long end, bool free_mapped,
-				    struct vmem_altmap *altmap)
+				    struct vmem_altmap *altmap, pgprot_t prot,
+				    phys_addr_t (*pgtable_alloc)(int),
+				    int flags)
 {
 	pte_t *ptep, pte;
 
@@ -876,6 +900,8 @@  static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr,
 			continue;
 
 		WARN_ON(!pte_present(pte));
+
+		pte_clear_cont(ptep);
 		pte_clear(&init_mm, addr, ptep);
 		flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
 		if (free_mapped)
@@ -884,9 +910,26 @@  static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr,
 	} while (addr += PAGE_SIZE, addr < end);
 }
 
+static void pmd_clear_cont(pmd_t *pmdp)
+{
+	int i = 0;
+	pmd_t pmd = READ_ONCE(*pmdp);
+	if (pmd_none(pmd) || !pmd_sect(pmd) || !pmd_cont(pmd))
+		return;
+	pmdp -= ((u64)pmdp / sizeof(pmd_t)) &
+		((1 << CONFIG_ARM64_CONT_PMD_SHIFT) - 1);
+	do {
+		pmd = READ_ONCE(*pmdp);
+		pmd = pte_pmd(pte_mknoncont(pmd_pte(pmd)));
+		set_pmd(pmdp, pmd);
+	} while (++pmdp, ++i < CONT_PMDS);
+}
+
 static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr,
 				    unsigned long end, bool free_mapped,
-				    struct vmem_altmap *altmap)
+				    struct vmem_altmap *altmap, pgprot_t prot,
+				    phys_addr_t (*pgtable_alloc)(int),
+				    int flags)
 {
 	unsigned long next;
 	pmd_t *pmdp, pmd;
@@ -900,6 +943,8 @@  static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr,
 
 		WARN_ON(!pmd_present(pmd));
 		if (pmd_sect(pmd)) {
+			//clear CONT flags
+			pmd_clear_cont(pmdp);
 			pmd_clear(pmdp);
 
 			/*
@@ -907,19 +952,36 @@  static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr,
 			 * range is mapped with a single block entry.
 			 */
 			flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+
+			if (addr & ~PMD_MASK)
+				alloc_init_cont_pte(pmdp, addr & PMD_MASK,
+						    addr, __virt_to_phys(addr &
+						    PMD_MASK), prot,
+						    pgtable_alloc, flags);
+
+			if (next & ~PMD_MASK)
+				alloc_init_cont_pte(pmdp, next, ALIGN(next,
+						    PMD_SIZE),
+						    __virt_to_phys(next),
+						    prot, pgtable_alloc,
+						    flags);
+
 			if (free_mapped)
 				free_hotplug_page_range(pmd_page(pmd),
 							PMD_SIZE, altmap);
 			continue;
 		}
 		WARN_ON(!pmd_table(pmd));
-		unmap_hotplug_pte_range(pmdp, addr, next, free_mapped, altmap);
+		unmap_hotplug_pte_range(pmdp, addr, next, free_mapped, altmap,
+					prot, pgtable_alloc, flags);
 	} while (addr = next, addr < end);
 }
 
 static void unmap_hotplug_pud_range(p4d_t *p4dp, unsigned long addr,
 				    unsigned long end, bool free_mapped,
-				    struct vmem_altmap *altmap)
+				    struct vmem_altmap *altmap, pgprot_t prot,
+				    phys_addr_t (*pgtable_alloc)(int),
+				    int flags)
 {
 	unsigned long next;
 	pud_t *pudp, pud;
@@ -940,19 +1002,36 @@  static void unmap_hotplug_pud_range(p4d_t *p4dp, unsigned long addr,
 			 * range is mapped with a single block entry.
 			 */
 			flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+
+			if (addr & (~PUD_MASK))
+				alloc_init_cont_pmd(pudp, addr & PUD_MASK,
+						    addr, __virt_to_phys(addr &
+						    PUD_MASK), prot,
+						    pgtable_alloc, flags);
+
+			if (next & (~PUD_MASK))
+				alloc_init_cont_pmd(pudp, next,
+						    ALIGN(next, PUD_SIZE),
+						    __virt_to_phys(next),
+						    prot, pgtable_alloc,
+						    flags);
+
 			if (free_mapped)
 				free_hotplug_page_range(pud_page(pud),
 							PUD_SIZE, altmap);
 			continue;
 		}
 		WARN_ON(!pud_table(pud));
-		unmap_hotplug_pmd_range(pudp, addr, next, free_mapped, altmap);
+		unmap_hotplug_pmd_range(pudp, addr, next, free_mapped, altmap,
+					prot, pgtable_alloc, flags);
 	} while (addr = next, addr < end);
 }
 
 static void unmap_hotplug_p4d_range(pgd_t *pgdp, unsigned long addr,
 				    unsigned long end, bool free_mapped,
-				    struct vmem_altmap *altmap)
+				    struct vmem_altmap *altmap, pgprot_t prot,
+				    phys_addr_t (*pgtable_alloc)(int),
+				    int flags)
 {
 	unsigned long next;
 	p4d_t *p4dp, p4d;
@@ -965,12 +1044,15 @@  static void unmap_hotplug_p4d_range(pgd_t *pgdp, unsigned long addr,
 			continue;
 
 		WARN_ON(!p4d_present(p4d));
-		unmap_hotplug_pud_range(p4dp, addr, next, free_mapped, altmap);
+		unmap_hotplug_pud_range(p4dp, addr, next, free_mapped, altmap,
+					prot, pgtable_alloc, flags);
 	} while (addr = next, addr < end);
 }
 
 static void unmap_hotplug_range(unsigned long addr, unsigned long end,
-				bool free_mapped, struct vmem_altmap *altmap)
+				bool free_mapped, struct vmem_altmap *altmap,
+				pgprot_t prot,
+				phys_addr_t (*pgtable_alloc)(int), int flags)
 {
 	unsigned long next;
 	pgd_t *pgdp, pgd;
@@ -991,7 +1073,8 @@  static void unmap_hotplug_range(unsigned long addr, unsigned long end,
 			continue;
 
 		WARN_ON(!pgd_present(pgd));
-		unmap_hotplug_p4d_range(pgdp, addr, next, free_mapped, altmap);
+		unmap_hotplug_p4d_range(pgdp, addr, next, free_mapped, altmap,
+					prot, pgtable_alloc, flags);
 	} while (addr = next, addr < end);
 }
 
@@ -1148,7 +1231,6 @@  static void free_empty_tables(unsigned long addr, unsigned long end,
 		free_empty_p4d_table(pgdp, addr, next, floor, ceiling);
 	} while (addr = next, addr < end);
 }
-#endif
 
 #if !ARM64_KERNEL_USES_PMD_MAPS
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
@@ -1210,7 +1292,7 @@  void vmemmap_free(unsigned long start, unsigned long end,
 {
 	WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
 
-	unmap_hotplug_range(start, end, true, altmap);
+	unmap_hotplug_range(start, end, true, altmap, __pgprot(0), NULL, 0);
 	free_empty_tables(start, end, VMEMMAP_START, VMEMMAP_END);
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */
@@ -1474,7 +1556,7 @@  static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size)
 	WARN_ON(pgdir != init_mm.pgd);
 	WARN_ON((start < PAGE_OFFSET) || (end > PAGE_END));
 
-	unmap_hotplug_range(start, end, false, NULL);
+	unmap_hotplug_range(start, end, false, NULL, __pgprot(0), NULL, 0);
 	free_empty_tables(start, end, PAGE_OFFSET, PAGE_END);
 }