Message ID | 1656777473-73887-1-git-send-email-guanghuifeng@linux.alibaba.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v4] arm64: mm: fix linear mem mapping access performance degradation | expand |
On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote: > The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone > (enable crashkernel, disable rodata full, disable kfence), the mem_map will > use non block/section mapping(for crashkernel requires to shrink the region > in page granularity). But it will degrade performance when doing larging > continuous mem access in kernel(memcpy/memmove, etc). Hmm. It seems a bit silly to me that we take special care to unmap the crashkernel from the linear map even when can_set_direct_map() is false, as we won't be protecting the main kernel at all! Why don't we just leave the crashkernel mapped if !can_set_direct_map() and then this problem just goes away? Will
在 2022/7/4 18:35, Will Deacon 写道: > On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote: >> The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone >> (enable crashkernel, disable rodata full, disable kfence), the mem_map will >> use non block/section mapping(for crashkernel requires to shrink the region >> in page granularity). But it will degrade performance when doing larging >> continuous mem access in kernel(memcpy/memmove, etc). > > Hmm. It seems a bit silly to me that we take special care to unmap the > crashkernel from the linear map even when can_set_direct_map() is false, as > we won't be protecting the main kernel at all! > > Why don't we just leave the crashkernel mapped if !can_set_direct_map() > and then this problem just goes away? > > Will This question had been asked lask week. 1.Quoted messages from arch/arm64/mm/init.c "Memory reservation for crash kernel either done early or deferred depending on DMA memory zones configs (ZONE_DMA) -- In absence of ZONE_DMA configs arm64_dma_phys_limit initialized here instead of max_zone_phys(). This lets early reservation of crash kernel memory which has a dependency on arm64_dma_phys_limit. Reserving memory early for crash kernel allows linear creation of block mappings (greater than page-granularity) for all the memory bank rangs. In this scheme a comparatively quicker boot is observed. If ZONE_DMA configs are defined, crash kernel memory reservation is delayed until DMA zone memory range size initialization performed in zone_sizes_init(). The defer is necessary to steer clear of DMA zone memory range to avoid overlap allocation. [[[ So crash kernel memory boundaries are not known when mapping all bank memory ranges, which otherwise means not possible to exclude crash kernel range from creating block mappings so page-granularity mappings are created for the entire memory range. ]]]" Namely, the init order: memblock init--->linear mem mapping(4k mapping for crashkernel, requirinig page-granularity changing))--->zone dma limit--->reserve crashkernel. So when enable ZONE DMA and using crashkernel, the mem mapping using 4k mapping. 2.As mentioned above, when linear mem use 4k mapping simply, there is high dtlb miss(degrade performance). This patch use block/section mapping as far as possible with performance improvement. 3.This patch reserve crashkernel as same as the history(ZONE DMA & crashkernel reserving order), and only change the linear mem mapping to block/section mapping. Init order: memblock init--->linear mem mapping(block/section mapping for linear mem mapping))--->zone dma limit--->reserve crashkernel--->[[[only]]] rebuild 4k pagesize mapping for crashkernel mem With this method, there will use block/section mapping as far as possible.
On Mon, Jul 04, 2022 at 06:58:20PM +0800, guanghui.fgh wrote: > > > 在 2022/7/4 18:35, Will Deacon 写道: > > On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote: > > > The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone > > > (enable crashkernel, disable rodata full, disable kfence), the mem_map will > > > use non block/section mapping(for crashkernel requires to shrink the region > > > in page granularity). But it will degrade performance when doing larging > > > continuous mem access in kernel(memcpy/memmove, etc). > > > > Hmm. It seems a bit silly to me that we take special care to unmap the > > crashkernel from the linear map even when can_set_direct_map() is false, as > > we won't be protecting the main kernel at all! > > > > Why don't we just leave the crashkernel mapped if !can_set_direct_map() > > and then this problem just goes away? > > > > Will > > This question had been asked lask week. Sorry, I didn't spot that. Please could you link me to the conversation, as I'm still unable to find it in my inbox? > 1.Quoted messages from arch/arm64/mm/init.c > > "Memory reservation for crash kernel either done early or deferred > depending on DMA memory zones configs (ZONE_DMA) -- > > In absence of ZONE_DMA configs arm64_dma_phys_limit initialized > here instead of max_zone_phys(). This lets early reservation of > crash kernel memory which has a dependency on arm64_dma_phys_limit. > Reserving memory early for crash kernel allows linear creation of block > mappings (greater than page-granularity) for all the memory bank rangs. > In this scheme a comparatively quicker boot is observed. > > If ZONE_DMA configs are defined, crash kernel memory reservation > is delayed until DMA zone memory range size initialization performed in > zone_sizes_init(). The defer is necessary to steer clear of DMA zone > memory range to avoid overlap allocation. > > [[[ > So crash kernel memory boundaries are not known when mapping all bank memory > ranges, which otherwise means not possible to exclude crash kernel range > from creating block mappings so page-granularity mappings are created for > the entire memory range. > ]]]" > > Namely, the init order: memblock init--->linear mem mapping(4k mapping for > crashkernel, requirinig page-granularity changing))--->zone dma > limit--->reserve crashkernel. > So when enable ZONE DMA and using crashkernel, the mem mapping using 4k > mapping. Yes, I understand that is how things work today but I'm saying that we may as well leave the crashkernel mapped (at block granularity) if !can_set_direct_map() and then I think your patch becomes a lot simpler. Will
在 2022/7/4 19:14, Will Deacon 写道: > On Mon, Jul 04, 2022 at 06:58:20PM +0800, guanghui.fgh wrote: >> >> >> 在 2022/7/4 18:35, Will Deacon 写道: >>> On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote: >>>> The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone >>>> (enable crashkernel, disable rodata full, disable kfence), the mem_map will >>>> use non block/section mapping(for crashkernel requires to shrink the region >>>> in page granularity). But it will degrade performance when doing larging >>>> continuous mem access in kernel(memcpy/memmove, etc). >>> >>> Hmm. It seems a bit silly to me that we take special care to unmap the >>> crashkernel from the linear map even when can_set_direct_map() is false, as >>> we won't be protecting the main kernel at all! >>> >>> Why don't we just leave the crashkernel mapped if !can_set_direct_map() >>> and then this problem just goes away? >>> >>> Will >> >> This question had been asked lask week. > > Sorry, I didn't spot that. Please could you link me to the conversation, as > I'm still unable to find it in my inbox? Please access this link: https://lore.kernel.org/linux-arm-kernel/075b0a8e-cb7e-70f6-b45a-54cd31886794@linux.alibaba.com/T/ > >> 1.Quoted messages from arch/arm64/mm/init.c >> >> "Memory reservation for crash kernel either done early or deferred >> depending on DMA memory zones configs (ZONE_DMA) -- >> >> In absence of ZONE_DMA configs arm64_dma_phys_limit initialized >> here instead of max_zone_phys(). This lets early reservation of >> crash kernel memory which has a dependency on arm64_dma_phys_limit. >> Reserving memory early for crash kernel allows linear creation of block >> mappings (greater than page-granularity) for all the memory bank rangs. >> In this scheme a comparatively quicker boot is observed. >> >> If ZONE_DMA configs are defined, crash kernel memory reservation >> is delayed until DMA zone memory range size initialization performed in >> zone_sizes_init(). The defer is necessary to steer clear of DMA zone >> memory range to avoid overlap allocation. >> >> [[[ >> So crash kernel memory boundaries are not known when mapping all bank memory >> ranges, which otherwise means not possible to exclude crash kernel range >> from creating block mappings so page-granularity mappings are created for >> the entire memory range. >> ]]]" >> >> Namely, the init order: memblock init--->linear mem mapping(4k mapping for >> crashkernel, requirinig page-granularity changing))--->zone dma >> limit--->reserve crashkernel. >> So when enable ZONE DMA and using crashkernel, the mem mapping using 4k >> mapping. > > Yes, I understand that is how things work today but I'm saying that we may > as well leave the crashkernel mapped (at block granularity) if > !can_set_direct_map() and then I think your patch becomes a lot simpler. > > Will But Page-granularity mapppings are necessary for crash kernel memory range for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from arch/arm64/mm/init.c). So this patch split block/section mapping to 4k page-granularity mapping for crashkernel mem. Thanks.
On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote: > > > 在 2022/7/4 19:14, Will Deacon 写道: > > On Mon, Jul 04, 2022 at 06:58:20PM +0800, guanghui.fgh wrote: > > > > > > > > > 在 2022/7/4 18:35, Will Deacon 写道: > > > > On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote: > > > > > The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone > > > > > (enable crashkernel, disable rodata full, disable kfence), the mem_map will > > > > > use non block/section mapping(for crashkernel requires to shrink the region > > > > > in page granularity). But it will degrade performance when doing larging > > > > > continuous mem access in kernel(memcpy/memmove, etc). > > > > > > > > Hmm. It seems a bit silly to me that we take special care to unmap the > > > > crashkernel from the linear map even when can_set_direct_map() is false, as > > > > we won't be protecting the main kernel at all! > > > > > > > > Why don't we just leave the crashkernel mapped if !can_set_direct_map() > > > > and then this problem just goes away? > > > > > > This question had been asked lask week. > > > > Sorry, I didn't spot that. Please could you link me to the conversation, as > > I'm still unable to find it in my inbox? > > Please access this link: > https://lore.kernel.org/linux-arm-kernel/075b0a8e-cb7e-70f6-b45a-54cd31886794@linux.alibaba.com/T/ Sorry, but I read through the thread and I still can't find where the possibility of leaving the crashkernel mapped was discussed. > > > 1.Quoted messages from arch/arm64/mm/init.c > > > > > > "Memory reservation for crash kernel either done early or deferred > > > depending on DMA memory zones configs (ZONE_DMA) -- > > > > > > In absence of ZONE_DMA configs arm64_dma_phys_limit initialized > > > here instead of max_zone_phys(). This lets early reservation of > > > crash kernel memory which has a dependency on arm64_dma_phys_limit. > > > Reserving memory early for crash kernel allows linear creation of block > > > mappings (greater than page-granularity) for all the memory bank rangs. > > > In this scheme a comparatively quicker boot is observed. > > > > > > If ZONE_DMA configs are defined, crash kernel memory reservation > > > is delayed until DMA zone memory range size initialization performed in > > > zone_sizes_init(). The defer is necessary to steer clear of DMA zone > > > memory range to avoid overlap allocation. > > > > > > [[[ > > > So crash kernel memory boundaries are not known when mapping all bank memory > > > ranges, which otherwise means not possible to exclude crash kernel range > > > from creating block mappings so page-granularity mappings are created for > > > the entire memory range. > > > ]]]" > > > > > > Namely, the init order: memblock init--->linear mem mapping(4k mapping for > > > crashkernel, requirinig page-granularity changing))--->zone dma > > > limit--->reserve crashkernel. > > > So when enable ZONE DMA and using crashkernel, the mem mapping using 4k > > > mapping. > > > > Yes, I understand that is how things work today but I'm saying that we may > > as well leave the crashkernel mapped (at block granularity) if > > !can_set_direct_map() and then I think your patch becomes a lot simpler. > > But Page-granularity mapppings are necessary for crash kernel memory range > for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from > arch/arm64/mm/init.c). > So this patch split block/section mapping to 4k page-granularity mapping for > crashkernel mem. Why? I don't see why the mapping granularity is relevant at all if we always leave the whole thing mapped. Will
Thanks. 在 2022/7/4 21:15, Will Deacon 写道: > On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote: >> >> >> 在 2022/7/4 19:14, Will Deacon 写道: >>> On Mon, Jul 04, 2022 at 06:58:20PM +0800, guanghui.fgh wrote: >>>> >>>> >>>> 在 2022/7/4 18:35, Will Deacon 写道: >>>>> On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote: >>>>>> The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone >>>>>> (enable crashkernel, disable rodata full, disable kfence), the mem_map will >>>>>> use non block/section mapping(for crashkernel requires to shrink the region >>>>>> in page granularity). But it will degrade performance when doing larging >>>>>> continuous mem access in kernel(memcpy/memmove, etc). >>>>> >>>>> Hmm. It seems a bit silly to me that we take special care to unmap the >>>>> crashkernel from the linear map even when can_set_direct_map() is false, as >>>>> we won't be protecting the main kernel at all! >>>>> >>>>> Why don't we just leave the crashkernel mapped if !can_set_direct_map() >>>>> and then this problem just goes away? >>>> >>>> This question had been asked lask week. >>> >>> Sorry, I didn't spot that. Please could you link me to the conversation, as >>> I'm still unable to find it in my inbox? >> >> Please access this link: >> https://lore.kernel.org/linux-arm-kernel/075b0a8e-cb7e-70f6-b45a-54cd31886794@linux.alibaba.com/T/ > > Sorry, but I read through the thread and I still can't find where the > possibility of leaving the crashkernel mapped was discussed > >>>> 1.Quoted messages from arch/arm64/mm/init.c >>>> >>>> "Memory reservation for crash kernel either done early or deferred >>>> depending on DMA memory zones configs (ZONE_DMA) -- >>>> >>>> In absence of ZONE_DMA configs arm64_dma_phys_limit initialized >>>> here instead of max_zone_phys(). This lets early reservation of >>>> crash kernel memory which has a dependency on arm64_dma_phys_limit. >>>> Reserving memory early for crash kernel allows linear creation of block >>>> mappings (greater than page-granularity) for all the memory bank rangs. >>>> In this scheme a comparatively quicker boot is observed. >>>> >>>> If ZONE_DMA configs are defined, crash kernel memory reservation >>>> is delayed until DMA zone memory range size initialization performed in >>>> zone_sizes_init(). The defer is necessary to steer clear of DMA zone >>>> memory range to avoid overlap allocation. >>>> >>>> [[[ >>>> So crash kernel memory boundaries are not known when mapping all bank memory >>>> ranges, which otherwise means not possible to exclude crash kernel range >>>> from creating block mappings so page-granularity mappings are created for >>>> the entire memory range. >>>> ]]]" >>>> >>>> Namely, the init order: memblock init--->linear mem mapping(4k mapping for >>>> crashkernel, requirinig page-granularity changing))--->zone dma >>>> limit--->reserve crashkernel. >>>> So when enable ZONE DMA and using crashkernel, the mem mapping using 4k >>>> mapping. >>> >>> Yes, I understand that is how things work today but I'm saying that we may >>> as well leave the crashkernel mapped (at block granularity) if >>> !can_set_direct_map() and then I think your patch becomes a lot simpler. >> >> But Page-granularity mapppings are necessary for crash kernel memory range >> for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from >> arch/arm64/mm/init.c). >> So this patch split block/section mapping to 4k page-granularity mapping for >> crashkernel mem. > > Why? I don't see why the mapping granularity is relevant at all if we > always leave the whole thing mapped. > > Will I have find the commit 06a7f711246b081afc21fff859f1003f1f2a0fbc adding /sys/kernel/kexec_crash_size for changing crashkernel mem size. "Implement shrinking the reserved memory for crash kernel, if it is more than enough." Maybe we could use block/section mapping for crashkernle mem, and split a part of crashkernel mem block/section mapping when shringking(by writing to /sys/kernel/kexec_crash_size(handled by crash_shrink_memory, crash_free_reserved_phys_range)). (Maybe there is no need to split all crashkernle mem block/section mapping at boot time). Thanks.
在 2022/7/4 21:15, Will Deacon 写道: > On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote: >> >> >> 在 2022/7/4 19:14, Will Deacon 写道: >>> On Mon, Jul 04, 2022 at 06:58:20PM +0800, guanghui.fgh wrote: >>>> >>>> >>>> 在 2022/7/4 18:35, Will Deacon 写道: >>>>> On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote: >>>>>> The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone >>>>>> (enable crashkernel, disable rodata full, disable kfence), the mem_map will >>>>>> use non block/section mapping(for crashkernel requires to shrink the region >>>>>> in page granularity). But it will degrade performance when doing larging >>>>>> continuous mem access in kernel(memcpy/memmove, etc). >>>>> >>>>> Hmm. It seems a bit silly to me that we take special care to unmap the >>>>> crashkernel from the linear map even when can_set_direct_map() is false, as >>>>> we won't be protecting the main kernel at all! >>>>> >>>>> Why don't we just leave the crashkernel mapped if !can_set_direct_map() >>>>> and then this problem just goes away? >>>> >>>> This question had been asked lask week. >>> >>> Sorry, I didn't spot that. Please could you link me to the conversation, as >>> I'm still unable to find it in my inbox? >> >> Please access this link: >> https://lore.kernel.org/linux-arm-kernel/075b0a8e-cb7e-70f6-b45a-54cd31886794@linux.alibaba.com/T/ > > Sorry, but I read through the thread and I still can't find where the > possibility of leaving the crashkernel mapped was discussed. > >>>> 1.Quoted messages from arch/arm64/mm/init.c >>>> >>>> "Memory reservation for crash kernel either done early or deferred >>>> depending on DMA memory zones configs (ZONE_DMA) -- >>>> >>>> In absence of ZONE_DMA configs arm64_dma_phys_limit initialized >>>> here instead of max_zone_phys(). This lets early reservation of >>>> crash kernel memory which has a dependency on arm64_dma_phys_limit. >>>> Reserving memory early for crash kernel allows linear creation of block >>>> mappings (greater than page-granularity) for all the memory bank rangs. >>>> In this scheme a comparatively quicker boot is observed. >>>> >>>> If ZONE_DMA configs are defined, crash kernel memory reservation >>>> is delayed until DMA zone memory range size initialization performed in >>>> zone_sizes_init(). The defer is necessary to steer clear of DMA zone >>>> memory range to avoid overlap allocation. >>>> >>>> [[[ >>>> So crash kernel memory boundaries are not known when mapping all bank memory >>>> ranges, which otherwise means not possible to exclude crash kernel range >>>> from creating block mappings so page-granularity mappings are created for >>>> the entire memory range. >>>> ]]]" >>>> >>>> Namely, the init order: memblock init--->linear mem mapping(4k mapping for >>>> crashkernel, requirinig page-granularity changing))--->zone dma >>>> limit--->reserve crashkernel. >>>> So when enable ZONE DMA and using crashkernel, the mem mapping using 4k >>>> mapping. >>> >>> Yes, I understand that is how things work today but I'm saying that we may >>> as well leave the crashkernel mapped (at block granularity) if >>> !can_set_direct_map() and then I think your patch becomes a lot simpler. >> >> But Page-granularity mapppings are necessary for crash kernel memory range >> for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from >> arch/arm64/mm/init.c). >> So this patch split block/section mapping to 4k page-granularity mapping for >> crashkernel mem. > > Why? I don't see why the mapping granularity is relevant at all if we > always leave the whole thing mapped. > > Will There is another reason. When loading crashkernel finish, the do_kexec_load will use arch_kexec_protect_crashkres to invalid all the pagetable for crashkernel mem(protect crashkernel mem from access). arch_kexec_protect_crashkres--->set_memory_valid--->...--->apply_to_pmd_range In the apply_to_pmd_range, there is a judement: BUG_ON(pud_huge(*pud)). And if the crashkernel use block/section mapping, there will be some error. Namely, it's need to use non block/section mapping for crashkernel mem before shringking. Thanks.
On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote: > 在 2022/7/4 21:15, Will Deacon 写道: > > On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote: > > > > > 1.Quoted messages from arch/arm64/mm/init.c > > > > > > > > > > "Memory reservation for crash kernel either done early or deferred > > > > > depending on DMA memory zones configs (ZONE_DMA) -- > > > > > > > > > > In absence of ZONE_DMA configs arm64_dma_phys_limit initialized > > > > > here instead of max_zone_phys(). This lets early reservation of > > > > > crash kernel memory which has a dependency on arm64_dma_phys_limit. > > > > > Reserving memory early for crash kernel allows linear creation of block > > > > > mappings (greater than page-granularity) for all the memory bank rangs. > > > > > In this scheme a comparatively quicker boot is observed. > > > > > > > > > > If ZONE_DMA configs are defined, crash kernel memory reservation > > > > > is delayed until DMA zone memory range size initialization performed in > > > > > zone_sizes_init(). The defer is necessary to steer clear of DMA zone > > > > > memory range to avoid overlap allocation. > > > > > > > > > > [[[ > > > > > So crash kernel memory boundaries are not known when mapping all bank memory > > > > > ranges, which otherwise means not possible to exclude crash kernel range > > > > > from creating block mappings so page-granularity mappings are created for > > > > > the entire memory range. > > > > > ]]]" > > > > > > > > > > Namely, the init order: memblock init--->linear mem mapping(4k mapping for > > > > > crashkernel, requirinig page-granularity changing))--->zone dma > > > > > limit--->reserve crashkernel. > > > > > So when enable ZONE DMA and using crashkernel, the mem mapping using 4k > > > > > mapping. > > > > > > > > Yes, I understand that is how things work today but I'm saying that we may > > > > as well leave the crashkernel mapped (at block granularity) if > > > > !can_set_direct_map() and then I think your patch becomes a lot simpler. > > > > > > But Page-granularity mapppings are necessary for crash kernel memory range > > > for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from > > > arch/arm64/mm/init.c). > > > So this patch split block/section mapping to 4k page-granularity mapping for > > > crashkernel mem. > > > > Why? I don't see why the mapping granularity is relevant at all if we > > always leave the whole thing mapped. > > > There is another reason. > > When loading crashkernel finish, the do_kexec_load will use > arch_kexec_protect_crashkres to invalid all the pagetable for crashkernel > mem(protect crashkernel mem from access). > > arch_kexec_protect_crashkres--->set_memory_valid--->...--->apply_to_pmd_range > > In the apply_to_pmd_range, there is a judement: BUG_ON(pud_huge(*pud)). And > if the crashkernel use block/section mapping, there will be some error. > > Namely, it's need to use non block/section mapping for crashkernel mem > before shringking. Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do that if we're leaving the thing mapped, no? Will
Thanks. 在 2022/7/4 22:23, Will Deacon 写道: > On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote: >> 在 2022/7/4 21:15, Will Deacon 写道: >>> On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote: >>>>>> 1.Quoted messages from arch/arm64/mm/init.c >>>>>> >>>>>> "Memory reservation for crash kernel either done early or deferred >>>>>> depending on DMA memory zones configs (ZONE_DMA) -- >>>>>> >>>>>> In absence of ZONE_DMA configs arm64_dma_phys_limit initialized >>>>>> here instead of max_zone_phys(). This lets early reservation of >>>>>> crash kernel memory which has a dependency on arm64_dma_phys_limit. >>>>>> Reserving memory early for crash kernel allows linear creation of block >>>>>> mappings (greater than page-granularity) for all the memory bank rangs. >>>>>> In this scheme a comparatively quicker boot is observed. >>>>>> >>>>>> If ZONE_DMA configs are defined, crash kernel memory reservation >>>>>> is delayed until DMA zone memory range size initialization performed in >>>>>> zone_sizes_init(). The defer is necessary to steer clear of DMA zone >>>>>> memory range to avoid overlap allocation. >>>>>> >>>>>> [[[ >>>>>> So crash kernel memory boundaries are not known when mapping all bank memory >>>>>> ranges, which otherwise means not possible to exclude crash kernel range >>>>>> from creating block mappings so page-granularity mappings are created for >>>>>> the entire memory range. >>>>>> ]]]" >>>>>> >>>>>> Namely, the init order: memblock init--->linear mem mapping(4k mapping for >>>>>> crashkernel, requirinig page-granularity changing))--->zone dma >>>>>> limit--->reserve crashkernel. >>>>>> So when enable ZONE DMA and using crashkernel, the mem mapping using 4k >>>>>> mapping. >>>>> >>>>> Yes, I understand that is how things work today but I'm saying that we may >>>>> as well leave the crashkernel mapped (at block granularity) if >>>>> !can_set_direct_map() and then I think your patch becomes a lot simpler. >>>> >>>> But Page-granularity mapppings are necessary for crash kernel memory range >>>> for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from >>>> arch/arm64/mm/init.c). >>>> So this patch split block/section mapping to 4k page-granularity mapping for >>>> crashkernel mem. >>> >>> Why? I don't see why the mapping granularity is relevant at all if we >>> always leave the whole thing mapped. >>> >> There is another reason. >> >> When loading crashkernel finish, the do_kexec_load will use >> arch_kexec_protect_crashkres to invalid all the pagetable for crashkernel >> mem(protect crashkernel mem from access). >> >> arch_kexec_protect_crashkres--->set_memory_valid--->...--->apply_to_pmd_range >> >> In the apply_to_pmd_range, there is a judement: BUG_ON(pud_huge(*pud)). And >> if the crashkernel use block/section mapping, there will be some error. >> >> Namely, it's need to use non block/section mapping for crashkernel mem >> before shringking. > > Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do > that if we're leaving the thing mapped, no? > > Will I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem. Because when invalid crashkernel mem pagetable, there is no chance to rd/wr the crashkernel mem by mistake. If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem pagetable, there maybe some write operations to these mem by mistake which may cause crashkernel boot error and vmcore saving error. Can we change the arch_kexec_[un]protect_crashkres to support block/section mapping?(But we also need to remap when shrinking) Thanks.
On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote: > Thanks. > > 在 2022/7/4 22:23, Will Deacon 写道: > > On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote: > > > 在 2022/7/4 21:15, Will Deacon 写道: > > > > On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote: > > > > > > > 1.Quoted messages from arch/arm64/mm/init.c > > > > > > > > > > > > > > "Memory reservation for crash kernel either done early or deferred > > > > > > > depending on DMA memory zones configs (ZONE_DMA) -- > > > > > > > > > > > > > > In absence of ZONE_DMA configs arm64_dma_phys_limit initialized > > > > > > > here instead of max_zone_phys(). This lets early reservation of > > > > > > > crash kernel memory which has a dependency on arm64_dma_phys_limit. > > > > > > > Reserving memory early for crash kernel allows linear creation of block > > > > > > > mappings (greater than page-granularity) for all the memory bank rangs. > > > > > > > In this scheme a comparatively quicker boot is observed. > > > > > > > > > > > > > > If ZONE_DMA configs are defined, crash kernel memory reservation > > > > > > > is delayed until DMA zone memory range size initialization performed in > > > > > > > zone_sizes_init(). The defer is necessary to steer clear of DMA zone > > > > > > > memory range to avoid overlap allocation. > > > > > > > > > > > > > > [[[ > > > > > > > So crash kernel memory boundaries are not known when mapping all bank memory > > > > > > > ranges, which otherwise means not possible to exclude crash kernel range > > > > > > > from creating block mappings so page-granularity mappings are created for > > > > > > > the entire memory range. > > > > > > > ]]]" > > > > > > > > > > > > > > Namely, the init order: memblock init--->linear mem mapping(4k mapping for > > > > > > > crashkernel, requirinig page-granularity changing))--->zone dma > > > > > > > limit--->reserve crashkernel. > > > > > > > So when enable ZONE DMA and using crashkernel, the mem mapping using 4k > > > > > > > mapping. > > > > > > > > > > > > Yes, I understand that is how things work today but I'm saying that we may > > > > > > as well leave the crashkernel mapped (at block granularity) if > > > > > > !can_set_direct_map() and then I think your patch becomes a lot simpler. > > > > > > > > > > But Page-granularity mapppings are necessary for crash kernel memory range > > > > > for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from > > > > > arch/arm64/mm/init.c). > > > > > So this patch split block/section mapping to 4k page-granularity mapping for > > > > > crashkernel mem. > > > > > > > > Why? I don't see why the mapping granularity is relevant at all if we > > > > always leave the whole thing mapped. > > > > > > > There is another reason. > > > > > > When loading crashkernel finish, the do_kexec_load will use > > > arch_kexec_protect_crashkres to invalid all the pagetable for crashkernel > > > mem(protect crashkernel mem from access). > > > > > > arch_kexec_protect_crashkres--->set_memory_valid--->...--->apply_to_pmd_range > > > > > > In the apply_to_pmd_range, there is a judement: BUG_ON(pud_huge(*pud)). And > > > if the crashkernel use block/section mapping, there will be some error. > > > > > > Namely, it's need to use non block/section mapping for crashkernel mem > > > before shringking. > > > > Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do > > that if we're leaving the thing mapped, no? > > > I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem. > > Because when invalid crashkernel mem pagetable, there is no chance to rd/wr > the crashkernel mem by mistake. > > If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem > pagetable, there maybe some write operations to these mem by mistake which > may cause crashkernel boot error and vmcore saving error. I don't really buy this line of reasoning. The entire main kernel is writable, so why do we care about protecting the crashkernel so much? The _code_ to launch the crash kernel is writable! If you care about preventing writes to memory which should not be writable, then you should use rodata=full. Will
On Mon, 4 Jul 2022 at 18:38, Will Deacon <will@kernel.org> wrote: > > On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote: > > Thanks. > > > > 在 2022/7/4 22:23, Will Deacon 写道: > > > On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote: ... > > > > Namely, it's need to use non block/section mapping for crashkernel mem > > > > before shringking. > > > > > > Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do > > > that if we're leaving the thing mapped, no? > > > > > I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem. > > > > Because when invalid crashkernel mem pagetable, there is no chance to rd/wr > > the crashkernel mem by mistake. > > > > If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem > > pagetable, there maybe some write operations to these mem by mistake which > > may cause crashkernel boot error and vmcore saving error. > > I don't really buy this line of reasoning. The entire main kernel is > writable, so why do we care about protecting the crashkernel so much? The > _code_ to launch the crash kernel is writable! If you care about preventing > writes to memory which should not be writable, then you should use > rodata=full. > This is not entirely true - the core kernel text and rodata are remapped r/o in the linear map, whereas all module code and rodata are left writable when rodata != full. But the conclusion is the same, imo: if you can't be bothered to protect a good chunk of the code and rodata that the kernel relies on, why should the crashkernel be treated any differently?
在 2022/7/5 0:38, Will Deacon 写道: > On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote: >> Thanks. >> >> 在 2022/7/4 22:23, Will Deacon 写道: >>> On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote: >>>> 在 2022/7/4 21:15, Will Deacon 写道: >>>>> On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote: >>>>>>>> 1.Quoted messages from arch/arm64/mm/init.c >>>>>>>> >>>>>>>> "Memory reservation for crash kernel either done early or deferred >>>>>>>> depending on DMA memory zones configs (ZONE_DMA) -- >>>>>>>> >>>>>>>> In absence of ZONE_DMA configs arm64_dma_phys_limit initialized >>>>>>>> here instead of max_zone_phys(). This lets early reservation of >>>>>>>> crash kernel memory which has a dependency on arm64_dma_phys_limit. >>>>>>>> Reserving memory early for crash kernel allows linear creation of block >>>>>>>> mappings (greater than page-granularity) for all the memory bank rangs. >>>>>>>> In this scheme a comparatively quicker boot is observed. >>>>>>>> >>>>>>>> If ZONE_DMA configs are defined, crash kernel memory reservation >>>>>>>> is delayed until DMA zone memory range size initialization performed in >>>>>>>> zone_sizes_init(). The defer is necessary to steer clear of DMA zone >>>>>>>> memory range to avoid overlap allocation. >>>>>>>> >>>>>>>> [[[ >>>>>>>> So crash kernel memory boundaries are not known when mapping all bank memory >>>>>>>> ranges, which otherwise means not possible to exclude crash kernel range >>>>>>>> from creating block mappings so page-granularity mappings are created for >>>>>>>> the entire memory range. >>>>>>>> ]]]" >>>>>>>> >>>>>>>> Namely, the init order: memblock init--->linear mem mapping(4k mapping for >>>>>>>> crashkernel, requirinig page-granularity changing))--->zone dma >>>>>>>> limit--->reserve crashkernel. >>>>>>>> So when enable ZONE DMA and using crashkernel, the mem mapping using 4k >>>>>>>> mapping. >>>>>>> >>>>>>> Yes, I understand that is how things work today but I'm saying that we may >>>>>>> as well leave the crashkernel mapped (at block granularity) if >>>>>>> !can_set_direct_map() and then I think your patch becomes a lot simpler. >>>>>> >>>>>> But Page-granularity mapppings are necessary for crash kernel memory range >>>>>> for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from >>>>>> arch/arm64/mm/init.c). >>>>>> So this patch split block/section mapping to 4k page-granularity mapping for >>>>>> crashkernel mem. >>>>> >>>>> Why? I don't see why the mapping granularity is relevant at all if we >>>>> always leave the whole thing mapped. >>>>> >>>> There is another reason. >>>> >>>> When loading crashkernel finish, the do_kexec_load will use >>>> arch_kexec_protect_crashkres to invalid all the pagetable for crashkernel >>>> mem(protect crashkernel mem from access). >>>> >>>> arch_kexec_protect_crashkres--->set_memory_valid--->...--->apply_to_pmd_range >>>> >>>> In the apply_to_pmd_range, there is a judement: BUG_ON(pud_huge(*pud)). And >>>> if the crashkernel use block/section mapping, there will be some error. >>>> >>>> Namely, it's need to use non block/section mapping for crashkernel mem >>>> before shringking. >>> >>> Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do >>> that if we're leaving the thing mapped, no? >>> >> I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem. >> >> Because when invalid crashkernel mem pagetable, there is no chance to rd/wr >> the crashkernel mem by mistake. >> >> If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem >> pagetable, there maybe some write operations to these mem by mistake which >> may cause crashkernel boot error and vmcore saving error. > > I don't really buy this line of reasoning. The entire main kernel is > writable, so why do we care about protecting the crashkernel so much? The > _code_ to launch the crash kernel is writable! If you care about preventing > writes to memory which should not be writable, then you should use > rodata=full. > > Will Thanks. 1.I think the normal kernel mem can be writeable or readonly 2.But the normal kernel should't access(read/write) crashkernel mem after the crashkernel is loaded to the mem.(despite the normal kernel write/read access protect) So invalid pagetable for crashkernel mem is needed. With this method, it can guarantee the usability of the crashkernel when occuring emergency and generating vmcore.
On 07/04/22 at 07:09pm, Ard Biesheuvel wrote: > On Mon, 4 Jul 2022 at 18:38, Will Deacon <will@kernel.org> wrote: > > > > On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote: > > > Thanks. > > > > > > 在 2022/7/4 22:23, Will Deacon 写道: > > > > On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote: > ... > > > > > Namely, it's need to use non block/section mapping for crashkernel mem > > > > > before shringking. > > > > > > > > Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do > > > > that if we're leaving the thing mapped, no? > > > > > > > I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem. > > > > > > Because when invalid crashkernel mem pagetable, there is no chance to rd/wr > > > the crashkernel mem by mistake. > > > > > > If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem > > > pagetable, there maybe some write operations to these mem by mistake which > > > may cause crashkernel boot error and vmcore saving error. > > > > I don't really buy this line of reasoning. The entire main kernel is > > writable, so why do we care about protecting the crashkernel so much? The > > _code_ to launch the crash kernel is writable! If you care about preventing > > writes to memory which should not be writable, then you should use > > rodata=full. > > > > This is not entirely true - the core kernel text and rodata are > remapped r/o in the linear map, whereas all module code and rodata are > left writable when rodata != full. > > But the conclusion is the same, imo: if you can't be bothered to > protect a good chunk of the code and rodata that the kernel relies on, > why should the crashkernel be treated any differently? Kernel text and rodata are remapped r/o in linear map, whereas module code and rodata are left writable, it's different concept than crashkernel region being mapped r/o. If it's doable in technology to remap module code and rodata r/o, and stamping into those regions will corrupt the entire system, we should do it too. However, kdump is a system error diagonosing mechanism which is very important and helpful on server, or some application scenarios, e.g cloud. Stamping into crashkernel region will make it useless. I am not against removing the arch_kexec_[un]protect_crashkres on arm64. It is a balance: Protecting the crashkernel region, causeing severe performance degradation. This is always felt since we usually don't specify rodata and enable kfence. Taking off the protecting of crashkernel region, performance improved very much, while wrong code may stamp into crashkernel region and fail kdump. That could happen one in a million. Once happen, it's a nightmare of kernel dev.
On Mon, Jul 04, 2022 at 07:09:23PM +0200, Ard Biesheuvel wrote: > On Mon, 4 Jul 2022 at 18:38, Will Deacon <will@kernel.org> wrote: > > > > On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote: > > > Thanks. > > > > > > 在 2022/7/4 22:23, Will Deacon 写道: > > > > On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote: > ... > > > > > Namely, it's need to use non block/section mapping for crashkernel mem > > > > > before shringking. > > > > > > > > Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do > > > > that if we're leaving the thing mapped, no? > > > > > > > I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem. > > > > > > Because when invalid crashkernel mem pagetable, there is no chance to rd/wr > > > the crashkernel mem by mistake. > > > > > > If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem > > > pagetable, there maybe some write operations to these mem by mistake which > > > may cause crashkernel boot error and vmcore saving error. > > > > I don't really buy this line of reasoning. The entire main kernel is > > writable, so why do we care about protecting the crashkernel so much? The > > _code_ to launch the crash kernel is writable! If you care about preventing > > writes to memory which should not be writable, then you should use > > rodata=full. > > > > This is not entirely true - the core kernel text and rodata are > remapped r/o in the linear map, whereas all module code and rodata are > left writable when rodata != full. Yes, sorry, you're quite right. The kernel text is only writable if rodata=off. But I still think it makes sense to protect the crashkernel only if rodata=full (which is the default on arm64) as this allows us to rely on page mappings and I think fits well with what we do for modules. > But the conclusion is the same, imo: if you can't be bothered to > protect a good chunk of the code and rodata that the kernel relies on, > why should the crashkernel be treated any differently? Thanks. Will
在 2022/7/5 17:52, Will Deacon 写道: > On Mon, Jul 04, 2022 at 07:09:23PM +0200, Ard Biesheuvel wrote: >> On Mon, 4 Jul 2022 at 18:38, Will Deacon <will@kernel.org> wrote: >>> >>> On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote: >>>> Thanks. >>>> >>>> 在 2022/7/4 22:23, Will Deacon 写道: >>>>> On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote: >> ... >>>>>> Namely, it's need to use non block/section mapping for crashkernel mem >>>>>> before shringking. >>>>> >>>>> Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do >>>>> that if we're leaving the thing mapped, no? >>>>> >>>> I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem. >>>> >>>> Because when invalid crashkernel mem pagetable, there is no chance to rd/wr >>>> the crashkernel mem by mistake. >>>> >>>> If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem >>>> pagetable, there maybe some write operations to these mem by mistake which >>>> may cause crashkernel boot error and vmcore saving error. >>> >>> I don't really buy this line of reasoning. The entire main kernel is >>> writable, so why do we care about protecting the crashkernel so much? The >>> _code_ to launch the crash kernel is writable! If you care about preventing >>> writes to memory which should not be writable, then you should use >>> rodata=full. >>> >> >> This is not entirely true - the core kernel text and rodata are >> remapped r/o in the linear map, whereas all module code and rodata are >> left writable when rodata != full. > > Yes, sorry, you're quite right. The kernel text is only writable if > rodata=off. > > But I still think it makes sense to protect the crashkernel only if > rodata=full (which is the default on arm64) as this allows us to rely > on page mappings and I think fits well with what we do for modules. > >> But the conclusion is the same, imo: if you can't be bothered to >> protect a good chunk of the code and rodata that the kernel relies on, >> why should the crashkernel be treated any differently? > > Thanks. > > Will Thanks. 1.The rodata full is harm to the performance and has been disabled in-house. 2.When using crashkernel with rodata non full, the kernel also will use non block/section mapping which cause high d-TLB miss and degrade performance greatly. This patch fix it to use block/section mapping as far as possible. bool can_set_direct_map(void) { return rodata_full || debug_pagealloc_enabled(); } map_mem: if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; 3.When rodata full is disabled, crashkernel also need protect(keep arch_kexec_[un]protect_crashkres using). I think crashkernel should't depend on radata full(Maybe other architecture don't support radata full now). Thanks.
On Tue, Jul 05, 2022 at 08:07:07PM +0800, guanghui.fgh wrote: > > > 在 2022/7/5 17:52, Will Deacon 写道: > > On Mon, Jul 04, 2022 at 07:09:23PM +0200, Ard Biesheuvel wrote: > > > On Mon, 4 Jul 2022 at 18:38, Will Deacon <will@kernel.org> wrote: > > > > > > > > On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote: > > > > > Thanks. > > > > > > > > > > 在 2022/7/4 22:23, Will Deacon 写道: > > > > > > On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote: > > > ... > > > > > > > Namely, it's need to use non block/section mapping for crashkernel mem > > > > > > > before shringking. > > > > > > > > > > > > Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do > > > > > > that if we're leaving the thing mapped, no? > > > > > > > > > > > I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem. > > > > > > > > > > Because when invalid crashkernel mem pagetable, there is no chance to rd/wr > > > > > the crashkernel mem by mistake. > > > > > > > > > > If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem > > > > > pagetable, there maybe some write operations to these mem by mistake which > > > > > may cause crashkernel boot error and vmcore saving error. > > > > > > > > I don't really buy this line of reasoning. The entire main kernel is > > > > writable, so why do we care about protecting the crashkernel so much? The > > > > _code_ to launch the crash kernel is writable! If you care about preventing > > > > writes to memory which should not be writable, then you should use > > > > rodata=full. > > > > > > > > > > This is not entirely true - the core kernel text and rodata are > > > remapped r/o in the linear map, whereas all module code and rodata are > > > left writable when rodata != full. > > > > Yes, sorry, you're quite right. The kernel text is only writable if > > rodata=off. > > > > But I still think it makes sense to protect the crashkernel only if > > rodata=full (which is the default on arm64) as this allows us to rely > > on page mappings and I think fits well with what we do for modules. > > > > > But the conclusion is the same, imo: if you can't be bothered to > > > protect a good chunk of the code and rodata that the kernel relies on, > > > why should the crashkernel be treated any differently? > > > > Thanks. > > > > Will > Thanks. > > 1.The rodata full is harm to the performance and has been disabled in-house. > > 2.When using crashkernel with rodata non full, the kernel also will use non > block/section mapping which cause high d-TLB miss and degrade performance > greatly. > This patch fix it to use block/section mapping as far as possible. > > bool can_set_direct_map(void) > { > return rodata_full || debug_pagealloc_enabled(); > } > > map_mem: > if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) > flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; > > 3.When rodata full is disabled, crashkernel also need protect(keep > arch_kexec_[un]protect_crashkres using). > I think crashkernel should't depend on radata full(Maybe other architecture > don't support radata full now). I think this is going round in circles :/ As a first step, can we please leave the crashkernel mapped unless rodata=full? It should be a much simpler patch to write, review and maintain and it gives you the performance you want when crashkernel is being used. Will
在 2022/7/5 20:11, Will Deacon 写道: > On Tue, Jul 05, 2022 at 08:07:07PM +0800, guanghui.fgh wrote: >> >> >> 在 2022/7/5 17:52, Will Deacon 写道: >>> On Mon, Jul 04, 2022 at 07:09:23PM +0200, Ard Biesheuvel wrote: >>>> On Mon, 4 Jul 2022 at 18:38, Will Deacon <will@kernel.org> wrote: >>>>> >>>>> On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote: >>>>>> Thanks. >>>>>> >>>>>> 在 2022/7/4 22:23, Will Deacon 写道: >>>>>>> On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote: >>>> ... >>>>>>>> Namely, it's need to use non block/section mapping for crashkernel mem >>>>>>>> before shringking. >>>>>>> >>>>>>> Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do >>>>>>> that if we're leaving the thing mapped, no? >>>>>>> >>>>>> I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem. >>>>>> >>>>>> Because when invalid crashkernel mem pagetable, there is no chance to rd/wr >>>>>> the crashkernel mem by mistake. >>>>>> >>>>>> If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem >>>>>> pagetable, there maybe some write operations to these mem by mistake which >>>>>> may cause crashkernel boot error and vmcore saving error. >>>>> >>>>> I don't really buy this line of reasoning. The entire main kernel is >>>>> writable, so why do we care about protecting the crashkernel so much? The >>>>> _code_ to launch the crash kernel is writable! If you care about preventing >>>>> writes to memory which should not be writable, then you should use >>>>> rodata=full. >>>>> >>>> >>>> This is not entirely true - the core kernel text and rodata are >>>> remapped r/o in the linear map, whereas all module code and rodata are >>>> left writable when rodata != full. >>> >>> Yes, sorry, you're quite right. The kernel text is only writable if >>> rodata=off. >>> >>> But I still think it makes sense to protect the crashkernel only if >>> rodata=full (which is the default on arm64) as this allows us to rely >>> on page mappings and I think fits well with what we do for modules. >>> >>>> But the conclusion is the same, imo: if you can't be bothered to >>>> protect a good chunk of the code and rodata that the kernel relies on, >>>> why should the crashkernel be treated any differently? >>> >>> Thanks. >>> >>> Will >> Thanks. >> >> 1.The rodata full is harm to the performance and has been disabled in-house. >> >> 2.When using crashkernel with rodata non full, the kernel also will use non >> block/section mapping which cause high d-TLB miss and degrade performance >> greatly. >> This patch fix it to use block/section mapping as far as possible. >> >> bool can_set_direct_map(void) >> { >> return rodata_full || debug_pagealloc_enabled(); >> } >> >> map_mem: >> if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) >> flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; >> >> 3.When rodata full is disabled, crashkernel also need protect(keep >> arch_kexec_[un]protect_crashkres using). >> I think crashkernel should't depend on radata full(Maybe other architecture >> don't support radata full now). > > I think this is going round in circles :/ > > As a first step, can we please leave the crashkernel mapped unless > rodata=full? It should be a much simpler patch to write, review and maintain > and it gives you the performance you want when crashkernel is being used. > > Will Thanks. There is a circle. 1.When the rodata is non full, there will be some error when calling arch_kexec_[un]protect_crashkres(BUG_ON(pud_huge(*pud))) now. It's also need non-block/section mapping for crashkernel mem. 2.In other words, maybe we should change arch_kexec_[un]protect_crashkres to support block/section mapping which can leave crashkernel block/section mapping. But when we shrink the crashkernel mem, we also need to split some block/section mapping(part mem for crashkernel, the left for the normal kernel). As a result, maybe we build crashkernel mem with non-block/section mapping is appropriate(as this patch doing).
On Tue, Jul 05, 2022 at 01:11:16PM +0100, Will Deacon wrote: > On Tue, Jul 05, 2022 at 08:07:07PM +0800, guanghui.fgh wrote: > > > > 1.The rodata full is harm to the performance and has been disabled in-house. > > > > 2.When using crashkernel with rodata non full, the kernel also will use non > > block/section mapping which cause high d-TLB miss and degrade performance > > greatly. > > This patch fix it to use block/section mapping as far as possible. > > > > bool can_set_direct_map(void) > > { > > return rodata_full || debug_pagealloc_enabled(); > > } > > > > map_mem: > > if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) > > flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; > > > > 3.When rodata full is disabled, crashkernel also need protect(keep > > arch_kexec_[un]protect_crashkres using). > > I think crashkernel should't depend on radata full(Maybe other architecture > > don't support radata full now). > > I think this is going round in circles :/ > > As a first step, can we please leave the crashkernel mapped unless > rodata=full? It should be a much simpler patch to write, review and maintain > and it gives you the performance you want when crashkernel is being used. Since we are talking about large systems, what do you think about letting them set CRASH_ALIGN to PUD_SIZE, then unmap(crashkernel); __create_pgd_mapping(crashkernel, NO_BLOCK_MAPPINGS); should be enough to make crash kernel mapped with base pages. > Will
在 2022/7/5 20:56, Mike Rapoport 写道: > On Tue, Jul 05, 2022 at 01:11:16PM +0100, Will Deacon wrote: >> On Tue, Jul 05, 2022 at 08:07:07PM +0800, guanghui.fgh wrote: >>> >>> 1.The rodata full is harm to the performance and has been disabled in-house. >>> >>> 2.When using crashkernel with rodata non full, the kernel also will use non >>> block/section mapping which cause high d-TLB miss and degrade performance >>> greatly. >>> This patch fix it to use block/section mapping as far as possible. >>> >>> bool can_set_direct_map(void) >>> { >>> return rodata_full || debug_pagealloc_enabled(); >>> } >>> >>> map_mem: >>> if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) >>> flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; >>> >>> 3.When rodata full is disabled, crashkernel also need protect(keep >>> arch_kexec_[un]protect_crashkres using). >>> I think crashkernel should't depend on radata full(Maybe other architecture >>> don't support radata full now). >> >> I think this is going round in circles :/ >> >> As a first step, can we please leave the crashkernel mapped unless >> rodata=full? It should be a much simpler patch to write, review and maintain >> and it gives you the performance you want when crashkernel is being used. > > Since we are talking about large systems, what do you think about letting > them set CRASH_ALIGN to PUD_SIZE, then > > unmap(crashkernel); > __create_pgd_mapping(crashkernel, NO_BLOCK_MAPPINGS); > > should be enough to make crash kernel mapped with base pages. > >> Will > Thanks. 1.When kernel boots with crashkernel, the crashkernel parameters format is: 0M-2G:0M,2G-256G:256M,256G-1024G:320M,1024G-:384M which is self-adaption to multiple system. 2.As mentioned above, the crashkernel mem size maybe less than PUD_SIZE(Not multiple time to PUD_SIZE). So, maybe we also need use some non block/section mappings.
On Tue, Jul 05, 2022 at 01:11:16PM +0100, Will Deacon wrote: > On Tue, Jul 05, 2022 at 08:07:07PM +0800, guanghui.fgh wrote: > > > > 3.When rodata full is disabled, crashkernel also need protect(keep > > arch_kexec_[un]protect_crashkres using). > > I think crashkernel should't depend on radata full(Maybe other architecture > > don't support radata full now). > > I think this is going round in circles :/ > > As a first step, can we please leave the crashkernel mapped unless > rodata=full? It should be a much simpler patch to write, review and maintain > and it gives you the performance you want when crashkernel is being used. As it seems I failed to communicate my thoughts about reusing the existing unmap_hotplug_range() to remap the crash kernel, let's try a more formal approach ;-) This is what I came up with and it does not look too complex. There are a couple of extra #ifdefs that can be removed if we toss some code around in a preparation patch. From 5adbfcbe370da0f09cd917e73aaac7ba8c6b45df Mon Sep 17 00:00:00 2001 From: Mike Rapoport <rppt@linux.ibm.com> Date: Sat, 2 Jul 2022 23:57:53 +0800 Subject: [PATCH] arm64/mm: remap crash kernel with base pages even if rodata_full disabled For server systems it is important to protect crash kernel memory for post-mortem analysis. In order to protect this memory it should be mapped at PTE level. When CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 is enabled, usage of crash kernel essentially forces mapping of the entire linear map with base pages even if rodata_full is not set (commit 2687275a5843 ("arm64: Force NO_BLOCK_MAPPINGS if crashkernel reservation is required")) and this causes performance degradation. To reduce the performance degradation, postpone reservation of the crash kernel memory to bootmem_init() regardless of CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 and enable remapping of the crash kernel memory at PTE level. Co-developed-by: Guanghui Feng <guanghuifeng@linux.alibaba.com> Signed-off-by: Guanghui Feng <guanghuifeng@linux.alibaba.com> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> --- arch/arm64/include/asm/mmu.h | 1 + arch/arm64/mm/init.c | 8 +--- arch/arm64/mm/mmu.c | 91 +++++++++++++++++++----------------- 3 files changed, 52 insertions(+), 48 deletions(-) diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h index 48f8466a4be9..f4eb2f61dd0d 100644 --- a/arch/arm64/include/asm/mmu.h +++ b/arch/arm64/include/asm/mmu.h @@ -71,6 +71,7 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys, extern void *fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot); extern void mark_linear_text_alias_ro(void); extern bool kaslr_requires_kpti(void); +extern void remap_crashkernel(void); #define INIT_MM_CONTEXT(name) \ .pgd = init_pg_dir, diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index 339ee84e5a61..51f8329931f8 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -190,6 +190,7 @@ static void __init reserve_crashkernel(void) crashk_res.start = crash_base; crashk_res.end = crash_base + crash_size - 1; insert_resource(&iomem_resource, &crashk_res); + remap_crashkernel(); } /* @@ -388,10 +389,6 @@ void __init arm64_memblock_init(void) } early_init_fdt_scan_reserved_mem(); - - if (!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) - reserve_crashkernel(); - high_memory = __va(memblock_end_of_DRAM() - 1) + 1; } @@ -438,8 +435,7 @@ void __init bootmem_init(void) * request_standard_resources() depends on crashkernel's memory being * reserved, so do it here. */ - if (IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32)) - reserve_crashkernel(); + reserve_crashkernel(); memblock_dump_all(); } diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 626ec32873c6..e0b5769bfc9f 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -483,21 +483,6 @@ void __init mark_linear_text_alias_ro(void) PAGE_KERNEL_RO); } -static bool crash_mem_map __initdata; - -static int __init enable_crash_mem_map(char *arg) -{ - /* - * Proper parameter parsing is done by reserve_crashkernel(). We only - * need to know if the linear map has to avoid block mappings so that - * the crashkernel reservations can be unmapped later. - */ - crash_mem_map = true; - - return 0; -} -early_param("crashkernel", enable_crash_mem_map); - static void __init map_mem(pgd_t *pgdp) { static const u64 direct_map_end = _PAGE_END(VA_BITS_MIN); @@ -527,17 +512,6 @@ static void __init map_mem(pgd_t *pgdp) */ memblock_mark_nomap(kernel_start, kernel_end - kernel_start); -#ifdef CONFIG_KEXEC_CORE - if (crash_mem_map) { - if (IS_ENABLED(CONFIG_ZONE_DMA) || - IS_ENABLED(CONFIG_ZONE_DMA32)) - flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; - else if (crashk_res.end) - memblock_mark_nomap(crashk_res.start, - resource_size(&crashk_res)); - } -#endif - /* map all the memory banks */ for_each_mem_range(i, &start, &end) { if (start >= end) @@ -570,19 +544,6 @@ static void __init map_mem(pgd_t *pgdp) * in page granularity and put back unused memory to buddy system * through /sys/kernel/kexec_crash_size interface. */ -#ifdef CONFIG_KEXEC_CORE - if (crash_mem_map && - !IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) { - if (crashk_res.end) { - __map_memblock(pgdp, crashk_res.start, - crashk_res.end + 1, - PAGE_KERNEL, - NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS); - memblock_clear_nomap(crashk_res.start, - resource_size(&crashk_res)); - } - } -#endif } void mark_rodata_ro(void) @@ -827,7 +788,7 @@ int kern_addr_valid(unsigned long addr) return pfn_valid(pte_pfn(pte)); } -#ifdef CONFIG_MEMORY_HOTPLUG +#if defined(CONFIG_MEMORY_HOTPLUG) || defined(CONFIG_KEXEC_CORE) static void free_hotplug_page_range(struct page *page, size_t size, struct vmem_altmap *altmap) { @@ -839,6 +800,7 @@ static void free_hotplug_page_range(struct page *page, size_t size, } } +#ifdef CONFIG_MEMORY_HOTPLUG static void free_hotplug_pgtable_page(struct page *page) { free_hotplug_page_range(page, PAGE_SIZE, NULL); @@ -862,6 +824,7 @@ static bool pgtable_range_aligned(unsigned long start, unsigned long end, return false; return true; } +#endif /* CONFIG_MEMORY_HOTPLUG */ static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr, unsigned long end, bool free_mapped, @@ -994,7 +957,9 @@ static void unmap_hotplug_range(unsigned long addr, unsigned long end, unmap_hotplug_p4d_range(pgdp, addr, next, free_mapped, altmap); } while (addr = next, addr < end); } +#endif /* CONFIG_MEMORY_HOTPLUG || CONFIG_KEXEC_CORE */ +#ifdef CONFIG_MEMORY_HOTPLUG static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling) @@ -1148,7 +1113,7 @@ static void free_empty_tables(unsigned long addr, unsigned long end, free_empty_p4d_table(pgdp, addr, next, floor, ceiling); } while (addr = next, addr < end); } -#endif +#endif /* CONFIG_MEMORY_HOTPLUG */ #if !ARM64_KERNEL_USES_PMD_MAPS int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, @@ -1213,7 +1178,7 @@ void vmemmap_free(unsigned long start, unsigned long end, unmap_hotplug_range(start, end, true, altmap); free_empty_tables(start, end, VMEMMAP_START, VMEMMAP_END); } -#endif /* CONFIG_MEMORY_HOTPLUG */ +#endif /* CONFIG_MEMORY_HOTPLUG || CONFIG_KEXEC_CORE */ static inline pud_t *fixmap_pud(unsigned long addr) { @@ -1677,3 +1642,45 @@ static int __init prevent_bootmem_remove_init(void) } early_initcall(prevent_bootmem_remove_init); #endif + +void __init remap_crashkernel(void) +{ +#ifdef CONFIG_KEXEC_CORE + phys_addr_t start, end, size; + phys_addr_t aligned_start, aligned_end; + + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) + return; + + if (!crashk_res.end) + return; + + start = crashk_res.start & PAGE_MASK; + end = PAGE_ALIGN(crashk_res.end); + + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); + aligned_end = ALIGN(end, PUD_SIZE); + + /* Clear PUDs containing crash kernel memory */ + unmap_hotplug_range(__phys_to_virt(aligned_start), + __phys_to_virt(aligned_end), false, NULL); + + /* map area from PUD start to start of crash kernel with large pages */ + size = start - aligned_start; + __create_pgd_mapping(swapper_pg_dir, aligned_start, + __phys_to_virt(aligned_start), + size, PAGE_KERNEL, early_pgtable_alloc, 0); + + /* map crash kernel memory with base pages */ + size = end - start; + __create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start), + size, PAGE_KERNEL, early_pgtable_alloc, + NO_EXEC_MAPPINGS | NO_BLOCK_MAPPINGS | + NO_CONT_MAPPINGS); + + /* map area from end of crash kernel to PUD end with large pages */ + size = aligned_end - end; + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end), + size, PAGE_KERNEL, early_pgtable_alloc, 0); +#endif +}
On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote: > +void __init remap_crashkernel(void) > +{ > +#ifdef CONFIG_KEXEC_CORE > + phys_addr_t start, end, size; > + phys_addr_t aligned_start, aligned_end; > + > + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) > + return; > + > + if (!crashk_res.end) > + return; > + > + start = crashk_res.start & PAGE_MASK; > + end = PAGE_ALIGN(crashk_res.end); > + > + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); > + aligned_end = ALIGN(end, PUD_SIZE); > + > + /* Clear PUDs containing crash kernel memory */ > + unmap_hotplug_range(__phys_to_virt(aligned_start), > + __phys_to_virt(aligned_end), false, NULL); What I don't understand is what happens if there's valid kernel data between aligned_start and crashk_res.start (or the other end of the range).
On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote: > On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote: > > +void __init remap_crashkernel(void) > > +{ > > +#ifdef CONFIG_KEXEC_CORE > > + phys_addr_t start, end, size; > > + phys_addr_t aligned_start, aligned_end; > > + > > + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) > > + return; > > + > > + if (!crashk_res.end) > > + return; > > + > > + start = crashk_res.start & PAGE_MASK; > > + end = PAGE_ALIGN(crashk_res.end); > > + > > + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); > > + aligned_end = ALIGN(end, PUD_SIZE); > > + > > + /* Clear PUDs containing crash kernel memory */ > > + unmap_hotplug_range(__phys_to_virt(aligned_start), > > + __phys_to_virt(aligned_end), false, NULL); > > What I don't understand is what happens if there's valid kernel data > between aligned_start and crashk_res.start (or the other end of the > range). Data shouldn't go anywhere :) There is + /* map area from PUD start to start of crash kernel with large pages */ + size = start - aligned_start; + __create_pgd_mapping(swapper_pg_dir, aligned_start, + __phys_to_virt(aligned_start), + size, PAGE_KERNEL, early_pgtable_alloc, 0); and + /* map area from end of crash kernel to PUD end with large pages */ + size = aligned_end - end; + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end), + size, PAGE_KERNEL, early_pgtable_alloc, 0); after the unmap, so after we tear down a part of a linear map we immediately recreate it, just with a different page size. This all happens before SMP, so there is no concurrency at that point. > -- > Catalin
On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote: > On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote: > > On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote: > > > +void __init remap_crashkernel(void) > > > +{ > > > +#ifdef CONFIG_KEXEC_CORE > > > + phys_addr_t start, end, size; > > > + phys_addr_t aligned_start, aligned_end; > > > + > > > + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) > > > + return; > > > + > > > + if (!crashk_res.end) > > > + return; > > > + > > > + start = crashk_res.start & PAGE_MASK; > > > + end = PAGE_ALIGN(crashk_res.end); > > > + > > > + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); > > > + aligned_end = ALIGN(end, PUD_SIZE); > > > + > > > + /* Clear PUDs containing crash kernel memory */ > > > + unmap_hotplug_range(__phys_to_virt(aligned_start), > > > + __phys_to_virt(aligned_end), false, NULL); > > > > What I don't understand is what happens if there's valid kernel data > > between aligned_start and crashk_res.start (or the other end of the > > range). > > Data shouldn't go anywhere :) > > There is > > + /* map area from PUD start to start of crash kernel with large pages */ > + size = start - aligned_start; > + __create_pgd_mapping(swapper_pg_dir, aligned_start, > + __phys_to_virt(aligned_start), > + size, PAGE_KERNEL, early_pgtable_alloc, 0); > > and > > + /* map area from end of crash kernel to PUD end with large pages */ > + size = aligned_end - end; > + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end), > + size, PAGE_KERNEL, early_pgtable_alloc, 0); > > after the unmap, so after we tear down a part of a linear map we > immediately recreate it, just with a different page size. > > This all happens before SMP, so there is no concurrency at that point. That brief period of unmap worries me. The kernel text, data and stack are all in the vmalloc space but any other (memblock) allocation to this point may be in the unmapped range before and after the crashkernel reservation. The interrupts are off, so I think the only allocation and potential access that may go in this range is the page table itself. But it looks fragile to me.
On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote: > On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote: > > On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote: > > > On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote: > > > > +void __init remap_crashkernel(void) > > > > +{ > > > > +#ifdef CONFIG_KEXEC_CORE > > > > + phys_addr_t start, end, size; > > > > + phys_addr_t aligned_start, aligned_end; > > > > + > > > > + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) > > > > + return; > > > > + > > > > + if (!crashk_res.end) > > > > + return; > > > > + > > > > + start = crashk_res.start & PAGE_MASK; > > > > + end = PAGE_ALIGN(crashk_res.end); > > > > + > > > > + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); > > > > + aligned_end = ALIGN(end, PUD_SIZE); > > > > + > > > > + /* Clear PUDs containing crash kernel memory */ > > > > + unmap_hotplug_range(__phys_to_virt(aligned_start), > > > > + __phys_to_virt(aligned_end), false, NULL); > > > > > > What I don't understand is what happens if there's valid kernel data > > > between aligned_start and crashk_res.start (or the other end of the > > > range). > > > > Data shouldn't go anywhere :) > > > > There is > > > > + /* map area from PUD start to start of crash kernel with large pages */ > > + size = start - aligned_start; > > + __create_pgd_mapping(swapper_pg_dir, aligned_start, > > + __phys_to_virt(aligned_start), > > + size, PAGE_KERNEL, early_pgtable_alloc, 0); > > > > and > > > > + /* map area from end of crash kernel to PUD end with large pages */ > > + size = aligned_end - end; > > + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end), > > + size, PAGE_KERNEL, early_pgtable_alloc, 0); > > > > after the unmap, so after we tear down a part of a linear map we > > immediately recreate it, just with a different page size. > > > > This all happens before SMP, so there is no concurrency at that point. > > That brief period of unmap worries me. The kernel text, data and stack > are all in the vmalloc space but any other (memblock) allocation to this > point may be in the unmapped range before and after the crashkernel > reservation. The interrupts are off, so I think the only allocation and > potential access that may go in this range is the page table itself. But > it looks fragile to me. I agree there are chances there will be an allocation from the unmapped range. We can make sure this won't happen, though. We can cap the memblock allocations with memblock_set_current_limit(aligned_end) or memblock_reserve(algined_start, aligned_end) until the mappings are restored. > -- > Catalin
Thanks. 在 2022/7/6 4:45, Mike Rapoport 写道: > On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote: >> On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote: >>> On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote: >>>> On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote: >>>>> +void __init remap_crashkernel(void) >>>>> +{ >>>>> +#ifdef CONFIG_KEXEC_CORE >>>>> + phys_addr_t start, end, size; >>>>> + phys_addr_t aligned_start, aligned_end; >>>>> + >>>>> + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) >>>>> + return; >>>>> + >>>>> + if (!crashk_res.end) >>>>> + return; >>>>> + >>>>> + start = crashk_res.start & PAGE_MASK; >>>>> + end = PAGE_ALIGN(crashk_res.end); >>>>> + >>>>> + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); >>>>> + aligned_end = ALIGN(end, PUD_SIZE); >>>>> + >>>>> + /* Clear PUDs containing crash kernel memory */ >>>>> + unmap_hotplug_range(__phys_to_virt(aligned_start), >>>>> + __phys_to_virt(aligned_end), false, NULL); >>>> >>>> What I don't understand is what happens if there's valid kernel data >>>> between aligned_start and crashk_res.start (or the other end of the >>>> range). >>> >>> Data shouldn't go anywhere :) >>> >>> There is >>> >>> + /* map area from PUD start to start of crash kernel with large pages */ >>> + size = start - aligned_start; >>> + __create_pgd_mapping(swapper_pg_dir, aligned_start, >>> + __phys_to_virt(aligned_start), >>> + size, PAGE_KERNEL, early_pgtable_alloc, 0); >>> >>> and >>> >>> + /* map area from end of crash kernel to PUD end with large pages */ >>> + size = aligned_end - end; >>> + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end), >>> + size, PAGE_KERNEL, early_pgtable_alloc, 0); >>> >>> after the unmap, so after we tear down a part of a linear map we >>> immediately recreate it, just with a different page size. >>> >>> This all happens before SMP, so there is no concurrency at that point. >> >> That brief period of unmap worries me. The kernel text, data and stack >> are all in the vmalloc space but any other (memblock) allocation to this >> point may be in the unmapped range before and after the crashkernel >> reservation. The interrupts are off, so I think the only allocation and >> potential access that may go in this range is the page table itself. But >> it looks fragile to me. > > I agree there are chances there will be an allocation from the unmapped > range. > > We can make sure this won't happen, though. We can cap the memblock > allocations with memblock_set_current_limit(aligned_end) or > memblock_reserve(algined_start, aligned_end) until the mappings are > restored. > >> -- >> Catalin > I think there is no need to worry about vmalloc mem. 1.As mentioned above, When reserving crashkernel and remapping linear mem mapping, there is only one boot cpu running. There is no other cpu/thread running at the same time. 2.Although vmalloc may alloc mem from the ummaped area, but we will rebuid remapping using pte level mapping which keeps virtual address to the same physical address (At the same time, no other cpu/thread is access vmalloc mem). As a result, it has no effect to vmalloc mem.
On Wed, Jul 06, 2022 at 10:49:43AM +0800, guanghui.fgh wrote: > 在 2022/7/6 4:45, Mike Rapoport 写道: > > On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote: > > > On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote: > > > > On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote: > > > > > On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote: > > > > > > +void __init remap_crashkernel(void) > > > > > > +{ > > > > > > +#ifdef CONFIG_KEXEC_CORE > > > > > > + phys_addr_t start, end, size; > > > > > > + phys_addr_t aligned_start, aligned_end; > > > > > > + > > > > > > + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) > > > > > > + return; > > > > > > + > > > > > > + if (!crashk_res.end) > > > > > > + return; > > > > > > + > > > > > > + start = crashk_res.start & PAGE_MASK; > > > > > > + end = PAGE_ALIGN(crashk_res.end); > > > > > > + > > > > > > + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); > > > > > > + aligned_end = ALIGN(end, PUD_SIZE); > > > > > > + > > > > > > + /* Clear PUDs containing crash kernel memory */ > > > > > > + unmap_hotplug_range(__phys_to_virt(aligned_start), > > > > > > + __phys_to_virt(aligned_end), false, NULL); > > > > > > > > > > What I don't understand is what happens if there's valid kernel data > > > > > between aligned_start and crashk_res.start (or the other end of the > > > > > range). > > > > > > > > Data shouldn't go anywhere :) > > > > > > > > There is > > > > > > > > + /* map area from PUD start to start of crash kernel with large pages */ > > > > + size = start - aligned_start; > > > > + __create_pgd_mapping(swapper_pg_dir, aligned_start, > > > > + __phys_to_virt(aligned_start), > > > > + size, PAGE_KERNEL, early_pgtable_alloc, 0); > > > > > > > > and > > > > > > > > + /* map area from end of crash kernel to PUD end with large pages */ > > > > + size = aligned_end - end; > > > > + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end), > > > > + size, PAGE_KERNEL, early_pgtable_alloc, 0); > > > > > > > > after the unmap, so after we tear down a part of a linear map we > > > > immediately recreate it, just with a different page size. > > > > > > > > This all happens before SMP, so there is no concurrency at that point. > > > > > > That brief period of unmap worries me. The kernel text, data and stack > > > are all in the vmalloc space but any other (memblock) allocation to this > > > point may be in the unmapped range before and after the crashkernel > > > reservation. The interrupts are off, so I think the only allocation and > > > potential access that may go in this range is the page table itself. But > > > it looks fragile to me. > > > > I agree there are chances there will be an allocation from the unmapped > > range. > > > > We can make sure this won't happen, though. We can cap the memblock > > allocations with memblock_set_current_limit(aligned_end) or > > memblock_reserve(algined_start, aligned_end) until the mappings are > > restored. > > I think there is no need to worry about vmalloc mem. That's not what I'm worried about. It's about memblock allocations that are accessed through the linear map.
On Tue, Jul 05, 2022 at 11:45:40PM +0300, Mike Rapoport wrote: > On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote: > > On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote: > > > On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote: > > > > On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote: > > > > > +void __init remap_crashkernel(void) > > > > > +{ > > > > > +#ifdef CONFIG_KEXEC_CORE > > > > > + phys_addr_t start, end, size; > > > > > + phys_addr_t aligned_start, aligned_end; > > > > > + > > > > > + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) > > > > > + return; > > > > > + > > > > > + if (!crashk_res.end) > > > > > + return; > > > > > + > > > > > + start = crashk_res.start & PAGE_MASK; > > > > > + end = PAGE_ALIGN(crashk_res.end); > > > > > + > > > > > + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); > > > > > + aligned_end = ALIGN(end, PUD_SIZE); > > > > > + > > > > > + /* Clear PUDs containing crash kernel memory */ > > > > > + unmap_hotplug_range(__phys_to_virt(aligned_start), > > > > > + __phys_to_virt(aligned_end), false, NULL); > > > > > > > > What I don't understand is what happens if there's valid kernel data > > > > between aligned_start and crashk_res.start (or the other end of the > > > > range). > > > > > > Data shouldn't go anywhere :) > > > > > > There is > > > > > > + /* map area from PUD start to start of crash kernel with large pages */ > > > + size = start - aligned_start; > > > + __create_pgd_mapping(swapper_pg_dir, aligned_start, > > > + __phys_to_virt(aligned_start), > > > + size, PAGE_KERNEL, early_pgtable_alloc, 0); > > > > > > and > > > > > > + /* map area from end of crash kernel to PUD end with large pages */ > > > + size = aligned_end - end; > > > + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end), > > > + size, PAGE_KERNEL, early_pgtable_alloc, 0); > > > > > > after the unmap, so after we tear down a part of a linear map we > > > immediately recreate it, just with a different page size. > > > > > > This all happens before SMP, so there is no concurrency at that point. > > > > That brief period of unmap worries me. The kernel text, data and stack > > are all in the vmalloc space but any other (memblock) allocation to this > > point may be in the unmapped range before and after the crashkernel > > reservation. The interrupts are off, so I think the only allocation and > > potential access that may go in this range is the page table itself. But > > it looks fragile to me. > > I agree there are chances there will be an allocation from the unmapped > range. > > We can make sure this won't happen, though. We can cap the memblock > allocations with memblock_set_current_limit(aligned_end) or > memblock_reserve(algined_start, aligned_end) until the mappings are > restored. We can reserve the region just before unmapping to avoid new allocations for the page tables but we can't do much about pages already allocated prior to calling remap_crashkernel().
On Wed, Jul 06, 2022 at 11:04:24AM +0100, Catalin Marinas wrote: > On Tue, Jul 05, 2022 at 11:45:40PM +0300, Mike Rapoport wrote: > > On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote: > > > On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote: > > > > On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote: > > > > > On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote: > > > > > > +void __init remap_crashkernel(void) > > > > > > +{ > > > > > > +#ifdef CONFIG_KEXEC_CORE > > > > > > + phys_addr_t start, end, size; > > > > > > + phys_addr_t aligned_start, aligned_end; > > > > > > + > > > > > > + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) > > > > > > + return; > > > > > > + > > > > > > + if (!crashk_res.end) > > > > > > + return; > > > > > > + > > > > > > + start = crashk_res.start & PAGE_MASK; > > > > > > + end = PAGE_ALIGN(crashk_res.end); > > > > > > + > > > > > > + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); > > > > > > + aligned_end = ALIGN(end, PUD_SIZE); > > > > > > + > > > > > > + /* Clear PUDs containing crash kernel memory */ > > > > > > + unmap_hotplug_range(__phys_to_virt(aligned_start), > > > > > > + __phys_to_virt(aligned_end), false, NULL); > > > > > > > > > > What I don't understand is what happens if there's valid kernel data > > > > > between aligned_start and crashk_res.start (or the other end of the > > > > > range). > > > > > > > > Data shouldn't go anywhere :) > > > > > > > > There is > > > > > > > > + /* map area from PUD start to start of crash kernel with large pages */ > > > > + size = start - aligned_start; > > > > + __create_pgd_mapping(swapper_pg_dir, aligned_start, > > > > + __phys_to_virt(aligned_start), > > > > + size, PAGE_KERNEL, early_pgtable_alloc, 0); > > > > > > > > and > > > > > > > > + /* map area from end of crash kernel to PUD end with large pages */ > > > > + size = aligned_end - end; > > > > + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end), > > > > + size, PAGE_KERNEL, early_pgtable_alloc, 0); > > > > > > > > after the unmap, so after we tear down a part of a linear map we > > > > immediately recreate it, just with a different page size. > > > > > > > > This all happens before SMP, so there is no concurrency at that point. > > > > > > That brief period of unmap worries me. The kernel text, data and stack > > > are all in the vmalloc space but any other (memblock) allocation to this > > > point may be in the unmapped range before and after the crashkernel > > > reservation. The interrupts are off, so I think the only allocation and > > > potential access that may go in this range is the page table itself. But > > > it looks fragile to me. > > > > I agree there are chances there will be an allocation from the unmapped > > range. > > > > We can make sure this won't happen, though. We can cap the memblock > > allocations with memblock_set_current_limit(aligned_end) or > > memblock_reserve(algined_start, aligned_end) until the mappings are > > restored. > > We can reserve the region just before unmapping to avoid new allocations > for the page tables but we can't do much about pages already allocated > prior to calling remap_crashkernel(). Right, this was bothering me too after I re-read you previous email. One thing I can think of is to only remap the crash kernel memory if it is a part of an allocation that exactly fits into one ore more PUDs. Say, in reserve_crashkernel() we try the memblock_phys_alloc() with PUD_SIZE as alignment and size rounded up to PUD_SIZE. If this allocation succeeds, we remap the entire area that now contains only memory allocated in reserve_crashkernel() and free the extra memory after remapping is done. If the large allocation fails, we fall back to the original size and alignment and don't allow unmapping crash kernel memory in arch_kexec_protect_crashkres(). > -- > Catalin
Thanks. 在 2022/7/6 21:54, Mike Rapoport 写道: > On Wed, Jul 06, 2022 at 11:04:24AM +0100, Catalin Marinas wrote: >> On Tue, Jul 05, 2022 at 11:45:40PM +0300, Mike Rapoport wrote: >>> On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote: >>>> On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote: >>>>> On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote: >>>>>> On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote: >>>>>>> +void __init remap_crashkernel(void) >>>>>>> +{ >>>>>>> +#ifdef CONFIG_KEXEC_CORE >>>>>>> + phys_addr_t start, end, size; >>>>>>> + phys_addr_t aligned_start, aligned_end; >>>>>>> + >>>>>>> + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) >>>>>>> + return; >>>>>>> + >>>>>>> + if (!crashk_res.end) >>>>>>> + return; >>>>>>> + >>>>>>> + start = crashk_res.start & PAGE_MASK; >>>>>>> + end = PAGE_ALIGN(crashk_res.end); >>>>>>> + >>>>>>> + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); >>>>>>> + aligned_end = ALIGN(end, PUD_SIZE); >>>>>>> + >>>>>>> + /* Clear PUDs containing crash kernel memory */ >>>>>>> + unmap_hotplug_range(__phys_to_virt(aligned_start), >>>>>>> + __phys_to_virt(aligned_end), false, NULL); >>>>>> >>>>>> What I don't understand is what happens if there's valid kernel data >>>>>> between aligned_start and crashk_res.start (or the other end of the >>>>>> range). >>>>> >>>>> Data shouldn't go anywhere :) >>>>> >>>>> There is >>>>> >>>>> + /* map area from PUD start to start of crash kernel with large pages */ >>>>> + size = start - aligned_start; >>>>> + __create_pgd_mapping(swapper_pg_dir, aligned_start, >>>>> + __phys_to_virt(aligned_start), >>>>> + size, PAGE_KERNEL, early_pgtable_alloc, 0); >>>>> >>>>> and >>>>> >>>>> + /* map area from end of crash kernel to PUD end with large pages */ >>>>> + size = aligned_end - end; >>>>> + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end), >>>>> + size, PAGE_KERNEL, early_pgtable_alloc, 0); >>>>> >>>>> after the unmap, so after we tear down a part of a linear map we >>>>> immediately recreate it, just with a different page size. >>>>> >>>>> This all happens before SMP, so there is no concurrency at that point. >>>> >>>> That brief period of unmap worries me. The kernel text, data and stack >>>> are all in the vmalloc space but any other (memblock) allocation to this >>>> point may be in the unmapped range before and after the crashkernel >>>> reservation. The interrupts are off, so I think the only allocation and >>>> potential access that may go in this range is the page table itself. But >>>> it looks fragile to me. >>> >>> I agree there are chances there will be an allocation from the unmapped >>> range. >>> >>> We can make sure this won't happen, though. We can cap the memblock >>> allocations with memblock_set_current_limit(aligned_end) or >>> memblock_reserve(algined_start, aligned_end) until the mappings are >>> restored. >> >> We can reserve the region just before unmapping to avoid new allocations >> for the page tables but we can't do much about pages already allocated >> prior to calling remap_crashkernel(). > > Right, this was bothering me too after I re-read you previous email. > > One thing I can think of is to only remap the crash kernel memory if it is > a part of an allocation that exactly fits into one ore more PUDs. > > Say, in reserve_crashkernel() we try the memblock_phys_alloc() with > PUD_SIZE as alignment and size rounded up to PUD_SIZE. If this allocation > succeeds, we remap the entire area that now contains only memory allocated > in reserve_crashkernel() and free the extra memory after remapping is done. > If the large allocation fails, we fall back to the original size and > alignment and don't allow unmapping crash kernel memory in > arch_kexec_protect_crashkres(). > >> -- >> Catalin > Thanks. There is a new method. I think we should use the patch v3(similar but need add some changes) 1.We can walk crashkernle block/section pagetable, [[[(keep the origin block/section mapping valid]]] rebuild the pte level page mapping for the crashkernel mem rebuild left & right margin mem(which is in same block/section mapping but out of crashkernel mem) with block/section mapping 2.'replace' the origin block/section mapping by new builded mapping iterately With this method, all the mem mapping keep valid all the time. 3.the patch v3 link: https://lore.kernel.org/linux-mm/6dc308db-3685-4df5-506a-71f9e3794ec8@linux.alibaba.com/T/ (Need some changes)
在 2022/7/6 23:18, guanghui.fgh 写道: > Thanks. > > 在 2022/7/6 21:54, Mike Rapoport 写道: >> On Wed, Jul 06, 2022 at 11:04:24AM +0100, Catalin Marinas wrote: >>> On Tue, Jul 05, 2022 at 11:45:40PM +0300, Mike Rapoport wrote: >>>> On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote: >>>>> On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote: >>>>>> On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote: >>>>>>> On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote: >>>>>>>> +void __init remap_crashkernel(void) >>>>>>>> +{ >>>>>>>> +#ifdef CONFIG_KEXEC_CORE >>>>>>>> + phys_addr_t start, end, size; >>>>>>>> + phys_addr_t aligned_start, aligned_end; >>>>>>>> + >>>>>>>> + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) >>>>>>>> + return; >>>>>>>> + >>>>>>>> + if (!crashk_res.end) >>>>>>>> + return; >>>>>>>> + >>>>>>>> + start = crashk_res.start & PAGE_MASK; >>>>>>>> + end = PAGE_ALIGN(crashk_res.end); >>>>>>>> + >>>>>>>> + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); >>>>>>>> + aligned_end = ALIGN(end, PUD_SIZE); >>>>>>>> + >>>>>>>> + /* Clear PUDs containing crash kernel memory */ >>>>>>>> + unmap_hotplug_range(__phys_to_virt(aligned_start), >>>>>>>> + __phys_to_virt(aligned_end), false, NULL); >>>>>>> >>>>>>> What I don't understand is what happens if there's valid kernel data >>>>>>> between aligned_start and crashk_res.start (or the other end of the >>>>>>> range). >>>>>> >>>>>> Data shouldn't go anywhere :) >>>>>> >>>>>> There is >>>>>> >>>>>> + /* map area from PUD start to start of crash kernel with >>>>>> large pages */ >>>>>> + size = start - aligned_start; >>>>>> + __create_pgd_mapping(swapper_pg_dir, aligned_start, >>>>>> + __phys_to_virt(aligned_start), >>>>>> + size, PAGE_KERNEL, early_pgtable_alloc, 0); >>>>>> >>>>>> and >>>>>> >>>>>> + /* map area from end of crash kernel to PUD end with large >>>>>> pages */ >>>>>> + size = aligned_end - end; >>>>>> + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end), >>>>>> + size, PAGE_KERNEL, early_pgtable_alloc, 0); >>>>>> >>>>>> after the unmap, so after we tear down a part of a linear map we >>>>>> immediately recreate it, just with a different page size. >>>>>> >>>>>> This all happens before SMP, so there is no concurrency at that >>>>>> point. >>>>> >>>>> That brief period of unmap worries me. The kernel text, data and stack >>>>> are all in the vmalloc space but any other (memblock) allocation to >>>>> this >>>>> point may be in the unmapped range before and after the crashkernel >>>>> reservation. The interrupts are off, so I think the only allocation >>>>> and >>>>> potential access that may go in this range is the page table >>>>> itself. But >>>>> it looks fragile to me. >>>> >>>> I agree there are chances there will be an allocation from the unmapped >>>> range. >>>> >>>> We can make sure this won't happen, though. We can cap the memblock >>>> allocations with memblock_set_current_limit(aligned_end) or >>>> memblock_reserve(algined_start, aligned_end) until the mappings are >>>> restored. >>> >>> We can reserve the region just before unmapping to avoid new allocations >>> for the page tables but we can't do much about pages already allocated >>> prior to calling remap_crashkernel(). >> >> Right, this was bothering me too after I re-read you previous email. >> >> One thing I can think of is to only remap the crash kernel memory if >> it is >> a part of an allocation that exactly fits into one ore more PUDs. >> >> Say, in reserve_crashkernel() we try the memblock_phys_alloc() with >> PUD_SIZE as alignment and size rounded up to PUD_SIZE. If this allocation >> succeeds, we remap the entire area that now contains only memory >> allocated >> in reserve_crashkernel() and free the extra memory after remapping is >> done. >> If the large allocation fails, we fall back to the original size and >> alignment and don't allow unmapping crash kernel memory in >> arch_kexec_protect_crashkres(). >>> -- >>> Catalin >> > Thanks. > > There is a new method. > I think we should use the patch v3(similar but need add some changes) > > 1.We can walk crashkernle block/section pagetable, > [[[(keep the origin block/section mapping valid]]] > rebuild the pte level page mapping for the crashkernel mem > rebuild left & right margin mem(which is in same block/section mapping > but out of crashkernel mem) with block/section mapping > > 2.'replace' the origin block/section mapping by new builded mapping > iterately > > With this method, all the mem mapping keep valid all the time. > > 3.the patch v3 link: > https://lore.kernel.org/linux-mm/6dc308db-3685-4df5-506a-71f9e3794ec8@linux.alibaba.com/T/ > > (Need some changes) Namely, When rebuilding for crashkernel mem pagemapping, there is no change to the origin mapping. When the new mapping is ready, we replace old mapping with the new builded mapping. With this method, keep all mem mapping valid all the time.
On Wed, Jul 06, 2022 at 11:18:22PM +0800, guanghui.fgh wrote: > 在 2022/7/6 21:54, Mike Rapoport 写道: > > One thing I can think of is to only remap the crash kernel memory if it is > > a part of an allocation that exactly fits into one ore more PUDs. > > > > Say, in reserve_crashkernel() we try the memblock_phys_alloc() with > > PUD_SIZE as alignment and size rounded up to PUD_SIZE. If this allocation > > succeeds, we remap the entire area that now contains only memory allocated > > in reserve_crashkernel() and free the extra memory after remapping is done. > > If the large allocation fails, we fall back to the original size and > > alignment and don't allow unmapping crash kernel memory in > > arch_kexec_protect_crashkres(). > > There is a new method. > I think we should use the patch v3(similar but need add some changes) > > 1.We can walk crashkernle block/section pagetable, > [[[(keep the origin block/section mapping valid]]] > rebuild the pte level page mapping for the crashkernel mem > rebuild left & right margin mem(which is in same block/section mapping but > out of crashkernel mem) with block/section mapping > > 2.'replace' the origin block/section mapping by new builded mapping > iterately > > With this method, all the mem mapping keep valid all the time. As I already commented on one of your previous patches, this is not allowed by the architecture. If FEAT_BBM is implemented (ARMv8.4 I think), the worst that can happen is a TLB conflict abort and the handler should invalidate the TLBs and restart the faulting instruction, assuming the handler won't try to access the same conflicting virtual address. Prior to FEAT_BBM, that's not possible as the architecture does not describe a precise behaviour of conflicting TLB entries (you might as well get the TLB output of multiple entries being or'ed together).
Thanks. 在 2022/7/6 23:40, Catalin Marinas 写道: > On Wed, Jul 06, 2022 at 11:18:22PM +0800, guanghui.fgh wrote: >> 在 2022/7/6 21:54, Mike Rapoport 写道: >>> One thing I can think of is to only remap the crash kernel memory if it is >>> a part of an allocation that exactly fits into one ore more PUDs. >>> >>> Say, in reserve_crashkernel() we try the memblock_phys_alloc() with >>> PUD_SIZE as alignment and size rounded up to PUD_SIZE. If this allocation >>> succeeds, we remap the entire area that now contains only memory allocated >>> in reserve_crashkernel() and free the extra memory after remapping is done. >>> If the large allocation fails, we fall back to the original size and >>> alignment and don't allow unmapping crash kernel memory in >>> arch_kexec_protect_crashkres(). >> >> There is a new method. >> I think we should use the patch v3(similar but need add some changes) >> >> 1.We can walk crashkernle block/section pagetable, >> [[[(keep the origin block/section mapping valid]]] >> rebuild the pte level page mapping for the crashkernel mem >> rebuild left & right margin mem(which is in same block/section mapping but >> out of crashkernel mem) with block/section mapping >> >> 2.'replace' the origin block/section mapping by new builded mapping >> iterately >> >> With this method, all the mem mapping keep valid all the time. > > As I already commented on one of your previous patches, this is not > allowed by the architecture. If FEAT_BBM is implemented (ARMv8.4 I > think), the worst that can happen is a TLB conflict abort and the > handler should invalidate the TLBs and restart the faulting instruction, > assuming the handler won't try to access the same conflicting virtual > address. Prior to FEAT_BBM, that's not possible as the architecture does > not describe a precise behaviour of conflicting TLB entries (you might > as well get the TLB output of multiple entries being or'ed together). > I think there is another way to handle it. 1.We can rebuild the crashkernel mem mapping firstly, but [[[don't change the origin linear mapping]]]. 2.Afterward, we can reuse the idmap_pg_dir and switch to it. We use idmap_pg_dir to change the linear mapping which complyes with the TLB BBM.
Thanks. 在 2022/7/6 23:40, Catalin Marinas 写道: > On Wed, Jul 06, 2022 at 11:18:22PM +0800, guanghui.fgh wrote: >> 在 2022/7/6 21:54, Mike Rapoport 写道: >>> One thing I can think of is to only remap the crash kernel memory if it is >>> a part of an allocation that exactly fits into one ore more PUDs. >>> >>> Say, in reserve_crashkernel() we try the memblock_phys_alloc() with >>> PUD_SIZE as alignment and size rounded up to PUD_SIZE. If this allocation >>> succeeds, we remap the entire area that now contains only memory allocated >>> in reserve_crashkernel() and free the extra memory after remapping is done. >>> If the large allocation fails, we fall back to the original size and >>> alignment and don't allow unmapping crash kernel memory in >>> arch_kexec_protect_crashkres(). >> >> There is a new method. >> I think we should use the patch v3(similar but need add some changes) >> >> 1.We can walk crashkernle block/section pagetable, >> [[[(keep the origin block/section mapping valid]]] >> rebuild the pte level page mapping for the crashkernel mem >> rebuild left & right margin mem(which is in same block/section mapping but >> out of crashkernel mem) with block/section mapping >> >> 2.'replace' the origin block/section mapping by new builded mapping >> iterately >> >> With this method, all the mem mapping keep valid all the time. > > As I already commented on one of your previous patches, this is not > allowed by the architecture. If FEAT_BBM is implemented (ARMv8.4 I > think), the worst that can happen is a TLB conflict abort and the > handler should invalidate the TLBs and restart the faulting instruction, > assuming the handler won't try to access the same conflicting virtual > address. Prior to FEAT_BBM, that's not possible as the architecture does > not describe a precise behaviour of conflicting TLB entries (you might > as well get the TLB output of multiple entries being or'ed together). > The cpu can generate a TLB conflict abort if it detects that the address being looked up in the TLB hits multiple entries. (1).I think when gathering small page to block/section mapping, there maybe tlb conflict if no complying with BBM. Namely: a.Map a 4KB page (address X) Touch that page, in order to get the translation cached in the TLB b.Modify the translation tables replacing the mapping for address X with a 2MB mapping - DO NOT INVALIDATE the TLB c.Touch "X + 4KB" This will/should miss in the TLB, causing a new walk returning the 2MB mapping d.Touch X Assuming they've not been evicted, you'll hit both on the 4KB and 2MB mapping - as both cover address X. There is tlb conflict. (link: https://community.arm.com/support-forums/f/dev-platforms-forum/13583/tlb-conflict-abort) (2).But when spliting large block/section mapping to small granularity, there maybe no tlb conflict. Namely: a.rebuild the pte level mapping without any change to orgin pagetable (the relation between virtual address and physicall address keep same) b.modify 1G mappting to use the new pte level mapping in the [[[mem]]] without tlb flush c.When the cpu access the 1G mem(anywhere), If 1G tlb entry already cached in tlb, all the 1G mem will access success(without any tlb loaded, no confilict) If 1G tlb entry has been evicted, then the tlb will access pagetable in mem(despite the cpu "catch" the old(1G) or new(4k) mapped pagetale in the mem, all the 1G mem can access sucess)(load new tlb entry, no conflict) d.Afterward, we flush the tlb and force cpu use the new pagetable.(no conflict) It seems that there are no two tlb entries for a same virtual address in the tlb cache When spliting large block/section mapping. (3).At the same time, I think we can use another way. As the system linear maping is builded with init_pg_dir, we can also resue the init_pg_dir to split the block/setion mapping sometime. As init_pg_dir contain all kernel text/data access and we can comply with the BBM requirement. a.rebuild new pte level mapping without any change to the old mapping(the cpu can't walk access the new page mapping, it's isolated) b.change to use init_pg_dir c.clear the old 1G block mapping and flush tlb d.modify the linear mapping to use new pte level page mapping with init_pg_dir(TLB BBM) e.switch to swapper_pg_dir Could you give me some advice? Thanks.
diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h index 48f8466..1a46b81 100644 --- a/arch/arm64/include/asm/mmu.h +++ b/arch/arm64/include/asm/mmu.h @@ -63,6 +63,7 @@ static inline bool arm64_kernel_unmapped_at_el0(void) extern void arm64_memblock_init(void); extern void paging_init(void); extern void bootmem_init(void); +extern void map_crashkernel(void); extern void __iomem *early_io_map(phys_addr_t phys, unsigned long virt); extern void init_mem_pgprot(void); extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys, diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index 339ee84..241d27e 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -190,6 +190,7 @@ static void __init reserve_crashkernel(void) crashk_res.start = crash_base; crashk_res.end = crash_base + crash_size - 1; insert_resource(&iomem_resource, &crashk_res); + map_crashkernel(); } /* @@ -388,10 +389,6 @@ void __init arm64_memblock_init(void) } early_init_fdt_scan_reserved_mem(); - - if (!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) - reserve_crashkernel(); - high_memory = __va(memblock_end_of_DRAM() - 1) + 1; } @@ -438,8 +435,7 @@ void __init bootmem_init(void) * request_standard_resources() depends on crashkernel's memory being * reserved, so do it here. */ - if (IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32)) - reserve_crashkernel(); + reserve_crashkernel(); memblock_dump_all(); } diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 626ec32..76a4ff0 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -65,6 +65,10 @@ static DEFINE_SPINLOCK(swapper_pgdir_lock); static DEFINE_MUTEX(fixmap_lock); +static void unmap_hotplug_range(unsigned long addr, unsigned long end, + bool free_mapped, struct vmem_altmap *altmap, + pgprot_t prot, + phys_addr_t (*pgtable_alloc)(int), int flags); void set_swapper_pgd(pgd_t *pgdp, pgd_t pgd) { @@ -483,20 +487,49 @@ void __init mark_linear_text_alias_ro(void) PAGE_KERNEL_RO); } -static bool crash_mem_map __initdata; +#ifdef CONFIG_KEXEC_CORE +static phys_addr_t __init early_crashkernel_pgtable_alloc(int shift) +{ + phys_addr_t phys; + void *ptr; + + phys = memblock_phys_alloc_range(PAGE_SIZE, PAGE_SIZE, 0, + MEMBLOCK_ALLOC_NOLEAKTRACE); + if (!phys) + panic("Failed to allocate page table page\n"); + + ptr = (void *)__phys_to_virt(phys); + memset(ptr, 0, PAGE_SIZE); + return phys; +} -static int __init enable_crash_mem_map(char *arg) +void __init map_crashkernel(void) { - /* - * Proper parameter parsing is done by reserve_crashkernel(). We only - * need to know if the linear map has to avoid block mappings so that - * the crashkernel reservations can be unmapped later. - */ - crash_mem_map = true; + phys_addr_t start, end, size; - return 0; + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) + return; + + if (!crashk_res.end) + return; + + start = crashk_res.start & PAGE_MASK; + end = PAGE_ALIGN(crashk_res.end); + size = end - start; + + unmap_hotplug_range(__phys_to_virt(start), __phys_to_virt(end), false, + NULL, PAGE_KERNEL, early_crashkernel_pgtable_alloc, + NO_EXEC_MAPPINGS); + __create_pgd_mapping(swapper_pg_dir, crashk_res.start, + __phys_to_virt(crashk_res.start), + size, PAGE_KERNEL, + early_crashkernel_pgtable_alloc, + NO_EXEC_MAPPINGS | NO_BLOCK_MAPPINGS | + NO_CONT_MAPPINGS); } -early_param("crashkernel", enable_crash_mem_map); +#else +void __init mapping_crashkernel(void) {} +#endif static void __init map_mem(pgd_t *pgdp) { @@ -527,17 +560,6 @@ static void __init map_mem(pgd_t *pgdp) */ memblock_mark_nomap(kernel_start, kernel_end - kernel_start); -#ifdef CONFIG_KEXEC_CORE - if (crash_mem_map) { - if (IS_ENABLED(CONFIG_ZONE_DMA) || - IS_ENABLED(CONFIG_ZONE_DMA32)) - flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; - else if (crashk_res.end) - memblock_mark_nomap(crashk_res.start, - resource_size(&crashk_res)); - } -#endif - /* map all the memory banks */ for_each_mem_range(i, &start, &end) { if (start >= end) @@ -570,19 +592,6 @@ static void __init map_mem(pgd_t *pgdp) * in page granularity and put back unused memory to buddy system * through /sys/kernel/kexec_crash_size interface. */ -#ifdef CONFIG_KEXEC_CORE - if (crash_mem_map && - !IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) { - if (crashk_res.end) { - __map_memblock(pgdp, crashk_res.start, - crashk_res.end + 1, - PAGE_KERNEL, - NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS); - memblock_clear_nomap(crashk_res.start, - resource_size(&crashk_res)); - } - } -#endif } void mark_rodata_ro(void) @@ -827,7 +836,6 @@ int kern_addr_valid(unsigned long addr) return pfn_valid(pte_pfn(pte)); } -#ifdef CONFIG_MEMORY_HOTPLUG static void free_hotplug_page_range(struct page *page, size_t size, struct vmem_altmap *altmap) { @@ -863,9 +871,25 @@ static bool pgtable_range_aligned(unsigned long start, unsigned long end, return true; } +static void pte_clear_cont(pte_t *ptep) +{ + int i = 0; + pte_t pte = READ_ONCE(*ptep); + if (pte_none(pte) || !pte_cont(pte)) + return; + ptep -= ((u64)ptep / sizeof(pte_t)) & + ((1 << CONFIG_ARM64_CONT_PTE_SHIFT) - 1); + do { + pte = pte_mknoncont(READ_ONCE(*ptep)); + set_pte(ptep, pte); + } while (++ptep, ++i < CONT_PTES); +} + static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr, unsigned long end, bool free_mapped, - struct vmem_altmap *altmap) + struct vmem_altmap *altmap, pgprot_t prot, + phys_addr_t (*pgtable_alloc)(int), + int flags) { pte_t *ptep, pte; @@ -876,6 +900,8 @@ static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr, continue; WARN_ON(!pte_present(pte)); + + pte_clear_cont(ptep); pte_clear(&init_mm, addr, ptep); flush_tlb_kernel_range(addr, addr + PAGE_SIZE); if (free_mapped) @@ -884,9 +910,26 @@ static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr, } while (addr += PAGE_SIZE, addr < end); } +static void pmd_clear_cont(pmd_t *pmdp) +{ + int i = 0; + pmd_t pmd = READ_ONCE(*pmdp); + if (pmd_none(pmd) || !pmd_sect(pmd) || !pmd_cont(pmd)) + return; + pmdp -= ((u64)pmdp / sizeof(pmd_t)) & + ((1 << CONFIG_ARM64_CONT_PMD_SHIFT) - 1); + do { + pmd = READ_ONCE(*pmdp); + pmd = pte_pmd(pte_mknoncont(pmd_pte(pmd))); + set_pmd(pmdp, pmd); + } while (++pmdp, ++i < CONT_PMDS); +} + static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr, unsigned long end, bool free_mapped, - struct vmem_altmap *altmap) + struct vmem_altmap *altmap, pgprot_t prot, + phys_addr_t (*pgtable_alloc)(int), + int flags) { unsigned long next; pmd_t *pmdp, pmd; @@ -900,6 +943,8 @@ static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr, WARN_ON(!pmd_present(pmd)); if (pmd_sect(pmd)) { + //clear CONT flags + pmd_clear_cont(pmdp); pmd_clear(pmdp); /* @@ -907,19 +952,36 @@ static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr, * range is mapped with a single block entry. */ flush_tlb_kernel_range(addr, addr + PAGE_SIZE); + + if (addr & ~PMD_MASK) + alloc_init_cont_pte(pmdp, addr & PMD_MASK, + addr, __virt_to_phys(addr & + PMD_MASK), prot, + pgtable_alloc, flags); + + if (next & ~PMD_MASK) + alloc_init_cont_pte(pmdp, next, ALIGN(next, + PMD_SIZE), + __virt_to_phys(next), + prot, pgtable_alloc, + flags); + if (free_mapped) free_hotplug_page_range(pmd_page(pmd), PMD_SIZE, altmap); continue; } WARN_ON(!pmd_table(pmd)); - unmap_hotplug_pte_range(pmdp, addr, next, free_mapped, altmap); + unmap_hotplug_pte_range(pmdp, addr, next, free_mapped, altmap, + prot, pgtable_alloc, flags); } while (addr = next, addr < end); } static void unmap_hotplug_pud_range(p4d_t *p4dp, unsigned long addr, unsigned long end, bool free_mapped, - struct vmem_altmap *altmap) + struct vmem_altmap *altmap, pgprot_t prot, + phys_addr_t (*pgtable_alloc)(int), + int flags) { unsigned long next; pud_t *pudp, pud; @@ -940,19 +1002,36 @@ static void unmap_hotplug_pud_range(p4d_t *p4dp, unsigned long addr, * range is mapped with a single block entry. */ flush_tlb_kernel_range(addr, addr + PAGE_SIZE); + + if (addr & (~PUD_MASK)) + alloc_init_cont_pmd(pudp, addr & PUD_MASK, + addr, __virt_to_phys(addr & + PUD_MASK), prot, + pgtable_alloc, flags); + + if (next & (~PUD_MASK)) + alloc_init_cont_pmd(pudp, next, + ALIGN(next, PUD_SIZE), + __virt_to_phys(next), + prot, pgtable_alloc, + flags); + if (free_mapped) free_hotplug_page_range(pud_page(pud), PUD_SIZE, altmap); continue; } WARN_ON(!pud_table(pud)); - unmap_hotplug_pmd_range(pudp, addr, next, free_mapped, altmap); + unmap_hotplug_pmd_range(pudp, addr, next, free_mapped, altmap, + prot, pgtable_alloc, flags); } while (addr = next, addr < end); } static void unmap_hotplug_p4d_range(pgd_t *pgdp, unsigned long addr, unsigned long end, bool free_mapped, - struct vmem_altmap *altmap) + struct vmem_altmap *altmap, pgprot_t prot, + phys_addr_t (*pgtable_alloc)(int), + int flags) { unsigned long next; p4d_t *p4dp, p4d; @@ -965,12 +1044,15 @@ static void unmap_hotplug_p4d_range(pgd_t *pgdp, unsigned long addr, continue; WARN_ON(!p4d_present(p4d)); - unmap_hotplug_pud_range(p4dp, addr, next, free_mapped, altmap); + unmap_hotplug_pud_range(p4dp, addr, next, free_mapped, altmap, + prot, pgtable_alloc, flags); } while (addr = next, addr < end); } static void unmap_hotplug_range(unsigned long addr, unsigned long end, - bool free_mapped, struct vmem_altmap *altmap) + bool free_mapped, struct vmem_altmap *altmap, + pgprot_t prot, + phys_addr_t (*pgtable_alloc)(int), int flags) { unsigned long next; pgd_t *pgdp, pgd; @@ -991,7 +1073,8 @@ static void unmap_hotplug_range(unsigned long addr, unsigned long end, continue; WARN_ON(!pgd_present(pgd)); - unmap_hotplug_p4d_range(pgdp, addr, next, free_mapped, altmap); + unmap_hotplug_p4d_range(pgdp, addr, next, free_mapped, altmap, + prot, pgtable_alloc, flags); } while (addr = next, addr < end); } @@ -1148,7 +1231,6 @@ static void free_empty_tables(unsigned long addr, unsigned long end, free_empty_p4d_table(pgdp, addr, next, floor, ceiling); } while (addr = next, addr < end); } -#endif #if !ARM64_KERNEL_USES_PMD_MAPS int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, @@ -1210,7 +1292,7 @@ void vmemmap_free(unsigned long start, unsigned long end, { WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END)); - unmap_hotplug_range(start, end, true, altmap); + unmap_hotplug_range(start, end, true, altmap, __pgprot(0), NULL, 0); free_empty_tables(start, end, VMEMMAP_START, VMEMMAP_END); } #endif /* CONFIG_MEMORY_HOTPLUG */ @@ -1474,7 +1556,7 @@ static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size) WARN_ON(pgdir != init_mm.pgd); WARN_ON((start < PAGE_OFFSET) || (end > PAGE_END)); - unmap_hotplug_range(start, end, false, NULL); + unmap_hotplug_range(start, end, false, NULL, __pgprot(0), NULL, 0); free_empty_tables(start, end, PAGE_OFFSET, PAGE_END); }
The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone (enable crashkernel, disable rodata full, disable kfence), the mem_map will use non block/section mapping(for crashkernel requires to shrink the region in page granularity). But it will degrade performance when doing larging continuous mem access in kernel(memcpy/memmove, etc). There are many changes and discussions: commit 031495635b46 ("arm64: Do not defer reserve_crashkernel() for platforms with no DMA memory zones") commit 0a30c53573b0 ("arm64: mm: Move reserve_crashkernel() into mem_init()") commit 2687275a5843 ("arm64: Force NO_BLOCK_MAPPINGS if crashkernel reservation is required") This patch changes mem_map to use block/section mapping with crashkernel. Firstly, do block/section mapping(normally 2M or 1G) for all avail mem at mem_map, reserve crashkernel memory. And then walking pagetable to split block/section mapping to non block/section mapping(normally 4K) [[[only]]] for crashkernel mem. So the linear mem mapping use block/section mapping as more as possible. We will reduce the cpu dTLB miss conspicuously, and accelerate mem access about 10-20% performance improvement. I have tested it with pft(Page Fault Test) and fio, obtained great performace improvement. For fio test: 1.prepare ramdisk modprobe -r brd modprobe brd rd_nr=1 rd_size=67108864 dmsetup remove_all wipefs -a --force /dev/ram0 mkfs -t ext4 -E lazy_itable_init=0,lazy_journal_init=0 -q -F /dev/ram0 mkdir -p /fs/ram0 mount -t ext4 /dev/ram0 /fs/ram0 2.prepare fio paremeter in x.fio file: [global] bs=4k ioengine=psync iodepth=128 size=32G direct=1 invalidate=1 group_reporting thread=1 rw=read directory=/fs/ram0 numjobs=1 [task_0] cpus_allowed=16 stonewall=1 3.run testcase: perf stat -e dTLB-load-misses fio x.fio 4.contrast ------------------------ without patch with patch fio READ aggrb=1493.2MB/s aggrb=1775.3MB/s dTLB-load-misses 1,818,320,693 438,729,774 time elapsed(s) 70.500326434 62.877316408 user(s) 15.926332000 15.684721000 sys(s) 54.211939000 47.046165000 5.conclusion Using this patch will reduce dTLB misses and improve performace greatly. Signed-off-by: Guanghui Feng <guanghuifeng@linux.alibaba.com> --- arch/arm64/include/asm/mmu.h | 1 + arch/arm64/mm/init.c | 8 +- arch/arm64/mm/mmu.c | 176 +++++++++++++++++++++++++++++++------------ 3 files changed, 132 insertions(+), 53 deletions(-)