diff mbox series

[v18,02/17] x86/setup: Move xen_pv_domain() check and insert_resource() to setup_arch()

Message ID 20211222130820.1754-3-thunder.leizhen@huawei.com (mailing list archive)
State New, archived
Headers show
Series support reserving crashkernel above 4G on arm64 kdump | expand

Commit Message

Leizhen (ThunderTown) Dec. 22, 2021, 1:08 p.m. UTC
From: Chen Zhou <chenzhou10@huawei.com>

We will make the functions reserve_crashkernel() as generic, the
xen_pv_domain() check in reserve_crashkernel() is relevant only to
x86, the same as insert_resource() in reserve_crashkernel[_low]().
So move xen_pv_domain() check and insert_resource() to setup_arch()
to keep them in x86.

Suggested-by: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Co-developed-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
---
 arch/x86/kernel/setup.c | 23 +++++++++++------------
 1 file changed, 11 insertions(+), 12 deletions(-)

Comments

Borislav Petkov Dec. 23, 2021, 5:26 p.m. UTC | #1
On Wed, Dec 22, 2021 at 09:08:05PM +0800, Zhen Lei wrote:
> From: Chen Zhou <chenzhou10@huawei.com>
> 
> We will make the functions reserve_crashkernel() as generic, the
> xen_pv_domain() check in reserve_crashkernel() is relevant only to
> x86,

Why is that so? Is Xen-PV x86-only?

> the same as insert_resource() in reserve_crashkernel[_low]().

Why?

Looking at

  0212f9159694 ("x86: Add Crash kernel low reservation")

it *surprisingly* explains why that resources thing is being added:

    We need to add another range in /proc/iomem like "Crash kernel low",
    so kexec-tools could find that info and append to kdump kernel
    command line.

Then,

  157752d84f5d ("kexec: use Crash kernel for Crash kernel low")

renamed it because, as it states, kexec-tools was taught to handle
multiple resources of the same name.

So why does kexec-tools on arm *not* need those iomem resources? How
does it parse the ranges there? Questions over questions...

So last time I told you to sit down and take your time with this cleanup.
From reading this here, it doesn't look like it. Rather, it looks like
hastily done in a hurry and hurrying stuff doesn't help you one bit - it
actually makes it worse.

Your commit messages need to explain *why* a change is being done and
why is that ok. This one doesn't.

> @@ -1120,7 +1109,17 @@ void __init setup_arch(char **cmdline_p)
>  	 * Reserve memory for crash kernel after SRAT is parsed so that it
>  	 * won't consume hotpluggable memory.
>  	 */
> -	reserve_crashkernel();
> +#ifdef CONFIG_KEXEC_CORE
> +	if (xen_pv_domain())
> +		pr_info("Ignoring crashkernel for a Xen PV domain\n");

This is wrong - the check is currently being done inside
reserve_crashkernel(), *after* it has parsed a crashkernel= cmdline
correctly - and not before.

Your change would print on Xen PV, regardless of whether it has received
crashkernel= on the cmdline or not.

This is exactly why I say that making those functions generic and shared
might not be such a good idea, after all, because then you'd have to
sprinkle around arch-specific stuff.

One of the ways how to address this particular case here would be:

1. Add a x86-specific wrapper around parse_crashkernel() which does
all the parsing. When that wrapper finishes, you should have parsed
everything that has crashkernel= on the cmdline.

2. At the end of that wrapper, you do arch-specific checks and setup
like the xen_pv_domain() one.

3. Now, you do reserve_crashkernel(), if those checks pass.

The question is, whether the flow on arm64 can do the same. Probably but
it needs careful auditing.
Leizhen (ThunderTown) Dec. 24, 2021, 6:36 a.m. UTC | #2
On 2021/12/24 1:26, Borislav Petkov wrote:
> On Wed, Dec 22, 2021 at 09:08:05PM +0800, Zhen Lei wrote:
>> From: Chen Zhou <chenzhou10@huawei.com>
>>
>> We will make the functions reserve_crashkernel() as generic, the
>> xen_pv_domain() check in reserve_crashkernel() is relevant only to
>> x86,
> 
> Why is that so? Is Xen-PV x86-only?
> 
>> the same as insert_resource() in reserve_crashkernel[_low]().
> 
> Why?
> 
> Looking at
> 
>   0212f9159694 ("x86: Add Crash kernel low reservation")
> 
> it *surprisingly* explains why that resources thing is being added:
> 
>     We need to add another range in /proc/iomem like "Crash kernel low",
>     so kexec-tools could find that info and append to kdump kernel
>     command line.
> 
> Then,
> 
>   157752d84f5d ("kexec: use Crash kernel for Crash kernel low")
> 
> renamed it because, as it states, kexec-tools was taught to handle
> multiple resources of the same name.
> 
> So why does kexec-tools on arm *not* need those iomem resources? How
> does it parse the ranges there? Questions over questions...

https://lkml.org/lkml/2019/4/4/1758

Chen Zhou has explained before, see below. I'll analyze why x86 and arm64 need
to process iomem resources at different times.

 < This very reminds what x86 does. Any chance some of the code can be reused
 < rather than duplicated?
As i said in the comment, i transport reserve_crashkernel_low() from x86_64. There are minor
differences. In arm64, we don't need to do insert_resource(), we do request_resource()
in request_standard_resources() later.

> 
> So last time I told you to sit down and take your time with this cleanup.
>>From reading this here, it doesn't look like it. Rather, it looks like
> hastily done in a hurry and hurrying stuff doesn't help you one bit - it
> actually makes it worse.
> 
> Your commit messages need to explain *why* a change is being done and
> why is that ok. This one doesn't.

OK, I'll do this in follow-up patches.

> 
>> @@ -1120,7 +1109,17 @@ void __init setup_arch(char **cmdline_p)
>>  	 * Reserve memory for crash kernel after SRAT is parsed so that it
>>  	 * won't consume hotpluggable memory.
>>  	 */
>> -	reserve_crashkernel();
>> +#ifdef CONFIG_KEXEC_CORE
>> +	if (xen_pv_domain())
>> +		pr_info("Ignoring crashkernel for a Xen PV domain\n");
> 
> This is wrong - the check is currently being done inside
> reserve_crashkernel(), *after* it has parsed a crashkernel= cmdline
> correctly - and not before.
> 
> Your change would print on Xen PV, regardless of whether it has received
> crashkernel= on the cmdline or not.

Yes, you're right. There are changes in code logic, but the print doesn't
seem to cause any misunderstanding.

> 
> This is exactly why I say that making those functions generic and shared
> might not be such a good idea, after all, because then you'd have to
> sprinkle around arch-specific stuff.

Yes, I'm thinking about that too. Perhaps they are not suitable for full
code sharing, but it looks like there's some code that can be shared.
For example, the function parse_crashkernel_in_order() that I extracted
based on your suggestion, it could also be parse_crashkernel_high_low().
Or the function reserve_crashkernel_low().

There are two ways to reserve memory above 4G:
1. Use crashkernel=X,high, with or without crashkernel=X,low
2. Use crashkernel=X,[offset], but try low memory first. If failed, then
   try high memory, and retry at least 256M low memory.

I plan to only implement 2 in the next version so that there can be fewer
changes. Then implement 1 after 2 is applied.

> 
> One of the ways how to address this particular case here would be:
> 
> 1. Add a x86-specific wrapper around parse_crashkernel() which does
> all the parsing. When that wrapper finishes, you should have parsed
> everything that has crashkernel= on the cmdline.
> 
> 2. At the end of that wrapper, you do arch-specific checks and setup
> like the xen_pv_domain() one.
> 
> 3. Now, you do reserve_crashkernel(), if those checks pass.
> 
> The question is, whether the flow on arm64 can do the same. Probably but
> it needs careful auditing.
>
Leizhen (ThunderTown) Dec. 25, 2021, 1:53 a.m. UTC | #3
On 2021/12/24 14:36, Leizhen (ThunderTown) wrote:
> 
> 
> On 2021/12/24 1:26, Borislav Petkov wrote:
>> On Wed, Dec 22, 2021 at 09:08:05PM +0800, Zhen Lei wrote:
>>> From: Chen Zhou <chenzhou10@huawei.com>
>>>
>>> We will make the functions reserve_crashkernel() as generic, the
>>> xen_pv_domain() check in reserve_crashkernel() is relevant only to
>>> x86,
>>
>> Why is that so? Is Xen-PV x86-only?
>>
>>> the same as insert_resource() in reserve_crashkernel[_low]().
>>
>> Why?
>>
>> Looking at
>>
>>   0212f9159694 ("x86: Add Crash kernel low reservation")
>>
>> it *surprisingly* explains why that resources thing is being added:
>>
>>     We need to add another range in /proc/iomem like "Crash kernel low",
>>     so kexec-tools could find that info and append to kdump kernel
>>     command line.
>>
>> Then,
>>
>>   157752d84f5d ("kexec: use Crash kernel for Crash kernel low")
>>
>> renamed it because, as it states, kexec-tools was taught to handle
>> multiple resources of the same name.
>>
>> So why does kexec-tools on arm *not* need those iomem resources? How
>> does it parse the ranges there? Questions over questions...

It's a good question worth figuring out. I'm going to dig into this.
I admire your rigorous style and sharp vision.

> 
> https://lkml.org/lkml/2019/4/4/1758
> 
> Chen Zhou has explained before, see below. I'll analyze why x86 and arm64 need
> to process iomem resources at different times.
> 
>  < This very reminds what x86 does. Any chance some of the code can be reused
>  < rather than duplicated?
> As i said in the comment, i transport reserve_crashkernel_low() from x86_64. There are minor
> differences. In arm64, we don't need to do insert_resource(), we do request_resource()
> in request_standard_resources() later.
> 
>>
>> So last time I told you to sit down and take your time with this cleanup.
>> >From reading this here, it doesn't look like it. Rather, it looks like
>> hastily done in a hurry and hurrying stuff doesn't help you one bit - it
>> actually makes it worse.
>>
>> Your commit messages need to explain *why* a change is being done and
>> why is that ok. This one doesn't.
> 
> OK, I'll do this in follow-up patches.
> 
>>
>>> @@ -1120,7 +1109,17 @@ void __init setup_arch(char **cmdline_p)
>>>  	 * Reserve memory for crash kernel after SRAT is parsed so that it
>>>  	 * won't consume hotpluggable memory.
>>>  	 */
>>> -	reserve_crashkernel();
>>> +#ifdef CONFIG_KEXEC_CORE
>>> +	if (xen_pv_domain())
>>> +		pr_info("Ignoring crashkernel for a Xen PV domain\n");

Right, these two lines of code do not need to be moved. xen_pv_domain() is
a friendly macro function.

>>
>> This is wrong - the check is currently being done inside
>> reserve_crashkernel(), *after* it has parsed a crashkernel= cmdline
>> correctly - and not before.
>>
>> Your change would print on Xen PV, regardless of whether it has received
>> crashkernel= on the cmdline or not.
> 
> Yes, you're right. There are changes in code logic, but the print doesn't
> seem to cause any misunderstanding.
> 
>>
>> This is exactly why I say that making those functions generic and shared
>> might not be such a good idea, after all, because then you'd have to
>> sprinkle around arch-specific stuff.
> 
> Yes, I'm thinking about that too. Perhaps they are not suitable for full
> code sharing, but it looks like there's some code that can be shared.
> For example, the function parse_crashkernel_in_order() that I extracted
> based on your suggestion, it could also be parse_crashkernel_high_low().
> Or the function reserve_crashkernel_low().
> 
> There are two ways to reserve memory above 4G:
> 1. Use crashkernel=X,high, with or without crashkernel=X,low
> 2. Use crashkernel=X,[offset], but try low memory first. If failed, then
>    try high memory, and retry at least 256M low memory.
> 
> I plan to only implement 2 in the next version so that there can be fewer
> changes. Then implement 1 after 2 is applied.

I tried it yesterday and it didn't work. I still have to deal with the
problem of adjusting insert_resource().

How about I isolate some cleanup patches first? Strive for them to be
merged into v5.17. This way, we can focus on the core changes in the
next version. And I can also save some repetitive rebase workload.

> 
>>
>> One of the ways how to address this particular case here would be:
>>
>> 1. Add a x86-specific wrapper around parse_crashkernel() which does
>> all the parsing. When that wrapper finishes, you should have parsed
>> everything that has crashkernel= on the cmdline.
>>
>> 2. At the end of that wrapper, you do arch-specific checks and setup
>> like the xen_pv_domain() one.
>>
>> 3. Now, you do reserve_crashkernel(), if those checks pass.
>>
>> The question is, whether the flow on arm64 can do the same. Probably but
>> it needs careful auditing.
>>
Leizhen (ThunderTown) Dec. 25, 2021, 10:16 a.m. UTC | #4
On 2021/12/25 9:53, Leizhen (ThunderTown) wrote:
>>> This is exactly why I say that making those functions generic and shared
>>> might not be such a good idea, after all, because then you'd have to
>>> sprinkle around arch-specific stuff.

Hi Borislav and all:
  Merry Christmas!

  I have a new idea now. It helps us get around all the arguments and
minimizes changes to the x86 (also to arm64).
  Previously, Chen Zhou and I tried to share the entire function
reserve_crashkernel(), which led to the following series of problems:
1. reserve_crashkernel() is also defined on other architectures, so we should
   add build option ARCH_WANT_RESERVE_CRASH_KERNEL to avoid conflicts.
2. Move xen_pv_domain() check out of reserve_crashkernel().
3. Move insert_resource() out of reserve_crashkernel()

Others:
4. start = memblock_phys_alloc_range(crash_size, SZ_1M, crash_base,
                                                  crash_base + crash_size);
   Change SZ_1M to CRASH_ALIGN, or keep it no change.
   The current conclusion is no change. But I think adding a new macro
   CRASH_FIXED_ALIGN is also a way. 2M alignment allows page tables to
   use block mappings for most architectures.
5. if (crash_base >= (1ULL << 32) && reserve_crashkernel_low())
   Change (1ULL << 32) to CRASH_ADDR_LOW_MAX, or keep it no change.
   I reanalyzed it, and this doesn't need to be changed.

So for 1-3,why not add a new function reserve_crashkernel_mem() and rename
reserve_crashkernel_low() to reserve_crashkernel_mem_low().
On x86:
static void __init reserve_crashkernel(void)
{
	//Parse all "crashkernel=" configurations in priority order until
        //a valid combination is found. Or return upon failure.
	
	if (xen_pv_domain()) {
                pr_info("Ignoring crashkernel for a Xen PV domain\n");
                return;
        }

	//Call reserve_crashkernel_mem() to reserve crashkernel memory, it will
	//call reserve_crashkernel_mem_low() if needed.

	if (crashk_low_res.end)
		insert_resource(&iomem_resource, &crashk_low_res);
	insert_resource(&iomem_resource, &crashk_res);
}

On arm64:
static void __init reserve_crashkernel(void)
{
	//Parse all "crashkernel=" configurations in priority order until
        //a valid combination is found. Or return upon failure.
	
	//Call reserve_crashkernel_mem() to reserve crashkernel memory, it will
	//call reserve_crashkernel_mem_low() if needed.
}


1. reserve_crashkernel() is still static, so that there is no
   need to add ARCH_WANT_RESERVE_CRASH_KERNEL.
2. The xen_pv_domain() check have not been affected in any way.
   Hi Borislav:
     As you mentioned, this check may also be needed on arm64. But it may be
   better not to add it until the problem is actually triggered on arm64.
3. insert_resource() is not moved outside reserve_crashkernel() on x86.
   Hi Borislav:
     Currently, I haven't figured out why request_resource() can't be replaced
   with insert_resource() on arm64. But I have a hunch that the kexec tool may
   be involved. The cost of modification on arm64 is definitely higher than that
   on x86. Other architectures that want to use reserve_crashkernel_mem() may
   also face the same problem. So it's probably better that function
   reserve_crashkernel_mem() doesn't invoke insert_resource().

I guess you have a long Christmas holiday. So I'm going to send the next version
without waiting for your response.


>> Yes, I'm thinking about that too. Perhaps they are not suitable for full
>> code sharing, but it looks like there's some code that can be shared.
>> For example, the function parse_crashkernel_in_order() that I extracted
>> based on your suggestion, it could also be parse_crashkernel_high_low().
>> Or the function reserve_crashkernel_low().
>>
>> There are two ways to reserve memory above 4G:
>> 1. Use crashkernel=X,high, with or without crashkernel=X,low
>> 2. Use crashkernel=X,[offset], but try low memory first. If failed, then
>>    try high memory, and retry at least 256M low memory.
>>
>> I plan to only implement 2 in the next version so that there can be fewer
>> changes. Then implement 1 after 2 is applied.
> I tried it yesterday and it didn't work. I still have to deal with the
> problem of adjusting insert_resource().
> 
> How about I isolate some cleanup patches first? Strive for them to be
> merged into v5.17. This way, we can focus on the core changes in the
> next version. And I can also save some repetitive rebase workload.
>
Leizhen (ThunderTown) Jan. 7, 2022, 8:13 a.m. UTC | #5
On 2021/12/25 9:53, Leizhen (ThunderTown) wrote:
> 
> 
> On 2021/12/24 14:36, Leizhen (ThunderTown) wrote:
>>
>>
>> On 2021/12/24 1:26, Borislav Petkov wrote:
>>> On Wed, Dec 22, 2021 at 09:08:05PM +0800, Zhen Lei wrote:
>>>> From: Chen Zhou <chenzhou10@huawei.com>
>>>>
>>>> We will make the functions reserve_crashkernel() as generic, the
>>>> xen_pv_domain() check in reserve_crashkernel() is relevant only to
>>>> x86,
>>>
>>> Why is that so? Is Xen-PV x86-only?
>>>
>>>> the same as insert_resource() in reserve_crashkernel[_low]().
>>>
>>> Why?
>>>
>>> Looking at
>>>
>>>   0212f9159694 ("x86: Add Crash kernel low reservation")
>>>
>>> it *surprisingly* explains why that resources thing is being added:
>>>
>>>     We need to add another range in /proc/iomem like "Crash kernel low",
>>>     so kexec-tools could find that info and append to kdump kernel
>>>     command line.
>>>
>>> Then,
>>>
>>>   157752d84f5d ("kexec: use Crash kernel for Crash kernel low")
>>>
>>> renamed it because, as it states, kexec-tools was taught to handle
>>> multiple resources of the same name.
>>>
>>> So why does kexec-tools on arm *not* need those iomem resources? How
>>> does it parse the ranges there? Questions over questions...

Hi Borislav:
  The reason why insert_resource() cannot be used in reserve_crashkernel[_low]()
on arm64 is clear. The parent resource node of crashk[_low]_res is added by
request_resource() in request_standard_resources(), so that it will be conflicted.
All request_resource() in request_standard_resources() should be changed to
insert_resource(), to make insert_resource() can be used in reserve_crashkernel[_low]().

  I found commit e25e6e7593ca ("kdump, x86: Process multiple Crash kernel in /proc/iomem")
in kexec-tools. I'm trying to port it to arm64, or make it generic.

  Thanks.

> 
> It's a good question worth figuring out. I'm going to dig into this.
> I admire your rigorous style and sharp vision.
>
Leizhen (ThunderTown) Jan. 7, 2022, 1:09 p.m. UTC | #6
On 2022/1/7 16:13, Leizhen (ThunderTown) wrote:
> 
> 
> On 2021/12/25 9:53, Leizhen (ThunderTown) wrote:
>>
>>
>> On 2021/12/24 14:36, Leizhen (ThunderTown) wrote:
>>>
>>>
>>> On 2021/12/24 1:26, Borislav Petkov wrote:
>>>> On Wed, Dec 22, 2021 at 09:08:05PM +0800, Zhen Lei wrote:
>>>>> From: Chen Zhou <chenzhou10@huawei.com>
>>>>>
>>>>> We will make the functions reserve_crashkernel() as generic, the
>>>>> xen_pv_domain() check in reserve_crashkernel() is relevant only to
>>>>> x86,
>>>>
>>>> Why is that so? Is Xen-PV x86-only?
>>>>
>>>>> the same as insert_resource() in reserve_crashkernel[_low]().
>>>>
>>>> Why?
>>>>
>>>> Looking at
>>>>
>>>>   0212f9159694 ("x86: Add Crash kernel low reservation")
>>>>
>>>> it *surprisingly* explains why that resources thing is being added:
>>>>
>>>>     We need to add another range in /proc/iomem like "Crash kernel low",
>>>>     so kexec-tools could find that info and append to kdump kernel
>>>>     command line.
>>>>
>>>> Then,
>>>>
>>>>   157752d84f5d ("kexec: use Crash kernel for Crash kernel low")
>>>>
>>>> renamed it because, as it states, kexec-tools was taught to handle
>>>> multiple resources of the same name.
>>>>
>>>> So why does kexec-tools on arm *not* need those iomem resources? How
>>>> does it parse the ranges there? Questions over questions...
> 
> Hi Borislav:
>   The reason why insert_resource() cannot be used in reserve_crashkernel[_low]()
> on arm64 is clear. The parent resource node of crashk[_low]_res is added by
> request_resource() in request_standard_resources(), so that it will be conflicted.
> All request_resource() in request_standard_resources() should be changed to
> insert_resource(), to make insert_resource() can be used in reserve_crashkernel[_low]().
> 
>   I found commit e25e6e7593ca ("kdump, x86: Process multiple Crash kernel in /proc/iomem")
> in kexec-tools. I'm trying to port it to arm64, or make it generic.

Chen Zhou's done it before. But the "Crash kernel (low)" can really be eliminated. Chen
Zhou just used it to distinguish whether the crashkernel memory range is crashkernel load
range or not. We can use get_crash_kernel_load_range() to get and check the load range.


> 
>   Thanks.
> 
>>
>> It's a good question worth figuring out. I'm going to dig into this.
>> I admire your rigorous style and sharp vision.
>>
> 
>
diff mbox series

Patch

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index ae8f63661363e25..acf2f2eedfe3415 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -434,7 +434,6 @@  static int __init reserve_crashkernel_low(void)
 
 	crashk_low_res.start = low_base;
 	crashk_low_res.end   = low_base + low_size - 1;
-	insert_resource(&iomem_resource, &crashk_low_res);
 #endif
 	return 0;
 }
@@ -458,11 +457,6 @@  static void __init reserve_crashkernel(void)
 		high = true;
 	}
 
-	if (xen_pv_domain()) {
-		pr_info("Ignoring crashkernel for a Xen PV domain\n");
-		return;
-	}
-
 	/* 0 means: find the address automatically */
 	if (!crash_base) {
 		/*
@@ -508,11 +502,6 @@  static void __init reserve_crashkernel(void)
 
 	crashk_res.start = crash_base;
 	crashk_res.end   = crash_base + crash_size - 1;
-	insert_resource(&iomem_resource, &crashk_res);
-}
-#else
-static void __init reserve_crashkernel(void)
-{
 }
 #endif
 
@@ -1120,7 +1109,17 @@  void __init setup_arch(char **cmdline_p)
 	 * Reserve memory for crash kernel after SRAT is parsed so that it
 	 * won't consume hotpluggable memory.
 	 */
-	reserve_crashkernel();
+#ifdef CONFIG_KEXEC_CORE
+	if (xen_pv_domain())
+		pr_info("Ignoring crashkernel for a Xen PV domain\n");
+	else {
+		reserve_crashkernel();
+		if (crashk_res.end > crashk_res.start)
+			insert_resource(&iomem_resource, &crashk_res);
+		if (crashk_low_res.end > crashk_low_res.start)
+			insert_resource(&iomem_resource, &crashk_low_res);
+	}
+#endif
 
 	memblock_find_dma_reserve();