diff mbox series

[05/22] x86/srat: vmap the pages for acpi_slit

Message ID 20221216114853.8227-6-julien@xen.org (mailing list archive)
State New, archived
Headers show
Series Remove the directmap | expand

Commit Message

Julien Grall Dec. 16, 2022, 11:48 a.m. UTC
From: Hongyan Xia <hongyxia@amazon.com>

This avoids the assumption that boot pages are in the direct map.

Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: Julien Grall <jgrall@amazon.com>

----

    Changes since Hongyan's version:
        * vmap_boot_pages() was renamed to vmap_contig_pages()
---
 xen/arch/x86/srat.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Jan Beulich Dec. 20, 2022, 3:30 p.m. UTC | #1
On 16.12.2022 12:48, Julien Grall wrote:
> From: Hongyan Xia <hongyxia@amazon.com>
> 
> This avoids the assumption that boot pages are in the direct map.
> 
> Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
> Signed-off-by: Julien Grall <jgrall@amazon.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>

However, ...

> --- a/xen/arch/x86/srat.c
> +++ b/xen/arch/x86/srat.c
> @@ -139,7 +139,8 @@ void __init acpi_numa_slit_init(struct acpi_table_slit *slit)
>  		return;
>  	}
>  	mfn = alloc_boot_pages(PFN_UP(slit->header.length), 1);
> -	acpi_slit = mfn_to_virt(mfn_x(mfn));
> +	acpi_slit = vmap_contig_pages(mfn, PFN_UP(slit->header.length));

... with the increased use of vmap space the VA range used will need
growing. And that's perhaps better done ahead of time than late.

> +	BUG_ON(!acpi_slit);

Similarly relevant for the earlier patch: It would be nice if boot
failure for optional things like NUMA data could be avoided. But I
understand this is somewhat orthogonal to this series (the more that
alloc_boot_pages() itself is also affected). Yet not entirely so,
since previously there was no mapping failure possible here.

Jan
Julien Grall Dec. 23, 2022, 11:31 a.m. UTC | #2
Hi Jan,

On 20/12/2022 15:30, Jan Beulich wrote:
> On 16.12.2022 12:48, Julien Grall wrote:
>> From: Hongyan Xia <hongyxia@amazon.com>
>>
>> This avoids the assumption that boot pages are in the direct map.
>>
>> Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
>> Signed-off-by: Julien Grall <jgrall@amazon.com>
> 
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> 
> However, ...
> 
>> --- a/xen/arch/x86/srat.c
>> +++ b/xen/arch/x86/srat.c
>> @@ -139,7 +139,8 @@ void __init acpi_numa_slit_init(struct acpi_table_slit *slit)
>>   		return;
>>   	}
>>   	mfn = alloc_boot_pages(PFN_UP(slit->header.length), 1);
>> -	acpi_slit = mfn_to_virt(mfn_x(mfn));
>> +	acpi_slit = vmap_contig_pages(mfn, PFN_UP(slit->header.length));
> 
> ... with the increased use of vmap space the VA range used will need
> growing. And that's perhaps better done ahead of time than late.

I will have a look to increase the vmap().

> 
>> +	BUG_ON(!acpi_slit);
> 
> Similarly relevant for the earlier patch: It would be nice if boot
> failure for optional things like NUMA data could be avoided.

If you can't map (or allocate the memory), then you are probably in a 
very bad situation because both should really not fail at boot.

So I think this is correct to crash early because the admin will be able 
to look what went wrong. Otherwise, it may be missed in the noise.

>  But I
> understand this is somewhat orthogonal to this series (the more that
> alloc_boot_pages() itself is also affected). Yet not entirely so,
> since previously there was no mapping failure possible here.

See above. I don't see the problem of adding a potential mapping failure 
here and before.

Cheers,
Jan Beulich Jan. 4, 2023, 10:23 a.m. UTC | #3
On 23.12.2022 12:31, Julien Grall wrote:
> On 20/12/2022 15:30, Jan Beulich wrote:
>> On 16.12.2022 12:48, Julien Grall wrote:
>>> From: Hongyan Xia <hongyxia@amazon.com>
>>>
>>> This avoids the assumption that boot pages are in the direct map.
>>>
>>> Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
>>> Signed-off-by: Julien Grall <jgrall@amazon.com>
>>
>> Reviewed-by: Jan Beulich <jbeulich@suse.com>
>>
>> However, ...
>>
>>> --- a/xen/arch/x86/srat.c
>>> +++ b/xen/arch/x86/srat.c
>>> @@ -139,7 +139,8 @@ void __init acpi_numa_slit_init(struct acpi_table_slit *slit)
>>>   		return;
>>>   	}
>>>   	mfn = alloc_boot_pages(PFN_UP(slit->header.length), 1);
>>> -	acpi_slit = mfn_to_virt(mfn_x(mfn));
>>> +	acpi_slit = vmap_contig_pages(mfn, PFN_UP(slit->header.length));
>>
>> ... with the increased use of vmap space the VA range used will need
>> growing. And that's perhaps better done ahead of time than late.
> 
> I will have a look to increase the vmap().
> 
>>
>>> +	BUG_ON(!acpi_slit);
>>
>> Similarly relevant for the earlier patch: It would be nice if boot
>> failure for optional things like NUMA data could be avoided.
> 
> If you can't map (or allocate the memory), then you are probably in a 
> very bad situation because both should really not fail at boot.
> 
> So I think this is correct to crash early because the admin will be able 
> to look what went wrong. Otherwise, it may be missed in the noise.

Well, I certainly can see one taking this view. However, at least in
principle allocation (or mapping) may fail _because_ of NUMA issues.
At which point it would be better to boot with NUMA support turned off.

Jan
Julien Grall Jan. 12, 2023, 11:15 p.m. UTC | #4
Hi,

On 04/01/2023 10:23, Jan Beulich wrote:
> On 23.12.2022 12:31, Julien Grall wrote:
>> On 20/12/2022 15:30, Jan Beulich wrote:
>>> On 16.12.2022 12:48, Julien Grall wrote:
>>>> From: Hongyan Xia <hongyxia@amazon.com>
>>>>
>>>> This avoids the assumption that boot pages are in the direct map.
>>>>
>>>> Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
>>>> Signed-off-by: Julien Grall <jgrall@amazon.com>
>>>
>>> Reviewed-by: Jan Beulich <jbeulich@suse.com>
>>>
>>> However, ...
>>>
>>>> --- a/xen/arch/x86/srat.c
>>>> +++ b/xen/arch/x86/srat.c
>>>> @@ -139,7 +139,8 @@ void __init acpi_numa_slit_init(struct acpi_table_slit *slit)
>>>>    		return;
>>>>    	}
>>>>    	mfn = alloc_boot_pages(PFN_UP(slit->header.length), 1);
>>>> -	acpi_slit = mfn_to_virt(mfn_x(mfn));
>>>> +	acpi_slit = vmap_contig_pages(mfn, PFN_UP(slit->header.length));
>>>
>>> ... with the increased use of vmap space the VA range used will need
>>> growing. And that's perhaps better done ahead of time than late.
>>
>> I will have a look to increase the vmap().
>>
>>>
>>>> +	BUG_ON(!acpi_slit);
>>>
>>> Similarly relevant for the earlier patch: It would be nice if boot
>>> failure for optional things like NUMA data could be avoided.
>>
>> If you can't map (or allocate the memory), then you are probably in a
>> very bad situation because both should really not fail at boot.
>>
>> So I think this is correct to crash early because the admin will be able
>> to look what went wrong. Otherwise, it may be missed in the noise.
> 
> Well, I certainly can see one taking this view. However, at least in
> principle allocation (or mapping) may fail _because_ of NUMA issues.

Right. I read this as the user will likely want to add "numa=off" on the 
command line.

> At which point it would be better to boot with NUMA support turned off
I have to disagree with "better" here. This may work for a user with a 
handful of hosts. But for large scale setup, you will really want a 
failure early rather than having a host booting with an expected feature 
disabled (the NUMA issues may be a broken HW).

It is better to fail and then ask the user to specify "numa=off". At
least the person made a conscientious decision to turn off the feature.

I am curious to hear the opinion from the others.

Cheers,
Jan Beulich Jan. 13, 2023, 9:16 a.m. UTC | #5
On 13.01.2023 00:15, Julien Grall wrote:
> Hi,
> 
> On 04/01/2023 10:23, Jan Beulich wrote:
>> On 23.12.2022 12:31, Julien Grall wrote:
>>> On 20/12/2022 15:30, Jan Beulich wrote:
>>>> On 16.12.2022 12:48, Julien Grall wrote:
>>>>> From: Hongyan Xia <hongyxia@amazon.com>
>>>>>
>>>>> This avoids the assumption that boot pages are in the direct map.
>>>>>
>>>>> Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
>>>>> Signed-off-by: Julien Grall <jgrall@amazon.com>
>>>>
>>>> Reviewed-by: Jan Beulich <jbeulich@suse.com>
>>>>
>>>> However, ...
>>>>
>>>>> --- a/xen/arch/x86/srat.c
>>>>> +++ b/xen/arch/x86/srat.c
>>>>> @@ -139,7 +139,8 @@ void __init acpi_numa_slit_init(struct acpi_table_slit *slit)
>>>>>    		return;
>>>>>    	}
>>>>>    	mfn = alloc_boot_pages(PFN_UP(slit->header.length), 1);
>>>>> -	acpi_slit = mfn_to_virt(mfn_x(mfn));
>>>>> +	acpi_slit = vmap_contig_pages(mfn, PFN_UP(slit->header.length));
>>>>
>>>> ... with the increased use of vmap space the VA range used will need
>>>> growing. And that's perhaps better done ahead of time than late.
>>>
>>> I will have a look to increase the vmap().
>>>
>>>>
>>>>> +	BUG_ON(!acpi_slit);
>>>>
>>>> Similarly relevant for the earlier patch: It would be nice if boot
>>>> failure for optional things like NUMA data could be avoided.
>>>
>>> If you can't map (or allocate the memory), then you are probably in a
>>> very bad situation because both should really not fail at boot.
>>>
>>> So I think this is correct to crash early because the admin will be able
>>> to look what went wrong. Otherwise, it may be missed in the noise.
>>
>> Well, I certainly can see one taking this view. However, at least in
>> principle allocation (or mapping) may fail _because_ of NUMA issues.
> 
> Right. I read this as the user will likely want to add "numa=off" on the 
> command line.
> 
>> At which point it would be better to boot with NUMA support turned off
> I have to disagree with "better" here. This may work for a user with a 
> handful of hosts. But for large scale setup, you will really want a 
> failure early rather than having a host booting with an expected feature 
> disabled (the NUMA issues may be a broken HW).
> 
> It is better to fail and then ask the user to specify "numa=off". At
> least the person made a conscientious decision to turn off the feature.

Yet how would the observing admin make the connection from the BUG_ON()
that triggered and the need to add "numa=off" to the command line,
without knowing Xen internals?

> I am curious to hear the opinion from the others.

So am I.

Jan
Julien Grall Jan. 13, 2023, 9:17 a.m. UTC | #6
On 13/01/2023 09:16, Jan Beulich wrote:
> On 13.01.2023 00:15, Julien Grall wrote:
>> Hi,
>>
>> On 04/01/2023 10:23, Jan Beulich wrote:
>>> On 23.12.2022 12:31, Julien Grall wrote:
>>>> On 20/12/2022 15:30, Jan Beulich wrote:
>>>>> On 16.12.2022 12:48, Julien Grall wrote:
>>>>>> From: Hongyan Xia <hongyxia@amazon.com>
>>>>>>
>>>>>> This avoids the assumption that boot pages are in the direct map.
>>>>>>
>>>>>> Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
>>>>>> Signed-off-by: Julien Grall <jgrall@amazon.com>
>>>>>
>>>>> Reviewed-by: Jan Beulich <jbeulich@suse.com>
>>>>>
>>>>> However, ...
>>>>>
>>>>>> --- a/xen/arch/x86/srat.c
>>>>>> +++ b/xen/arch/x86/srat.c
>>>>>> @@ -139,7 +139,8 @@ void __init acpi_numa_slit_init(struct acpi_table_slit *slit)
>>>>>>     		return;
>>>>>>     	}
>>>>>>     	mfn = alloc_boot_pages(PFN_UP(slit->header.length), 1);
>>>>>> -	acpi_slit = mfn_to_virt(mfn_x(mfn));
>>>>>> +	acpi_slit = vmap_contig_pages(mfn, PFN_UP(slit->header.length));
>>>>>
>>>>> ... with the increased use of vmap space the VA range used will need
>>>>> growing. And that's perhaps better done ahead of time than late.
>>>>
>>>> I will have a look to increase the vmap().
>>>>
>>>>>
>>>>>> +	BUG_ON(!acpi_slit);
>>>>>
>>>>> Similarly relevant for the earlier patch: It would be nice if boot
>>>>> failure for optional things like NUMA data could be avoided.
>>>>
>>>> If you can't map (or allocate the memory), then you are probably in a
>>>> very bad situation because both should really not fail at boot.
>>>>
>>>> So I think this is correct to crash early because the admin will be able
>>>> to look what went wrong. Otherwise, it may be missed in the noise.
>>>
>>> Well, I certainly can see one taking this view. However, at least in
>>> principle allocation (or mapping) may fail _because_ of NUMA issues.
>>
>> Right. I read this as the user will likely want to add "numa=off" on the
>> command line.
>>
>>> At which point it would be better to boot with NUMA support turned off
>> I have to disagree with "better" here. This may work for a user with a
>> handful of hosts. But for large scale setup, you will really want a
>> failure early rather than having a host booting with an expected feature
>> disabled (the NUMA issues may be a broken HW).
>>
>> It is better to fail and then ask the user to specify "numa=off". At
>> least the person made a conscientious decision to turn off the feature.
> 
> Yet how would the observing admin make the connection from the BUG_ON()
> that triggered and the need to add "numa=off" to the command line,
> without knowing Xen internals?

I am happy to switch to a panic() that suggests to turn off NUMA.

> 
>> I am curious to hear the opinion from the others.
> 
> So am I.
> 
> Jan
Julien Grall Jan. 30, 2023, 7:27 p.m. UTC | #7
Hi Jan,

On 23/12/2022 11:31, Julien Grall wrote:
> On 20/12/2022 15:30, Jan Beulich wrote:
>> On 16.12.2022 12:48, Julien Grall wrote:
>>> From: Hongyan Xia <hongyxia@amazon.com>
>>>
>>> This avoids the assumption that boot pages are in the direct map.
>>>
>>> Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
>>> Signed-off-by: Julien Grall <jgrall@amazon.com>
>>
>> Reviewed-by: Jan Beulich <jbeulich@suse.com>
>>
>> However, ...
>>
>>> --- a/xen/arch/x86/srat.c
>>> +++ b/xen/arch/x86/srat.c
>>> @@ -139,7 +139,8 @@ void __init acpi_numa_slit_init(struct 
>>> acpi_table_slit *slit)
>>>           return;
>>>       }
>>>       mfn = alloc_boot_pages(PFN_UP(slit->header.length), 1);
>>> -    acpi_slit = mfn_to_virt(mfn_x(mfn));
>>> +    acpi_slit = vmap_contig_pages(mfn, PFN_UP(slit->header.length));
>>
>> ... with the increased use of vmap space the VA range used will need
>> growing. And that's perhaps better done ahead of time than late.
> 
> I will have a look to increase the vmap().

I have started to look at this. The current size of VMAP is 64GB.

At least in the setup I have I didn't see any particular issue with the 
existing size of the vmap. Looking through the history, the last time it 
was bumped by one of your commit (see b0581b9214d2) but it is not clear 
what was the setup.

Given I don't have a setup where the VMAP is exhausted it is not clear 
to me what would be an acceptable bump.

AFAICT, in PML4 slot 261, we still have 62GB reserved for future. So I 
was thinking to add an extra 32GB which would bring the VMAP to 96GB. 
This is just a number that doesn't use all the reserved space but still 
a power of two.

Are you fine with that?

Cheers,
Jan Beulich Jan. 31, 2023, 9:11 a.m. UTC | #8
On 30.01.2023 20:27, Julien Grall wrote:
> Hi Jan,
> 
> On 23/12/2022 11:31, Julien Grall wrote:
>> On 20/12/2022 15:30, Jan Beulich wrote:
>>> On 16.12.2022 12:48, Julien Grall wrote:
>>>> From: Hongyan Xia <hongyxia@amazon.com>
>>>>
>>>> This avoids the assumption that boot pages are in the direct map.
>>>>
>>>> Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
>>>> Signed-off-by: Julien Grall <jgrall@amazon.com>
>>>
>>> Reviewed-by: Jan Beulich <jbeulich@suse.com>
>>>
>>> However, ...
>>>
>>>> --- a/xen/arch/x86/srat.c
>>>> +++ b/xen/arch/x86/srat.c
>>>> @@ -139,7 +139,8 @@ void __init acpi_numa_slit_init(struct 
>>>> acpi_table_slit *slit)
>>>>           return;
>>>>       }
>>>>       mfn = alloc_boot_pages(PFN_UP(slit->header.length), 1);
>>>> -    acpi_slit = mfn_to_virt(mfn_x(mfn));
>>>> +    acpi_slit = vmap_contig_pages(mfn, PFN_UP(slit->header.length));
>>>
>>> ... with the increased use of vmap space the VA range used will need
>>> growing. And that's perhaps better done ahead of time than late.
>>
>> I will have a look to increase the vmap().
> 
> I have started to look at this. The current size of VMAP is 64GB.
> 
> At least in the setup I have I didn't see any particular issue with the 
> existing size of the vmap. Looking through the history, the last time it 
> was bumped by one of your commit (see b0581b9214d2) but it is not clear 
> what was the setup.
> 
> Given I don't have a setup where the VMAP is exhausted it is not clear 
> to me what would be an acceptable bump.
> 
> AFAICT, in PML4 slot 261, we still have 62GB reserved for future. So I 
> was thinking to add an extra 32GB which would bring the VMAP to 96GB. 
> This is just a number that doesn't use all the reserved space but still 
> a power of two.
> 
> Are you fine with that?

Hmm. Leaving aside that 96Gb isn't a power of two, my comment saying
"ahead of time" was under the (wrong, as it now looks) impression that
the goal of your series was to truly do away with the directmap. I was
therefore expecting a much larger bump in size, perhaps moving the
vmap area into space presently occupied by the directmap. IOW for the
time being, with no _significant_ increase of space consumption, we
may well be fine with the 64Gb range.

Jan
Julien Grall Jan. 31, 2023, 9:37 p.m. UTC | #9
Hi Jan,

On 31/01/2023 09:11, Jan Beulich wrote:
> On 30.01.2023 20:27, Julien Grall wrote:
>> Hi Jan,
>>
>> On 23/12/2022 11:31, Julien Grall wrote:
>>> On 20/12/2022 15:30, Jan Beulich wrote:
>>>> On 16.12.2022 12:48, Julien Grall wrote:
>>>>> From: Hongyan Xia <hongyxia@amazon.com>
>>>>>
>>>>> This avoids the assumption that boot pages are in the direct map.
>>>>>
>>>>> Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
>>>>> Signed-off-by: Julien Grall <jgrall@amazon.com>
>>>>
>>>> Reviewed-by: Jan Beulich <jbeulich@suse.com>
>>>>
>>>> However, ...
>>>>
>>>>> --- a/xen/arch/x86/srat.c
>>>>> +++ b/xen/arch/x86/srat.c
>>>>> @@ -139,7 +139,8 @@ void __init acpi_numa_slit_init(struct
>>>>> acpi_table_slit *slit)
>>>>>            return;
>>>>>        }
>>>>>        mfn = alloc_boot_pages(PFN_UP(slit->header.length), 1);
>>>>> -    acpi_slit = mfn_to_virt(mfn_x(mfn));
>>>>> +    acpi_slit = vmap_contig_pages(mfn, PFN_UP(slit->header.length));
>>>>
>>>> ... with the increased use of vmap space the VA range used will need
>>>> growing. And that's perhaps better done ahead of time than late.
>>>
>>> I will have a look to increase the vmap().
>>
>> I have started to look at this. The current size of VMAP is 64GB.
>>
>> At least in the setup I have I didn't see any particular issue with the
>> existing size of the vmap. Looking through the history, the last time it
>> was bumped by one of your commit (see b0581b9214d2) but it is not clear
>> what was the setup.
>>
>> Given I don't have a setup where the VMAP is exhausted it is not clear
>> to me what would be an acceptable bump.
>>
>> AFAICT, in PML4 slot 261, we still have 62GB reserved for future. So I
>> was thinking to add an extra 32GB which would bring the VMAP to 96GB.
>> This is just a number that doesn't use all the reserved space but still
>> a power of two.
>>
>> Are you fine with that?
> 
> Hmm. Leaving aside that 96Gb isn't a power of two, my comment saying
> "ahead of time" was under the (wrong, as it now looks) impression that
> the goal of your series was to truly do away with the directmap.

Yes, the directmap is still present with this series. There are more 
work to completely remove the vmap() (see the cover letter for some 
details) and would prefer if this is separated from this series.


> I was
> therefore expecting a much larger bump in size, perhaps moving the
> vmap area into space presently occupied by the directmap. IOW for the
> time being, with no _significant_ increase of space consumption, we
> may well be fine with the 64Gb range.

Ok. I will keep it in mind if I am working completely removing the 
directmap.

Cheers,
diff mbox series

Patch

diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
index 56749ddca526..1fd178e89d28 100644
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -139,7 +139,8 @@  void __init acpi_numa_slit_init(struct acpi_table_slit *slit)
 		return;
 	}
 	mfn = alloc_boot_pages(PFN_UP(slit->header.length), 1);
-	acpi_slit = mfn_to_virt(mfn_x(mfn));
+	acpi_slit = vmap_contig_pages(mfn, PFN_UP(slit->header.length));
+	BUG_ON(!acpi_slit);
 	memcpy(acpi_slit, slit, slit->header.length);
 }