diff mbox series

mm/page_alloc: consider pfn holes after pfn_valid() in __pageblock_pfn_to_page()

Message ID 62e231a8f2e50c04dcadc7a0cfaa6dea5ce1ec05.1681296022.git.baolin.wang@linux.alibaba.com (mailing list archive)
State New
Headers show
Series mm/page_alloc: consider pfn holes after pfn_valid() in __pageblock_pfn_to_page() | expand

Commit Message

Baolin Wang April 12, 2023, 10:45 a.m. UTC
Now the __pageblock_pfn_to_page() is used by set_zone_contiguous(),
which checks whether the given zone contains holes, and uses pfn_valid()
to check if the end pfn is valid. However pfn_valid() can not make sure
the end pfn is not a hole if the size of a pageblock is larger than the
size of a sub-mem_section, since the struct page getting by pfn_to_page()
may represent a hole or an unusable page frame, which may cause incorrect
zone contiguous is set.

Though another user of pageblock_pfn_to_page() in compaction seems work
well now, it is better to avoid scanning or touching these offline pfns.
So like commit 2d070eab2e82 ("mm: consider zone which is not fully
populated to have holes"), we should also use pfn_to_online_page() for
the end pfn to make sure it is a valid pfn with usable page frame.
Meanwhile the pfn_valid() for end pfn can be dropped now.

Moreover we've already used pfn_to_online_page() for start pfn to make
sure it is online and valid, so the pfn_valid() for the start pfn is
unnecessary, drop it.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/page_alloc.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

Comments

Michal Hocko April 12, 2023, 11:15 a.m. UTC | #1
On Wed 12-04-23 18:45:31, Baolin Wang wrote:
> Now the __pageblock_pfn_to_page() is used by set_zone_contiguous(),
> which checks whether the given zone contains holes, and uses pfn_valid()
> to check if the end pfn is valid. However pfn_valid() can not make sure
> the end pfn is not a hole if the size of a pageblock is larger than the
> size of a sub-mem_section, since the struct page getting by pfn_to_page()
> may represent a hole or an unusable page frame, which may cause incorrect
> zone contiguous is set.
> 
> Though another user of pageblock_pfn_to_page() in compaction seems work
> well now, it is better to avoid scanning or touching these offline pfns.
> So like commit 2d070eab2e82 ("mm: consider zone which is not fully
> populated to have holes"), we should also use pfn_to_online_page() for
> the end pfn to make sure it is a valid pfn with usable page frame.
> Meanwhile the pfn_valid() for end pfn can be dropped now.
> 
> Moreover we've already used pfn_to_online_page() for start pfn to make
> sure it is online and valid, so the pfn_valid() for the start pfn is
> unnecessary, drop it.

Is this a theoretical problem or something you have encountered on a
real machine? Could you provide more details please?
David Hildenbrand April 12, 2023, 11:25 a.m. UTC | #2
On 12.04.23 12:45, Baolin Wang wrote:
> Now the __pageblock_pfn_to_page() is used by set_zone_contiguous(),
> which checks whether the given zone contains holes, and uses pfn_valid()
> to check if the end pfn is valid. However pfn_valid() can not make sure
> the end pfn is not a hole if the size of a pageblock is larger than the
> size of a sub-mem_section, since the struct page getting by pfn_to_page()
> may represent a hole or an unusable page frame, which may cause incorrect
> zone contiguous is set.
> 
> Though another user of pageblock_pfn_to_page() in compaction seems work
> well now, it is better to avoid scanning or touching these offline pfns.
> So like commit 2d070eab2e82 ("mm: consider zone which is not fully
> populated to have holes"), we should also use pfn_to_online_page() for
> the end pfn to make sure it is a valid pfn with usable page frame.
> Meanwhile the pfn_valid() for end pfn can be dropped now.
> 
> Moreover we've already used pfn_to_online_page() for start pfn to make
> sure it is online and valid, so the pfn_valid() for the start pfn is
> unnecessary, drop it.

pageblocks are supposed to fall into a single memory section, so in most 
cases, if the start is online, so is the end.

The exception to this rule is when we have a mixture of ZONE_DEVICE and 
ZONE_* within the same section.

Then, indeed the end might not be online.

BUT, if the end is valid (-> ZONE_DEVICE), then the zone_id will differ. 
[let's ignore any races for now, up to this point they are mostly of 
theoretical nature]

So I don't think this change actually fixes something.


Getting rid of the pfn_valid(start_pfn)  makes sense. Replacing the 
pfn_valid(end_pfn) by a pfn_to_online_page(end_pfn) could make that 
function less efficient.

> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>   mm/page_alloc.c | 7 +++----
>   1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d0eb280ec7e4..8076f519c572 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1512,9 +1512,6 @@ struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
>   	/* end_pfn is one past the range we are checking */
>   	end_pfn--;
>   
> -	if (!pfn_valid(start_pfn) || !pfn_valid(end_pfn))
> -		return NULL;
> -
>   	start_page = pfn_to_online_page(start_pfn);
>   	if (!start_page)
>   		return NULL;
> @@ -1522,7 +1519,9 @@ struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
>   	if (page_zone(start_page) != zone)
>   		return NULL;
>   
> -	end_page = pfn_to_page(end_pfn);
> +	end_page = pfn_to_online_page(end_pfn);
> +	if (!end_page)
> +		return NULL;
>   
>   	/* This gives a shorter code than deriving page_zone(end_page) */
>   	if (page_zone_id(start_page) != page_zone_id(end_page))
Baolin Wang April 12, 2023, 12:16 p.m. UTC | #3
On 4/12/2023 7:25 PM, David Hildenbrand wrote:
> On 12.04.23 12:45, Baolin Wang wrote:
>> Now the __pageblock_pfn_to_page() is used by set_zone_contiguous(),
>> which checks whether the given zone contains holes, and uses pfn_valid()
>> to check if the end pfn is valid. However pfn_valid() can not make sure
>> the end pfn is not a hole if the size of a pageblock is larger than the
>> size of a sub-mem_section, since the struct page getting by pfn_to_page()
>> may represent a hole or an unusable page frame, which may cause incorrect
>> zone contiguous is set.
>>
>> Though another user of pageblock_pfn_to_page() in compaction seems work
>> well now, it is better to avoid scanning or touching these offline pfns.
>> So like commit 2d070eab2e82 ("mm: consider zone which is not fully
>> populated to have holes"), we should also use pfn_to_online_page() for
>> the end pfn to make sure it is a valid pfn with usable page frame.
>> Meanwhile the pfn_valid() for end pfn can be dropped now.
>>
>> Moreover we've already used pfn_to_online_page() for start pfn to make
>> sure it is online and valid, so the pfn_valid() for the start pfn is
>> unnecessary, drop it.
> 
> pageblocks are supposed to fall into a single memory section, so in mos > cases, if the start is online, so is the end.

Yes, the granularity of memory hotplug is a mem_section.

However, suppose the pageblock order is MAX_ORDER-1, and the size of a 
sub-section is 2M, that means a pageblock will fall into 2 sub 
mem-section, and if there is a hole in the zone, that means the 2nd sub 
mem-section can be invalid without setting subsection_map bitmap.

So the start is online can make sure the end pfn of a pageblock is 
online, but a valid start pfn can not make sure the end pfn is valid in 
the bitmap of ms->usage->subsection_map.

> The exception to this rule is when we have a mixture of ZONE_DEVICE and 
> ZONE_* within the same section.
> 
> Then, indeed the end might not be online.
> 
> BUT, if the end is valid (-> ZONE_DEVICE), then the zone_id will differ. 
> [let's ignore any races for now, up to this point they are mostly of 
> theoretical nature]
> 
> So I don't think this change actually fixes something.
> 
> 
> Getting rid of the pfn_valid(start_pfn)  makes sense. Replacing the 

Yes, my motivation is try to optimize the __pageblock_pfn_to_page() 
which is hot when doing compaction, and I saw these pfn_valid() can be 
dropped.

> pfn_valid(end_pfn) by a pfn_to_online_page(end_pfn) could make that 
> function less efficient.
> 
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   mm/page_alloc.c | 7 +++----
>>   1 file changed, 3 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index d0eb280ec7e4..8076f519c572 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1512,9 +1512,6 @@ struct page *__pageblock_pfn_to_page(unsigned 
>> long start_pfn,
>>       /* end_pfn is one past the range we are checking */
>>       end_pfn--;
>> -    if (!pfn_valid(start_pfn) || !pfn_valid(end_pfn))
>> -        return NULL;
>> -
>>       start_page = pfn_to_online_page(start_pfn);
>>       if (!start_page)
>>           return NULL;
>> @@ -1522,7 +1519,9 @@ struct page *__pageblock_pfn_to_page(unsigned 
>> long start_pfn,
>>       if (page_zone(start_page) != zone)
>>           return NULL;
>> -    end_page = pfn_to_page(end_pfn);
>> +    end_page = pfn_to_online_page(end_pfn);
>> +    if (!end_page)
>> +        return NULL;
>>       /* This gives a shorter code than deriving page_zone(end_page) */
>>       if (page_zone_id(start_page) != page_zone_id(end_page))
>
Baolin Wang April 12, 2023, 12:24 p.m. UTC | #4
On 4/12/2023 7:15 PM, Michal Hocko wrote:
> On Wed 12-04-23 18:45:31, Baolin Wang wrote:
>> Now the __pageblock_pfn_to_page() is used by set_zone_contiguous(),
>> which checks whether the given zone contains holes, and uses pfn_valid()
>> to check if the end pfn is valid. However pfn_valid() can not make sure
>> the end pfn is not a hole if the size of a pageblock is larger than the
>> size of a sub-mem_section, since the struct page getting by pfn_to_page()
>> may represent a hole or an unusable page frame, which may cause incorrect
>> zone contiguous is set.
>>
>> Though another user of pageblock_pfn_to_page() in compaction seems work
>> well now, it is better to avoid scanning or touching these offline pfns.
>> So like commit 2d070eab2e82 ("mm: consider zone which is not fully
>> populated to have holes"), we should also use pfn_to_online_page() for
>> the end pfn to make sure it is a valid pfn with usable page frame.
>> Meanwhile the pfn_valid() for end pfn can be dropped now.
>>
>> Moreover we've already used pfn_to_online_page() for start pfn to make
>> sure it is online and valid, so the pfn_valid() for the start pfn is
>> unnecessary, drop it.
> 
> Is this a theoretical problem or something you have encountered on a
> real machine? Could you provide more details please?

As I replied to David, this is just from code inspection when trying to 
remove the unnecessary pfn_valid() in __pageblock_pfn_to_page().
Vlastimil Babka April 14, 2023, 3:07 p.m. UTC | #5
On 4/12/23 14:16, Baolin Wang wrote:
> 
> 
> On 4/12/2023 7:25 PM, David Hildenbrand wrote:
>> On 12.04.23 12:45, Baolin Wang wrote:
>>> Now the __pageblock_pfn_to_page() is used by set_zone_contiguous(),
>>> which checks whether the given zone contains holes, and uses pfn_valid()
>>> to check if the end pfn is valid. However pfn_valid() can not make sure
>>> the end pfn is not a hole if the size of a pageblock is larger than the
>>> size of a sub-mem_section, since the struct page getting by pfn_to_page()
>>> may represent a hole or an unusable page frame, which may cause incorrect
>>> zone contiguous is set.
>>>
>>> Though another user of pageblock_pfn_to_page() in compaction seems work
>>> well now, it is better to avoid scanning or touching these offline pfns.
>>> So like commit 2d070eab2e82 ("mm: consider zone which is not fully
>>> populated to have holes"), we should also use pfn_to_online_page() for
>>> the end pfn to make sure it is a valid pfn with usable page frame.
>>> Meanwhile the pfn_valid() for end pfn can be dropped now.
>>>
>>> Moreover we've already used pfn_to_online_page() for start pfn to make
>>> sure it is online and valid, so the pfn_valid() for the start pfn is
>>> unnecessary, drop it.
>> 
>> pageblocks are supposed to fall into a single memory section, so in mos > cases, if the start is online, so is the end.
> 
> Yes, the granularity of memory hotplug is a mem_section.
> 
> However, suppose the pageblock order is MAX_ORDER-1, and the size of a 
> sub-section is 2M, that means a pageblock will fall into 2 sub 
> mem-section, and if there is a hole in the zone, that means the 2nd sub 
> mem-section can be invalid without setting subsection_map bitmap.

Can that really happen? I think the buddy merging in __free_one_page() would
trip on that?
Baolin Wang April 19, 2023, 6:47 a.m. UTC | #6
On 4/14/2023 11:07 PM, Vlastimil Babka wrote:
> On 4/12/23 14:16, Baolin Wang wrote:
>>
>>
>> On 4/12/2023 7:25 PM, David Hildenbrand wrote:
>>> On 12.04.23 12:45, Baolin Wang wrote:
>>>> Now the __pageblock_pfn_to_page() is used by set_zone_contiguous(),
>>>> which checks whether the given zone contains holes, and uses pfn_valid()
>>>> to check if the end pfn is valid. However pfn_valid() can not make sure
>>>> the end pfn is not a hole if the size of a pageblock is larger than the
>>>> size of a sub-mem_section, since the struct page getting by pfn_to_page()
>>>> may represent a hole or an unusable page frame, which may cause incorrect
>>>> zone contiguous is set.
>>>>
>>>> Though another user of pageblock_pfn_to_page() in compaction seems work
>>>> well now, it is better to avoid scanning or touching these offline pfns.
>>>> So like commit 2d070eab2e82 ("mm: consider zone which is not fully
>>>> populated to have holes"), we should also use pfn_to_online_page() for
>>>> the end pfn to make sure it is a valid pfn with usable page frame.
>>>> Meanwhile the pfn_valid() for end pfn can be dropped now.
>>>>
>>>> Moreover we've already used pfn_to_online_page() for start pfn to make
>>>> sure it is online and valid, so the pfn_valid() for the start pfn is
>>>> unnecessary, drop it.
>>>
>>> pageblocks are supposed to fall into a single memory section, so in mos > cases, if the start is online, so is the end.
>>
>> Yes, the granularity of memory hotplug is a mem_section.
>>
>> However, suppose the pageblock order is MAX_ORDER-1, and the size of a
>> sub-section is 2M, that means a pageblock will fall into 2 sub
>> mem-section, and if there is a hole in the zone, that means the 2nd sub
>> mem-section can be invalid without setting subsection_map bitmap.
> 
> Can that really happen? I think the buddy merging in __free_one_page() would
> trip on that?

I do not think so IIUC. The hole pfns will not free to buddy system, and 
the buddy system did not change the ms->usage->subsection_map of the 
hole pfns, which indicates the hole pfns are invalid.
Baolin Wang April 20, 2023, 9:11 a.m. UTC | #7
On 4/20/2023 3:22 PM, Huang, Ying wrote:
> Baolin Wang <baolin.wang@linux.alibaba.com> writes:
> 
>> On 4/12/2023 7:25 PM, David Hildenbrand wrote:
>>> On 12.04.23 12:45, Baolin Wang wrote:
>>>> Now the __pageblock_pfn_to_page() is used by set_zone_contiguous(),
>>>> which checks whether the given zone contains holes, and uses pfn_valid()
>>>> to check if the end pfn is valid. However pfn_valid() can not make sure
>>>> the end pfn is not a hole if the size of a pageblock is larger than the
>>>> size of a sub-mem_section, since the struct page getting by pfn_to_page()
>>>> may represent a hole or an unusable page frame, which may cause incorrect
>>>> zone contiguous is set.
>>>>
>>>> Though another user of pageblock_pfn_to_page() in compaction seems work
>>>> well now, it is better to avoid scanning or touching these offline pfns.
>>>> So like commit 2d070eab2e82 ("mm: consider zone which is not fully
>>>> populated to have holes"), we should also use pfn_to_online_page() for
>>>> the end pfn to make sure it is a valid pfn with usable page frame.
>>>> Meanwhile the pfn_valid() for end pfn can be dropped now.
>>>>
>>>> Moreover we've already used pfn_to_online_page() for start pfn to make
>>>> sure it is online and valid, so the pfn_valid() for the start pfn is
>>>> unnecessary, drop it.
>>> pageblocks are supposed to fall into a single memory section, so in
>>> mos > cases, if the start is online, so is the end.
>>
>> Yes, the granularity of memory hotplug is a mem_section.
>>
>> However, suppose the pageblock order is MAX_ORDER-1, and the size of a
>> sub-section is 2M, that means a pageblock will fall into 2 sub
>> mem-section, and if there is a hole in the zone, that means the 2nd
>> sub mem-section can be invalid without setting subsection_map bitmap.
>>
>> So the start is online can make sure the end pfn of a pageblock is
>> online, but a valid start pfn can not make sure the end pfn is valid
>> in the bitmap of ms->usage->subsection_map.
> 
> arch_add_memory
>    add_pages
>      __add_pages
>        sparse_add_section /* set subsection_map */
> 
> arch_add_memory() is only called by add_memory_resource() and
> pagemap_range() (called add_pages() too).  In add_memory_resource(),
> check_hotplug_memory_range() will enforce a strict hotplug range
> alignment requirement (128 MB on x86_64).  pagemap_range() are used for
> ZONE_DEVICE only.  That is, for normal memory, hotplug granularity is
> much larger than 2MB.
> 
> IIUC, the situation you mentioned above is impossible.  Or do I miss
> something?

Thanks for your input. Your example is correct, but this is not the case 
I want to describe. My case is not about the memory hotplug, instead 
about the early memory holes when initialzing the memory. Let me try to 
describe explicity:

First suppose the pageblock order is MAX_ORDER-1, and see below memory 
layout as an example:

[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x0000000040000000-0x00000000ffffffff]
[    0.000000]   DMA32    empty
[    0.000000]   Normal   [mem 0x0000000100000000-0x0000001fa7ffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000040000000-0x0000001fa3c7ffff]
[    0.000000]   node   0: [mem 0x0000001fa3c80000-0x0000001fa3ffffff]
[    0.000000]   node   0: [mem 0x0000001fa4000000-0x0000001fa402ffff]
[    0.000000]   node   0: [mem 0x0000001fa4030000-0x0000001fa40effff]
[    0.000000]   node   0: [mem 0x0000001fa40f0000-0x0000001fa73cffff]
[    0.000000]   node   0: [mem 0x0000001fa73d0000-0x0000001fa745ffff]
[    0.000000]   node   0: [mem 0x0000001fa7460000-0x0000001fa746ffff]
[    0.000000]   node   0: [mem 0x0000001fa7470000-0x0000001fa758ffff]
[    0.000000]   node   0: [mem 0x0000001fa7590000-0x0000001fa7dfffff]

Focus on the last memory range, and there is a hole for the range [mem 
0x0000001fa7590000-0x0000001fa7dfffff]. That means the last pageblock 
will contain the range from 0x1fa7c00000 to 0x1fa7ffffff, since the 
pageblock must be 4M aligned. And in this page block, these pfns will 
fall into 2 sub-section (the sub-section size is 2M aligned).

So, the 1st sub-section (indicates pfn range: 0x1fa7c00000 - 
0x1fa7dfffff ) in this pageblock is valid by 
free_area_init()--->subsection_map_init(), but the 2nd sub-section 
(indicates pfn range: 0x1fa7e00000 - 0x1fa7ffffff ) in this pageblock is 
not valid.

The problem is, if we just check the pageblock start of the hole pfn 
(such as 0x1fa7dfffff) to make sure the hole pfn (0x1fa7dfffff) is also 
valid, which is NOT correct. So that is what I mean "the start is online 
can make sure the end pfn of a pageblock is online, but a valid start 
pfn can not make sure the end pfn is valid in the bitmap of 
ms->usage->subsection_map."

Hope I make it clear. Does that make sense to you? Thanks.

>>> THE exception to this rule is when we have a mixture of ZONE_DEVICE
>>> and ZONE_* within the same section.
>>> Then, indeed the end might not be online.
>>> BUT, if the end is valid (-> ZONE_DEVICE), then the zone_id will
>>> differ. [let's ignore any races for now, up to this point they are
>>> mostly of theoretical nature]
>>> So I don't think this change actually fixes something.
>>>
>>> Getting rid of the pfn_valid(start_pfn). makes sense. Replacing the
>>
>> Yes, my motivation is try to optimize the __pageblock_pfn_to_page()
>> which is hot when doing compaction, and I saw these pfn_valid() can be
>> dropped.
>>
>>> pfn_valid(end_pfn) by a pfn_to_online_page(end_pfn) could make that
>>> function less efficient.
>>>
>>>>
>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> ---
>>>> . mm/page_alloc.c | 7 +++----
>>>> . 1 file changed, 3 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index d0eb280ec7e4..8076f519c572 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -1512,9 +1512,6 @@ struct page *__pageblock_pfn_to_page(unsigned
>>>> long start_pfn,
>>>> . . . /* end_pfn is one past the range we are checking */
>>>> . . . end_pfn--;
>>>> -. . if (!pfn_valid(start_pfn) || !pfn_valid(end_pfn))
>>>> -. . . . return NULL;
>>>> -
>>>> . . . start_page = pfn_to_online_page(start_pfn);
>>>> . . . if (!start_page)
>>>> . . . . . return NULL;
>>>> @@ -1522,7 +1519,9 @@ struct page *__pageblock_pfn_to_page(unsigned
>>>> long start_pfn,
>>>> . . . if (page_zone(start_page) != zone)
>>>> . . . . . return NULL;
>>>> -. . end_page = pfn_to_page(end_pfn);
>>>> +. . end_page = pfn_to_online_page(end_pfn);
>>>> +. . if (!end_page)
>>>> +. . . . return NULL;
>>>> . . . /* This gives a shorter code than deriving page_zone(end_page) */
>>>> . . . if (page_zone_id(start_page) != page_zone_id(end_page))
>>>
Huang, Ying April 21, 2023, 4:21 a.m. UTC | #8
Baolin Wang <baolin.wang@linux.alibaba.com> writes:

> On 4/20/2023 3:22 PM, Huang, Ying wrote:
>> Baolin Wang <baolin.wang@linux.alibaba.com> writes:
>> 
>>> On 4/12/2023 7:25 PM, David Hildenbrand wrote:
>>>> On 12.04.23 12:45, Baolin Wang wrote:
>>>>> Now the __pageblock_pfn_to_page() is used by set_zone_contiguous(),
>>>>> which checks whether the given zone contains holes, and uses pfn_valid()
>>>>> to check if the end pfn is valid. However pfn_valid() can not make sure
>>>>> the end pfn is not a hole if the size of a pageblock is larger than the
>>>>> size of a sub-mem_section, since the struct page getting by pfn_to_page()
>>>>> may represent a hole or an unusable page frame, which may cause incorrect
>>>>> zone contiguous is set.
>>>>>
>>>>> Though another user of pageblock_pfn_to_page() in compaction seems work
>>>>> well now, it is better to avoid scanning or touching these offline pfns.
>>>>> So like commit 2d070eab2e82 ("mm: consider zone which is not fully
>>>>> populated to have holes"), we should also use pfn_to_online_page() for
>>>>> the end pfn to make sure it is a valid pfn with usable page frame.
>>>>> Meanwhile the pfn_valid() for end pfn can be dropped now.
>>>>>
>>>>> Moreover we've already used pfn_to_online_page() for start pfn to make
>>>>> sure it is online and valid, so the pfn_valid() for the start pfn is
>>>>> unnecessary, drop it.
>>>> pageblocks are supposed to fall into a single memory section, so in
>>>> mos > cases, if the start is online, so is the end.
>>>
>>> Yes, the granularity of memory hotplug is a mem_section.
>>>
>>> However, suppose the pageblock order is MAX_ORDER-1, and the size of a
>>> sub-section is 2M, that means a pageblock will fall into 2 sub
>>> mem-section, and if there is a hole in the zone, that means the 2nd
>>> sub mem-section can be invalid without setting subsection_map bitmap.
>>>
>>> So the start is online can make sure the end pfn of a pageblock is
>>> online, but a valid start pfn can not make sure the end pfn is valid
>>> in the bitmap of ms->usage->subsection_map.
>> arch_add_memory
>>    add_pages
>>      __add_pages
>>        sparse_add_section /* set subsection_map */
>> arch_add_memory() is only called by add_memory_resource() and
>> pagemap_range() (called add_pages() too).  In add_memory_resource(),
>> check_hotplug_memory_range() will enforce a strict hotplug range
>> alignment requirement (128 MB on x86_64).  pagemap_range() are used for
>> ZONE_DEVICE only.  That is, for normal memory, hotplug granularity is
>> much larger than 2MB.
>> IIUC, the situation you mentioned above is impossible.  Or do I miss
>> something?
>
> Thanks for your input. Your example is correct, but this is not the
> case I want to describe. My case is not about the memory hotplug,
> instead about the early memory holes when initialzing the memory. Let
> me try to describe explicity:
>
> First suppose the pageblock order is MAX_ORDER-1, and see below memory
> layout as an example:
>
> [    0.000000] Zone ranges:
> [    0.000000]   DMA      [mem 0x0000000040000000-0x00000000ffffffff]
> [    0.000000]   DMA32    empty
> [    0.000000]   Normal   [mem 0x0000000100000000-0x0000001fa7ffffff]
> [    0.000000] Movable zone start for each node
> [    0.000000] Early memory node ranges
> [    0.000000]   node   0: [mem 0x0000000040000000-0x0000001fa3c7ffff]
> [    0.000000]   node   0: [mem 0x0000001fa3c80000-0x0000001fa3ffffff]
> [    0.000000]   node   0: [mem 0x0000001fa4000000-0x0000001fa402ffff]
> [    0.000000]   node   0: [mem 0x0000001fa4030000-0x0000001fa40effff]
> [    0.000000]   node   0: [mem 0x0000001fa40f0000-0x0000001fa73cffff]
> [    0.000000]   node   0: [mem 0x0000001fa73d0000-0x0000001fa745ffff]
> [    0.000000]   node   0: [mem 0x0000001fa7460000-0x0000001fa746ffff]
> [    0.000000]   node   0: [mem 0x0000001fa7470000-0x0000001fa758ffff]
> [    0.000000]   node   0: [mem 0x0000001fa7590000-0x0000001fa7dfffff]
>
> Focus on the last memory range, and there is a hole for the range [mem
> 0x0000001fa7590000-0x0000001fa7dfffff]. That means the last pageblock 
> will contain the range from 0x1fa7c00000 to 0x1fa7ffffff, since the
> pageblock must be 4M aligned. And in this page block, these pfns will 
> fall into 2 sub-section (the sub-section size is 2M aligned).
>
> So, the 1st sub-section (indicates pfn range: 0x1fa7c00000 -
> 0x1fa7dfffff ) in this pageblock is valid by 
> free_area_init()--->subsection_map_init(), but the 2nd sub-section
> (indicates pfn range: 0x1fa7e00000 - 0x1fa7ffffff ) in this pageblock
> is not valid.
>
> The problem is, if we just check the pageblock start of the hole pfn
> (such as 0x1fa7dfffff) to make sure the hole pfn (0x1fa7dfffff) is
> also valid, which is NOT correct. So that is what I mean "the start is
> online can make sure the end pfn of a pageblock is online, but a valid
> start pfn can not make sure the end pfn is valid in the bitmap of 
> ms->usage->subsection_map."
>
> Hope I make it clear. Does that make sense to you? Thanks.

Thanks for your detailed description.  You are right, it's possible that
the second subsection of a pageblock is a hole.

It's good to remove unnecessary pfn_valid(start_pfn) check in your
original patch.  But it appears unnecessary to replace
pfn_valid(end_pfn) with pfn_to_online_page(end_pfn).  Yes, it's possible
that there's a hole in a page block.  But it appears that this will not
break anything.  Per my understanding, even if we had fixed this one,
there may be other smaller memory holes in a pageblock represented as
reserved pages.

Best Regards,
Huang, Ying

[snip]
Baolin Wang April 21, 2023, 7:13 a.m. UTC | #9
On 4/21/2023 12:21 PM, Huang, Ying wrote:
> Baolin Wang <baolin.wang@linux.alibaba.com> writes:
> 
>> On 4/20/2023 3:22 PM, Huang, Ying wrote:
>>> Baolin Wang <baolin.wang@linux.alibaba.com> writes:
>>>
>>>> On 4/12/2023 7:25 PM, David Hildenbrand wrote:
>>>>> On 12.04.23 12:45, Baolin Wang wrote:
>>>>>> Now the __pageblock_pfn_to_page() is used by set_zone_contiguous(),
>>>>>> which checks whether the given zone contains holes, and uses pfn_valid()
>>>>>> to check if the end pfn is valid. However pfn_valid() can not make sure
>>>>>> the end pfn is not a hole if the size of a pageblock is larger than the
>>>>>> size of a sub-mem_section, since the struct page getting by pfn_to_page()
>>>>>> may represent a hole or an unusable page frame, which may cause incorrect
>>>>>> zone contiguous is set.
>>>>>>
>>>>>> Though another user of pageblock_pfn_to_page() in compaction seems work
>>>>>> well now, it is better to avoid scanning or touching these offline pfns.
>>>>>> So like commit 2d070eab2e82 ("mm: consider zone which is not fully
>>>>>> populated to have holes"), we should also use pfn_to_online_page() for
>>>>>> the end pfn to make sure it is a valid pfn with usable page frame.
>>>>>> Meanwhile the pfn_valid() for end pfn can be dropped now.
>>>>>>
>>>>>> Moreover we've already used pfn_to_online_page() for start pfn to make
>>>>>> sure it is online and valid, so the pfn_valid() for the start pfn is
>>>>>> unnecessary, drop it.
>>>>> pageblocks are supposed to fall into a single memory section, so in
>>>>> mos > cases, if the start is online, so is the end.
>>>>
>>>> Yes, the granularity of memory hotplug is a mem_section.
>>>>
>>>> However, suppose the pageblock order is MAX_ORDER-1, and the size of a
>>>> sub-section is 2M, that means a pageblock will fall into 2 sub
>>>> mem-section, and if there is a hole in the zone, that means the 2nd
>>>> sub mem-section can be invalid without setting subsection_map bitmap.
>>>>
>>>> So the start is online can make sure the end pfn of a pageblock is
>>>> online, but a valid start pfn can not make sure the end pfn is valid
>>>> in the bitmap of ms->usage->subsection_map.
>>> arch_add_memory
>>>     add_pages
>>>       __add_pages
>>>         sparse_add_section /* set subsection_map */
>>> arch_add_memory() is only called by add_memory_resource() and
>>> pagemap_range() (called add_pages() too).  In add_memory_resource(),
>>> check_hotplug_memory_range() will enforce a strict hotplug range
>>> alignment requirement (128 MB on x86_64).  pagemap_range() are used for
>>> ZONE_DEVICE only.  That is, for normal memory, hotplug granularity is
>>> much larger than 2MB.
>>> IIUC, the situation you mentioned above is impossible.  Or do I miss
>>> something?
>>
>> Thanks for your input. Your example is correct, but this is not the
>> case I want to describe. My case is not about the memory hotplug,
>> instead about the early memory holes when initialzing the memory. Let
>> me try to describe explicity:
>>
>> First suppose the pageblock order is MAX_ORDER-1, and see below memory
>> layout as an example:
>>
>> [    0.000000] Zone ranges:
>> [    0.000000]   DMA      [mem 0x0000000040000000-0x00000000ffffffff]
>> [    0.000000]   DMA32    empty
>> [    0.000000]   Normal   [mem 0x0000000100000000-0x0000001fa7ffffff]
>> [    0.000000] Movable zone start for each node
>> [    0.000000] Early memory node ranges
>> [    0.000000]   node   0: [mem 0x0000000040000000-0x0000001fa3c7ffff]
>> [    0.000000]   node   0: [mem 0x0000001fa3c80000-0x0000001fa3ffffff]
>> [    0.000000]   node   0: [mem 0x0000001fa4000000-0x0000001fa402ffff]
>> [    0.000000]   node   0: [mem 0x0000001fa4030000-0x0000001fa40effff]
>> [    0.000000]   node   0: [mem 0x0000001fa40f0000-0x0000001fa73cffff]
>> [    0.000000]   node   0: [mem 0x0000001fa73d0000-0x0000001fa745ffff]
>> [    0.000000]   node   0: [mem 0x0000001fa7460000-0x0000001fa746ffff]
>> [    0.000000]   node   0: [mem 0x0000001fa7470000-0x0000001fa758ffff]
>> [    0.000000]   node   0: [mem 0x0000001fa7590000-0x0000001fa7dfffff]
>>
>> Focus on the last memory range, and there is a hole for the range [mem
>> 0x0000001fa7590000-0x0000001fa7dfffff]. That means the last pageblock
>> will contain the range from 0x1fa7c00000 to 0x1fa7ffffff, since the
>> pageblock must be 4M aligned. And in this page block, these pfns will
>> fall into 2 sub-section (the sub-section size is 2M aligned).
>>
>> So, the 1st sub-section (indicates pfn range: 0x1fa7c00000 -
>> 0x1fa7dfffff ) in this pageblock is valid by
>> free_area_init()--->subsection_map_init(), but the 2nd sub-section
>> (indicates pfn range: 0x1fa7e00000 - 0x1fa7ffffff ) in this pageblock
>> is not valid.
>>
>> The problem is, if we just check the pageblock start of the hole pfn
>> (such as 0x1fa7dfffff) to make sure the hole pfn (0x1fa7dfffff) is
>> also valid, which is NOT correct. So that is what I mean "the start is
>> online can make sure the end pfn of a pageblock is online, but a valid
>> start pfn can not make sure the end pfn is valid in the bitmap of
>> ms->usage->subsection_map."
>>
>> Hope I make it clear. Does that make sense to you? Thanks.
> 
> Thanks for your detailed description.  You are right, it's possible that
> the second subsection of a pageblock is a hole.
> 
> It's good to remove unnecessary pfn_valid(start_pfn) check in your
> original patch.  But it appears unnecessary to replace

OK. I will split this into a separate patch.

> pfn_valid(end_pfn) with pfn_to_online_page(end_pfn).  Yes, it's possible
> that there's a hole in a page block.  But it appears that this will not
> break anything.  Per my understanding, even if we had fixed this one,

Yes, it will not break anything now, the worst case is the compaction 
will waste more time to scan unnecessary hole pfns, though I did not 
have a performance report to show this issue.

Another concern is that the zone->contiguous is fragile IMO, and not 
sure if there are pfn walkers will meet the holes though the 
zone->contiguous = 1 in future.

So at least we can add some comments for __pageblock_pfn_to_page() to 
describe this issue? what do you think?

> there may be other smaller memory holes in a pageblock represented as
> reserved pages
Huang, Ying April 21, 2023, 7:44 a.m. UTC | #10
Baolin Wang <baolin.wang@linux.alibaba.com> writes:

> On 4/21/2023 12:21 PM, Huang, Ying wrote:
>> Baolin Wang <baolin.wang@linux.alibaba.com> writes:
>> 
>>> On 4/20/2023 3:22 PM, Huang, Ying wrote:
>>>> Baolin Wang <baolin.wang@linux.alibaba.com> writes:
>>>>
>>>>> On 4/12/2023 7:25 PM, David Hildenbrand wrote:
>>>>>> On 12.04.23 12:45, Baolin Wang wrote:
>>>>>>> Now the __pageblock_pfn_to_page() is used by set_zone_contiguous(),
>>>>>>> which checks whether the given zone contains holes, and uses pfn_valid()
>>>>>>> to check if the end pfn is valid. However pfn_valid() can not make sure
>>>>>>> the end pfn is not a hole if the size of a pageblock is larger than the
>>>>>>> size of a sub-mem_section, since the struct page getting by pfn_to_page()
>>>>>>> may represent a hole or an unusable page frame, which may cause incorrect
>>>>>>> zone contiguous is set.
>>>>>>>
>>>>>>> Though another user of pageblock_pfn_to_page() in compaction seems work
>>>>>>> well now, it is better to avoid scanning or touching these offline pfns.
>>>>>>> So like commit 2d070eab2e82 ("mm: consider zone which is not fully
>>>>>>> populated to have holes"), we should also use pfn_to_online_page() for
>>>>>>> the end pfn to make sure it is a valid pfn with usable page frame.
>>>>>>> Meanwhile the pfn_valid() for end pfn can be dropped now.
>>>>>>>
>>>>>>> Moreover we've already used pfn_to_online_page() for start pfn to make
>>>>>>> sure it is online and valid, so the pfn_valid() for the start pfn is
>>>>>>> unnecessary, drop it.
>>>>>> pageblocks are supposed to fall into a single memory section, so in
>>>>>> mos > cases, if the start is online, so is the end.
>>>>>
>>>>> Yes, the granularity of memory hotplug is a mem_section.
>>>>>
>>>>> However, suppose the pageblock order is MAX_ORDER-1, and the size of a
>>>>> sub-section is 2M, that means a pageblock will fall into 2 sub
>>>>> mem-section, and if there is a hole in the zone, that means the 2nd
>>>>> sub mem-section can be invalid without setting subsection_map bitmap.
>>>>>
>>>>> So the start is online can make sure the end pfn of a pageblock is
>>>>> online, but a valid start pfn can not make sure the end pfn is valid
>>>>> in the bitmap of ms->usage->subsection_map.
>>>> arch_add_memory
>>>>     add_pages
>>>>       __add_pages
>>>>         sparse_add_section /* set subsection_map */
>>>> arch_add_memory() is only called by add_memory_resource() and
>>>> pagemap_range() (called add_pages() too).  In add_memory_resource(),
>>>> check_hotplug_memory_range() will enforce a strict hotplug range
>>>> alignment requirement (128 MB on x86_64).  pagemap_range() are used for
>>>> ZONE_DEVICE only.  That is, for normal memory, hotplug granularity is
>>>> much larger than 2MB.
>>>> IIUC, the situation you mentioned above is impossible.  Or do I miss
>>>> something?
>>>
>>> Thanks for your input. Your example is correct, but this is not the
>>> case I want to describe. My case is not about the memory hotplug,
>>> instead about the early memory holes when initialzing the memory. Let
>>> me try to describe explicity:
>>>
>>> First suppose the pageblock order is MAX_ORDER-1, and see below memory
>>> layout as an example:
>>>
>>> [    0.000000] Zone ranges:
>>> [    0.000000]   DMA      [mem 0x0000000040000000-0x00000000ffffffff]
>>> [    0.000000]   DMA32    empty
>>> [    0.000000]   Normal   [mem 0x0000000100000000-0x0000001fa7ffffff]
>>> [    0.000000] Movable zone start for each node
>>> [    0.000000] Early memory node ranges
>>> [    0.000000]   node   0: [mem 0x0000000040000000-0x0000001fa3c7ffff]
>>> [    0.000000]   node   0: [mem 0x0000001fa3c80000-0x0000001fa3ffffff]
>>> [    0.000000]   node   0: [mem 0x0000001fa4000000-0x0000001fa402ffff]
>>> [    0.000000]   node   0: [mem 0x0000001fa4030000-0x0000001fa40effff]
>>> [    0.000000]   node   0: [mem 0x0000001fa40f0000-0x0000001fa73cffff]
>>> [    0.000000]   node   0: [mem 0x0000001fa73d0000-0x0000001fa745ffff]
>>> [    0.000000]   node   0: [mem 0x0000001fa7460000-0x0000001fa746ffff]
>>> [    0.000000]   node   0: [mem 0x0000001fa7470000-0x0000001fa758ffff]
>>> [    0.000000]   node   0: [mem 0x0000001fa7590000-0x0000001fa7dfffff]
>>>
>>> Focus on the last memory range, and there is a hole for the range [mem
>>> 0x0000001fa7590000-0x0000001fa7dfffff]. That means the last pageblock
>>> will contain the range from 0x1fa7c00000 to 0x1fa7ffffff, since the
>>> pageblock must be 4M aligned. And in this page block, these pfns will
>>> fall into 2 sub-section (the sub-section size is 2M aligned).
>>>
>>> So, the 1st sub-section (indicates pfn range: 0x1fa7c00000 -
>>> 0x1fa7dfffff ) in this pageblock is valid by
>>> free_area_init()--->subsection_map_init(), but the 2nd sub-section
>>> (indicates pfn range: 0x1fa7e00000 - 0x1fa7ffffff ) in this pageblock
>>> is not valid.
>>>
>>> The problem is, if we just check the pageblock start of the hole pfn
>>> (such as 0x1fa7dfffff) to make sure the hole pfn (0x1fa7dfffff) is
>>> also valid, which is NOT correct. So that is what I mean "the start is
>>> online can make sure the end pfn of a pageblock is online, but a valid
>>> start pfn can not make sure the end pfn is valid in the bitmap of
>>> ms->usage->subsection_map."
>>>
>>> Hope I make it clear. Does that make sense to you? Thanks.
>> Thanks for your detailed description.  You are right, it's possible
>> that
>> the second subsection of a pageblock is a hole.
>> It's good to remove unnecessary pfn_valid(start_pfn) check in your
>> original patch.  But it appears unnecessary to replace
>
> OK. I will split this into a separate patch.

Thanks!

>> pfn_valid(end_pfn) with pfn_to_online_page(end_pfn).  Yes, it's possible
>> that there's a hole in a page block.  But it appears that this will not
>> break anything.  Per my understanding, even if we had fixed this one,
>
> Yes, it will not break anything now, the worst case is the compaction
> will waste more time to scan unnecessary hole pfns, though I did not 
> have a performance report to show this issue.

I think the scanning should be fast.

> Another concern is that the zone->contiguous is fragile IMO, and not
> sure if there are pfn walkers will meet the holes though the 
> zone->contiguous = 1 in future.

If there's any issue in the future, we can fix it at that time.

> So at least we can add some comments for __pageblock_pfn_to_page() to
> describe this issue? what do you think?

I'm OK to add some comments there.

>> there may be other smaller memory holes in a pageblock represented as
>> reserved pages

Best Regards,
Huang, Ying
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d0eb280ec7e4..8076f519c572 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1512,9 +1512,6 @@  struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
 	/* end_pfn is one past the range we are checking */
 	end_pfn--;
 
-	if (!pfn_valid(start_pfn) || !pfn_valid(end_pfn))
-		return NULL;
-
 	start_page = pfn_to_online_page(start_pfn);
 	if (!start_page)
 		return NULL;
@@ -1522,7 +1519,9 @@  struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
 	if (page_zone(start_page) != zone)
 		return NULL;
 
-	end_page = pfn_to_page(end_pfn);
+	end_page = pfn_to_online_page(end_pfn);
+	if (!end_page)
+		return NULL;
 
 	/* This gives a shorter code than deriving page_zone(end_page) */
 	if (page_zone_id(start_page) != page_zone_id(end_page))