[v2] mm/sparse: set section nid for hot-add memory
diff mbox series

Message ID 20190618005537.18878-1-richardw.yang@linux.intel.com
State New
Headers show
Series
  • [v2] mm/sparse: set section nid for hot-add memory
Related show

Commit Message

Wei Yang June 18, 2019, 12:55 a.m. UTC
In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in
section_to_node_table[]. While for hot-add memory, this is missed.
Without this information, page_to_nid() may not give the right node id.

BTW, current online_pages works because it leverages nid in memory_block.
But the granularity of node id should be mem_section wide.

Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>

---
v2:
  * specify the case NODE_NOT_IN_PAGE_FLAGS is effected.
  * list one of the victim page_to_nid()

---
 mm/sparse.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Oscar Salvador June 18, 2019, 7:49 a.m. UTC | #1
On Tue, Jun 18, 2019 at 08:55:37AM +0800, Wei Yang wrote:
> In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in
> section_to_node_table[]. While for hot-add memory, this is missed.
> Without this information, page_to_nid() may not give the right node id.
> 
> BTW, current online_pages works because it leverages nid in memory_block.
> But the granularity of node id should be mem_section wide.

I forgot to ask this before, but why do you mention online_pages here?
IMHO, it does not add any value to the changelog, and it does not have much
to do with the matter.

online_pages() works with memblock granularity and not section granularity.
That memblock is just a hot-added range of memory, worth of either 1 section or multiple
sections, depending on the arch or on the size of the current memory.
And we assume that each hot-added memory all belongs to the same node.


> Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
> 
> ---
> v2:
>   * specify the case NODE_NOT_IN_PAGE_FLAGS is effected.
>   * list one of the victim page_to_nid()
> 
> ---
>  mm/sparse.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 4012d7f50010..48fa16038cf5 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -733,6 +733,7 @@ int __meminit sparse_add_one_section(int nid, unsigned long start_pfn,
>  	 */
>  	page_init_poison(memmap, sizeof(struct page) * PAGES_PER_SECTION);
>  
> +	set_section_nid(section_nr, nid);
>  	section_mark_present(ms);
>  	sparse_init_one_section(ms, section_nr, memmap, usemap);
>  
> -- 
> 2.19.1
>
Wei Yang June 18, 2019, 8:32 a.m. UTC | #2
On Tue, Jun 18, 2019 at 09:49:48AM +0200, Oscar Salvador wrote:
>On Tue, Jun 18, 2019 at 08:55:37AM +0800, Wei Yang wrote:
>> In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in
>> section_to_node_table[]. While for hot-add memory, this is missed.
>> Without this information, page_to_nid() may not give the right node id.
>> 
>> BTW, current online_pages works because it leverages nid in memory_block.
>> But the granularity of node id should be mem_section wide.
>
>I forgot to ask this before, but why do you mention online_pages here?
>IMHO, it does not add any value to the changelog, and it does not have much
>to do with the matter.
>

Since to me it is a little confused why we don't set the node info but still
could online memory to the correct node. It turns out we leverage the
information in memblock.

>online_pages() works with memblock granularity and not section granularity.
>That memblock is just a hot-added range of memory, worth of either 1 section or multiple
>sections, depending on the arch or on the size of the current memory.
>And we assume that each hot-added memory all belongs to the same node.
>

So I am not clear about the granularity of node id. section based or memblock
based. Or we have two cases:

* for initial memory, section wide
* for hot-add memory, mem_block wide

>
>> Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
>> Reviewed-by: Oscar Salvador <osalvador@suse.de>
>> Reviewed-by: David Hildenbrand <david@redhat.com>
>> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
>> 
>> ---
>> v2:
>>   * specify the case NODE_NOT_IN_PAGE_FLAGS is effected.
>>   * list one of the victim page_to_nid()
>> 
>> ---
>>  mm/sparse.c | 1 +
>>  1 file changed, 1 insertion(+)
>> 
>> diff --git a/mm/sparse.c b/mm/sparse.c
>> index 4012d7f50010..48fa16038cf5 100644
>> --- a/mm/sparse.c
>> +++ b/mm/sparse.c
>> @@ -733,6 +733,7 @@ int __meminit sparse_add_one_section(int nid, unsigned long start_pfn,
>>  	 */
>>  	page_init_poison(memmap, sizeof(struct page) * PAGES_PER_SECTION);
>>  
>> +	set_section_nid(section_nr, nid);
>>  	section_mark_present(ms);
>>  	sparse_init_one_section(ms, section_nr, memmap, usemap);
>>  
>> -- 
>> 2.19.1
>> 
>
>-- 
>Oscar Salvador
>SUSE L3
David Hildenbrand June 18, 2019, 8:40 a.m. UTC | #3
On 18.06.19 10:32, Wei Yang wrote:
> On Tue, Jun 18, 2019 at 09:49:48AM +0200, Oscar Salvador wrote:
>> On Tue, Jun 18, 2019 at 08:55:37AM +0800, Wei Yang wrote:
>>> In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in
>>> section_to_node_table[]. While for hot-add memory, this is missed.
>>> Without this information, page_to_nid() may not give the right node id.
>>>
>>> BTW, current online_pages works because it leverages nid in memory_block.
>>> But the granularity of node id should be mem_section wide.
>>
>> I forgot to ask this before, but why do you mention online_pages here?
>> IMHO, it does not add any value to the changelog, and it does not have much
>> to do with the matter.
>>
> 
> Since to me it is a little confused why we don't set the node info but still
> could online memory to the correct node. It turns out we leverage the
> information in memblock.

I'd also drop the comment here.

> 
>> online_pages() works with memblock granularity and not section granularity.
>> That memblock is just a hot-added range of memory, worth of either 1 section or multiple
>> sections, depending on the arch or on the size of the current memory.
>> And we assume that each hot-added memory all belongs to the same node.
>>
> 
> So I am not clear about the granularity of node id. section based or memblock
> based. Or we have two cases:
> 
> * for initial memory, section wide
> * for hot-add memory, mem_block wide

It's all a big mess. Right now, you can offline initial memory with
mixed nodes. Also on my list of many ugly things to clean up.

(I even remember that we can have mixed nodes within a section, but I
haven't figured out yet how that is supposed to work in some scenarios)
Michal Hocko June 19, 2019, 6:10 a.m. UTC | #4
On Tue 18-06-19 10:40:06, David Hildenbrand wrote:
> On 18.06.19 10:32, Wei Yang wrote:
> > On Tue, Jun 18, 2019 at 09:49:48AM +0200, Oscar Salvador wrote:
> >> On Tue, Jun 18, 2019 at 08:55:37AM +0800, Wei Yang wrote:
> >>> In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in
> >>> section_to_node_table[]. While for hot-add memory, this is missed.
> >>> Without this information, page_to_nid() may not give the right node id.
> >>>
> >>> BTW, current online_pages works because it leverages nid in memory_block.
> >>> But the granularity of node id should be mem_section wide.
> >>
> >> I forgot to ask this before, but why do you mention online_pages here?
> >> IMHO, it does not add any value to the changelog, and it does not have much
> >> to do with the matter.
> >>
> > 
> > Since to me it is a little confused why we don't set the node info but still
> > could online memory to the correct node. It turns out we leverage the
> > information in memblock.
> 
> I'd also drop the comment here.
> 
> > 
> >> online_pages() works with memblock granularity and not section granularity.
> >> That memblock is just a hot-added range of memory, worth of either 1 section or multiple
> >> sections, depending on the arch or on the size of the current memory.
> >> And we assume that each hot-added memory all belongs to the same node.
> >>
> > 
> > So I am not clear about the granularity of node id. section based or memblock
> > based. Or we have two cases:
> > 
> > * for initial memory, section wide
> > * for hot-add memory, mem_block wide
> 
> It's all a big mess. Right now, you can offline initial memory with
> mixed nodes. Also on my list of many ugly things to clean up.
> 
> (I even remember that we can have mixed nodes within a section, but I
> haven't figured out yet how that is supposed to work in some scenarios)

Yes, that is indeed the case. See 4aa9fc2a435abe95a1e8d7f8c7b3d6356514b37a.
How to fix this? Well, I do not think we can. Section based granularity
simply doesn't agree with the reality and so we have to live with that.
There is a long way to remove all those section size assumptions from
the code though.
Michal Hocko June 19, 2019, 6:23 a.m. UTC | #5
On Tue 18-06-19 08:55:37, Wei Yang wrote:
> In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in
> section_to_node_table[]. While for hot-add memory, this is missed.
> Without this information, page_to_nid() may not give the right node id.

Which would mean that NODE_NOT_IN_PAGE_FLAGS doesn't really work with
the hotpluged memory, right? Any idea why nobody has noticed this
so far? Is it because NODE_NOT_IN_PAGE_FLAGS is rare and essentially
unused with the hotplug? page_to_nid providing an incorrect result
sounds quite serious to me.

Could you identify when we have introduced this problem? A Fixes tag
would sound very useful to me.

> BTW, current online_pages works because it leverages nid in memory_block.
> But the granularity of node id should be mem_section wide.

This is not really helpful because nothing except for the hotplug really
cares about mem blocks. The whole MM really does care about page_to_nid
and that is why it matters much more so spending a word or two on that
would be more helpful.

> Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>

The patch itself looks good to me.
Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!
> 
> ---
> v2:
>   * specify the case NODE_NOT_IN_PAGE_FLAGS is effected.
>   * list one of the victim page_to_nid()
> 
> ---
>  mm/sparse.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 4012d7f50010..48fa16038cf5 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -733,6 +733,7 @@ int __meminit sparse_add_one_section(int nid, unsigned long start_pfn,
>  	 */
>  	page_init_poison(memmap, sizeof(struct page) * PAGES_PER_SECTION);
>  
> +	set_section_nid(section_nr, nid);
>  	section_mark_present(ms);
>  	sparse_init_one_section(ms, section_nr, memmap, usemap);
>  
> -- 
> 2.19.1
Oscar Salvador June 19, 2019, 7:53 a.m. UTC | #6
On Wed, Jun 19, 2019 at 08:23:30AM +0200, Michal Hocko wrote:
> On Tue 18-06-19 08:55:37, Wei Yang wrote:
> > In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in
> > section_to_node_table[]. While for hot-add memory, this is missed.
> > Without this information, page_to_nid() may not give the right node id.
> 
> Which would mean that NODE_NOT_IN_PAGE_FLAGS doesn't really work with
> the hotpluged memory, right? Any idea why nobody has noticed this
> so far? Is it because NODE_NOT_IN_PAGE_FLAGS is rare and essentially
> unused with the hotplug? page_to_nid providing an incorrect result
> sounds quite serious to me.

The thing is that for NODE_NOT_IN_PAGE_FLAGS to be enabled we need to run out of
space in page->flags to store zone, nid and section. 
Currently, even with the largest values (with pagetable level 5), that is not
possible on x86_64.
It is possible though, that somewhere in the future, when the values get larger
(e.g: we add more zones, NODE_SHIFT grows, or we need more space to store
the section) we finally run out of room for the flags though.

I am not sure about the other arches though, we probably should audit them
and see which ones can fall in there.
David Hildenbrand June 19, 2019, 8:51 a.m. UTC | #7
On 19.06.19 09:53, Oscar Salvador wrote:
> On Wed, Jun 19, 2019 at 08:23:30AM +0200, Michal Hocko wrote:
>> On Tue 18-06-19 08:55:37, Wei Yang wrote:
>>> In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in
>>> section_to_node_table[]. While for hot-add memory, this is missed.
>>> Without this information, page_to_nid() may not give the right node id.
>>
>> Which would mean that NODE_NOT_IN_PAGE_FLAGS doesn't really work with
>> the hotpluged memory, right? Any idea why nobody has noticed this
>> so far? Is it because NODE_NOT_IN_PAGE_FLAGS is rare and essentially
>> unused with the hotplug? page_to_nid providing an incorrect result
>> sounds quite serious to me.
> 
> The thing is that for NODE_NOT_IN_PAGE_FLAGS to be enabled we need to run out of
> space in page->flags to store zone, nid and section. 
> Currently, even with the largest values (with pagetable level 5), that is not
> possible on x86_64.
> It is possible though, that somewhere in the future, when the values get larger
> (e.g: we add more zones, NODE_SHIFT grows, or we need more space to store
> the section) we finally run out of room for the flags though.
> 
> I am not sure about the other arches though, we probably should audit them
> and see which ones can fall in there.
> 

I'd love to see NODE_NOT_IN_PAGE_FLAGS go.
David Hildenbrand June 19, 2019, 8:54 a.m. UTC | #8
On 19.06.19 08:10, Michal Hocko wrote:
> On Tue 18-06-19 10:40:06, David Hildenbrand wrote:
>> On 18.06.19 10:32, Wei Yang wrote:
>>> On Tue, Jun 18, 2019 at 09:49:48AM +0200, Oscar Salvador wrote:
>>>> On Tue, Jun 18, 2019 at 08:55:37AM +0800, Wei Yang wrote:
>>>>> In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in
>>>>> section_to_node_table[]. While for hot-add memory, this is missed.
>>>>> Without this information, page_to_nid() may not give the right node id.
>>>>>
>>>>> BTW, current online_pages works because it leverages nid in memory_block.
>>>>> But the granularity of node id should be mem_section wide.
>>>>
>>>> I forgot to ask this before, but why do you mention online_pages here?
>>>> IMHO, it does not add any value to the changelog, and it does not have much
>>>> to do with the matter.
>>>>
>>>
>>> Since to me it is a little confused why we don't set the node info but still
>>> could online memory to the correct node. It turns out we leverage the
>>> information in memblock.
>>
>> I'd also drop the comment here.
>>
>>>
>>>> online_pages() works with memblock granularity and not section granularity.
>>>> That memblock is just a hot-added range of memory, worth of either 1 section or multiple
>>>> sections, depending on the arch or on the size of the current memory.
>>>> And we assume that each hot-added memory all belongs to the same node.
>>>>
>>>
>>> So I am not clear about the granularity of node id. section based or memblock
>>> based. Or we have two cases:
>>>
>>> * for initial memory, section wide
>>> * for hot-add memory, mem_block wide
>>
>> It's all a big mess. Right now, you can offline initial memory with
>> mixed nodes. Also on my list of many ugly things to clean up.
>>
>> (I even remember that we can have mixed nodes within a section, but I
>> haven't figured out yet how that is supposed to work in some scenarios)
> 
> Yes, that is indeed the case. See 4aa9fc2a435abe95a1e8d7f8c7b3d6356514b37a.
> How to fix this? Well, I do not think we can. Section based granularity
> simply doesn't agree with the reality and so we have to live with that.
> There is a long way to remove all those section size assumptions from
> the code though.
> 

Trying to remove NODE_NOT_IN_PAGE_FLAGS could work, but we would have to
identify how exactly needs that. For memory blocks, we need a different
approach (I have in my head to make ->nid indicate if we are dealing
with mixed nodes. If mixed, disallow onlining/offlining).
Michal Hocko June 19, 2019, 9:01 a.m. UTC | #9
On Wed 19-06-19 10:54:08, David Hildenbrand wrote:
> On 19.06.19 08:10, Michal Hocko wrote:
> > On Tue 18-06-19 10:40:06, David Hildenbrand wrote:
> >> On 18.06.19 10:32, Wei Yang wrote:
> >>> On Tue, Jun 18, 2019 at 09:49:48AM +0200, Oscar Salvador wrote:
> >>>> On Tue, Jun 18, 2019 at 08:55:37AM +0800, Wei Yang wrote:
> >>>>> In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in
> >>>>> section_to_node_table[]. While for hot-add memory, this is missed.
> >>>>> Without this information, page_to_nid() may not give the right node id.
> >>>>>
> >>>>> BTW, current online_pages works because it leverages nid in memory_block.
> >>>>> But the granularity of node id should be mem_section wide.
> >>>>
> >>>> I forgot to ask this before, but why do you mention online_pages here?
> >>>> IMHO, it does not add any value to the changelog, and it does not have much
> >>>> to do with the matter.
> >>>>
> >>>
> >>> Since to me it is a little confused why we don't set the node info but still
> >>> could online memory to the correct node. It turns out we leverage the
> >>> information in memblock.
> >>
> >> I'd also drop the comment here.
> >>
> >>>
> >>>> online_pages() works with memblock granularity and not section granularity.
> >>>> That memblock is just a hot-added range of memory, worth of either 1 section or multiple
> >>>> sections, depending on the arch or on the size of the current memory.
> >>>> And we assume that each hot-added memory all belongs to the same node.
> >>>>
> >>>
> >>> So I am not clear about the granularity of node id. section based or memblock
> >>> based. Or we have two cases:
> >>>
> >>> * for initial memory, section wide
> >>> * for hot-add memory, mem_block wide
> >>
> >> It's all a big mess. Right now, you can offline initial memory with
> >> mixed nodes. Also on my list of many ugly things to clean up.
> >>
> >> (I even remember that we can have mixed nodes within a section, but I
> >> haven't figured out yet how that is supposed to work in some scenarios)
> > 
> > Yes, that is indeed the case. See 4aa9fc2a435abe95a1e8d7f8c7b3d6356514b37a.
> > How to fix this? Well, I do not think we can. Section based granularity
> > simply doesn't agree with the reality and so we have to live with that.
> > There is a long way to remove all those section size assumptions from
> > the code though.
> > 
> 
> Trying to remove NODE_NOT_IN_PAGE_FLAGS could work, but we would have to
> identify how exactly needs that. For memory blocks, we need a different
> approach (I have in my head to make ->nid indicate if we are dealing
> with mixed nodes. If mixed, disallow onlining/offlining).

Well, I am not sure we really have to care about mutli-nodes memblocks
much. The API is clumsy but does anybody actually care? The vast
majority of hotplug usecases simply do not do that in the first place
right? And if they do need a smaller granularity to describe their
memory topology then we need a different user API rather the fiddle with
implementation details I would argue.
David Hildenbrand June 19, 2019, 9:03 a.m. UTC | #10
On 19.06.19 11:01, Michal Hocko wrote:
> On Wed 19-06-19 10:54:08, David Hildenbrand wrote:
>> On 19.06.19 08:10, Michal Hocko wrote:
>>> On Tue 18-06-19 10:40:06, David Hildenbrand wrote:
>>>> On 18.06.19 10:32, Wei Yang wrote:
>>>>> On Tue, Jun 18, 2019 at 09:49:48AM +0200, Oscar Salvador wrote:
>>>>>> On Tue, Jun 18, 2019 at 08:55:37AM +0800, Wei Yang wrote:
>>>>>>> In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in
>>>>>>> section_to_node_table[]. While for hot-add memory, this is missed.
>>>>>>> Without this information, page_to_nid() may not give the right node id.
>>>>>>>
>>>>>>> BTW, current online_pages works because it leverages nid in memory_block.
>>>>>>> But the granularity of node id should be mem_section wide.
>>>>>>
>>>>>> I forgot to ask this before, but why do you mention online_pages here?
>>>>>> IMHO, it does not add any value to the changelog, and it does not have much
>>>>>> to do with the matter.
>>>>>>
>>>>>
>>>>> Since to me it is a little confused why we don't set the node info but still
>>>>> could online memory to the correct node. It turns out we leverage the
>>>>> information in memblock.
>>>>
>>>> I'd also drop the comment here.
>>>>
>>>>>
>>>>>> online_pages() works with memblock granularity and not section granularity.
>>>>>> That memblock is just a hot-added range of memory, worth of either 1 section or multiple
>>>>>> sections, depending on the arch or on the size of the current memory.
>>>>>> And we assume that each hot-added memory all belongs to the same node.
>>>>>>
>>>>>
>>>>> So I am not clear about the granularity of node id. section based or memblock
>>>>> based. Or we have two cases:
>>>>>
>>>>> * for initial memory, section wide
>>>>> * for hot-add memory, mem_block wide
>>>>
>>>> It's all a big mess. Right now, you can offline initial memory with
>>>> mixed nodes. Also on my list of many ugly things to clean up.
>>>>
>>>> (I even remember that we can have mixed nodes within a section, but I
>>>> haven't figured out yet how that is supposed to work in some scenarios)
>>>
>>> Yes, that is indeed the case. See 4aa9fc2a435abe95a1e8d7f8c7b3d6356514b37a.
>>> How to fix this? Well, I do not think we can. Section based granularity
>>> simply doesn't agree with the reality and so we have to live with that.
>>> There is a long way to remove all those section size assumptions from
>>> the code though.
>>>
>>
>> Trying to remove NODE_NOT_IN_PAGE_FLAGS could work, but we would have to
>> identify how exactly needs that. For memory blocks, we need a different
>> approach (I have in my head to make ->nid indicate if we are dealing
>> with mixed nodes. If mixed, disallow onlining/offlining).
> 
> Well, I am not sure we really have to care about mutli-nodes memblocks
> much. The API is clumsy but does anybody actually care? The vast
> majority of hotplug usecases simply do not do that in the first place
> right?

Yes, AFAIK it could be done, resulting in unpredictable outcome.

And if they do need a smaller granularity to describe their
> memory topology then we need a different user API rather the fiddle with
> implementation details I would argue.
> 

It is not about supporting it, it is about properly blocking it.
Michal Hocko June 19, 2019, 9:04 a.m. UTC | #11
On Wed 19-06-19 10:51:47, David Hildenbrand wrote:
> On 19.06.19 09:53, Oscar Salvador wrote:
> > On Wed, Jun 19, 2019 at 08:23:30AM +0200, Michal Hocko wrote:
> >> On Tue 18-06-19 08:55:37, Wei Yang wrote:
> >>> In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in
> >>> section_to_node_table[]. While for hot-add memory, this is missed.
> >>> Without this information, page_to_nid() may not give the right node id.
> >>
> >> Which would mean that NODE_NOT_IN_PAGE_FLAGS doesn't really work with
> >> the hotpluged memory, right? Any idea why nobody has noticed this
> >> so far? Is it because NODE_NOT_IN_PAGE_FLAGS is rare and essentially
> >> unused with the hotplug? page_to_nid providing an incorrect result
> >> sounds quite serious to me.
> > 
> > The thing is that for NODE_NOT_IN_PAGE_FLAGS to be enabled we need to run out of
> > space in page->flags to store zone, nid and section. 
> > Currently, even with the largest values (with pagetable level 5), that is not
> > possible on x86_64.
> > It is possible though, that somewhere in the future, when the values get larger
> > (e.g: we add more zones, NODE_SHIFT grows, or we need more space to store
> > the section) we finally run out of room for the flags though.
> > 
> > I am not sure about the other arches though, we probably should audit them
> > and see which ones can fall in there.
> > 
> 
> I'd love to see NODE_NOT_IN_PAGE_FLAGS go.

NODE_NOT_IN_PAGE_FLAGS is an implementation detail on where the
information is stored. I cannot say how much it is really needed now but
I can see there will be a demand for it in a longer term because
page->flags space is scarce and very interesting storage. So I do not
see it go away I am afraid.
David Hildenbrand June 19, 2019, 9:07 a.m. UTC | #12
On 19.06.19 11:04, Michal Hocko wrote:
> On Wed 19-06-19 10:51:47, David Hildenbrand wrote:
>> On 19.06.19 09:53, Oscar Salvador wrote:
>>> On Wed, Jun 19, 2019 at 08:23:30AM +0200, Michal Hocko wrote:
>>>> On Tue 18-06-19 08:55:37, Wei Yang wrote:
>>>>> In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in
>>>>> section_to_node_table[]. While for hot-add memory, this is missed.
>>>>> Without this information, page_to_nid() may not give the right node id.
>>>>
>>>> Which would mean that NODE_NOT_IN_PAGE_FLAGS doesn't really work with
>>>> the hotpluged memory, right? Any idea why nobody has noticed this
>>>> so far? Is it because NODE_NOT_IN_PAGE_FLAGS is rare and essentially
>>>> unused with the hotplug? page_to_nid providing an incorrect result
>>>> sounds quite serious to me.
>>>
>>> The thing is that for NODE_NOT_IN_PAGE_FLAGS to be enabled we need to run out of
>>> space in page->flags to store zone, nid and section. 
>>> Currently, even with the largest values (with pagetable level 5), that is not
>>> possible on x86_64.
>>> It is possible though, that somewhere in the future, when the values get larger
>>> (e.g: we add more zones, NODE_SHIFT grows, or we need more space to store
>>> the section) we finally run out of room for the flags though.
>>>
>>> I am not sure about the other arches though, we probably should audit them
>>> and see which ones can fall in there.
>>>
>>
>> I'd love to see NODE_NOT_IN_PAGE_FLAGS go.
> 
> NODE_NOT_IN_PAGE_FLAGS is an implementation detail on where the
> information is stored.

Yes and no. Storing it per section clearly doesn't allow storing node
information on smaller granularity, like storing in page->flags does.

So no, it is not only an implementation detail.

I cannot say how much it is really needed now but
> I can see there will be a demand for it in a longer term because
> page->flags space is scarce and very interesting storage. So I do not
> see it go away I am afraid.
Depends on how performance-critical pfn_to_nid() is. I can't tell.
Michal Hocko June 19, 2019, 9:08 a.m. UTC | #13
On Wed 19-06-19 11:03:49, David Hildenbrand wrote:
> On 19.06.19 11:01, Michal Hocko wrote:
[...]
> > And if they do need a smaller granularity to describe their
> > memory topology then we need a different user API rather the fiddle with
> > implementation details I would argue.
> > 
> 
> It is not about supporting it, it is about properly blocking it.

We already do that in test_pages_in_a_zone, right? Albeit in
MAX_ORDER_NR_PAGES granularity.
David Hildenbrand June 19, 2019, 9:11 a.m. UTC | #14
On 19.06.19 11:08, Michal Hocko wrote:
> On Wed 19-06-19 11:03:49, David Hildenbrand wrote:
>> On 19.06.19 11:01, Michal Hocko wrote:
> [...]
>>> And if they do need a smaller granularity to describe their
>>> memory topology then we need a different user API rather the fiddle with
>>> implementation details I would argue.
>>>
>>
>> It is not about supporting it, it is about properly blocking it.
> 
> We already do that in test_pages_in_a_zone, right? Albeit in
> MAX_ORDER_NR_PAGES granularity.
> 

Indeed, thanks for pointing that out. I knew that we were checking zones
but had in my head that it was working on zone idx.
Michal Hocko June 19, 2019, 9:16 a.m. UTC | #15
On Wed 19-06-19 11:07:30, David Hildenbrand wrote:
> On 19.06.19 11:04, Michal Hocko wrote:
> > On Wed 19-06-19 10:51:47, David Hildenbrand wrote:
> >> On 19.06.19 09:53, Oscar Salvador wrote:
> >>> On Wed, Jun 19, 2019 at 08:23:30AM +0200, Michal Hocko wrote:
> >>>> On Tue 18-06-19 08:55:37, Wei Yang wrote:
> >>>>> In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in
> >>>>> section_to_node_table[]. While for hot-add memory, this is missed.
> >>>>> Without this information, page_to_nid() may not give the right node id.
> >>>>
> >>>> Which would mean that NODE_NOT_IN_PAGE_FLAGS doesn't really work with
> >>>> the hotpluged memory, right? Any idea why nobody has noticed this
> >>>> so far? Is it because NODE_NOT_IN_PAGE_FLAGS is rare and essentially
> >>>> unused with the hotplug? page_to_nid providing an incorrect result
> >>>> sounds quite serious to me.
> >>>
> >>> The thing is that for NODE_NOT_IN_PAGE_FLAGS to be enabled we need to run out of
> >>> space in page->flags to store zone, nid and section. 
> >>> Currently, even with the largest values (with pagetable level 5), that is not
> >>> possible on x86_64.
> >>> It is possible though, that somewhere in the future, when the values get larger
> >>> (e.g: we add more zones, NODE_SHIFT grows, or we need more space to store
> >>> the section) we finally run out of room for the flags though.
> >>>
> >>> I am not sure about the other arches though, we probably should audit them
> >>> and see which ones can fall in there.
> >>>
> >>
> >> I'd love to see NODE_NOT_IN_PAGE_FLAGS go.
> > 
> > NODE_NOT_IN_PAGE_FLAGS is an implementation detail on where the
> > information is stored.
> 
> Yes and no. Storing it per section clearly doesn't allow storing node
> information on smaller granularity, like storing in page->flags does.
> 
> So no, it is not only an implementation detail.

Let me try to put it differently. NODE_NOT_IN_PAGE_FLAGS is not about
storing the mapping per section. You can do what ever other data
structure. NODE_NOT_IN_PAGE_FLAGS is in fact about telling that it is
not in page->flags.

> > I cannot say how much it is really needed now but
> > I can see there will be a demand for it in a longer term because
> > page->flags space is scarce and very interesting storage. So I do not
> > see it go away I am afraid.
> Depends on how performance-critical pfn_to_nid() is. I can't tell.

page_to_node is used in many important code paths. Not in the hotest
ones I believe but many of them are quite hot I would say.
David Hildenbrand June 19, 2019, 9:30 a.m. UTC | #16
On 19.06.19 11:16, Michal Hocko wrote:
> On Wed 19-06-19 11:07:30, David Hildenbrand wrote:
>> On 19.06.19 11:04, Michal Hocko wrote:
>>> On Wed 19-06-19 10:51:47, David Hildenbrand wrote:
>>>> On 19.06.19 09:53, Oscar Salvador wrote:
>>>>> On Wed, Jun 19, 2019 at 08:23:30AM +0200, Michal Hocko wrote:
>>>>>> On Tue 18-06-19 08:55:37, Wei Yang wrote:
>>>>>>> In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in
>>>>>>> section_to_node_table[]. While for hot-add memory, this is missed.
>>>>>>> Without this information, page_to_nid() may not give the right node id.
>>>>>>
>>>>>> Which would mean that NODE_NOT_IN_PAGE_FLAGS doesn't really work with
>>>>>> the hotpluged memory, right? Any idea why nobody has noticed this
>>>>>> so far? Is it because NODE_NOT_IN_PAGE_FLAGS is rare and essentially
>>>>>> unused with the hotplug? page_to_nid providing an incorrect result
>>>>>> sounds quite serious to me.
>>>>>
>>>>> The thing is that for NODE_NOT_IN_PAGE_FLAGS to be enabled we need to run out of
>>>>> space in page->flags to store zone, nid and section. 
>>>>> Currently, even with the largest values (with pagetable level 5), that is not
>>>>> possible on x86_64.
>>>>> It is possible though, that somewhere in the future, when the values get larger
>>>>> (e.g: we add more zones, NODE_SHIFT grows, or we need more space to store
>>>>> the section) we finally run out of room for the flags though.
>>>>>
>>>>> I am not sure about the other arches though, we probably should audit them
>>>>> and see which ones can fall in there.
>>>>>
>>>>
>>>> I'd love to see NODE_NOT_IN_PAGE_FLAGS go.
>>>
>>> NODE_NOT_IN_PAGE_FLAGS is an implementation detail on where the
>>> information is stored.
>>
>> Yes and no. Storing it per section clearly doesn't allow storing node
>> information on smaller granularity, like storing in page->flags does.
>>
>> So no, it is not only an implementation detail.
> 
> Let me try to put it differently. NODE_NOT_IN_PAGE_FLAGS is not about
> storing the mapping per section. You can do what ever other data
> structure. NODE_NOT_IN_PAGE_FLAGS is in fact about telling that it is
> not in page->flags.

Okay, I get what you are saying. Storing it differently is problematic,
though, if we want o minimize memory consumption and have a fast lookup.

I was also looking into avoiding to store the section number in
page-flags with CONFIG_SPARSEMEM. Especially, because the
CONFIG_HAVE_ARCH_PFN_VALID hack is really ugly. But it's tricky :(

Patch
diff mbox series

diff --git a/mm/sparse.c b/mm/sparse.c
index 4012d7f50010..48fa16038cf5 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -733,6 +733,7 @@  int __meminit sparse_add_one_section(int nid, unsigned long start_pfn,
 	 */
 	page_init_poison(memmap, sizeof(struct page) * PAGES_PER_SECTION);
 
+	set_section_nid(section_nr, nid);
 	section_mark_present(ms);
 	sparse_init_one_section(ms, section_nr, memmap, usemap);