Message ID | 20190618005537.18878-1-richardw.yang@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] mm/sparse: set section nid for hot-add memory | expand |
On Tue, Jun 18, 2019 at 08:55:37AM +0800, Wei Yang wrote: > In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in > section_to_node_table[]. While for hot-add memory, this is missed. > Without this information, page_to_nid() may not give the right node id. > > BTW, current online_pages works because it leverages nid in memory_block. > But the granularity of node id should be mem_section wide. I forgot to ask this before, but why do you mention online_pages here? IMHO, it does not add any value to the changelog, and it does not have much to do with the matter. online_pages() works with memblock granularity and not section granularity. That memblock is just a hot-added range of memory, worth of either 1 section or multiple sections, depending on the arch or on the size of the current memory. And we assume that each hot-added memory all belongs to the same node. > Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> > Reviewed-by: Oscar Salvador <osalvador@suse.de> > Reviewed-by: David Hildenbrand <david@redhat.com> > Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> > > --- > v2: > * specify the case NODE_NOT_IN_PAGE_FLAGS is effected. > * list one of the victim page_to_nid() > > --- > mm/sparse.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/mm/sparse.c b/mm/sparse.c > index 4012d7f50010..48fa16038cf5 100644 > --- a/mm/sparse.c > +++ b/mm/sparse.c > @@ -733,6 +733,7 @@ int __meminit sparse_add_one_section(int nid, unsigned long start_pfn, > */ > page_init_poison(memmap, sizeof(struct page) * PAGES_PER_SECTION); > > + set_section_nid(section_nr, nid); > section_mark_present(ms); > sparse_init_one_section(ms, section_nr, memmap, usemap); > > -- > 2.19.1 >
On Tue, Jun 18, 2019 at 09:49:48AM +0200, Oscar Salvador wrote: >On Tue, Jun 18, 2019 at 08:55:37AM +0800, Wei Yang wrote: >> In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in >> section_to_node_table[]. While for hot-add memory, this is missed. >> Without this information, page_to_nid() may not give the right node id. >> >> BTW, current online_pages works because it leverages nid in memory_block. >> But the granularity of node id should be mem_section wide. > >I forgot to ask this before, but why do you mention online_pages here? >IMHO, it does not add any value to the changelog, and it does not have much >to do with the matter. > Since to me it is a little confused why we don't set the node info but still could online memory to the correct node. It turns out we leverage the information in memblock. >online_pages() works with memblock granularity and not section granularity. >That memblock is just a hot-added range of memory, worth of either 1 section or multiple >sections, depending on the arch or on the size of the current memory. >And we assume that each hot-added memory all belongs to the same node. > So I am not clear about the granularity of node id. section based or memblock based. Or we have two cases: * for initial memory, section wide * for hot-add memory, mem_block wide > >> Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> >> Reviewed-by: Oscar Salvador <osalvador@suse.de> >> Reviewed-by: David Hildenbrand <david@redhat.com> >> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> >> >> --- >> v2: >> * specify the case NODE_NOT_IN_PAGE_FLAGS is effected. >> * list one of the victim page_to_nid() >> >> --- >> mm/sparse.c | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git a/mm/sparse.c b/mm/sparse.c >> index 4012d7f50010..48fa16038cf5 100644 >> --- a/mm/sparse.c >> +++ b/mm/sparse.c >> @@ -733,6 +733,7 @@ int __meminit sparse_add_one_section(int nid, unsigned long start_pfn, >> */ >> page_init_poison(memmap, sizeof(struct page) * PAGES_PER_SECTION); >> >> + set_section_nid(section_nr, nid); >> section_mark_present(ms); >> sparse_init_one_section(ms, section_nr, memmap, usemap); >> >> -- >> 2.19.1 >> > >-- >Oscar Salvador >SUSE L3
On 18.06.19 10:32, Wei Yang wrote: > On Tue, Jun 18, 2019 at 09:49:48AM +0200, Oscar Salvador wrote: >> On Tue, Jun 18, 2019 at 08:55:37AM +0800, Wei Yang wrote: >>> In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in >>> section_to_node_table[]. While for hot-add memory, this is missed. >>> Without this information, page_to_nid() may not give the right node id. >>> >>> BTW, current online_pages works because it leverages nid in memory_block. >>> But the granularity of node id should be mem_section wide. >> >> I forgot to ask this before, but why do you mention online_pages here? >> IMHO, it does not add any value to the changelog, and it does not have much >> to do with the matter. >> > > Since to me it is a little confused why we don't set the node info but still > could online memory to the correct node. It turns out we leverage the > information in memblock. I'd also drop the comment here. > >> online_pages() works with memblock granularity and not section granularity. >> That memblock is just a hot-added range of memory, worth of either 1 section or multiple >> sections, depending on the arch or on the size of the current memory. >> And we assume that each hot-added memory all belongs to the same node. >> > > So I am not clear about the granularity of node id. section based or memblock > based. Or we have two cases: > > * for initial memory, section wide > * for hot-add memory, mem_block wide It's all a big mess. Right now, you can offline initial memory with mixed nodes. Also on my list of many ugly things to clean up. (I even remember that we can have mixed nodes within a section, but I haven't figured out yet how that is supposed to work in some scenarios)
On Tue 18-06-19 10:40:06, David Hildenbrand wrote: > On 18.06.19 10:32, Wei Yang wrote: > > On Tue, Jun 18, 2019 at 09:49:48AM +0200, Oscar Salvador wrote: > >> On Tue, Jun 18, 2019 at 08:55:37AM +0800, Wei Yang wrote: > >>> In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in > >>> section_to_node_table[]. While for hot-add memory, this is missed. > >>> Without this information, page_to_nid() may not give the right node id. > >>> > >>> BTW, current online_pages works because it leverages nid in memory_block. > >>> But the granularity of node id should be mem_section wide. > >> > >> I forgot to ask this before, but why do you mention online_pages here? > >> IMHO, it does not add any value to the changelog, and it does not have much > >> to do with the matter. > >> > > > > Since to me it is a little confused why we don't set the node info but still > > could online memory to the correct node. It turns out we leverage the > > information in memblock. > > I'd also drop the comment here. > > > > >> online_pages() works with memblock granularity and not section granularity. > >> That memblock is just a hot-added range of memory, worth of either 1 section or multiple > >> sections, depending on the arch or on the size of the current memory. > >> And we assume that each hot-added memory all belongs to the same node. > >> > > > > So I am not clear about the granularity of node id. section based or memblock > > based. Or we have two cases: > > > > * for initial memory, section wide > > * for hot-add memory, mem_block wide > > It's all a big mess. Right now, you can offline initial memory with > mixed nodes. Also on my list of many ugly things to clean up. > > (I even remember that we can have mixed nodes within a section, but I > haven't figured out yet how that is supposed to work in some scenarios) Yes, that is indeed the case. See 4aa9fc2a435abe95a1e8d7f8c7b3d6356514b37a. How to fix this? Well, I do not think we can. Section based granularity simply doesn't agree with the reality and so we have to live with that. There is a long way to remove all those section size assumptions from the code though.
On Tue 18-06-19 08:55:37, Wei Yang wrote: > In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in > section_to_node_table[]. While for hot-add memory, this is missed. > Without this information, page_to_nid() may not give the right node id. Which would mean that NODE_NOT_IN_PAGE_FLAGS doesn't really work with the hotpluged memory, right? Any idea why nobody has noticed this so far? Is it because NODE_NOT_IN_PAGE_FLAGS is rare and essentially unused with the hotplug? page_to_nid providing an incorrect result sounds quite serious to me. Could you identify when we have introduced this problem? A Fixes tag would sound very useful to me. > BTW, current online_pages works because it leverages nid in memory_block. > But the granularity of node id should be mem_section wide. This is not really helpful because nothing except for the hotplug really cares about mem blocks. The whole MM really does care about page_to_nid and that is why it matters much more so spending a word or two on that would be more helpful. > Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> > Reviewed-by: Oscar Salvador <osalvador@suse.de> > Reviewed-by: David Hildenbrand <david@redhat.com> > Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> The patch itself looks good to me. Acked-by: Michal Hocko <mhocko@suse.com> Thanks! > > --- > v2: > * specify the case NODE_NOT_IN_PAGE_FLAGS is effected. > * list one of the victim page_to_nid() > > --- > mm/sparse.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/mm/sparse.c b/mm/sparse.c > index 4012d7f50010..48fa16038cf5 100644 > --- a/mm/sparse.c > +++ b/mm/sparse.c > @@ -733,6 +733,7 @@ int __meminit sparse_add_one_section(int nid, unsigned long start_pfn, > */ > page_init_poison(memmap, sizeof(struct page) * PAGES_PER_SECTION); > > + set_section_nid(section_nr, nid); > section_mark_present(ms); > sparse_init_one_section(ms, section_nr, memmap, usemap); > > -- > 2.19.1
On Wed, Jun 19, 2019 at 08:23:30AM +0200, Michal Hocko wrote: > On Tue 18-06-19 08:55:37, Wei Yang wrote: > > In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in > > section_to_node_table[]. While for hot-add memory, this is missed. > > Without this information, page_to_nid() may not give the right node id. > > Which would mean that NODE_NOT_IN_PAGE_FLAGS doesn't really work with > the hotpluged memory, right? Any idea why nobody has noticed this > so far? Is it because NODE_NOT_IN_PAGE_FLAGS is rare and essentially > unused with the hotplug? page_to_nid providing an incorrect result > sounds quite serious to me. The thing is that for NODE_NOT_IN_PAGE_FLAGS to be enabled we need to run out of space in page->flags to store zone, nid and section. Currently, even with the largest values (with pagetable level 5), that is not possible on x86_64. It is possible though, that somewhere in the future, when the values get larger (e.g: we add more zones, NODE_SHIFT grows, or we need more space to store the section) we finally run out of room for the flags though. I am not sure about the other arches though, we probably should audit them and see which ones can fall in there.
On 19.06.19 09:53, Oscar Salvador wrote: > On Wed, Jun 19, 2019 at 08:23:30AM +0200, Michal Hocko wrote: >> On Tue 18-06-19 08:55:37, Wei Yang wrote: >>> In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in >>> section_to_node_table[]. While for hot-add memory, this is missed. >>> Without this information, page_to_nid() may not give the right node id. >> >> Which would mean that NODE_NOT_IN_PAGE_FLAGS doesn't really work with >> the hotpluged memory, right? Any idea why nobody has noticed this >> so far? Is it because NODE_NOT_IN_PAGE_FLAGS is rare and essentially >> unused with the hotplug? page_to_nid providing an incorrect result >> sounds quite serious to me. > > The thing is that for NODE_NOT_IN_PAGE_FLAGS to be enabled we need to run out of > space in page->flags to store zone, nid and section. > Currently, even with the largest values (with pagetable level 5), that is not > possible on x86_64. > It is possible though, that somewhere in the future, when the values get larger > (e.g: we add more zones, NODE_SHIFT grows, or we need more space to store > the section) we finally run out of room for the flags though. > > I am not sure about the other arches though, we probably should audit them > and see which ones can fall in there. > I'd love to see NODE_NOT_IN_PAGE_FLAGS go.
On 19.06.19 08:10, Michal Hocko wrote: > On Tue 18-06-19 10:40:06, David Hildenbrand wrote: >> On 18.06.19 10:32, Wei Yang wrote: >>> On Tue, Jun 18, 2019 at 09:49:48AM +0200, Oscar Salvador wrote: >>>> On Tue, Jun 18, 2019 at 08:55:37AM +0800, Wei Yang wrote: >>>>> In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in >>>>> section_to_node_table[]. While for hot-add memory, this is missed. >>>>> Without this information, page_to_nid() may not give the right node id. >>>>> >>>>> BTW, current online_pages works because it leverages nid in memory_block. >>>>> But the granularity of node id should be mem_section wide. >>>> >>>> I forgot to ask this before, but why do you mention online_pages here? >>>> IMHO, it does not add any value to the changelog, and it does not have much >>>> to do with the matter. >>>> >>> >>> Since to me it is a little confused why we don't set the node info but still >>> could online memory to the correct node. It turns out we leverage the >>> information in memblock. >> >> I'd also drop the comment here. >> >>> >>>> online_pages() works with memblock granularity and not section granularity. >>>> That memblock is just a hot-added range of memory, worth of either 1 section or multiple >>>> sections, depending on the arch or on the size of the current memory. >>>> And we assume that each hot-added memory all belongs to the same node. >>>> >>> >>> So I am not clear about the granularity of node id. section based or memblock >>> based. Or we have two cases: >>> >>> * for initial memory, section wide >>> * for hot-add memory, mem_block wide >> >> It's all a big mess. Right now, you can offline initial memory with >> mixed nodes. Also on my list of many ugly things to clean up. >> >> (I even remember that we can have mixed nodes within a section, but I >> haven't figured out yet how that is supposed to work in some scenarios) > > Yes, that is indeed the case. See 4aa9fc2a435abe95a1e8d7f8c7b3d6356514b37a. > How to fix this? Well, I do not think we can. Section based granularity > simply doesn't agree with the reality and so we have to live with that. > There is a long way to remove all those section size assumptions from > the code though. > Trying to remove NODE_NOT_IN_PAGE_FLAGS could work, but we would have to identify how exactly needs that. For memory blocks, we need a different approach (I have in my head to make ->nid indicate if we are dealing with mixed nodes. If mixed, disallow onlining/offlining).
On Wed 19-06-19 10:54:08, David Hildenbrand wrote: > On 19.06.19 08:10, Michal Hocko wrote: > > On Tue 18-06-19 10:40:06, David Hildenbrand wrote: > >> On 18.06.19 10:32, Wei Yang wrote: > >>> On Tue, Jun 18, 2019 at 09:49:48AM +0200, Oscar Salvador wrote: > >>>> On Tue, Jun 18, 2019 at 08:55:37AM +0800, Wei Yang wrote: > >>>>> In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in > >>>>> section_to_node_table[]. While for hot-add memory, this is missed. > >>>>> Without this information, page_to_nid() may not give the right node id. > >>>>> > >>>>> BTW, current online_pages works because it leverages nid in memory_block. > >>>>> But the granularity of node id should be mem_section wide. > >>>> > >>>> I forgot to ask this before, but why do you mention online_pages here? > >>>> IMHO, it does not add any value to the changelog, and it does not have much > >>>> to do with the matter. > >>>> > >>> > >>> Since to me it is a little confused why we don't set the node info but still > >>> could online memory to the correct node. It turns out we leverage the > >>> information in memblock. > >> > >> I'd also drop the comment here. > >> > >>> > >>>> online_pages() works with memblock granularity and not section granularity. > >>>> That memblock is just a hot-added range of memory, worth of either 1 section or multiple > >>>> sections, depending on the arch or on the size of the current memory. > >>>> And we assume that each hot-added memory all belongs to the same node. > >>>> > >>> > >>> So I am not clear about the granularity of node id. section based or memblock > >>> based. Or we have two cases: > >>> > >>> * for initial memory, section wide > >>> * for hot-add memory, mem_block wide > >> > >> It's all a big mess. Right now, you can offline initial memory with > >> mixed nodes. Also on my list of many ugly things to clean up. > >> > >> (I even remember that we can have mixed nodes within a section, but I > >> haven't figured out yet how that is supposed to work in some scenarios) > > > > Yes, that is indeed the case. See 4aa9fc2a435abe95a1e8d7f8c7b3d6356514b37a. > > How to fix this? Well, I do not think we can. Section based granularity > > simply doesn't agree with the reality and so we have to live with that. > > There is a long way to remove all those section size assumptions from > > the code though. > > > > Trying to remove NODE_NOT_IN_PAGE_FLAGS could work, but we would have to > identify how exactly needs that. For memory blocks, we need a different > approach (I have in my head to make ->nid indicate if we are dealing > with mixed nodes. If mixed, disallow onlining/offlining). Well, I am not sure we really have to care about mutli-nodes memblocks much. The API is clumsy but does anybody actually care? The vast majority of hotplug usecases simply do not do that in the first place right? And if they do need a smaller granularity to describe their memory topology then we need a different user API rather the fiddle with implementation details I would argue.
On 19.06.19 11:01, Michal Hocko wrote: > On Wed 19-06-19 10:54:08, David Hildenbrand wrote: >> On 19.06.19 08:10, Michal Hocko wrote: >>> On Tue 18-06-19 10:40:06, David Hildenbrand wrote: >>>> On 18.06.19 10:32, Wei Yang wrote: >>>>> On Tue, Jun 18, 2019 at 09:49:48AM +0200, Oscar Salvador wrote: >>>>>> On Tue, Jun 18, 2019 at 08:55:37AM +0800, Wei Yang wrote: >>>>>>> In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in >>>>>>> section_to_node_table[]. While for hot-add memory, this is missed. >>>>>>> Without this information, page_to_nid() may not give the right node id. >>>>>>> >>>>>>> BTW, current online_pages works because it leverages nid in memory_block. >>>>>>> But the granularity of node id should be mem_section wide. >>>>>> >>>>>> I forgot to ask this before, but why do you mention online_pages here? >>>>>> IMHO, it does not add any value to the changelog, and it does not have much >>>>>> to do with the matter. >>>>>> >>>>> >>>>> Since to me it is a little confused why we don't set the node info but still >>>>> could online memory to the correct node. It turns out we leverage the >>>>> information in memblock. >>>> >>>> I'd also drop the comment here. >>>> >>>>> >>>>>> online_pages() works with memblock granularity and not section granularity. >>>>>> That memblock is just a hot-added range of memory, worth of either 1 section or multiple >>>>>> sections, depending on the arch or on the size of the current memory. >>>>>> And we assume that each hot-added memory all belongs to the same node. >>>>>> >>>>> >>>>> So I am not clear about the granularity of node id. section based or memblock >>>>> based. Or we have two cases: >>>>> >>>>> * for initial memory, section wide >>>>> * for hot-add memory, mem_block wide >>>> >>>> It's all a big mess. Right now, you can offline initial memory with >>>> mixed nodes. Also on my list of many ugly things to clean up. >>>> >>>> (I even remember that we can have mixed nodes within a section, but I >>>> haven't figured out yet how that is supposed to work in some scenarios) >>> >>> Yes, that is indeed the case. See 4aa9fc2a435abe95a1e8d7f8c7b3d6356514b37a. >>> How to fix this? Well, I do not think we can. Section based granularity >>> simply doesn't agree with the reality and so we have to live with that. >>> There is a long way to remove all those section size assumptions from >>> the code though. >>> >> >> Trying to remove NODE_NOT_IN_PAGE_FLAGS could work, but we would have to >> identify how exactly needs that. For memory blocks, we need a different >> approach (I have in my head to make ->nid indicate if we are dealing >> with mixed nodes. If mixed, disallow onlining/offlining). > > Well, I am not sure we really have to care about mutli-nodes memblocks > much. The API is clumsy but does anybody actually care? The vast > majority of hotplug usecases simply do not do that in the first place > right? Yes, AFAIK it could be done, resulting in unpredictable outcome. And if they do need a smaller granularity to describe their > memory topology then we need a different user API rather the fiddle with > implementation details I would argue. > It is not about supporting it, it is about properly blocking it.
On Wed 19-06-19 10:51:47, David Hildenbrand wrote: > On 19.06.19 09:53, Oscar Salvador wrote: > > On Wed, Jun 19, 2019 at 08:23:30AM +0200, Michal Hocko wrote: > >> On Tue 18-06-19 08:55:37, Wei Yang wrote: > >>> In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in > >>> section_to_node_table[]. While for hot-add memory, this is missed. > >>> Without this information, page_to_nid() may not give the right node id. > >> > >> Which would mean that NODE_NOT_IN_PAGE_FLAGS doesn't really work with > >> the hotpluged memory, right? Any idea why nobody has noticed this > >> so far? Is it because NODE_NOT_IN_PAGE_FLAGS is rare and essentially > >> unused with the hotplug? page_to_nid providing an incorrect result > >> sounds quite serious to me. > > > > The thing is that for NODE_NOT_IN_PAGE_FLAGS to be enabled we need to run out of > > space in page->flags to store zone, nid and section. > > Currently, even with the largest values (with pagetable level 5), that is not > > possible on x86_64. > > It is possible though, that somewhere in the future, when the values get larger > > (e.g: we add more zones, NODE_SHIFT grows, or we need more space to store > > the section) we finally run out of room for the flags though. > > > > I am not sure about the other arches though, we probably should audit them > > and see which ones can fall in there. > > > > I'd love to see NODE_NOT_IN_PAGE_FLAGS go. NODE_NOT_IN_PAGE_FLAGS is an implementation detail on where the information is stored. I cannot say how much it is really needed now but I can see there will be a demand for it in a longer term because page->flags space is scarce and very interesting storage. So I do not see it go away I am afraid.
On 19.06.19 11:04, Michal Hocko wrote: > On Wed 19-06-19 10:51:47, David Hildenbrand wrote: >> On 19.06.19 09:53, Oscar Salvador wrote: >>> On Wed, Jun 19, 2019 at 08:23:30AM +0200, Michal Hocko wrote: >>>> On Tue 18-06-19 08:55:37, Wei Yang wrote: >>>>> In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in >>>>> section_to_node_table[]. While for hot-add memory, this is missed. >>>>> Without this information, page_to_nid() may not give the right node id. >>>> >>>> Which would mean that NODE_NOT_IN_PAGE_FLAGS doesn't really work with >>>> the hotpluged memory, right? Any idea why nobody has noticed this >>>> so far? Is it because NODE_NOT_IN_PAGE_FLAGS is rare and essentially >>>> unused with the hotplug? page_to_nid providing an incorrect result >>>> sounds quite serious to me. >>> >>> The thing is that for NODE_NOT_IN_PAGE_FLAGS to be enabled we need to run out of >>> space in page->flags to store zone, nid and section. >>> Currently, even with the largest values (with pagetable level 5), that is not >>> possible on x86_64. >>> It is possible though, that somewhere in the future, when the values get larger >>> (e.g: we add more zones, NODE_SHIFT grows, or we need more space to store >>> the section) we finally run out of room for the flags though. >>> >>> I am not sure about the other arches though, we probably should audit them >>> and see which ones can fall in there. >>> >> >> I'd love to see NODE_NOT_IN_PAGE_FLAGS go. > > NODE_NOT_IN_PAGE_FLAGS is an implementation detail on where the > information is stored. Yes and no. Storing it per section clearly doesn't allow storing node information on smaller granularity, like storing in page->flags does. So no, it is not only an implementation detail. I cannot say how much it is really needed now but > I can see there will be a demand for it in a longer term because > page->flags space is scarce and very interesting storage. So I do not > see it go away I am afraid. Depends on how performance-critical pfn_to_nid() is. I can't tell.
On Wed 19-06-19 11:03:49, David Hildenbrand wrote: > On 19.06.19 11:01, Michal Hocko wrote: [...] > > And if they do need a smaller granularity to describe their > > memory topology then we need a different user API rather the fiddle with > > implementation details I would argue. > > > > It is not about supporting it, it is about properly blocking it. We already do that in test_pages_in_a_zone, right? Albeit in MAX_ORDER_NR_PAGES granularity.
On 19.06.19 11:08, Michal Hocko wrote: > On Wed 19-06-19 11:03:49, David Hildenbrand wrote: >> On 19.06.19 11:01, Michal Hocko wrote: > [...] >>> And if they do need a smaller granularity to describe their >>> memory topology then we need a different user API rather the fiddle with >>> implementation details I would argue. >>> >> >> It is not about supporting it, it is about properly blocking it. > > We already do that in test_pages_in_a_zone, right? Albeit in > MAX_ORDER_NR_PAGES granularity. > Indeed, thanks for pointing that out. I knew that we were checking zones but had in my head that it was working on zone idx.
On Wed 19-06-19 11:07:30, David Hildenbrand wrote: > On 19.06.19 11:04, Michal Hocko wrote: > > On Wed 19-06-19 10:51:47, David Hildenbrand wrote: > >> On 19.06.19 09:53, Oscar Salvador wrote: > >>> On Wed, Jun 19, 2019 at 08:23:30AM +0200, Michal Hocko wrote: > >>>> On Tue 18-06-19 08:55:37, Wei Yang wrote: > >>>>> In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in > >>>>> section_to_node_table[]. While for hot-add memory, this is missed. > >>>>> Without this information, page_to_nid() may not give the right node id. > >>>> > >>>> Which would mean that NODE_NOT_IN_PAGE_FLAGS doesn't really work with > >>>> the hotpluged memory, right? Any idea why nobody has noticed this > >>>> so far? Is it because NODE_NOT_IN_PAGE_FLAGS is rare and essentially > >>>> unused with the hotplug? page_to_nid providing an incorrect result > >>>> sounds quite serious to me. > >>> > >>> The thing is that for NODE_NOT_IN_PAGE_FLAGS to be enabled we need to run out of > >>> space in page->flags to store zone, nid and section. > >>> Currently, even with the largest values (with pagetable level 5), that is not > >>> possible on x86_64. > >>> It is possible though, that somewhere in the future, when the values get larger > >>> (e.g: we add more zones, NODE_SHIFT grows, or we need more space to store > >>> the section) we finally run out of room for the flags though. > >>> > >>> I am not sure about the other arches though, we probably should audit them > >>> and see which ones can fall in there. > >>> > >> > >> I'd love to see NODE_NOT_IN_PAGE_FLAGS go. > > > > NODE_NOT_IN_PAGE_FLAGS is an implementation detail on where the > > information is stored. > > Yes and no. Storing it per section clearly doesn't allow storing node > information on smaller granularity, like storing in page->flags does. > > So no, it is not only an implementation detail. Let me try to put it differently. NODE_NOT_IN_PAGE_FLAGS is not about storing the mapping per section. You can do what ever other data structure. NODE_NOT_IN_PAGE_FLAGS is in fact about telling that it is not in page->flags. > > I cannot say how much it is really needed now but > > I can see there will be a demand for it in a longer term because > > page->flags space is scarce and very interesting storage. So I do not > > see it go away I am afraid. > Depends on how performance-critical pfn_to_nid() is. I can't tell. page_to_node is used in many important code paths. Not in the hotest ones I believe but many of them are quite hot I would say.
On 19.06.19 11:16, Michal Hocko wrote: > On Wed 19-06-19 11:07:30, David Hildenbrand wrote: >> On 19.06.19 11:04, Michal Hocko wrote: >>> On Wed 19-06-19 10:51:47, David Hildenbrand wrote: >>>> On 19.06.19 09:53, Oscar Salvador wrote: >>>>> On Wed, Jun 19, 2019 at 08:23:30AM +0200, Michal Hocko wrote: >>>>>> On Tue 18-06-19 08:55:37, Wei Yang wrote: >>>>>>> In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in >>>>>>> section_to_node_table[]. While for hot-add memory, this is missed. >>>>>>> Without this information, page_to_nid() may not give the right node id. >>>>>> >>>>>> Which would mean that NODE_NOT_IN_PAGE_FLAGS doesn't really work with >>>>>> the hotpluged memory, right? Any idea why nobody has noticed this >>>>>> so far? Is it because NODE_NOT_IN_PAGE_FLAGS is rare and essentially >>>>>> unused with the hotplug? page_to_nid providing an incorrect result >>>>>> sounds quite serious to me. >>>>> >>>>> The thing is that for NODE_NOT_IN_PAGE_FLAGS to be enabled we need to run out of >>>>> space in page->flags to store zone, nid and section. >>>>> Currently, even with the largest values (with pagetable level 5), that is not >>>>> possible on x86_64. >>>>> It is possible though, that somewhere in the future, when the values get larger >>>>> (e.g: we add more zones, NODE_SHIFT grows, or we need more space to store >>>>> the section) we finally run out of room for the flags though. >>>>> >>>>> I am not sure about the other arches though, we probably should audit them >>>>> and see which ones can fall in there. >>>>> >>>> >>>> I'd love to see NODE_NOT_IN_PAGE_FLAGS go. >>> >>> NODE_NOT_IN_PAGE_FLAGS is an implementation detail on where the >>> information is stored. >> >> Yes and no. Storing it per section clearly doesn't allow storing node >> information on smaller granularity, like storing in page->flags does. >> >> So no, it is not only an implementation detail. > > Let me try to put it differently. NODE_NOT_IN_PAGE_FLAGS is not about > storing the mapping per section. You can do what ever other data > structure. NODE_NOT_IN_PAGE_FLAGS is in fact about telling that it is > not in page->flags. Okay, I get what you are saying. Storing it differently is problematic, though, if we want o minimize memory consumption and have a fast lookup. I was also looking into avoiding to store the section number in page-flags with CONFIG_SPARSEMEM. Especially, because the CONFIG_HAVE_ARCH_PFN_VALID hack is really ugly. But it's tricky :(
diff --git a/mm/sparse.c b/mm/sparse.c index 4012d7f50010..48fa16038cf5 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -733,6 +733,7 @@ int __meminit sparse_add_one_section(int nid, unsigned long start_pfn, */ page_init_poison(memmap, sizeof(struct page) * PAGES_PER_SECTION); + set_section_nid(section_nr, nid); section_mark_present(ms); sparse_init_one_section(ms, section_nr, memmap, usemap);