Message ID | 20190527111152.16324-11-david@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm/memory_hotplug: Factor out memory block devicehandling | expand |
On Mon, May 27, 2019 at 01:11:51PM +0200, David Hildenbrand wrote: >We really don't want anything during memory hotunplug to fail. >We always pass a valid memory block device, that check can go. Avoid >allocating memory and eventually failing. As we are always called under >lock, we can use a static piece of memory. This avoids having to put >the structure onto the stack, having to guess about the stack size >of callers. > >Patch inspired by a patch from Oscar Salvador. > >In the future, there might be no need to iterate over nodes at all. >mem->nid should tell us exactly what to remove. Memory block devices >with mixed nodes (added during boot) should properly fenced off and never >removed. > >Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> >Cc: "Rafael J. Wysocki" <rafael@kernel.org> >Cc: Alex Deucher <alexander.deucher@amd.com> >Cc: "David S. Miller" <davem@davemloft.net> >Cc: Mark Brown <broonie@kernel.org> >Cc: Chris Wilson <chris@chris-wilson.co.uk> >Cc: David Hildenbrand <david@redhat.com> >Cc: Oscar Salvador <osalvador@suse.de> >Cc: Andrew Morton <akpm@linux-foundation.org> >Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> >Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Wei Yang <richardw.yang@linux.intel.com> >--- > drivers/base/node.c | 18 +++++------------- > include/linux/node.h | 5 ++--- > 2 files changed, 7 insertions(+), 16 deletions(-) > >diff --git a/drivers/base/node.c b/drivers/base/node.c >index 04fdfa99b8bc..9be88fd05147 100644 >--- a/drivers/base/node.c >+++ b/drivers/base/node.c >@@ -803,20 +803,14 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, void *arg) > > /* > * Unregister memory block device under all nodes that it spans. >+ * Has to be called with mem_sysfs_mutex held (due to unlinked_nodes). > */ >-int unregister_memory_block_under_nodes(struct memory_block *mem_blk) >+void unregister_memory_block_under_nodes(struct memory_block *mem_blk) > { >- NODEMASK_ALLOC(nodemask_t, unlinked_nodes, GFP_KERNEL); > unsigned long pfn, sect_start_pfn, sect_end_pfn; >+ static nodemask_t unlinked_nodes; > >- if (!mem_blk) { >- NODEMASK_FREE(unlinked_nodes); >- return -EFAULT; >- } >- if (!unlinked_nodes) >- return -ENOMEM; >- nodes_clear(*unlinked_nodes); >- >+ nodes_clear(unlinked_nodes); > sect_start_pfn = section_nr_to_pfn(mem_blk->start_section_nr); > sect_end_pfn = section_nr_to_pfn(mem_blk->end_section_nr); > for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) { >@@ -827,15 +821,13 @@ int unregister_memory_block_under_nodes(struct memory_block *mem_blk) > continue; > if (!node_online(nid)) > continue; >- if (node_test_and_set(nid, *unlinked_nodes)) >+ if (node_test_and_set(nid, unlinked_nodes)) > continue; > sysfs_remove_link(&node_devices[nid]->dev.kobj, > kobject_name(&mem_blk->dev.kobj)); > sysfs_remove_link(&mem_blk->dev.kobj, > kobject_name(&node_devices[nid]->dev.kobj)); > } >- NODEMASK_FREE(unlinked_nodes); >- return 0; > } > > int link_mem_sections(int nid, unsigned long start_pfn, unsigned long end_pfn) >diff --git a/include/linux/node.h b/include/linux/node.h >index 02a29e71b175..548c226966a2 100644 >--- a/include/linux/node.h >+++ b/include/linux/node.h >@@ -139,7 +139,7 @@ extern int register_cpu_under_node(unsigned int cpu, unsigned int nid); > extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid); > extern int register_mem_sect_under_node(struct memory_block *mem_blk, > void *arg); >-extern int unregister_memory_block_under_nodes(struct memory_block *mem_blk); >+extern void unregister_memory_block_under_nodes(struct memory_block *mem_blk); > > extern int register_memory_node_under_compute_node(unsigned int mem_nid, > unsigned int cpu_nid, >@@ -175,9 +175,8 @@ static inline int register_mem_sect_under_node(struct memory_block *mem_blk, > { > return 0; > } >-static inline int unregister_memory_block_under_nodes(struct memory_block *mem_blk) >+static inline void unregister_memory_block_under_nodes(struct memory_block *mem_blk) > { >- return 0; > } > > static inline void register_hugetlbfs_with_node(node_registration_func_t reg, >-- >2.20.1
On Mon, May 27, 2019 at 01:11:51PM +0200, David Hildenbrand wrote: > We really don't want anything during memory hotunplug to fail. > We always pass a valid memory block device, that check can go. Avoid > allocating memory and eventually failing. As we are always called under > lock, we can use a static piece of memory. This avoids having to put > the structure onto the stack, having to guess about the stack size > of callers. > > Patch inspired by a patch from Oscar Salvador. > > In the future, there might be no need to iterate over nodes at all. > mem->nid should tell us exactly what to remove. Memory block devices > with mixed nodes (added during boot) should properly fenced off and never > removed. > > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > Cc: "Rafael J. Wysocki" <rafael@kernel.org> > Cc: Alex Deucher <alexander.deucher@amd.com> > Cc: "David S. Miller" <davem@davemloft.net> > Cc: Mark Brown <broonie@kernel.org> > Cc: Chris Wilson <chris@chris-wilson.co.uk> > Cc: David Hildenbrand <david@redhat.com> > Cc: Oscar Salvador <osalvador@suse.de> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> > Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de>
On Mon 27-05-19 13:11:51, David Hildenbrand wrote: > We really don't want anything during memory hotunplug to fail. > We always pass a valid memory block device, that check can go. Avoid > allocating memory and eventually failing. As we are always called under > lock, we can use a static piece of memory. This avoids having to put > the structure onto the stack, having to guess about the stack size > of callers. > > Patch inspired by a patch from Oscar Salvador. > > In the future, there might be no need to iterate over nodes at all. > mem->nid should tell us exactly what to remove. Memory block devices > with mixed nodes (added during boot) should properly fenced off and never > removed. Yeah, we do not allow to offline multi zone (node) ranges so the current code seems to be over engineered. Anyway, I am wondering why do we have to strictly check for already removed nodes links. Is the sysfs code going to complain we we try to remove again? > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > Cc: "Rafael J. Wysocki" <rafael@kernel.org> > Cc: Alex Deucher <alexander.deucher@amd.com> > Cc: "David S. Miller" <davem@davemloft.net> > Cc: Mark Brown <broonie@kernel.org> > Cc: Chris Wilson <chris@chris-wilson.co.uk> > Cc: David Hildenbrand <david@redhat.com> > Cc: Oscar Salvador <osalvador@suse.de> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> > Signed-off-by: David Hildenbrand <david@redhat.com> Anyway Acked-by: Michal Hocko <mhocko@suse.com> > --- > drivers/base/node.c | 18 +++++------------- > include/linux/node.h | 5 ++--- > 2 files changed, 7 insertions(+), 16 deletions(-) > > diff --git a/drivers/base/node.c b/drivers/base/node.c > index 04fdfa99b8bc..9be88fd05147 100644 > --- a/drivers/base/node.c > +++ b/drivers/base/node.c > @@ -803,20 +803,14 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, void *arg) > > /* > * Unregister memory block device under all nodes that it spans. > + * Has to be called with mem_sysfs_mutex held (due to unlinked_nodes). > */ > -int unregister_memory_block_under_nodes(struct memory_block *mem_blk) > +void unregister_memory_block_under_nodes(struct memory_block *mem_blk) > { > - NODEMASK_ALLOC(nodemask_t, unlinked_nodes, GFP_KERNEL); > unsigned long pfn, sect_start_pfn, sect_end_pfn; > + static nodemask_t unlinked_nodes; > > - if (!mem_blk) { > - NODEMASK_FREE(unlinked_nodes); > - return -EFAULT; > - } > - if (!unlinked_nodes) > - return -ENOMEM; > - nodes_clear(*unlinked_nodes); > - > + nodes_clear(unlinked_nodes); > sect_start_pfn = section_nr_to_pfn(mem_blk->start_section_nr); > sect_end_pfn = section_nr_to_pfn(mem_blk->end_section_nr); > for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) { > @@ -827,15 +821,13 @@ int unregister_memory_block_under_nodes(struct memory_block *mem_blk) > continue; > if (!node_online(nid)) > continue; > - if (node_test_and_set(nid, *unlinked_nodes)) > + if (node_test_and_set(nid, unlinked_nodes)) > continue; > sysfs_remove_link(&node_devices[nid]->dev.kobj, > kobject_name(&mem_blk->dev.kobj)); > sysfs_remove_link(&mem_blk->dev.kobj, > kobject_name(&node_devices[nid]->dev.kobj)); > } > - NODEMASK_FREE(unlinked_nodes); > - return 0; > } > > int link_mem_sections(int nid, unsigned long start_pfn, unsigned long end_pfn) > diff --git a/include/linux/node.h b/include/linux/node.h > index 02a29e71b175..548c226966a2 100644 > --- a/include/linux/node.h > +++ b/include/linux/node.h > @@ -139,7 +139,7 @@ extern int register_cpu_under_node(unsigned int cpu, unsigned int nid); > extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid); > extern int register_mem_sect_under_node(struct memory_block *mem_blk, > void *arg); > -extern int unregister_memory_block_under_nodes(struct memory_block *mem_blk); > +extern void unregister_memory_block_under_nodes(struct memory_block *mem_blk); > > extern int register_memory_node_under_compute_node(unsigned int mem_nid, > unsigned int cpu_nid, > @@ -175,9 +175,8 @@ static inline int register_mem_sect_under_node(struct memory_block *mem_blk, > { > return 0; > } > -static inline int unregister_memory_block_under_nodes(struct memory_block *mem_blk) > +static inline void unregister_memory_block_under_nodes(struct memory_block *mem_blk) > { > - return 0; > } > > static inline void register_hugetlbfs_with_node(node_registration_func_t reg, > -- > 2.20.1
On Mon, Jul 01, 2019 at 10:51:44AM +0200, Michal Hocko wrote: > Yeah, we do not allow to offline multi zone (node) ranges so the current > code seems to be over engineered. > > Anyway, I am wondering why do we have to strictly check for already > removed nodes links. Is the sysfs code going to complain we we try to > remove again? No, sysfs will silently "fail" if the symlink has already been removed. At least that is what I saw last time I played with it. I guess the question is what if sysfs handling changes in the future and starts dropping warnings when trying to remove a symlink is not there. Maybe that is unlikely to happen?
On Mon 01-07-19 11:36:44, Oscar Salvador wrote: > On Mon, Jul 01, 2019 at 10:51:44AM +0200, Michal Hocko wrote: > > Yeah, we do not allow to offline multi zone (node) ranges so the current > > code seems to be over engineered. > > > > Anyway, I am wondering why do we have to strictly check for already > > removed nodes links. Is the sysfs code going to complain we we try to > > remove again? > > No, sysfs will silently "fail" if the symlink has already been removed. > At least that is what I saw last time I played with it. > > I guess the question is what if sysfs handling changes in the future > and starts dropping warnings when trying to remove a symlink is not there. > Maybe that is unlikely to happen? And maybe we handle it then rather than have a static allocation that everybody with hotremove configured has to pay for.
On 01.07.19 12:27, Michal Hocko wrote: > On Mon 01-07-19 11:36:44, Oscar Salvador wrote: >> On Mon, Jul 01, 2019 at 10:51:44AM +0200, Michal Hocko wrote: >>> Yeah, we do not allow to offline multi zone (node) ranges so the current >>> code seems to be over engineered. >>> >>> Anyway, I am wondering why do we have to strictly check for already >>> removed nodes links. Is the sysfs code going to complain we we try to >>> remove again? >> >> No, sysfs will silently "fail" if the symlink has already been removed. >> At least that is what I saw last time I played with it. >> >> I guess the question is what if sysfs handling changes in the future >> and starts dropping warnings when trying to remove a symlink is not there. >> Maybe that is unlikely to happen? > > And maybe we handle it then rather than have a static allocation that > everybody with hotremove configured has to pay for. > So what's the suggestion? Dropping the nodemask_t completely and calling sysfs_remove_link() on already potentially removed links? Of course, we can also just use mem_blk->nid and rest assured that it will never be called for memory blocks belonging to multiple nodes.
On Mon, Jul 15, 2019 at 01:10:33PM +0200, David Hildenbrand wrote: > On 01.07.19 12:27, Michal Hocko wrote: > > On Mon 01-07-19 11:36:44, Oscar Salvador wrote: > >> On Mon, Jul 01, 2019 at 10:51:44AM +0200, Michal Hocko wrote: > >>> Yeah, we do not allow to offline multi zone (node) ranges so the current > >>> code seems to be over engineered. > >>> > >>> Anyway, I am wondering why do we have to strictly check for already > >>> removed nodes links. Is the sysfs code going to complain we we try to > >>> remove again? > >> > >> No, sysfs will silently "fail" if the symlink has already been removed. > >> At least that is what I saw last time I played with it. > >> > >> I guess the question is what if sysfs handling changes in the future > >> and starts dropping warnings when trying to remove a symlink is not there. > >> Maybe that is unlikely to happen? > > > > And maybe we handle it then rather than have a static allocation that > > everybody with hotremove configured has to pay for. > > > > So what's the suggestion? Dropping the nodemask_t completely and calling > sysfs_remove_link() on already potentially removed links? > > Of course, we can also just use mem_blk->nid and rest assured that it > will never be called for memory blocks belonging to multiple nodes. Hi David, While it is easy to construct a scenario where a memblock belongs to multiple nodes, I have to confess that I yet have not seen that in a real-world scenario. Given said that, I think that the less risky way is to just drop the nodemask_t and do not care about calling sysfs_remove_link() for already removed links. As I said, sysfs_remove_link() will silently fail when it fails to find the symlink, so I do not think it is a big deal.
On 16.07.19 10:46, Oscar Salvador wrote: > On Mon, Jul 15, 2019 at 01:10:33PM +0200, David Hildenbrand wrote: >> On 01.07.19 12:27, Michal Hocko wrote: >>> On Mon 01-07-19 11:36:44, Oscar Salvador wrote: >>>> On Mon, Jul 01, 2019 at 10:51:44AM +0200, Michal Hocko wrote: >>>>> Yeah, we do not allow to offline multi zone (node) ranges so the current >>>>> code seems to be over engineered. >>>>> >>>>> Anyway, I am wondering why do we have to strictly check for already >>>>> removed nodes links. Is the sysfs code going to complain we we try to >>>>> remove again? >>>> >>>> No, sysfs will silently "fail" if the symlink has already been removed. >>>> At least that is what I saw last time I played with it. >>>> >>>> I guess the question is what if sysfs handling changes in the future >>>> and starts dropping warnings when trying to remove a symlink is not there. >>>> Maybe that is unlikely to happen? >>> >>> And maybe we handle it then rather than have a static allocation that >>> everybody with hotremove configured has to pay for. >>> >> >> So what's the suggestion? Dropping the nodemask_t completely and calling >> sysfs_remove_link() on already potentially removed links? >> >> Of course, we can also just use mem_blk->nid and rest assured that it >> will never be called for memory blocks belonging to multiple nodes. > > Hi David, > > While it is easy to construct a scenario where a memblock belongs to multiple > nodes, I have to confess that I yet have not seen that in a real-world scenario. > > Given said that, I think that the less risky way is to just drop the nodemask_t > and do not care about calling sysfs_remove_link() for already removed links. > As I said, sysfs_remove_link() will silently fail when it fails to find the > symlink, so I do not think it is a big deal. > > As far as I can tell we a) don't allow offlining of memory that belongs to multiple nodes already (as pointed out by Michael recently) b) users cannot add memory blocks that belong to multiple nodes via add_memory() So I don't see a way how remove_memory() (and even offline_pages()) could ever succeed on such memory blocks. I think it should be fine to limit it to one node here. (if not, I guess we would have a different BUG that would actually allow to remove such memory blocks)
On 16.07.19 10:46, Oscar Salvador wrote: > On Mon, Jul 15, 2019 at 01:10:33PM +0200, David Hildenbrand wrote: >> On 01.07.19 12:27, Michal Hocko wrote: >>> On Mon 01-07-19 11:36:44, Oscar Salvador wrote: >>>> On Mon, Jul 01, 2019 at 10:51:44AM +0200, Michal Hocko wrote: >>>>> Yeah, we do not allow to offline multi zone (node) ranges so the current >>>>> code seems to be over engineered. >>>>> >>>>> Anyway, I am wondering why do we have to strictly check for already >>>>> removed nodes links. Is the sysfs code going to complain we we try to >>>>> remove again? >>>> >>>> No, sysfs will silently "fail" if the symlink has already been removed. >>>> At least that is what I saw last time I played with it. >>>> >>>> I guess the question is what if sysfs handling changes in the future >>>> and starts dropping warnings when trying to remove a symlink is not there. >>>> Maybe that is unlikely to happen? >>> >>> And maybe we handle it then rather than have a static allocation that >>> everybody with hotremove configured has to pay for. >>> >> >> So what's the suggestion? Dropping the nodemask_t completely and calling >> sysfs_remove_link() on already potentially removed links? >> >> Of course, we can also just use mem_blk->nid and rest assured that it >> will never be called for memory blocks belonging to multiple nodes. > > Hi David, > > While it is easy to construct a scenario where a memblock belongs to multiple > nodes, I have to confess that I yet have not seen that in a real-world scenario. > > Given said that, I think that the less risky way is to just drop the nodemask_t > and do not care about calling sysfs_remove_link() for already removed links. > As I said, sysfs_remove_link() will silently fail when it fails to find the > symlink, so I do not think it is a big deal. > > As far as I can tell we a) don't allow offlining of memory that belongs to multiple nodes already (as pointed out by Michal recently) b) users cannot add memory blocks that belong to multiple nodes via add_memory() So I don't see a way how remove_memory() (and even offline_pages()) could ever succeed on such memory blocks. I think it should be fine to limit it to one node here. (if not, I guess we would have a different BUG that would actually allow to remove such memory blocks)
On Mon 15-07-19 13:10:33, David Hildenbrand wrote: > On 01.07.19 12:27, Michal Hocko wrote: > > On Mon 01-07-19 11:36:44, Oscar Salvador wrote: > >> On Mon, Jul 01, 2019 at 10:51:44AM +0200, Michal Hocko wrote: > >>> Yeah, we do not allow to offline multi zone (node) ranges so the current > >>> code seems to be over engineered. > >>> > >>> Anyway, I am wondering why do we have to strictly check for already > >>> removed nodes links. Is the sysfs code going to complain we we try to > >>> remove again? > >> > >> No, sysfs will silently "fail" if the symlink has already been removed. > >> At least that is what I saw last time I played with it. > >> > >> I guess the question is what if sysfs handling changes in the future > >> and starts dropping warnings when trying to remove a symlink is not there. > >> Maybe that is unlikely to happen? > > > > And maybe we handle it then rather than have a static allocation that > > everybody with hotremove configured has to pay for. > > > > So what's the suggestion? Dropping the nodemask_t completely and calling > sysfs_remove_link() on already potentially removed links? Yes. In a follow up patch.
diff --git a/drivers/base/node.c b/drivers/base/node.c index 04fdfa99b8bc..9be88fd05147 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -803,20 +803,14 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, void *arg) /* * Unregister memory block device under all nodes that it spans. + * Has to be called with mem_sysfs_mutex held (due to unlinked_nodes). */ -int unregister_memory_block_under_nodes(struct memory_block *mem_blk) +void unregister_memory_block_under_nodes(struct memory_block *mem_blk) { - NODEMASK_ALLOC(nodemask_t, unlinked_nodes, GFP_KERNEL); unsigned long pfn, sect_start_pfn, sect_end_pfn; + static nodemask_t unlinked_nodes; - if (!mem_blk) { - NODEMASK_FREE(unlinked_nodes); - return -EFAULT; - } - if (!unlinked_nodes) - return -ENOMEM; - nodes_clear(*unlinked_nodes); - + nodes_clear(unlinked_nodes); sect_start_pfn = section_nr_to_pfn(mem_blk->start_section_nr); sect_end_pfn = section_nr_to_pfn(mem_blk->end_section_nr); for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) { @@ -827,15 +821,13 @@ int unregister_memory_block_under_nodes(struct memory_block *mem_blk) continue; if (!node_online(nid)) continue; - if (node_test_and_set(nid, *unlinked_nodes)) + if (node_test_and_set(nid, unlinked_nodes)) continue; sysfs_remove_link(&node_devices[nid]->dev.kobj, kobject_name(&mem_blk->dev.kobj)); sysfs_remove_link(&mem_blk->dev.kobj, kobject_name(&node_devices[nid]->dev.kobj)); } - NODEMASK_FREE(unlinked_nodes); - return 0; } int link_mem_sections(int nid, unsigned long start_pfn, unsigned long end_pfn) diff --git a/include/linux/node.h b/include/linux/node.h index 02a29e71b175..548c226966a2 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -139,7 +139,7 @@ extern int register_cpu_under_node(unsigned int cpu, unsigned int nid); extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid); extern int register_mem_sect_under_node(struct memory_block *mem_blk, void *arg); -extern int unregister_memory_block_under_nodes(struct memory_block *mem_blk); +extern void unregister_memory_block_under_nodes(struct memory_block *mem_blk); extern int register_memory_node_under_compute_node(unsigned int mem_nid, unsigned int cpu_nid, @@ -175,9 +175,8 @@ static inline int register_mem_sect_under_node(struct memory_block *mem_blk, { return 0; } -static inline int unregister_memory_block_under_nodes(struct memory_block *mem_blk) +static inline void unregister_memory_block_under_nodes(struct memory_block *mem_blk) { - return 0; } static inline void register_hugetlbfs_with_node(node_registration_func_t reg,
We really don't want anything during memory hotunplug to fail. We always pass a valid memory block device, that check can go. Avoid allocating memory and eventually failing. As we are always called under lock, we can use a static piece of memory. This avoids having to put the structure onto the stack, having to guess about the stack size of callers. Patch inspired by a patch from Oscar Salvador. In the future, there might be no need to iterate over nodes at all. mem->nid should tell us exactly what to remove. Memory block devices with mixed nodes (added during boot) should properly fenced off and never removed. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Mark Brown <broonie@kernel.org> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: David Hildenbrand <david@redhat.com> --- drivers/base/node.c | 18 +++++------------- include/linux/node.h | 5 ++--- 2 files changed, 7 insertions(+), 16 deletions(-)