diff mbox series

[-next,v2] mm/hotplug: fix a null-ptr-deref during NUMA boot

Message ID 20190512054829.11899-1-cai@lca.pw (mailing list archive)
State New, archived
Headers show
Series [-next,v2] mm/hotplug: fix a null-ptr-deref during NUMA boot | expand

Commit Message

Qian Cai May 12, 2019, 5:48 a.m. UTC
The linux-next commit ("x86, numa: always initialize all possible
nodes") introduced a crash below during boot for systems with a
memory-less node. This is due to CPUs that get onlined during SMP boot,
but that onlining triggers a page fault in bus_add_device() during
device registration:

	error = sysfs_create_link(&bus->p->devices_kset->kobj,

bus->p is NULL. That "p" is the subsys_private struct, and it should
have been set in,

	postcore_initcall(register_node_type);

but that happens in do_basic_setup() after smp_init().

The old code had set this node online via alloc_node_data(), so when it
came time to do_cpu_up() -> try_online_node(), the node was already up
and nothing happened.

Now, it attempts to online the node, which registers the node with
sysfs, but that can't happen before the 'node' subsystem is registered.

Since kernel_init() is running by a kernel thread that is in
SYSTEM_SCHEDULING state, fixed this by skipping registering with sysfs
during the early boot in __try_online_node().

Call Trace:
 device_add+0x43e/0x690
 device_register+0x107/0x110
 __register_one_node+0x72/0x150
 __try_online_node+0x8f/0xd0
 try_online_node+0x2b/0x50
 do_cpu_up+0x46/0xf0
 cpu_up+0x13/0x20
 smp_init+0x6e/0xd0
 kernel_init_freeable+0xe5/0x21f
 kernel_init+0xf/0x180
 ret_from_fork+0x1f/0x30

Reported-by: Barret Rhoden <brho@google.com>
Signed-off-by: Qian Cai <cai@lca.pw>
---

v2: Set the node online as it have CPUs. Otherwise, those memory-less nodes will
    end up being not in sysfs i.e., /sys/devices/system/node/.

 mm/memory_hotplug.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

Comments

Michal Hocko May 13, 2019, 12:41 p.m. UTC | #1
On Sun 12-05-19 01:48:29, Qian Cai wrote:
> The linux-next commit ("x86, numa: always initialize all possible
> nodes") introduced a crash below during boot for systems with a
> memory-less node. This is due to CPUs that get onlined during SMP boot,
> but that onlining triggers a page fault in bus_add_device() during
> device registration:
> 
> 	error = sysfs_create_link(&bus->p->devices_kset->kobj,
> 
> bus->p is NULL. That "p" is the subsys_private struct, and it should
> have been set in,
> 
> 	postcore_initcall(register_node_type);
> 
> but that happens in do_basic_setup() after smp_init().
> 
> The old code had set this node online via alloc_node_data(), so when it
> came time to do_cpu_up() -> try_online_node(), the node was already up
> and nothing happened.
> 
> Now, it attempts to online the node, which registers the node with
> sysfs, but that can't happen before the 'node' subsystem is registered.
> 
> Since kernel_init() is running by a kernel thread that is in
> SYSTEM_SCHEDULING state, fixed this by skipping registering with sysfs
> during the early boot in __try_online_node().

Relying on SYSTEM_SCHEDULING looks really hackish. Why cannot we simply
drop try_online_node from do_cpu_up? Your v2 remark below suggests that
we need to call node_set_online because something later on depends on
that. Btw. why do we even allocate a pgdat from this path? This looks
really messy.

> Call Trace:
>  device_add+0x43e/0x690
>  device_register+0x107/0x110
>  __register_one_node+0x72/0x150
>  __try_online_node+0x8f/0xd0
>  try_online_node+0x2b/0x50
>  do_cpu_up+0x46/0xf0
>  cpu_up+0x13/0x20
>  smp_init+0x6e/0xd0
>  kernel_init_freeable+0xe5/0x21f
>  kernel_init+0xf/0x180
>  ret_from_fork+0x1f/0x30
> 
> Reported-by: Barret Rhoden <brho@google.com>
> Signed-off-by: Qian Cai <cai@lca.pw>
> ---
> 
> v2: Set the node online as it have CPUs. Otherwise, those memory-less nodes will
>     end up being not in sysfs i.e., /sys/devices/system/node/.
> 
>  mm/memory_hotplug.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index b236069ff0d8..6eb2331fa826 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1037,6 +1037,18 @@ static int __try_online_node(int nid, u64 start, bool set_node_online)
>  	if (node_online(nid))
>  		return 0;
>  
> +	/*
> +	 * Here is called by cpu_up() to online a node without memory from
> +	 * kernel_init() which guarantees that "set_node_online" is true which
> +	 * will set the node online as it have CPUs but not ready to call
> +	 * register_one_node() as "node_subsys" has not been initialized
> +	 * properly yet.
> +	 */
> +	if (system_state == SYSTEM_SCHEDULING) {
> +		node_set_online(nid);
> +		return 0;
> +	}
> +
>  	pgdat = hotadd_new_pgdat(nid, start);
>  	if (!pgdat) {
>  		pr_err("Cannot online node %d due to NULL pgdat\n", nid);
> -- 
> 2.20.1 (Apple Git-117)
Qian Cai May 13, 2019, 1:43 p.m. UTC | #2
On Mon, 2019-05-13 at 14:41 +0200, Michal Hocko wrote:
> On Sun 12-05-19 01:48:29, Qian Cai wrote:
> > The linux-next commit ("x86, numa: always initialize all possible
> > nodes") introduced a crash below during boot for systems with a
> > memory-less node. This is due to CPUs that get onlined during SMP boot,
> > but that onlining triggers a page fault in bus_add_device() during
> > device registration:
> > 
> > 	error = sysfs_create_link(&bus->p->devices_kset->kobj,
> > 
> > bus->p is NULL. That "p" is the subsys_private struct, and it should
> > have been set in,
> > 
> > 	postcore_initcall(register_node_type);
> > 
> > but that happens in do_basic_setup() after smp_init().
> > 
> > The old code had set this node online via alloc_node_data(), so when it
> > came time to do_cpu_up() -> try_online_node(), the node was already up
> > and nothing happened.
> > 
> > Now, it attempts to online the node, which registers the node with
> > sysfs, but that can't happen before the 'node' subsystem is registered.
> > 
> > Since kernel_init() is running by a kernel thread that is in
> > SYSTEM_SCHEDULING state, fixed this by skipping registering with sysfs
> > during the early boot in __try_online_node().
> 
> Relying on SYSTEM_SCHEDULING looks really hackish. Why cannot we simply
> drop try_online_node from do_cpu_up? Your v2 remark below suggests that
> we need to call node_set_online because something later on depends on
> that. Btw. why do we even allocate a pgdat from this path? This looks
> really messy.

See the commit cf23422b9d76 ("cpu/mem hotplug: enable CPUs online before local
memory online")

It looks like try_online_node() in do_cpu_up() is needed for memory hotplug
which is to put its node online if offlined and then hotadd_new_pgdat() calls
build_all_zonelists() to initialize the zone list.

> 
> > Call Trace:
> >  device_add+0x43e/0x690
> >  device_register+0x107/0x110
> >  __register_one_node+0x72/0x150
> >  __try_online_node+0x8f/0xd0
> >  try_online_node+0x2b/0x50
> >  do_cpu_up+0x46/0xf0
> >  cpu_up+0x13/0x20
> >  smp_init+0x6e/0xd0
> >  kernel_init_freeable+0xe5/0x21f
> >  kernel_init+0xf/0x180
> >  ret_from_fork+0x1f/0x30
> > 
> > Reported-by: Barret Rhoden <brho@google.com>
> > Signed-off-by: Qian Cai <cai@lca.pw>
> > ---
> > 
> > v2: Set the node online as it have CPUs. Otherwise, those memory-less nodes
> > will
> >     end up being not in sysfs i.e., /sys/devices/system/node/.
> > 
> >  mm/memory_hotplug.c | 12 ++++++++++++
> >  1 file changed, 12 insertions(+)
> > 
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index b236069ff0d8..6eb2331fa826 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -1037,6 +1037,18 @@ static int __try_online_node(int nid, u64 start, bool
> > set_node_online)
> >  	if (node_online(nid))
> >  		return 0;
> >  
> > +	/*
> > +	 * Here is called by cpu_up() to online a node without memory from
> > +	 * kernel_init() which guarantees that "set_node_online" is true
> > which
> > +	 * will set the node online as it have CPUs but not ready to call
> > +	 * register_one_node() as "node_subsys" has not been initialized
> > +	 * properly yet.
> > +	 */
> > +	if (system_state == SYSTEM_SCHEDULING) {
> > +		node_set_online(nid);
> > +		return 0;
> > +	}
> > +
> >  	pgdat = hotadd_new_pgdat(nid, start);
> >  	if (!pgdat) {
> >  		pr_err("Cannot online node %d due to NULL pgdat\n", nid);
> > -- 
> > 2.20.1 (Apple Git-117)
> 
>
Michal Hocko May 13, 2019, 2:04 p.m. UTC | #3
On Mon 13-05-19 09:43:59, Qian Cai wrote:
> On Mon, 2019-05-13 at 14:41 +0200, Michal Hocko wrote:
> > On Sun 12-05-19 01:48:29, Qian Cai wrote:
> > > The linux-next commit ("x86, numa: always initialize all possible
> > > nodes") introduced a crash below during boot for systems with a
> > > memory-less node. This is due to CPUs that get onlined during SMP boot,
> > > but that onlining triggers a page fault in bus_add_device() during
> > > device registration:
> > > 
> > > 	error = sysfs_create_link(&bus->p->devices_kset->kobj,
> > > 
> > > bus->p is NULL. That "p" is the subsys_private struct, and it should
> > > have been set in,
> > > 
> > > 	postcore_initcall(register_node_type);
> > > 
> > > but that happens in do_basic_setup() after smp_init().
> > > 
> > > The old code had set this node online via alloc_node_data(), so when it
> > > came time to do_cpu_up() -> try_online_node(), the node was already up
> > > and nothing happened.
> > > 
> > > Now, it attempts to online the node, which registers the node with
> > > sysfs, but that can't happen before the 'node' subsystem is registered.
> > > 
> > > Since kernel_init() is running by a kernel thread that is in
> > > SYSTEM_SCHEDULING state, fixed this by skipping registering with sysfs
> > > during the early boot in __try_online_node().
> > 
> > Relying on SYSTEM_SCHEDULING looks really hackish. Why cannot we simply
> > drop try_online_node from do_cpu_up? Your v2 remark below suggests that
> > we need to call node_set_online because something later on depends on
> > that. Btw. why do we even allocate a pgdat from this path? This looks
> > really messy.
> 
> See the commit cf23422b9d76 ("cpu/mem hotplug: enable CPUs online before local
> memory online")
> 
> It looks like try_online_node() in do_cpu_up() is needed for memory hotplug
> which is to put its node online if offlined and then hotadd_new_pgdat() calls
> build_all_zonelists() to initialize the zone list.

Well, do we still have to followthe logic that the above (unreviewed)
commit has established? The hotplug code in general made a lot of ad-hoc
design decisions which had to be revisited over time. If we are not
allocating pgdats for newly added memory then we should really make sure
to do so at a proper time and hook. I am not sure about CPU vs. memory
init ordering but even then I would really prefer if we could make the
init less obscure and _documented_.
Qian Cai May 13, 2019, 3:20 p.m. UTC | #4
On Mon, 2019-05-13 at 16:04 +0200, Michal Hocko wrote:
> On Mon 13-05-19 09:43:59, Qian Cai wrote:
> > On Mon, 2019-05-13 at 14:41 +0200, Michal Hocko wrote:
> > > On Sun 12-05-19 01:48:29, Qian Cai wrote:
> > > > The linux-next commit ("x86, numa: always initialize all possible
> > > > nodes") introduced a crash below during boot for systems with a
> > > > memory-less node. This is due to CPUs that get onlined during SMP boot,
> > > > but that onlining triggers a page fault in bus_add_device() during
> > > > device registration:
> > > > 
> > > > 	error = sysfs_create_link(&bus->p->devices_kset->kobj,
> > > > 
> > > > bus->p is NULL. That "p" is the subsys_private struct, and it should
> > > > have been set in,
> > > > 
> > > > 	postcore_initcall(register_node_type);
> > > > 
> > > > but that happens in do_basic_setup() after smp_init().
> > > > 
> > > > The old code had set this node online via alloc_node_data(), so when it
> > > > came time to do_cpu_up() -> try_online_node(), the node was already up
> > > > and nothing happened.
> > > > 
> > > > Now, it attempts to online the node, which registers the node with
> > > > sysfs, but that can't happen before the 'node' subsystem is registered.
> > > > 
> > > > Since kernel_init() is running by a kernel thread that is in
> > > > SYSTEM_SCHEDULING state, fixed this by skipping registering with sysfs
> > > > during the early boot in __try_online_node().
> > > 
> > > Relying on SYSTEM_SCHEDULING looks really hackish. Why cannot we simply
> > > drop try_online_node from do_cpu_up? Your v2 remark below suggests that
> > > we need to call node_set_online because something later on depends on
> > > that. Btw. why do we even allocate a pgdat from this path? This looks
> > > really messy.
> > 
> > See the commit cf23422b9d76 ("cpu/mem hotplug: enable CPUs online before
> > local
> > memory online")
> > 
> > It looks like try_online_node() in do_cpu_up() is needed for memory hotplug
> > which is to put its node online if offlined and then hotadd_new_pgdat()
> > calls
> > build_all_zonelists() to initialize the zone list.
> 
> Well, do we still have to followthe logic that the above (unreviewed)
> commit has established? The hotplug code in general made a lot of ad-hoc
> design decisions which had to be revisited over time. If we are not
> allocating pgdats for newly added memory then we should really make sure
> to do so at a proper time and hook. I am not sure about CPU vs. memory
> init ordering but even then I would really prefer if we could make the
> init less obscure and _documented_.

I don't know, but I think it is a good idea to keep the existing logic rather
than do a big surgery unless someone is able to confirm it is not breaking NUMA
node physical hotplug.
Michal Hocko May 13, 2019, 3:31 p.m. UTC | #5
On Mon 13-05-19 11:20:46, Qian Cai wrote:
> On Mon, 2019-05-13 at 16:04 +0200, Michal Hocko wrote:
> > On Mon 13-05-19 09:43:59, Qian Cai wrote:
> > > On Mon, 2019-05-13 at 14:41 +0200, Michal Hocko wrote:
> > > > On Sun 12-05-19 01:48:29, Qian Cai wrote:
> > > > > The linux-next commit ("x86, numa: always initialize all possible
> > > > > nodes") introduced a crash below during boot for systems with a
> > > > > memory-less node. This is due to CPUs that get onlined during SMP boot,
> > > > > but that onlining triggers a page fault in bus_add_device() during
> > > > > device registration:
> > > > > 
> > > > > 	error = sysfs_create_link(&bus->p->devices_kset->kobj,
> > > > > 
> > > > > bus->p is NULL. That "p" is the subsys_private struct, and it should
> > > > > have been set in,
> > > > > 
> > > > > 	postcore_initcall(register_node_type);
> > > > > 
> > > > > but that happens in do_basic_setup() after smp_init().
> > > > > 
> > > > > The old code had set this node online via alloc_node_data(), so when it
> > > > > came time to do_cpu_up() -> try_online_node(), the node was already up
> > > > > and nothing happened.
> > > > > 
> > > > > Now, it attempts to online the node, which registers the node with
> > > > > sysfs, but that can't happen before the 'node' subsystem is registered.
> > > > > 
> > > > > Since kernel_init() is running by a kernel thread that is in
> > > > > SYSTEM_SCHEDULING state, fixed this by skipping registering with sysfs
> > > > > during the early boot in __try_online_node().
> > > > 
> > > > Relying on SYSTEM_SCHEDULING looks really hackish. Why cannot we simply
> > > > drop try_online_node from do_cpu_up? Your v2 remark below suggests that
> > > > we need to call node_set_online because something later on depends on
> > > > that. Btw. why do we even allocate a pgdat from this path? This looks
> > > > really messy.
> > > 
> > > See the commit cf23422b9d76 ("cpu/mem hotplug: enable CPUs online before
> > > local
> > > memory online")
> > > 
> > > It looks like try_online_node() in do_cpu_up() is needed for memory hotplug
> > > which is to put its node online if offlined and then hotadd_new_pgdat()
> > > calls
> > > build_all_zonelists() to initialize the zone list.
> > 
> > Well, do we still have to followthe logic that the above (unreviewed)
> > commit has established? The hotplug code in general made a lot of ad-hoc
> > design decisions which had to be revisited over time. If we are not
> > allocating pgdats for newly added memory then we should really make sure
> > to do so at a proper time and hook. I am not sure about CPU vs. memory
> > init ordering but even then I would really prefer if we could make the
> > init less obscure and _documented_.
> 
> I don't know, but I think it is a good idea to keep the existing logic rather
> than do a big surgery

Adding more hacks just doesn't make the situation any better.

> unless someone is able to confirm it is not breaking NUMA
> node physical hotplug.

I have a machine to test whole node offline. I am just busy to prepare a
patch myself. I can have it tested though.
Pingfan Liu May 22, 2019, 7:12 a.m. UTC | #6
On Mon, May 13, 2019 at 11:31 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 13-05-19 11:20:46, Qian Cai wrote:
> > On Mon, 2019-05-13 at 16:04 +0200, Michal Hocko wrote:
> > > On Mon 13-05-19 09:43:59, Qian Cai wrote:
> > > > On Mon, 2019-05-13 at 14:41 +0200, Michal Hocko wrote:
> > > > > On Sun 12-05-19 01:48:29, Qian Cai wrote:
> > > > > > The linux-next commit ("x86, numa: always initialize all possible
> > > > > > nodes") introduced a crash below during boot for systems with a
> > > > > > memory-less node. This is due to CPUs that get onlined during SMP boot,
> > > > > > but that onlining triggers a page fault in bus_add_device() during
> > > > > > device registration:
> > > > > >
> > > > > >       error = sysfs_create_link(&bus->p->devices_kset->kobj,
> > > > > >
> > > > > > bus->p is NULL. That "p" is the subsys_private struct, and it should
> > > > > > have been set in,
> > > > > >
> > > > > >       postcore_initcall(register_node_type);
> > > > > >
> > > > > > but that happens in do_basic_setup() after smp_init().
> > > > > >
> > > > > > The old code had set this node online via alloc_node_data(), so when it
> > > > > > came time to do_cpu_up() -> try_online_node(), the node was already up
> > > > > > and nothing happened.
> > > > > >
> > > > > > Now, it attempts to online the node, which registers the node with
> > > > > > sysfs, but that can't happen before the 'node' subsystem is registered.
> > > > > >
> > > > > > Since kernel_init() is running by a kernel thread that is in
> > > > > > SYSTEM_SCHEDULING state, fixed this by skipping registering with sysfs
> > > > > > during the early boot in __try_online_node().
> > > > >
> > > > > Relying on SYSTEM_SCHEDULING looks really hackish. Why cannot we simply
> > > > > drop try_online_node from do_cpu_up? Your v2 remark below suggests that
> > > > > we need to call node_set_online because something later on depends on
> > > > > that. Btw. why do we even allocate a pgdat from this path? This looks
> > > > > really messy.
> > > >
> > > > See the commit cf23422b9d76 ("cpu/mem hotplug: enable CPUs online before
> > > > local
> > > > memory online")
> > > >
> > > > It looks like try_online_node() in do_cpu_up() is needed for memory hotplug
> > > > which is to put its node online if offlined and then hotadd_new_pgdat()
> > > > calls
> > > > build_all_zonelists() to initialize the zone list.
> > >
> > > Well, do we still have to followthe logic that the above (unreviewed)
> > > commit has established? The hotplug code in general made a lot of ad-hoc
> > > design decisions which had to be revisited over time. If we are not
> > > allocating pgdats for newly added memory then we should really make sure
> > > to do so at a proper time and hook. I am not sure about CPU vs. memory
> > > init ordering but even then I would really prefer if we could make the
> > > init less obscure and _documented_.
> >
> > I don't know, but I think it is a good idea to keep the existing logic rather
> > than do a big surgery
>
> Adding more hacks just doesn't make the situation any better.
>
> > unless someone is able to confirm it is not breaking NUMA
> > node physical hotplug.
>
> I have a machine to test whole node offline. I am just busy to prepare a
> patch myself. I can have it tested though.
>
I think the definition of "node online" is worth of rethinking. Before
patch "x86, numa: always initialize all possible nodes", online means
either cpu or memory present. After this patch, only node owing memory
as present.

In the commit log, I think the change's motivation should be "Not to
mention that it doesn't really make much sense to consider an empty
node as online because we just consider this node whenever we want to
iterate nodes to use and empty node is obviously not the best
candidate."

But in fact, we already have for_each_node_state(nid, N_MEMORY) to
cover this purpose. Furthermore, changing the definition of online may
break something in the scheduler, e.g. in task_numa_migrate(), where
it calls for_each_online_node.

By keeping the node owning cpu as online, Michal's patch can avoid
such corner case and keep things easy. Furthermore, if needed, the
other patch can use for_each_node_state(nid, N_MEMORY) to replace
for_each_online_node is some space.

Regards,
Pingfan

> --
> Michal Hocko
> SUSE Labs
Michal Hocko May 22, 2019, 11:16 a.m. UTC | #7
On Wed 22-05-19 15:12:16, Pingfan Liu wrote:
> On Mon, May 13, 2019 at 11:31 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Mon 13-05-19 11:20:46, Qian Cai wrote:
> > > On Mon, 2019-05-13 at 16:04 +0200, Michal Hocko wrote:
> > > > On Mon 13-05-19 09:43:59, Qian Cai wrote:
> > > > > On Mon, 2019-05-13 at 14:41 +0200, Michal Hocko wrote:
> > > > > > On Sun 12-05-19 01:48:29, Qian Cai wrote:
> > > > > > > The linux-next commit ("x86, numa: always initialize all possible
> > > > > > > nodes") introduced a crash below during boot for systems with a
> > > > > > > memory-less node. This is due to CPUs that get onlined during SMP boot,
> > > > > > > but that onlining triggers a page fault in bus_add_device() during
> > > > > > > device registration:
> > > > > > >
> > > > > > >       error = sysfs_create_link(&bus->p->devices_kset->kobj,
> > > > > > >
> > > > > > > bus->p is NULL. That "p" is the subsys_private struct, and it should
> > > > > > > have been set in,
> > > > > > >
> > > > > > >       postcore_initcall(register_node_type);
> > > > > > >
> > > > > > > but that happens in do_basic_setup() after smp_init().
> > > > > > >
> > > > > > > The old code had set this node online via alloc_node_data(), so when it
> > > > > > > came time to do_cpu_up() -> try_online_node(), the node was already up
> > > > > > > and nothing happened.
> > > > > > >
> > > > > > > Now, it attempts to online the node, which registers the node with
> > > > > > > sysfs, but that can't happen before the 'node' subsystem is registered.
> > > > > > >
> > > > > > > Since kernel_init() is running by a kernel thread that is in
> > > > > > > SYSTEM_SCHEDULING state, fixed this by skipping registering with sysfs
> > > > > > > during the early boot in __try_online_node().
> > > > > >
> > > > > > Relying on SYSTEM_SCHEDULING looks really hackish. Why cannot we simply
> > > > > > drop try_online_node from do_cpu_up? Your v2 remark below suggests that
> > > > > > we need to call node_set_online because something later on depends on
> > > > > > that. Btw. why do we even allocate a pgdat from this path? This looks
> > > > > > really messy.
> > > > >
> > > > > See the commit cf23422b9d76 ("cpu/mem hotplug: enable CPUs online before
> > > > > local
> > > > > memory online")
> > > > >
> > > > > It looks like try_online_node() in do_cpu_up() is needed for memory hotplug
> > > > > which is to put its node online if offlined and then hotadd_new_pgdat()
> > > > > calls
> > > > > build_all_zonelists() to initialize the zone list.
> > > >
> > > > Well, do we still have to followthe logic that the above (unreviewed)
> > > > commit has established? The hotplug code in general made a lot of ad-hoc
> > > > design decisions which had to be revisited over time. If we are not
> > > > allocating pgdats for newly added memory then we should really make sure
> > > > to do so at a proper time and hook. I am not sure about CPU vs. memory
> > > > init ordering but even then I would really prefer if we could make the
> > > > init less obscure and _documented_.
> > >
> > > I don't know, but I think it is a good idea to keep the existing logic rather
> > > than do a big surgery
> >
> > Adding more hacks just doesn't make the situation any better.
> >
> > > unless someone is able to confirm it is not breaking NUMA
> > > node physical hotplug.
> >
> > I have a machine to test whole node offline. I am just busy to prepare a
> > patch myself. I can have it tested though.
> >
> I think the definition of "node online" is worth of rethinking. Before
> patch "x86, numa: always initialize all possible nodes", online means
> either cpu or memory present. After this patch, only node owing memory
> as present.
> 
> In the commit log, I think the change's motivation should be "Not to
> mention that it doesn't really make much sense to consider an empty
> node as online because we just consider this node whenever we want to
> iterate nodes to use and empty node is obviously not the best
> candidate."
> 
> But in fact, we already have for_each_node_state(nid, N_MEMORY) to
> cover this purpose.

I do not really think we want to spread N_MEMORY outside of the core MM.
It is quite confusing IMHO.
. 
> Furthermore, changing the definition of online may
> break something in the scheduler, e.g. in task_numa_migrate(), where
> it calls for_each_online_node.

Could you be more specific please? Why should numa balancing consider
nodes without any memory?

> By keeping the node owning cpu as online, Michal's patch can avoid
> such corner case and keep things easy. Furthermore, if needed, the
> other patch can use for_each_node_state(nid, N_MEMORY) to replace
> for_each_online_node is some space.

Ideally no code outside of the core MM should care about what kind of
memory does the node really own. The external code should only care
whether the node is online and thus usable or offline and of no
interest.
Pingfan Liu May 23, 2019, 3:58 a.m. UTC | #8
On Wed, May 22, 2019 at 7:16 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 22-05-19 15:12:16, Pingfan Liu wrote:
> > On Mon, May 13, 2019 at 11:31 PM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Mon 13-05-19 11:20:46, Qian Cai wrote:
> > > > On Mon, 2019-05-13 at 16:04 +0200, Michal Hocko wrote:
> > > > > On Mon 13-05-19 09:43:59, Qian Cai wrote:
> > > > > > On Mon, 2019-05-13 at 14:41 +0200, Michal Hocko wrote:
> > > > > > > On Sun 12-05-19 01:48:29, Qian Cai wrote:
> > > > > > > > The linux-next commit ("x86, numa: always initialize all possible
> > > > > > > > nodes") introduced a crash below during boot for systems with a
> > > > > > > > memory-less node. This is due to CPUs that get onlined during SMP boot,
> > > > > > > > but that onlining triggers a page fault in bus_add_device() during
> > > > > > > > device registration:
> > > > > > > >
> > > > > > > >       error = sysfs_create_link(&bus->p->devices_kset->kobj,
> > > > > > > >
> > > > > > > > bus->p is NULL. That "p" is the subsys_private struct, and it should
> > > > > > > > have been set in,
> > > > > > > >
> > > > > > > >       postcore_initcall(register_node_type);
> > > > > > > >
> > > > > > > > but that happens in do_basic_setup() after smp_init().
> > > > > > > >
> > > > > > > > The old code had set this node online via alloc_node_data(), so when it
> > > > > > > > came time to do_cpu_up() -> try_online_node(), the node was already up
> > > > > > > > and nothing happened.
> > > > > > > >
> > > > > > > > Now, it attempts to online the node, which registers the node with
> > > > > > > > sysfs, but that can't happen before the 'node' subsystem is registered.
> > > > > > > >
> > > > > > > > Since kernel_init() is running by a kernel thread that is in
> > > > > > > > SYSTEM_SCHEDULING state, fixed this by skipping registering with sysfs
> > > > > > > > during the early boot in __try_online_node().
> > > > > > >
> > > > > > > Relying on SYSTEM_SCHEDULING looks really hackish. Why cannot we simply
> > > > > > > drop try_online_node from do_cpu_up? Your v2 remark below suggests that
> > > > > > > we need to call node_set_online because something later on depends on
> > > > > > > that. Btw. why do we even allocate a pgdat from this path? This looks
> > > > > > > really messy.
> > > > > >
> > > > > > See the commit cf23422b9d76 ("cpu/mem hotplug: enable CPUs online before
> > > > > > local
> > > > > > memory online")
> > > > > >
> > > > > > It looks like try_online_node() in do_cpu_up() is needed for memory hotplug
> > > > > > which is to put its node online if offlined and then hotadd_new_pgdat()
> > > > > > calls
> > > > > > build_all_zonelists() to initialize the zone list.
> > > > >
> > > > > Well, do we still have to followthe logic that the above (unreviewed)
> > > > > commit has established? The hotplug code in general made a lot of ad-hoc
> > > > > design decisions which had to be revisited over time. If we are not
> > > > > allocating pgdats for newly added memory then we should really make sure
> > > > > to do so at a proper time and hook. I am not sure about CPU vs. memory
> > > > > init ordering but even then I would really prefer if we could make the
> > > > > init less obscure and _documented_.
> > > >
> > > > I don't know, but I think it is a good idea to keep the existing logic rather
> > > > than do a big surgery
> > >
> > > Adding more hacks just doesn't make the situation any better.
> > >
> > > > unless someone is able to confirm it is not breaking NUMA
> > > > node physical hotplug.
> > >
> > > I have a machine to test whole node offline. I am just busy to prepare a
> > > patch myself. I can have it tested though.
> > >
> > I think the definition of "node online" is worth of rethinking. Before
> > patch "x86, numa: always initialize all possible nodes", online means
> > either cpu or memory present. After this patch, only node owing memory
> > as present.
> >
> > In the commit log, I think the change's motivation should be "Not to
> > mention that it doesn't really make much sense to consider an empty
> > node as online because we just consider this node whenever we want to
> > iterate nodes to use and empty node is obviously not the best
> > candidate."
> >
> > But in fact, we already have for_each_node_state(nid, N_MEMORY) to
> > cover this purpose.
>
> I do not really think we want to spread N_MEMORY outside of the core MM.
> It is quite confusing IMHO.
> .
But it has already like this. Just git grep N_MEMORY.

> > Furthermore, changing the definition of online may
> > break something in the scheduler, e.g. in task_numa_migrate(), where
> > it calls for_each_online_node.
>
> Could you be more specific please? Why should numa balancing consider
> nodes without any memory?
>
As my understanding, the destination cpu can be on a memory less node.
BTW, there are several functions in the scheduler facing the same
scenario, task_numa_migrate() is an example.

> > By keeping the node owning cpu as online, Michal's patch can avoid
> > such corner case and keep things easy. Furthermore, if needed, the
> > other patch can use for_each_node_state(nid, N_MEMORY) to replace
> > for_each_online_node is some space.
>
> Ideally no code outside of the core MM should care about what kind of
> memory does the node really own. The external code should only care
> whether the node is online and thus usable or offline and of no
> interest.
Yes, but maybe it will pay great effort on it.

Regards,
Pingfan
> --
> Michal Hocko
> SUSE Labs
Pingfan Liu May 23, 2019, 4 a.m. UTC | #9
On Thu, May 23, 2019 at 11:58 AM Pingfan Liu <kernelfans@gmail.com> wrote:
>
> On Wed, May 22, 2019 at 7:16 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Wed 22-05-19 15:12:16, Pingfan Liu wrote:
> > > On Mon, May 13, 2019 at 11:31 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > > On Mon 13-05-19 11:20:46, Qian Cai wrote:
> > > > > On Mon, 2019-05-13 at 16:04 +0200, Michal Hocko wrote:
> > > > > > On Mon 13-05-19 09:43:59, Qian Cai wrote:
> > > > > > > On Mon, 2019-05-13 at 14:41 +0200, Michal Hocko wrote:
> > > > > > > > On Sun 12-05-19 01:48:29, Qian Cai wrote:
> > > > > > > > > The linux-next commit ("x86, numa: always initialize all possible
> > > > > > > > > nodes") introduced a crash below during boot for systems with a
> > > > > > > > > memory-less node. This is due to CPUs that get onlined during SMP boot,
> > > > > > > > > but that onlining triggers a page fault in bus_add_device() during
> > > > > > > > > device registration:
> > > > > > > > >
> > > > > > > > >       error = sysfs_create_link(&bus->p->devices_kset->kobj,
> > > > > > > > >
> > > > > > > > > bus->p is NULL. That "p" is the subsys_private struct, and it should
> > > > > > > > > have been set in,
> > > > > > > > >
> > > > > > > > >       postcore_initcall(register_node_type);
> > > > > > > > >
> > > > > > > > > but that happens in do_basic_setup() after smp_init().
> > > > > > > > >
> > > > > > > > > The old code had set this node online via alloc_node_data(), so when it
> > > > > > > > > came time to do_cpu_up() -> try_online_node(), the node was already up
> > > > > > > > > and nothing happened.
> > > > > > > > >
> > > > > > > > > Now, it attempts to online the node, which registers the node with
> > > > > > > > > sysfs, but that can't happen before the 'node' subsystem is registered.
> > > > > > > > >
> > > > > > > > > Since kernel_init() is running by a kernel thread that is in
> > > > > > > > > SYSTEM_SCHEDULING state, fixed this by skipping registering with sysfs
> > > > > > > > > during the early boot in __try_online_node().
> > > > > > > >
> > > > > > > > Relying on SYSTEM_SCHEDULING looks really hackish. Why cannot we simply
> > > > > > > > drop try_online_node from do_cpu_up? Your v2 remark below suggests that
> > > > > > > > we need to call node_set_online because something later on depends on
> > > > > > > > that. Btw. why do we even allocate a pgdat from this path? This looks
> > > > > > > > really messy.
> > > > > > >
> > > > > > > See the commit cf23422b9d76 ("cpu/mem hotplug: enable CPUs online before
> > > > > > > local
> > > > > > > memory online")
> > > > > > >
> > > > > > > It looks like try_online_node() in do_cpu_up() is needed for memory hotplug
> > > > > > > which is to put its node online if offlined and then hotadd_new_pgdat()
> > > > > > > calls
> > > > > > > build_all_zonelists() to initialize the zone list.
> > > > > >
> > > > > > Well, do we still have to followthe logic that the above (unreviewed)
> > > > > > commit has established? The hotplug code in general made a lot of ad-hoc
> > > > > > design decisions which had to be revisited over time. If we are not
> > > > > > allocating pgdats for newly added memory then we should really make sure
> > > > > > to do so at a proper time and hook. I am not sure about CPU vs. memory
> > > > > > init ordering but even then I would really prefer if we could make the
> > > > > > init less obscure and _documented_.
> > > > >
> > > > > I don't know, but I think it is a good idea to keep the existing logic rather
> > > > > than do a big surgery
> > > >
> > > > Adding more hacks just doesn't make the situation any better.
> > > >
> > > > > unless someone is able to confirm it is not breaking NUMA
> > > > > node physical hotplug.
> > > >
> > > > I have a machine to test whole node offline. I am just busy to prepare a
> > > > patch myself. I can have it tested though.
> > > >
> > > I think the definition of "node online" is worth of rethinking. Before
> > > patch "x86, numa: always initialize all possible nodes", online means
> > > either cpu or memory present. After this patch, only node owing memory
> > > as present.
> > >
> > > In the commit log, I think the change's motivation should be "Not to
> > > mention that it doesn't really make much sense to consider an empty
> > > node as online because we just consider this node whenever we want to
> > > iterate nodes to use and empty node is obviously not the best
> > > candidate."
> > >
> > > But in fact, we already have for_each_node_state(nid, N_MEMORY) to
> > > cover this purpose.
> >
> > I do not really think we want to spread N_MEMORY outside of the core MM.
> > It is quite confusing IMHO.
> > .
> But it has already like this. Just git grep N_MEMORY.
>
> > > Furthermore, changing the definition of online may
> > > break something in the scheduler, e.g. in task_numa_migrate(), where
> > > it calls for_each_online_node.
> >
> > Could you be more specific please? Why should numa balancing consider
> > nodes without any memory?
> >
> As my understanding, the destination cpu can be on a memory less node.
> BTW, there are several functions in the scheduler facing the same
> scenario, task_numa_migrate() is an example.
>
> > > By keeping the node owning cpu as online, Michal's patch can avoid
> > > such corner case and keep things easy. Furthermore, if needed, the
> > > other patch can use for_each_node_state(nid, N_MEMORY) to replace
> > > for_each_online_node is some space.
> >
> > Ideally no code outside of the core MM should care about what kind of
> > memory does the node really own. The external code should only care
> > whether the node is online and thus usable or offline and of no
> > interest.
> Yes, but maybe it will pay great effort on it.
>
And as a first step, we can find a way to fix the bug reported by me
and the one reported by Barret

> Regards,
> Pingfan
> > --
> > Michal Hocko
> > SUSE Labs
Michal Hocko May 28, 2019, 6:20 p.m. UTC | #10
[Sorry for a late reply]

On Thu 23-05-19 11:58:45, Pingfan Liu wrote:
> On Wed, May 22, 2019 at 7:16 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Wed 22-05-19 15:12:16, Pingfan Liu wrote:
[...]
> > > But in fact, we already have for_each_node_state(nid, N_MEMORY) to
> > > cover this purpose.
> >
> > I do not really think we want to spread N_MEMORY outside of the core MM.
> > It is quite confusing IMHO.
> > .
> But it has already like this. Just git grep N_MEMORY.

I might be wrong but I suspect a closer review would reveal that the use
will be inconsistent or dubious so following the existing users is not
the best approach.

> > > Furthermore, changing the definition of online may
> > > break something in the scheduler, e.g. in task_numa_migrate(), where
> > > it calls for_each_online_node.
> >
> > Could you be more specific please? Why should numa balancing consider
> > nodes without any memory?
> >
> As my understanding, the destination cpu can be on a memory less node.
> BTW, there are several functions in the scheduler facing the same
> scenario, task_numa_migrate() is an example.

Even if the destination node is memoryless then any migration would fail
because there is no memory. Anyway I still do not see how using online
node would break anything.

> > > By keeping the node owning cpu as online, Michal's patch can avoid
> > > such corner case and keep things easy. Furthermore, if needed, the
> > > other patch can use for_each_node_state(nid, N_MEMORY) to replace
> > > for_each_online_node is some space.
> >
> > Ideally no code outside of the core MM should care about what kind of
> > memory does the node really own. The external code should only care
> > whether the node is online and thus usable or offline and of no
> > interest.
>
> Yes, but maybe it will pay great effort on it.

Even if that is the case it would be preferable because the current
situation is just not sustainable wrt maintenance cost. It is just too
simple to break the existing logic as this particular report outlines.
Michal Hocko May 28, 2019, 6:21 p.m. UTC | #11
On Thu 23-05-19 12:00:46, Pingfan Liu wrote:
[...]
> > Yes, but maybe it will pay great effort on it.
> >
> And as a first step, we can find a way to fix the bug reported by me
> and the one reported by Barret

Can we try http://lkml.kernel.org/r/20190513140448.GJ24036@dhcp22.suse.cz
for starter?
Pingfan Liu May 30, 2019, 12:55 p.m. UTC | #12
On Wed, May 29, 2019 at 2:20 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> [Sorry for a late reply]
>
> On Thu 23-05-19 11:58:45, Pingfan Liu wrote:
> > On Wed, May 22, 2019 at 7:16 PM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Wed 22-05-19 15:12:16, Pingfan Liu wrote:
> [...]
> > > > But in fact, we already have for_each_node_state(nid, N_MEMORY) to
> > > > cover this purpose.
> > >
> > > I do not really think we want to spread N_MEMORY outside of the core MM.
> > > It is quite confusing IMHO.
> > > .
> > But it has already like this. Just git grep N_MEMORY.
>
> I might be wrong but I suspect a closer review would reveal that the use
> will be inconsistent or dubious so following the existing users is not
> the best approach.
>
> > > > Furthermore, changing the definition of online may
> > > > break something in the scheduler, e.g. in task_numa_migrate(), where
> > > > it calls for_each_online_node.
> > >
> > > Could you be more specific please? Why should numa balancing consider
> > > nodes without any memory?
> > >
> > As my understanding, the destination cpu can be on a memory less node.
> > BTW, there are several functions in the scheduler facing the same
> > scenario, task_numa_migrate() is an example.
>
> Even if the destination node is memoryless then any migration would fail
> because there is no memory. Anyway I still do not see how using online
> node would break anything.
>
Suppose we have nodes A, B,C, where C is memory less but has little
distance to B, comparing with the one from A to B. Then if a task is
running on A, but prefer to run on B due to memory footprint.
task_numa_migrate() allows us to migrate the task to node C. Changing
for_each_online_node will break this.

Regards,
  Pingfan

> > > > By keeping the node owning cpu as online, Michal's patch can avoid
> > > > such corner case and keep things easy. Furthermore, if needed, the
> > > > other patch can use for_each_node_state(nid, N_MEMORY) to replace
> > > > for_each_online_node is some space.
> > >
> > > Ideally no code outside of the core MM should care about what kind of
> > > memory does the node really own. The external code should only care
> > > whether the node is online and thus usable or offline and of no
> > > interest.
> >
> > Yes, but maybe it will pay great effort on it.
>
> Even if that is the case it would be preferable because the current
> situation is just not sustainable wrt maintenance cost. It is just too
> simple to break the existing logic as this particular report outlines.
> --
> Michal Hocko
> SUSE Labs
Pingfan Liu May 30, 2019, 1:01 p.m. UTC | #13
On Wed, May 29, 2019 at 2:21 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 23-05-19 12:00:46, Pingfan Liu wrote:
> [...]
> > > Yes, but maybe it will pay great effort on it.
> > >
> > And as a first step, we can find a way to fix the bug reported by me
> > and the one reported by Barret
>
> Can we try http://lkml.kernel.org/r/20190513140448.GJ24036@dhcp22.suse.cz
> for starter?
If it turns out that the changing of for_each_online_node() will not
break functions in scheduler like task_numa_migrate(), I think it will
be a good starter. On the other hand, if it does, then I think it only
requires a slight adjustment of your patch to meet the aim.

Thanks,
  Pingfan
> --
> Michal Hocko
> SUSE Labs
Michal Hocko May 31, 2019, 9:03 a.m. UTC | #14
On Thu 30-05-19 20:55:32, Pingfan Liu wrote:
> On Wed, May 29, 2019 at 2:20 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > [Sorry for a late reply]
> >
> > On Thu 23-05-19 11:58:45, Pingfan Liu wrote:
> > > On Wed, May 22, 2019 at 7:16 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > > On Wed 22-05-19 15:12:16, Pingfan Liu wrote:
> > [...]
> > > > > But in fact, we already have for_each_node_state(nid, N_MEMORY) to
> > > > > cover this purpose.
> > > >
> > > > I do not really think we want to spread N_MEMORY outside of the core MM.
> > > > It is quite confusing IMHO.
> > > > .
> > > But it has already like this. Just git grep N_MEMORY.
> >
> > I might be wrong but I suspect a closer review would reveal that the use
> > will be inconsistent or dubious so following the existing users is not
> > the best approach.
> >
> > > > > Furthermore, changing the definition of online may
> > > > > break something in the scheduler, e.g. in task_numa_migrate(), where
> > > > > it calls for_each_online_node.
> > > >
> > > > Could you be more specific please? Why should numa balancing consider
> > > > nodes without any memory?
> > > >
> > > As my understanding, the destination cpu can be on a memory less node.
> > > BTW, there are several functions in the scheduler facing the same
> > > scenario, task_numa_migrate() is an example.
> >
> > Even if the destination node is memoryless then any migration would fail
> > because there is no memory. Anyway I still do not see how using online
> > node would break anything.
> >
> Suppose we have nodes A, B,C, where C is memory less but has little
> distance to B, comparing with the one from A to B. Then if a task is
> running on A, but prefer to run on B due to memory footprint.
> task_numa_migrate() allows us to migrate the task to node C. Changing
> for_each_online_node will break this.

That would require the task to have preferred node to be C no? Or do I
missunderstand the task migration logic?
Pingfan Liu June 3, 2019, 4:17 a.m. UTC | #15
On Fri, May 31, 2019 at 5:03 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 30-05-19 20:55:32, Pingfan Liu wrote:
> > On Wed, May 29, 2019 at 2:20 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > [Sorry for a late reply]
> > >
> > > On Thu 23-05-19 11:58:45, Pingfan Liu wrote:
> > > > On Wed, May 22, 2019 at 7:16 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > > >
> > > > > On Wed 22-05-19 15:12:16, Pingfan Liu wrote:
> > > [...]
> > > > > > But in fact, we already have for_each_node_state(nid, N_MEMORY) to
> > > > > > cover this purpose.
> > > > >
> > > > > I do not really think we want to spread N_MEMORY outside of the core MM.
> > > > > It is quite confusing IMHO.
> > > > > .
> > > > But it has already like this. Just git grep N_MEMORY.
> > >
> > > I might be wrong but I suspect a closer review would reveal that the use
> > > will be inconsistent or dubious so following the existing users is not
> > > the best approach.
> > >
> > > > > > Furthermore, changing the definition of online may
> > > > > > break something in the scheduler, e.g. in task_numa_migrate(), where
> > > > > > it calls for_each_online_node.
> > > > >
> > > > > Could you be more specific please? Why should numa balancing consider
> > > > > nodes without any memory?
> > > > >
> > > > As my understanding, the destination cpu can be on a memory less node.
> > > > BTW, there are several functions in the scheduler facing the same
> > > > scenario, task_numa_migrate() is an example.
> > >
> > > Even if the destination node is memoryless then any migration would fail
> > > because there is no memory. Anyway I still do not see how using online
> > > node would break anything.
> > >
> > Suppose we have nodes A, B,C, where C is memory less but has little
> > distance to B, comparing with the one from A to B. Then if a task is
> > running on A, but prefer to run on B due to memory footprint.
> > task_numa_migrate() allows us to migrate the task to node C. Changing
> > for_each_online_node will break this.
>
> That would require the task to have preferred node to be C no? Or do I
> missunderstand the task migration logic?
I think in task_numa_migrate(), the migration logic should looks like:
  env.dst_nid = p->numa_preferred_nid; //Here dst nid is B
But later in
  if (env.best_cpu == -1 || (p->numa_group &&
p->numa_group->active_nodes > 1)) {
    for_each_online_node(nid) {
[...]
       task_numa_find_cpu(&env, taskimp, groupimp); // Here is a
chance to change p->numa_preferred_nid

There are serveral other broken by changing for_each_online_node(),
-1. show_numa_stats()
-2. init_numa_topology_type(), where sched_numa_topology_type may be
mistaken evaluated.
-3. ... can check call to for_each_online_node() one by one in scheduler.

That is my understanding of the code.

Thanks,
  Pingfan
Qian Cai June 21, 2019, 1:17 p.m. UTC | #16
Sigh...

I don't see any benefit to keep the broken commit,

"x86, numa: always initialize all possible nodes"

for so long in linux-next that just prevent x86 NUMA machines with any memory-
less node from booting.

Andrew, maybe it is time to drop this patch until Michal found some time to fix
it properly.

On Mon, 2019-05-13 at 14:41 +0200, Michal Hocko wrote:
> On Sun 12-05-19 01:48:29, Qian Cai wrote:
> > The linux-next commit ("x86, numa: always initialize all possible
> > nodes") introduced a crash below during boot for systems with a
> > memory-less node. This is due to CPUs that get onlined during SMP boot,
> > but that onlining triggers a page fault in bus_add_device() during
> > device registration:
> > 
> > 	error = sysfs_create_link(&bus->p->devices_kset->kobj,
> > 
> > bus->p is NULL. That "p" is the subsys_private struct, and it should
> > have been set in,
> > 
> > 	postcore_initcall(register_node_type);
> > 
> > but that happens in do_basic_setup() after smp_init().
> > 
> > The old code had set this node online via alloc_node_data(), so when it
> > came time to do_cpu_up() -> try_online_node(), the node was already up
> > and nothing happened.
> > 
> > Now, it attempts to online the node, which registers the node with
> > sysfs, but that can't happen before the 'node' subsystem is registered.
> > 
> > Since kernel_init() is running by a kernel thread that is in
> > SYSTEM_SCHEDULING state, fixed this by skipping registering with sysfs
> > during the early boot in __try_online_node().
> 
> Relying on SYSTEM_SCHEDULING looks really hackish. Why cannot we simply
> drop try_online_node from do_cpu_up? Your v2 remark below suggests that
> we need to call node_set_online because something later on depends on
> that. Btw. why do we even allocate a pgdat from this path? This looks
> really messy.
> 
> > Call Trace:
> >  device_add+0x43e/0x690
> >  device_register+0x107/0x110
> >  __register_one_node+0x72/0x150
> >  __try_online_node+0x8f/0xd0
> >  try_online_node+0x2b/0x50
> >  do_cpu_up+0x46/0xf0
> >  cpu_up+0x13/0x20
> >  smp_init+0x6e/0xd0
> >  kernel_init_freeable+0xe5/0x21f
> >  kernel_init+0xf/0x180
> >  ret_from_fork+0x1f/0x30
> > 
> > Reported-by: Barret Rhoden <brho@google.com>
> > Signed-off-by: Qian Cai <cai@lca.pw>
> > ---
> > 
> > v2: Set the node online as it have CPUs. Otherwise, those memory-less nodes
> > will
> >     end up being not in sysfs i.e., /sys/devices/system/node/.
> > 
> >  mm/memory_hotplug.c | 12 ++++++++++++
> >  1 file changed, 12 insertions(+)
> > 
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index b236069ff0d8..6eb2331fa826 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -1037,6 +1037,18 @@ static int __try_online_node(int nid, u64 start, bool
> > set_node_online)
> >  	if (node_online(nid))
> >  		return 0;
> >  
> > +	/*
> > +	 * Here is called by cpu_up() to online a node without memory from
> > +	 * kernel_init() which guarantees that "set_node_online" is true
> > which
> > +	 * will set the node online as it have CPUs but not ready to call
> > +	 * register_one_node() as "node_subsys" has not been initialized
> > +	 * properly yet.
> > +	 */
> > +	if (system_state == SYSTEM_SCHEDULING) {
> > +		node_set_online(nid);
> > +		return 0;
> > +	}
> > +
> >  	pgdat = hotadd_new_pgdat(nid, start);
> >  	if (!pgdat) {
> >  		pr_err("Cannot online node %d due to NULL pgdat\n", nid);
> > -- 
> > 2.20.1 (Apple Git-117)
> 
>
Michal Hocko June 21, 2019, 1:55 p.m. UTC | #17
On Fri 21-06-19 09:17:58, Qian Cai wrote:
> Sigh...
> 
> I don't see any benefit to keep the broken commit,
> 
> "x86, numa: always initialize all possible nodes"
> 
> for so long in linux-next that just prevent x86 NUMA machines with any memory-
> less node from booting.
> 
> Andrew, maybe it is time to drop this patch until Michal found some time to fix
> it properly.

Yes, please drop the patch for now, Andrew. I thought I could get to
this but time is just scarce.
Pingfan Liu June 24, 2019, 8:42 a.m. UTC | #18
Hi Michal,

What about dropping the change of the online definition of your patch,
just do the following?
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index e6dad60..9c087c3 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -749,13 +749,12 @@ static void __init init_memory_less_node(int nid)
  */
 void __init init_cpu_to_node(void)
 {
-       int cpu;
+       int cpu, node;
        u16 *cpu_to_apicid = early_per_cpu_ptr(x86_cpu_to_apicid);

        BUG_ON(cpu_to_apicid == NULL);

-       for_each_possible_cpu(cpu) {
-               int node = numa_cpu_node(cpu);
+       for_each_node_mask(node, numa_nodes_parsed) {

                if (node == NUMA_NO_NODE)
                        continue;
@@ -765,6 +764,10 @@ void __init init_cpu_to_node(void)

                numa_set_node(cpu, node);
        }
+       for_each_possible_cpu(cpu) {
+               int node = numa_cpu_node(cpu);
+               numa_set_node(cpu, node);
+       }
 }

Thanks,
  Pingfan

On Fri, Jun 21, 2019 at 9:55 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Fri 21-06-19 09:17:58, Qian Cai wrote:
> > Sigh...
> >
> > I don't see any benefit to keep the broken commit,
> >
> > "x86, numa: always initialize all possible nodes"
> >
> > for so long in linux-next that just prevent x86 NUMA machines with any memory-
> > less node from booting.
> >
> > Andrew, maybe it is time to drop this patch until Michal found some time to fix
> > it properly.
>
> Yes, please drop the patch for now, Andrew. I thought I could get to
> this but time is just scarce.
> --
> Michal Hocko
> SUSE Labs
Michal Hocko June 26, 2019, 1:57 p.m. UTC | #19
On Mon 24-06-19 16:42:20, Pingfan Liu wrote:
> Hi Michal,
> 
> What about dropping the change of the online definition of your patch,
> just do the following?

I am sorry but I am unlikely to find some more time to look into this. I
am willing to help reviewing but I will not find enough time to focus on
this to fix up the patch. Are you willing to work on this and finish the
patch? It is a very tricky area with side effects really hard to see in
advance but going with a robust fix is definitely worth the effort.

Thanks!
Pingfan Liu June 27, 2019, 3:11 a.m. UTC | #20
On Wed, Jun 26, 2019 at 9:57 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 24-06-19 16:42:20, Pingfan Liu wrote:
> > Hi Michal,
> >
> > What about dropping the change of the online definition of your patch,
> > just do the following?
>
> I am sorry but I am unlikely to find some more time to look into this. I
> am willing to help reviewing but I will not find enough time to focus on
> this to fix up the patch. Are you willing to work on this and finish the
> patch? It is a very tricky area with side effects really hard to see in
> advance but going with a robust fix is definitely worth the effort.
Yeah, the bug is a little trivial but hard to fix bug, and take a lot
of time. It is hard to meet your original design target, based on
current situation. I will have a try limited to this bug.

Thanks,
  Pingfan
>
> Thanks!
> --
> Michal Hocko
> SUSE Labs
diff mbox series

Patch

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b236069ff0d8..6eb2331fa826 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1037,6 +1037,18 @@  static int __try_online_node(int nid, u64 start, bool set_node_online)
 	if (node_online(nid))
 		return 0;
 
+	/*
+	 * Here is called by cpu_up() to online a node without memory from
+	 * kernel_init() which guarantees that "set_node_online" is true which
+	 * will set the node online as it have CPUs but not ready to call
+	 * register_one_node() as "node_subsys" has not been initialized
+	 * properly yet.
+	 */
+	if (system_state == SYSTEM_SCHEDULING) {
+		node_set_online(nid);
+		return 0;
+	}
+
 	pgdat = hotadd_new_pgdat(nid, start);
 	if (!pgdat) {
 		pr_err("Cannot online node %d due to NULL pgdat\n", nid);