diff mbox series

mm/alloc: fallback to first node if the wanted node offline

Message ID 1543892757-4323-1-git-send-email-kernelfans@gmail.com (mailing list archive)
State New, archived
Headers show
Series mm/alloc: fallback to first node if the wanted node offline | expand

Commit Message

Pingfan Liu Dec. 4, 2018, 3:05 a.m. UTC
During my test on some AMD machine, with kexec -l nr_cpus=x option, the
kernel failed to bootup, because some node's data struct can not be allocated,
e.g, on x86, initialized by init_cpu_to_node()->init_memory_less_node(). But
device->numa_node info is used as preferred_nid param for
__alloc_pages_nodemask(), which causes NULL reference
  ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
This patch tries to fix the issue by falling back to the first online node,
when encountering such corner case.

Notes about the crashing info:
-1. kexec -l with nr_cpus=4
-2. system info
  NUMA node0 CPU(s):     0,8,16,24
  NUMA node1 CPU(s):     2,10,18,26
  NUMA node2 CPU(s):     4,12,20,28
  NUMA node3 CPU(s):     6,14,22,30
  NUMA node4 CPU(s):     1,9,17,25
  NUMA node5 CPU(s):     3,11,19,27
  NUMA node6 CPU(s):     5,13,21,29
  NUMA node7 CPU(s):     7,15,23,31
-3. panic stack
[...]
[    5.721547] atomic64_test: passed for x86-64 platform with CX8 and with SSE
[    5.729187] pcieport 0000:00:01.1: Signaling PME with IRQ 34
[    5.735187] pcieport 0000:00:01.2: Signaling PME with IRQ 35
[    5.741168] pcieport 0000:00:01.3: Signaling PME with IRQ 36
[    5.747189] pcieport 0000:00:07.1: Signaling PME with IRQ 37
[    5.754061] pcieport 0000:00:08.1: Signaling PME with IRQ 39
[    5.760727] pcieport 0000:20:07.1: Signaling PME with IRQ 40
[    5.766955] pcieport 0000:20:08.1: Signaling PME with IRQ 42
[    5.772742] BUG: unable to handle kernel paging request at 0000000000002088
[    5.773618] PGD 0 P4D 0
[    5.773618] Oops: 0000 [#1] SMP NOPTI
[    5.773618] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.20.0-rc1+ #3
[    5.773618] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.4.3 06/29/2018
[    5.773618] RIP: 0010:__alloc_pages_nodemask+0xe2/0x2a0
[    5.773618] Code: 00 00 44 89 ea 80 ca 80 41 83 f8 01 44 0f 44 ea 89 da c1 ea 08 83 e2 01 88 54 24 20 48 8b 54 24 08 48 85 d2 0f 85 46 01 00 00 <3b> 77 08 0f 82 3d 01 00 00 48 89 f8 44 89 ea 48 89
e1 44 89 e6 89
[    5.773618] RSP: 0018:ffffaa600005fb20 EFLAGS: 00010246
[    5.773618] RAX: 0000000000000000 RBX: 00000000006012c0 RCX: 0000000000000000
[    5.773618] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080
[    5.773618] RBP: 00000000006012c0 R08: 0000000000000000 R09: 0000000000000002
[    5.773618] R10: 00000000006080c0 R11: 0000000000000002 R12: 0000000000000000
[    5.773618] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000002
[    5.773618] FS:  0000000000000000(0000) GS:ffff8c69afe00000(0000) knlGS:0000000000000000
[    5.773618] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    5.773618] CR2: 0000000000002088 CR3: 000000087e00a000 CR4: 00000000003406e0
[    5.773618] Call Trace:
[    5.773618]  new_slab+0xa9/0x570
[    5.773618]  ___slab_alloc+0x375/0x540
[    5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
[    5.773618]  __slab_alloc+0x1c/0x38
[    5.773618]  __kmalloc_node_track_caller+0xc8/0x270
[    5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
[    5.773618]  devm_kmalloc+0x28/0x60
[    5.773618]  pinctrl_bind_pins+0x2b/0x2a0
[    5.773618]  really_probe+0x73/0x420
[    5.773618]  driver_probe_device+0x115/0x130
[    5.773618]  __driver_attach+0x103/0x110
[    5.773618]  ? driver_probe_device+0x130/0x130
[    5.773618]  bus_for_each_dev+0x67/0xc0
[    5.773618]  ? klist_add_tail+0x3b/0x70
[    5.773618]  bus_add_driver+0x41/0x260
[    5.773618]  ? pcie_port_setup+0x4d/0x4d
[    5.773618]  driver_register+0x5b/0xe0
[    5.773618]  ? pcie_port_setup+0x4d/0x4d
[    5.773618]  do_one_initcall+0x4e/0x1d4
[    5.773618]  ? init_setup+0x25/0x28
[    5.773618]  kernel_init_freeable+0x1c1/0x26e
[    5.773618]  ? loglevel+0x5b/0x5b
[    5.773618]  ? rest_init+0xb0/0xb0
[    5.773618]  kernel_init+0xa/0x110
[    5.773618]  ret_from_fork+0x22/0x40
[    5.773618] Modules linked in:
[    5.773618] CR2: 0000000000002088
[    5.773618] ---[ end trace 1030c9120a03d081 ]---
[...]

Other notes about the reproduction of this bug:
After appling the following patch:
commit 0d76bcc960e6057750fcf556b65da13f8bbdfd2b
Author: Bjorn Helgaas <bhelgaas@google.com>
Date:   Tue Nov 13 08:38:17 2018 -0600

    Revert "ACPI/PCI: Pay attention to device-specific _PXM node values"

This bug is covered and not triggered on my test AMD machine.
But it should still exist since dev->numa_node info can be set by other
method on other archs when using nr_cpus param

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 include/linux/gfp.h | 2 ++
 1 file changed, 2 insertions(+)

Comments

David Rientjes Dec. 4, 2018, 3:53 a.m. UTC | #1
On Tue, 4 Dec 2018, Pingfan Liu wrote:

> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 76f8db0..8324953 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -453,6 +453,8 @@ static inline int gfp_zonelist(gfp_t flags)
>   */
>  static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
>  {
> +	if (unlikely(!node_online(nid)))
> +		nid = first_online_node;
>  	return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
>  }
>  

So we're passing the node id from dev_to_node() to kmalloc which 
interprets that as the preferred node and then does node_zonelist() to 
find the zonelist at allocation time.

What happens if we fix this in alloc_dr()?  Does anything else cause 
problems?

And rather than using first_online_node, would next_online_node() work?

I'm thinking about this:

diff --git a/drivers/base/devres.c b/drivers/base/devres.c
--- a/drivers/base/devres.c
+++ b/drivers/base/devres.c
@@ -100,6 +100,8 @@ static __always_inline struct devres * alloc_dr(dr_release_t release,
 					&tot_size)))
 		return NULL;
 
+	if (unlikely(!node_online(nid)))
+		nid = next_online_node(nid);
 	dr = kmalloc_node_track_caller(tot_size, gfp, nid);
 	if (unlikely(!dr))
 		return NULL;
Wei Yang Dec. 4, 2018, 6:54 a.m. UTC | #2
On Tue, Dec 04, 2018 at 11:05:57AM +0800, Pingfan Liu wrote:
>During my test on some AMD machine, with kexec -l nr_cpus=x option, the
>kernel failed to bootup, because some node's data struct can not be allocated,
>e.g, on x86, initialized by init_cpu_to_node()->init_memory_less_node(). But
>device->numa_node info is used as preferred_nid param for

could we fix the preferred_nid before passed to
__alloc_pages_nodemask()?

BTW, I don't catch the function call flow to this point. Would you mind
giving me some hint?
Pingfan Liu Dec. 4, 2018, 7:16 a.m. UTC | #3
On Tue, Dec 4, 2018 at 11:53 AM David Rientjes <rientjes@google.com> wrote:
>
> On Tue, 4 Dec 2018, Pingfan Liu wrote:
>
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index 76f8db0..8324953 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -453,6 +453,8 @@ static inline int gfp_zonelist(gfp_t flags)
> >   */
> >  static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> >  {
> > +     if (unlikely(!node_online(nid)))
> > +             nid = first_online_node;
> >       return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
> >  }
> >
>
> So we're passing the node id from dev_to_node() to kmalloc which
> interprets that as the preferred node and then does node_zonelist() to
> find the zonelist at allocation time.
>
> What happens if we fix this in alloc_dr()?  Does anything else cause
> problems?
>
I think it is better to fix it mm, since it can protect any new
similar bug in future. While fixing in alloc_dr() just work at present

> And rather than using first_online_node, would next_online_node() work?
>
What is the gain? Is it for memory pressure on node0?

Thanks,
Pingfan

> I'm thinking about this:
>
> diff --git a/drivers/base/devres.c b/drivers/base/devres.c
> --- a/drivers/base/devres.c
> +++ b/drivers/base/devres.c
> @@ -100,6 +100,8 @@ static __always_inline struct devres * alloc_dr(dr_release_t release,
>                                         &tot_size)))
>                 return NULL;
>
> +       if (unlikely(!node_online(nid)))
> +               nid = next_online_node(nid);
>         dr = kmalloc_node_track_caller(tot_size, gfp, nid);
>         if (unlikely(!dr))
>                 return NULL;
Pingfan Liu Dec. 4, 2018, 7:20 a.m. UTC | #4
On Tue, Dec 4, 2018 at 2:54 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Tue, Dec 04, 2018 at 11:05:57AM +0800, Pingfan Liu wrote:
> >During my test on some AMD machine, with kexec -l nr_cpus=x option, the
> >kernel failed to bootup, because some node's data struct can not be allocated,
> >e.g, on x86, initialized by init_cpu_to_node()->init_memory_less_node(). But
> >device->numa_node info is used as preferred_nid param for
>
> could we fix the preferred_nid before passed to
> __alloc_pages_nodemask()?
>
Yes, we can doit too, but what is the gain?

> BTW, I don't catch the function call flow to this point. Would you mind
> giving me some hint?
>
You can track the code along slab_alloc() ->...->__alloc_pages_nodemask()

Thanks,
Pingfan
Michal Hocko Dec. 4, 2018, 7:22 a.m. UTC | #5
On Tue 04-12-18 11:05:57, Pingfan Liu wrote:
> During my test on some AMD machine, with kexec -l nr_cpus=x option, the
> kernel failed to bootup, because some node's data struct can not be allocated,
> e.g, on x86, initialized by init_cpu_to_node()->init_memory_less_node(). But
> device->numa_node info is used as preferred_nid param for
> __alloc_pages_nodemask(), which causes NULL reference
>   ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> This patch tries to fix the issue by falling back to the first online node,
> when encountering such corner case.

We have seen similar issues already and the bug was usually that the
zonelists were not initialized yet or the node is completely bogus.
Zonelists should be initialized by build_all_zonelists quite early so I
am wondering whether the later is the case. What is the actual node
number the device is associated with?

Your patch is not correct btw, because we want to fallback into the node in
the distance order rather into the first online node.
Pingfan Liu Dec. 4, 2018, 8:20 a.m. UTC | #6
On Tue, Dec 4, 2018 at 3:22 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 04-12-18 11:05:57, Pingfan Liu wrote:
> > During my test on some AMD machine, with kexec -l nr_cpus=x option, the
> > kernel failed to bootup, because some node's data struct can not be allocated,
> > e.g, on x86, initialized by init_cpu_to_node()->init_memory_less_node(). But
> > device->numa_node info is used as preferred_nid param for
> > __alloc_pages_nodemask(), which causes NULL reference
> >   ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> > This patch tries to fix the issue by falling back to the first online node,
> > when encountering such corner case.
>
> We have seen similar issues already and the bug was usually that the
> zonelists were not initialized yet or the node is completely bogus.
> Zonelists should be initialized by build_all_zonelists quite early so I
> am wondering whether the later is the case. What is the actual node
> number the device is associated with?
>
The device's node num is 2. And in my case, I used nr_cpus param. Due
to init_cpu_to_node() initialize all the possible node.  It is hard
for me to figure out without this param, how zonelists is accessed
before page allocator works.

> Your patch is not correct btw, because we want to fallback into the node in
> the distance order rather into the first online node.
> --
What about this:
+extern int find_next_best_node(int node, nodemask_t *used_node_mask);
+
 /*
  * We get the zone list from the current node and the gfp_mask.
  * This zone list contains a maximum of MAXNODES*MAX_NR_ZONES zones.
@@ -453,6 +455,11 @@ static inline int gfp_zonelist(gfp_t flags)
  */
 static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
 {
+       if (unlikely(!node_online(nid))) {
+               nodemask_t used_mask;
+               nodes_complement(used_mask, node_online_map);
+               nid = find_next_best_node(nid, &used_mask);
+       }
        return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
 }

I just finished the compiling, not test it yet, since the machine is
not on hand yet. It needs some time to get it again.

Thanks,
Pingfan
Wei Yang Dec. 4, 2018, 8:34 a.m. UTC | #7
On Tue, Dec 04, 2018 at 03:20:13PM +0800, Pingfan Liu wrote:
>On Tue, Dec 4, 2018 at 2:54 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>>
>> On Tue, Dec 04, 2018 at 11:05:57AM +0800, Pingfan Liu wrote:
>> >During my test on some AMD machine, with kexec -l nr_cpus=x option, the
>> >kernel failed to bootup, because some node's data struct can not be allocated,
>> >e.g, on x86, initialized by init_cpu_to_node()->init_memory_less_node(). But
>> >device->numa_node info is used as preferred_nid param for
>>
>> could we fix the preferred_nid before passed to
>> __alloc_pages_nodemask()?
>>
>Yes, we can doit too, but what is the gain?

node_zonelist() is used some places. If we are sure where the problem
is, it is not necessary to spread to other places.

>
>> BTW, I don't catch the function call flow to this point. Would you mind
>> giving me some hint?
>>
>You can track the code along slab_alloc() ->...->__alloc_pages_nodemask()

slab_alloc() pass NUMA_NO_NODE down, so I am lost in where the
preferred_nid is assigned.

>
>Thanks,
>Pingfan
Wei Yang Dec. 4, 2018, 8:40 a.m. UTC | #8
On Tue, Dec 04, 2018 at 04:20:32PM +0800, Pingfan Liu wrote:
>On Tue, Dec 4, 2018 at 3:22 PM Michal Hocko <mhocko@kernel.org> wrote:
>>
>> On Tue 04-12-18 11:05:57, Pingfan Liu wrote:
>> > During my test on some AMD machine, with kexec -l nr_cpus=x option, the
>> > kernel failed to bootup, because some node's data struct can not be allocated,
>> > e.g, on x86, initialized by init_cpu_to_node()->init_memory_less_node(). But
>> > device->numa_node info is used as preferred_nid param for
>> > __alloc_pages_nodemask(), which causes NULL reference
>> >   ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
>> > This patch tries to fix the issue by falling back to the first online node,
>> > when encountering such corner case.
>>
>> We have seen similar issues already and the bug was usually that the
>> zonelists were not initialized yet or the node is completely bogus.
>> Zonelists should be initialized by build_all_zonelists quite early so I
>> am wondering whether the later is the case. What is the actual node
>> number the device is associated with?
>>
>The device's node num is 2. And in my case, I used nr_cpus param. Due
>to init_cpu_to_node() initialize all the possible node.  It is hard
>for me to figure out without this param, how zonelists is accessed
>before page allocator works.

If my understanding is correct, we can't do page alloc before zonelist
is initialized.

I guess Michal's point is to figure out this reason.

>
>> Your patch is not correct btw, because we want to fallback into the node in
>> the distance order rather into the first online node.
>> --
>What about this:
>+extern int find_next_best_node(int node, nodemask_t *used_node_mask);
>+
> /*
>  * We get the zone list from the current node and the gfp_mask.
>  * This zone list contains a maximum of MAXNODES*MAX_NR_ZONES zones.
>@@ -453,6 +455,11 @@ static inline int gfp_zonelist(gfp_t flags)
>  */
> static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> {
>+       if (unlikely(!node_online(nid))) {
>+               nodemask_t used_mask;
>+               nodes_complement(used_mask, node_online_map);
>+               nid = find_next_best_node(nid, &used_mask);
>+       }
>        return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
> }
>
>I just finished the compiling, not test it yet, since the machine is
>not on hand yet. It needs some time to get it again.
>
>Thanks,
>Pingfan
Pingfan Liu Dec. 4, 2018, 8:52 a.m. UTC | #9
On Tue, Dec 4, 2018 at 4:34 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Tue, Dec 04, 2018 at 03:20:13PM +0800, Pingfan Liu wrote:
> >On Tue, Dec 4, 2018 at 2:54 PM Wei Yang <richard.weiyang@gmail.com> wrote:
> >>
> >> On Tue, Dec 04, 2018 at 11:05:57AM +0800, Pingfan Liu wrote:
> >> >During my test on some AMD machine, with kexec -l nr_cpus=x option, the
> >> >kernel failed to bootup, because some node's data struct can not be allocated,
> >> >e.g, on x86, initialized by init_cpu_to_node()->init_memory_less_node(). But
> >> >device->numa_node info is used as preferred_nid param for
> >>
> >> could we fix the preferred_nid before passed to
> >> __alloc_pages_nodemask()?
> >>
> >Yes, we can doit too, but what is the gain?
>
> node_zonelist() is used some places. If we are sure where the problem
> is, it is not necessary to spread to other places.
>
> >
> >> BTW, I don't catch the function call flow to this point. Would you mind
> >> giving me some hint?
> >>
> >You can track the code along slab_alloc() ->...->__alloc_pages_nodemask()
>
> slab_alloc() pass NUMA_NO_NODE down, so I am lost in where the
> preferred_nid is assigned.
>
You can follow:
[    5.773618]  new_slab+0xa9/0x570
[    5.773618]  ___slab_alloc+0x375/0x540
[    5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
where static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)

Thanks,
Pingfan
Michal Hocko Dec. 4, 2018, 8:56 a.m. UTC | #10
On Tue 04-12-18 16:20:32, Pingfan Liu wrote:
> On Tue, Dec 4, 2018 at 3:22 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Tue 04-12-18 11:05:57, Pingfan Liu wrote:
> > > During my test on some AMD machine, with kexec -l nr_cpus=x option, the
> > > kernel failed to bootup, because some node's data struct can not be allocated,
> > > e.g, on x86, initialized by init_cpu_to_node()->init_memory_less_node(). But
> > > device->numa_node info is used as preferred_nid param for
> > > __alloc_pages_nodemask(), which causes NULL reference
> > >   ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> > > This patch tries to fix the issue by falling back to the first online node,
> > > when encountering such corner case.
> >
> > We have seen similar issues already and the bug was usually that the
> > zonelists were not initialized yet or the node is completely bogus.
> > Zonelists should be initialized by build_all_zonelists quite early so I
> > am wondering whether the later is the case. What is the actual node
> > number the device is associated with?
> >
> The device's node num is 2. And in my case, I used nr_cpus param. Due
> to init_cpu_to_node() initialize all the possible node.  It is hard
> for me to figure out without this param, how zonelists is accessed
> before page allocator works.

I believe we should focus on this. Why does the node have no zonelist
even though all zonelists should be initialized already? Maybe this is
nr_cpus pecularity and we do not initialize all the existing numa nodes.
Or maybe the device is associated to a non-existing node with that
setup. A full dmesg might help us here.

> > Your patch is not correct btw, because we want to fallback into the node in
> > the distance order rather into the first online node.
> > --
> What about this:
> +extern int find_next_best_node(int node, nodemask_t *used_node_mask);
> +
>  /*
>   * We get the zone list from the current node and the gfp_mask.
>   * This zone list contains a maximum of MAXNODES*MAX_NR_ZONES zones.
> @@ -453,6 +455,11 @@ static inline int gfp_zonelist(gfp_t flags)
>   */
>  static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
>  {
> +       if (unlikely(!node_online(nid))) {
> +               nodemask_t used_mask;
> +               nodes_complement(used_mask, node_online_map);
> +               nid = find_next_best_node(nid, &used_mask);
> +       }
>         return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
>  }
> 
> I just finished the compiling, not test it yet, since the machine is
> not on hand yet. It needs some time to get it again.

This is clearly a no-go. nodemask_t can be giant and you cannot have it
on the stack for allocation paths which might be called from a deep
stack already. Also this is called from the allocator hot paths and each
branch counts.
Pingfan Liu Dec. 4, 2018, 8:56 a.m. UTC | #11
On Tue, Dec 4, 2018 at 4:40 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Tue, Dec 04, 2018 at 04:20:32PM +0800, Pingfan Liu wrote:
> >On Tue, Dec 4, 2018 at 3:22 PM Michal Hocko <mhocko@kernel.org> wrote:
> >>
> >> On Tue 04-12-18 11:05:57, Pingfan Liu wrote:
> >> > During my test on some AMD machine, with kexec -l nr_cpus=x option, the
> >> > kernel failed to bootup, because some node's data struct can not be allocated,
> >> > e.g, on x86, initialized by init_cpu_to_node()->init_memory_less_node(). But
> >> > device->numa_node info is used as preferred_nid param for
> >> > __alloc_pages_nodemask(), which causes NULL reference
> >> >   ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> >> > This patch tries to fix the issue by falling back to the first online node,
> >> > when encountering such corner case.
> >>
> >> We have seen similar issues already and the bug was usually that the
> >> zonelists were not initialized yet or the node is completely bogus.
> >> Zonelists should be initialized by build_all_zonelists quite early so I
> >> am wondering whether the later is the case. What is the actual node
> >> number the device is associated with?
> >>
> >The device's node num is 2. And in my case, I used nr_cpus param. Due
> >to init_cpu_to_node() initialize all the possible node.  It is hard
> >for me to figure out without this param, how zonelists is accessed
> >before page allocator works.
>
> If my understanding is correct, we can't do page alloc before zonelist
> is initialized.
>
> I guess Michal's point is to figure out this reason.
>
Yeah, I know. I just want to emphasize that I hit this bug using
nr_cpus, which may be rarely used by people. Hence it may be a
different bug as Michal had seen. Sorry for bad English if it cause
confusion.

Thanks

[...]
Wei Yang Dec. 4, 2018, 9:09 a.m. UTC | #12
On Tue, Dec 04, 2018 at 04:52:52PM +0800, Pingfan Liu wrote:
>On Tue, Dec 4, 2018 at 4:34 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>>
>> On Tue, Dec 04, 2018 at 03:20:13PM +0800, Pingfan Liu wrote:
>> >On Tue, Dec 4, 2018 at 2:54 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>> >>
>> >> On Tue, Dec 04, 2018 at 11:05:57AM +0800, Pingfan Liu wrote:
>> >> >During my test on some AMD machine, with kexec -l nr_cpus=x option, the
>> >> >kernel failed to bootup, because some node's data struct can not be allocated,
>> >> >e.g, on x86, initialized by init_cpu_to_node()->init_memory_less_node(). But
>> >> >device->numa_node info is used as preferred_nid param for
>> >>
>> >> could we fix the preferred_nid before passed to
>> >> __alloc_pages_nodemask()?
>> >>
>> >Yes, we can doit too, but what is the gain?
>>
>> node_zonelist() is used some places. If we are sure where the problem
>> is, it is not necessary to spread to other places.
>>
>> >
>> >> BTW, I don't catch the function call flow to this point. Would you mind
>> >> giving me some hint?
>> >>
>> >You can track the code along slab_alloc() ->...->__alloc_pages_nodemask()
>>
>> slab_alloc() pass NUMA_NO_NODE down, so I am lost in where the
>> preferred_nid is assigned.
>>
>You can follow:
>[    5.773618]  new_slab+0xa9/0x570
>[    5.773618]  ___slab_alloc+0x375/0x540
>[    5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
>where static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
>

Well, thanks for your patience, but I still don't get it.

new_slab(node)
    allocate_slab(node)
       alloc_slab_page(node)
           if (node == NUMA_NO_NODE)
	       alloc_pages()
	   eles
	       __alloc_pages_node(node)

As you mentioned, this starts from slab_alloc() which pass NUMA_NO_NODE.
This means it goes to alloc_pages() and then alloc_pages_current() ->
__alloc_pages_nodemask(). Here we use policy_node() to get the
preferred_nid. 

I didn't catch the relathionship between policy_node() and
device->numa_node. Maybe I got wrong in some place. Would you minding
sharing more?

>Thanks,
>Pingfan
Vlastimil Babka Dec. 4, 2018, 2:42 p.m. UTC | #13
On 12/4/18 9:56 AM, Michal Hocko wrote:
>> The device's node num is 2. And in my case, I used nr_cpus param. Due
>> to init_cpu_to_node() initialize all the possible node.  It is hard
>> for me to figure out without this param, how zonelists is accessed
>> before page allocator works.
> I believe we should focus on this. Why does the node have no zonelist
> even though all zonelists should be initialized already? Maybe this is
> nr_cpus pecularity and we do not initialize all the existing numa nodes.
> Or maybe the device is associated to a non-existing node with that
> setup. A full dmesg might help us here.

Yes, a full dmesg should contain line such as this one:

[    0.137407] Built 1 zonelists, mobility grouping on.  Total pages:
6181664

That should at least tell us if nr_cpus=X resulted in some node's
zonelists not being built.
Pingfan Liu Dec. 5, 2018, 5:38 a.m. UTC | #14
On Tue, Dec 4, 2018 at 4:56 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 04-12-18 16:20:32, Pingfan Liu wrote:
> > On Tue, Dec 4, 2018 at 3:22 PM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Tue 04-12-18 11:05:57, Pingfan Liu wrote:
> > > > During my test on some AMD machine, with kexec -l nr_cpus=x option, the
> > > > kernel failed to bootup, because some node's data struct can not be allocated,
> > > > e.g, on x86, initialized by init_cpu_to_node()->init_memory_less_node(). But
> > > > device->numa_node info is used as preferred_nid param for
> > > > __alloc_pages_nodemask(), which causes NULL reference
> > > >   ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> > > > This patch tries to fix the issue by falling back to the first online node,
> > > > when encountering such corner case.
> > >
> > > We have seen similar issues already and the bug was usually that the
> > > zonelists were not initialized yet or the node is completely bogus.
> > > Zonelists should be initialized by build_all_zonelists quite early so I
> > > am wondering whether the later is the case. What is the actual node
> > > number the device is associated with?
> > >
> > The device's node num is 2. And in my case, I used nr_cpus param. Due
> > to init_cpu_to_node() initialize all the possible node.  It is hard
> > for me to figure out without this param, how zonelists is accessed
> > before page allocator works.
>
> I believe we should focus on this. Why does the node have no zonelist
> even though all zonelists should be initialized already? Maybe this is
> nr_cpus pecularity and we do not initialize all the existing numa nodes.
> Or maybe the device is associated to a non-existing node with that
> setup. A full dmesg might help us here.
>
Requiring the machine again, and I got the following without nr_cpus option
[root@dell-per7425-03 ~]# cd /sys/devices/system/node/
[root@dell-per7425-03 node]# ls
has_cpu  has_memory  has_normal_memory  node0  node1  node2  node3
node4  node5  node6  node7  online  possible  power  uevent
[root@dell-per7425-03 node]# cat has_cpu
0-7
[root@dell-per7425-03 node]# cat has_memory
1,5
[root@dell-per7425-03 node]# cat online
0-7
[root@dell-per7425-03 node]# cat possible
0-7
And lscpu shows the following numa-cpu info:
NUMA node0 CPU(s):     0,8,16,24
NUMA node1 CPU(s):     2,10,18,26
NUMA node2 CPU(s):     4,12,20,28
NUMA node3 CPU(s):     6,14,22,30
NUMA node4 CPU(s):     1,9,17,25
NUMA node5 CPU(s):     3,11,19,27
NUMA node6 CPU(s):     5,13,21,29
NUMA node7 CPU(s):     7,15,23,31

For the full panic message (I masked some hostname info with xx),
please see the attachment.
In a short word, it seems a problem with nr_cpus, if without this
option, the kernel can bootup correctly.

> > > Your patch is not correct btw, because we want to fallback into the node in
> > > the distance order rather into the first online node.
> > > --
> > What about this:
> > +extern int find_next_best_node(int node, nodemask_t *used_node_mask);
> > +
> >  /*
> >   * We get the zone list from the current node and the gfp_mask.
> >   * This zone list contains a maximum of MAXNODES*MAX_NR_ZONES zones.
> > @@ -453,6 +455,11 @@ static inline int gfp_zonelist(gfp_t flags)
> >   */
> >  static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> >  {
> > +       if (unlikely(!node_online(nid))) {
> > +               nodemask_t used_mask;
> > +               nodes_complement(used_mask, node_online_map);
> > +               nid = find_next_best_node(nid, &used_mask);
> > +       }
> >         return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
> >  }
> >
> > I just finished the compiling, not test it yet, since the machine is
> > not on hand yet. It needs some time to get it again.
>
> This is clearly a no-go. nodemask_t can be giant and you cannot have it
> on the stack for allocation paths which might be called from a deep

What about the static variable
static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
 {
+       if (unlikely(!node_online(nid))) {
+                   WARN_ONCE(1, "Try to alloc mem from not online node\n");
+                   nid = find_best_node(nid);
+       }
        return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
 }

+int find_best_node(int node)
+{
+       static nodemask_t used_mask;
+       static DEFINE_SPINLOCK(lock);
+       static int best_neigh[MAX_NUMNODES] = {0};
+       int nid;
+
+       spin_lock(&lock);
+       if (likely( best_neigh[node] != 0)) {
+               nid = best_neigh[node];
+               goto unlock;
+       }
+       nodes_complement(used_mask, node_online_map);
+       nid = find_next_best_node(node, &used_mask);
+       best_neigh[node] = nid;
+
+unlock:
+       spin_unlock(&lock);
+       return nid;
+}

> stack already. Also this is called from the allocator hot paths and each
> branch counts.
>
I am puzzling about this. Does unlikely work for it?

Thanks,
Pingfan
[ 1048.902358] kvm: exiting hardware virtualization
[ 1048.910116] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 1049.721572] kexec: Starting new kernel
[    0.000000] Linux version 4.20.0-rc1+
[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-3.10.0-957.el7.x86_64 root=/dev/mapper/xxx_dell--per7425--03-root ro crashkernel=auto rd.lvm.lv=xxx_dell-per7425-03/root rd.lvm.lv=xxx_dell-per7425-03/swap console=ttyS0,115200n81 nr_cpus=4
[    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[    0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'compacted' format.
[    0.000000] BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000100-0x000000000008efff] usable
[    0.000000] BIOS-e820: [mem 0x000000000008f000-0x000000000008ffff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x0000000000090000-0x000000000009ffff] usable
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000005c3d6fff] usable
[    0.000000] BIOS-e820: [mem 0x000000005c3d7000-0x00000000643defff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000643df000-0x0000000068ff7fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000068ff8000-0x000000006b4f7fff] reserved
[    0.000000] BIOS-e820: [mem 0x000000006b4f8000-0x000000006c327fff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x000000006c328000-0x000000006c527fff] ACPI data
[    0.000000] BIOS-e820: [mem 0x000000006c528000-0x000000006fffffff] usable
[    0.000000] BIOS-e820: [mem 0x0000000070000000-0x000000008fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fec10000-0x00000000fec10fff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed80fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000087effffff] usable
[    0.000000] BIOS-e820: [mem 0x000000087f000000-0x000000087fffffff] reserved
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] extended physical RAM map:
[    0.000000] reserve setup_data: [mem 0x0000000000000100-0x000000000008efff] usable
[    0.000000] reserve setup_data: [mem 0x000000000008f000-0x000000000008ffff] ACPI NVS
[    0.000000] reserve setup_data: [mem 0x0000000000090000-0x000000000009ffff] usable
[    0.000000] reserve setup_data: [mem 0x0000000000100000-0x000000000010006f] usable
[    0.000000] reserve setup_data: [mem 0x0000000000100070-0x000000005c3d6fff] usable
[    0.000000] reserve setup_data: [mem 0x000000005c3d7000-0x00000000643defff] reserved
[    0.000000] reserve setup_data: [mem 0x00000000643df000-0x0000000068ff7fff] usable
[    0.000000] reserve setup_data: [mem 0x0000000068ff8000-0x000000006b4f7fff] reserved
[    0.000000] reserve setup_data: [mem 0x000000006b4f8000-0x000000006c327fff] ACPI NVS
[    0.000000] reserve setup_data: [mem 0x000000006c328000-0x000000006c527fff] ACPI data
[    0.000000] reserve setup_data: [mem 0x000000006c528000-0x000000006fffffff] usable
[    0.000000] reserve setup_data: [mem 0x0000000070000000-0x000000008fffffff] reserved
[    0.000000] reserve setup_data: [mem 0x00000000fec10000-0x00000000fec10fff] reserved
[    0.000000] reserve setup_data: [mem 0x00000000fed80000-0x00000000fed80fff] reserved
[    0.000000] reserve setup_data: [mem 0x0000000100000000-0x000000087effffff] usable
[    0.000000] reserve setup_data: [mem 0x000000087f000000-0x000000087fffffff] reserved
[    0.000000] efi: EFI v2.50 by Dell Inc.
[    0.000000] efi:  ACPI=0x6c527000  ACPI 2.0=0x6c527014  SMBIOS=0x6afde000  SMBIOS 3.0=0x6afdc000 
[    0.000000] SMBIOS 3.0.0 present.
[    0.000000] DMI: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.4.3 06/29/2018
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.000000] tsc: Detected 2095.974 MHz processor
[    0.000064] last_pfn = 0x87f000 max_arch_pfn = 0x400000000
[    0.000952] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT  
[    0.001463] last_pfn = 0x70000 max_arch_pfn = 0x400000000
[    0.006776] Using GB pages for direct mapping
[    0.007062] Secure boot could not be determined
[    0.007064] RAMDISK: [mem 0x87a118000-0x87cdfffff]
[    0.007074] ACPI: Early table checksum verification disabled
[    0.007081] ACPI: RSDP 0x000000006C527014 000024 (v02 DELL  )
[    0.007086] ACPI: XSDT 0x000000006C5260E8 0000C4 (v01 DELL   PE_SC3   00000002 DELL 00000001)
[    0.007094] ACPI: FACP 0x000000006C516000 000114 (v06 DELL   PE_SC3   00000002 DELL 00000001)
[    0.007101] ACPI: DSDT 0x000000006C505000 00D302 (v02 DELL   PE_SC3   00000002 DELL 00000001)
[    0.007104] ACPI: FACS 0x000000006C2F1000 000040
[    0.007107] ACPI: SSDT 0x000000006C525000 0000D2 (v02 DELL   PE_SC3   00000002 MSFT 04000000)
[    0.007110] ACPI: BERT 0x000000006C524000 000030 (v01 DELL   BERT     00000001 DELL 00000001)
[    0.007112] ACPI: HEST 0x000000006C523000 0006DC (v01 DELL   HEST     00000001 DELL 00000001)
[    0.007115] ACPI: SSDT 0x000000006C522000 0001C4 (v01 DELL   PE_SC3   00000001 AMD  00000001)
[    0.007118] ACPI: SRAT 0x000000006C521000 0002D0 (v03 DELL   PE_SC3   00000001 AMD  00000001)
[    0.007121] ACPI: MSCT 0x000000006C520000 0000A6 (v01 DELL   PE_SC3   00000000 AMD  00000001)
[    0.007123] ACPI: SLIT 0x000000006C51F000 00006C (v01 DELL   PE_SC3   00000001 AMD  00000001)
[    0.007126] ACPI: CRAT 0x000000006C51C000 002210 (v01 DELL   PE_SC3   00000001 AMD  00000001)
[    0.007129] ACPI: CDIT 0x000000006C51B000 000068 (v01 DELL   PE_SC3   00000001 AMD  00000001)
[    0.007132] ACPI: SSDT 0x000000006C51A000 0003C6 (v02 DELL   Tpm2Tabl 00001000 INTL 20170119)
[    0.007134] ACPI: TPM2 0x000000006C519000 000038 (v04 DELL   PE_SC3   00000002 DELL 00000001)
[    0.007137] ACPI: EINJ 0x000000006C518000 000150 (v01 DELL   PE_SC3   00000001 AMD  00000001)
[    0.007140] ACPI: SLIC 0x000000006C517000 000024 (v01 DELL   PE_SC3   00000002 DELL 00000001)
[    0.007142] ACPI: HPET 0x000000006C515000 000038 (v01 DELL   PE_SC3   00000002 DELL 00000001)
[    0.007145] ACPI: APIC 0x000000006C514000 0004B2 (v03 DELL   PE_SC3   00000002 DELL 00000001)
[    0.007148] ACPI: MCFG 0x000000006C513000 00003C (v01 DELL   PE_SC3   00000002 DELL 00000001)
[    0.007150] ACPI: SSDT 0x000000006C504000 0005CA (v02 DELL   xhc_port 00000001 INTL 20170119)
[    0.007153] ACPI: IVRS 0x000000006C503000 000390 (v02 DELL   PE_SC3   00000001 AMD  00000000)
[    0.007156] ACPI: SSDT 0x000000006C501000 001658 (v01 AMD    CPMCMN   00000001 INTL 20170119)
[    0.007226] SRAT: PXM 0 -> APIC 0x00 -> Node 0
[    0.007227] SRAT: PXM 0 -> APIC 0x01 -> Node 0
[    0.007228] SRAT: PXM 0 -> APIC 0x08 -> Node 0
[    0.007229] SRAT: PXM 0 -> APIC 0x09 -> Node 0
[    0.007230] SRAT: PXM 1 -> APIC 0x10 -> Node 1
[    0.007231] SRAT: PXM 1 -> APIC 0x11 -> Node 1
[    0.007232] SRAT: PXM 1 -> APIC 0x18 -> Node 1
[    0.007233] SRAT: PXM 1 -> APIC 0x19 -> Node 1
[    0.007234] SRAT: PXM 2 -> APIC 0x20 -> Node 2
[    0.007235] SRAT: PXM 2 -> APIC 0x21 -> Node 2
[    0.007236] SRAT: PXM 2 -> APIC 0x28 -> Node 2
[    0.007237] SRAT: PXM 2 -> APIC 0x29 -> Node 2
[    0.007238] SRAT: PXM 3 -> APIC 0x30 -> Node 3
[    0.007239] SRAT: PXM 3 -> APIC 0x31 -> Node 3
[    0.007239] SRAT: PXM 3 -> APIC 0x38 -> Node 3
[    0.007240] SRAT: PXM 3 -> APIC 0x39 -> Node 3
[    0.007241] SRAT: PXM 4 -> APIC 0x40 -> Node 4
[    0.007242] SRAT: PXM 4 -> APIC 0x41 -> Node 4
[    0.007243] SRAT: PXM 4 -> APIC 0x48 -> Node 4
[    0.007244] SRAT: PXM 4 -> APIC 0x49 -> Node 4
[    0.007245] SRAT: PXM 5 -> APIC 0x50 -> Node 5
[    0.007246] SRAT: PXM 5 -> APIC 0x51 -> Node 5
[    0.007247] SRAT: PXM 5 -> APIC 0x58 -> Node 5
[    0.007247] SRAT: PXM 5 -> APIC 0x59 -> Node 5
[    0.007248] SRAT: PXM 6 -> APIC 0x60 -> Node 6
[    0.007249] SRAT: PXM 6 -> APIC 0x61 -> Node 6
[    0.007250] SRAT: PXM 6 -> APIC 0x68 -> Node 6
[    0.007251] SRAT: PXM 6 -> APIC 0x69 -> Node 6
[    0.007252] SRAT: PXM 7 -> APIC 0x70 -> Node 7
[    0.007253] SRAT: PXM 7 -> APIC 0x71 -> Node 7
[    0.007254] SRAT: PXM 7 -> APIC 0x78 -> Node 7
[    0.007254] SRAT: PXM 7 -> APIC 0x79 -> Node 7
[    0.007257] ACPI: SRAT: Node 1 PXM 1 [mem 0x00000000-0x0009ffff]
[    0.007259] ACPI: SRAT: Node 1 PXM 1 [mem 0x00100000-0x7fffffff]
[    0.007260] ACPI: SRAT: Node 1 PXM 1 [mem 0x100000000-0x47fffffff]
[    0.007261] ACPI: SRAT: Node 5 PXM 5 [mem 0x480000000-0x87fffffff]
[    0.007271] NUMA: Node 1 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x7fffffff] -> [mem 0x00000000-0x7fffffff]
[    0.007273] NUMA: Node 1 [mem 0x00000000-0x7fffffff] + [mem 0x100000000-0x47fffffff] -> [mem 0x00000000-0x47fffffff]
[    0.007285] NODE_DATA(1) allocated [mem 0x47ffd5000-0x47fffffff]
[    0.007315] NODE_DATA(5) allocated [mem 0x87efd4000-0x87effefff]
[    0.007357] crashkernel: memory value expected
[    0.007407] Zone ranges:
[    0.007409]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
[    0.007410]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
[    0.007412]   Normal   [mem 0x0000000100000000-0x000000087effffff]
[    0.007414]   Device   empty
[    0.007415] Movable zone start for each node
[    0.007418] Early memory node ranges
[    0.007419]   node   1: [mem 0x0000000000001000-0x000000000008efff]
[    0.007420]   node   1: [mem 0x0000000000090000-0x000000000009ffff]
[    0.007422]   node   1: [mem 0x0000000000100000-0x000000005c3d6fff]
[    0.007422]   node   1: [mem 0x00000000643df000-0x0000000068ff7fff]
[    0.007423]   node   1: [mem 0x000000006c528000-0x000000006fffffff]
[    0.007424]   node   1: [mem 0x0000000100000000-0x000000047fffffff]
[    0.007425]   node   5: [mem 0x0000000480000000-0x000000087effffff]
[    0.008195] Zeroed struct page in unavailable ranges: 46490 pages
[    0.008197] Initmem setup node 1 [mem 0x0000000000001000-0x000000047fffffff]
[    0.024042] Initmem setup node 5 [mem 0x0000000480000000-0x000000087effffff]
[    0.025522] ACPI: PM-Timer IO Port: 0x408
[    0.025534] APIC: NR_CPUS/possible_cpus limit of 4 reached. Processor 4/0x20 ignored.
[    0.025535] APIC: NR_CPUS/possible_cpus limit of 4 reached. Processor 5/0x60 ignored.
[    0.025536] APIC: NR_CPUS/possible_cpus limit of 4 reached. Processor 6/0x30 ignored.
[    0.025537] APIC: NR_CPUS/possible_cpus limit of 4 reached. Processor 7/0x70 ignored.
[    0.025538] APIC: NR_CPUS/possible_cpus limit of 4 reached. Processor 8/0x8 ignored.
[    0.025539] APIC: NR_CPUS/possible_cpus limit of 4 reached. Processor 9/0x48 ignored.
[    0.025540] APIC: NR_CPUS/possible_cpus limit of 4 reached. Processor 10/0x18 ignored.
[    0.025541] APIC: NR_CPUS/possible_cpus limit of 4 reached. Processor 11/0x58 ignored.
[    0.025542] APIC: NR_CPUS/possible_cpus limit of 4 reached. Processor 12/0x28 ignored.
[    0.025543] APIC: NR_CPUS/possible_cpus limit of 4 reached. Processor 13/0x68 ignored.
[    0.025544] APIC: NR_CPUS/possible_cpus limit of 4 reached. Processor 14/0x38 ignored.
[    0.025545] APIC: NR_CPUS/possible_cpus limit of 4 reached. Processor 15/0x78 ignored.
[    0.025546] APIC: NR_CPUS/possible_cpus limit of 4 reached. Processor 16/0x1 ignored.
[    0.025547] APIC: NR_CPUS/possible_cpus limit of 4 reached. Processor 17/0x41 ignored.
[    0.025548] APIC: NR_CPUS/possible_cpus limit of 4 reached. Processor 18/0x11 ignored.
[    0.025549] APIC: NR_CPUS/possible_cpus limit of 4 reached. Processor 19/0x51 ignored.
[    0.025550] APIC: NR_CPUS/possible_cpus limit of 4 reached. Processor 20/0x21 ignored.
[    0.025551] APIC: NR_CPUS/possible_cpus limit of 4 reached. Processor 21/0x61 ignored.
[    0.025552] APIC: NR_CPUS/possible_cpus limit of 4 reached. Processor 22/0x31 ignored.
[    0.025553] APIC: NR_CPUS/possible_cpus limit of 4 reached. Processor 23/0x71 ignored.
[    0.025554] APIC: NR_CPUS/possible_cpus limit of 4 reached. Processor 24/0x9 ignored.
[    0.025555] APIC: NR_CPUS/possible_cpus limit of 4 reached. Processor 25/0x49 ignored.
[    0.025556] APIC: NR_CPUS/possible_cpus limit of 4 reached. Processor 26/0x19 ignored.
[    0.025557] APIC: NR_CPUS/possible_cpus limit of 4 reached. Processor 27/0x59 ignored.
[    0.025558] APIC: NR_CPUS/possible_cpus limit of 4 reached. Processor 28/0x29 ignored.
[    0.025559] APIC: NR_CPUS/possible_cpus limit of 4 reached. Processor 29/0x69 ignored.
[    0.025560] APIC: NR_CPUS/possible_cpus limit of 4 reached. Processor 30/0x39 ignored.
[    0.025560] APIC: NR_CPUS/possible_cpus limit of 4 reached. Processor 31/0x79 ignored.
[    0.025567] ACPI: LAPIC_NMI (acpi_id[0xff] high edge lint[0x1])
[    0.025596] IOAPIC[0]: apic_id 128, version 33, address 0xfec00000, GSI 0-23
[    0.025601] IOAPIC[1]: apic_id 129, version 33, address 0xfd880000, GSI 24-55
[    0.025607] IOAPIC[2]: apic_id 130, version 33, address 0xea900000, GSI 56-87
[    0.025612] IOAPIC[3]: apic_id 131, version 33, address 0xdd900000, GSI 88-119
[    0.025618] IOAPIC[4]: apic_id 132, version 33, address 0xd0900000, GSI 120-151
[    0.025624] IOAPIC[5]: apic_id 133, version 33, address 0xc3900000, GSI 152-183
[    0.025631] IOAPIC[6]: apic_id 134, version 33, address 0xb6900000, GSI 184-215
[    0.025637] IOAPIC[7]: apic_id 135, version 33, address 0xa9900000, GSI 216-247
[    0.025644] IOAPIC[8]: apic_id 136, version 33, address 0x9c900000, GSI 248-279
[    0.025647] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[    0.025650] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 low level)
[    0.025656] Using ACPI (MADT) for SMP configuration information
[    0.025659] ACPI: HPET id: 0x10228201 base: 0xfed00000
[    0.025668] smpboot: 128 Processors exceeds NR_CPUS limit of 4
[    0.025670] smpboot: Allowing 4 CPUs, 0 hotplug CPUs
[    0.025673] NODE_DATA(0) allocated [mem 0x87efa3000-0x87efcdfff]
[    0.025674]     NODE_DATA(0) on node 5
[    0.025722] Initmem setup node 0 [mem 0x0000000000000000-0x0000000000000000]
[    0.025726] NODE_DATA(4) allocated [mem 0x87ef78000-0x87efa2fff]
[    0.025726]     NODE_DATA(4) on node 5
[    0.025761] Initmem setup node 4 [mem 0x0000000000000000-0x0000000000000000]
[    0.025785] PM: Registered nosave memory: [mem 0x00000000-0x00000fff]
[    0.025787] PM: Registered nosave memory: [mem 0x0008f000-0x0008ffff]
[    0.025788] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff]
[    0.025789] PM: Registered nosave memory: [mem 0x00100000-0x00100fff]
[    0.025791] PM: Registered nosave memory: [mem 0x5c3d7000-0x643defff]
[    0.025793] PM: Registered nosave memory: [mem 0x68ff8000-0x6b4f7fff]
[    0.025794] PM: Registered nosave memory: [mem 0x6b4f8000-0x6c327fff]
[    0.025795] PM: Registered nosave memory: [mem 0x6c328000-0x6c527fff]
[    0.025796] PM: Registered nosave memory: [mem 0x70000000-0x8fffffff]
[    0.025797] PM: Registered nosave memory: [mem 0x90000000-0xfec0ffff]
[    0.025798] PM: Registered nosave memory: [mem 0xfec10000-0xfec10fff]
[    0.025799] PM: Registered nosave memory: [mem 0xfec11000-0xfed7ffff]
[    0.025800] PM: Registered nosave memory: [mem 0xfed80000-0xfed80fff]
[    0.025801] PM: Registered nosave memory: [mem 0xfed81000-0xffffffff]
[    0.025804] [mem 0x90000000-0xfec0ffff] available for PCI devices
[    0.025806] Booting paravirtualized kernel on bare hardware
[    0.025812] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[    0.141589] random: get_random_bytes called from start_kernel+0x9b/0x528 with crng_init=0
[    0.141600] setup_percpu: NR_CPUS:8192 nr_cpumask_bits:4 nr_cpu_ids:4 nr_node_ids:8
[    0.143537] percpu: Embedded 46 pages/cpu @(____ptrval____) s151552 r8192 d28672 u2097152
[    0.143598] Built 4 zonelists, mobility grouping on.  Total pages: 8143202
[    0.143600] Policy zone: Normal
[    0.143605] Kernel command line: BOOT_IMAGE=/vmlinuz-3.10.0-957.el7.x86_64 root=/dev/mapper/xxx_dell--per7425--03-root ro crashkernel=auto rd.lvm.lv=xxx_dell-per7425-03/root rd.lvm.lv=xxx_dell-per7425-03/swap console=ttyS0,115200n81 nr_cpus=4
[    0.166247] Memory: 1845560K/33089944K available (12293K kernel code, 2066K rwdata, 3752K rodata, 2348K init, 6532K bss, 667612K reserved, 0K cma-reserved)
[    0.166392] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=8
[    0.166403] ftrace: allocating 35639 entries in 140 pages
[    0.184002] rcu: Hierarchical RCU implementation.
[    0.184006] rcu: 	RCU restricting CPUs from NR_CPUS=8192 to nr_cpu_ids=4.
[    0.184008] rcu: RCU calculated value of scheduler-enlistment delay is 100 jiffies.
[    0.184009] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=4
[    0.186336] NR_IRQS: 524544, nr_irqs: 1024, preallocated irqs: 16
[    0.187054] spurious APIC interrupt through vector ff on CPU#0, should never happen.
[    0.187685] Console: colour dummy device 80x25
[    1.667868] printk: console [ttyS0] enabled
[    1.672071] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
[    1.683299] ACPI: Core revision 20181003
[    1.687482] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns
[    1.696650] APIC: Switch to symmetric I/O mode setup
[    2.468804] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[    2.479670] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1e365567211, max_idle_ns: 440795264558 ns
[    2.490203] Calibrating delay loop (skipped), value calculated using timer frequency.. 4191.94 BogoMIPS (lpj=2095974)
[    2.491191] pid_max: default: 32768 minimum: 301
[    2.493219] LSM: Security Framework initializing
[    2.494193] Yama: becoming mindful.
[    2.495197] SELinux:  Initializing.
[    2.502765] Dentry cache hash table entries: 4194304 (order: 13, 33554432 bytes)
[    2.506886] Inode-cache hash table entries: 2097152 (order: 12, 16777216 bytes)
[    2.507310] Mount-cache hash table entries: 65536 (order: 7, 524288 bytes)
[    2.508273] Mountpoint-cache hash table entries: 65536 (order: 7, 524288 bytes)
[    2.509485] mce: CPU supports 23 MCE banks
[    2.510239] LVT offset 2 assigned for vector 0xf4
[    2.511201] Last level iTLB entries: 4KB 1024, 2MB 1024, 4MB 512
[    2.512190] Last level dTLB entries: 4KB 1536, 2MB 1536, 4MB 768, 1GB 0
[    2.513193] Spectre V2 : Mitigation: Full AMD retpoline
[    2.514190] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
[    2.515190] Spectre V2 : Spectre v2 mitigation: Enabling Indirect Branch Prediction Barrier
[    2.516192] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl and seccomp
[    2.517507] Freeing SMP alternatives memory: 32K
[    2.522189] smpboot: CPU0: AMD EPYC 7251 8-Core Processor (family: 0x17, model: 0x1, stepping: 0x2)
[    2.522353] Performance Events: Fam17h core perfctr, AMD PMU driver.
[    2.523194] ... version:                0
[    2.524190] ... bit width:              48
[    2.525190] ... generic registers:      6
[    2.526190] ... value mask:             0000ffffffffffff
[    2.527190] ... max period:             00007fffffffffff
[    2.528190] ... fixed-purpose events:   0
[    2.529190] ... event mask:             000000000000003f
[    2.530238] rcu: Hierarchical SRCU implementation.
[    2.532406] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
[    2.533270] smp: Bringing up secondary CPUs ...
[    2.534386] x86: Booting SMP configuration:
[    2.535192] .... node  #4, CPUs:      #1
[    2.547192] .... node  #1, CPUs:   #2
[    2.549192] .... node  #5, CPUs:   #3
[    2.550282] smp: Brought up 4 nodes, 4 CPUs
[    2.552191] smpboot: Max logical packages: 2
[    2.553191] smpboot: Total of 4 processors activated (16739.73 BogoMIPS)
[    2.694217] node 1 initialised, 3572598 pages in 138ms
[    2.708208] node 5 initialised, 4071595 pages in 152ms
[    2.715067] devtmpfs: initialized
[    2.715263] x86/mm: Memory block size: 128MB
[    2.718668] PM: Registering ACPI NVS region [mem 0x0008f000-0x0008ffff] (4096 bytes)
[    2.719194] PM: Registering ACPI NVS region [mem 0x6b4f8000-0x6c327fff] (14876672 bytes)
[    2.720380] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
[    2.721207] futex hash table entries: 1024 (order: 4, 65536 bytes)
[    2.722293] pinctrl core: initialized pinctrl subsystem
[    2.723264] RTC time:  1:48:54, date: 11/07/18
[    2.724397] NET: Registered protocol family 16
[    2.726222] audit: initializing netlink subsys (disabled)
[    2.727220] audit: type=2000 audit(1541555331.250:1): state=initialized audit_enabled=0 res=1
[    2.736198] cpuidle: using governor menu
[    2.738263] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
[    2.739192] ACPI: bus type PCI registered
[    2.740193] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
[    2.741254] PCI: MMCONFIG for domain 0000 [bus 00-ff] at [mem 0x80000000-0x8fffffff] (base 0x80000000)
[    2.742208] PCI: MMCONFIG at [mem 0x80000000-0x8fffffff] reserved in E820
[    2.743202] PCI: Using configuration type 1 for base access
[    2.744200] PCI: Dell System detected, enabling pci=bfsort.
[    2.746944] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
[    2.747194] HugeTLB registered 2.00 MiB page size, pre-allocated 0 pages
[    2.748278] ACPI: Added _OSI(Module Device)
[    2.749195] ACPI: Added _OSI(Processor Device)
[    2.750192] ACPI: Added _OSI(3.0 _SCP Extensions)
[    2.751192] ACPI: Added _OSI(Processor Aggregator Device)
[    2.752191] ACPI: Added _OSI(Linux-Dell-Video)
[    2.753192] ACPI: Added _OSI(Linux-Lenovo-NV-HDMI-Audio)
[    2.759056] ACPI: 6 ACPI AML tables successfully acquired and loaded
[    2.761950] ACPI: Interpreter enabled
[    2.762203] ACPI: (supports S0 S5)
[    2.763192] ACPI: Using IOAPIC for interrupt routing
[    2.765276] HEST: Table parsing has been initialized.
[    2.766194] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
[    2.767805] ACPI: Enabled 2 GPEs in block 00 to 1F
[    2.778161] ACPI: PCI Interrupt Link [LNKA] (IRQs 4 5 7 10 11 14 15) *0
[    2.778229] ACPI: PCI Interrupt Link [LNKB] (IRQs 4 5 7 10 11 14 15) *0
[    2.779226] ACPI: PCI Interrupt Link [LNKC] (IRQs 4 5 7 10 11 14 15) *0
[    2.780225] ACPI: PCI Interrupt Link [LNKD] (IRQs 4 5 7 10 11 14 15) *0
[    2.781225] ACPI: PCI Interrupt Link [LNKE] (IRQs 4 5 7 10 11 14 15) *0
[    2.782225] ACPI: PCI Interrupt Link [LNKF] (IRQs 4 5 7 10 11 14 15) *0
[    2.783226] ACPI: PCI Interrupt Link [LNKG] (IRQs 4 5 7 10 11 14 15) *0
[    2.784225] ACPI: PCI Interrupt Link [LNKH] (IRQs 4 5 7 10 11 14 15) *0
[    2.785369] ACPI: PCI Root Bridge [PC00] (domain 0000 [bus 00-1f])
[    2.786198] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
[    2.787231] acpi PNP0A08:00: PCIe AER handled by firmware
[    2.788229] acpi PNP0A08:00: _OSC: platform does not support [SHPCHotplug LTR]
[    2.789258] acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug PME PCIeCapability]
[    2.790191] acpi PNP0A08:00: FADT indicates ASPM is unsupported, using BIOS configuration
[    2.791417] PCI host bridge to bus 0000:00
[    2.792193] pci_bus 0000:00: root bus resource [io  0x0000-0x03af window]
[    2.793191] pci_bus 0000:00: root bus resource [io  0x03e0-0x0cf7 window]
[    2.794192] pci_bus 0000:00: root bus resource [io  0x03b0-0x03df window]
[    2.795192] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
[    2.796191] pci_bus 0000:00: root bus resource [mem 0x000c0000-0x000c3fff window]
[    2.797191] pci_bus 0000:00: root bus resource [mem 0x000c4000-0x000c7fff window]
[    2.798191] pci_bus 0000:00: root bus resource [mem 0x000c8000-0x000cbfff window]
[    2.799192] pci_bus 0000:00: root bus resource [mem 0x000cc000-0x000cffff window]
[    2.800191] pci_bus 0000:00: root bus resource [mem 0x000d0000-0x000d3fff window]
[    2.801191] pci_bus 0000:00: root bus resource [mem 0x000d4000-0x000d7fff window]
[    2.802192] pci_bus 0000:00: root bus resource [mem 0x000d8000-0x000dbfff window]
[    2.803192] pci_bus 0000:00: root bus resource [mem 0x000dc000-0x000dffff window]
[    2.804191] pci_bus 0000:00: root bus resource [mem 0x000e0000-0x000e3fff window]
[    2.805191] pci_bus 0000:00: root bus resource [mem 0x000e4000-0x000e7fff window]
[    2.806191] pci_bus 0000:00: root bus resource [mem 0x000e8000-0x000ebfff window]
[    2.807192] pci_bus 0000:00: root bus resource [mem 0x000ec000-0x000effff window]
[    2.808191] pci_bus 0000:00: root bus resource [mem 0x000f0000-0x000fffff window]
[    2.809191] pci_bus 0000:00: root bus resource [io  0x0d00-0x1fff window]
[    2.810192] pci_bus 0000:00: root bus resource [mem 0xeb000000-0xfebfffff window]
[    2.811192] pci_bus 0000:00: root bus resource [mem 0x10000000000-0x1df9fffffff window]
[    2.812191] pci_bus 0000:00: root bus resource [bus 00-1f]
[    2.817111] pci 0000:00:07.1: enabling Extended Tags
[    2.818393] pci 0000:00:08.1: enabling Extended Tags
[    2.825592] pci 0000:02:00.0: 4.000 Gb/s available PCIe bandwidth, limited by 5 GT/s x1 link at 0000:00:01.1 (capable of 15.752 Gb/s with 8 GT/s x2 link)
[    2.826549] pci 0000:00:01.1: PCI bridge to [bus 02]
[    2.828441] pci 0000:01:00.0: 4.000 Gb/s available PCIe bandwidth, limited by 5 GT/s x1 link at 0000:00:01.2 (capable of 15.752 Gb/s with 8 GT/s x2 link)
[    2.830504] pci 0000:00:01.2: PCI bridge to [bus 01]
[    2.832414] pci 0000:00:01.3: PCI bridge to [bus 03-04]
[    2.833226] pci_bus 0000:04: extended config space not accessible
[    2.834398] pci 0000:03:00.0: PCI bridge to [bus 04]
[    2.836278] pci 0000:05:00.0: enabling Extended Tags
[    2.837331] pci 0000:05:00.2: enabling Extended Tags
[    2.838335] pci 0000:05:00.3: enabling Extended Tags
[    2.839299] pci 0000:00:07.1: PCI bridge to [bus 05]
[    2.841293] pci 0000:06:00.0: enabling Extended Tags
[    2.842339] pci 0000:06:00.1: enabling Extended Tags
[    2.843352] pci 0000:06:00.2: enabling Extended Tags
[    2.844306] pci 0000:00:08.1: PCI bridge to [bus 06]
[    2.845601] ACPI: PCI Root Bridge [PC01] (domain 0000 [bus 20-3f])
[    2.846193] acpi PNP0A08:01: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
[    2.847227] acpi PNP0A08:01: PCIe AER handled by firmware
[    2.848228] acpi PNP0A08:01: _OSC: platform does not support [SHPCHotplug LTR]
[    2.850231] acpi PNP0A08:01: _OSC: OS now controls [PCIeHotplug PME PCIeCapability]
[    2.851199] acpi PNP0A08:01: FADT indicates ASPM is unsupported, using BIOS configuration
[    2.853278] PCI host bridge to bus 0000:20
[    2.854192] pci_bus 0000:20: root bus resource [io  0x2000-0x3fff window]
[    2.855192] pci_bus 0000:20: root bus resource [mem 0xde000000-0xeaffffff window]
[    2.856191] pci_bus 0000:20: root bus resource [mem 0x1dfa0000000-0x2bf3fffffff window]
[    2.857191] pci_bus 0000:20: root bus resource [bus 20-3f]
[    2.860141] pci 0000:20:07.1: enabling Extended Tags
[    2.861427] pci 0000:20:08.1: enabling Extended Tags
[    2.862560] pci 0000:21:00.0: enabling Extended Tags
[    2.863340] pci 0000:21:00.2: enabling Extended Tags
[    2.864354] pci 0000:21:00.3: enabling Extended Tags
[    2.865306] pci 0000:20:07.1: PCI bridge to [bus 21]
[    2.867301] pci 0000:22:00.0: enabling Extended Tags
[    2.868351] pci 0000:22:00.1: enabling Extended Tags
[    2.869307] pci 0000:20:08.1: PCI bridge to [bus 22]
[    2.870406] ACPI: PCI Root Bridge [PC02] (domain 0000 [bus 40-5f])
[    2.871193] acpi PNP0A08:02: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
[    2.872227] acpi PNP0A08:02: PCIe AER handled by firmware
[    2.873228] acpi PNP0A08:02: _OSC: platform does not support [SHPCHotplug LTR]
[    2.874256] acpi PNP0A08:02: _OSC: OS now controls [PCIeHotplug PME PCIeCapability]
[    2.875191] acpi PNP0A08:02: FADT indicates ASPM is unsupported, using BIOS configuration
[    2.876387] PCI host bridge to bus 0000:40
[    2.877192] pci_bus 0000:40: root bus resource [io  0x4000-0x5fff window]
[    2.878191] pci_bus 0000:40: root bus resource [mem 0xd1000000-0xddffffff window]
[    2.879191] pci_bus 0000:40: root bus resource [mem 0x2bf40000000-0x39edfffffff window]
[    2.880192] pci_bus 0000:40: root bus resource [bus 40-5f]
[    2.882669] pci 0000:40:07.1: enabling Extended Tags
[    2.883502] pci 0000:40:08.1: enabling Extended Tags
[    2.885297] pci 0000:41:00.0: enabling Extended Tags
[    2.886340] pci 0000:41:00.2: enabling Extended Tags
[    2.887301] pci 0000:40:07.1: PCI bridge to [bus 41]
[    2.888574] pci 0000:42:00.0: enabling Extended Tags
[    2.889348] pci 0000:42:00.1: enabling Extended Tags
[    2.890307] pci 0000:40:08.1: PCI bridge to [bus 42]
[    2.892202] ACPI: PCI Root Bridge [PC03] (domain 0000 [bus 60-7f])
[    2.893195] acpi PNP0A08:03: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
[    2.894225] acpi PNP0A08:03: PCIe AER handled by firmware
[    2.895227] acpi PNP0A08:03: _OSC: platform does not support [SHPCHotplug LTR]
[    2.896256] acpi PNP0A08:03: _OSC: OS now controls [PCIeHotplug PME PCIeCapability]
[    2.897192] acpi PNP0A08:03: FADT indicates ASPM is unsupported, using BIOS configuration
[    2.899351] PCI host bridge to bus 0000:60
[    2.900192] pci_bus 0000:60: root bus resource [io  0x6000-0x7fff window]
[    2.901192] pci_bus 0000:60: root bus resource [mem 0xc4000000-0xd0ffffff window]
[    2.902191] pci_bus 0000:60: root bus resource [mem 0x39ee0000000-0x47e7fffffff window]
[    2.903191] pci_bus 0000:60: root bus resource [bus 60-7f]
[    2.905195] pci 0000:60:07.1: enabling Extended Tags
[    2.906503] pci 0000:60:08.1: enabling Extended Tags
[    2.908431] pci 0000:60:03.1: PCI bridge to [bus 61]
[    2.909317] pci 0000:62:00.0: enabling Extended Tags
[    2.910342] pci 0000:62:00.2: enabling Extended Tags
[    2.911303] pci 0000:60:07.1: PCI bridge to [bus 62]
[    2.912329] pci 0000:63:00.0: enabling Extended Tags
[    2.914210] pci 0000:63:00.1: enabling Extended Tags
[    2.916298] pci 0000:60:08.1: PCI bridge to [bus 63]
[    2.917380] ACPI: PCI Root Bridge [PC04] (domain 0000 [bus 80-9f])
[    2.918193] acpi PNP0A08:04: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
[    2.919226] acpi PNP0A08:04: PCIe AER handled by firmware
[    2.920227] acpi PNP0A08:04: _OSC: platform does not support [SHPCHotplug LTR]
[    2.921258] acpi PNP0A08:04: _OSC: OS now controls [PCIeHotplug PME PCIeCapability]
[    2.922191] acpi PNP0A08:04: FADT indicates ASPM is unsupported, using BIOS configuration
[    2.923377] PCI host bridge to bus 0000:80
[    2.924192] pci_bus 0000:80: root bus resource [io  0x8000-0x9fff window]
[    2.925192] pci_bus 0000:80: root bus resource [mem 0xb7000000-0xc3ffffff window]
[    2.926191] pci_bus 0000:80: root bus resource [mem 0x47e80000000-0x55e1fffffff window]
[    2.927191] pci_bus 0000:80: root bus resource [bus 80-9f]
[    2.928924] pci 0000:80:07.1: enabling Extended Tags
[    2.931096] pci 0000:80:08.1: enabling Extended Tags
[    2.932468] pci 0000:81:00.0: enabling Extended Tags
[    2.933364] pci 0000:81:00.2: enabling Extended Tags
[    2.934316] pci 0000:80:07.1: PCI bridge to [bus 81]
[    2.936260] pci 0000:82:00.0: enabling Extended Tags
[    2.937373] pci 0000:82:00.1: enabling Extended Tags
[    2.938322] pci 0000:80:08.1: PCI bridge to [bus 82]
[    2.939374] ACPI: PCI Root Bridge [PC05] (domain 0000 [bus a0-bf])
[    2.940193] acpi PNP0A08:05: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
[    2.941227] acpi PNP0A08:05: PCIe AER handled by firmware
[    2.942227] acpi PNP0A08:05: _OSC: platform does not support [SHPCHotplug LTR]
[    2.943256] acpi PNP0A08:05: _OSC: OS now controls [PCIeHotplug PME PCIeCapability]
[    2.944191] acpi PNP0A08:05: FADT indicates ASPM is unsupported, using BIOS configuration
[    2.945377] PCI host bridge to bus 0000:a0
[    2.946192] pci_bus 0000:a0: root bus resource [io  0xa000-0xbfff window]
[    2.947191] pci_bus 0000:a0: root bus resource [mem 0xaa000000-0xb6ffffff window]
[    2.948191] pci_bus 0000:a0: root bus resource [mem 0x55e20000000-0x63dbfffffff window]
[    2.949192] pci_bus 0000:a0: root bus resource [bus a0-bf]
[    2.951398] pci 0000:a0:07.1: enabling Extended Tags
[    2.952959] pci 0000:a0:08.1: enabling Extended Tags
[    2.954959] pci 0000:a1:00.0: enabling Extended Tags
[    2.955366] pci 0000:a1:00.2: enabling Extended Tags
[    2.957216] pci 0000:a0:07.1: PCI bridge to [bus a1]
[    2.959358] pci 0000:a2:00.0: enabling Extended Tags
[    2.960372] pci 0000:a2:00.1: enabling Extended Tags
[    2.961324] pci 0000:a0:08.1: PCI bridge to [bus a2]
[    2.962407] ACPI: PCI Root Bridge [PC06] (domain 0000 [bus c0-df])
[    2.963193] acpi PNP0A08:06: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
[    2.965194] acpi PNP0A08:06: PCIe AER handled by firmware
[    2.966228] acpi PNP0A08:06: _OSC: platform does not support [SHPCHotplug LTR]
[    2.967256] acpi PNP0A08:06: _OSC: OS now controls [PCIeHotplug PME PCIeCapability]
[    2.968191] acpi PNP0A08:06: FADT indicates ASPM is unsupported, using BIOS configuration
[    2.969376] PCI host bridge to bus 0000:c0
[    2.970193] pci_bus 0000:c0: root bus resource [io  0xc000-0xdfff window]
[    2.971191] pci_bus 0000:c0: root bus resource [mem 0x9d000000-0xa9ffffff window]
[    2.972191] pci_bus 0000:c0: root bus resource [mem 0x63dc0000000-0x71d5fffffff window]
[    2.973191] pci_bus 0000:c0: root bus resource [bus c0-df]
[    2.975736] pci 0000:c0:07.1: enabling Extended Tags
[    2.976530] pci 0000:c0:08.1: enabling Extended Tags
[    2.978855] pci 0000:c1:00.0: enabling Extended Tags
[    2.979353] pci 0000:c1:00.2: enabling Extended Tags
[    2.980308] pci 0000:c0:07.1: PCI bridge to [bus c1]
[    2.982191] pci 0000:c2:00.0: enabling Extended Tags
[    2.984334] pci 0000:c2:00.1: enabling Extended Tags
[    2.985317] pci 0000:c0:08.1: PCI bridge to [bus c2]
[    2.986380] ACPI: PCI Root Bridge [PC07] (domain 0000 [bus e0-ff])
[    2.987194] acpi PNP0A08:07: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
[    2.988226] acpi PNP0A08:07: PCIe AER handled by firmware
[    2.989228] acpi PNP0A08:07: _OSC: platform does not support [SHPCHotplug LTR]
[    2.990256] acpi PNP0A08:07: _OSC: OS now controls [PCIeHotplug PME PCIeCapability]
[    2.991192] acpi PNP0A08:07: FADT indicates ASPM is unsupported, using BIOS configuration
[    2.992328] acpi PNP0A08:07: host bridge window [mem 0x71d60000000-0xffffffffffff window] ([0x80000000000-0xffffffffffff] ignored, not CPU addressable)
[    2.993246] PCI host bridge to bus 0000:e0
[    2.994191] pci_bus 0000:e0: root bus resource [io  0xe000-0xffff window]
[    2.995192] pci_bus 0000:e0: root bus resource [mem 0x90000000-0x9cffffff window]
[    2.996191] pci_bus 0000:e0: root bus resource [mem 0x71d60000000-0x7ffffffffff window]
[    2.997196] pci_bus 0000:e0: root bus resource [bus e0-ff]
[    3.000196] pci 0000:e0:07.1: enabling Extended Tags
[    3.001531] pci 0000:e0:08.1: enabling Extended Tags
[    3.003296] pci 0000:e1:00.0: enabling Extended Tags
[    3.004362] pci 0000:e1:00.2: enabling Extended Tags
[    3.005314] pci 0000:e0:07.1: PCI bridge to [bus e1]
[    3.006364] pci 0000:e2:00.0: enabling Extended Tags
[    3.007369] pci 0000:e2:00.1: enabling Extended Tags
[    3.008324] pci 0000:e0:08.1: PCI bridge to [bus e2]
[    3.013216] pci 0000:04:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[    3.014205] pci 0000:04:00.0: vgaarb: bridge control possible
[    3.015193] pci 0000:04:00.0: vgaarb: setting as boot device (VGA legacy resources not available)
[    3.016192] vgaarb: loaded
[    3.017299] SCSI subsystem initialized
[    3.018214] ACPI: bus type USB registered
[    3.019207] usbcore: registered new interface driver usbfs
[    3.020197] usbcore: registered new interface driver hub
[    3.021210] usbcore: registered new device driver usb
[    3.022212] pps_core: LinuxPPS API ver. 1 registered
[    3.023190] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <giometti@linux.it>
[    3.024193] PTP clock support registered
[    3.025230] EDAC MC: Ver: 3.0.0
[    3.027299] Registered efivars operations
[    3.217225] PCI: Using ACPI for IRQ routing
[    3.238515] NetLabel: Initializing
[    3.239191] NetLabel:  domain hash size = 128
[    3.240190] NetLabel:  protocols = UNLABELED CIPSOv4 CALIPSO
[    3.241207] NetLabel:  unlabeled traffic allowed by default
[    3.242286] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0
[    3.243193] hpet0: 3 comparators, 32-bit 14.318180 MHz counter
[    3.246273] clocksource: Switched to clocksource tsc-early
[    3.261402] VFS: Disk quotas dquot_6.6.0
[    3.265359] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
[    3.272339] pnp: PnP ACPI init
[    3.275723] system 00:00: [mem 0x80000000-0x8fffffff] has been reserved
[    3.283023] pnp: PnP ACPI: found 4 devices
[    3.292620] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[    3.301495] pci 0000:02:00.0: can't claim BAR 6 [mem 0xfffc0000-0xffffffff pref]: no compatible bridge window
[    3.311405] pci 0000:02:00.1: can't claim BAR 6 [mem 0xfffc0000-0xffffffff pref]: no compatible bridge window
[    3.321314] pci 0000:01:00.0: can't claim BAR 6 [mem 0xfffc0000-0xffffffff pref]: no compatible bridge window
[    3.331228] pci 0000:01:00.1: can't claim BAR 6 [mem 0xfffc0000-0xffffffff pref]: no compatible bridge window
[    3.341155] pci 0000:61:00.0: can't claim BAR 6 [mem 0xfff00000-0xffffffff pref]: no compatible bridge window
[    3.351117] pci 0000:02:00.0: BAR 6: assigned [mem 0xec200000-0xec23ffff pref]
[    3.358339] pci 0000:02:00.1: BAR 6: assigned [mem 0xec240000-0xec27ffff pref]
[    3.365565] pci 0000:00:01.1: PCI bridge to [bus 02]
[    3.370532] pci 0000:00:01.1:   bridge window [mem 0xec200000-0xec2fffff]
[    3.377326] pci 0000:00:01.1:   bridge window [mem 0xec100000-0xec1fffff 64bit pref]
[    3.385069] pci 0000:01:00.0: BAR 6: assigned [mem 0xec300000-0xec33ffff pref]
[    3.392294] pci 0000:01:00.1: BAR 6: assigned [mem 0xec340000-0xec37ffff pref]
[    3.399512] pci 0000:00:01.2: PCI bridge to [bus 01]
[    3.404482] pci 0000:00:01.2:   bridge window [mem 0xec300000-0xec3fffff]
[    3.411276] pci 0000:00:01.2:   bridge window [mem 0xec000000-0xec0fffff 64bit pref]
[    3.419026] pci 0000:04:00.0: BAR 6: assigned [mem 0xf9810000-0xf981ffff pref]
[    3.426250] pci 0000:03:00.0: PCI bridge to [bus 04]
[    3.431220] pci 0000:03:00.0:   bridge window [mem 0xf9000000-0xf98fffff]
[    3.438016] pci 0000:03:00.0:   bridge window [mem 0xeb000000-0xebffffff 64bit pref]
[    3.445763] pci 0000:00:01.3: PCI bridge to [bus 03-04]
[    3.450996] pci 0000:00:01.3:   bridge window [mem 0xf9000000-0xf98fffff]
[    3.457784] pci 0000:00:01.3:   bridge window [mem 0xeb000000-0xebffffff 64bit pref]
[    3.465535] pci 0000:00:07.1: PCI bridge to [bus 05]
[    3.470506] pci 0000:00:07.1:   bridge window [mem 0xf9b00000-0xf9dfffff]
[    3.477294] pci 0000:00:08.1: PCI bridge to [bus 06]
[    3.482268] pci 0000:00:08.1:   bridge window [mem 0xf9900000-0xf9afffff]
[    3.489148] pci 0000:20:07.1: PCI bridge to [bus 21]
[    3.494116] pci 0000:20:07.1:   bridge window [mem 0xe8200000-0xe84fffff]
[    3.500912] pci 0000:20:08.1: PCI bridge to [bus 22]
[    3.505885] pci 0000:20:08.1:   bridge window [mem 0xe8000000-0xe81fffff]
[    3.512717] pci 0000:40:07.1: PCI bridge to [bus 41]
[    3.517690] pci 0000:40:07.1:   bridge window [mem 0xdb200000-0xdb3fffff]
[    3.524485] pci 0000:40:08.1: PCI bridge to [bus 42]
[    3.529460] pci 0000:40:08.1:   bridge window [mem 0xdb000000-0xdb1fffff]
[    3.536284] pci 0000:61:00.0: BAR 6: no space for [mem size 0x00100000 pref]
[    3.543335] pci 0000:61:00.0: BAR 6: failed to assign [mem size 0x00100000 pref]
[    3.550726] pci 0000:60:03.1: PCI bridge to [bus 61]
[    3.555696] pci 0000:60:03.1:   bridge window [io  0x6000-0x6fff]
[    3.561798] pci 0000:60:03.1:   bridge window [mem 0xce400000-0xce5fffff]
[    3.568593] pci 0000:60:07.1: PCI bridge to [bus 62]
[    3.573566] pci 0000:60:07.1:   bridge window [mem 0xce200000-0xce3fffff]
[    3.580363] pci 0000:60:08.1: PCI bridge to [bus 63]
[    3.585338] pci 0000:60:08.1:   bridge window [mem 0xce000000-0xce1fffff]
[    3.592157] pci 0000:80:07.1: PCI bridge to [bus 81]
[    3.597134] pci 0000:80:07.1:   bridge window [mem 0xc1200000-0xc13fffff]
[    3.603930] pci 0000:80:08.1: PCI bridge to [bus 82]
[    3.608904] pci 0000:80:08.1:   bridge window [mem 0xc1000000-0xc11fffff]
[    3.615722] pci 0000:a0:07.1: PCI bridge to [bus a1]
[    3.620690] pci 0000:a0:07.1:   bridge window [mem 0xb4200000-0xb43fffff]
[    3.627484] pci 0000:a0:08.1: PCI bridge to [bus a2]
[    3.632460] pci 0000:a0:08.1:   bridge window [mem 0xb4000000-0xb41fffff]
[    3.639286] pci 0000:c0:07.1: PCI bridge to [bus c1]
[    3.644256] pci 0000:c0:07.1:   bridge window [mem 0xa7200000-0xa73fffff]
[    3.651052] pci 0000:c0:08.1: PCI bridge to [bus c2]
[    3.656024] pci 0000:c0:08.1:   bridge window [mem 0xa7000000-0xa71fffff]
[    3.662846] pci 0000:e0:07.1: PCI bridge to [bus e1]
[    3.667820] pci 0000:e0:07.1:   bridge window [mem 0x9a200000-0x9a3fffff]
[    3.674618] pci 0000:e0:08.1: PCI bridge to [bus e2]
[    3.679590] pci 0000:e0:08.1:   bridge window [mem 0x9a000000-0x9a1fffff]
[    3.686444] NET: Registered protocol family 2
[    3.691063] tcp_listen_portaddr_hash hash table entries: 16384 (order: 6, 262144 bytes)
[    3.699181] TCP established hash table entries: 262144 (order: 9, 2097152 bytes)
[    3.707061] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
[    3.713995] TCP: Hash tables configured (established 262144 bind 65536)
[    3.720671] UDP hash table entries: 16384 (order: 7, 524288 bytes)
[    3.727018] UDP-Lite hash table entries: 16384 (order: 7, 524288 bytes)
[    3.733814] NET: Registered protocol family 1
[    3.738943] Unpacking initramfs...
[    4.384141] Freeing initrd memory: 45984K
[    4.388246] AMD-Vi: IOMMU performance counters supported
[    4.393652] AMD-Vi: IOMMU performance counters supported
[    4.399003] AMD-Vi: IOMMU performance counters supported
[    4.404345] AMD-Vi: IOMMU performance counters supported
[    4.409696] AMD-Vi: IOMMU performance counters supported
[    4.415047] AMD-Vi: IOMMU performance counters supported
[    4.420399] AMD-Vi: IOMMU performance counters supported
[    4.425752] AMD-Vi: IOMMU performance counters supported
[    4.431301] iommu: Adding device 0000:00:01.0 to group 0
[    4.436684] iommu: Adding device 0000:00:01.1 to group 1
[    4.442068] iommu: Adding device 0000:00:01.2 to group 2
[    4.447457] iommu: Adding device 0000:00:01.3 to group 3
[    4.452856] iommu: Adding device 0000:00:02.0 to group 4
[    4.458260] iommu: Adding device 0000:00:03.0 to group 5
[    4.463661] iommu: Adding device 0000:00:04.0 to group 6
[    4.469070] iommu: Adding device 0000:00:07.0 to group 7
[    4.474449] iommu: Adding device 0000:00:07.1 to group 8
[    4.479853] iommu: Adding device 0000:00:08.0 to group 9
[    4.485237] iommu: Adding device 0000:00:08.1 to group 10
[    4.490746] iommu: Adding device 0000:00:14.0 to group 11
[    4.496183] iommu: Adding device 0000:00:14.3 to group 11
[    4.501786] iommu: Adding device 0000:00:18.0 to group 12
[    4.507221] iommu: Adding device 0000:00:18.1 to group 12
[    4.512653] iommu: Adding device 0000:00:18.2 to group 12
[    4.518077] iommu: Adding device 0000:00:18.3 to group 12
[    4.523505] iommu: Adding device 0000:00:18.4 to group 12
[    4.528930] iommu: Adding device 0000:00:18.5 to group 12
[    4.534355] iommu: Adding device 0000:00:18.6 to group 12
[    4.539781] iommu: Adding device 0000:00:18.7 to group 12
[    4.545393] iommu: Adding device 0000:00:19.0 to group 13
[    4.550824] iommu: Adding device 0000:00:19.1 to group 13
[    4.556258] iommu: Adding device 0000:00:19.2 to group 13
[    4.561684] iommu: Adding device 0000:00:19.3 to group 13
[    4.567119] iommu: Adding device 0000:00:19.4 to group 13
[    4.572552] iommu: Adding device 0000:00:19.5 to group 13
[    4.577988] iommu: Adding device 0000:00:19.6 to group 13
[    4.583424] iommu: Adding device 0000:00:19.7 to group 13
[    4.589026] iommu: Adding device 0000:00:1a.0 to group 14
[    4.594455] iommu: Adding device 0000:00:1a.1 to group 14
[    4.599888] iommu: Adding device 0000:00:1a.2 to group 14
[    4.605322] iommu: Adding device 0000:00:1a.3 to group 14
[    4.610754] iommu: Adding device 0000:00:1a.4 to group 14
[    4.616180] iommu: Adding device 0000:00:1a.5 to group 14
[    4.621605] iommu: Adding device 0000:00:1a.6 to group 14
[    4.627031] iommu: Adding device 0000:00:1a.7 to group 14
[    4.632638] iommu: Adding device 0000:00:1b.0 to group 15
[    4.638070] iommu: Adding device 0000:00:1b.1 to group 15
[    4.643499] iommu: Adding device 0000:00:1b.2 to group 15
[    4.648925] iommu: Adding device 0000:00:1b.3 to group 15
[    4.654353] iommu: Adding device 0000:00:1b.4 to group 15
[    4.659786] iommu: Adding device 0000:00:1b.5 to group 15
[    4.665221] iommu: Adding device 0000:00:1b.6 to group 15
[    4.670656] iommu: Adding device 0000:00:1b.7 to group 15
[    4.676261] iommu: Adding device 0000:00:1c.0 to group 16
[    4.681688] iommu: Adding device 0000:00:1c.1 to group 16
[    4.687121] iommu: Adding device 0000:00:1c.2 to group 16
[    4.692547] iommu: Adding device 0000:00:1c.3 to group 16
[    4.697980] iommu: Adding device 0000:00:1c.4 to group 16
[    4.703408] iommu: Adding device 0000:00:1c.5 to group 16
[    4.708843] iommu: Adding device 0000:00:1c.6 to group 16
[    4.714277] iommu: Adding device 0000:00:1c.7 to group 16
[    4.719890] iommu: Adding device 0000:00:1d.0 to group 17
[    4.725320] iommu: Adding device 0000:00:1d.1 to group 17
[    4.730751] iommu: Adding device 0000:00:1d.2 to group 17
[    4.736177] iommu: Adding device 0000:00:1d.3 to group 17
[    4.741611] iommu: Adding device 0000:00:1d.4 to group 17
[    4.747037] iommu: Adding device 0000:00:1d.5 to group 17
[    4.752472] iommu: Adding device 0000:00:1d.6 to group 17
[    4.757905] iommu: Adding device 0000:00:1d.7 to group 17
[    4.763506] iommu: Adding device 0000:00:1e.0 to group 18
[    4.768940] iommu: Adding device 0000:00:1e.1 to group 18
[    4.774372] iommu: Adding device 0000:00:1e.2 to group 18
[    4.779812] iommu: Adding device 0000:00:1e.3 to group 18
[    4.785244] iommu: Adding device 0000:00:1e.4 to group 18
[    4.790679] iommu: Adding device 0000:00:1e.5 to group 18
[    4.796113] iommu: Adding device 0000:00:1e.6 to group 18
[    4.801548] iommu: Adding device 0000:00:1e.7 to group 18
[    4.807161] iommu: Adding device 0000:00:1f.0 to group 19
[    4.812598] iommu: Adding device 0000:00:1f.1 to group 19
[    4.818032] iommu: Adding device 0000:00:1f.2 to group 19
[    4.823466] iommu: Adding device 0000:00:1f.3 to group 19
[    4.828899] iommu: Adding device 0000:00:1f.4 to group 19
[    4.834334] iommu: Adding device 0000:00:1f.5 to group 19
[    4.839771] iommu: Adding device 0000:00:1f.6 to group 19
[    4.845207] iommu: Adding device 0000:00:1f.7 to group 19
[    4.850721] iommu: Adding device 0000:01:00.0 to group 20
[    4.856176] iommu: Adding device 0000:01:00.1 to group 20
[    4.861702] iommu: Adding device 0000:02:00.0 to group 21
[    4.867160] iommu: Adding device 0000:02:00.1 to group 21
[    4.872627] iommu: Adding device 0000:03:00.0 to group 22
[    4.878038] iommu: Adding device 0000:04:00.0 to group 22
[    4.883504] iommu: Adding device 0000:05:00.0 to group 23
[    4.888985] iommu: Adding device 0000:05:00.2 to group 24
[    4.894469] iommu: Adding device 0000:05:00.3 to group 25
[    4.899939] iommu: Adding device 0000:06:00.0 to group 26
[    4.905413] iommu: Adding device 0000:06:00.1 to group 27
[    4.910887] iommu: Adding device 0000:06:00.2 to group 28
[    4.916385] iommu: Adding device 0000:20:01.0 to group 29
[    4.921873] iommu: Adding device 0000:20:02.0 to group 30
[    4.927361] iommu: Adding device 0000:20:03.0 to group 31
[    4.932847] iommu: Adding device 0000:20:04.0 to group 32
[    4.938337] iommu: Adding device 0000:20:07.0 to group 33
[    4.943811] iommu: Adding device 0000:20:07.1 to group 34
[    4.949304] iommu: Adding device 0000:20:08.0 to group 35
[    4.954775] iommu: Adding device 0000:20:08.1 to group 36
[    4.960256] iommu: Adding device 0000:21:00.0 to group 37
[    4.965727] iommu: Adding device 0000:21:00.2 to group 38
[    4.971211] iommu: Adding device 0000:21:00.3 to group 39
[    4.976688] iommu: Adding device 0000:22:00.0 to group 40
[    4.982171] iommu: Adding device 0000:22:00.1 to group 41
[    4.987655] iommu: Adding device 0000:40:01.0 to group 42
[    4.993153] iommu: Adding device 0000:40:02.0 to group 43
[    4.998645] iommu: Adding device 0000:40:03.0 to group 44
[    5.004136] iommu: Adding device 0000:40:04.0 to group 45
[    5.009630] iommu: Adding device 0000:40:07.0 to group 46
[    5.015111] iommu: Adding device 0000:40:07.1 to group 47
[    5.020606] iommu: Adding device 0000:40:08.0 to group 48
[    5.026084] iommu: Adding device 0000:40:08.1 to group 49
[    5.031563] iommu: Adding device 0000:41:00.0 to group 50
[    5.037042] iommu: Adding device 0000:41:00.2 to group 51
[    5.042520] iommu: Adding device 0000:42:00.0 to group 52
[    5.047998] iommu: Adding device 0000:42:00.1 to group 53
[    5.053486] iommu: Adding device 0000:60:01.0 to group 54
[    5.058985] iommu: Adding device 0000:60:02.0 to group 55
[    5.064473] iommu: Adding device 0000:60:03.0 to group 56
[    5.069956] iommu: Adding device 0000:60:03.1 to group 57
[    5.075441] iommu: Adding device 0000:60:04.0 to group 58
[    5.080939] iommu: Adding device 0000:60:07.0 to group 59
[    5.086408] iommu: Adding device 0000:60:07.1 to group 60
[    5.091908] iommu: Adding device 0000:60:08.0 to group 61
[    5.097383] iommu: Adding device 0000:60:08.1 to group 62
[    5.102873] iommu: Adding device 0000:61:00.0 to group 63
[    5.108357] iommu: Adding device 0000:62:00.0 to group 64
[    5.113846] iommu: Adding device 0000:62:00.2 to group 65
[    5.119329] iommu: Adding device 0000:63:00.0 to group 66
[    5.124817] iommu: Adding device 0000:63:00.1 to group 67
[    5.130312] iommu: Adding device 0000:80:01.0 to group 68
[    5.135804] iommu: Adding device 0000:80:02.0 to group 69
[    5.141292] iommu: Adding device 0000:80:03.0 to group 70
[    5.146790] iommu: Adding device 0000:80:04.0 to group 71
[    5.152287] iommu: Adding device 0000:80:07.0 to group 72
[    5.157767] iommu: Adding device 0000:80:07.1 to group 73
[    5.163267] iommu: Adding device 0000:80:08.0 to group 74
[    5.168752] iommu: Adding device 0000:80:08.1 to group 75
[    5.174238] iommu: Adding device 0000:81:00.0 to group 76
[    5.179728] iommu: Adding device 0000:81:00.2 to group 77
[    5.185206] iommu: Adding device 0000:82:00.0 to group 78
[    5.190700] iommu: Adding device 0000:82:00.1 to group 79
[    5.196193] iommu: Adding device 0000:a0:01.0 to group 80
[    5.201705] iommu: Adding device 0000:a0:02.0 to group 81
[    5.207195] iommu: Adding device 0000:a0:03.0 to group 82
[    5.212697] iommu: Adding device 0000:a0:04.0 to group 83
[    5.218186] iommu: Adding device 0000:a0:07.0 to group 84
[    5.223677] iommu: Adding device 0000:a0:07.1 to group 85
[    5.229169] iommu: Adding device 0000:a0:08.0 to group 86
[    5.234661] iommu: Adding device 0000:a0:08.1 to group 87
[    5.240158] iommu: Adding device 0000:a1:00.0 to group 88
[    5.245646] iommu: Adding device 0000:a1:00.2 to group 89
[    5.251131] iommu: Adding device 0000:a2:00.0 to group 90
[    5.256621] iommu: Adding device 0000:a2:00.1 to group 91
[    5.262118] iommu: Adding device 0000:c0:01.0 to group 92
[    5.267619] iommu: Adding device 0000:c0:02.0 to group 93
[    5.273112] iommu: Adding device 0000:c0:03.0 to group 94
[    5.278608] iommu: Adding device 0000:c0:04.0 to group 95
[    5.284112] iommu: Adding device 0000:c0:07.0 to group 96
[    5.289596] iommu: Adding device 0000:c0:07.1 to group 97
[    5.295094] iommu: Adding device 0000:c0:08.0 to group 98
[    5.300587] iommu: Adding device 0000:c0:08.1 to group 99
[    5.306073] iommu: Adding device 0000:c1:00.0 to group 100
[    5.311656] iommu: Adding device 0000:c1:00.2 to group 101
[    5.317227] iommu: Adding device 0000:c2:00.0 to group 102
[    5.322809] iommu: Adding device 0000:c2:00.1 to group 103
[    5.328394] iommu: Adding device 0000:e0:01.0 to group 104
[    5.333978] iommu: Adding device 0000:e0:02.0 to group 105
[    5.339559] iommu: Adding device 0000:e0:03.0 to group 106
[    5.345150] iommu: Adding device 0000:e0:04.0 to group 107
[    5.350739] iommu: Adding device 0000:e0:07.0 to group 108
[    5.356306] iommu: Adding device 0000:e0:07.1 to group 109
[    5.361900] iommu: Adding device 0000:e0:08.0 to group 110
[    5.367469] iommu: Adding device 0000:e0:08.1 to group 111
[    5.373052] iommu: Adding device 0000:e1:00.0 to group 112
[    5.378619] iommu: Adding device 0000:e1:00.2 to group 113
[    5.384204] iommu: Adding device 0000:e2:00.0 to group 114
[    5.389779] iommu: Adding device 0000:e2:00.1 to group 115
[    5.395500] AMD-Vi: Found IOMMU at 0000:00:00.2 cap 0x40
[    5.400813] AMD-Vi: Extended features (0xf77ef22294ada):
[    5.406124]  PPR NX GT IA GA PC GA_vAPIC
[    5.410054] AMD-Vi: Found IOMMU at 0000:20:00.2 cap 0x40
[    5.415364] AMD-Vi: Extended features (0xf77ef22294ada):
[    5.420677]  PPR NX GT IA GA PC GA_vAPIC
[    5.424603] AMD-Vi: Found IOMMU at 0000:40:00.2 cap 0x40
[    5.429916] AMD-Vi: Extended features (0xf77ef22294ada):
[    5.435228]  PPR NX GT IA GA PC GA_vAPIC
[    5.439154] AMD-Vi: Found IOMMU at 0000:60:00.2 cap 0x40
[    5.444468] AMD-Vi: Extended features (0xf77ef22294ada):
[    5.449780]  PPR NX GT IA GA PC GA_vAPIC
[    5.453708] AMD-Vi: Found IOMMU at 0000:80:00.2 cap 0x40
[    5.459019] AMD-Vi: Extended features (0xf77ef22294ada):
[    5.464331]  PPR NX GT IA GA PC GA_vAPIC
[    5.468258] AMD-Vi: Found IOMMU at 0000:a0:00.2 cap 0x40
[    5.473572] AMD-Vi: Extended features (0xf77ef22294ada):
[    5.478885]  PPR NX GT IA GA PC GA_vAPIC
[    5.482812] AMD-Vi: Found IOMMU at 0000:c0:00.2 cap 0x40
[    5.488122] AMD-Vi: Extended features (0xf77ef22294ada):
[    5.493439]  PPR NX GT IA GA PC GA_vAPIC
[    5.497365] AMD-Vi: Found IOMMU at 0000:e0:00.2 cap 0x40
[    5.502675] AMD-Vi: Extended features (0xf77ef22294ada):
[    5.507989]  PPR NX GT IA GA PC GA_vAPIC
[    5.511913] AMD-Vi: Interrupt remapping enabled
[    5.516447] AMD-Vi: virtual APIC enabled
[    5.521109] AMD-Vi: Lazy IO/TLB flushing enabled
[    5.527025] amd_uncore: AMD NB counters detected
[    5.531646] amd_uncore: AMD LLC counters detected
[    5.536489] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
[    5.543629] perf/amd_iommu: Detected AMD IOMMU #1 (2 banks, 4 counters/bank).
[    5.550771] perf/amd_iommu: Detected AMD IOMMU #2 (2 banks, 4 counters/bank).
[    5.557910] perf/amd_iommu: Detected AMD IOMMU #3 (2 banks, 4 counters/bank).
[    5.565049] perf/amd_iommu: Detected AMD IOMMU #4 (2 banks, 4 counters/bank).
[    5.572192] perf/amd_iommu: Detected AMD IOMMU #5 (2 banks, 4 counters/bank).
[    5.579331] perf/amd_iommu: Detected AMD IOMMU #6 (2 banks, 4 counters/bank).
[    5.586472] perf/amd_iommu: Detected AMD IOMMU #7 (2 banks, 4 counters/bank).
[    5.603124] Initialise system trusted keyrings
[    5.607587] Key type blacklist registered
[    5.611704] workingset: timestamp_bits=36 max_order=23 bucket_order=0
[    5.619189] zbud: loaded
[    5.677994] NET: Registered protocol family 38
[    5.682506] Key type asymmetric registered
[    5.686639] Asymmetric key parser 'x509' registered
[    5.691573] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 246)
[    5.699044] io scheduler noop registered
[    5.702979] io scheduler deadline registered (default)
[    5.708173] io scheduler cfq registered
[    5.712015] io scheduler mq-deadline registered (default)
[    5.717425] io scheduler kyber registered
[    5.721547] atomic64_test: passed for x86-64 platform with CX8 and with SSE
[    5.729187] pcieport 0000:00:01.1: Signaling PME with IRQ 34
[    5.735187] pcieport 0000:00:01.2: Signaling PME with IRQ 35
[    5.741168] pcieport 0000:00:01.3: Signaling PME with IRQ 36
[    5.747189] pcieport 0000:00:07.1: Signaling PME with IRQ 37
[    5.754061] pcieport 0000:00:08.1: Signaling PME with IRQ 39
[    5.760727] pcieport 0000:20:07.1: Signaling PME with IRQ 40
[    5.766955] pcieport 0000:20:08.1: Signaling PME with IRQ 42
[    5.772742] BUG: unable to handle kernel paging request at 0000000000002088
[    5.773618] PGD 0 P4D 0 
[    5.773618] Oops: 0000 [#1] SMP NOPTI
[    5.773618] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.20.0-rc1+ #3
[    5.773618] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.4.3 06/29/2018
[    5.773618] RIP: 0010:__alloc_pages_nodemask+0xe2/0x2a0
[    5.773618] Code: 00 00 44 89 ea 80 ca 80 41 83 f8 01 44 0f 44 ea 89 da c1 ea 08 83 e2 01 88 54 24 20 48 8b 54 24 08 48 85 d2 0f 85 46 01 00 00 <3b> 77 08 0f 82 3d 01 00 00 48 89 f8 44 89 ea 48 89 e1 44 89 e6 89
[    5.773618] RSP: 0018:ffffaa600005fb20 EFLAGS: 00010246
[    5.773618] RAX: 0000000000000000 RBX: 00000000006012c0 RCX: 0000000000000000
[    5.773618] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080
[    5.773618] RBP: 00000000006012c0 R08: 0000000000000000 R09: 0000000000000002
[    5.773618] R10: 00000000006080c0 R11: 0000000000000002 R12: 0000000000000000
[    5.773618] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000002
[    5.773618] FS:  0000000000000000(0000) GS:ffff8c69afe00000(0000) knlGS:0000000000000000
[    5.773618] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    5.773618] CR2: 0000000000002088 CR3: 000000087e00a000 CR4: 00000000003406e0
[    5.773618] Call Trace:
[    5.773618]  new_slab+0xa9/0x570
[    5.773618]  ___slab_alloc+0x375/0x540
[    5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
[    5.773618]  __slab_alloc+0x1c/0x38
[    5.773618]  __kmalloc_node_track_caller+0xc8/0x270
[    5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
[    5.773618]  devm_kmalloc+0x28/0x60
[    5.773618]  pinctrl_bind_pins+0x2b/0x2a0
[    5.773618]  really_probe+0x73/0x420
[    5.773618]  driver_probe_device+0x115/0x130
[    5.773618]  __driver_attach+0x103/0x110
[    5.773618]  ? driver_probe_device+0x130/0x130
[    5.773618]  bus_for_each_dev+0x67/0xc0
[    5.773618]  ? klist_add_tail+0x3b/0x70
[    5.773618]  bus_add_driver+0x41/0x260
[    5.773618]  ? pcie_port_setup+0x4d/0x4d
[    5.773618]  driver_register+0x5b/0xe0
[    5.773618]  ? pcie_port_setup+0x4d/0x4d
[    5.773618]  do_one_initcall+0x4e/0x1d4
[    5.773618]  ? init_setup+0x25/0x28
[    5.773618]  kernel_init_freeable+0x1c1/0x26e
[    5.773618]  ? loglevel+0x5b/0x5b
[    5.773618]  ? rest_init+0xb0/0xb0
[    5.773618]  kernel_init+0xa/0x110
[    5.773618]  ret_from_fork+0x22/0x40
[    5.773618] Modules linked in:
[    5.773618] CR2: 0000000000002088
[    5.773618] ---[ end trace 1030c9120a03d081 ]---
[    5.773618] RIP: 0010:__alloc_pages_nodemask+0xe2/0x2a0
[    5.773618] Code: 00 00 44 89 ea 80 ca 80 41 83 f8 01 44 0f 44 ea 89 da c1 ea 08 83 e2 01 88 54 24 20 48 8b 54 24 08 48 85 d2 0f 85 46 01 00 00 <3b> 77 08 0f 82 3d 01 00 00 48 89 f8 44 89 ea 48 89 e1 44 89 e6 89
[    5.773618] RSP: 0018:ffffaa600005fb20 EFLAGS: 00010246
[    5.773618] RAX: 0000000000000000 RBX: 00000000006012c0 RCX: 0000000000000000
[    5.773618] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080
[    5.773618] RBP: 00000000006012c0 R08: 0000000000000000 R09: 0000000000000002
[    5.773618] R10: 00000000006080c0 R11: 0000000000000002 R12: 0000000000000000
[    5.773618] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000002
[    5.773618] FS:  0000000000000000(0000) GS:ffff8c69afe00000(0000) knlGS:0000000000000000
[    5.773618] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    5.773618] CR2: 0000000000002088 CR3: 000000087e00a000 CR4: 00000000003406e0
[    5.773618] Kernel panic - not syncing: Fatal exception
[    5.773618] Kernel Offset: 0x23a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[    5.773618] ---[ end Kernel panic - not syncing: Fatal exception ]---
[    6.103338] ------------[ cut here ]------------
[    6.104337] sched: Unexpected reschedule of offline CPU#0!
[    6.104337] WARNING: CPU: 2 PID: 1 at arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x39/0x40
[    6.104337] Modules linked in:
[    6.104337] CPU: 2 PID: 1 Comm: swapper/0 Tainted: G      D           4.20.0-rc1+ #3
[    6.104337] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.4.3 06/29/2018
[    6.104337] RIP: 0010:native_smp_send_reschedule+0x39/0x40
[    6.104337] Code: 0f 92 c0 84 c0 74 15 48 8b 05 e3 a6 0e 01 be fd 00 00 00 48 8b 40 30 e9 e5 5a ba 00 89 fe 48 c7 c7 58 84 a7 a5 e8 67 e9 03 00 <0f> 0b c3 0f 1f 40 00 0f 1f 44 00 00 53 48 83 ec 20 65 48 8b 04 25
[    6.104337] RSP: 0018:ffff8c69afe03ee0 EFLAGS: 00010082
[    6.104337] RAX: 0000000000000000 RBX: ffff8c66462a0000 RCX: ffffffffa5c644c8
[    6.104337] RDX: 0000000000000001 RSI: 0000000000000092 RDI: 0000000000000046
[    6.104337] RBP: 0000000000000000 R08: 0000022b3208548f R09: 0000000000000531
[    6.104337] R10: 0000000000000000 R11: ffff8c69afe03c50 R12: ffffaa600005f8a8
[    6.104337] R13: ffffffffa4b2f720 R14: 0000000000000002 R15: ffff8c69afe1ceb8
[    6.104337] FS:  0000000000000000(0000) GS:ffff8c69afe00000(0000) knlGS:0000000000000000
[    6.104337] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    6.104337] CR2: 0000000000002088 CR3: 000000087e00a000 CR4: 00000000003406e0
[    6.104337] Call Trace:
[    6.104337]  <IRQ>
[    6.104337]  update_process_times+0x40/0x50
[    6.104337]  tick_sched_handle+0x25/0x60
[    6.104337]  tick_sched_timer+0x37/0x70
[    6.104337]  __hrtimer_run_queues+0xfb/0x270
[    6.104337]  hrtimer_interrupt+0x122/0x270
[    6.104337]  smp_apic_timer_interrupt+0x6a/0x140
[    6.104337]  apic_timer_interrupt+0xf/0x20
[    6.104337]  </IRQ>
[    6.104337] RIP: 0010:panic+0x220/0x26c
[    6.104337] Code: 83 3d 9f d9 96 01 00 74 05 e8 78 5b 02 00 48 c7 c6 80 82 40 a6 48 c7 c7 d8 36 a8 a5 31 c0 e8 d7 60 06 00 fb 66 0f 1f 44 00 00 <45> 31 e4 e8 14 5a 0d 00 4d 39 ec 7c 1e 41 83 f6 01 48 8b 05 44 d9
[    6.104337] RSP: 0018:ffffaa600005f958 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[    6.104337] RAX: 0000000000000039 RBX: 0000000000000000 RCX: ffffffffa5c644c8
[    6.104337] RDX: 0000000000000000 RSI: 0000000000000096 RDI: 0000000000000046
[    6.104337] RBP: ffffaa600005f9c8 R08: 0000022b30c5221e R09: 000000000000052f
[    6.104337] R10: 0000000000000000 R11: ffffaa600005f6d0 R12: ffffffffa5a6fa51
[    6.104337] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[    6.104337]  ? panic+0x219/0x26c
[    6.104337]  oops_end+0xc3/0xe0
[    6.104337]  no_context+0x1b2/0x3e0
[    6.104337]  ? __switch_to_asm+0x40/0x70
[    6.104337]  ? __switch_to_asm+0x34/0x70
[    6.104337]  do_page_fault+0x32/0x140
[    6.104337]  ? __switch_to_asm+0x40/0x70
[    6.104337]  page_fault+0x1e/0x30
[    6.104337] RIP: 0010:__alloc_pages_nodemask+0xe2/0x2a0
[    6.104337] Code: 00 00 44 89 ea 80 ca 80 41 83 f8 01 44 0f 44 ea 89 da c1 ea 08 83 e2 01 88 54 24 20 48 8b 54 24 08 48 85 d2 0f 85 46 01 00 00 <3b> 77 08 0f 82 3d 01 00 00 48 89 f8 44 89 ea 48 89 e1 44 89 e6 89
[    6.104337] RSP: 0018:ffffaa600005fb20 EFLAGS: 00010246
[    6.104337] RAX: 0000000000000000 RBX: 00000000006012c0 RCX: 0000000000000000
[    6.104337] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080
[    6.104337] RBP: 00000000006012c0 R08: 0000000000000000 R09: 0000000000000002
[    6.104337] R10: 00000000006080c0 R11: 0000000000000002 R12: 0000000000000000
[    6.104337] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000002
[    6.104337]  new_slab+0xa9/0x570
[    6.104337]  ___slab_alloc+0x375/0x540
[    6.104337]  ? pinctrl_bind_pins+0x2b/0x2a0
[    6.104337]  __slab_alloc+0x1c/0x38
[    6.104337]  __kmalloc_node_track_caller+0xc8/0x270
[    6.104337]  ? pinctrl_bind_pins+0x2b/0x2a0
[    6.104337]  devm_kmalloc+0x28/0x60
[    6.104337]  pinctrl_bind_pins+0x2b/0x2a0
[    6.104337]  really_probe+0x73/0x420
[    6.104337]  driver_probe_device+0x115/0x130
[    6.104337]  __driver_attach+0x103/0x110
[    6.104337]  ? driver_probe_device+0x130/0x130
[    6.104337]  bus_for_each_dev+0x67/0xc0
[    6.104337]  ? klist_add_tail+0x3b/0x70
[    6.104337]  bus_add_driver+0x41/0x260
[    6.104337]  ? pcie_port_setup+0x4d/0x4d
[    6.104337]  driver_register+0x5b/0xe0
[    6.104337]  ? pcie_port_setup+0x4d/0x4d
[    6.104337]  do_one_initcall+0x4e/0x1d4
[    6.104337]  ? init_setup+0x25/0x28
[    6.104337]  kernel_init_freeable+0x1c1/0x26e
[    6.104337]  ? loglevel+0x5b/0x5b
[    6.104337]  ? rest_init+0xb0/0xb0
[    6.104337]  kernel_init+0xa/0x110
[    6.104337]  ret_from_fork+0x22/0x40
[    6.104337] ---[ end trace 1030c9120a03d082 ]---
Pingfan Liu Dec. 5, 2018, 5:49 a.m. UTC | #15
On Tue, Dec 4, 2018 at 3:16 PM Pingfan Liu <kernelfans@gmail.com> wrote:
>
> On Tue, Dec 4, 2018 at 11:53 AM David Rientjes <rientjes@google.com> wrote:
> >
> > On Tue, 4 Dec 2018, Pingfan Liu wrote:
> >
> > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > > index 76f8db0..8324953 100644
> > > --- a/include/linux/gfp.h
> > > +++ b/include/linux/gfp.h
> > > @@ -453,6 +453,8 @@ static inline int gfp_zonelist(gfp_t flags)
> > >   */
> > >  static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> > >  {
> > > +     if (unlikely(!node_online(nid)))
> > > +             nid = first_online_node;
> > >       return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
> > >  }
> > >
> >
> > So we're passing the node id from dev_to_node() to kmalloc which
> > interprets that as the preferred node and then does node_zonelist() to
> > find the zonelist at allocation time.
> >
> > What happens if we fix this in alloc_dr()?  Does anything else cause
> > problems?
> >
> I think it is better to fix it mm, since it can protect any new
> similar bug in future. While fixing in alloc_dr() just work at present
>
> > And rather than using first_online_node, would next_online_node() work?
> >
> What is the gain? Is it for memory pressure on node0?
>
Maybe I got your point now.  Do you try to give a cheap assumption on
nearest neigh of this node?

Thanks,
Pingfan
Pingfan Liu Dec. 5, 2018, 5:50 a.m. UTC | #16
On Tue, Dec 4, 2018 at 5:09 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Tue, Dec 04, 2018 at 04:52:52PM +0800, Pingfan Liu wrote:
> >On Tue, Dec 4, 2018 at 4:34 PM Wei Yang <richard.weiyang@gmail.com> wrote:
> >>
> >> On Tue, Dec 04, 2018 at 03:20:13PM +0800, Pingfan Liu wrote:
> >> >On Tue, Dec 4, 2018 at 2:54 PM Wei Yang <richard.weiyang@gmail.com> wrote:
> >> >>
> >> >> On Tue, Dec 04, 2018 at 11:05:57AM +0800, Pingfan Liu wrote:
> >> >> >During my test on some AMD machine, with kexec -l nr_cpus=x option, the
> >> >> >kernel failed to bootup, because some node's data struct can not be allocated,
> >> >> >e.g, on x86, initialized by init_cpu_to_node()->init_memory_less_node(). But
> >> >> >device->numa_node info is used as preferred_nid param for
> >> >>
> >> >> could we fix the preferred_nid before passed to
> >> >> __alloc_pages_nodemask()?
> >> >>
> >> >Yes, we can doit too, but what is the gain?
> >>
> >> node_zonelist() is used some places. If we are sure where the problem
> >> is, it is not necessary to spread to other places.
> >>
> >> >
> >> >> BTW, I don't catch the function call flow to this point. Would you mind
> >> >> giving me some hint?
> >> >>
> >> >You can track the code along slab_alloc() ->...->__alloc_pages_nodemask()
> >>
> >> slab_alloc() pass NUMA_NO_NODE down, so I am lost in where the
> >> preferred_nid is assigned.
> >>
> >You can follow:
> >[    5.773618]  new_slab+0xa9/0x570
> >[    5.773618]  ___slab_alloc+0x375/0x540
> >[    5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
> >where static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
> >
>
> Well, thanks for your patience, but I still don't get it.
>
> new_slab(node)
>     allocate_slab(node)
>        alloc_slab_page(node)
>            if (node == NUMA_NO_NODE)
>                alloc_pages()
>            eles
>                __alloc_pages_node(node)
>
> As you mentioned, this starts from slab_alloc() which pass NUMA_NO_NODE.
> This means it goes to alloc_pages() and then alloc_pages_current() ->
> __alloc_pages_nodemask(). Here we use policy_node() to get the
> preferred_nid.
>
> I didn't catch the relathionship between policy_node() and
> device->numa_node. Maybe I got wrong in some place. Would you minding
> sharing more?
>
Have uploaded the full panic log. Enjoy it.

Regards,
Pingfan
Michal Hocko Dec. 5, 2018, 9:21 a.m. UTC | #17
On Wed 05-12-18 13:38:17, Pingfan Liu wrote:
> On Tue, Dec 4, 2018 at 4:56 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Tue 04-12-18 16:20:32, Pingfan Liu wrote:
> > > On Tue, Dec 4, 2018 at 3:22 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > > On Tue 04-12-18 11:05:57, Pingfan Liu wrote:
> > > > > During my test on some AMD machine, with kexec -l nr_cpus=x option, the
> > > > > kernel failed to bootup, because some node's data struct can not be allocated,
> > > > > e.g, on x86, initialized by init_cpu_to_node()->init_memory_less_node(). But
> > > > > device->numa_node info is used as preferred_nid param for
> > > > > __alloc_pages_nodemask(), which causes NULL reference
> > > > >   ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> > > > > This patch tries to fix the issue by falling back to the first online node,
> > > > > when encountering such corner case.
> > > >
> > > > We have seen similar issues already and the bug was usually that the
> > > > zonelists were not initialized yet or the node is completely bogus.
> > > > Zonelists should be initialized by build_all_zonelists quite early so I
> > > > am wondering whether the later is the case. What is the actual node
> > > > number the device is associated with?
> > > >
> > > The device's node num is 2. And in my case, I used nr_cpus param. Due
> > > to init_cpu_to_node() initialize all the possible node.  It is hard
> > > for me to figure out without this param, how zonelists is accessed
> > > before page allocator works.
> >
> > I believe we should focus on this. Why does the node have no zonelist
> > even though all zonelists should be initialized already? Maybe this is
> > nr_cpus pecularity and we do not initialize all the existing numa nodes.
> > Or maybe the device is associated to a non-existing node with that
> > setup. A full dmesg might help us here.
> >
> Requiring the machine again, and I got the following without nr_cpus option
> [root@dell-per7425-03 ~]# cd /sys/devices/system/node/
> [root@dell-per7425-03 node]# ls
> has_cpu  has_memory  has_normal_memory  node0  node1  node2  node3
> node4  node5  node6  node7  online  possible  power  uevent
> [root@dell-per7425-03 node]# cat has_cpu
> 0-7
> [root@dell-per7425-03 node]# cat has_memory
> 1,5
> [root@dell-per7425-03 node]# cat online
> 0-7
> [root@dell-per7425-03 node]# cat possible
> 0-7
> And lscpu shows the following numa-cpu info:
> NUMA node0 CPU(s):     0,8,16,24
> NUMA node1 CPU(s):     2,10,18,26
> NUMA node2 CPU(s):     4,12,20,28
> NUMA node3 CPU(s):     6,14,22,30
> NUMA node4 CPU(s):     1,9,17,25
> NUMA node5 CPU(s):     3,11,19,27
> NUMA node6 CPU(s):     5,13,21,29
> NUMA node7 CPU(s):     7,15,23,31
> 
> For the full panic message (I masked some hostname info with xx),
> please see the attachment.
> In a short word, it seems a problem with nr_cpus, if without this
> option, the kernel can bootup correctly.

Yep.
[    0.007418] Early memory node ranges
[    0.007419]   node   1: [mem 0x0000000000001000-0x000000000008efff]
[    0.007420]   node   1: [mem 0x0000000000090000-0x000000000009ffff]
[    0.007422]   node   1: [mem 0x0000000000100000-0x000000005c3d6fff]
[    0.007422]   node   1: [mem 0x00000000643df000-0x0000000068ff7fff]
[    0.007423]   node   1: [mem 0x000000006c528000-0x000000006fffffff]
[    0.007424]   node   1: [mem 0x0000000100000000-0x000000047fffffff]
[    0.007425]   node   5: [mem 0x0000000480000000-0x000000087effffff]

There is clearly no node2. Where did the driver get the node2 from?
Pingfan Liu Dec. 5, 2018, 9:29 a.m. UTC | #18
On Wed, Dec 5, 2018 at 5:21 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 05-12-18 13:38:17, Pingfan Liu wrote:
> > On Tue, Dec 4, 2018 at 4:56 PM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Tue 04-12-18 16:20:32, Pingfan Liu wrote:
> > > > On Tue, Dec 4, 2018 at 3:22 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > > >
> > > > > On Tue 04-12-18 11:05:57, Pingfan Liu wrote:
> > > > > > During my test on some AMD machine, with kexec -l nr_cpus=x option, the
> > > > > > kernel failed to bootup, because some node's data struct can not be allocated,
> > > > > > e.g, on x86, initialized by init_cpu_to_node()->init_memory_less_node(). But
> > > > > > device->numa_node info is used as preferred_nid param for
> > > > > > __alloc_pages_nodemask(), which causes NULL reference
> > > > > >   ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> > > > > > This patch tries to fix the issue by falling back to the first online node,
> > > > > > when encountering such corner case.
> > > > >
> > > > > We have seen similar issues already and the bug was usually that the
> > > > > zonelists were not initialized yet or the node is completely bogus.
> > > > > Zonelists should be initialized by build_all_zonelists quite early so I
> > > > > am wondering whether the later is the case. What is the actual node
> > > > > number the device is associated with?
> > > > >
> > > > The device's node num is 2. And in my case, I used nr_cpus param. Due
> > > > to init_cpu_to_node() initialize all the possible node.  It is hard
> > > > for me to figure out without this param, how zonelists is accessed
> > > > before page allocator works.
> > >
> > > I believe we should focus on this. Why does the node have no zonelist
> > > even though all zonelists should be initialized already? Maybe this is
> > > nr_cpus pecularity and we do not initialize all the existing numa nodes.
> > > Or maybe the device is associated to a non-existing node with that
> > > setup. A full dmesg might help us here.
> > >
> > Requiring the machine again, and I got the following without nr_cpus option
> > [root@dell-per7425-03 ~]# cd /sys/devices/system/node/
> > [root@dell-per7425-03 node]# ls
> > has_cpu  has_memory  has_normal_memory  node0  node1  node2  node3
> > node4  node5  node6  node7  online  possible  power  uevent
> > [root@dell-per7425-03 node]# cat has_cpu
> > 0-7
> > [root@dell-per7425-03 node]# cat has_memory
> > 1,5
> > [root@dell-per7425-03 node]# cat online
> > 0-7
> > [root@dell-per7425-03 node]# cat possible
> > 0-7
> > And lscpu shows the following numa-cpu info:
> > NUMA node0 CPU(s):     0,8,16,24
> > NUMA node1 CPU(s):     2,10,18,26
> > NUMA node2 CPU(s):     4,12,20,28
> > NUMA node3 CPU(s):     6,14,22,30
> > NUMA node4 CPU(s):     1,9,17,25
> > NUMA node5 CPU(s):     3,11,19,27
> > NUMA node6 CPU(s):     5,13,21,29
> > NUMA node7 CPU(s):     7,15,23,31
> >
> > For the full panic message (I masked some hostname info with xx),
> > please see the attachment.
> > In a short word, it seems a problem with nr_cpus, if without this
> > option, the kernel can bootup correctly.
>
> Yep.
> [    0.007418] Early memory node ranges
> [    0.007419]   node   1: [mem 0x0000000000001000-0x000000000008efff]
> [    0.007420]   node   1: [mem 0x0000000000090000-0x000000000009ffff]
> [    0.007422]   node   1: [mem 0x0000000000100000-0x000000005c3d6fff]
> [    0.007422]   node   1: [mem 0x00000000643df000-0x0000000068ff7fff]
> [    0.007423]   node   1: [mem 0x000000006c528000-0x000000006fffffff]
> [    0.007424]   node   1: [mem 0x0000000100000000-0x000000047fffffff]
> [    0.007425]   node   5: [mem 0x0000000480000000-0x000000087effffff]
>
> There is clearly no node2. Where did the driver get the node2 from?
Since using nr_cpus=4 , the node2 is not be instanced by x86 initalizing code.
For the normal bootup, having the following:
[    0.007704] Movable zone start for each node
[    0.007707] Early memory node ranges
[    0.007708]   node   1: [mem 0x0000000000001000-0x000000000008efff]
[    0.007709]   node   1: [mem 0x0000000000090000-0x000000000009ffff]
[    0.007711]   node   1: [mem 0x0000000000100000-0x000000005c3d6fff]
[    0.007712]   node   1: [mem 0x00000000643df000-0x0000000068ff7fff]
[    0.007712]   node   1: [mem 0x000000006c528000-0x000000006fffffff]
[    0.007713]   node   1: [mem 0x0000000100000000-0x000000047fffffff]
[    0.007714]   node   5: [mem 0x0000000480000000-0x000000087effffff]
[    0.008434] Zeroed struct page in unavailable ranges: 46490 pages
[    0.008435] Initmem setup node 1 [mem 0x0000000000001000-0x000000047fffffff]
[    0.022826] Initmem setup node 5 [mem 0x0000000480000000-0x000000087effffff]
[    0.024303] ACPI: PM-Timer IO Port: 0x408
[    0.024320] ACPI: LAPIC_NMI (acpi_id[0xff] high edge lint[0x1])
[    0.024349] IOAPIC[0]: apic_id 128, version 33, address 0xfec00000, GSI 0-23
[    0.024354] IOAPIC[1]: apic_id 129, version 33, address 0xfd880000, GSI 24-55
[    0.024360] IOAPIC[2]: apic_id 130, version 33, address 0xea900000, GSI 56-87
[    0.024365] IOAPIC[3]: apic_id 131, version 33, address 0xdd900000,
GSI 88-119
[    0.024371] IOAPIC[4]: apic_id 132, version 33, address 0xd0900000,
GSI 120-151
[    0.024378] IOAPIC[5]: apic_id 133, version 33, address 0xc3900000,
GSI 152-183
[    0.024385] IOAPIC[6]: apic_id 134, version 33, address 0xb6900000,
GSI 184-215
[    0.024391] IOAPIC[7]: apic_id 135, version 33, address 0xa9900000,
GSI 216-247
[    0.024397] IOAPIC[8]: apic_id 136, version 33, address 0x9c900000,
GSI 248-279
[    0.024400] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[    0.024402] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 low level)
[    0.024408] Using ACPI (MADT) for SMP configuration information
[    0.024410] ACPI: HPET id: 0x10228201 base: 0xfed00000
[    0.024418] smpboot: Allowing 128 CPUs, 96 hotplug CPUs
[    0.024422] NODE_DATA(0) allocated [mem 0x87efa1000-0x87efcbfff]
[    0.024424]     NODE_DATA(0) on node 5
[    0.024457] Initmem setup node 0 [mem 0x0000000000000000-0x0000000000000000]
[    0.024460] NODE_DATA(4) allocated [mem 0x87ef76000-0x87efa0fff]
[    0.024461]     NODE_DATA(4) on node 5
[    0.024494] Initmem setup node 4 [mem 0x0000000000000000-0x0000000000000000]
[    0.024497] NODE_DATA(2) allocated [mem 0x87ef4b000-0x87ef75fff]
[    0.024498]     NODE_DATA(2) on node 5
---------------------------------------------------------------------------------->nid=2
[    0.024530] Initmem setup node 2 [mem 0x0000000000000000-0x0000000000000000]
[    0.024533] NODE_DATA(6) allocated [mem 0x87ef20000-0x87ef4afff]
[    0.024534]     NODE_DATA(6) on node 5
[    0.024566] Initmem setup node 6 [mem 0x0000000000000000-0x0000000000000000]
[    0.024568] NODE_DATA(3) allocated [mem 0x87eef5000-0x87ef1ffff]

Hence, this should be a specific issue with nr_cpus. The attachment is
full message of normal bootup

Thanks,
Pingfan
[    0.000000] Linux version 4.20.0-rc5+
[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.20.0-rc5+ root=/dev/mapper/xx_dell--per7425--03-root ro crashkernel=auto rd.lvm.lv=xx_dell-per7425-03/root rd.lvm.lv=xx_dell-per7425-03/swap console=ttyS0,115200n81 LANG=en_US.UTF-8 nokaslr
[    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[    0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'compacted' format.
[    0.000000] BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000008efff] usable
[    0.000000] BIOS-e820: [mem 0x000000000008f000-0x000000000008ffff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x0000000000090000-0x000000000009ffff] usable
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000005c3d6fff] usable
[    0.000000] BIOS-e820: [mem 0x000000005c3d7000-0x00000000643defff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000643df000-0x0000000068ff7fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000068ff8000-0x000000006b4f7fff] reserved
[    0.000000] BIOS-e820: [mem 0x000000006b4f8000-0x000000006c327fff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x000000006c328000-0x000000006c527fff] ACPI data
[    0.000000] BIOS-e820: [mem 0x000000006c528000-0x000000006fffffff] usable
[    0.000000] BIOS-e820: [mem 0x0000000070000000-0x000000008fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fec10000-0x00000000fec10fff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed80fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000087effffff] usable
[    0.000000] BIOS-e820: [mem 0x000000087f000000-0x000000087fffffff] reserved
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] extended physical RAM map:
[    0.000000] reserve setup_data: [mem 0x0000000000000000-0x000000000008efff] usable
[    0.000000] reserve setup_data: [mem 0x000000000008f000-0x000000000008ffff] ACPI NVS
[    0.000000] reserve setup_data: [mem 0x0000000000090000-0x000000000009ffff] usable
[    0.000000] reserve setup_data: [mem 0x0000000000100000-0x000000004425301f] usable
[    0.000000] reserve setup_data: [mem 0x0000000044253020-0x000000004426b65f] usable
[    0.000000] reserve setup_data: [mem 0x000000004426b660-0x000000004426c01f] usable
[    0.000000] reserve setup_data: [mem 0x000000004426c020-0x000000004429cc5f] usable
[    0.000000] reserve setup_data: [mem 0x000000004429cc60-0x000000004429d01f] usable
[    0.000000] reserve setup_data: [mem 0x000000004429d020-0x00000000442cdc5f] usable
[    0.000000] reserve setup_data: [mem 0x00000000442cdc60-0x00000000442ce01f] usable
[    0.000000] reserve setup_data: [mem 0x00000000442ce020-0x00000000442fec5f] usable
[    0.000000] reserve setup_data: [mem 0x00000000442fec60-0x00000000442ff01f] usable
[    0.000000] reserve setup_data: [mem 0x00000000442ff020-0x000000004432fc5f] usable
[    0.000000] reserve setup_data: [mem 0x000000004432fc60-0x000000005af8c01f] usable
[    0.000000] reserve setup_data: [mem 0x000000005af8c020-0x000000005af9405f] usable
[    0.000000] reserve setup_data: [mem 0x000000005af94060-0x000000005c3d6fff] usable
[    0.000000] reserve setup_data: [mem 0x000000005c3d7000-0x00000000643defff] reserved
[    0.000000] reserve setup_data: [mem 0x00000000643df000-0x0000000068ff7fff] usable
[    0.000000] reserve setup_data: [mem 0x0000000068ff8000-0x000000006b4f7fff] reserved
[    0.000000] reserve setup_data: [mem 0x000000006b4f8000-0x000000006c327fff] ACPI NVS
[    0.000000] reserve setup_data: [mem 0x000000006c328000-0x000000006c527fff] ACPI data
[    0.000000] reserve setup_data: [mem 0x000000006c528000-0x000000006fffffff] usable
[    0.000000] reserve setup_data: [mem 0x0000000070000000-0x000000008fffffff] reserved
[    0.000000] reserve setup_data: [mem 0x00000000fec10000-0x00000000fec10fff] reserved
[    0.000000] reserve setup_data: [mem 0x00000000fed80000-0x00000000fed80fff] reserved
[    0.000000] reserve setup_data: [mem 0x0000000100000000-0x000000087effffff] usable
[    0.000000] reserve setup_data: [mem 0x000000087f000000-0x000000087fffffff] reserved
[    0.000000] efi: EFI v2.50 by Dell Inc.
[    0.000000] efi:  ACPI=0x6c527000  ACPI 2.0=0x6c527014  SMBIOS=0x6afde000  SMBIOS 3.0=0x6afdc000 
[    0.000000] SMBIOS 3.0.0 present.
[    0.000000] DMI: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.4.3 06/29/2018
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.000000] tsc: Detected 2095.923 MHz processor
[    0.000062] last_pfn = 0x87f000 max_arch_pfn = 0x400000000
[    0.000498] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT  
[    0.000814] last_pfn = 0x70000 max_arch_pfn = 0x400000000
[    0.006332] Using GB pages for direct mapping
[    0.007245] Secure boot disabled
[    0.007247] RAMDISK: [mem 0x3d313000-0x3fffdfff]
[    0.007256] ACPI: Early table checksum verification disabled
[    0.007261] ACPI: RSDP 0x000000006C527014 000024 (v02 DELL  )
[    0.007265] ACPI: XSDT 0x000000006C5260E8 0000C4 (v01 DELL   PE_SC3   00000002 DELL 00000001)
[    0.007271] ACPI: FACP 0x000000006C516000 000114 (v06 DELL   PE_SC3   00000002 DELL 00000001)
[    0.007277] ACPI: DSDT 0x000000006C505000 00D302 (v02 DELL   PE_SC3   00000002 DELL 00000001)
[    0.007280] ACPI: FACS 0x000000006C2F1000 000040
[    0.007283] ACPI: SSDT 0x000000006C525000 0000D2 (v02 DELL   PE_SC3   00000002 MSFT 04000000)
[    0.007286] ACPI: BERT 0x000000006C524000 000030 (v01 DELL   BERT     00000001 DELL 00000001)
[    0.007289] ACPI: HEST 0x000000006C523000 0006DC (v01 DELL   HEST     00000001 DELL 00000001)
[    0.007292] ACPI: SSDT 0x000000006C522000 0001C4 (v01 DELL   PE_SC3   00000001 AMD  00000001)
[    0.007295] ACPI: SRAT 0x000000006C521000 0002D0 (v03 DELL   PE_SC3   00000001 AMD  00000001)
[    0.007297] ACPI: MSCT 0x000000006C520000 0000A6 (v01 DELL   PE_SC3   00000000 AMD  00000001)
[    0.007300] ACPI: SLIT 0x000000006C51F000 00006C (v01 DELL   PE_SC3   00000001 AMD  00000001)
[    0.007303] ACPI: CRAT 0x000000006C51C000 002210 (v01 DELL   PE_SC3   00000001 AMD  00000001)
[    0.007306] ACPI: CDIT 0x000000006C51B000 000068 (v01 DELL   PE_SC3   00000001 AMD  00000001)
[    0.007308] ACPI: SSDT 0x000000006C51A000 0003C6 (v02 DELL   Tpm2Tabl 00001000 INTL 20170119)
[    0.007311] ACPI: TPM2 0x000000006C519000 000038 (v04 DELL   PE_SC3   00000002 DELL 00000001)
[    0.007314] ACPI: EINJ 0x000000006C518000 000150 (v01 DELL   PE_SC3   00000001 AMD  00000001)
[    0.007317] ACPI: SLIC 0x000000006C517000 000024 (v01 DELL   PE_SC3   00000002 DELL 00000001)
[    0.007320] ACPI: HPET 0x000000006C515000 000038 (v01 DELL   PE_SC3   00000002 DELL 00000001)
[    0.007323] ACPI: APIC 0x000000006C514000 0004B2 (v03 DELL   PE_SC3   00000002 DELL 00000001)
[    0.007325] ACPI: MCFG 0x000000006C513000 00003C (v01 DELL   PE_SC3   00000002 DELL 00000001)
[    0.007328] ACPI: SSDT 0x000000006C504000 0005CA (v02 DELL   xhc_port 00000001 INTL 20170119)
[    0.007331] ACPI: IVRS 0x000000006C503000 000390 (v02 DELL   PE_SC3   00000001 AMD  00000000)
[    0.007334] ACPI: SSDT 0x000000006C501000 001658 (v01 AMD    CPMCMN   00000001 INTL 20170119)
[    0.007385] SRAT: PXM 0 -> APIC 0x00 -> Node 0
[    0.007386] SRAT: PXM 0 -> APIC 0x01 -> Node 0
[    0.007387] SRAT: PXM 0 -> APIC 0x08 -> Node 0
[    0.007388] SRAT: PXM 0 -> APIC 0x09 -> Node 0
[    0.007389] SRAT: PXM 1 -> APIC 0x10 -> Node 1
[    0.007390] SRAT: PXM 1 -> APIC 0x11 -> Node 1
[    0.007391] SRAT: PXM 1 -> APIC 0x18 -> Node 1
[    0.007392] SRAT: PXM 1 -> APIC 0x19 -> Node 1
[    0.007393] SRAT: PXM 2 -> APIC 0x20 -> Node 2
[    0.007394] SRAT: PXM 2 -> APIC 0x21 -> Node 2
[    0.007395] SRAT: PXM 2 -> APIC 0x28 -> Node 2
[    0.007396] SRAT: PXM 2 -> APIC 0x29 -> Node 2
[    0.007397] SRAT: PXM 3 -> APIC 0x30 -> Node 3
[    0.007398] SRAT: PXM 3 -> APIC 0x31 -> Node 3
[    0.007399] SRAT: PXM 3 -> APIC 0x38 -> Node 3
[    0.007400] SRAT: PXM 3 -> APIC 0x39 -> Node 3
[    0.007401] SRAT: PXM 4 -> APIC 0x40 -> Node 4
[    0.007402] SRAT: PXM 4 -> APIC 0x41 -> Node 4
[    0.007403] SRAT: PXM 4 -> APIC 0x48 -> Node 4
[    0.007403] SRAT: PXM 4 -> APIC 0x49 -> Node 4
[    0.007404] SRAT: PXM 5 -> APIC 0x50 -> Node 5
[    0.007405] SRAT: PXM 5 -> APIC 0x51 -> Node 5
[    0.007406] SRAT: PXM 5 -> APIC 0x58 -> Node 5
[    0.007407] SRAT: PXM 5 -> APIC 0x59 -> Node 5
[    0.007408] SRAT: PXM 6 -> APIC 0x60 -> Node 6
[    0.007409] SRAT: PXM 6 -> APIC 0x61 -> Node 6
[    0.007410] SRAT: PXM 6 -> APIC 0x68 -> Node 6
[    0.007411] SRAT: PXM 6 -> APIC 0x69 -> Node 6
[    0.007412] SRAT: PXM 7 -> APIC 0x70 -> Node 7
[    0.007413] SRAT: PXM 7 -> APIC 0x71 -> Node 7
[    0.007414] SRAT: PXM 7 -> APIC 0x78 -> Node 7
[    0.007414] SRAT: PXM 7 -> APIC 0x79 -> Node 7
[    0.007417] ACPI: SRAT: Node 1 PXM 1 [mem 0x00000000-0x0009ffff]
[    0.007419] ACPI: SRAT: Node 1 PXM 1 [mem 0x00100000-0x7fffffff]
[    0.007420] ACPI: SRAT: Node 1 PXM 1 [mem 0x100000000-0x47fffffff]
[    0.007421] ACPI: SRAT: Node 5 PXM 5 [mem 0x480000000-0x87fffffff]
[    0.007430] NUMA: Node 1 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x7fffffff] -> [mem 0x00000000-0x7fffffff]
[    0.007432] NUMA: Node 1 [mem 0x00000000-0x7fffffff] + [mem 0x100000000-0x47fffffff] -> [mem 0x00000000-0x47fffffff]
[    0.007444] NODE_DATA(1) allocated [mem 0x47ffd5000-0x47fffffff]
[    0.007465] NODE_DATA(5) allocated [mem 0x87efd2000-0x87effcfff]
[    0.007644] crashkernel: memory value expected
[    0.007697] Zone ranges:
[    0.007698]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
[    0.007700]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
[    0.007701]   Normal   [mem 0x0000000100000000-0x000000087effffff]
[    0.007702]   Device   empty
[    0.007704] Movable zone start for each node
[    0.007707] Early memory node ranges
[    0.007708]   node   1: [mem 0x0000000000001000-0x000000000008efff]
[    0.007709]   node   1: [mem 0x0000000000090000-0x000000000009ffff]
[    0.007711]   node   1: [mem 0x0000000000100000-0x000000005c3d6fff]
[    0.007712]   node   1: [mem 0x00000000643df000-0x0000000068ff7fff]
[    0.007712]   node   1: [mem 0x000000006c528000-0x000000006fffffff]
[    0.007713]   node   1: [mem 0x0000000100000000-0x000000047fffffff]
[    0.007714]   node   5: [mem 0x0000000480000000-0x000000087effffff]
[    0.008434] Zeroed struct page in unavailable ranges: 46490 pages
[    0.008435] Initmem setup node 1 [mem 0x0000000000001000-0x000000047fffffff]
[    0.022826] Initmem setup node 5 [mem 0x0000000480000000-0x000000087effffff]
[    0.024303] ACPI: PM-Timer IO Port: 0x408
[    0.024320] ACPI: LAPIC_NMI (acpi_id[0xff] high edge lint[0x1])
[    0.024349] IOAPIC[0]: apic_id 128, version 33, address 0xfec00000, GSI 0-23
[    0.024354] IOAPIC[1]: apic_id 129, version 33, address 0xfd880000, GSI 24-55
[    0.024360] IOAPIC[2]: apic_id 130, version 33, address 0xea900000, GSI 56-87
[    0.024365] IOAPIC[3]: apic_id 131, version 33, address 0xdd900000, GSI 88-119
[    0.024371] IOAPIC[4]: apic_id 132, version 33, address 0xd0900000, GSI 120-151
[    0.024378] IOAPIC[5]: apic_id 133, version 33, address 0xc3900000, GSI 152-183
[    0.024385] IOAPIC[6]: apic_id 134, version 33, address 0xb6900000, GSI 184-215
[    0.024391] IOAPIC[7]: apic_id 135, version 33, address 0xa9900000, GSI 216-247
[    0.024397] IOAPIC[8]: apic_id 136, version 33, address 0x9c900000, GSI 248-279
[    0.024400] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[    0.024402] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 low level)
[    0.024408] Using ACPI (MADT) for SMP configuration information
[    0.024410] ACPI: HPET id: 0x10228201 base: 0xfed00000
[    0.024418] smpboot: Allowing 128 CPUs, 96 hotplug CPUs
[    0.024422] NODE_DATA(0) allocated [mem 0x87efa1000-0x87efcbfff]
[    0.024424]     NODE_DATA(0) on node 5
[    0.024457] Initmem setup node 0 [mem 0x0000000000000000-0x0000000000000000]
[    0.024460] NODE_DATA(4) allocated [mem 0x87ef76000-0x87efa0fff]
[    0.024461]     NODE_DATA(4) on node 5
[    0.024494] Initmem setup node 4 [mem 0x0000000000000000-0x0000000000000000]
[    0.024497] NODE_DATA(2) allocated [mem 0x87ef4b000-0x87ef75fff]
[    0.024498]     NODE_DATA(2) on node 5
[    0.024530] Initmem setup node 2 [mem 0x0000000000000000-0x0000000000000000]
[    0.024533] NODE_DATA(6) allocated [mem 0x87ef20000-0x87ef4afff]
[    0.024534]     NODE_DATA(6) on node 5
[    0.024566] Initmem setup node 6 [mem 0x0000000000000000-0x0000000000000000]
[    0.024568] NODE_DATA(3) allocated [mem 0x87eef5000-0x87ef1ffff]
[    0.024569]     NODE_DATA(3) on node 5
[    0.024601] Initmem setup node 3 [mem 0x0000000000000000-0x0000000000000000]
[    0.024604] NODE_DATA(7) allocated [mem 0x87eeca000-0x87eef4fff]
[    0.024605]     NODE_DATA(7) on node 5
[    0.024637] Initmem setup node 7 [mem 0x0000000000000000-0x0000000000000000]
[    0.024678] PM: Registered nosave memory: [mem 0x00000000-0x00000fff]
[    0.024680] PM: Registered nosave memory: [mem 0x0008f000-0x0008ffff]
[    0.024683] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff]
[    0.024685] PM: Registered nosave memory: [mem 0x44253000-0x44253fff]
[    0.024688] PM: Registered nosave memory: [mem 0x4426b000-0x4426bfff]
[    0.024688] PM: Registered nosave memory: [mem 0x4426c000-0x4426cfff]
[    0.024691] PM: Registered nosave memory: [mem 0x4429c000-0x4429cfff]
[    0.024692] PM: Registered nosave memory: [mem 0x4429d000-0x4429dfff]
[    0.024694] PM: Registered nosave memory: [mem 0x442cd000-0x442cdfff]
[    0.024695] PM: Registered nosave memory: [mem 0x442ce000-0x442cefff]
[    0.024698] PM: Registered nosave memory: [mem 0x442fe000-0x442fefff]
[    0.024699] PM: Registered nosave memory: [mem 0x442ff000-0x442fffff]
[    0.024701] PM: Registered nosave memory: [mem 0x4432f000-0x4432ffff]
[    0.024703] PM: Registered nosave memory: [mem 0x5af8c000-0x5af8cfff]
[    0.024706] PM: Registered nosave memory: [mem 0x5af94000-0x5af94fff]
[    0.024708] PM: Registered nosave memory: [mem 0x5c3d7000-0x643defff]
[    0.024711] PM: Registered nosave memory: [mem 0x68ff8000-0x6b4f7fff]
[    0.024712] PM: Registered nosave memory: [mem 0x6b4f8000-0x6c327fff]
[    0.024713] PM: Registered nosave memory: [mem 0x6c328000-0x6c527fff]
[    0.024715] PM: Registered nosave memory: [mem 0x70000000-0x8fffffff]
[    0.024716] PM: Registered nosave memory: [mem 0x90000000-0xfec0ffff]
[    0.024717] PM: Registered nosave memory: [mem 0xfec10000-0xfec10fff]
[    0.024718] PM: Registered nosave memory: [mem 0xfec11000-0xfed7ffff]
[    0.024719] PM: Registered nosave memory: [mem 0xfed80000-0xfed80fff]
[    0.024720] PM: Registered nosave memory: [mem 0xfed81000-0xffffffff]
[    0.024723] [mem 0x90000000-0xfec0ffff] available for PCI devices
[    0.024727] Booting paravirtualized kernel on bare hardware
[    0.024731] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[    0.138543] random: get_random_bytes called from start_kernel+0x9b/0x52e with crng_init=0
[    0.138555] setup_percpu: NR_CPUS:8192 nr_cpumask_bits:128 nr_cpu_ids:128 nr_node_ids:8
[    0.149797] percpu: Embedded 46 pages/cpu @(____ptrval____) s151552 r8192 d28672 u262144
[    0.149978] Built 8 zonelists, mobility grouping on.  Total pages: 8142100
[    0.149980] Policy zone: Normal
[    0.149982] Kernel command line: BOOT_IMAGE=/vmlinuz-4.20.0-rc5+ root=/dev/mapper/xx_dell--per7425--03-root ro crashkernel=auto rd.lvm.lv=xx_dell-per7425-03/root rd.lvm.lv=xx_dell-per7425-03/swap console=ttyS0,115200n81 LANG=en_US.UTF-8 nokaslr
[    0.174052] Memory: 1604004K/33089944K available (12292K kernel code, 2066K rwdata, 3752K rodata, 2348K init, 6532K bss, 853208K reserved, 0K cma-reserved)
[    0.174867] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=128, Nodes=8
[    0.174892] ftrace: allocating 35651 entries in 140 pages
[    0.191608] rcu: Hierarchical RCU implementation.
[    0.191612] rcu: 	RCU restricting CPUs from NR_CPUS=8192 to nr_cpu_ids=128.
[    0.191614] rcu: RCU calculated value of scheduler-enlistment delay is 100 jiffies.
[    0.191615] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=128
[    0.194327] NR_IRQS: 524544, nr_irqs: 5800, preallocated irqs: 16
[    0.195498] Console: colour dummy device 80x25
[    1.663077] printk: console [ttyS0] enabled
[    1.667605] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
[    1.678833] ACPI: Core revision 20181003
[    1.683092] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns
[    1.692262] APIC: Switch to symmetric I/O mode setup
[    2.468813] Switched APIC routing to physical flat.
[    2.476620] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[    2.487315] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1e36250bdc1, max_idle_ns: 440795203241 ns
[    2.497836] Calibrating delay loop (skipped), value calculated using timer frequency.. 4191.84 BogoMIPS (lpj=2095923)
[    2.498829] pid_max: default: 131072 minimum: 1024
[    2.501660] LSM: Security Framework initializing
[    2.501832] Yama: becoming mindful.
[    2.502835] SELinux:  Initializing.
[    2.510641] Dentry cache hash table entries: 4194304 (order: 13, 33554432 bytes)
[    2.514088] Inode-cache hash table entries: 2097152 (order: 12, 16777216 bytes)
[    2.514996] Mount-cache hash table entries: 65536 (order: 7, 524288 bytes)
[    2.515924] Mountpoint-cache hash table entries: 65536 (order: 7, 524288 bytes)
[    2.517772] mce: CPU supports 23 MCE banks
[    2.517876] LVT offset 2 assigned for vector 0xf4
[    2.518838] Last level iTLB entries: 4KB 1024, 2MB 1024, 4MB 512
[    2.519829] Last level dTLB entries: 4KB 1536, 2MB 1536, 4MB 768, 1GB 0
[    2.520831] Spectre V2 : Mitigation: Full AMD retpoline
[    2.521828] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
[    2.522838] Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier
[    2.523828] Spectre V2 : User space: Vulnerable
[    2.524830] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl and seccomp
[    2.526123] Freeing SMP alternatives memory: 32K
[    2.533827] smpboot: CPU0: AMD EPYC 7251 8-Core Processor (family: 0x17, model: 0x1, stepping: 0x2)
[    2.534197] Performance Events: Fam17h core perfctr, AMD PMU driver.
[    2.534832] ... version:                0
[    2.535828] ... bit width:              48
[    2.536828] ... generic registers:      6
[    2.537829] ... value mask:             0000ffffffffffff
[    2.538829] ... max period:             00007fffffffffff
[    2.539828] ... fixed-purpose events:   0
[    2.540828] ... event mask:             000000000000003f
[    2.542924] rcu: Hierarchical SRCU implementation.
[    2.545340] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
[    2.548090] smp: Bringing up secondary CPUs ...
[    2.549873] x86: Booting SMP configuration:
[    2.550830] .... node  #4, CPUs:          #1
[    2.562830] .... node  #1, CPUs:     #2
[    2.565831] .... node  #5, CPUs:     #3
[    2.567831] .... node  #2, CPUs:     #4
[    2.570831] .... node  #6, CPUs:     #5
[    2.572831] .... node  #3, CPUs:     #6
[    2.575831] .... node  #7, CPUs:     #7
[    2.577831] .... node  #0, CPUs:     #8
[    2.580830] .... node  #4, CPUs:     #9
[    2.582831] .... node  #1, CPUs:    #10
[    2.585831] .... node  #5, CPUs:    #11
[    2.588832] .... node  #2, CPUs:    #12
[    2.590830] .... node  #6, CPUs:    #13
[    2.593831] .... node  #3, CPUs:    #14
[    2.595831] .... node  #7, CPUs:    #15
[    2.598830] .... node  #0, CPUs:    #16
[    2.601830] .... node  #4, CPUs:    #17
[    2.603831] .... node  #1, CPUs:    #18
[    2.605831] .... node  #5, CPUs:    #19
[    2.608830] .... node  #2, CPUs:    #20
[    2.610831] .... node  #6, CPUs:    #21
[    2.613830] .... node  #3, CPUs:    #22
[    2.616831] .... node  #7, CPUs:    #23
[    2.618831] .... node  #0, CPUs:    #24
[    2.621830] .... node  #4, CPUs:    #25
[    2.623831] .... node  #1, CPUs:    #26
[    2.625831] .... node  #5, CPUs:    #27
[    2.628830] .... node  #2, CPUs:    #28
[    2.630831] .... node  #6, CPUs:    #29
[    2.633830] .... node  #3, CPUs:    #30
[    2.636831] .... node  #7, CPUs:    #31
[    2.638000] smp: Brought up 8 nodes, 32 CPUs
[    2.639831] smpboot: Max logical packages: 8
[    2.640832] smpboot: Total of 32 processors activated (133916.25 BogoMIPS)
[    2.779838] node 1 initialised, 3570252 pages in 132ms
[    2.798853] node 5 initialised, 4087931 pages in 151ms
[    2.805940] devtmpfs: initialized
[    2.809887] x86/mm: Memory block size: 128MB
[    2.815869] PM: Registering ACPI NVS region [mem 0x0008f000-0x0008ffff] (4096 bytes)
[    2.823833] PM: Registering ACPI NVS region [mem 0x6b4f8000-0x6c327fff] (14876672 bytes)
[    2.832078] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
[    2.841843] futex hash table entries: 32768 (order: 9, 2097152 bytes)
[    2.849375] pinctrl core: initialized pinctrl subsystem
[    2.854937] RTC time:  4:14:15, date: 12/05/18
[    2.859282] NET: Registered protocol family 16
[    2.863935] audit: initializing netlink subsys (disabled)
[    2.869896] audit: type=2000 audit(1543983252.387:1): state=initialized audit_enabled=0 res=1
[    2.870838] cpuidle: using governor menu
[    2.875932] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
[    2.883833] ACPI: bus type PCI registered
[    2.887831] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
[    2.893925] PCI: MMCONFIG for domain 0000 [bus 00-ff] at [mem 0x80000000-0x8fffffff] (base 0x80000000)
[    2.903869] PCI: MMCONFIG at [mem 0x80000000-0x8fffffff] reserved in E820
[    2.909842] PCI: Using configuration type 1 for base access
[    2.915839] PCI: Dell System detected, enabling pci=bfsort.
[    2.924067] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
[    2.924833] HugeTLB registered 2.00 MiB page size, pre-allocated 0 pages
[    2.927289] ACPI: Added _OSI(Module Device)
[    2.930838] ACPI: Added _OSI(Processor Device)
[    2.935830] ACPI: Added _OSI(3.0 _SCP Extensions)
[    2.939829] ACPI: Added _OSI(Processor Aggregator Device)
[    2.945831] ACPI: Added _OSI(Linux-Dell-Video)
[    2.949829] ACPI: Added _OSI(Linux-Lenovo-NV-HDMI-Audio)
[    2.959776] ACPI: 6 ACPI AML tables successfully acquired and loaded
[    2.968763] ACPI: Interpreter enabled
[    2.972839] ACPI: (supports S0 S5)
[    2.975830] ACPI: Using IOAPIC for interrupt routing
[    2.981082] HEST: Table parsing has been initialized.
[    2.985832] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
[    2.995952] ACPI: Enabled 2 GPEs in block 00 to 1F
[    3.010265] ACPI: PCI Interrupt Link [LNKA] (IRQs 4 5 7 10 11 14 15) *0
[    3.016867] ACPI: PCI Interrupt Link [LNKB] (IRQs 4 5 7 10 11 14 15) *0
[    3.023865] ACPI: PCI Interrupt Link [LNKC] (IRQs 4 5 7 10 11 14 15) *0
[    3.029868] ACPI: PCI Interrupt Link [LNKD] (IRQs 4 5 7 10 11 14 15) *0
[    3.036867] ACPI: PCI Interrupt Link [LNKE] (IRQs 4 5 7 10 11 14 15) *0
[    3.043869] ACPI: PCI Interrupt Link [LNKF] (IRQs 4 5 7 10 11 14 15) *0
[    3.049867] ACPI: PCI Interrupt Link [LNKG] (IRQs 4 5 7 10 11 14 15) *0
[    3.056856] ACPI: PCI Interrupt Link [LNKH] (IRQs 4 5 7 10 11 14 15) *0
[    3.064010] ACPI: PCI Root Bridge [PC00] (domain 0000 [bus 00-1f])
[    3.069835] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
[    3.077869] acpi PNP0A08:00: PCIe AER handled by firmware
[    3.083869] acpi PNP0A08:00: _OSC: platform does not support [SHPCHotplug LTR]
[    3.090898] acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug PME PCIeCapability]
[    3.098831] acpi PNP0A08:00: FADT indicates ASPM is unsupported, using BIOS configuration
[    3.106836] PCI host bridge to bus 0000:00
[    3.110832] pci_bus 0000:00: root bus resource [io  0x0000-0x03af window]
[    3.117830] pci_bus 0000:00: root bus resource [io  0x03e0-0x0cf7 window]
[    3.124831] pci_bus 0000:00: root bus resource [io  0x03b0-0x03df window]
[    3.131831] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
[    3.138831] pci_bus 0000:00: root bus resource [mem 0x000c0000-0x000c3fff window]
[    3.146831] pci_bus 0000:00: root bus resource [mem 0x000c4000-0x000c7fff window]
[    3.153831] pci_bus 0000:00: root bus resource [mem 0x000c8000-0x000cbfff window]
[    3.161831] pci_bus 0000:00: root bus resource [mem 0x000cc000-0x000cffff window]
[    3.168830] pci_bus 0000:00: root bus resource [mem 0x000d0000-0x000d3fff window]
[    3.175831] pci_bus 0000:00: root bus resource [mem 0x000d4000-0x000d7fff window]
[    3.183831] pci_bus 0000:00: root bus resource [mem 0x000d8000-0x000dbfff window]
[    3.190832] pci_bus 0000:00: root bus resource [mem 0x000dc000-0x000dffff window]
[    3.198830] pci_bus 0000:00: root bus resource [mem 0x000e0000-0x000e3fff window]
[    3.205830] pci_bus 0000:00: root bus resource [mem 0x000e4000-0x000e7fff window]
[    3.213830] pci_bus 0000:00: root bus resource [mem 0x000e8000-0x000ebfff window]
[    3.220831] pci_bus 0000:00: root bus resource [mem 0x000ec000-0x000effff window]
[    3.228831] pci_bus 0000:00: root bus resource [mem 0x000f0000-0x000fffff window]
[    3.235830] pci_bus 0000:00: root bus resource [io  0x0d00-0x1fff window]
[    3.242830] pci_bus 0000:00: root bus resource [mem 0xeb000000-0xfebfffff window]
[    3.250831] pci_bus 0000:00: root bus resource [mem 0x10000000000-0x1df9fffffff window]
[    3.258831] pci_bus 0000:00: root bus resource [bus 00-1f]
[    3.267588] pci 0000:00:07.1: enabling Extended Tags
[    3.273049] pci 0000:00:08.1: enabling Extended Tags
[    3.284141] pci 0000:02:00.0: 4.000 Gb/s available PCIe bandwidth, limited by 5 GT/s x1 link at 0000:00:01.1 (capable of 8.000 Gb/s with 5 GT/s x2 link)
[    3.298142] pci 0000:00:01.1: PCI bridge to [bus 02]
[    3.303925] pci 0000:01:00.0: 4.000 Gb/s available PCIe bandwidth, limited by 5 GT/s x1 link at 0000:00:01.2 (capable of 8.000 Gb/s with 5 GT/s x2 link)
[    3.318084] pci 0000:00:01.2: PCI bridge to [bus 01]
[    3.323089] pci 0000:00:01.3: PCI bridge to [bus 03-04]
[    3.328865] pci_bus 0000:04: extended config space not accessible
[    3.334951] pci 0000:04:00.0: BAR 0: assigned to efifb
[    3.339842] pci 0000:03:00.0: PCI bridge to [bus 04]
[    3.344946] pci 0000:05:00.0: enabling Extended Tags
[    3.349973] pci 0000:05:00.2: enabling Extended Tags
[    3.354979] pci 0000:05:00.3: enabling Extended Tags
[    3.359943] pci 0000:00:07.1: PCI bridge to [bus 05]
[    3.365873] pci 0000:06:00.0: enabling Extended Tags
[    3.370982] pci 0000:06:00.1: enabling Extended Tags
[    3.375989] pci 0000:06:00.2: enabling Extended Tags
[    3.380947] pci 0000:00:08.1: PCI bridge to [bus 06]
[    3.386263] ACPI: PCI Root Bridge [PC01] (domain 0000 [bus 20-3f])
[    3.392833] acpi PNP0A08:01: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
[    3.400866] acpi PNP0A08:01: PCIe AER handled by firmware
[    3.406867] acpi PNP0A08:01: _OSC: platform does not support [SHPCHotplug LTR]
[    3.413896] acpi PNP0A08:01: _OSC: OS now controls [PCIeHotplug PME PCIeCapability]
[    3.421830] acpi PNP0A08:01: FADT indicates ASPM is unsupported, using BIOS configuration
[    3.430028] PCI host bridge to bus 0000:20
[    3.433831] pci_bus 0000:20: root bus resource [io  0x2000-0x3fff window]
[    3.440832] pci_bus 0000:20: root bus resource [mem 0xde000000-0xeaffffff window]
[    3.447831] pci_bus 0000:20: root bus resource [mem 0x1dfa0000000-0x2bf3fffffff window]
[    3.455832] pci_bus 0000:20: root bus resource [bus 20-3f]
[    3.462520] pci 0000:20:07.1: enabling Extended Tags
[    3.468153] pci 0000:20:08.1: enabling Extended Tags
[    3.473149] pci 0000:21:00.0: enabling Extended Tags
[    3.477984] pci 0000:21:00.2: enabling Extended Tags
[    3.483991] pci 0000:21:00.3: enabling Extended Tags
[    3.488948] pci 0000:20:07.1: PCI bridge to [bus 21]
[    3.494068] pci 0000:22:00.0: enabling Extended Tags
[    3.498984] pci 0000:22:00.1: enabling Extended Tags
[    3.503917] pci 0000:20:08.1: PCI bridge to [bus 22]
[    3.509055] ACPI: PCI Root Bridge [PC02] (domain 0000 [bus 40-5f])
[    3.515833] acpi PNP0A08:02: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
[    3.523866] acpi PNP0A08:02: PCIe AER handled by firmware
[    3.528868] acpi PNP0A08:02: _OSC: platform does not support [SHPCHotplug LTR]
[    3.535896] acpi PNP0A08:02: _OSC: OS now controls [PCIeHotplug PME PCIeCapability]
[    3.543830] acpi PNP0A08:02: FADT indicates ASPM is unsupported, using BIOS configuration
[    3.552028] PCI host bridge to bus 0000:40
[    3.556831] pci_bus 0000:40: root bus resource [io  0x4000-0x5fff window]
[    3.562830] pci_bus 0000:40: root bus resource [mem 0xd1000000-0xddffffff window]
[    3.570830] pci_bus 0000:40: root bus resource [mem 0x2bf40000000-0x39edfffffff window]
[    3.578831] pci_bus 0000:40: root bus resource [bus 40-5f]
[    3.584508] pci 0000:40:07.1: enabling Extended Tags
[    3.591070] pci 0000:40:08.1: enabling Extended Tags
[    3.595947] pci 0000:41:00.0: enabling Extended Tags
[    3.600981] pci 0000:41:00.2: enabling Extended Tags
[    3.605943] pci 0000:40:07.1: PCI bridge to [bus 41]
[    3.611221] pci 0000:42:00.0: enabling Extended Tags
[    3.616991] pci 0000:42:00.1: enabling Extended Tags
[    3.621950] pci 0000:40:08.1: PCI bridge to [bus 42]
[    3.627014] ACPI: PCI Root Bridge [PC03] (domain 0000 [bus 60-7f])
[    3.632832] acpi PNP0A08:03: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
[    3.640866] acpi PNP0A08:03: PCIe AER handled by firmware
[    3.646866] acpi PNP0A08:03: _OSC: platform does not support [SHPCHotplug LTR]
[    3.653896] acpi PNP0A08:03: _OSC: OS now controls [PCIeHotplug PME PCIeCapability]
[    3.661830] acpi PNP0A08:03: FADT indicates ASPM is unsupported, using BIOS configuration
[    3.669952] PCI host bridge to bus 0000:60
[    3.673831] pci_bus 0000:60: root bus resource [io  0x6000-0x7fff window]
[    3.680830] pci_bus 0000:60: root bus resource [mem 0xc4000000-0xd0ffffff window]
[    3.688831] pci_bus 0000:60: root bus resource [mem 0x39ee0000000-0x47e7fffffff window]
[    3.696832] pci_bus 0000:60: root bus resource [bus 60-7f]
[    3.703082] pci 0000:60:07.1: enabling Extended Tags
[    3.708832] pci 0000:60:08.1: enabling Extended Tags
[    3.714318] pci 0000:60:03.1: PCI bridge to [bus 61]
[    3.720241] pci 0000:62:00.0: enabling Extended Tags
[    3.724982] pci 0000:62:00.2: enabling Extended Tags
[    3.729907] pci 0000:60:07.1: PCI bridge to [bus 62]
[    3.734970] pci 0000:63:00.0: enabling Extended Tags
[    3.739994] pci 0000:63:00.1: enabling Extended Tags
[    3.744958] pci 0000:60:08.1: PCI bridge to [bus 63]
[    3.750028] ACPI: PCI Root Bridge [PC04] (domain 0000 [bus 80-9f])
[    3.756833] acpi PNP0A08:04: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
[    3.764867] acpi PNP0A08:04: PCIe AER handled by firmware
[    3.769870] acpi PNP0A08:04: _OSC: platform does not support [SHPCHotplug LTR]
[    3.777896] acpi PNP0A08:04: _OSC: OS now controls [PCIeHotplug PME PCIeCapability]
[    3.784830] acpi PNP0A08:04: FADT indicates ASPM is unsupported, using BIOS configuration
[    3.794020] PCI host bridge to bus 0000:80
[    3.797830] pci_bus 0000:80: root bus resource [io  0x8000-0x9fff window]
[    3.804831] pci_bus 0000:80: root bus resource [mem 0xb7000000-0xc3ffffff window]
[    3.811830] pci_bus 0000:80: root bus resource [mem 0x47e80000000-0x55e1fffffff window]
[    3.819831] pci_bus 0000:80: root bus resource [bus 80-9f]
[    3.826468] pci 0000:80:07.1: enabling Extended Tags
[    3.832178] pci 0000:80:08.1: enabling Extended Tags
[    3.838127] pci 0000:81:00.0: enabling Extended Tags
[    3.842975] pci 0000:81:00.2: enabling Extended Tags
[    3.847959] pci 0000:80:07.1: PCI bridge to [bus 81]
[    3.853417] pci 0000:82:00.0: enabling Extended Tags
[    3.859014] pci 0000:82:00.1: enabling Extended Tags
[    3.863887] pci 0000:80:08.1: PCI bridge to [bus 82]
[    3.869025] ACPI: PCI Root Bridge [PC05] (domain 0000 [bus a0-bf])
[    3.875833] acpi PNP0A08:05: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
[    3.883866] acpi PNP0A08:05: PCIe AER handled by firmware
[    3.888867] acpi PNP0A08:05: _OSC: platform does not support [SHPCHotplug LTR]
[    3.895896] acpi PNP0A08:05: _OSC: OS now controls [PCIeHotplug PME PCIeCapability]
[    3.903830] acpi PNP0A08:05: FADT indicates ASPM is unsupported, using BIOS configuration
[    3.912019] PCI host bridge to bus 0000:a0
[    3.916831] pci_bus 0000:a0: root bus resource [io  0xa000-0xbfff window]
[    3.922830] pci_bus 0000:a0: root bus resource [mem 0xaa000000-0xb6ffffff window]
[    3.930830] pci_bus 0000:a0: root bus resource [mem 0x55e20000000-0x63dbfffffff window]
[    3.938830] pci_bus 0000:a0: root bus resource [bus a0-bf]
[    3.944580] pci 0000:a0:07.1: enabling Extended Tags
[    3.950551] pci 0000:a0:08.1: enabling Extended Tags
[    3.956384] pci 0000:a1:00.0: enabling Extended Tags
[    3.962005] pci 0000:a1:00.2: enabling Extended Tags
[    3.966956] pci 0000:a0:07.1: PCI bridge to [bus a1]
[    3.971990] pci 0000:a2:00.0: enabling Extended Tags
[    3.976901] pci 0000:a2:00.1: enabling Extended Tags
[    3.981963] pci 0000:a0:08.1: PCI bridge to [bus a2]
[    3.987049] ACPI: PCI Root Bridge [PC06] (domain 0000 [bus c0-df])
[    3.993833] acpi PNP0A08:06: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
[    4.001866] acpi PNP0A08:06: PCIe AER handled by firmware
[    4.006867] acpi PNP0A08:06: _OSC: platform does not support [SHPCHotplug LTR]
[    4.014896] acpi PNP0A08:06: _OSC: OS now controls [PCIeHotplug PME PCIeCapability]
[    4.021830] acpi PNP0A08:06: FADT indicates ASPM is unsupported, using BIOS configuration
[    4.030021] PCI host bridge to bus 0000:c0
[    4.034831] pci_bus 0000:c0: root bus resource [io  0xc000-0xdfff window]
[    4.041831] pci_bus 0000:c0: root bus resource [mem 0x9d000000-0xa9ffffff window]
[    4.048830] pci_bus 0000:c0: root bus resource [mem 0x63dc0000000-0x71d5fffffff window]
[    4.056832] pci_bus 0000:c0: root bus resource [bus c0-df]
[    4.062909] pci 0000:c0:07.1: enabling Extended Tags
[    4.068161] pci 0000:c0:08.1: enabling Extended Tags
[    4.074014] pci 0000:c1:00.0: enabling Extended Tags
[    4.078994] pci 0000:c1:00.2: enabling Extended Tags
[    4.083956] pci 0000:c0:07.1: PCI bridge to [bus c1]
[    4.088987] pci 0000:c2:00.0: enabling Extended Tags
[    4.095004] pci 0000:c2:00.1: enabling Extended Tags
[    4.099960] pci 0000:c0:08.1: PCI bridge to [bus c2]
[    4.105021] ACPI: PCI Root Bridge [PC07] (domain 0000 [bus e0-ff])
[    4.110831] acpi PNP0A08:07: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
[    4.118866] acpi PNP0A08:07: PCIe AER handled by firmware
[    4.124868] acpi PNP0A08:07: _OSC: platform does not support [SHPCHotplug LTR]
[    4.131878] acpi PNP0A08:07: _OSC: OS now controls [PCIeHotplug PME PCIeCapability]
[    4.139830] acpi PNP0A08:07: FADT indicates ASPM is unsupported, using BIOS configuration
[    4.147842] acpi PNP0A08:07: host bridge window [mem 0x71d60000000-0xffffffffffff window] ([0x80000000000-0xffffffffffff] ignored, not CPU addressable)
[    4.161887] PCI host bridge to bus 0000:e0
[    4.165830] pci_bus 0000:e0: root bus resource [io  0xe000-0xffff window]
[    4.172830] pci_bus 0000:e0: root bus resource [mem 0x90000000-0x9cffffff window]
[    4.179830] pci_bus 0000:e0: root bus resource [mem 0x71d60000000-0x7ffffffffff window]
[    4.187831] pci_bus 0000:e0: root bus resource [bus e0-ff]
[    4.194340] pci 0000:e0:07.1: enabling Extended Tags
[    4.199177] pci 0000:e0:08.1: enabling Extended Tags
[    4.204959] pci 0000:e1:00.0: enabling Extended Tags
[    4.210005] pci 0000:e1:00.2: enabling Extended Tags
[    4.215959] pci 0000:e0:07.1: PCI bridge to [bus e1]
[    4.220985] pci 0000:e2:00.0: enabling Extended Tags
[    4.226015] pci 0000:e2:00.1: enabling Extended Tags
[    4.230966] pci 0000:e0:08.1: PCI bridge to [bus e2]
[    4.238917] pci 0000:04:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[    4.247850] pci 0000:04:00.0: vgaarb: bridge control possible
[    4.252831] pci 0000:04:00.0: vgaarb: setting as boot device
[    4.258831] vgaarb: loaded
[    4.261974] SCSI subsystem initialized
[    4.262849] ACPI: bus type USB registered
[    4.263845] usbcore: registered new interface driver usbfs
[    4.264836] usbcore: registered new interface driver hub
[    4.266052] usbcore: registered new device driver usb
[    4.266847] pps_core: LinuxPPS API ver. 1 registered
[    4.267829] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <giometti@linux.it>
[    4.268832] PTP clock support registered
[    4.270027] EDAC MC: Ver: 3.0.0
[    4.272108] Registered efivars operations
[    4.449116] PCI: Using ACPI for IRQ routing
[    4.470579] NetLabel: Initializing
[    4.470829] NetLabel:  domain hash size = 128
[    4.471829] NetLabel:  protocols = UNLABELED CIPSOv4 CALIPSO
[    4.472846] NetLabel:  unlabeled traffic allowed by default
[    4.473897] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0
[    4.478831] hpet0: 3 comparators, 32-bit 14.318180 MHz counter
[    4.487005] clocksource: Switched to clocksource tsc-early
[    4.502894] VFS: Disk quotas dquot_6.6.0
[    4.506920] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
[    4.513957] pnp: PnP ACPI init
[    4.517315] system 00:00: [mem 0x80000000-0x8fffffff] has been reserved
[    4.524597] pnp: PnP ACPI: found 4 devices
[    4.534499] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[    4.543378] pci 0000:02:00.0: can't claim BAR 6 [mem 0xfffc0000-0xffffffff pref]: no compatible bridge window
[    4.553287] pci 0000:02:00.1: can't claim BAR 6 [mem 0xfffc0000-0xffffffff pref]: no compatible bridge window
[    4.563201] pci 0000:01:00.0: can't claim BAR 6 [mem 0xfffc0000-0xffffffff pref]: no compatible bridge window
[    4.573105] pci 0000:01:00.1: can't claim BAR 6 [mem 0xfffc0000-0xffffffff pref]: no compatible bridge window
[    4.583029] pci 0000:61:00.0: can't claim BAR 6 [mem 0xfff00000-0xffffffff pref]: no compatible bridge window
[    4.592988] pci 0000:00:01.1: BAR 14: assigned [mem 0xec200000-0xec2fffff]
[    4.599869] pci 0000:00:01.2: BAR 14: assigned [mem 0xec300000-0xec3fffff]
[    4.606745] pci 0000:02:00.0: BAR 6: assigned [mem 0xec200000-0xec23ffff pref]
[    4.613965] pci 0000:02:00.1: BAR 6: assigned [mem 0xec240000-0xec27ffff pref]
[    4.621187] pci 0000:00:01.1: PCI bridge to [bus 02]
[    4.626159] pci 0000:00:01.1:   bridge window [mem 0xec200000-0xec2fffff]
[    4.632945] pci 0000:00:01.1:   bridge window [mem 0xec100000-0xec1fffff 64bit pref]
[    4.640687] pci 0000:01:00.0: BAR 6: assigned [mem 0xec300000-0xec33ffff pref]
[    4.647912] pci 0000:01:00.1: BAR 6: assigned [mem 0xec340000-0xec37ffff pref]
[    4.655133] pci 0000:00:01.2: PCI bridge to [bus 01]
[    4.660101] pci 0000:00:01.2:   bridge window [mem 0xec300000-0xec3fffff]
[    4.666885] pci 0000:00:01.2:   bridge window [mem 0xec000000-0xec0fffff 64bit pref]
[    4.674626] pci 0000:03:00.0: PCI bridge to [bus 04]
[    4.679593] pci 0000:03:00.0:   bridge window [mem 0xf9000000-0xf98fffff]
[    4.686379] pci 0000:03:00.0:   bridge window [mem 0xeb000000-0xebffffff 64bit pref]
[    4.694124] pci 0000:00:01.3: PCI bridge to [bus 03-04]
[    4.699355] pci 0000:00:01.3:   bridge window [mem 0xf9000000-0xf98fffff]
[    4.706140] pci 0000:00:01.3:   bridge window [mem 0xeb000000-0xebffffff 64bit pref]
[    4.713882] pci 0000:00:07.1: PCI bridge to [bus 05]
[    4.718856] pci 0000:00:07.1:   bridge window [mem 0xf9b00000-0xf9dfffff]
[    4.725643] pci 0000:00:08.1: PCI bridge to [bus 06]
[    4.730613] pci 0000:00:08.1:   bridge window [mem 0xf9900000-0xf9afffff]
[    4.737484] pci 0000:20:07.1: PCI bridge to [bus 21]
[    4.742458] pci 0000:20:07.1:   bridge window [mem 0xe8200000-0xe84fffff]
[    4.749244] pci 0000:20:08.1: PCI bridge to [bus 22]
[    4.754208] pci 0000:20:08.1:   bridge window [mem 0xe8000000-0xe81fffff]
[    4.761029] pci 0000:40:07.1: PCI bridge to [bus 41]
[    4.765998] pci 0000:40:07.1:   bridge window [mem 0xdb200000-0xdb3fffff]
[    4.772794] pci 0000:40:08.1: PCI bridge to [bus 42]
[    4.777765] pci 0000:40:08.1:   bridge window [mem 0xdb000000-0xdb1fffff]
[    4.784584] pci 0000:61:00.0: BAR 6: no space for [mem size 0x00100000 pref]
[    4.791634] pci 0000:61:00.0: BAR 6: failed to assign [mem size 0x00100000 pref]
[    4.799027] pci 0000:60:03.1: PCI bridge to [bus 61]
[    4.803993] pci 0000:60:03.1:   bridge window [io  0x6000-0x6fff]
[    4.810086] pci 0000:60:03.1:   bridge window [mem 0xce400000-0xce5fffff]
[    4.816876] pci 0000:60:07.1: PCI bridge to [bus 62]
[    4.821847] pci 0000:60:07.1:   bridge window [mem 0xce200000-0xce3fffff]
[    4.828636] pci 0000:60:08.1: PCI bridge to [bus 63]
[    4.833603] pci 0000:60:08.1:   bridge window [mem 0xce000000-0xce1fffff]
[    4.840426] pci 0000:80:07.1: PCI bridge to [bus 81]
[    4.845399] pci 0000:80:07.1:   bridge window [mem 0xc1200000-0xc13fffff]
[    4.852194] pci 0000:80:08.1: PCI bridge to [bus 82]
[    4.857167] pci 0000:80:08.1:   bridge window [mem 0xc1000000-0xc11fffff]
[    4.863980] pci 0000:a0:07.1: PCI bridge to [bus a1]
[    4.868948] pci 0000:a0:07.1:   bridge window [mem 0xb4200000-0xb43fffff]
[    4.875743] pci 0000:a0:08.1: PCI bridge to [bus a2]
[    4.880715] pci 0000:a0:08.1:   bridge window [mem 0xb4000000-0xb41fffff]
[    4.887535] pci 0000:c0:07.1: PCI bridge to [bus c1]
[    4.892503] pci 0000:c0:07.1:   bridge window [mem 0xa7200000-0xa73fffff]
[    4.899293] pci 0000:c0:08.1: PCI bridge to [bus c2]
[    4.904267] pci 0000:c0:08.1:   bridge window [mem 0xa7000000-0xa71fffff]
[    4.911087] pci 0000:e0:07.1: PCI bridge to [bus e1]
[    4.916061] pci 0000:e0:07.1:   bridge window [mem 0x9a200000-0x9a3fffff]
[    4.922849] pci 0000:e0:08.1: PCI bridge to [bus e2]
[    4.927822] pci 0000:e0:08.1:   bridge window [mem 0x9a000000-0x9a1fffff]
[    4.934759] NET: Registered protocol family 2
[    4.939811] tcp_listen_portaddr_hash hash table entries: 16384 (order: 6, 262144 bytes)
[    4.947938] TCP established hash table entries: 262144 (order: 9, 2097152 bytes)
[    4.955716] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
[    4.962585] TCP: Hash tables configured (established 262144 bind 65536)
[    4.969386] UDP hash table entries: 16384 (order: 7, 524288 bytes)
[    4.975673] UDP-Lite hash table entries: 16384 (order: 7, 524288 bytes)
[    4.982735] NET: Registered protocol family 1
[    4.987888] Unpacking initramfs...
[    5.637866] Freeing initrd memory: 45996K
[    5.641983] AMD-Vi: IOMMU performance counters supported
[    5.647388] AMD-Vi: IOMMU performance counters supported
[    5.652747] AMD-Vi: IOMMU performance counters supported
[    5.658099] AMD-Vi: IOMMU performance counters supported
[    5.663455] AMD-Vi: IOMMU performance counters supported
[    5.668811] AMD-Vi: IOMMU performance counters supported
[    5.674163] AMD-Vi: IOMMU performance counters supported
[    5.679518] AMD-Vi: IOMMU performance counters supported
[    5.686810] iommu: Adding device 0000:00:01.0 to group 0
[    5.693219] iommu: Adding device 0000:00:01.1 to group 1
[    5.699651] iommu: Adding device 0000:00:01.2 to group 2
[    5.705932] iommu: Adding device 0000:00:01.3 to group 3
[    5.712504] iommu: Adding device 0000:00:02.0 to group 4
[    5.718835] iommu: Adding device 0000:00:03.0 to group 5
[    5.725319] iommu: Adding device 0000:00:04.0 to group 6
[    5.731697] iommu: Adding device 0000:00:07.0 to group 7
[    5.738156] iommu: Adding device 0000:00:07.1 to group 8
[    5.744387] iommu: Adding device 0000:00:08.0 to group 9
[    5.750998] iommu: Adding device 0000:00:08.1 to group 10
[    5.757319] iommu: Adding device 0000:00:14.0 to group 11
[    5.762767] iommu: Adding device 0000:00:14.3 to group 11
[    5.769215] iommu: Adding device 0000:00:18.0 to group 12
[    5.774668] iommu: Adding device 0000:00:18.1 to group 12
[    5.780098] iommu: Adding device 0000:00:18.2 to group 12
[    5.785523] iommu: Adding device 0000:00:18.3 to group 12
[    5.790951] iommu: Adding device 0000:00:18.4 to group 12
[    5.796381] iommu: Adding device 0000:00:18.5 to group 12
[    5.801807] iommu: Adding device 0000:00:18.6 to group 12
[    5.807234] iommu: Adding device 0000:00:18.7 to group 12
[    5.813630] iommu: Adding device 0000:00:19.0 to group 13
[    5.819089] iommu: Adding device 0000:00:19.1 to group 13
[    5.824518] iommu: Adding device 0000:00:19.2 to group 13
[    5.829956] iommu: Adding device 0000:00:19.3 to group 13
[    5.835386] iommu: Adding device 0000:00:19.4 to group 13
[    5.840812] iommu: Adding device 0000:00:19.5 to group 13
[    5.846236] iommu: Adding device 0000:00:19.6 to group 13
[    5.851661] iommu: Adding device 0000:00:19.7 to group 13
[    5.858207] iommu: Adding device 0000:00:1a.0 to group 14
[    5.863664] iommu: Adding device 0000:00:1a.1 to group 14
[    5.869095] iommu: Adding device 0000:00:1a.2 to group 14
[    5.874527] iommu: Adding device 0000:00:1a.3 to group 14
[    5.879953] iommu: Adding device 0000:00:1a.4 to group 14
[    5.885378] iommu: Adding device 0000:00:1a.5 to group 14
[    5.890804] iommu: Adding device 0000:00:1a.6 to group 14
[    5.896232] iommu: Adding device 0000:00:1a.7 to group 14
[    5.902764] iommu: Adding device 0000:00:1b.0 to group 15
[    5.908224] iommu: Adding device 0000:00:1b.1 to group 15
[    5.913655] iommu: Adding device 0000:00:1b.2 to group 15
[    5.919086] iommu: Adding device 0000:00:1b.3 to group 15
[    5.924514] iommu: Adding device 0000:00:1b.4 to group 15
[    5.929947] iommu: Adding device 0000:00:1b.5 to group 15
[    5.935381] iommu: Adding device 0000:00:1b.6 to group 15
[    5.940815] iommu: Adding device 0000:00:1b.7 to group 15
[    5.947316] iommu: Adding device 0000:00:1c.0 to group 16
[    5.952774] iommu: Adding device 0000:00:1c.1 to group 16
[    5.958206] iommu: Adding device 0000:00:1c.2 to group 16
[    5.963638] iommu: Adding device 0000:00:1c.3 to group 16
[    5.969072] iommu: Adding device 0000:00:1c.4 to group 16
[    5.974506] iommu: Adding device 0000:00:1c.5 to group 16
[    5.979944] iommu: Adding device 0000:00:1c.6 to group 16
[    5.985373] iommu: Adding device 0000:00:1c.7 to group 16
[    5.991784] iommu: Adding device 0000:00:1d.0 to group 17
[    5.997238] iommu: Adding device 0000:00:1d.1 to group 17
[    6.002669] iommu: Adding device 0000:00:1d.2 to group 17
[    6.008104] iommu: Adding device 0000:00:1d.3 to group 17
[    6.013536] iommu: Adding device 0000:00:1d.4 to group 17
[    6.018970] iommu: Adding device 0000:00:1d.5 to group 17
[    6.024403] iommu: Adding device 0000:00:1d.6 to group 17
[    6.029838] iommu: Adding device 0000:00:1d.7 to group 17
[    6.036226] iommu: Adding device 0000:00:1e.0 to group 18
[    6.041682] iommu: Adding device 0000:00:1e.1 to group 18
[    6.047115] iommu: Adding device 0000:00:1e.2 to group 18
[    6.052552] iommu: Adding device 0000:00:1e.3 to group 18
[    6.057985] iommu: Adding device 0000:00:1e.4 to group 18
[    6.063416] iommu: Adding device 0000:00:1e.5 to group 18
[    6.068851] iommu: Adding device 0000:00:1e.6 to group 18
[    6.074283] iommu: Adding device 0000:00:1e.7 to group 18
[    6.080735] iommu: Adding device 0000:00:1f.0 to group 19
[    6.086195] iommu: Adding device 0000:00:1f.1 to group 19
[    6.091628] iommu: Adding device 0000:00:1f.2 to group 19
[    6.097064] iommu: Adding device 0000:00:1f.3 to group 19
[    6.102499] iommu: Adding device 0000:00:1f.4 to group 19
[    6.107932] iommu: Adding device 0000:00:1f.5 to group 19
[    6.113366] iommu: Adding device 0000:00:1f.6 to group 19
[    6.118801] iommu: Adding device 0000:00:1f.7 to group 19
[    6.125128] iommu: Adding device 0000:01:00.0 to group 20
[    6.130611] iommu: Adding device 0000:01:00.1 to group 20
[    6.137010] iommu: Adding device 0000:02:00.0 to group 21
[    6.142484] iommu: Adding device 0000:02:00.1 to group 21
[    6.148891] iommu: Adding device 0000:03:00.0 to group 22
[    6.154307] iommu: Adding device 0000:04:00.0 to group 22
[    6.160813] iommu: Adding device 0000:05:00.0 to group 23
[    6.166975] iommu: Adding device 0000:05:00.2 to group 24
[    6.173450] iommu: Adding device 0000:05:00.3 to group 25
[    6.179867] iommu: Adding device 0000:06:00.0 to group 26
[    6.186327] iommu: Adding device 0000:06:00.1 to group 27
[    6.192726] iommu: Adding device 0000:06:00.2 to group 28
[    6.199083] iommu: Adding device 0000:20:01.0 to group 29
[    6.205374] iommu: Adding device 0000:20:02.0 to group 30
[    6.211835] iommu: Adding device 0000:20:03.0 to group 31
[    6.218261] iommu: Adding device 0000:20:04.0 to group 32
[    6.224541] iommu: Adding device 0000:20:07.0 to group 33
[    6.230898] iommu: Adding device 0000:20:07.1 to group 34
[    6.237464] iommu: Adding device 0000:20:08.0 to group 35
[    6.243805] iommu: Adding device 0000:20:08.1 to group 36
[    6.250174] iommu: Adding device 0000:21:00.0 to group 37
[    6.256771] iommu: Adding device 0000:21:00.2 to group 38
[    6.263230] iommu: Adding device 0000:21:00.3 to group 39
[    6.269576] iommu: Adding device 0000:22:00.0 to group 40
[    6.275837] iommu: Adding device 0000:22:00.1 to group 41
[    6.282213] iommu: Adding device 0000:40:01.0 to group 42
[    6.288500] iommu: Adding device 0000:40:02.0 to group 43
[    6.294842] iommu: Adding device 0000:40:03.0 to group 44
[    6.301184] iommu: Adding device 0000:40:04.0 to group 45
[    6.307222] iommu: Adding device 0000:40:07.0 to group 46
[    6.313519] iommu: Adding device 0000:40:07.1 to group 47
[    6.319645] iommu: Adding device 0000:40:08.0 to group 48
[    6.325815] iommu: Adding device 0000:40:08.1 to group 49
[    6.332174] iommu: Adding device 0000:41:00.0 to group 50
[    6.338576] iommu: Adding device 0000:41:00.2 to group 51
[    6.344717] iommu: Adding device 0000:42:00.0 to group 52
[    6.351018] iommu: Adding device 0000:42:00.1 to group 53
[    6.357177] iommu: Adding device 0000:60:01.0 to group 54
[    6.363363] iommu: Adding device 0000:60:02.0 to group 55
[    6.369897] iommu: Adding device 0000:60:03.0 to group 56
[    6.376276] iommu: Adding device 0000:60:03.1 to group 57
[    6.382713] iommu: Adding device 0000:60:04.0 to group 58
[    6.388751] iommu: Adding device 0000:60:07.0 to group 59
[    6.395157] iommu: Adding device 0000:60:07.1 to group 60
[    6.401381] iommu: Adding device 0000:60:08.0 to group 61
[    6.407649] iommu: Adding device 0000:60:08.1 to group 62
[    6.414092] iommu: Adding device 0000:61:00.0 to group 63
[    6.420475] iommu: Adding device 0000:62:00.0 to group 64
[    6.426632] iommu: Adding device 0000:62:00.2 to group 65
[    6.432892] iommu: Adding device 0000:63:00.0 to group 66
[    6.439015] iommu: Adding device 0000:63:00.1 to group 67
[    6.445369] iommu: Adding device 0000:80:01.0 to group 68
[    6.451731] iommu: Adding device 0000:80:02.0 to group 69
[    6.458059] iommu: Adding device 0000:80:03.0 to group 70
[    6.464297] iommu: Adding device 0000:80:04.0 to group 71
[    6.470527] iommu: Adding device 0000:80:07.0 to group 72
[    6.476617] iommu: Adding device 0000:80:07.1 to group 73
[    6.482892] iommu: Adding device 0000:80:08.0 to group 74
[    6.489194] iommu: Adding device 0000:80:08.1 to group 75
[    6.495736] iommu: Adding device 0000:81:00.0 to group 76
[    6.502085] iommu: Adding device 0000:81:00.2 to group 77
[    6.508589] iommu: Adding device 0000:82:00.0 to group 78
[    6.514807] iommu: Adding device 0000:82:00.1 to group 79
[    6.521387] iommu: Adding device 0000:a0:01.0 to group 80
[    6.527774] iommu: Adding device 0000:a0:02.0 to group 81
[    6.534015] iommu: Adding device 0000:a0:03.0 to group 82
[    6.540315] iommu: Adding device 0000:a0:04.0 to group 83
[    6.546717] iommu: Adding device 0000:a0:07.0 to group 84
[    6.553099] iommu: Adding device 0000:a0:07.1 to group 85
[    6.559281] iommu: Adding device 0000:a0:08.0 to group 86
[    6.565801] iommu: Adding device 0000:a0:08.1 to group 87
[    6.572170] iommu: Adding device 0000:a1:00.0 to group 88
[    6.578557] iommu: Adding device 0000:a1:00.2 to group 89
[    6.584857] iommu: Adding device 0000:a2:00.0 to group 90
[    6.591376] iommu: Adding device 0000:a2:00.1 to group 91
[    6.597800] iommu: Adding device 0000:c0:01.0 to group 92
[    6.604010] iommu: Adding device 0000:c0:02.0 to group 93
[    6.610405] iommu: Adding device 0000:c0:03.0 to group 94
[    6.616747] iommu: Adding device 0000:c0:04.0 to group 95
[    6.623228] iommu: Adding device 0000:c0:07.0 to group 96
[    6.629465] iommu: Adding device 0000:c0:07.1 to group 97
[    6.635798] iommu: Adding device 0000:c0:08.0 to group 98
[    6.642019] iommu: Adding device 0000:c0:08.1 to group 99
[    6.648470] iommu: Adding device 0000:c1:00.0 to group 100
[    6.654819] iommu: Adding device 0000:c1:00.2 to group 101
[    6.661379] iommu: Adding device 0000:c2:00.0 to group 102
[    6.667797] iommu: Adding device 0000:c2:00.1 to group 103
[    6.674318] iommu: Adding device 0000:e0:01.0 to group 104
[    6.680784] iommu: Adding device 0000:e0:02.0 to group 105
[    6.687219] iommu: Adding device 0000:e0:03.0 to group 106
[    6.693767] iommu: Adding device 0000:e0:04.0 to group 107
[    6.700250] iommu: Adding device 0000:e0:07.0 to group 108
[    6.706790] iommu: Adding device 0000:e0:07.1 to group 109
[    6.713192] iommu: Adding device 0000:e0:08.0 to group 110
[    6.719518] iommu: Adding device 0000:e0:08.1 to group 111
[    6.725883] iommu: Adding device 0000:e1:00.0 to group 112
[    6.732294] iommu: Adding device 0000:e1:00.2 to group 113
[    6.738644] iommu: Adding device 0000:e2:00.0 to group 114
[    6.745053] iommu: Adding device 0000:e2:00.1 to group 115
[    6.750834] AMD-Vi: Found IOMMU at 0000:00:00.2 cap 0x40
[    6.756147] AMD-Vi: Extended features (0xf77ef22294ada):
[    6.761460]  PPR NX GT IA GA PC GA_vAPIC
[    6.765388] AMD-Vi: Found IOMMU at 0000:20:00.2 cap 0x40
[    6.770697] AMD-Vi: Extended features (0xf77ef22294ada):
[    6.776010]  PPR NX GT IA GA PC GA_vAPIC
[    6.779939] AMD-Vi: Found IOMMU at 0000:40:00.2 cap 0x40
[    6.785251] AMD-Vi: Extended features (0xf77ef22294ada):
[    6.790568]  PPR NX GT IA GA PC GA_vAPIC
[    6.794499] AMD-Vi: Found IOMMU at 0000:60:00.2 cap 0x40
[    6.799815] AMD-Vi: Extended features (0xf77ef22294ada):
[    6.805125]  PPR NX GT IA GA PC GA_vAPIC
[    6.809053] AMD-Vi: Found IOMMU at 0000:80:00.2 cap 0x40
[    6.814363] AMD-Vi: Extended features (0xf77ef22294ada):
[    6.819678]  PPR NX GT IA GA PC GA_vAPIC
[    6.823605] AMD-Vi: Found IOMMU at 0000:a0:00.2 cap 0x40
[    6.828915] AMD-Vi: Extended features (0xf77ef22294ada):
[    6.834230]  PPR NX GT IA GA PC GA_vAPIC
[    6.838156] AMD-Vi: Found IOMMU at 0000:c0:00.2 cap 0x40
[    6.843468] AMD-Vi: Extended features (0xf77ef22294ada):
[    6.848780]  PPR NX GT IA GA PC GA_vAPIC
[    6.852708] AMD-Vi: Found IOMMU at 0000:e0:00.2 cap 0x40
[    6.858021] AMD-Vi: Extended features (0xf77ef22294ada):
[    6.863334]  PPR NX GT IA GA PC GA_vAPIC
[    6.867259] AMD-Vi: Interrupt remapping enabled
[    6.871792] AMD-Vi: virtual APIC enabled
[    6.876729] AMD-Vi: Lazy IO/TLB flushing enabled
[    6.882697] amd_uncore: AMD NB counters detected
[    6.887347] amd_uncore: AMD LLC counters detected
[    6.893388] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
[    6.900553] perf/amd_iommu: Detected AMD IOMMU #1 (2 banks, 4 counters/bank).
[    6.907706] perf/amd_iommu: Detected AMD IOMMU #2 (2 banks, 4 counters/bank).
[    6.914865] perf/amd_iommu: Detected AMD IOMMU #3 (2 banks, 4 counters/bank).
[    6.922022] perf/amd_iommu: Detected AMD IOMMU #4 (2 banks, 4 counters/bank).
[    6.929176] perf/amd_iommu: Detected AMD IOMMU #5 (2 banks, 4 counters/bank).
[    6.936336] perf/amd_iommu: Detected AMD IOMMU #6 (2 banks, 4 counters/bank).
[    6.943493] perf/amd_iommu: Detected AMD IOMMU #7 (2 banks, 4 counters/bank).
[    6.966284] Initialise system trusted keyrings
[    6.970745] Key type blacklist registered
[    6.974876] workingset: timestamp_bits=36 max_order=23 bucket_order=0
[    6.982322] zbud: loaded
[    7.054348] NET: Registered protocol family 38
[    7.058805] Key type asymmetric registered
[    7.062909] Asymmetric key parser 'x509' registered
[    7.067800] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 246)
[    7.075385] io scheduler noop registered
[    7.079317] io scheduler deadline registered (default)
[    7.084517] io scheduler cfq registered
[    7.088357] io scheduler mq-deadline registered (default)
[    7.093765] io scheduler kyber registered
[    7.098280] atomic64_test: passed for x86-64 platform with CX8 and with SSE
[    7.106112] pcieport 0000:00:01.1: Signaling PME with IRQ 34
[    7.112114] pcieport 0000:00:01.2: Signaling PME with IRQ 35
[    7.118120] pcieport 0000:00:01.3: Signaling PME with IRQ 36
[    7.124133] pcieport 0000:00:07.1: Signaling PME with IRQ 37
[    7.130941] pcieport 0000:00:08.1: Signaling PME with IRQ 39
[    7.137607] pcieport 0000:20:07.1: Signaling PME with IRQ 40
[    7.143807] pcieport 0000:20:08.1: Signaling PME with IRQ 42
[    7.150278] pcieport 0000:40:07.1: Signaling PME with IRQ 44
[    7.156378] pcieport 0000:40:08.1: Signaling PME with IRQ 46
[    7.162382] pcieport 0000:60:03.1: Signaling PME with IRQ 47
[    7.168517] pcieport 0000:60:07.1: Signaling PME with IRQ 49
[    7.174492] pcieport 0000:60:08.1: Signaling PME with IRQ 51
[    7.181111] pcieport 0000:80:07.1: Signaling PME with IRQ 53
[    7.187193] pcieport 0000:80:08.1: Signaling PME with IRQ 55
[    7.193678] pcieport 0000:a0:07.1: Signaling PME with IRQ 57
[    7.199738] pcieport 0000:a0:08.1: Signaling PME with IRQ 59
[    7.206015] pcieport 0000:c0:07.1: Signaling PME with IRQ 61
[    7.212080] pcieport 0000:c0:08.1: Signaling PME with IRQ 63
[    7.218247] pcieport 0000:e0:07.1: Signaling PME with IRQ 65
[    7.224266] pcieport 0000:e0:08.1: Signaling PME with IRQ 67
[    7.230064] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
[    7.236860] efifb: probing for efifb
[    7.240466] efifb: framebuffer at 0xeb000000, using 3072k, total 3072k
[    7.246999] efifb: mode is 1024x768x32, linelength=4096, pages=1
[    7.253006] efifb: scrolling: redraw
[    7.256585] efifb: Truecolor: size=8:8:8:8, shift=24:16:8:0
[    7.277822] Console: switching to colour frame buffer device 128x48
[    7.299393] fb0: EFI VGA frame buffer device
[    7.303830] input: Power Button as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C0C:00/input/input0
[    7.312182] ACPI: Power Button [PWRB]
[    7.315906] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input1
[    7.323370] ACPI: Power Button [PWRF]
[    7.328997] GHES: APEI firmware first mode is enabled by APEI bit and WHEA _OSC.
[    7.336519] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
[    7.363757] 00:02: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A
[    7.392058] 00:03: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
[    7.400051] Non-volatile memory driver v1.3
[    7.404275] Linux agpgart interface v0.103
[    7.414431] rdac: device handler registered
[    7.418774] hp_sw: device handler registered
[    7.423055] emc: device handler registered
[    7.427339] alua: device handler registered
[    7.431564] libphy: Fixed MDIO Bus: probed
[    7.435731] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
[    7.442265] ehci-pci: EHCI PCI platform driver
[    7.446728] ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
[    7.452906] ohci-pci: OHCI PCI platform driver
[    7.457370] uhci_hcd: USB Universal Host Controller Interface driver
[    7.463961] xhci_hcd 0000:05:00.3: xHCI Host Controller
[    7.469254] xhci_hcd 0000:05:00.3: new USB bus registered, assigned bus number 1
[    7.476805] xhci_hcd 0000:05:00.3: hcc params 0x0270f665 hci version 0x100 quirks 0x0000000000000410
[    7.486584] usb usb1: New USB device found, idVendor=1d6b, idProduct=0002, bcdDevice= 4.20
[    7.494847] usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    7.502071] usb usb1: Product: xHCI Host Controller
[    7.506953] usb usb1: Manufacturer: Linux 4.20.0-rc5+ xhci-hcd
[    7.512787] usb usb1: SerialNumber: 0000:05:00.3
[    7.517502] hub 1-0:1.0: USB hub found
[    7.521266] hub 1-0:1.0: 2 ports detected
[    7.525547] xhci_hcd 0000:05:00.3: xHCI Host Controller
[    7.530824] xhci_hcd 0000:05:00.3: new USB bus registered, assigned bus number 2
[    7.538255] xhci_hcd 0000:05:00.3: Host supports USB 3.0  SuperSpeed
[    7.544631] usb usb2: We don't know the algorithms for LPM for this host, disabling LPM.
[    7.552732] usb usb2: New USB device found, idVendor=1d6b, idProduct=0003, bcdDevice= 4.20
[    7.560996] usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    7.568214] usb usb2: Product: xHCI Host Controller
[    7.573097] usb usb2: Manufacturer: Linux 4.20.0-rc5+ xhci-hcd
[    7.578931] usb usb2: SerialNumber: 0000:05:00.3
[    7.583637] hub 2-0:1.0: USB hub found
[    7.587405] hub 2-0:1.0: 2 ports detected
[    7.591749] xhci_hcd 0000:21:00.3: xHCI Host Controller
[    7.597055] xhci_hcd 0000:21:00.3: new USB bus registered, assigned bus number 3
[    7.604574] xhci_hcd 0000:21:00.3: hcc params 0x0270f665 hci version 0x100 quirks 0x0000000000000410
[    7.614145] usb usb3: New USB device found, idVendor=1d6b, idProduct=0002, bcdDevice= 4.20
[    7.622413] usb usb3: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    7.629635] usb usb3: Product: xHCI Host Controller
[    7.634520] usb usb3: Manufacturer: Linux 4.20.0-rc5+ xhci-hcd
[    7.640352] usb usb3: SerialNumber: 0000:21:00.3
[    7.645060] hub 3-0:1.0: USB hub found
[    7.648820] hub 3-0:1.0: 2 ports detected
[    7.653096] xhci_hcd 0000:21:00.3: xHCI Host Controller
[    7.658368] xhci_hcd 0000:21:00.3: new USB bus registered, assigned bus number 4
[    7.665769] xhci_hcd 0000:21:00.3: Host supports USB 3.0  SuperSpeed
[    7.672139] usb usb4: We don't know the algorithms for LPM for this host, disabling LPM.
[    7.680249] usb usb4: New USB device found, idVendor=1d6b, idProduct=0003, bcdDevice= 4.20
[    7.688507] usb usb4: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    7.695728] usb usb4: Product: xHCI Host Controller
[    7.700609] usb usb4: Manufacturer: Linux 4.20.0-rc5+ xhci-hcd
[    7.706439] usb usb4: SerialNumber: 0000:21:00.3
[    7.711130] hub 4-0:1.0: USB hub found
[    7.714890] hub 4-0:1.0: 2 ports detected
[    7.719146] usbcore: registered new interface driver usbserial_generic
[    7.725686] usbserial: USB Serial support registered for generic
[    7.731735] i8042: PNP: No PS/2 controller found.
[    7.736490] mousedev: PS/2 mouse device common for all mice
[    7.742181] rtc_cmos 00:01: RTC can wake from S4
[    7.747100] rtc_cmos 00:01: registered as rtc0
[    7.751562] rtc_cmos 00:01: alarms up to one month, y3k, 114 bytes nvram, hpet irqs
[    7.759432] EFI Variables Facility v0.08 2004-May-17
[    7.843987] usb 1-1: new high-speed USB device number 2 using xhci_hcd
[    7.953889] tsc: Refined TSC clocksource calibration: 2096.060 MHz
[    7.957946] hidraw: raw HID events driver (C) Jiri Kosina
[    7.962432] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x1e36a685b69, max_idle_ns: 440795255641 ns
[    7.967855] usbcore: registered new interface driver usbhid
[    7.983324] usbhid: USB HID core driver
[    7.987201] clocksource: Switched to clocksource tsc
[    7.987623] drop_monitor: Initializing network drop monitor service
[    7.998608] Initializing XFRM netlink socket
[    7.999778] usb 1-1: New USB device found, idVendor=1604, idProduct=10c0, bcdDevice= 0.00
[    8.003134] NET: Registered protocol family 10
[    8.011066] usb 1-1: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[    8.023238] Segment Routing with IPv6
[    8.026931] NET: Registered protocol family 17
[    8.031382] mpls_gso: MPLS GSO support
[    8.036834] microcode: CPU0: patch_level=0x08001227
[    8.041730] microcode: CPU1: patch_level=0x08001227
[    8.046623] microcode: CPU2: patch_level=0x08001227
[    8.051511] microcode: CPU3: patch_level=0x08001227
[    8.056404] microcode: CPU4: patch_level=0x08001227
[    8.056799] hub 1-1:1.0: USB hub found
[    8.061297] microcode: CPU5: patch_level=0x08001227
[    8.065277] hub 1-1:1.0: 4 ports detected
[    8.069938] microcode: CPU6: patch_level=0x08001227
[    8.078841] microcode: CPU7: patch_level=0x08001227
[    8.083732] microcode: CPU8: patch_level=0x08001227
[    8.088624] microcode: CPU9: patch_level=0x08001227
[    8.093517] microcode: CPU10: patch_level=0x08001227
[    8.098494] microcode: CPU11: patch_level=0x08001227
[    8.103479] microcode: CPU12: patch_level=0x08001227
[    8.108464] microcode: CPU13: patch_level=0x08001227
[    8.113440] microcode: CPU14: patch_level=0x08001227
[    8.116907] usb 2-2: new SuperSpeed Gen 1 USB device number 2 using xhci_hcd
[    8.118423] microcode: CPU15: patch_level=0x08001227
[    8.130438] microcode: CPU16: patch_level=0x08001227
[    8.135412] microcode: CPU17: patch_level=0x08001227
[    8.140028] usb 2-2: New USB device found, idVendor=0424, idProduct=5744, bcdDevice= 1.21
[    8.140385] microcode: CPU18: patch_level=0x08001227
[    8.148556] usb 2-2: New USB device strings: Mfr=2, Product=3, SerialNumber=0
[    8.153537] microcode: CPU19: patch_level=0x08001227
[    8.160665] usb 2-2: Product: USB5734
[    8.165645] microcode: CPU20: patch_level=0x08001227
[    8.169296] usb 2-2: Manufacturer: Microchip Tech
[    8.174263] microcode: CPU21: patch_level=0x08001227
[    8.183935] microcode: CPU22: patch_level=0x08001227
[    8.188916] microcode: CPU23: patch_level=0x08001227
[    8.193893] microcode: CPU24: patch_level=0x08001227
[    8.198872] microcode: CPU25: patch_level=0x08001227
[    8.200798] hub 2-2:1.0: USB hub found
[    8.203844] microcode: CPU26: patch_level=0x08001227
[    8.207773] hub 2-2:1.0: 4 ports detected
[    8.212583] microcode: CPU27: patch_level=0x08001227
[    8.221561] microcode: CPU28: patch_level=0x08001227
[    8.226533] microcode: CPU29: patch_level=0x08001227
[    8.231509] microcode: CPU30: patch_level=0x08001227
[    8.236481] microcode: CPU31: patch_level=0x08001227
[    8.241498] microcode: Microcode Update Driver: v2.2.
[    8.241529] sched_clock: Marking stable (5953779253, 2287707992)->(8729037616, -487550371)
[    8.251838] usb 1-2: new high-speed USB device number 3 using xhci_hcd
[    8.261931] registered taskstats version 1
[    8.266043] Loading compiled-in X.509 certificates
[    8.300573] Loaded X.509 cert 'Build time autogenerated kernel key: df9aa3155d8a57e9d39c7df5faa22ed567fed551'
[    8.310573] zswap: loaded using pool lzo/zbud
[    8.317622] Key type big_key registered
[    8.323785] Key type trusted registered
[    8.329670] Key type encrypted registered
[    8.333691] ima: No TPM chip found, activating TPM-bypass!
[    8.339181] ima: Allocated hash algorithm: sha1
[    8.343732] evm: Initialising EVM extended attributes:
[    8.348876] evm: security.selinux
[    8.352195] evm: security.ima
[    8.355166] evm: security.capability
[    8.358746] evm: HMAC attrs: 0x1
[    8.363377]   Magic number: 14:2:211
[    8.367211] acpi device:62: hash matches
[    8.371276] rtc_cmos 00:01: setting system clock to 2018-12-05 04:14:21 UTC (1543983261)
[    8.381080] Freeing unused decrypted memory: 2040K
[    8.386340] Freeing unused kernel image memory: 2348K
[    8.390653] usb 1-2: New USB device found, idVendor=0424, idProduct=2744, bcdDevice= 1.21
[    8.399569] usb 1-2: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[    8.406712] usb 1-2: Product: USB2734
[    8.410382] usb 1-2: Manufacturer: Microchip Tech
[    8.417844] Write protecting the kernel read-only data: 18432k
[    8.424346] Freeing unused kernel image memory: 2020K
[    8.429577] Freeing unused kernel image memory: 344K
[    8.434555] rodata_test: all tests were successful
[    8.439354] Run /init as init process
[    8.473281] hub 1-2:1.0: USB hub found
[    8.485108] hub 1-2:1.0: 4 ports detected
[    8.594913] random: fast init done
[    8.779840] usb 1-1.1: new high-speed USB device number 4 using xhci_hcd
[    8.803405] random: systemd: uninitialized urandom read (16 bytes read)
[    8.810958] random: systemd: uninitialized urandom read (16 bytes read)
[    8.826238] systemd[1]: systemd 219 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN)
[    8.856779] systemd[1]: Detected architecture x86-64.
[    8.861846] systemd[1]: Running in initial RAM disk.
Vlastimil Babka Dec. 5, 2018, 9:40 a.m. UTC | #19
On 12/5/18 10:29 AM, Pingfan Liu wrote:
>> [    0.007418] Early memory node ranges
>> [    0.007419]   node   1: [mem 0x0000000000001000-0x000000000008efff]
>> [    0.007420]   node   1: [mem 0x0000000000090000-0x000000000009ffff]
>> [    0.007422]   node   1: [mem 0x0000000000100000-0x000000005c3d6fff]
>> [    0.007422]   node   1: [mem 0x00000000643df000-0x0000000068ff7fff]
>> [    0.007423]   node   1: [mem 0x000000006c528000-0x000000006fffffff]
>> [    0.007424]   node   1: [mem 0x0000000100000000-0x000000047fffffff]
>> [    0.007425]   node   5: [mem 0x0000000480000000-0x000000087effffff]
>>
>> There is clearly no node2. Where did the driver get the node2 from?

I don't understand these tables too much, but it seems the other nodes
exist without them:

[    0.007393] SRAT: PXM 2 -> APIC 0x20 -> Node 2

Maybe the nodes are hotplugable or something?

> Since using nr_cpus=4 , the node2 is not be instanced by x86 initalizing code.

Indeed, nr_cpus seems to restrict what nodes we allocate and populate
zonelists for.
Michal Hocko Dec. 5, 2018, 9:43 a.m. UTC | #20
On Wed 05-12-18 17:29:31, Pingfan Liu wrote:
> On Wed, Dec 5, 2018 at 5:21 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Wed 05-12-18 13:38:17, Pingfan Liu wrote:
> > > On Tue, Dec 4, 2018 at 4:56 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > > On Tue 04-12-18 16:20:32, Pingfan Liu wrote:
> > > > > On Tue, Dec 4, 2018 at 3:22 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > > > >
> > > > > > On Tue 04-12-18 11:05:57, Pingfan Liu wrote:
> > > > > > > During my test on some AMD machine, with kexec -l nr_cpus=x option, the
> > > > > > > kernel failed to bootup, because some node's data struct can not be allocated,
> > > > > > > e.g, on x86, initialized by init_cpu_to_node()->init_memory_less_node(). But
> > > > > > > device->numa_node info is used as preferred_nid param for
> > > > > > > __alloc_pages_nodemask(), which causes NULL reference
> > > > > > >   ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> > > > > > > This patch tries to fix the issue by falling back to the first online node,
> > > > > > > when encountering such corner case.
> > > > > >
> > > > > > We have seen similar issues already and the bug was usually that the
> > > > > > zonelists were not initialized yet or the node is completely bogus.
> > > > > > Zonelists should be initialized by build_all_zonelists quite early so I
> > > > > > am wondering whether the later is the case. What is the actual node
> > > > > > number the device is associated with?
> > > > > >
> > > > > The device's node num is 2. And in my case, I used nr_cpus param. Due
> > > > > to init_cpu_to_node() initialize all the possible node.  It is hard
> > > > > for me to figure out without this param, how zonelists is accessed
> > > > > before page allocator works.
> > > >
> > > > I believe we should focus on this. Why does the node have no zonelist
> > > > even though all zonelists should be initialized already? Maybe this is
> > > > nr_cpus pecularity and we do not initialize all the existing numa nodes.
> > > > Or maybe the device is associated to a non-existing node with that
> > > > setup. A full dmesg might help us here.
> > > >
> > > Requiring the machine again, and I got the following without nr_cpus option
> > > [root@dell-per7425-03 ~]# cd /sys/devices/system/node/
> > > [root@dell-per7425-03 node]# ls
> > > has_cpu  has_memory  has_normal_memory  node0  node1  node2  node3
> > > node4  node5  node6  node7  online  possible  power  uevent
> > > [root@dell-per7425-03 node]# cat has_cpu
> > > 0-7
> > > [root@dell-per7425-03 node]# cat has_memory
> > > 1,5
> > > [root@dell-per7425-03 node]# cat online
> > > 0-7
> > > [root@dell-per7425-03 node]# cat possible
> > > 0-7
> > > And lscpu shows the following numa-cpu info:
> > > NUMA node0 CPU(s):     0,8,16,24
> > > NUMA node1 CPU(s):     2,10,18,26
> > > NUMA node2 CPU(s):     4,12,20,28
> > > NUMA node3 CPU(s):     6,14,22,30
> > > NUMA node4 CPU(s):     1,9,17,25
> > > NUMA node5 CPU(s):     3,11,19,27
> > > NUMA node6 CPU(s):     5,13,21,29
> > > NUMA node7 CPU(s):     7,15,23,31
> > >
> > > For the full panic message (I masked some hostname info with xx),
> > > please see the attachment.
> > > In a short word, it seems a problem with nr_cpus, if without this
> > > option, the kernel can bootup correctly.
> >
> > Yep.
> > [    0.007418] Early memory node ranges
> > [    0.007419]   node   1: [mem 0x0000000000001000-0x000000000008efff]
> > [    0.007420]   node   1: [mem 0x0000000000090000-0x000000000009ffff]
> > [    0.007422]   node   1: [mem 0x0000000000100000-0x000000005c3d6fff]
> > [    0.007422]   node   1: [mem 0x00000000643df000-0x0000000068ff7fff]
> > [    0.007423]   node   1: [mem 0x000000006c528000-0x000000006fffffff]
> > [    0.007424]   node   1: [mem 0x0000000100000000-0x000000047fffffff]
> > [    0.007425]   node   5: [mem 0x0000000480000000-0x000000087effffff]
> >
> > There is clearly no node2. Where did the driver get the node2 from?
> Since using nr_cpus=4 , the node2 is not be instanced by x86 initalizing code.
> For the normal bootup, having the following:
> [    0.007704] Movable zone start for each node
> [    0.007707] Early memory node ranges
> [    0.007708]   node   1: [mem 0x0000000000001000-0x000000000008efff]
> [    0.007709]   node   1: [mem 0x0000000000090000-0x000000000009ffff]
> [    0.007711]   node   1: [mem 0x0000000000100000-0x000000005c3d6fff]
> [    0.007712]   node   1: [mem 0x00000000643df000-0x0000000068ff7fff]
> [    0.007712]   node   1: [mem 0x000000006c528000-0x000000006fffffff]
> [    0.007713]   node   1: [mem 0x0000000100000000-0x000000047fffffff]
> [    0.007714]   node   5: [mem 0x0000000480000000-0x000000087effffff]
> [    0.008434] Zeroed struct page in unavailable ranges: 46490 pages

Hmm, this is even more interesting. So even a normal boot doesn't have
node 2. So where exactly does the device get its affinity from?

I suspect we are looking at two issues here. The first one, and a more
important one is that there is a NUMA affinity configured for the device
to a non-existing node. The second one is that nr_cpus affects
initialization of possible nodes.
David Rientjes Dec. 5, 2018, 7 p.m. UTC | #21
On Wed, 5 Dec 2018, Pingfan Liu wrote:

> > > And rather than using first_online_node, would next_online_node() work?
> > >
> > What is the gain? Is it for memory pressure on node0?
> >
> Maybe I got your point now.  Do you try to give a cheap assumption on
> nearest neigh of this node?
> 

It's likely better than first_online_node, but probably going to be the 
same based on the node ids that you have reported since the nodemask will 
simply wrap around back to the first node.
Pingfan Liu Dec. 6, 2018, 3:07 a.m. UTC | #22
On Wed, Dec 5, 2018 at 5:40 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 12/5/18 10:29 AM, Pingfan Liu wrote:
> >> [    0.007418] Early memory node ranges
> >> [    0.007419]   node   1: [mem 0x0000000000001000-0x000000000008efff]
> >> [    0.007420]   node   1: [mem 0x0000000000090000-0x000000000009ffff]
> >> [    0.007422]   node   1: [mem 0x0000000000100000-0x000000005c3d6fff]
> >> [    0.007422]   node   1: [mem 0x00000000643df000-0x0000000068ff7fff]
> >> [    0.007423]   node   1: [mem 0x000000006c528000-0x000000006fffffff]
> >> [    0.007424]   node   1: [mem 0x0000000100000000-0x000000047fffffff]
> >> [    0.007425]   node   5: [mem 0x0000000480000000-0x000000087effffff]
> >>
> >> There is clearly no node2. Where did the driver get the node2 from?
>
> I don't understand these tables too much, but it seems the other nodes
> exist without them:
>
> [    0.007393] SRAT: PXM 2 -> APIC 0x20 -> Node 2
>
> Maybe the nodes are hotplugable or something?
>
I also not sure about it, and just have a hurry look at acpi spec. I
will reply it on another email, and Cced some acpi guys about it

> > Since using nr_cpus=4 , the node2 is not be instanced by x86 initalizing code.
>
> Indeed, nr_cpus seems to restrict what nodes we allocate and populate
> zonelists for.

Yes, in init_cpu_to_node(),  since nr_cpus limits the possible cpu,
which affects the loop for_each_possible_cpu(cpu) and skip the node2
in this case.

Thanks,
Pingfan
Pingfan Liu Dec. 6, 2018, 3:34 a.m. UTC | #23
On Wed, Dec 5, 2018 at 5:43 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 05-12-18 17:29:31, Pingfan Liu wrote:
> > On Wed, Dec 5, 2018 at 5:21 PM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Wed 05-12-18 13:38:17, Pingfan Liu wrote:
> > > > On Tue, Dec 4, 2018 at 4:56 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > > >
> > > > > On Tue 04-12-18 16:20:32, Pingfan Liu wrote:
> > > > > > On Tue, Dec 4, 2018 at 3:22 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > > > > >
> > > > > > > On Tue 04-12-18 11:05:57, Pingfan Liu wrote:
> > > > > > > > During my test on some AMD machine, with kexec -l nr_cpus=x option, the
> > > > > > > > kernel failed to bootup, because some node's data struct can not be allocated,
> > > > > > > > e.g, on x86, initialized by init_cpu_to_node()->init_memory_less_node(). But
> > > > > > > > device->numa_node info is used as preferred_nid param for
> > > > > > > > __alloc_pages_nodemask(), which causes NULL reference
> > > > > > > >   ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> > > > > > > > This patch tries to fix the issue by falling back to the first online node,
> > > > > > > > when encountering such corner case.
> > > > > > >
> > > > > > > We have seen similar issues already and the bug was usually that the
> > > > > > > zonelists were not initialized yet or the node is completely bogus.
> > > > > > > Zonelists should be initialized by build_all_zonelists quite early so I
> > > > > > > am wondering whether the later is the case. What is the actual node
> > > > > > > number the device is associated with?
> > > > > > >
> > > > > > The device's node num is 2. And in my case, I used nr_cpus param. Due
> > > > > > to init_cpu_to_node() initialize all the possible node.  It is hard
> > > > > > for me to figure out without this param, how zonelists is accessed
> > > > > > before page allocator works.
> > > > >
> > > > > I believe we should focus on this. Why does the node have no zonelist
> > > > > even though all zonelists should be initialized already? Maybe this is
> > > > > nr_cpus pecularity and we do not initialize all the existing numa nodes.
> > > > > Or maybe the device is associated to a non-existing node with that
> > > > > setup. A full dmesg might help us here.
> > > > >
> > > > Requiring the machine again, and I got the following without nr_cpus option
> > > > [root@dell-per7425-03 ~]# cd /sys/devices/system/node/
> > > > [root@dell-per7425-03 node]# ls
> > > > has_cpu  has_memory  has_normal_memory  node0  node1  node2  node3
> > > > node4  node5  node6  node7  online  possible  power  uevent
> > > > [root@dell-per7425-03 node]# cat has_cpu
> > > > 0-7
> > > > [root@dell-per7425-03 node]# cat has_memory
> > > > 1,5
> > > > [root@dell-per7425-03 node]# cat online
> > > > 0-7
> > > > [root@dell-per7425-03 node]# cat possible
> > > > 0-7
> > > > And lscpu shows the following numa-cpu info:
> > > > NUMA node0 CPU(s):     0,8,16,24
> > > > NUMA node1 CPU(s):     2,10,18,26
> > > > NUMA node2 CPU(s):     4,12,20,28
> > > > NUMA node3 CPU(s):     6,14,22,30
> > > > NUMA node4 CPU(s):     1,9,17,25
> > > > NUMA node5 CPU(s):     3,11,19,27
> > > > NUMA node6 CPU(s):     5,13,21,29
> > > > NUMA node7 CPU(s):     7,15,23,31
> > > >
> > > > For the full panic message (I masked some hostname info with xx),
> > > > please see the attachment.
> > > > In a short word, it seems a problem with nr_cpus, if without this
> > > > option, the kernel can bootup correctly.
> > >
> > > Yep.
> > > [    0.007418] Early memory node ranges
> > > [    0.007419]   node   1: [mem 0x0000000000001000-0x000000000008efff]
> > > [    0.007420]   node   1: [mem 0x0000000000090000-0x000000000009ffff]
> > > [    0.007422]   node   1: [mem 0x0000000000100000-0x000000005c3d6fff]
> > > [    0.007422]   node   1: [mem 0x00000000643df000-0x0000000068ff7fff]
> > > [    0.007423]   node   1: [mem 0x000000006c528000-0x000000006fffffff]
> > > [    0.007424]   node   1: [mem 0x0000000100000000-0x000000047fffffff]
> > > [    0.007425]   node   5: [mem 0x0000000480000000-0x000000087effffff]
> > >
> > > There is clearly no node2. Where did the driver get the node2 from?
> > Since using nr_cpus=4 , the node2 is not be instanced by x86 initalizing code.
> > For the normal bootup, having the following:
> > [    0.007704] Movable zone start for each node
> > [    0.007707] Early memory node ranges
> > [    0.007708]   node   1: [mem 0x0000000000001000-0x000000000008efff]
> > [    0.007709]   node   1: [mem 0x0000000000090000-0x000000000009ffff]
> > [    0.007711]   node   1: [mem 0x0000000000100000-0x000000005c3d6fff]
> > [    0.007712]   node   1: [mem 0x00000000643df000-0x0000000068ff7fff]
> > [    0.007712]   node   1: [mem 0x000000006c528000-0x000000006fffffff]
> > [    0.007713]   node   1: [mem 0x0000000100000000-0x000000047fffffff]
> > [    0.007714]   node   5: [mem 0x0000000480000000-0x000000087effffff]
> > [    0.008434] Zeroed struct page in unavailable ranges: 46490 pages
>
> Hmm, this is even more interesting. So even a normal boot doesn't have
> node 2. So where exactly does the device get its affinity from?
>
I am afraid that there is maybe some misunderstanding, but for the
normal bootup, the full boot msg shows the existence of node 2
First, the following can be observed:
[    0.007385] SRAT: PXM 0 -> APIC 0x00 -> Node 0
[    0.007386] SRAT: PXM 0 -> APIC 0x01 -> Node 0
[    0.007387] SRAT: PXM 0 -> APIC 0x08 -> Node 0
[    0.007388] SRAT: PXM 0 -> APIC 0x09 -> Node 0
[    0.007389] SRAT: PXM 1 -> APIC 0x10 -> Node 1
[    0.007390] SRAT: PXM 1 -> APIC 0x11 -> Node 1
[    0.007391] SRAT: PXM 1 -> APIC 0x18 -> Node 1
[    0.007392] SRAT: PXM 1 -> APIC 0x19 -> Node 1
[    0.007393] SRAT: PXM 2 -> APIC 0x20 -> Node 2
[    0.007394] SRAT: PXM 2 -> APIC 0x21 -> Node 2
[    0.007395] SRAT: PXM 2 -> APIC 0x28 -> Node 2
[    0.007396] SRAT: PXM 2 -> APIC 0x29 -> Node 2
[    0.007397] SRAT: PXM 3 -> APIC 0x30 -> Node 3
[    0.007398] SRAT: PXM 3 -> APIC 0x31 -> Node 3
[    0.007399] SRAT: PXM 3 -> APIC 0x38 -> Node 3
[    0.007400] SRAT: PXM 3 -> APIC 0x39 -> Node 3
[    0.007401] SRAT: PXM 4 -> APIC 0x40 -> Node 4
[    0.007402] SRAT: PXM 4 -> APIC 0x41 -> Node 4
[    0.007403] SRAT: PXM 4 -> APIC 0x48 -> Node 4
[    0.007403] SRAT: PXM 4 -> APIC 0x49 -> Node 4
[    0.007404] SRAT: PXM 5 -> APIC 0x50 -> Node 5
[    0.007405] SRAT: PXM 5 -> APIC 0x51 -> Node 5
[    0.007406] SRAT: PXM 5 -> APIC 0x58 -> Node 5
[    0.007407] SRAT: PXM 5 -> APIC 0x59 -> Node 5
[    0.007408] SRAT: PXM 6 -> APIC 0x60 -> Node 6
[    0.007409] SRAT: PXM 6 -> APIC 0x61 -> Node 6
[    0.007410] SRAT: PXM 6 -> APIC 0x68 -> Node 6
[    0.007411] SRAT: PXM 6 -> APIC 0x69 -> Node 6
[    0.007412] SRAT: PXM 7 -> APIC 0x70 -> Node 7
[    0.007413] SRAT: PXM 7 -> APIC 0x71 -> Node 7
[    0.007414] SRAT: PXM 7 -> APIC 0x78 -> Node 7
[    0.007414] SRAT: PXM 7 -> APIC 0x79 -> Node 7

Second:
[    0.024497] NODE_DATA(2) allocated [mem 0x87ef4b000-0x87ef75fff]
[    0.024498]     NODE_DATA(2) on node 5
[    0.024530] Initmem setup node 2 [mem 0x0000000000000000-0x0000000000000000]

Besides these,  as I pasted, when normal bootup, lscpu shows the: NUMA
node2 CPU(s):     4,12,20,28
and
[root@dell-per7425-03 node]# cat has_cpu
0-7
[root@dell-per7425-03 node]# cat has_memory
1,5
So I think node 2 exists, which has cpus but no memory.

I attach SRAT, which show there are eight Proximity Domain. As my
understanding, each Proximity Domain will correspond to a numa node.
Cced acpi guys and hope them can give some hints.

> I suspect we are looking at two issues here. The first one, and a more
> important one is that there is a NUMA affinity configured for the device
> to a non-existing node. The second one is that nr_cpus affects
> initialization of possible nodes.

The dev->numa_node info is extracted from acpi table, not depends on
the instance of numa-node, which may be limited by nr_cpus. Hence the
node is existing, just not instanced.

Thanks,
Pingfan
/*
 * Intel ACPI Component Architecture
 * AML/ASL+ Disassembler version 20160527-64
 * Copyright (c) 2000 - 2016 Intel Corporation
 * 
 * Disassembly of ./SRAT.dat, Wed Dec  5 21:56:44 2018
 *
 * ACPI Data Table [SRAT]
 *
 * Format: [HexOffset DecimalOffset ByteLength]  FieldName : FieldValue
 */

[000h 0000   4]                    Signature : "SRAT"    [System Resource Affinity Table]
[004h 0004   4]                 Table Length : 000002D0
[008h 0008   1]                     Revision : 03
[009h 0009   1]                     Checksum : 39
[00Ah 0010   6]                       Oem ID : "DELL  "
[010h 0016   8]                 Oem Table ID : "PE_SC3  "
[018h 0024   4]                 Oem Revision : 00000001
[01Ch 0028   4]              Asl Compiler ID : "AMD "
[020h 0032   4]        Asl Compiler Revision : 00000001

[024h 0036   4]               Table Revision : 00000001
[028h 0040   8]                     Reserved : 0000000000000000

[030h 0048   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[031h 0049   1]                       Length : 10

[032h 0050   1]      Proximity Domain Low(8) : 00
[033h 0051   1]                      Apic ID : 00
[034h 0052   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[038h 0056   1]              Local Sapic EID : 00
[039h 0057   3]    Proximity Domain High(24) : 000000
[03Ch 0060   4]                 Clock Domain : 00000000

[040h 0064   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[041h 0065   1]                       Length : 10

[042h 0066   1]      Proximity Domain Low(8) : 00
[043h 0067   1]                      Apic ID : 01
[044h 0068   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[048h 0072   1]              Local Sapic EID : 00
[049h 0073   3]    Proximity Domain High(24) : 000000
[04Ch 0076   4]                 Clock Domain : 00000000

[050h 0080   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[051h 0081   1]                       Length : 10

[052h 0082   1]      Proximity Domain Low(8) : 00
[053h 0083   1]                      Apic ID : 08
[054h 0084   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[058h 0088   1]              Local Sapic EID : 00
[059h 0089   3]    Proximity Domain High(24) : 000000
[05Ch 0092   4]                 Clock Domain : 00000000

[060h 0096   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[061h 0097   1]                       Length : 10

[062h 0098   1]      Proximity Domain Low(8) : 00
[063h 0099   1]                      Apic ID : 09
[064h 0100   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[068h 0104   1]              Local Sapic EID : 00
[069h 0105   3]    Proximity Domain High(24) : 000000
[06Ch 0108   4]                 Clock Domain : 00000000

[070h 0112   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[071h 0113   1]                       Length : 10

[072h 0114   1]      Proximity Domain Low(8) : 01
[073h 0115   1]                      Apic ID : 10
[074h 0116   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[078h 0120   1]              Local Sapic EID : 00
[079h 0121   3]    Proximity Domain High(24) : 000000
[07Ch 0124   4]                 Clock Domain : 00000000

[080h 0128   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[081h 0129   1]                       Length : 10

[082h 0130   1]      Proximity Domain Low(8) : 01
[083h 0131   1]                      Apic ID : 11
[084h 0132   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[088h 0136   1]              Local Sapic EID : 00
[089h 0137   3]    Proximity Domain High(24) : 000000
[08Ch 0140   4]                 Clock Domain : 00000000

[090h 0144   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[091h 0145   1]                       Length : 10

[092h 0146   1]      Proximity Domain Low(8) : 01
[093h 0147   1]                      Apic ID : 18
[094h 0148   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[098h 0152   1]              Local Sapic EID : 00
[099h 0153   3]    Proximity Domain High(24) : 000000
[09Ch 0156   4]                 Clock Domain : 00000000

[0A0h 0160   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[0A1h 0161   1]                       Length : 10

[0A2h 0162   1]      Proximity Domain Low(8) : 01
[0A3h 0163   1]                      Apic ID : 19
[0A4h 0164   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[0A8h 0168   1]              Local Sapic EID : 00
[0A9h 0169   3]    Proximity Domain High(24) : 000000
[0ACh 0172   4]                 Clock Domain : 00000000

[0B0h 0176   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[0B1h 0177   1]                       Length : 10

[0B2h 0178   1]      Proximity Domain Low(8) : 02
[0B3h 0179   1]                      Apic ID : 20
[0B4h 0180   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[0B8h 0184   1]              Local Sapic EID : 00
[0B9h 0185   3]    Proximity Domain High(24) : 000000
[0BCh 0188   4]                 Clock Domain : 00000000

[0C0h 0192   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[0C1h 0193   1]                       Length : 10

[0C2h 0194   1]      Proximity Domain Low(8) : 02
[0C3h 0195   1]                      Apic ID : 21
[0C4h 0196   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[0C8h 0200   1]              Local Sapic EID : 00
[0C9h 0201   3]    Proximity Domain High(24) : 000000
[0CCh 0204   4]                 Clock Domain : 00000000

[0D0h 0208   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[0D1h 0209   1]                       Length : 10

[0D2h 0210   1]      Proximity Domain Low(8) : 02
[0D3h 0211   1]                      Apic ID : 28
[0D4h 0212   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[0D8h 0216   1]              Local Sapic EID : 00
[0D9h 0217   3]    Proximity Domain High(24) : 000000
[0DCh 0220   4]                 Clock Domain : 00000000

[0E0h 0224   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[0E1h 0225   1]                       Length : 10

[0E2h 0226   1]      Proximity Domain Low(8) : 02
[0E3h 0227   1]                      Apic ID : 29
[0E4h 0228   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[0E8h 0232   1]              Local Sapic EID : 00
[0E9h 0233   3]    Proximity Domain High(24) : 000000
[0ECh 0236   4]                 Clock Domain : 00000000

[0F0h 0240   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[0F1h 0241   1]                       Length : 10

[0F2h 0242   1]      Proximity Domain Low(8) : 03
[0F3h 0243   1]                      Apic ID : 30
[0F4h 0244   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[0F8h 0248   1]              Local Sapic EID : 00
[0F9h 0249   3]    Proximity Domain High(24) : 000000
[0FCh 0252   4]                 Clock Domain : 00000000

[100h 0256   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[101h 0257   1]                       Length : 10

[102h 0258   1]      Proximity Domain Low(8) : 03
[103h 0259   1]                      Apic ID : 31
[104h 0260   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[108h 0264   1]              Local Sapic EID : 00
[109h 0265   3]    Proximity Domain High(24) : 000000
[10Ch 0268   4]                 Clock Domain : 00000000

[110h 0272   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[111h 0273   1]                       Length : 10

[112h 0274   1]      Proximity Domain Low(8) : 03
[113h 0275   1]                      Apic ID : 38
[114h 0276   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[118h 0280   1]              Local Sapic EID : 00
[119h 0281   3]    Proximity Domain High(24) : 000000
[11Ch 0284   4]                 Clock Domain : 00000000

[120h 0288   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[121h 0289   1]                       Length : 10

[122h 0290   1]      Proximity Domain Low(8) : 03
[123h 0291   1]                      Apic ID : 39
[124h 0292   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[128h 0296   1]              Local Sapic EID : 00
[129h 0297   3]    Proximity Domain High(24) : 000000
[12Ch 0300   4]                 Clock Domain : 00000000

[130h 0304   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[131h 0305   1]                       Length : 10

[132h 0306   1]      Proximity Domain Low(8) : 04
[133h 0307   1]                      Apic ID : 40
[134h 0308   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[138h 0312   1]              Local Sapic EID : 00
[139h 0313   3]    Proximity Domain High(24) : 000000
[13Ch 0316   4]                 Clock Domain : 00000000

[140h 0320   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[141h 0321   1]                       Length : 10

[142h 0322   1]      Proximity Domain Low(8) : 04
[143h 0323   1]                      Apic ID : 41
[144h 0324   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[148h 0328   1]              Local Sapic EID : 00
[149h 0329   3]    Proximity Domain High(24) : 000000
[14Ch 0332   4]                 Clock Domain : 00000000

[150h 0336   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[151h 0337   1]                       Length : 10

[152h 0338   1]      Proximity Domain Low(8) : 04
[153h 0339   1]                      Apic ID : 48
[154h 0340   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[158h 0344   1]              Local Sapic EID : 00
[159h 0345   3]    Proximity Domain High(24) : 000000
[15Ch 0348   4]                 Clock Domain : 00000000

[160h 0352   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[161h 0353   1]                       Length : 10

[162h 0354   1]      Proximity Domain Low(8) : 04
[163h 0355   1]                      Apic ID : 49
[164h 0356   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[168h 0360   1]              Local Sapic EID : 00
[169h 0361   3]    Proximity Domain High(24) : 000000
[16Ch 0364   4]                 Clock Domain : 00000000

[170h 0368   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[171h 0369   1]                       Length : 10

[172h 0370   1]      Proximity Domain Low(8) : 05
[173h 0371   1]                      Apic ID : 50
[174h 0372   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[178h 0376   1]              Local Sapic EID : 00
[179h 0377   3]    Proximity Domain High(24) : 000000
[17Ch 0380   4]                 Clock Domain : 00000000

[180h 0384   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[181h 0385   1]                       Length : 10

[182h 0386   1]      Proximity Domain Low(8) : 05
[183h 0387   1]                      Apic ID : 51
[184h 0388   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[188h 0392   1]              Local Sapic EID : 00
[189h 0393   3]    Proximity Domain High(24) : 000000
[18Ch 0396   4]                 Clock Domain : 00000000

[190h 0400   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[191h 0401   1]                       Length : 10

[192h 0402   1]      Proximity Domain Low(8) : 05
[193h 0403   1]                      Apic ID : 58
[194h 0404   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[198h 0408   1]              Local Sapic EID : 00
[199h 0409   3]    Proximity Domain High(24) : 000000
[19Ch 0412   4]                 Clock Domain : 00000000

[1A0h 0416   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[1A1h 0417   1]                       Length : 10

[1A2h 0418   1]      Proximity Domain Low(8) : 05
[1A3h 0419   1]                      Apic ID : 59
[1A4h 0420   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[1A8h 0424   1]              Local Sapic EID : 00
[1A9h 0425   3]    Proximity Domain High(24) : 000000
[1ACh 0428   4]                 Clock Domain : 00000000

[1B0h 0432   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[1B1h 0433   1]                       Length : 10

[1B2h 0434   1]      Proximity Domain Low(8) : 06
[1B3h 0435   1]                      Apic ID : 60
[1B4h 0436   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[1B8h 0440   1]              Local Sapic EID : 00
[1B9h 0441   3]    Proximity Domain High(24) : 000000
[1BCh 0444   4]                 Clock Domain : 00000000

[1C0h 0448   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[1C1h 0449   1]                       Length : 10

[1C2h 0450   1]      Proximity Domain Low(8) : 06
[1C3h 0451   1]                      Apic ID : 61
[1C4h 0452   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[1C8h 0456   1]              Local Sapic EID : 00
[1C9h 0457   3]    Proximity Domain High(24) : 000000
[1CCh 0460   4]                 Clock Domain : 00000000

[1D0h 0464   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[1D1h 0465   1]                       Length : 10

[1D2h 0466   1]      Proximity Domain Low(8) : 06
[1D3h 0467   1]                      Apic ID : 68
[1D4h 0468   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[1D8h 0472   1]              Local Sapic EID : 00
[1D9h 0473   3]    Proximity Domain High(24) : 000000
[1DCh 0476   4]                 Clock Domain : 00000000

[1E0h 0480   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[1E1h 0481   1]                       Length : 10

[1E2h 0482   1]      Proximity Domain Low(8) : 06
[1E3h 0483   1]                      Apic ID : 69
[1E4h 0484   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[1E8h 0488   1]              Local Sapic EID : 00
[1E9h 0489   3]    Proximity Domain High(24) : 000000
[1ECh 0492   4]                 Clock Domain : 00000000

[1F0h 0496   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[1F1h 0497   1]                       Length : 10

[1F2h 0498   1]      Proximity Domain Low(8) : 07
[1F3h 0499   1]                      Apic ID : 70
[1F4h 0500   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[1F8h 0504   1]              Local Sapic EID : 00
[1F9h 0505   3]    Proximity Domain High(24) : 000000
[1FCh 0508   4]                 Clock Domain : 00000000

[200h 0512   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[201h 0513   1]                       Length : 10

[202h 0514   1]      Proximity Domain Low(8) : 07
[203h 0515   1]                      Apic ID : 71
[204h 0516   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[208h 0520   1]              Local Sapic EID : 00
[209h 0521   3]    Proximity Domain High(24) : 000000
[20Ch 0524   4]                 Clock Domain : 00000000

[210h 0528   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[211h 0529   1]                       Length : 10

[212h 0530   1]      Proximity Domain Low(8) : 07
[213h 0531   1]                      Apic ID : 78
[214h 0532   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[218h 0536   1]              Local Sapic EID : 00
[219h 0537   3]    Proximity Domain High(24) : 000000
[21Ch 0540   4]                 Clock Domain : 00000000

[220h 0544   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[221h 0545   1]                       Length : 10

[222h 0546   1]      Proximity Domain Low(8) : 07
[223h 0547   1]                      Apic ID : 79
[224h 0548   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[228h 0552   1]              Local Sapic EID : 00
[229h 0553   3]    Proximity Domain High(24) : 000000
[22Ch 0556   4]                 Clock Domain : 00000000

[230h 0560   1]                Subtable Type : 01 [Memory Affinity]
[231h 0561   1]                       Length : 28

[232h 0562   4]             Proximity Domain : 00000001
[236h 0566   2]                    Reserved1 : 0000
[238h 0568   8]                 Base Address : 0000000000000000
[240h 0576   8]               Address Length : 00000000000A0000
[248h 0584   4]                    Reserved2 : 00000000
[24Ch 0588   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
                               Hot Pluggable : 0
                                Non-Volatile : 0
[250h 0592   8]                    Reserved3 : 0000000000000000

[258h 0600   1]                Subtable Type : 01 [Memory Affinity]
[259h 0601   1]                       Length : 28

[25Ah 0602   4]             Proximity Domain : 00000001
[25Eh 0606   2]                    Reserved1 : 0000
[260h 0608   8]                 Base Address : 0000000000100000
[268h 0616   8]               Address Length : 000000007FF00000
[270h 0624   4]                    Reserved2 : 00000000
[274h 0628   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
                               Hot Pluggable : 0
                                Non-Volatile : 0
[278h 0632   8]                    Reserved3 : 0000000000000000

[280h 0640   1]                Subtable Type : 01 [Memory Affinity]
[281h 0641   1]                       Length : 28

[282h 0642   4]             Proximity Domain : 00000001
[286h 0646   2]                    Reserved1 : 0000
[288h 0648   8]                 Base Address : 0000000100000000
[290h 0656   8]               Address Length : 0000000380000000
[298h 0664   4]                    Reserved2 : 00000000
[29Ch 0668   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
                               Hot Pluggable : 0
                                Non-Volatile : 0
[2A0h 0672   8]                    Reserved3 : 0000000000000000

[2A8h 0680   1]                Subtable Type : 01 [Memory Affinity]
[2A9h 0681   1]                       Length : 28

[2AAh 0682   4]             Proximity Domain : 00000005
[2AEh 0686   2]                    Reserved1 : 0000
[2B0h 0688   8]                 Base Address : 0000000480000000
[2B8h 0696   8]               Address Length : 0000000400000000
[2C0h 0704   4]                    Reserved2 : 00000000
[2C4h 0708   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
                               Hot Pluggable : 0
                                Non-Volatile : 0
[2C8h 0712   8]                    Reserved3 : 0000000000000000

Raw Table Data: Length 720 (0x2D0)

  0000: 53 52 41 54 D0 02 00 00 03 39 44 45 4C 4C 20 20  // SRAT.....9DELL  
  0010: 50 45 5F 53 43 33 20 20 01 00 00 00 41 4D 44 20  // PE_SC3  ....AMD 
  0020: 01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  // ................
  0030: 00 10 00 00 01 00 00 00 00 00 00 00 00 00 00 00  // ................
  0040: 00 10 00 01 01 00 00 00 00 00 00 00 00 00 00 00  // ................
  0050: 00 10 00 08 01 00 00 00 00 00 00 00 00 00 00 00  // ................
  0060: 00 10 00 09 01 00 00 00 00 00 00 00 00 00 00 00  // ................
  0070: 00 10 01 10 01 00 00 00 00 00 00 00 00 00 00 00  // ................
  0080: 00 10 01 11 01 00 00 00 00 00 00 00 00 00 00 00  // ................
  0090: 00 10 01 18 01 00 00 00 00 00 00 00 00 00 00 00  // ................
  00A0: 00 10 01 19 01 00 00 00 00 00 00 00 00 00 00 00  // ................
  00B0: 00 10 02 20 01 00 00 00 00 00 00 00 00 00 00 00  // ... ............
  00C0: 00 10 02 21 01 00 00 00 00 00 00 00 00 00 00 00  // ...!............
  00D0: 00 10 02 28 01 00 00 00 00 00 00 00 00 00 00 00  // ...(............
  00E0: 00 10 02 29 01 00 00 00 00 00 00 00 00 00 00 00  // ...)............
  00F0: 00 10 03 30 01 00 00 00 00 00 00 00 00 00 00 00  // ...0............
  0100: 00 10 03 31 01 00 00 00 00 00 00 00 00 00 00 00  // ...1............
  0110: 00 10 03 38 01 00 00 00 00 00 00 00 00 00 00 00  // ...8............
  0120: 00 10 03 39 01 00 00 00 00 00 00 00 00 00 00 00  // ...9............
  0130: 00 10 04 40 01 00 00 00 00 00 00 00 00 00 00 00  // ...@............
  0140: 00 10 04 41 01 00 00 00 00 00 00 00 00 00 00 00  // ...A............
  0150: 00 10 04 48 01 00 00 00 00 00 00 00 00 00 00 00  // ...H............
  0160: 00 10 04 49 01 00 00 00 00 00 00 00 00 00 00 00  // ...I............
  0170: 00 10 05 50 01 00 00 00 00 00 00 00 00 00 00 00  // ...P............
  0180: 00 10 05 51 01 00 00 00 00 00 00 00 00 00 00 00  // ...Q............
  0190: 00 10 05 58 01 00 00 00 00 00 00 00 00 00 00 00  // ...X............
  01A0: 00 10 05 59 01 00 00 00 00 00 00 00 00 00 00 00  // ...Y............
  01B0: 00 10 06 60 01 00 00 00 00 00 00 00 00 00 00 00  // ...`............
  01C0: 00 10 06 61 01 00 00 00 00 00 00 00 00 00 00 00  // ...a............
  01D0: 00 10 06 68 01 00 00 00 00 00 00 00 00 00 00 00  // ...h............
  01E0: 00 10 06 69 01 00 00 00 00 00 00 00 00 00 00 00  // ...i............
  01F0: 00 10 07 70 01 00 00 00 00 00 00 00 00 00 00 00  // ...p............
  0200: 00 10 07 71 01 00 00 00 00 00 00 00 00 00 00 00  // ...q............
  0210: 00 10 07 78 01 00 00 00 00 00 00 00 00 00 00 00  // ...x............
  0220: 00 10 07 79 01 00 00 00 00 00 00 00 00 00 00 00  // ...y............
  0230: 01 28 01 00 00 00 00 00 00 00 00 00 00 00 00 00  // .(..............
  0240: 00 00 0A 00 00 00 00 00 00 00 00 00 01 00 00 00  // ................
  0250: 00 00 00 00 00 00 00 00 01 28 01 00 00 00 00 00  // .........(......
  0260: 00 00 10 00 00 00 00 00 00 00 F0 7F 00 00 00 00  // ................
  0270: 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  // ................
  0280: 01 28 01 00 00 00 00 00 00 00 00 00 01 00 00 00  // .(..............
  0290: 00 00 00 80 03 00 00 00 00 00 00 00 01 00 00 00  // ................
  02A0: 00 00 00 00 00 00 00 00 01 28 05 00 00 00 00 00  // .........(......
  02B0: 00 00 00 80 04 00 00 00 00 00 00 00 04 00 00 00  // ................
  02C0: 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  // ................
Michal Hocko Dec. 6, 2018, 7:23 a.m. UTC | #24
On Thu 06-12-18 11:34:30, Pingfan Liu wrote:
[...]
> > I suspect we are looking at two issues here. The first one, and a more
> > important one is that there is a NUMA affinity configured for the device
> > to a non-existing node. The second one is that nr_cpus affects
> > initialization of possible nodes.
> 
> The dev->numa_node info is extracted from acpi table, not depends on
> the instance of numa-node, which may be limited by nr_cpus. Hence the
> node is existing, just not instanced.

Hmm, binding to memory less node is quite dubious. But OK. I am not sure
how much sanitization can we do. We need to fallback anyway so we should
better make sure that all possible nodes are initialized regardless of
nr_cpus. I will look into that.
Michal Hocko Dec. 6, 2018, 8:28 a.m. UTC | #25
On Thu 06-12-18 11:07:33, Pingfan Liu wrote:
> On Wed, Dec 5, 2018 at 5:40 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > On 12/5/18 10:29 AM, Pingfan Liu wrote:
> > >> [    0.007418] Early memory node ranges
> > >> [    0.007419]   node   1: [mem 0x0000000000001000-0x000000000008efff]
> > >> [    0.007420]   node   1: [mem 0x0000000000090000-0x000000000009ffff]
> > >> [    0.007422]   node   1: [mem 0x0000000000100000-0x000000005c3d6fff]
> > >> [    0.007422]   node   1: [mem 0x00000000643df000-0x0000000068ff7fff]
> > >> [    0.007423]   node   1: [mem 0x000000006c528000-0x000000006fffffff]
> > >> [    0.007424]   node   1: [mem 0x0000000100000000-0x000000047fffffff]
> > >> [    0.007425]   node   5: [mem 0x0000000480000000-0x000000087effffff]
> > >>
> > >> There is clearly no node2. Where did the driver get the node2 from?
> >
> > I don't understand these tables too much, but it seems the other nodes
> > exist without them:
> >
> > [    0.007393] SRAT: PXM 2 -> APIC 0x20 -> Node 2
> >
> > Maybe the nodes are hotplugable or something?
> >
> I also not sure about it, and just have a hurry look at acpi spec. I
> will reply it on another email, and Cced some acpi guys about it
> 
> > > Since using nr_cpus=4 , the node2 is not be instanced by x86 initalizing code.
> >
> > Indeed, nr_cpus seems to restrict what nodes we allocate and populate
> > zonelists for.
> 
> Yes, in init_cpu_to_node(),  since nr_cpus limits the possible cpu,
> which affects the loop for_each_possible_cpu(cpu) and skip the node2
> in this case.

THanks for pointing this out. It made my life easier. So It think the
bug is that we call init_memory_less_node from this path. I suspect
numa_register_memblks is the right place to do this. So I admit I
am not 100% sure but could you give this a try please?

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 1308f5408bf7..4575ae4d5449 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -527,6 +527,19 @@ static void __init numa_clear_kernel_node_hotplug(void)
 	}
 }
 
+static void __init init_memory_less_node(int nid)
+{
+	unsigned long zones_size[MAX_NR_ZONES] = {0};
+	unsigned long zholes_size[MAX_NR_ZONES] = {0};
+
+	free_area_init_node(nid, zones_size, 0, zholes_size);
+
+	/*
+	 * All zonelists will be built later in start_kernel() after per cpu
+	 * areas are initialized.
+	 */
+}
+
 static int __init numa_register_memblks(struct numa_meminfo *mi)
 {
 	unsigned long uninitialized_var(pfn_align);
@@ -592,6 +605,8 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 			continue;
 
 		alloc_node_data(nid);
+		if (!end)
+			init_memory_less_node(nid);
 	}
 
 	/* Dump memblock with node info and return. */
@@ -721,21 +736,6 @@ void __init x86_numa_init(void)
 	numa_init(dummy_numa_init);
 }
 
-static void __init init_memory_less_node(int nid)
-{
-	unsigned long zones_size[MAX_NR_ZONES] = {0};
-	unsigned long zholes_size[MAX_NR_ZONES] = {0};
-
-	/* Allocate and initialize node data. Memory-less node is now online.*/
-	alloc_node_data(nid);
-	free_area_init_node(nid, zones_size, 0, zholes_size);
-
-	/*
-	 * All zonelists will be built later in start_kernel() after per cpu
-	 * areas are initialized.
-	 */
-}
-
 /*
  * Setup early cpu_to_node.
  *
@@ -763,9 +763,6 @@ void __init init_cpu_to_node(void)
 		if (node == NUMA_NO_NODE)
 			continue;
 
-		if (!node_online(node))
-			init_memory_less_node(node);
-
 		numa_set_node(cpu, node);
 	}
 }
Pingfan Liu Dec. 6, 2018, 10:03 a.m. UTC | #26
[...]
> THanks for pointing this out. It made my life easier. So It think the
> bug is that we call init_memory_less_node from this path. I suspect
> numa_register_memblks is the right place to do this. So I admit I
> am not 100% sure but could you give this a try please?
>
Sure.

> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 1308f5408bf7..4575ae4d5449 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -527,6 +527,19 @@ static void __init numa_clear_kernel_node_hotplug(void)
>         }
>  }
>
> +static void __init init_memory_less_node(int nid)
> +{
> +       unsigned long zones_size[MAX_NR_ZONES] = {0};
> +       unsigned long zholes_size[MAX_NR_ZONES] = {0};
> +
> +       free_area_init_node(nid, zones_size, 0, zholes_size);
> +
> +       /*
> +        * All zonelists will be built later in start_kernel() after per cpu
> +        * areas are initialized.
> +        */
> +}
> +
>  static int __init numa_register_memblks(struct numa_meminfo *mi)
>  {
>         unsigned long uninitialized_var(pfn_align);
> @@ -592,6 +605,8 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>                         continue;
>
>                 alloc_node_data(nid);
> +               if (!end)
> +                       init_memory_less_node(nid);
>         }
>
>         /* Dump memblock with node info and return. */
> @@ -721,21 +736,6 @@ void __init x86_numa_init(void)
>         numa_init(dummy_numa_init);
>  }
>
> -static void __init init_memory_less_node(int nid)
> -{
> -       unsigned long zones_size[MAX_NR_ZONES] = {0};
> -       unsigned long zholes_size[MAX_NR_ZONES] = {0};
> -
> -       /* Allocate and initialize node data. Memory-less node is now online.*/
> -       alloc_node_data(nid);
> -       free_area_init_node(nid, zones_size, 0, zholes_size);
> -
> -       /*
> -        * All zonelists will be built later in start_kernel() after per cpu
> -        * areas are initialized.
> -        */
> -}
> -
>  /*
>   * Setup early cpu_to_node.
>   *
> @@ -763,9 +763,6 @@ void __init init_cpu_to_node(void)
>                 if (node == NUMA_NO_NODE)
>                         continue;
>
> -               if (!node_online(node))
> -                       init_memory_less_node(node);
> -
>                 numa_set_node(cpu, node);
>         }
>  }
> --
Which commit is this patch applied on? I can not apply it on latest linux tree.

Thanks,
Pingfan
Pingfan Liu Dec. 6, 2018, 10:44 a.m. UTC | #27
On Thu, Dec 6, 2018 at 6:03 PM Pingfan Liu <kernelfans@gmail.com> wrote:
>
> [...]
> > THanks for pointing this out. It made my life easier. So It think the
> > bug is that we call init_memory_less_node from this path. I suspect
> > numa_register_memblks is the right place to do this. So I admit I
> > am not 100% sure but could you give this a try please?
> >
> Sure.
>
> > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > index 1308f5408bf7..4575ae4d5449 100644
> > --- a/arch/x86/mm/numa.c
> > +++ b/arch/x86/mm/numa.c
> > @@ -527,6 +527,19 @@ static void __init numa_clear_kernel_node_hotplug(void)
> >         }
> >  }
> >
> > +static void __init init_memory_less_node(int nid)
> > +{
> > +       unsigned long zones_size[MAX_NR_ZONES] = {0};
> > +       unsigned long zholes_size[MAX_NR_ZONES] = {0};
> > +
> > +       free_area_init_node(nid, zones_size, 0, zholes_size);
> > +
> > +       /*
> > +        * All zonelists will be built later in start_kernel() after per cpu
> > +        * areas are initialized.
> > +        */
> > +}
> > +
> >  static int __init numa_register_memblks(struct numa_meminfo *mi)
> >  {
> >         unsigned long uninitialized_var(pfn_align);
> > @@ -592,6 +605,8 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
> >                         continue;
> >
> >                 alloc_node_data(nid);
> > +               if (!end)
> > +                       init_memory_less_node(nid);
> >         }
> >
> >         /* Dump memblock with node info and return. */
> > @@ -721,21 +736,6 @@ void __init x86_numa_init(void)
> >         numa_init(dummy_numa_init);
> >  }
> >
> > -static void __init init_memory_less_node(int nid)
> > -{
> > -       unsigned long zones_size[MAX_NR_ZONES] = {0};
> > -       unsigned long zholes_size[MAX_NR_ZONES] = {0};
> > -
> > -       /* Allocate and initialize node data. Memory-less node is now online.*/
> > -       alloc_node_data(nid);
> > -       free_area_init_node(nid, zones_size, 0, zholes_size);
> > -
> > -       /*
> > -        * All zonelists will be built later in start_kernel() after per cpu
> > -        * areas are initialized.
> > -        */
> > -}
> > -
> >  /*
> >   * Setup early cpu_to_node.
> >   *
> > @@ -763,9 +763,6 @@ void __init init_cpu_to_node(void)
> >                 if (node == NUMA_NO_NODE)
> >                         continue;
> >
> > -               if (!node_online(node))
> > -                       init_memory_less_node(node);
> > -
> >                 numa_set_node(cpu, node);
> >         }
> >  }
> > --
> Which commit is this patch applied on? I can not apply it on latest linux tree.
>
I applied it by manual, will see the test result. I think it should
work since you instance all the node.
But there are two things worth to consider:
-1st. why x86 do not bring up all nodes by default, apparently it will
be more simple by that way
-2nd. there are other archs, do they obey the rules?

Thanks,
Pingfan
Michal Hocko Dec. 6, 2018, 12:11 p.m. UTC | #28
On Thu 06-12-18 18:44:03, Pingfan Liu wrote:
> On Thu, Dec 6, 2018 at 6:03 PM Pingfan Liu <kernelfans@gmail.com> wrote:
[...]
> > Which commit is this patch applied on? I can not apply it on latest linux tree.
> >
> I applied it by manual, will see the test result. I think it should
> work since you instance all the node.
> But there are two things worth to consider:
> -1st. why x86 do not bring up all nodes by default, apparently it will
> be more simple by that way

What do you mean? Why it didn't bring up before? Or do you see some
nodes not being brought up after this patch?

> -2nd. there are other archs, do they obey the rules?

I am afraid that each arch does its own initialization.
Pingfan Liu Dec. 7, 2018, 2:56 a.m. UTC | #29
On Thu, Dec 6, 2018 at 8:11 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 06-12-18 18:44:03, Pingfan Liu wrote:
> > On Thu, Dec 6, 2018 at 6:03 PM Pingfan Liu <kernelfans@gmail.com> wrote:
> [...]
> > > Which commit is this patch applied on? I can not apply it on latest linux tree.
> > >
> > I applied it by manual, will see the test result. I think it should
> > work since you instance all the node.
> > But there are two things worth to consider:
> > -1st. why x86 do not bring up all nodes by default, apparently it will
> > be more simple by that way
>
> What do you mean? Why it didn't bring up before? Or do you see some

Yes, this is what I mean. But maybe the author does not consider about
the nr_cpus, otherwise, using:
+       for_each_node(node)
+               if (!node_online(node))
+                       init_memory_less_node(node);
in init_cpu_to_node() is more simple.

> nodes not being brought up after this patch?
>
> > -2nd. there are other archs, do they obey the rules?
>
> I am afraid that each arch does its own initialization.

Then it is arguable whether to fix this issue in memory core or let
each archs to fix this issue. I check the powerpc code, it should also
need a fix, it maybe the same in arm and mips ..
 BTW, your patch can not work for normal bootup, and the kernel hang
without any kernel message.
I think it is due to the bug in the patch:
                alloc_node_data(nid);
+               if (!end)
+                       init_memory_less_node(nid); //which calls
alloc_node_data(nid) also.
How about the following:
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 1308f54..4dc497d 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -754,18 +754,23 @@ void __init init_cpu_to_node(void)
 {
        int cpu;
        u16 *cpu_to_apicid = early_per_cpu_ptr(x86_cpu_to_apicid);
+       int node, nr;

        BUG_ON(cpu_to_apicid == NULL);
+       nr = cpumask_weight(cpu_possible_mask);
+
+       /* bring up all possible node, since dev->numa_node */
+       //should check acpi works for node possible,
+       for_each_node(node)
+               if (!node_online(node))
+                       init_memory_less_node(node);

        for_each_possible_cpu(cpu) {
-               int node = numa_cpu_node(cpu);
+               node = numa_cpu_node(cpu);

                if (node == NUMA_NO_NODE)
                        continue;

-               if (!node_online(node))
-                       init_memory_less_node(node);
-
                numa_set_node(cpu, node);
        }
 }

Although it works, I hesitate about the idea, due to the semantic of
online-node, does the online-node require either cpu or memory inside
the node to be online?
In a short word, the fix method should consider about the two factors:
semantic of online-node and the effect on all archs

Thanks,
Pingfan

> --
> Michal Hocko
> SUSE Labs
Michal Hocko Dec. 7, 2018, 7:53 a.m. UTC | #30
On Fri 07-12-18 10:56:51, Pingfan Liu wrote:
[...]
> In a short word, the fix method should consider about the two factors:
> semantic of online-node and the effect on all archs

I am pretty sure there is a lot of room for unification in this area.
Nevertheless I strongly believe the bug should be fixed firs with the
simplest way and all the cleanup should be done on top.

Do I get it right that the diff worked for you and I can prepare a full
patch?
Pingfan Liu Dec. 7, 2018, 9:40 a.m. UTC | #31
On Fri, Dec 7, 2018 at 3:53 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Fri 07-12-18 10:56:51, Pingfan Liu wrote:
> [...]
> > In a short word, the fix method should consider about the two factors:
> > semantic of online-node and the effect on all archs
>
> I am pretty sure there is a lot of room for unification in this area.
> Nevertheless I strongly believe the bug should be fixed firs with the
> simplest way and all the cleanup should be done on top.
>
> Do I get it right that the diff worked for you and I can prepare a full
> patch?
>
Sure, I am glad to test you new patch.

Thanks,
Pingfan
Michal Hocko Dec. 7, 2018, 11:30 a.m. UTC | #32
On Fri 07-12-18 17:40:09, Pingfan Liu wrote:
> On Fri, Dec 7, 2018 at 3:53 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Fri 07-12-18 10:56:51, Pingfan Liu wrote:
> > [...]
> > > In a short word, the fix method should consider about the two factors:
> > > semantic of online-node and the effect on all archs
> >
> > I am pretty sure there is a lot of room for unification in this area.
> > Nevertheless I strongly believe the bug should be fixed firs with the
> > simplest way and all the cleanup should be done on top.
> >
> > Do I get it right that the diff worked for you and I can prepare a full
> > patch?
> >
> Sure, I am glad to test you new patch.

From 46e68be89d9c299fd497b2b8bea3f2add144f17f Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Fri, 7 Dec 2018 12:23:32 +0100
Subject: [PATCH] x86, numa: always initialize all possible nodes

Pingfan Liu has reported the following splat
[    5.772742] BUG: unable to handle kernel paging request at 0000000000002088
[    5.773618] PGD 0 P4D 0
[    5.773618] Oops: 0000 [#1] SMP NOPTI
[    5.773618] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.20.0-rc1+ #3
[    5.773618] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.4.3 06/29/2018
[    5.773618] RIP: 0010:__alloc_pages_nodemask+0xe2/0x2a0
[    5.773618] Code: 00 00 44 89 ea 80 ca 80 41 83 f8 01 44 0f 44 ea 89 da c1 ea 08 83 e2 01 88 54 24 20 48 8b 54 24 08 48 85 d2 0f 85 46 01 00 00 <3b> 77 08 0f 82 3d 01 00 00 48 89 f8 44 89 ea 48 89
e1 44 89 e6 89
[    5.773618] RSP: 0018:ffffaa600005fb20 EFLAGS: 00010246
[    5.773618] RAX: 0000000000000000 RBX: 00000000006012c0 RCX: 0000000000000000
[    5.773618] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080
[    5.773618] RBP: 00000000006012c0 R08: 0000000000000000 R09: 0000000000000002
[    5.773618] R10: 00000000006080c0 R11: 0000000000000002 R12: 0000000000000000
[    5.773618] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000002
[    5.773618] FS:  0000000000000000(0000) GS:ffff8c69afe00000(0000) knlGS:0000000000000000
[    5.773618] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    5.773618] CR2: 0000000000002088 CR3: 000000087e00a000 CR4: 00000000003406e0
[    5.773618] Call Trace:
[    5.773618]  new_slab+0xa9/0x570
[    5.773618]  ___slab_alloc+0x375/0x540
[    5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
[    5.773618]  __slab_alloc+0x1c/0x38
[    5.773618]  __kmalloc_node_track_caller+0xc8/0x270
[    5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
[    5.773618]  devm_kmalloc+0x28/0x60
[    5.773618]  pinctrl_bind_pins+0x2b/0x2a0
[    5.773618]  really_probe+0x73/0x420
[    5.773618]  driver_probe_device+0x115/0x130
[    5.773618]  __driver_attach+0x103/0x110
[    5.773618]  ? driver_probe_device+0x130/0x130
[    5.773618]  bus_for_each_dev+0x67/0xc0
[    5.773618]  ? klist_add_tail+0x3b/0x70
[    5.773618]  bus_add_driver+0x41/0x260
[    5.773618]  ? pcie_port_setup+0x4d/0x4d
[    5.773618]  driver_register+0x5b/0xe0
[    5.773618]  ? pcie_port_setup+0x4d/0x4d
[    5.773618]  do_one_initcall+0x4e/0x1d4
[    5.773618]  ? init_setup+0x25/0x28
[    5.773618]  kernel_init_freeable+0x1c1/0x26e
[    5.773618]  ? loglevel+0x5b/0x5b
[    5.773618]  ? rest_init+0xb0/0xb0
[    5.773618]  kernel_init+0xa/0x110
[    5.773618]  ret_from_fork+0x22/0x40
[    5.773618] Modules linked in:
[    5.773618] CR2: 0000000000002088
[    5.773618] ---[ end trace 1030c9120a03d081 ]---

with his AMD machine with the following topology
  NUMA node0 CPU(s):     0,8,16,24
  NUMA node1 CPU(s):     2,10,18,26
  NUMA node2 CPU(s):     4,12,20,28
  NUMA node3 CPU(s):     6,14,22,30
  NUMA node4 CPU(s):     1,9,17,25
  NUMA node5 CPU(s):     3,11,19,27
  NUMA node6 CPU(s):     5,13,21,29
  NUMA node7 CPU(s):     7,15,23,31

[    0.007418] Early memory node ranges
[    0.007419]   node   1: [mem 0x0000000000001000-0x000000000008efff]
[    0.007420]   node   1: [mem 0x0000000000090000-0x000000000009ffff]
[    0.007422]   node   1: [mem 0x0000000000100000-0x000000005c3d6fff]
[    0.007422]   node   1: [mem 0x00000000643df000-0x0000000068ff7fff]
[    0.007423]   node   1: [mem 0x000000006c528000-0x000000006fffffff]
[    0.007424]   node   1: [mem 0x0000000100000000-0x000000047fffffff]
[    0.007425]   node   5: [mem 0x0000000480000000-0x000000087effffff]

and nr_cpus set to 4. The underlying reason is tha the device is bound
to node 2 which doesn't have any memory and init_cpu_to_node only
initializes memory-less nodes for possible cpus which nr_cpus restrics.
This in turn means that proper zonelists are not allocated and the page
allocator blows up.

Fix the issue by moving init_memory_less_node into numa_register_memblks
and always initialize all possible nodes consistently at a single place.

Reported-by: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 arch/x86/mm/numa.c | 33 +++++++++++++++------------------
 1 file changed, 15 insertions(+), 18 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 1308f5408bf7..4575ae4d5449 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -527,6 +527,19 @@ static void __init numa_clear_kernel_node_hotplug(void)
 	}
 }
 
+static void __init init_memory_less_node(int nid)
+{
+	unsigned long zones_size[MAX_NR_ZONES] = {0};
+	unsigned long zholes_size[MAX_NR_ZONES] = {0};
+
+	free_area_init_node(nid, zones_size, 0, zholes_size);
+
+	/*
+	 * All zonelists will be built later in start_kernel() after per cpu
+	 * areas are initialized.
+	 */
+}
+
 static int __init numa_register_memblks(struct numa_meminfo *mi)
 {
 	unsigned long uninitialized_var(pfn_align);
@@ -592,6 +605,8 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 			continue;
 
 		alloc_node_data(nid);
+		if (!end)
+			init_memory_less_node(nid);
 	}
 
 	/* Dump memblock with node info and return. */
@@ -721,21 +736,6 @@ void __init x86_numa_init(void)
 	numa_init(dummy_numa_init);
 }
 
-static void __init init_memory_less_node(int nid)
-{
-	unsigned long zones_size[MAX_NR_ZONES] = {0};
-	unsigned long zholes_size[MAX_NR_ZONES] = {0};
-
-	/* Allocate and initialize node data. Memory-less node is now online.*/
-	alloc_node_data(nid);
-	free_area_init_node(nid, zones_size, 0, zholes_size);
-
-	/*
-	 * All zonelists will be built later in start_kernel() after per cpu
-	 * areas are initialized.
-	 */
-}
-
 /*
  * Setup early cpu_to_node.
  *
@@ -763,9 +763,6 @@ void __init init_cpu_to_node(void)
 		if (node == NUMA_NO_NODE)
 			continue;
 
-		if (!node_online(node))
-			init_memory_less_node(node);
-
 		numa_set_node(cpu, node);
 	}
 }
Pingfan Liu Dec. 7, 2018, 1:20 p.m. UTC | #33
On Fri, Dec 7, 2018 at 7:30 PM Michal Hocko <mhocko@kernel.org> wrote:
>
[...]
> On Fri 07-12-18 17:40:09, Pingfan Liu wrote:
> > On Fri, Dec 7, 2018 at 3:53 PM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Fri 07-12-18 10:56:51, Pingfan Liu wrote:
> > > [...]
> > > > In a short word, the fix method should consider about the two factors:
> > > > semantic of online-node and the effect on all archs
> > >
> > > I am pretty sure there is a lot of room for unification in this area.
> > > Nevertheless I strongly believe the bug should be fixed firs with the
> > > simplest way and all the cleanup should be done on top.
> > >
> > > Do I get it right that the diff worked for you and I can prepare a full
> > > patch?
> > >
> > Sure, I am glad to test you new patch.
>
> From 46e68be89d9c299fd497b2b8bea3f2add144f17f Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Fri, 7 Dec 2018 12:23:32 +0100
> Subject: [PATCH] x86, numa: always initialize all possible nodes
>
> Pingfan Liu has reported the following splat
> [    5.772742] BUG: unable to handle kernel paging request at 0000000000002088
> [    5.773618] PGD 0 P4D 0
> [    5.773618] Oops: 0000 [#1] SMP NOPTI
> [    5.773618] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.20.0-rc1+ #3
> [    5.773618] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.4.3 06/29/2018
> [    5.773618] RIP: 0010:__alloc_pages_nodemask+0xe2/0x2a0
> [    5.773618] Code: 00 00 44 89 ea 80 ca 80 41 83 f8 01 44 0f 44 ea 89 da c1 ea 08 83 e2 01 88 54 24 20 48 8b 54 24 08 48 85 d2 0f 85 46 01 00 00 <3b> 77 08 0f 82 3d 01 00 00 48 89 f8 44 89 ea 48 89
> e1 44 89 e6 89
> [    5.773618] RSP: 0018:ffffaa600005fb20 EFLAGS: 00010246
> [    5.773618] RAX: 0000000000000000 RBX: 00000000006012c0 RCX: 0000000000000000
> [    5.773618] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080
> [    5.773618] RBP: 00000000006012c0 R08: 0000000000000000 R09: 0000000000000002
> [    5.773618] R10: 00000000006080c0 R11: 0000000000000002 R12: 0000000000000000
> [    5.773618] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000002
> [    5.773618] FS:  0000000000000000(0000) GS:ffff8c69afe00000(0000) knlGS:0000000000000000
> [    5.773618] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    5.773618] CR2: 0000000000002088 CR3: 000000087e00a000 CR4: 00000000003406e0
> [    5.773618] Call Trace:
> [    5.773618]  new_slab+0xa9/0x570
> [    5.773618]  ___slab_alloc+0x375/0x540
> [    5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
> [    5.773618]  __slab_alloc+0x1c/0x38
> [    5.773618]  __kmalloc_node_track_caller+0xc8/0x270
> [    5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
> [    5.773618]  devm_kmalloc+0x28/0x60
> [    5.773618]  pinctrl_bind_pins+0x2b/0x2a0
> [    5.773618]  really_probe+0x73/0x420
> [    5.773618]  driver_probe_device+0x115/0x130
> [    5.773618]  __driver_attach+0x103/0x110
> [    5.773618]  ? driver_probe_device+0x130/0x130
> [    5.773618]  bus_for_each_dev+0x67/0xc0
> [    5.773618]  ? klist_add_tail+0x3b/0x70
> [    5.773618]  bus_add_driver+0x41/0x260
> [    5.773618]  ? pcie_port_setup+0x4d/0x4d
> [    5.773618]  driver_register+0x5b/0xe0
> [    5.773618]  ? pcie_port_setup+0x4d/0x4d
> [    5.773618]  do_one_initcall+0x4e/0x1d4
> [    5.773618]  ? init_setup+0x25/0x28
> [    5.773618]  kernel_init_freeable+0x1c1/0x26e
> [    5.773618]  ? loglevel+0x5b/0x5b
> [    5.773618]  ? rest_init+0xb0/0xb0
> [    5.773618]  kernel_init+0xa/0x110
> [    5.773618]  ret_from_fork+0x22/0x40
> [    5.773618] Modules linked in:
> [    5.773618] CR2: 0000000000002088
> [    5.773618] ---[ end trace 1030c9120a03d081 ]---
>
> with his AMD machine with the following topology
>   NUMA node0 CPU(s):     0,8,16,24
>   NUMA node1 CPU(s):     2,10,18,26
>   NUMA node2 CPU(s):     4,12,20,28
>   NUMA node3 CPU(s):     6,14,22,30
>   NUMA node4 CPU(s):     1,9,17,25
>   NUMA node5 CPU(s):     3,11,19,27
>   NUMA node6 CPU(s):     5,13,21,29
>   NUMA node7 CPU(s):     7,15,23,31
>
> [    0.007418] Early memory node ranges
> [    0.007419]   node   1: [mem 0x0000000000001000-0x000000000008efff]
> [    0.007420]   node   1: [mem 0x0000000000090000-0x000000000009ffff]
> [    0.007422]   node   1: [mem 0x0000000000100000-0x000000005c3d6fff]
> [    0.007422]   node   1: [mem 0x00000000643df000-0x0000000068ff7fff]
> [    0.007423]   node   1: [mem 0x000000006c528000-0x000000006fffffff]
> [    0.007424]   node   1: [mem 0x0000000100000000-0x000000047fffffff]
> [    0.007425]   node   5: [mem 0x0000000480000000-0x000000087effffff]
>
> and nr_cpus set to 4. The underlying reason is tha the device is bound
> to node 2 which doesn't have any memory and init_cpu_to_node only
> initializes memory-less nodes for possible cpus which nr_cpus restrics.
> This in turn means that proper zonelists are not allocated and the page
> allocator blows up.
>
> Fix the issue by moving init_memory_less_node into numa_register_memblks
> and always initialize all possible nodes consistently at a single place.
>
> Reported-by: Pingfan Liu <kernelfans@gmail.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  arch/x86/mm/numa.c | 33 +++++++++++++++------------------
>  1 file changed, 15 insertions(+), 18 deletions(-)
>
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 1308f5408bf7..4575ae4d5449 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -527,6 +527,19 @@ static void __init numa_clear_kernel_node_hotplug(void)
>         }
>  }
>
> +static void __init init_memory_less_node(int nid)
> +{
> +       unsigned long zones_size[MAX_NR_ZONES] = {0};
> +       unsigned long zholes_size[MAX_NR_ZONES] = {0};
> +
> +       free_area_init_node(nid, zones_size, 0, zholes_size);
> +
> +       /*
> +        * All zonelists will be built later in start_kernel() after per cpu
> +        * areas are initialized.
> +        */
> +}
> +
>  static int __init numa_register_memblks(struct numa_meminfo *mi)
>  {
>         unsigned long uninitialized_var(pfn_align);
> @@ -592,6 +605,8 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>                         continue;
>
>                 alloc_node_data(nid);
> +               if (!end)
> +                       init_memory_less_node(nid);
>         }
>
>         /* Dump memblock with node info and return. */
> @@ -721,21 +736,6 @@ void __init x86_numa_init(void)
>         numa_init(dummy_numa_init);
>  }
>
> -static void __init init_memory_less_node(int nid)
> -{
> -       unsigned long zones_size[MAX_NR_ZONES] = {0};
> -       unsigned long zholes_size[MAX_NR_ZONES] = {0};
> -
> -       /* Allocate and initialize node data. Memory-less node is now online.*/
> -       alloc_node_data(nid);
> -       free_area_init_node(nid, zones_size, 0, zholes_size);
> -
> -       /*
> -        * All zonelists will be built later in start_kernel() after per cpu
> -        * areas are initialized.
> -        */
> -}
> -
>  /*
>   * Setup early cpu_to_node.
>   *
> @@ -763,9 +763,6 @@ void __init init_cpu_to_node(void)
>                 if (node == NUMA_NO_NODE)
>                         continue;
>
> -               if (!node_online(node))
> -                       init_memory_less_node(node);
> -
>                 numa_set_node(cpu, node);
>         }
>  }
> --
> 2.19.2
>
Hi Michal,

As I mentioned in my previous email, I have manually apply the patch,
and the patch can not work for normal bootup. Your new patch seems to
have no essential changes, I applied it and had a try. It does not
work yet.

Thanks,
Pingfan
Michal Hocko Dec. 7, 2018, 2:22 p.m. UTC | #34
On Fri 07-12-18 21:20:17, Pingfan Liu wrote:
[...]
> Hi Michal,
> 
> As I mentioned in my previous email, I have manually apply the patch,
> and the patch can not work for normal bootup.

I am sorry, I have misread your previous response. Is there anything
interesting on the serial console by any chance?
Pingfan Liu Dec. 7, 2018, 2:27 p.m. UTC | #35
On Fri, Dec 7, 2018 at 10:22 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Fri 07-12-18 21:20:17, Pingfan Liu wrote:
> [...]
> > Hi Michal,
> >
> > As I mentioned in my previous email, I have manually apply the patch,
> > and the patch can not work for normal bootup.
>
> I am sorry, I have misread your previous response. Is there anything
> interesting on the serial console by any chance?

Nothing. It need more effort to debug. But as I mentioned, enable all
of the rest possible node, then it works. Maybe it can give some help
for you.
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 1308f54..4dc497d 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -754,18 +754,23 @@ void __init init_cpu_to_node(void)
 {
        int cpu;
        u16 *cpu_to_apicid = early_per_cpu_ptr(x86_cpu_to_apicid);
+       int node, nr;

        BUG_ON(cpu_to_apicid == NULL);
+       nr = cpumask_weight(cpu_possible_mask);
+
+       /* bring up all possible node, since dev->numa_node */
+       //should check acpi works for node possible,
+       for_each_node(node)
+               if (!node_online(node))
+                       init_memory_less_node(node);

        for_each_possible_cpu(cpu) {
-               int node = numa_cpu_node(cpu);
+               node = numa_cpu_node(cpu);

                if (node == NUMA_NO_NODE)
                        continue;

-               if (!node_online(node))
-                       init_memory_less_node(node);
-
                numa_set_node(cpu, node);
        }
 }

Thanks,
Pingfan
Michal Hocko Dec. 7, 2018, 2:50 p.m. UTC | #36
On Fri 07-12-18 22:27:13, Pingfan Liu wrote:
> On Fri, Dec 7, 2018 at 10:22 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Fri 07-12-18 21:20:17, Pingfan Liu wrote:
> > [...]
> > > Hi Michal,
> > >
> > > As I mentioned in my previous email, I have manually apply the patch,
> > > and the patch can not work for normal bootup.
> >
> > I am sorry, I have misread your previous response. Is there anything
> > interesting on the serial console by any chance?
> 
> Nothing. It need more effort to debug. But as I mentioned, enable all
> of the rest possible node, then it works. Maybe it can give some help
> for you.

I will have a look. Thanks!
Michal Hocko Dec. 7, 2018, 3:56 p.m. UTC | #37
On Fri 07-12-18 22:27:13, Pingfan Liu wrote:
[...]
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 1308f54..4dc497d 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -754,18 +754,23 @@ void __init init_cpu_to_node(void)
>  {
>         int cpu;
>         u16 *cpu_to_apicid = early_per_cpu_ptr(x86_cpu_to_apicid);
> +       int node, nr;
> 
>         BUG_ON(cpu_to_apicid == NULL);
> +       nr = cpumask_weight(cpu_possible_mask);
> +
> +       /* bring up all possible node, since dev->numa_node */
> +       //should check acpi works for node possible,
> +       for_each_node(node)
> +               if (!node_online(node))
> +                       init_memory_less_node(node);

I suspect there is no change if you replace for_each_node by
	for_each_node_mask(nid, node_possible_map)

here. If that is the case then we are probably calling
free_area_init_node too early. I do not see it yet though.
Pingfan Liu Dec. 10, 2018, 4 a.m. UTC | #38
On Fri, Dec 7, 2018 at 11:56 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Fri 07-12-18 22:27:13, Pingfan Liu wrote:
> [...]
> > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > index 1308f54..4dc497d 100644
> > --- a/arch/x86/mm/numa.c
> > +++ b/arch/x86/mm/numa.c
> > @@ -754,18 +754,23 @@ void __init init_cpu_to_node(void)
> >  {
> >         int cpu;
> >         u16 *cpu_to_apicid = early_per_cpu_ptr(x86_cpu_to_apicid);
> > +       int node, nr;
> >
> >         BUG_ON(cpu_to_apicid == NULL);
> > +       nr = cpumask_weight(cpu_possible_mask);
> > +
> > +       /* bring up all possible node, since dev->numa_node */
> > +       //should check acpi works for node possible,
> > +       for_each_node(node)
> > +               if (!node_online(node))
> > +                       init_memory_less_node(node);
>
> I suspect there is no change if you replace for_each_node by
>         for_each_node_mask(nid, node_possible_map)
>
> here. If that is the case then we are probably calling
> free_area_init_node too early. I do not see it yet though.

Maybe I do not clearly get your meaning, just try to guess. But if you
worry about node_possible_map, then it is dynamically set by
alloc_node_data(). The map is changed after the first time to call
free_area_init_node() for the node with memory.  This logic is the
same as the current x86 code.

Thanks,
Pingfan
Pingfan Liu Dec. 10, 2018, 7:57 a.m. UTC | #39
On Mon, Dec 10, 2018 at 12:00 PM Pingfan Liu <kernelfans@gmail.com> wrote:
>
> On Fri, Dec 7, 2018 at 11:56 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Fri 07-12-18 22:27:13, Pingfan Liu wrote:
> > [...]
> > > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > > index 1308f54..4dc497d 100644
> > > --- a/arch/x86/mm/numa.c
> > > +++ b/arch/x86/mm/numa.c
> > > @@ -754,18 +754,23 @@ void __init init_cpu_to_node(void)
> > >  {
> > >         int cpu;
> > >         u16 *cpu_to_apicid = early_per_cpu_ptr(x86_cpu_to_apicid);
> > > +       int node, nr;
> > >
> > >         BUG_ON(cpu_to_apicid == NULL);
> > > +       nr = cpumask_weight(cpu_possible_mask);
> > > +
> > > +       /* bring up all possible node, since dev->numa_node */
> > > +       //should check acpi works for node possible,
> > > +       for_each_node(node)
> > > +               if (!node_online(node))
> > > +                       init_memory_less_node(node);
> >
> > I suspect there is no change if you replace for_each_node by
> >         for_each_node_mask(nid, node_possible_map)
> >
> > here. If that is the case then we are probably calling
> > free_area_init_node too early. I do not see it yet though.
>
> Maybe I do not clearly get your meaning, just try to guess. But if you
> worry about node_possible_map, then it is dynamically set by
> alloc_node_data(). The map is changed after the first time to call

A mistake, it should be node_online_map. and in free_area_init_nodes()
for_each_online_node(nid) {
   free_area_init_node(nid, NULL,..

So at this time, we do not need to worry about the memory-less node.

> free_area_init_node() for the node with memory.  This logic is the
> same as the current x86 code.
>
> Thanks,
> Pingfan
Michal Hocko Dec. 10, 2018, 12:37 p.m. UTC | #40
On Fri 07-12-18 16:56:27, Michal Hocko wrote:
> On Fri 07-12-18 22:27:13, Pingfan Liu wrote:
> [...]
> > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > index 1308f54..4dc497d 100644
> > --- a/arch/x86/mm/numa.c
> > +++ b/arch/x86/mm/numa.c
> > @@ -754,18 +754,23 @@ void __init init_cpu_to_node(void)
> >  {
> >         int cpu;
> >         u16 *cpu_to_apicid = early_per_cpu_ptr(x86_cpu_to_apicid);
> > +       int node, nr;
> > 
> >         BUG_ON(cpu_to_apicid == NULL);
> > +       nr = cpumask_weight(cpu_possible_mask);
> > +
> > +       /* bring up all possible node, since dev->numa_node */
> > +       //should check acpi works for node possible,
> > +       for_each_node(node)
> > +               if (!node_online(node))
> > +                       init_memory_less_node(node);
> 
> I suspect there is no change if you replace for_each_node by
> 	for_each_node_mask(nid, node_possible_map)
> 
> here. If that is the case then we are probably calling
> free_area_init_node too early. I do not see it yet though.

OK, so it is not about calling it late or soon. It is just that
node_possible_map is a misnomer and it has a different semantic than
I've expected. numa_nodemask_from_meminfo simply considers only nodes
with some memory. So my patch didn't really make any difference and the
node stayed uninialized.

In other words. Does the following work? I am sorry to wildguess this
way but I am not able to recreate your setups to play with this myself.

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 1308f5408bf7..d51643e10d00 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -216,8 +216,6 @@ static void __init alloc_node_data(int nid)
 
 	node_data[nid] = nd;
 	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
-
-	node_set_online(nid);
 }
 
 /**
@@ -527,6 +525,19 @@ static void __init numa_clear_kernel_node_hotplug(void)
 	}
 }
 
+static void __init init_memory_less_node(int nid)
+{
+	unsigned long zones_size[MAX_NR_ZONES] = {0};
+	unsigned long zholes_size[MAX_NR_ZONES] = {0};
+
+	free_area_init_node(nid, zones_size, 0, zholes_size);
+
+	/*
+	 * All zonelists will be built later in start_kernel() after per cpu
+	 * areas are initialized.
+	 */
+}
+
 static int __init numa_register_memblks(struct numa_meminfo *mi)
 {
 	unsigned long uninitialized_var(pfn_align);
@@ -570,7 +581,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 		return -EINVAL;
 
 	/* Finally register nodes. */
-	for_each_node_mask(nid, node_possible_map) {
+	for_each_node(nid) {
 		u64 start = PFN_PHYS(max_pfn);
 		u64 end = 0;
 
@@ -592,6 +603,10 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 			continue;
 
 		alloc_node_data(nid);
+		if (!end)
+			init_memory_less_node(nid);
+		else
+			node_set_online(nid);
 	}
 
 	/* Dump memblock with node info and return. */
@@ -721,21 +736,6 @@ void __init x86_numa_init(void)
 	numa_init(dummy_numa_init);
 }
 
-static void __init init_memory_less_node(int nid)
-{
-	unsigned long zones_size[MAX_NR_ZONES] = {0};
-	unsigned long zholes_size[MAX_NR_ZONES] = {0};
-
-	/* Allocate and initialize node data. Memory-less node is now online.*/
-	alloc_node_data(nid);
-	free_area_init_node(nid, zones_size, 0, zholes_size);
-
-	/*
-	 * All zonelists will be built later in start_kernel() after per cpu
-	 * areas are initialized.
-	 */
-}
-
 /*
  * Setup early cpu_to_node.
  *
@@ -763,9 +763,6 @@ void __init init_cpu_to_node(void)
 		if (node == NUMA_NO_NODE)
 			continue;
 
-		if (!node_online(node))
-			init_memory_less_node(node);
-
 		numa_set_node(cpu, node);
 	}
 }
Pingfan Liu Dec. 11, 2018, 8:05 a.m. UTC | #41
On Mon, Dec 10, 2018 at 8:37 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Fri 07-12-18 16:56:27, Michal Hocko wrote:
> > On Fri 07-12-18 22:27:13, Pingfan Liu wrote:
> > [...]
> > > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > > index 1308f54..4dc497d 100644
> > > --- a/arch/x86/mm/numa.c
> > > +++ b/arch/x86/mm/numa.c
> > > @@ -754,18 +754,23 @@ void __init init_cpu_to_node(void)
> > >  {
> > >         int cpu;
> > >         u16 *cpu_to_apicid = early_per_cpu_ptr(x86_cpu_to_apicid);
> > > +       int node, nr;
> > >
> > >         BUG_ON(cpu_to_apicid == NULL);
> > > +       nr = cpumask_weight(cpu_possible_mask);
> > > +
> > > +       /* bring up all possible node, since dev->numa_node */
> > > +       //should check acpi works for node possible,
> > > +       for_each_node(node)
> > > +               if (!node_online(node))
> > > +                       init_memory_less_node(node);
> >
> > I suspect there is no change if you replace for_each_node by
> >       for_each_node_mask(nid, node_possible_map)
> >
> > here. If that is the case then we are probably calling
> > free_area_init_node too early. I do not see it yet though.
>
> OK, so it is not about calling it late or soon. It is just that
> node_possible_map is a misnomer and it has a different semantic than
> I've expected. numa_nodemask_from_meminfo simply considers only nodes
> with some memory. So my patch didn't really make any difference and the
> node stayed uninialized.
>
> In other words. Does the following work? I am sorry to wildguess this
> way but I am not able to recreate your setups to play with this myself.
>
No problem. Yeah, in order to debug the patch, you need a numa machine
with a memory-less node. And unlucky, the patch can not work either by
grub bootup or kexec -l boot. There is nothing, just silent.  I will
dig into numa_register_memblks() to figure out the problem.

Thanks,
Pingfan
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 1308f5408bf7..d51643e10d00 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -216,8 +216,6 @@ static void __init alloc_node_data(int nid)
>
>         node_data[nid] = nd;
>         memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
> -
> -       node_set_online(nid);
>  }
>
>  /**
> @@ -527,6 +525,19 @@ static void __init numa_clear_kernel_node_hotplug(void)
>         }
>  }
>
> +static void __init init_memory_less_node(int nid)
> +{
> +       unsigned long zones_size[MAX_NR_ZONES] = {0};
> +       unsigned long zholes_size[MAX_NR_ZONES] = {0};
> +
> +       free_area_init_node(nid, zones_size, 0, zholes_size);
> +
> +       /*
> +        * All zonelists will be built later in start_kernel() after per cpu
> +        * areas are initialized.
> +        */
> +}
> +
>  static int __init numa_register_memblks(struct numa_meminfo *mi)
>  {
>         unsigned long uninitialized_var(pfn_align);
> @@ -570,7 +581,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>                 return -EINVAL;
>
>         /* Finally register nodes. */
> -       for_each_node_mask(nid, node_possible_map) {
> +       for_each_node(nid) {
>                 u64 start = PFN_PHYS(max_pfn);
>                 u64 end = 0;
>
> @@ -592,6 +603,10 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>                         continue;
>
>                 alloc_node_data(nid);
> +               if (!end)
> +                       init_memory_less_node(nid);
> +               else
> +                       node_set_online(nid);
>         }
>
>         /* Dump memblock with node info and return. */
> @@ -721,21 +736,6 @@ void __init x86_numa_init(void)
>         numa_init(dummy_numa_init);
>  }
>
> -static void __init init_memory_less_node(int nid)
> -{
> -       unsigned long zones_size[MAX_NR_ZONES] = {0};
> -       unsigned long zholes_size[MAX_NR_ZONES] = {0};
> -
> -       /* Allocate and initialize node data. Memory-less node is now online.*/
> -       alloc_node_data(nid);
> -       free_area_init_node(nid, zones_size, 0, zholes_size);
> -
> -       /*
> -        * All zonelists will be built later in start_kernel() after per cpu
> -        * areas are initialized.
> -        */
> -}
> -
>  /*
>   * Setup early cpu_to_node.
>   *
> @@ -763,9 +763,6 @@ void __init init_cpu_to_node(void)
>                 if (node == NUMA_NO_NODE)
>                         continue;
>
> -               if (!node_online(node))
> -                       init_memory_less_node(node);
> -
>                 numa_set_node(cpu, node);
>         }
>  }
> --
> Michal Hocko
> SUSE Labs
Michal Hocko Dec. 11, 2018, 9:44 a.m. UTC | #42
On Tue 11-12-18 16:05:58, Pingfan Liu wrote:
> On Mon, Dec 10, 2018 at 8:37 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Fri 07-12-18 16:56:27, Michal Hocko wrote:
> > > On Fri 07-12-18 22:27:13, Pingfan Liu wrote:
> > > [...]
> > > > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > > > index 1308f54..4dc497d 100644
> > > > --- a/arch/x86/mm/numa.c
> > > > +++ b/arch/x86/mm/numa.c
> > > > @@ -754,18 +754,23 @@ void __init init_cpu_to_node(void)
> > > >  {
> > > >         int cpu;
> > > >         u16 *cpu_to_apicid = early_per_cpu_ptr(x86_cpu_to_apicid);
> > > > +       int node, nr;
> > > >
> > > >         BUG_ON(cpu_to_apicid == NULL);
> > > > +       nr = cpumask_weight(cpu_possible_mask);
> > > > +
> > > > +       /* bring up all possible node, since dev->numa_node */
> > > > +       //should check acpi works for node possible,
> > > > +       for_each_node(node)
> > > > +               if (!node_online(node))
> > > > +                       init_memory_less_node(node);
> > >
> > > I suspect there is no change if you replace for_each_node by
> > >       for_each_node_mask(nid, node_possible_map)
> > >
> > > here. If that is the case then we are probably calling
> > > free_area_init_node too early. I do not see it yet though.
> >
> > OK, so it is not about calling it late or soon. It is just that
> > node_possible_map is a misnomer and it has a different semantic than
> > I've expected. numa_nodemask_from_meminfo simply considers only nodes
> > with some memory. So my patch didn't really make any difference and the
> > node stayed uninialized.
> >
> > In other words. Does the following work? I am sorry to wildguess this
> > way but I am not able to recreate your setups to play with this myself.
> >
> No problem. Yeah, in order to debug the patch, you need a numa machine
> with a memory-less node. And unlucky, the patch can not work either by
> grub bootup or kexec -l boot. There is nothing, just silent.  I will
> dig into numa_register_memblks() to figure out the problem.

I do not have such a machine handy. Anyway, can you post the full serial
console log. Maybe I can infer something. It is quite weird that this
patch would make an existing situation any worse.
Pingfan Liu Dec. 12, 2018, 8:31 a.m. UTC | #43
On Mon, Dec 10, 2018 at 8:37 PM Michal Hocko <mhocko@kernel.org> wrote:
>
[...]
>
> In other words. Does the following work? I am sorry to wildguess this
> way but I am not able to recreate your setups to play with this myself.
>
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 1308f5408bf7..d51643e10d00 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -216,8 +216,6 @@ static void __init alloc_node_data(int nid)
>
>         node_data[nid] = nd;
>         memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
> -
> -       node_set_online(nid);
>  }
>
>  /**
> @@ -527,6 +525,19 @@ static void __init numa_clear_kernel_node_hotplug(void)
>         }
>  }
>
> +static void __init init_memory_less_node(int nid)
> +{
> +       unsigned long zones_size[MAX_NR_ZONES] = {0};
> +       unsigned long zholes_size[MAX_NR_ZONES] = {0};
> +
> +       free_area_init_node(nid, zones_size, 0, zholes_size);
> +
> +       /*
> +        * All zonelists will be built later in start_kernel() after per cpu
> +        * areas are initialized.
> +        */
> +}
> +
>  static int __init numa_register_memblks(struct numa_meminfo *mi)
>  {
>         unsigned long uninitialized_var(pfn_align);
> @@ -570,7 +581,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>                 return -EINVAL;
>
>         /* Finally register nodes. */
> -       for_each_node_mask(nid, node_possible_map) {
> +       for_each_node(nid) {
>                 u64 start = PFN_PHYS(max_pfn);
>                 u64 end = 0;
>
> @@ -592,6 +603,10 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>                         continue;
>
>                 alloc_node_data(nid);
> +               if (!end)

Here comes the bug, since !end can not reach here.
> +                       init_memory_less_node(nid);
> +               else
> +                       node_set_online(nid);
>         }
>
>         /* Dump memblock with node info and return. */
> @@ -721,21 +736,6 @@ void __init x86_numa_init(void)
>         numa_init(dummy_numa_init);
>  }
>
> -static void __init init_memory_less_node(int nid)
> -{
> -       unsigned long zones_size[MAX_NR_ZONES] = {0};
> -       unsigned long zholes_size[MAX_NR_ZONES] = {0};
> -
> -       /* Allocate and initialize node data. Memory-less node is now online.*/
> -       alloc_node_data(nid);
> -       free_area_init_node(nid, zones_size, 0, zholes_size);
> -
> -       /*
> -        * All zonelists will be built later in start_kernel() after per cpu
> -        * areas are initialized.
> -        */
> -}
> -
>  /*
>   * Setup early cpu_to_node.
>   *
> @@ -763,9 +763,6 @@ void __init init_cpu_to_node(void)
>                 if (node == NUMA_NO_NODE)
>                         continue;
>
> -               if (!node_online(node))
> -                       init_memory_less_node(node);
> -
>                 numa_set_node(cpu, node);
>         }
>  }

After passing extra param for earlyprintk, finally I got the
following. Please get it from attachment.
BTW, based on your patch, I tried the following, it works.
---
 arch/x86/mm/numa.c | 42 +++++++++++++++++++-----------------------
 1 file changed, 19 insertions(+), 23 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 1308f54..4874248 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -216,7 +216,6 @@ static void __init alloc_node_data(int nid)

        node_data[nid] = nd;
        memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
-
        node_set_online(nid);
 }

@@ -527,6 +526,21 @@ static void __init numa_clear_kernel_node_hotplug(void)
        }
 }

+static void __init init_memory_less_node(int nid)
+{
+       unsigned long zones_size[MAX_NR_ZONES] = {0};
+       unsigned long zholes_size[MAX_NR_ZONES] = {0};
+
+       alloc_node_data(nid);
+       free_area_init_node(nid, zones_size, 0, zholes_size);
+       node_set_online(nid);
+
+       /*
+        * All zonelists will be built later in start_kernel() after per cpu
+        * areas are initialized.
+        */
+}
+
  static int __init numa_register_memblks(struct numa_meminfo *mi)
 {
        unsigned long uninitialized_var(pfn_align);
@@ -570,7 +584,7 @@ static int __init numa_register_memblks(struct
numa_meminfo *mi)
                return -EINVAL;

        /* Finally register nodes. */
-       for_each_node_mask(nid, node_possible_map) {
+       for_each_node(nid) {
                u64 start = PFN_PHYS(max_pfn);
                u64 end = 0;

@@ -581,15 +595,15 @@ static int __init numa_register_memblks(struct
numa_meminfo *mi)
                        end = max(mi->blk[i].end, end);
                }

-               if (start >= end)
-                       continue;

                /*
                 * Don't confuse VM with a node that doesn't have the
                 * minimum amount of memory:
                 */
-               if (end && (end - start) < NODE_MIN_SIZE)
+               if ( start >= end || (end && (end - start) < NODE_MIN_SIZE)) {
+                       init_memory_less_node(nid);
                        continue;
+               }

                alloc_node_data(nid);
        }
@@ -721,21 +735,6 @@ void __init x86_numa_init(void)
        numa_init(dummy_numa_init);
 }

-static void __init init_memory_less_node(int nid)
-{
-       unsigned long zones_size[MAX_NR_ZONES] = {0};
-       unsigned long zholes_size[MAX_NR_ZONES] = {0};
-
-       /* Allocate and initialize node data. Memory-less node is now online.*/
-       alloc_node_data(nid);
-       free_area_init_node(nid, zones_size, 0, zholes_size);
-
-       /*
-        * All zonelists will be built later in start_kernel() after per cpu
-        * areas are initialized.
-        */
-}
-
 /*
  * Setup early cpu_to_node.
  *
@@ -763,9 +762,6 @@ void __init init_cpu_to_node(void)
                if (node == NUMA_NO_NODE)
                        continue;

-               if (!node_online(node))
-                       init_memory_less_node(node);
-
                numa_set_node(cpu, node);
        }
 }
Pingfan Liu Dec. 12, 2018, 8:33 a.m. UTC | #44
On Tue, Dec 11, 2018 at 5:44 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 11-12-18 16:05:58, Pingfan Liu wrote:
> > On Mon, Dec 10, 2018 at 8:37 PM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Fri 07-12-18 16:56:27, Michal Hocko wrote:
> > > > On Fri 07-12-18 22:27:13, Pingfan Liu wrote:
> > > > [...]
> > > > > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > > > > index 1308f54..4dc497d 100644
> > > > > --- a/arch/x86/mm/numa.c
> > > > > +++ b/arch/x86/mm/numa.c
> > > > > @@ -754,18 +754,23 @@ void __init init_cpu_to_node(void)
> > > > >  {
> > > > >         int cpu;
> > > > >         u16 *cpu_to_apicid = early_per_cpu_ptr(x86_cpu_to_apicid);
> > > > > +       int node, nr;
> > > > >
> > > > >         BUG_ON(cpu_to_apicid == NULL);
> > > > > +       nr = cpumask_weight(cpu_possible_mask);
> > > > > +
> > > > > +       /* bring up all possible node, since dev->numa_node */
> > > > > +       //should check acpi works for node possible,
> > > > > +       for_each_node(node)
> > > > > +               if (!node_online(node))
> > > > > +                       init_memory_less_node(node);
> > > >
> > > > I suspect there is no change if you replace for_each_node by
> > > >       for_each_node_mask(nid, node_possible_map)
> > > >
> > > > here. If that is the case then we are probably calling
> > > > free_area_init_node too early. I do not see it yet though.
> > >
> > > OK, so it is not about calling it late or soon. It is just that
> > > node_possible_map is a misnomer and it has a different semantic than
> > > I've expected. numa_nodemask_from_meminfo simply considers only nodes
> > > with some memory. So my patch didn't really make any difference and the
> > > node stayed uninialized.
> > >
> > > In other words. Does the following work? I am sorry to wildguess this
> > > way but I am not able to recreate your setups to play with this myself.
> > >
> > No problem. Yeah, in order to debug the patch, you need a numa machine
> > with a memory-less node. And unlucky, the patch can not work either by
> > grub bootup or kexec -l boot. There is nothing, just silent.  I will
> > dig into numa_register_memblks() to figure out the problem.
>
> I do not have such a machine handy. Anyway, can you post the full serial
> console log. Maybe I can infer something. It is quite weird that this
> patch would make an existing situation any worse.

After passing extra param to earlyprintk, finally I got something. I
replied it in another mail, and some notes to your code.

Thanks,
Pingfan
Michal Hocko Dec. 12, 2018, 11:53 a.m. UTC | #45
On Wed 12-12-18 16:31:35, Pingfan Liu wrote:
> On Mon, Dec 10, 2018 at 8:37 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> [...]
> >
> > In other words. Does the following work? I am sorry to wildguess this
> > way but I am not able to recreate your setups to play with this myself.
> >
> > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > index 1308f5408bf7..d51643e10d00 100644
> > --- a/arch/x86/mm/numa.c
> > +++ b/arch/x86/mm/numa.c
> > @@ -216,8 +216,6 @@ static void __init alloc_node_data(int nid)
> >
> >         node_data[nid] = nd;
> >         memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
> > -
> > -       node_set_online(nid);
> >  }
> >
> >  /**
> > @@ -527,6 +525,19 @@ static void __init numa_clear_kernel_node_hotplug(void)
> >         }
> >  }
> >
> > +static void __init init_memory_less_node(int nid)
> > +{
> > +       unsigned long zones_size[MAX_NR_ZONES] = {0};
> > +       unsigned long zholes_size[MAX_NR_ZONES] = {0};
> > +
> > +       free_area_init_node(nid, zones_size, 0, zholes_size);
> > +
> > +       /*
> > +        * All zonelists will be built later in start_kernel() after per cpu
> > +        * areas are initialized.
> > +        */
> > +}
> > +
> >  static int __init numa_register_memblks(struct numa_meminfo *mi)
> >  {
> >         unsigned long uninitialized_var(pfn_align);
> > @@ -570,7 +581,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
> >                 return -EINVAL;
> >
> >         /* Finally register nodes. */
> > -       for_each_node_mask(nid, node_possible_map) {
> > +       for_each_node(nid) {
> >                 u64 start = PFN_PHYS(max_pfn);
> >                 u64 end = 0;
> >
> > @@ -592,6 +603,10 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
> >                         continue;
> >
> >                 alloc_node_data(nid);
> > +               if (!end)
> 
> Here comes the bug, since !end can not reach here.

You are right. I am dumb. I've just completely missed that. Sigh.
Anyway, I think the code is more complicated than necessary and we can
simply drop the check. I do not think we really have to worry about
the start overflowing end. So the end patch should look as follows.
Btw. I believe it is better to pull alloc_node_data out of init_memory_less_node
because a) there is no need to duplicate the call and moreover we want
to pull node_set_online as well. The code also seems cleaner this way.

Thanks for your testing and your patience with me here.

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 1308f5408bf7..a5548fe668fb 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -216,8 +216,6 @@ static void __init alloc_node_data(int nid)
 
 	node_data[nid] = nd;
 	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
-
-	node_set_online(nid);
 }
 
 /**
@@ -527,6 +525,19 @@ static void __init numa_clear_kernel_node_hotplug(void)
 	}
 }
 
+static void __init init_memory_less_node(int nid)
+{
+	unsigned long zones_size[MAX_NR_ZONES] = {0};
+	unsigned long zholes_size[MAX_NR_ZONES] = {0};
+
+	free_area_init_node(nid, zones_size, 0, zholes_size);
+
+	/*
+	 * All zonelists will be built later in start_kernel() after per cpu
+	 * areas are initialized.
+	 */
+}
+
 static int __init numa_register_memblks(struct numa_meminfo *mi)
 {
 	unsigned long uninitialized_var(pfn_align);
@@ -570,7 +581,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 		return -EINVAL;
 
 	/* Finally register nodes. */
-	for_each_node_mask(nid, node_possible_map) {
+	for_each_node(nid) {
 		u64 start = PFN_PHYS(max_pfn);
 		u64 end = 0;
 
@@ -581,9 +592,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 			end = max(mi->blk[i].end, end);
 		}
 
-		if (start >= end)
-			continue;
-
 		/*
 		 * Don't confuse VM with a node that doesn't have the
 		 * minimum amount of memory:
@@ -592,6 +600,10 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 			continue;
 
 		alloc_node_data(nid);
+		if (!end)
+			init_memory_less_node(nid);
+		else
+			node_set_online(nid);
 	}
 
 	/* Dump memblock with node info and return. */
@@ -721,21 +733,6 @@ void __init x86_numa_init(void)
 	numa_init(dummy_numa_init);
 }
 
-static void __init init_memory_less_node(int nid)
-{
-	unsigned long zones_size[MAX_NR_ZONES] = {0};
-	unsigned long zholes_size[MAX_NR_ZONES] = {0};
-
-	/* Allocate and initialize node data. Memory-less node is now online.*/
-	alloc_node_data(nid);
-	free_area_init_node(nid, zones_size, 0, zholes_size);
-
-	/*
-	 * All zonelists will be built later in start_kernel() after per cpu
-	 * areas are initialized.
-	 */
-}
-
 /*
  * Setup early cpu_to_node.
  *
@@ -763,9 +760,6 @@ void __init init_cpu_to_node(void)
 		if (node == NUMA_NO_NODE)
 			continue;
 
-		if (!node_online(node))
-			init_memory_less_node(node);
-
 		numa_set_node(cpu, node);
 	}
 }
Pingfan Liu Dec. 13, 2018, 8:37 a.m. UTC | #46
On Wed, Dec 12, 2018 at 7:53 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 12-12-18 16:31:35, Pingfan Liu wrote:
> > On Mon, Dec 10, 2018 at 8:37 PM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > [...]
> > >
> > > In other words. Does the following work? I am sorry to wildguess this
> > > way but I am not able to recreate your setups to play with this myself.
> > >
> > > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > > index 1308f5408bf7..d51643e10d00 100644
> > > --- a/arch/x86/mm/numa.c
> > > +++ b/arch/x86/mm/numa.c
> > > @@ -216,8 +216,6 @@ static void __init alloc_node_data(int nid)
> > >
> > >         node_data[nid] = nd;
> > >         memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
> > > -
> > > -       node_set_online(nid);
> > >  }
> > >
> > >  /**
> > > @@ -527,6 +525,19 @@ static void __init numa_clear_kernel_node_hotplug(void)
> > >         }
> > >  }
> > >
> > > +static void __init init_memory_less_node(int nid)
> > > +{
> > > +       unsigned long zones_size[MAX_NR_ZONES] = {0};
> > > +       unsigned long zholes_size[MAX_NR_ZONES] = {0};
> > > +
> > > +       free_area_init_node(nid, zones_size, 0, zholes_size);
> > > +
> > > +       /*
> > > +        * All zonelists will be built later in start_kernel() after per cpu
> > > +        * areas are initialized.
> > > +        */
> > > +}
> > > +
> > >  static int __init numa_register_memblks(struct numa_meminfo *mi)
> > >  {
> > >         unsigned long uninitialized_var(pfn_align);
> > > @@ -570,7 +581,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
> > >                 return -EINVAL;
> > >
> > >         /* Finally register nodes. */
> > > -       for_each_node_mask(nid, node_possible_map) {
> > > +       for_each_node(nid) {
> > >                 u64 start = PFN_PHYS(max_pfn);
> > >                 u64 end = 0;
> > >
> > > @@ -592,6 +603,10 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
> > >                         continue;
> > >
> > >                 alloc_node_data(nid);
> > > +               if (!end)
> >
> > Here comes the bug, since !end can not reach here.
>
> You are right. I am dumb. I've just completely missed that. Sigh.
> Anyway, I think the code is more complicated than necessary and we can
> simply drop the check. I do not think we really have to worry about
> the start overflowing end. So the end patch should look as follows.
> Btw. I believe it is better to pull alloc_node_data out of init_memory_less_node
> because a) there is no need to duplicate the call and moreover we want
> to pull node_set_online as well. The code also seems cleaner this way.
>
I have no strong opinion here.
> Thanks for your testing and your patience with me here.
Np.
>
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 1308f5408bf7..a5548fe668fb 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -216,8 +216,6 @@ static void __init alloc_node_data(int nid)
>
>         node_data[nid] = nd;
>         memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
> -
> -       node_set_online(nid);
>  }
>
>  /**
> @@ -527,6 +525,19 @@ static void __init numa_clear_kernel_node_hotplug(void)
>         }
>  }
>
> +static void __init init_memory_less_node(int nid)
> +{
> +       unsigned long zones_size[MAX_NR_ZONES] = {0};
> +       unsigned long zholes_size[MAX_NR_ZONES] = {0};
> +
> +       free_area_init_node(nid, zones_size, 0, zholes_size);
> +
> +       /*
> +        * All zonelists will be built later in start_kernel() after per cpu
> +        * areas are initialized.
> +        */
> +}
> +
>  static int __init numa_register_memblks(struct numa_meminfo *mi)
>  {
>         unsigned long uninitialized_var(pfn_align);
> @@ -570,7 +581,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>                 return -EINVAL;
>
>         /* Finally register nodes. */
> -       for_each_node_mask(nid, node_possible_map) {
> +       for_each_node(nid) {
>                 u64 start = PFN_PHYS(max_pfn);
>                 u64 end = 0;
>
> @@ -581,9 +592,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>                         end = max(mi->blk[i].end, end);
>                 }
>
> -               if (start >= end)
> -                       continue;
> -
>                 /*
>                  * Don't confuse VM with a node that doesn't have the
>                  * minimum amount of memory:
> @@ -592,6 +600,10 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>                         continue;
>
>                 alloc_node_data(nid);
> +               if (!end)
> +                       init_memory_less_node(nid);
> +               else
> +                       node_set_online(nid);
>         }
>
>         /* Dump memblock with node info and return. */
> @@ -721,21 +733,6 @@ void __init x86_numa_init(void)
>         numa_init(dummy_numa_init);
>  }
>
> -static void __init init_memory_less_node(int nid)
> -{
> -       unsigned long zones_size[MAX_NR_ZONES] = {0};
> -       unsigned long zholes_size[MAX_NR_ZONES] = {0};
> -
> -       /* Allocate and initialize node data. Memory-less node is now online.*/
> -       alloc_node_data(nid);
> -       free_area_init_node(nid, zones_size, 0, zholes_size);
> -
> -       /*
> -        * All zonelists will be built later in start_kernel() after per cpu
> -        * areas are initialized.
> -        */
> -}
> -
>  /*
>   * Setup early cpu_to_node.
>   *
> @@ -763,9 +760,6 @@ void __init init_cpu_to_node(void)
>                 if (node == NUMA_NO_NODE)
>                         continue;
>
> -               if (!node_online(node))
> -                       init_memory_less_node(node);
> -
>                 numa_set_node(cpu, node);
>         }
>  }
> --
Regret, it still has bug, and I got panic. Attached log.

Thanks,
Pingfan
[    0.000000] Linux version 4.20.0-rc6+
[    0.000000] Command line: root=/dev/mapper/xx_dell--per7425--03-root ro crashkernel=500M rd.lvm.lv=xx_dell-per7425-03/root rd.lvm.lv=xx_dell-per7425-03/swap console=ttyS0,115200n81 earlyprintk=ttyS0,115200n81
[    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[    0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'compacted' format.
[    0.000000] BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000100-0x000000000008efff] usable
[    0.000000] BIOS-e820: [mem 0x000000000008f000-0x000000000008ffff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x0000000000090000-0x000000000009ffff] usable
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000005c3d6fff] usable
[    0.000000] BIOS-e820: [mem 0x000000005c3d7000-0x00000000643defff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000643df000-0x0000000068ff7fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000068ff8000-0x000000006b4f7fff] reserved
[    0.000000] BIOS-e820: [mem 0x000000006b4f8000-0x000000006c327fff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x000000006c328000-0x000000006c527fff] ACPI data
[    0.000000] BIOS-e820: [mem 0x000000006c528000-0x000000006fffffff] usable
[    0.000000] BIOS-e820: [mem 0x0000000070000000-0x000000008fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fec10000-0x00000000fec10fff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed80fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000087effffff] usable
[    0.000000] BIOS-e820: [mem 0x000000087f000000-0x000000087fffffff] reserved
[    0.000000] printk: bootconsole [earlyser0] enabled
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] extended physical RAM map:
[    0.000000] reserve setup_data: [mem 0x0000000000000100-0x000000000008efff] usable
[    0.000000] reserve setup_data: [mem 0x000000000008f000-0x000000000008ffff] ACPI NVS
[    0.000000] reserve setup_data: [mem 0x0000000000090000-0x000000000009ffff] usable
[    0.000000] reserve setup_data: [mem 0x0000000000100000-0x000000000010006f] usable
[    0.000000] reserve setup_data: [mem 0x0000000000100070-0x000000005c3d6fff] usable
[    0.000000] reserve setup_data: [mem 0x000000005c3d7000-0x00000000643defff] reserved
[    0.000000] reserve setup_data: [mem 0x00000000643df000-0x0000000068ff7fff] usable
[    0.000000] reserve setup_data: [mem 0x0000000068ff8000-0x000000006b4f7fff] reserved
[    0.000000] reserve setup_data: [mem 0x000000006b4f8000-0x000000006c327fff] ACPI NVS
[    0.000000] reserve setup_data: [mem 0x000000006c328000-0x000000006c527fff] ACPI data
[    0.000000] reserve setup_data: [mem 0x000000006c528000-0x000000006fffffff] usable
[    0.000000] reserve setup_data: [mem 0x0000000070000000-0x000000008fffffff] reserved
[    0.000000] reserve setup_data: [mem 0x00000000fec10000-0x00000000fec10fff] reserved
[    0.000000] reserve setup_data: [mem 0x00000000fed80000-0x00000000fed80fff] reserved
[    0.000000] reserve setup_data: [mem 0x0000000100000000-0x000000087effffff] usable
[    0.000000] reserve setup_data: [mem 0x000000087f000000-0x000000087fffffff] reserved
[    0.000000] efi: EFI v2.50 by Dell Inc.
[    0.000000] efi:  ACPI=0x6c527000  ACPI 2.0=0x6c527014  SMBIOS=0x6afde000  SMBIOS 3.0=0x6afdc000 
[    0.000000] SMBIOS 3.0.0 present.
[    0.000000] DMI: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.4.3 06/29/2018
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.000000] tsc: Detected 2095.918 MHz processor
[    0.000066] last_pfn = 0x87f000 max_arch_pfn = 0x400000000
[    0.006409] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT  
Memory KASLR using RDRAND RDTSC...
[    0.016620] last_pfn = 0x70000 max_arch_pfn = 0x400000000
[    0.027379] Using GB pages for direct mapping
[    0.032009] Secure boot could not be determined
[    0.036361] RAMDISK: [mem 0x87a172000-0x87cdfffff]
[    0.041137] ACPI: Early table checksum verification disabled
[    0.046766] ACPI: RSDP 0x000000006C527014 000024 (v02 DELL  )
[    0.052480] ACPI: XSDT 0x000000006C5260E8 0000C4 (v01 DELL   PE_SC3   00000002 DELL 00000001)
[    0.060981] ACPI: FACP 0x000000006C516000 000114 (v06 DELL   PE_SC3   00000002 DELL 00000001)
[    0.069472] ACPI: DSDT 0x000000006C505000 00D302 (v02 DELL   PE_SC3   00000002 DELL 00000001)
[    0.077962] ACPI: FACS 0x000000006C2F1000 000040
[    0.082556] ACPI: SSDT 0x000000006C525000 0000D2 (v02 DELL   PE_SC3   00000002 MSFT 04000000)
[    0.091051] ACPI: BERT 0x000000006C524000 000030 (v01 DELL   BERT     00000001 DELL 00000001)
[    0.099543] ACPI: HEST 0x000000006C523000 0006DC (v01 DELL   HEST     00000001 DELL 00000001)
[    0.108038] ACPI: SSDT 0x000000006C522000 0001C4 (v01 DELL   PE_SC3   00000001 AMD  00000001)
[    0.116532] ACPI: SRAT 0x000000006C521000 0002D0 (v03 DELL   PE_SC3   00000001 AMD  00000001)
[    0.125027] ACPI: MSCT 0x000000006C520000 0000A6 (v01 DELL   PE_SC3   00000000 AMD  00000001)
[    0.133519] ACPI: SLIT 0x000000006C51F000 00006C (v01 DELL   PE_SC3   00000001 AMD  00000001)
[    0.142014] ACPI: CRAT 0x000000006C51C000 002210 (v01 DELL   PE_SC3   00000001 AMD  00000001)
[    0.150508] ACPI: CDIT 0x000000006C51B000 000068 (v01 DELL   PE_SC3   00000001 AMD  00000001)
[    0.159001] ACPI: SSDT 0x000000006C51A000 0003C6 (v02 DELL   Tpm2Tabl 00001000 INTL 20170119)
[    0.167495] ACPI: TPM2 0x000000006C519000 000038 (v04 DELL   PE_SC3   00000002 DELL 00000001)
[    0.175990] ACPI: EINJ 0x000000006C518000 000150 (v01 DELL   PE_SC3   00000001 AMD  00000001)
[    0.184484] ACPI: SLIC 0x000000006C517000 000024 (v01 DELL   PE_SC3   00000002 DELL 00000001)
[    0.192976] ACPI: HPET 0x000000006C515000 000038 (v01 DELL   PE_SC3   00000002 DELL 00000001)
[    0.201471] ACPI: APIC 0x000000006C514000 0004B2 (v03 DELL   PE_SC3   00000002 DELL 00000001)
[    0.209965] ACPI: MCFG 0x000000006C513000 00003C (v01 DELL   PE_SC3   00000002 DELL 00000001)
[    0.218458] ACPI: SSDT 0x000000006C504000 0005CA (v02 DELL   xhc_port 00000001 INTL 20170119)
[    0.226952] ACPI: IVRS 0x000000006C503000 000390 (v02 DELL   PE_SC3   00000001 AMD  00000000)
[    0.235447] ACPI: SSDT 0x000000006C501000 001658 (v01 AMD    CPMCMN   00000001 INTL 20170119)
[    0.244001] SRAT: PXM 0 -> APIC 0x00 -> Node 0
[    0.248360] SRAT: PXM 0 -> APIC 0x01 -> Node 0
[    0.252781] SRAT: PXM 0 -> APIC 0x08 -> Node 0
[    0.257202] SRAT: PXM 0 -> APIC 0x09 -> Node 0
[    0.261620] SRAT: PXM 1 -> APIC 0x10 -> Node 1
[    0.266041] SRAT: PXM 1 -> APIC 0x11 -> Node 1
[    0.270462] SRAT: PXM 1 -> APIC 0x18 -> Node 1
[    0.274883] SRAT: PXM 1 -> APIC 0x19 -> Node 1
[    0.279301] SRAT: PXM 2 -> APIC 0x20 -> Node 2
[    0.283722] SRAT: PXM 2 -> APIC 0x21 -> Node 2
[    0.288143] SRAT: PXM 2 -> APIC 0x28 -> Node 2
[    0.292564] SRAT: PXM 2 -> APIC 0x29 -> Node 2
[    0.296982] SRAT: PXM 3 -> APIC 0x30 -> Node 3
[    0.301403] SRAT: PXM 3 -> APIC 0x31 -> Node 3
[    0.305824] SRAT: PXM 3 -> APIC 0x38 -> Node 3
[    0.310245] SRAT: PXM 3 -> APIC 0x39 -> Node 3
[    0.314664] SRAT: PXM 4 -> APIC 0x40 -> Node 4
[    0.319085] SRAT: PXM 4 -> APIC 0x41 -> Node 4
[    0.323506] SRAT: PXM 4 -> APIC 0x48 -> Node 4
[    0.327926] SRAT: PXM 4 -> APIC 0x49 -> Node 4
[    0.332345] SRAT: PXM 5 -> APIC 0x50 -> Node 5
[    0.336766] SRAT: PXM 5 -> APIC 0x51 -> Node 5
[    0.341187] SRAT: PXM 5 -> APIC 0x58 -> Node 5
[    0.345607] SRAT: PXM 5 -> APIC 0x59 -> Node 5
[    0.350026] SRAT: PXM 6 -> APIC 0x60 -> Node 6
[    0.354447] SRAT: PXM 6 -> APIC 0x61 -> Node 6
[    0.358868] SRAT: PXM 6 -> APIC 0x68 -> Node 6
[    0.363289] SRAT: PXM 6 -> APIC 0x69 -> Node 6
[    0.367707] SRAT: PXM 7 -> APIC 0x70 -> Node 7
[    0.372128] SRAT: PXM 7 -> APIC 0x71 -> Node 7
[    0.376549] SRAT: PXM 7 -> APIC 0x78 -> Node 7
[    0.380970] SRAT: PXM 7 -> APIC 0x79 -> Node 7
[    0.385391] ACPI: SRAT: Node 1 PXM 1 [mem 0x00000000-0x0009ffff]
[    0.391370] ACPI: SRAT: Node 1 PXM 1 [mem 0x00100000-0x7fffffff]
[    0.397349] ACPI: SRAT: Node 1 PXM 1 [mem 0x100000000-0x47fffffff]
[    0.403503] ACPI: SRAT: Node 5 PXM 5 [mem 0x480000000-0x87fffffff]
[    0.409667] NUMA: Node 1 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x7fffffff] -> [mem 0x00000000-0x7fffffff]
[    0.419885] NUMA: Node 1 [mem 0x00000000-0x7fffffff] + [mem 0x100000000-0x47fffffff] -> [mem 0x00000000-0x47fffffff]
[    0.430386] NODE_DATA(0) allocated [mem 0x87efd4000-0x87effefff]
[    0.436352]     NODE_DATA(0) on node 5
[    0.440124] Initmem setup node 0 [mem 0x0000000000000000-0x0000000000000000]
[    0.447104] NODE_DATA(1) allocated [mem 0x47ffd5000-0x47fffffff]
[    0.453110] NODE_DATA(2) allocated [mem 0x87efa9000-0x87efd3fff]
[    0.459060]     NODE_DATA(2) on node 5
[    0.462855] Initmem setup node 2 [mem 0x0000000000000000-0x0000000000000000]
[    0.469809] NODE_DATA(3) allocated [mem 0x87ef7e000-0x87efa8fff]
[    0.475788]     NODE_DATA(3) on node 5
[    0.479554] Initmem setup node 3 [mem 0x0000000000000000-0x0000000000000000]
[    0.486536] NODE_DATA(4) allocated [mem 0x87ef53000-0x87ef7dfff]
[    0.492518]     NODE_DATA(4) on node 5
[    0.496280] Initmem setup node 4 [mem 0x0000000000000000-0x0000000000000000]
[    0.503266] NODE_DATA(5) allocated [mem 0x87ef28000-0x87ef52fff]
[    0.509281] NODE_DATA(6) allocated [mem 0x87eefd000-0x87ef27fff]
[    0.515224]     NODE_DATA(6) on node 5
[    0.518987] Initmem setup node 6 [mem 0x0000000000000000-0x0000000000000000]
[    0.525974] NODE_DATA(7) allocated [mem 0x87eed2000-0x87eefcfff]
[    0.531953]     NODE_DATA(7) on node 5
[    0.535716] Initmem setup node 7 [mem 0x0000000000000000-0x0000000000000000]
[    0.542839] Reserving 500MB of memory at 384MB for crashkernel (System RAM: 32314MB)
[    0.550465] Zone ranges:
[    0.552927]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
[    0.559081]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
[    0.565235]   Normal   [mem 0x0000000100000000-0x000000087effffff]
[    0.571388]   Device   empty
[    0.574249] Movable zone start for each node
[    0.578498] Early memory node ranges
[    0.582049]   node   1: [mem 0x0000000000001000-0x000000000008efff]
[    0.588291]   node   1: [mem 0x0000000000090000-0x000000000009ffff]
[    0.594530]   node   1: [mem 0x0000000000100000-0x000000005c3d6fff]
[    0.600772]   node   1: [mem 0x00000000643df000-0x0000000068ff7fff]
[    0.607011]   node   1: [mem 0x000000006c528000-0x000000006fffffff]
[    0.613251]   node   1: [mem 0x0000000100000000-0x000000047fffffff]
[    0.619493]   node   5: [mem 0x0000000480000000-0x000000087effffff]
[    0.626479] Zeroed struct page in unavailable ranges: 46490 pages
[    0.626480] Initmem setup node 1 [mem 0x0000000000001000-0x000000047fffffff]
[    0.655261] Initmem setup node 5 [mem 0x0000000480000000-0x000000087effffff]
[    0.663605] ACPI: PM-Timer IO Port: 0x408
[    0.667459] ACPI: LAPIC_NMI (acpi_id[0xff] high edge lint[0x1])
[    0.673362] IOAPIC[0]: apic_id 128, version 33, address 0xfec00000, GSI 0-23
[    0.680359] IOAPIC[1]: apic_id 129, version 33, address 0xfd880000, GSI 24-55
[    0.687465] IOAPIC[2]: apic_id 130, version 33, address 0xea900000, GSI 56-87
[    0.694571] IOAPIC[3]: apic_id 131, version 33, address 0xdd900000, GSI 88-119
[    0.701766] IOAPIC[4]: apic_id 132, version 33, address 0xd0900000, GSI 120-151
[    0.709048] IOAPIC[5]: apic_id 133, version 33, address 0xc3900000, GSI 152-183
[    0.716328] IOAPIC[6]: apic_id 134, version 33, address 0xb6900000, GSI 184-215
[    0.723609] IOAPIC[7]: apic_id 135, version 33, address 0xa9900000, GSI 216-247
[    0.730888] IOAPIC[8]: apic_id 136, version 33, address 0x9c900000, GSI 248-279
[    0.738167] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[    0.744494] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 low level)
[    0.750998] Using ACPI (MADT) for SMP configuration information
[    0.756887] ACPI: HPET id: 0x10228201 base: 0xfed00000
[    0.762007] smpboot: Allowing 128 CPUs, 96 hotplug CPUs
[    0.767228] PM: Registered nosave memory: [mem 0x00000000-0x00000fff]
[    0.773614] PM: Registered nosave memory: [mem 0x0008f000-0x0008ffff]
[    0.780029] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff]
[    0.786440] PM: Registered nosave memory: [mem 0x00100000-0x00100fff]
[    0.792855] PM: Registered nosave memory: [mem 0x5c3d7000-0x643defff]
[    0.799270] PM: Registered nosave memory: [mem 0x68ff8000-0x6b4f7fff]
[    0.805681] PM: Registered nosave memory: [mem 0x6b4f8000-0x6c327fff]
[    0.812095] PM: Registered nosave memory: [mem 0x6c328000-0x6c527fff]
[    0.818510] PM: Registered nosave memory: [mem 0x70000000-0x8fffffff]
[    0.824924] PM: Registered nosave memory: [mem 0x90000000-0xfec0ffff]
[    0.831337] PM: Registered nosave memory: [mem 0xfec10000-0xfec10fff]
[    0.837751] PM: Registered nosave memory: [mem 0xfec11000-0xfed7ffff]
[    0.844165] PM: Registered nosave memory: [mem 0xfed80000-0xfed80fff]
[    0.850579] PM: Registered nosave memory: [mem 0xfed81000-0xffffffff]
[    0.856996] [mem 0x90000000-0xfec0ffff] available for PCI devices
[    0.863060] Booting paravirtualized kernel on bare hardware
[    0.868612] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[    0.994782] random: get_random_bytes called from start_kernel+0x9b/0x52e with crng_init=0
[    1.002785] setup_percpu: NR_CPUS:8192 nr_cpumask_bits:128 nr_cpu_ids:128 nr_node_ids:8
[    1.011287] setup_percpu: cpu 0 has no node 0 or node-local memory
[    1.017780] setup_percpu: cpu 1 has no node 4 or node-local memory
[    1.029878] setup_percpu: cpu 4 has no node 2 or node-local memory
[    1.036268] setup_percpu: cpu 5 has no node 6 or node-local memory
[    1.042638] setup_percpu: cpu 6 has no node 3 or node-local memory
[    1.049022] setup_percpu: cpu 7 has no node 7 or node-local memory
[    1.058265] percpu: Embedded 46 pages/cpu @(____ptrval____) s151552 r8192 d28672 u262144
[    1.066324] Built 2 zonelists, mobility grouping off.  Total pages: 0
[    1.072590] Policy zone: Normal
[    1.075716] Kernel command line: root=/dev/mapper/xx_dell--per7425--03-root ro crashkernel=500M rd.lvm.lv=xx_dell-per7425-03/root rd.lvm.lv=xx_dell-per7425-03/swap console=ttyS0,115200n81 earlyprintk=ttyS0,115200n81
[    1.119510] Memory: 1333560K/33089944K available (12292K kernel code, 2066K rwdata, 3756K rodata, 2352K init, 6524K bss, 1202444K reserved, 0K cma-reserved)
[    1.133390] swapper: page allocation failure: order:0, mode:0x4000(__GFP_COMP), nodemask=(null)
[    1.141981] swapper cpuset=(null) mems_allowed=0-1023
[    1.147012] CPU: 0 PID: 0 Comm: swapper Not tainted 4.20.0-rc6+ #6
[    1.153161] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.4.3 06/29/2018
[    1.160703] Call Trace:
[    1.163145]  dump_stack+0x5c/0x7b
[    1.166429]  warn_alloc+0xf5/0x180
[    1.169809]  ? __raw_callee_save___native_queued_spin_unlock+0x11/0x1e
[    1.176305]  __alloc_pages_slowpath+0x84f/0xa0d
[    1.180815]  ? pcpu_block_refresh_hint+0x77/0xa0
[    1.185404]  __alloc_pages_nodemask+0x299/0x2e0
[    1.189916]  new_slab+0x425/0x570
[    1.193205]  ___slab_alloc+0x375/0x540
[    1.196935]  ? bootstrap+0x1b/0xcb
[    1.200313]  ? __kmem_cache_create+0x2b/0x150
[    1.204649]  ? printk+0x58/0x6f
[    1.207765]  ? bootstrap+0x1b/0xcb
[    1.211145]  __slab_alloc+0x1c/0x38
[    1.214613]  kmem_cache_alloc+0x192/0x1c0
[    1.218600]  bootstrap+0x1b/0xcb
[    1.221809]  kmem_cache_init+0x8d/0x109
[    1.225625]  start_kernel+0x26c/0x52e
[    1.229262]  ? set_init_arg+0x55/0x55
[    1.232904]  secondary_startup_64+0xa4/0xb0
[    1.237064] Mem-Info:
[    1.239317] active_anon:0 inactive_anon:0 isolated_anon:0
[    1.239317]  active_file:0 inactive_file:0 isolated_file:0
[    1.239317]  unevictable:0 dirty:0 writeback:0 unstable:0
[    1.239317]  slab_reclaimable:0 slab_unreclaimable:2
[    1.239317]  mapped:0 shmem:0 pagetables:0 bounce:0
[    1.239317]  free:333388 free_pcp:0 free_cma:0
[    1.269738] Node 1 active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[    1.295132] Node 5 active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[    1.320525] Node 1 DMA free:15896kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15896kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[    1.345920] lowmem_reserve[]: 0 0 0 0 0
[    1.349735] Node 1 DMA32 free:1055520kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1633056kB managed:1055520kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[    1.375823] lowmem_reserve[]: 0 0 0 0 0
[    1.379637] Node 1 Normal free:131068kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:14680064kB managed:131072kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[    1.405724] lowmem_reserve[]: 0 0 0 0 0
[    1.409539] Node 5 Normal free:131068kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:16760832kB managed:131072kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[    1.435626] lowmem_reserve[]: 0 0 0 0 0
[    1.439440] Node 1 DMA: 2*4kB (U) 2*8kB (U) 2*16kB (U) 3*32kB (U) 2*64kB (U) 2*128kB (U) 2*256kB (U) 1*512kB (U) 0*1024kB 1*2048kB (M) 3*4096kB (M) = 15896kB
[    1.453482] Node 1 DMA32: 2*4kB (M) 1*8kB (M) 1*16kB (M) 2*32kB (M) 3*64kB (M) 2*128kB (M) 3*256kB (M) 3*512kB (M) 2*1024kB (M) 3*2048kB (M) 255*4096kB (M) = 1055520kB
[    1.468388] Node 1 Normal: 1*4kB (U) 1*8kB (U) 1*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 1*512kB (U) 1*1024kB (U) 1*2048kB (U) 31*4096kB (M) = 131068kB
[    1.483211] Node 5 Normal: 1*4kB (U) 1*8kB (U) 1*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 1*512kB (U) 1*1024kB (U) 1*2048kB (U) 31*4096kB (M) = 131068kB
[    1.498032] 0 total pagecache pages
[    1.501501] 0 pages in swap cache
[    1.504792] Swap cache stats: add 0, delete 0, find 0/0
[    1.509992] Free swap  = 0kB
[    1.512852] Total swap = 0kB
[    1.515713] 8272486 pages RAM
[    1.518659] 0 pages HighMem/MovableOnly
[    1.522473] 7939096 pages reserved
[    1.525852] 0 pages cma reserved
[    1.529059] 0 pages hwpoisoned
[    1.532096] SLUB: Unable to allocate memory on node -1, gfp=0x408000(GFP_NOWAIT|__GFP_ZERO)
[    1.540415]   cache: kmem_cache, object size: 392, buffer size: 448, default order: 2, min order: 0
[    1.549429]   node 1: slabs: 0, objs: 0, free: 0
[    1.554022]   node 5: slabs: 0, objs: 0, free: 0
[    1.558631] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[    1.566416] PGD 0 P4D 0 
[    1.568929] Oops: 0002 [#1] SMP NOPTI
[    1.572571] CPU: 0 PID: 0 Comm: swapper Not tainted 4.20.0-rc6+ #6
[    1.578724] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.4.3 06/29/2018
[    1.586265] RIP: 0010:bootstrap+0x2e/0xcb
[    1.590251] Code: ff 55 48 89 fd 48 8b 3d 95 c0 42 00 be 00 80 40 00 53 e8 4a c8 aa fe 48 89 c3 48 8b 05 80 c0 42 00 48 89 ee 48 89 df 8b 48 1c <f3> a4 65 8b 35 5e b8 45 56 48 89 df e8 e6 d6 aa fe 44 8b 05 47 6b
[    1.608972] RSP: 0000:ffffffffa9603ed0 EFLAGS: 00010046
[    1.614173] RAX: ffffffffa9ce6600 RBX: 0000000000000000 RCX: 0000000000000188
[    1.621280] RDX: 00000000000001c0 RSI: ffffffffa9ce6600 RDI: 0000000000000000
[    1.628387] RBP: ffffffffa9ce6600 R08: 000000006f6e2020 R09: 0000000000000147
[    1.635494] R10: 202c30203a736261 R11: 6c73203a35206564 R12: ffffffffa9c2e900
[    1.642601] R13: ffffffffa9c492c0 R14: 0000000000000000 R15: 0000000000000000
[    1.649709] FS:  0000000000000000(0000) GS:ffff9b3c69c00000(0000) knlGS:0000000000000000
[    1.657770] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.663488] CR2: 0000000000000000 CR3: 000000087e00a000 CR4: 00000000000406b0
[    1.670598] Call Trace:
[    1.673025]  kmem_cache_init+0x8d/0x109
[    1.676837]  start_kernel+0x26c/0x52e
[    1.680478]  ? set_init_arg+0x55/0x55
[    1.684118]  secondary_startup_64+0xa4/0xb0
[    1.688277] Modules linked in:
[    1.691311] CR2: 0000000000000000
[    1.694647] ---[ end trace 321632dadec2749b ]---
[    1.699199] RIP: 0010:bootstrap+0x2e/0xcb
[    1.703185] Code: ff 55 48 89 fd 48 8b 3d 95 c0 42 00 be 00 80 40 00 53 e8 4a c8 aa fe 48 89 c3 48 8b 05 80 c0 42 00 48 89 ee 48 89 df 8b 48 1c <f3> a4 65 8b 35 5e b8 45 56 48 89 df e8 e6 d6 aa fe 44 8b 05 47 6b
[    1.721906] RSP: 0000:ffffffffa9603ed0 EFLAGS: 00010046
[    1.727106] RAX: ffffffffa9ce6600 RBX: 0000000000000000 RCX: 0000000000000188
[    1.734213] RDX: 00000000000001c0 RSI: ffffffffa9ce6600 RDI: 0000000000000000
[    1.741320] RBP: ffffffffa9ce6600 R08: 000000006f6e2020 R09: 0000000000000147
[    1.748427] R10: 202c30203a736261 R11: 6c73203a35206564 R12: ffffffffa9c2e900
[    1.755534] R13: ffffffffa9c492c0 R14: 0000000000000000 R15: 0000000000000000
[    1.762642] FS:  0000000000000000(0000) GS:ffff9b3c69c00000(0000) knlGS:0000000000000000
[    1.770702] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.776423] CR2: 0000000000000000 CR3: 000000087e00a000 CR4: 00000000000406b0
[    1.783531] Kernel panic - not syncing: Fatal exception
[    1.788802] ---[ end Kernel panic - not syncing: Fatal exception ]---
Pingfan Liu Dec. 13, 2018, 9:04 a.m. UTC | #47
On Thu, Dec 13, 2018 at 4:37 PM Pingfan Liu <kernelfans@gmail.com> wrote:
>
> On Wed, Dec 12, 2018 at 7:53 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Wed 12-12-18 16:31:35, Pingfan Liu wrote:
> > > On Mon, Dec 10, 2018 at 8:37 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > [...]
> > > >
> > > > In other words. Does the following work? I am sorry to wildguess this
> > > > way but I am not able to recreate your setups to play with this myself.
> > > >
> > > > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > > > index 1308f5408bf7..d51643e10d00 100644
> > > > --- a/arch/x86/mm/numa.c
> > > > +++ b/arch/x86/mm/numa.c
> > > > @@ -216,8 +216,6 @@ static void __init alloc_node_data(int nid)
> > > >
> > > >         node_data[nid] = nd;
> > > >         memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
> > > > -
> > > > -       node_set_online(nid);
> > > >  }
> > > >
> > > >  /**
> > > > @@ -527,6 +525,19 @@ static void __init numa_clear_kernel_node_hotplug(void)
> > > >         }
> > > >  }
> > > >
> > > > +static void __init init_memory_less_node(int nid)
> > > > +{
> > > > +       unsigned long zones_size[MAX_NR_ZONES] = {0};
> > > > +       unsigned long zholes_size[MAX_NR_ZONES] = {0};
> > > > +
> > > > +       free_area_init_node(nid, zones_size, 0, zholes_size);
> > > > +
> > > > +       /*
> > > > +        * All zonelists will be built later in start_kernel() after per cpu
> > > > +        * areas are initialized.
> > > > +        */
> > > > +}
> > > > +
> > > >  static int __init numa_register_memblks(struct numa_meminfo *mi)
> > > >  {
> > > >         unsigned long uninitialized_var(pfn_align);
> > > > @@ -570,7 +581,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
> > > >                 return -EINVAL;
> > > >
> > > >         /* Finally register nodes. */
> > > > -       for_each_node_mask(nid, node_possible_map) {
> > > > +       for_each_node(nid) {
> > > >                 u64 start = PFN_PHYS(max_pfn);
> > > >                 u64 end = 0;
> > > >
> > > > @@ -592,6 +603,10 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
> > > >                         continue;
> > > >
> > > >                 alloc_node_data(nid);
> > > > +               if (!end)
> > >
> > > Here comes the bug, since !end can not reach here.
> >
> > You are right. I am dumb. I've just completely missed that. Sigh.
> > Anyway, I think the code is more complicated than necessary and we can
> > simply drop the check. I do not think we really have to worry about
> > the start overflowing end. So the end patch should look as follows.
> > Btw. I believe it is better to pull alloc_node_data out of init_memory_less_node
> > because a) there is no need to duplicate the call and moreover we want
> > to pull node_set_online as well. The code also seems cleaner this way.
> >
> I have no strong opinion here.
> > Thanks for your testing and your patience with me here.
> Np.
> >
> > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > index 1308f5408bf7..a5548fe668fb 100644
> > --- a/arch/x86/mm/numa.c
> > +++ b/arch/x86/mm/numa.c
> > @@ -216,8 +216,6 @@ static void __init alloc_node_data(int nid)
> >
> >         node_data[nid] = nd;
> >         memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
> > -
> > -       node_set_online(nid);
> >  }
> >
> >  /**
> > @@ -527,6 +525,19 @@ static void __init numa_clear_kernel_node_hotplug(void)
> >         }
> >  }
> >
> > +static void __init init_memory_less_node(int nid)
> > +{
> > +       unsigned long zones_size[MAX_NR_ZONES] = {0};
> > +       unsigned long zholes_size[MAX_NR_ZONES] = {0};
> > +
> > +       free_area_init_node(nid, zones_size, 0, zholes_size);
> > +
> > +       /*
> > +        * All zonelists will be built later in start_kernel() after per cpu
> > +        * areas are initialized.
> > +        */
> > +}
> > +
> >  static int __init numa_register_memblks(struct numa_meminfo *mi)
> >  {
> >         unsigned long uninitialized_var(pfn_align);
> > @@ -570,7 +581,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
> >                 return -EINVAL;
> >
> >         /* Finally register nodes. */
> > -       for_each_node_mask(nid, node_possible_map) {
> > +       for_each_node(nid) {
> >                 u64 start = PFN_PHYS(max_pfn);
> >                 u64 end = 0;
> >
> > @@ -581,9 +592,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
> >                         end = max(mi->blk[i].end, end);
> >                 }
> >
> > -               if (start >= end)
> > -                       continue;
> > -
> >                 /*
> >                  * Don't confuse VM with a node that doesn't have the
> >                  * minimum amount of memory:
> > @@ -592,6 +600,10 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
> >                         continue;
> >
> >                 alloc_node_data(nid);
> > +               if (!end)
> > +                       init_memory_less_node(nid);

Just have some opinion on this. Here is two issue. First, is this node
online? I do not see node_set_online() is called in this patch.
Second, if node is online here, then  init_memory_less_node->
free_area_init_node is called duplicated when free_area_init_nodes().
This should be a critical design issue.

Thanks,
Pingfan
> > +               else
> > +                       node_set_online(nid);
> >         }
> >
> >         /* Dump memblock with node info and return. */
> > @@ -721,21 +733,6 @@ void __init x86_numa_init(void)
> >         numa_init(dummy_numa_init);
> >  }
> >
> > -static void __init init_memory_less_node(int nid)
> > -{
> > -       unsigned long zones_size[MAX_NR_ZONES] = {0};
> > -       unsigned long zholes_size[MAX_NR_ZONES] = {0};
> > -
> > -       /* Allocate and initialize node data. Memory-less node is now online.*/
> > -       alloc_node_data(nid);
> > -       free_area_init_node(nid, zones_size, 0, zholes_size);
> > -
> > -       /*
> > -        * All zonelists will be built later in start_kernel() after per cpu
> > -        * areas are initialized.
> > -        */
> > -}
> > -
> >  /*
> >   * Setup early cpu_to_node.
> >   *
> > @@ -763,9 +760,6 @@ void __init init_cpu_to_node(void)
> >                 if (node == NUMA_NO_NODE)
> >                         continue;
> >
> > -               if (!node_online(node))
> > -                       init_memory_less_node(node);
> > -
> >                 numa_set_node(cpu, node);
> >         }
> >  }
> > --
> Regret, it still has bug, and I got panic. Attached log.
>
> Thanks,
> Pingfan
Michal Hocko Dec. 17, 2018, 12:57 p.m. UTC | #48
On Thu 13-12-18 16:37:35, Pingfan Liu wrote:
[...]
> [    0.409667] NUMA: Node 1 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x7fffffff] -> [mem 0x00000000-0x7fffffff]
> [    0.419885] NUMA: Node 1 [mem 0x00000000-0x7fffffff] + [mem 0x100000000-0x47fffffff] -> [mem 0x00000000-0x47fffffff]
> [    0.430386] NODE_DATA(0) allocated [mem 0x87efd4000-0x87effefff]
> [    0.436352]     NODE_DATA(0) on node 5
> [    0.440124] Initmem setup node 0 [mem 0x0000000000000000-0x0000000000000000]
> [    0.447104] NODE_DATA(1) allocated [mem 0x47ffd5000-0x47fffffff]
> [    0.453110] NODE_DATA(2) allocated [mem 0x87efa9000-0x87efd3fff]
> [    0.459060]     NODE_DATA(2) on node 5
> [    0.462855] Initmem setup node 2 [mem 0x0000000000000000-0x0000000000000000]
> [    0.469809] NODE_DATA(3) allocated [mem 0x87ef7e000-0x87efa8fff]
> [    0.475788]     NODE_DATA(3) on node 5
> [    0.479554] Initmem setup node 3 [mem 0x0000000000000000-0x0000000000000000]
> [    0.486536] NODE_DATA(4) allocated [mem 0x87ef53000-0x87ef7dfff]
> [    0.492518]     NODE_DATA(4) on node 5
> [    0.496280] Initmem setup node 4 [mem 0x0000000000000000-0x0000000000000000]
> [    0.503266] NODE_DATA(5) allocated [mem 0x87ef28000-0x87ef52fff]
> [    0.509281] NODE_DATA(6) allocated [mem 0x87eefd000-0x87ef27fff]
> [    0.515224]     NODE_DATA(6) on node 5
> [    0.518987] Initmem setup node 6 [mem 0x0000000000000000-0x0000000000000000]
> [    0.525974] NODE_DATA(7) allocated [mem 0x87eed2000-0x87eefcfff]
> [    0.531953]     NODE_DATA(7) on node 5
> [    0.535716] Initmem setup node 7 [mem 0x0000000000000000-0x0000000000000000]

OK, so we have allocated node_data for all NUMA nodes. Good!

> [    0.542839] Reserving 500MB of memory at 384MB for crashkernel (System RAM: 32314MB)
> [    0.550465] Zone ranges:
> [    0.552927]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
> [    0.559081]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
> [    0.565235]   Normal   [mem 0x0000000100000000-0x000000087effffff]
> [    0.571388]   Device   empty
> [    0.574249] Movable zone start for each node
> [    0.578498] Early memory node ranges
> [    0.582049]   node   1: [mem 0x0000000000001000-0x000000000008efff]
> [    0.588291]   node   1: [mem 0x0000000000090000-0x000000000009ffff]
> [    0.594530]   node   1: [mem 0x0000000000100000-0x000000005c3d6fff]
> [    0.600772]   node   1: [mem 0x00000000643df000-0x0000000068ff7fff]
> [    0.607011]   node   1: [mem 0x000000006c528000-0x000000006fffffff]
> [    0.613251]   node   1: [mem 0x0000000100000000-0x000000047fffffff]
> [    0.619493]   node   5: [mem 0x0000000480000000-0x000000087effffff]
> [    0.626479] Zeroed struct page in unavailable ranges: 46490 pages
> [    0.626480] Initmem setup node 1 [mem 0x0000000000001000-0x000000047fffffff]
> [    0.655261] Initmem setup node 5 [mem 0x0000000480000000-0x000000087effffff]
[...]
> [    1.066324] Built 2 zonelists, mobility grouping off.  Total pages: 0

There are 2 zonelists built, but for some reason vm_total_pages is 0 and
that is clearly wrong.

Because the allocation failure (which later leads to NULL ptr) tells
there is quite a lot of memory.  One reason might be that the zonelist
for memory less nodes is initialized incorrectly. nr_free_zone_pages
relies on the local Node zonelist so if the code happened to run on a
cpu associated with Node2 then we could indeed got vm_total_pages=0.

> [    1.439440] Node 1 DMA: 2*4kB (U) 2*8kB (U) 2*16kB (U) 3*32kB (U) 2*64kB (U) 2*128kB (U) 2*256kB (U) 1*512kB (U) 0*1024kB 1*2048kB (M) 3*4096kB (M) = 15896kB
> [    1.453482] Node 1 DMA32: 2*4kB (M) 1*8kB (M) 1*16kB (M) 2*32kB (M) 3*64kB (M) 2*128kB (M) 3*256kB (M) 3*512kB (M) 2*1024kB (M) 3*2048kB (M) 255*4096kB (M) = 1055520kB
> [    1.468388] Node 1 Normal: 1*4kB (U) 1*8kB (U) 1*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 1*512kB (U) 1*1024kB (U) 1*2048kB (U) 31*4096kB (M) = 131068kB
> [    1.483211] Node 5 Normal: 1*4kB (U) 1*8kB (U) 1*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 1*512kB (U) 1*1024kB (U) 1*2048kB (U) 31*4096kB (M) = 131068kB

I am investigating what the hell is going on here. Maybe the former hack
to re-initialize memory-less nodes is working around some ordering
issues.
Michal Hocko Dec. 17, 2018, 1:29 p.m. UTC | #49
On Thu 13-12-18 17:04:01, Pingfan Liu wrote:
[...]
> > > @@ -592,6 +600,10 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
> > >                         continue;
> > >
> > >                 alloc_node_data(nid);
> > > +               if (!end)
> > > +                       init_memory_less_node(nid);
> 
> Just have some opinion on this. Here is two issue. First, is this node
> online?


It shouldn't be as it doesn't have any memory.

> I do not see node_set_online() is called in this patch.

It is below for nodes with some memory.

> Second, if node is online here, then  init_memory_less_node->
> free_area_init_node is called duplicated when free_area_init_nodes().
> This should be a critical design issue.

I am still trying to wrap my head around the expected code flow here.
numa_init does the following for all CPUs within nr_cpu_ids (aka nr_cpus
aware).
		if (!node_online(nid))
			numa_clear_node(i);

I do not really understand why do we do this. But this enforces
init_cpu_to_node to do init_memory_less_node (with the current upstream
code) and that will mark the node online again and zonelists are built
properly. My patch couldn't help in that respect because the node is
offline (as it should be IMHO).

So let's try another attempt with some larger surgery (on top of the
previous patch). It will also dump the zonelist after it is built for
each node. Let's see whether something more is lurking there.

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index a5548fe668fb..eb7c905d5d86 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -525,19 +525,6 @@ static void __init numa_clear_kernel_node_hotplug(void)
 	}
 }
 
-static void __init init_memory_less_node(int nid)
-{
-	unsigned long zones_size[MAX_NR_ZONES] = {0};
-	unsigned long zholes_size[MAX_NR_ZONES] = {0};
-
-	free_area_init_node(nid, zones_size, 0, zholes_size);
-
-	/*
-	 * All zonelists will be built later in start_kernel() after per cpu
-	 * areas are initialized.
-	 */
-}
-
 static int __init numa_register_memblks(struct numa_meminfo *mi)
 {
 	unsigned long uninitialized_var(pfn_align);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5411de93a363..99252a0b6551 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2045,6 +2045,8 @@ extern void __init pagecache_init(void);
 extern void free_area_init(unsigned long * zones_size);
 extern void __init free_area_init_node(int nid, unsigned long * zones_size,
 		unsigned long zone_start_pfn, unsigned long *zholes_size);
+extern void init_memory_less_node(int nid);
+
 extern void free_initmem(void);
 
 /*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2ec9cc407216..a5c035fd6307 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5234,6 +5234,8 @@ static void build_zonelists(pg_data_t *pgdat)
 	int node, load, nr_nodes = 0;
 	nodemask_t used_mask;
 	int local_node, prev_node;
+	struct zone *zone;
+	struct zoneref *z;
 
 	/* NUMA-aware ordering of nodes */
 	local_node = pgdat->node_id;
@@ -5259,6 +5261,11 @@ static void build_zonelists(pg_data_t *pgdat)
 
 	build_zonelists_in_node_order(pgdat, node_order, nr_nodes);
 	build_thisnode_zonelists(pgdat);
+
+	pr_info("node[%d] zonelist: ", pgdat->node_id);
+	for_each_zone_zonelist(zone, z, &pgdat->node_zonelists[ZONELIST_FALLBACK], MAX_NR_ZONES-1)
+		pr_cont("%d:%s ", zone_to_nid(zone), zone->name);
+	pr_cont("\n");
 }
 
 #ifdef CONFIG_HAVE_MEMORYLESS_NODES
@@ -5447,6 +5454,20 @@ void __ref build_all_zonelists(pg_data_t *pgdat)
 #endif
 }
 
+void __init init_memory_less_node(int nid)
+{
+	unsigned long zones_size[MAX_NR_ZONES] = {0};
+	unsigned long zholes_size[MAX_NR_ZONES] = {0};
+
+	free_area_init_node(nid, zones_size, 0, zholes_size);
+	__build_all_zonelists(NODE_DATA(nid));
+
+	/*
+	 * All zonelists will be built later in start_kernel() after per cpu
+	 * areas are initialized.
+	 */
+}
+
 /* If zone is ZONE_MOVABLE but memory is mirrored, it is an overlapped init */
 static bool __meminit
 overlap_memmap_init(unsigned long zone, unsigned long *pfn)
Pingfan Liu Dec. 20, 2018, 7:19 a.m. UTC | #50
Hi Michal,

WIth this patch applied on the old one, I got the following message.
Please get it from attachment.

Thanks,
Pingfan

On Mon, Dec 17, 2018 at 9:29 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 13-12-18 17:04:01, Pingfan Liu wrote:
> [...]
> > > > @@ -592,6 +600,10 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
> > > >                         continue;
> > > >
> > > >                 alloc_node_data(nid);
> > > > +               if (!end)
> > > > +                       init_memory_less_node(nid);
> >
> > Just have some opinion on this. Here is two issue. First, is this node
> > online?
>
>
> It shouldn't be as it doesn't have any memory.
>
> > I do not see node_set_online() is called in this patch.
>
> It is below for nodes with some memory.
>
> > Second, if node is online here, then  init_memory_less_node->
> > free_area_init_node is called duplicated when free_area_init_nodes().
> > This should be a critical design issue.
>
> I am still trying to wrap my head around the expected code flow here.
> numa_init does the following for all CPUs within nr_cpu_ids (aka nr_cpus
> aware).
>                 if (!node_online(nid))
>                         numa_clear_node(i);
>
> I do not really understand why do we do this. But this enforces
> init_cpu_to_node to do init_memory_less_node (with the current upstream
> code) and that will mark the node online again and zonelists are built
> properly. My patch couldn't help in that respect because the node is
> offline (as it should be IMHO).
>
> So let's try another attempt with some larger surgery (on top of the
> previous patch). It will also dump the zonelist after it is built for
> each node. Let's see whether something more is lurking there.
>
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index a5548fe668fb..eb7c905d5d86 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -525,19 +525,6 @@ static void __init numa_clear_kernel_node_hotplug(void)
>         }
>  }
>
> -static void __init init_memory_less_node(int nid)
> -{
> -       unsigned long zones_size[MAX_NR_ZONES] = {0};
> -       unsigned long zholes_size[MAX_NR_ZONES] = {0};
> -
> -       free_area_init_node(nid, zones_size, 0, zholes_size);
> -
> -       /*
> -        * All zonelists will be built later in start_kernel() after per cpu
> -        * areas are initialized.
> -        */
> -}
> -
>  static int __init numa_register_memblks(struct numa_meminfo *mi)
>  {
>         unsigned long uninitialized_var(pfn_align);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5411de93a363..99252a0b6551 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2045,6 +2045,8 @@ extern void __init pagecache_init(void);
>  extern void free_area_init(unsigned long * zones_size);
>  extern void __init free_area_init_node(int nid, unsigned long * zones_size,
>                 unsigned long zone_start_pfn, unsigned long *zholes_size);
> +extern void init_memory_less_node(int nid);
> +
>  extern void free_initmem(void);
>
>  /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2ec9cc407216..a5c035fd6307 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5234,6 +5234,8 @@ static void build_zonelists(pg_data_t *pgdat)
>         int node, load, nr_nodes = 0;
>         nodemask_t used_mask;
>         int local_node, prev_node;
> +       struct zone *zone;
> +       struct zoneref *z;
>
>         /* NUMA-aware ordering of nodes */
>         local_node = pgdat->node_id;
> @@ -5259,6 +5261,11 @@ static void build_zonelists(pg_data_t *pgdat)
>
>         build_zonelists_in_node_order(pgdat, node_order, nr_nodes);
>         build_thisnode_zonelists(pgdat);
> +
> +       pr_info("node[%d] zonelist: ", pgdat->node_id);
> +       for_each_zone_zonelist(zone, z, &pgdat->node_zonelists[ZONELIST_FALLBACK], MAX_NR_ZONES-1)
> +               pr_cont("%d:%s ", zone_to_nid(zone), zone->name);
> +       pr_cont("\n");
>  }
>
>  #ifdef CONFIG_HAVE_MEMORYLESS_NODES
> @@ -5447,6 +5454,20 @@ void __ref build_all_zonelists(pg_data_t *pgdat)
>  #endif
>  }
>
> +void __init init_memory_less_node(int nid)
> +{
> +       unsigned long zones_size[MAX_NR_ZONES] = {0};
> +       unsigned long zholes_size[MAX_NR_ZONES] = {0};
> +
> +       free_area_init_node(nid, zones_size, 0, zholes_size);
> +       __build_all_zonelists(NODE_DATA(nid));
> +
> +       /*
> +        * All zonelists will be built later in start_kernel() after per cpu
> +        * areas are initialized.
> +        */
> +}
> +
>  /* If zone is ZONE_MOVABLE but memory is mirrored, it is an overlapped init */
>  static bool __meminit
>  overlap_memmap_init(unsigned long zone, unsigned long *pfn)
> --
> Michal Hocko
> SUSE Labs
[    0.000000] Linux version 4.20.0-rc7+
[    0.000000] Command line: root=/dev/mapper/xx_dell--per7425--03-root ro crashkernel=500M rd.lvm.lv=xx_dell-per7425-03/root rd.lvm.lv=xx_dell-per7425-03/swap console=ttyS0,115200n81 earlyprintk=ttyS0,115200n81
[    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[    0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'compacted' format.
[    0.000000] BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000100-0x000000000008efff] usable
[    0.000000] BIOS-e820: [mem 0x000000000008f000-0x000000000008ffff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x0000000000090000-0x000000000009ffff] usable
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000005c3d6fff] usable
[    0.000000] BIOS-e820: [mem 0x000000005c3d7000-0x00000000643defff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000643df000-0x0000000068ff7fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000068ff8000-0x000000006b4f7fff] reserved
[    0.000000] BIOS-e820: [mem 0x000000006b4f8000-0x000000006c327fff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x000000006c328000-0x000000006c527fff] ACPI data
[    0.000000] BIOS-e820: [mem 0x000000006c528000-0x000000006fffffff] usable
[    0.000000] BIOS-e820: [mem 0x0000000070000000-0x000000008fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fec10000-0x00000000fec10fff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed80fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000087effffff] usable
[    0.000000] BIOS-e820: [mem 0x000000087f000000-0x000000087fffffff] reserved
[    0.000000] printk: bootconsole [earlyser0] enabled
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] extended physical RAM map:
[    0.000000] reserve setup_data: [mem 0x0000000000000100-0x000000000008efff] usable
[    0.000000] reserve setup_data: [mem 0x000000000008f000-0x000000000008ffff] ACPI NVS
[    0.000000] reserve setup_data: [mem 0x0000000000090000-0x000000000009ffff] usable
[    0.000000] reserve setup_data: [mem 0x0000000000100000-0x000000000010006f] usable
[    0.000000] reserve setup_data: [mem 0x0000000000100070-0x000000005c3d6fff] usable
[    0.000000] reserve setup_data: [mem 0x000000005c3d7000-0x00000000643defff] reserved
[    0.000000] reserve setup_data: [mem 0x00000000643df000-0x0000000068ff7fff] usable
[    0.000000] reserve setup_data: [mem 0x0000000068ff8000-0x000000006b4f7fff] reserved
[    0.000000] reserve setup_data: [mem 0x000000006b4f8000-0x000000006c327fff] ACPI NVS
[    0.000000] reserve setup_data: [mem 0x000000006c328000-0x000000006c527fff] ACPI data
[    0.000000] reserve setup_data: [mem 0x000000006c528000-0x000000006fffffff] usable
[    0.000000] reserve setup_data: [mem 0x0000000070000000-0x000000008fffffff] reserved
[    0.000000] reserve setup_data: [mem 0x00000000fec10000-0x00000000fec10fff] reserved
[    0.000000] reserve setup_data: [mem 0x00000000fed80000-0x00000000fed80fff] reserved
[    0.000000] reserve setup_data: [mem 0x0000000100000000-0x000000087effffff] usable
[    0.000000] reserve setup_data: [mem 0x000000087f000000-0x000000087fffffff] reserved
[    0.000000] efi: EFI v2.50 by Dell Inc.
[    0.000000] efi:  ACPI=0x6c527000  ACPI 2.0=0x6c527014  SMBIOS=0x6afde000  SMBIOS 3.0=0x6afdc000 
[    0.000000] SMBIOS 3.0.0 present.
[    0.000000] DMI: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.4.3 06/29/2018
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.000000] tsc: Detected 2095.973 MHz processor
[    0.000066] last_pfn = 0x87f000 max_arch_pfn = 0x400000000
[    0.006389] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT  
Memory KASLR using RDRAND RDTSC...
[    0.016603] last_pfn = 0x70000 max_arch_pfn = 0x400000000
[    0.027381] Using GB pages for direct mapping
[    0.031993] Secure boot could not be determined
[    0.036341] RAMDISK: [mem 0x87a171000-0x87cdfffff]
[    0.041121] ACPI: Early table checksum verification disabled
[    0.046749] ACPI: RSDP 0x000000006C527014 000024 (v02 DELL  )
[    0.052463] ACPI: XSDT 0x000000006C5260E8 0000C4 (v01 DELL   PE_SC3   00000002 DELL 00000001)
[    0.060961] ACPI: FACP 0x000000006C516000 000114 (v06 DELL   PE_SC3   00000002 DELL 00000001)
[    0.069455] ACPI: DSDT 0x000000006C505000 00D302 (v02 DELL   PE_SC3   00000002 DELL 00000001)
[    0.077944] ACPI: FACS 0x000000006C2F1000 000040
[    0.082539] ACPI: SSDT 0x000000006C525000 0000D2 (v02 DELL   PE_SC3   00000002 MSFT 04000000)
[    0.091031] ACPI: BERT 0x000000006C524000 000030 (v01 DELL   BERT     00000001 DELL 00000001)
[    0.099525] ACPI: HEST 0x000000006C523000 0006DC (v01 DELL   HEST     00000001 DELL 00000001)
[    0.108019] ACPI: SSDT 0x000000006C522000 0001C4 (v01 DELL   PE_SC3   00000001 AMD  00000001)
[    0.116511] ACPI: SRAT 0x000000006C521000 0002D0 (v03 DELL   PE_SC3   00000001 AMD  00000001)
[    0.125005] ACPI: MSCT 0x000000006C520000 0000A6 (v01 DELL   PE_SC3   00000000 AMD  00000001)
[    0.133500] ACPI: SLIT 0x000000006C51F000 00006C (v01 DELL   PE_SC3   00000001 AMD  00000001)
[    0.141994] ACPI: CRAT 0x000000006C51C000 002210 (v01 DELL   PE_SC3   00000001 AMD  00000001)
[    0.150486] ACPI: CDIT 0x000000006C51B000 000068 (v01 DELL   PE_SC3   00000001 AMD  00000001)
[    0.158980] ACPI: SSDT 0x000000006C51A000 0003C6 (v02 DELL   Tpm2Tabl 00001000 INTL 20170119)
[    0.167474] ACPI: TPM2 0x000000006C519000 000038 (v04 DELL   PE_SC3   00000002 DELL 00000001)
[    0.175969] ACPI: EINJ 0x000000006C518000 000150 (v01 DELL   PE_SC3   00000001 AMD  00000001)
[    0.184461] ACPI: SLIC 0x000000006C517000 000024 (v01 DELL   PE_SC3   00000002 DELL 00000001)
[    0.192955] ACPI: HPET 0x000000006C515000 000038 (v01 DELL   PE_SC3   00000002 DELL 00000001)
[    0.201449] ACPI: APIC 0x000000006C514000 0004B2 (v03 DELL   PE_SC3   00000002 DELL 00000001)
[    0.209943] ACPI: MCFG 0x000000006C513000 00003C (v01 DELL   PE_SC3   00000002 DELL 00000001)
[    0.218435] ACPI: SSDT 0x000000006C504000 0005CA (v02 DELL   xhc_port 00000001 INTL 20170119)
[    0.226929] ACPI: IVRS 0x000000006C503000 000390 (v02 DELL   PE_SC3   00000001 AMD  00000000)
[    0.235424] ACPI: SSDT 0x000000006C501000 001658 (v01 AMD    CPMCMN   00000001 INTL 20170119)
[    0.243977] SRAT: PXM 0 -> APIC 0x00 -> Node 0
[    0.248338] SRAT: PXM 0 -> APIC 0x01 -> Node 0
[    0.252756] SRAT: PXM 0 -> APIC 0x08 -> Node 0
[    0.257177] SRAT: PXM 0 -> APIC 0x09 -> Node 0
[    0.261598] SRAT: PXM 1 -> APIC 0x10 -> Node 1
[    0.266018] SRAT: PXM 1 -> APIC 0x11 -> Node 1
[    0.270437] SRAT: PXM 1 -> APIC 0x18 -> Node 1
[    0.274857] SRAT: PXM 1 -> APIC 0x19 -> Node 1
[    0.279278] SRAT: PXM 2 -> APIC 0x20 -> Node 2
[    0.283699] SRAT: PXM 2 -> APIC 0x21 -> Node 2
[    0.288117] SRAT: PXM 2 -> APIC 0x28 -> Node 2
[    0.292538] SRAT: PXM 2 -> APIC 0x29 -> Node 2
[    0.296959] SRAT: PXM 3 -> APIC 0x30 -> Node 3
[    0.301380] SRAT: PXM 3 -> APIC 0x31 -> Node 3
[    0.305801] SRAT: PXM 3 -> APIC 0x38 -> Node 3
[    0.310219] SRAT: PXM 3 -> APIC 0x39 -> Node 3
[    0.314640] SRAT: PXM 4 -> APIC 0x40 -> Node 4
[    0.319060] SRAT: PXM 4 -> APIC 0x41 -> Node 4
[    0.323481] SRAT: PXM 4 -> APIC 0x48 -> Node 4
[    0.327899] SRAT: PXM 4 -> APIC 0x49 -> Node 4
[    0.332320] SRAT: PXM 5 -> APIC 0x50 -> Node 5
[    0.336741] SRAT: PXM 5 -> APIC 0x51 -> Node 5
[    0.341162] SRAT: PXM 5 -> APIC 0x58 -> Node 5
[    0.345580] SRAT: PXM 5 -> APIC 0x59 -> Node 5
[    0.350001] SRAT: PXM 6 -> APIC 0x60 -> Node 6
[    0.354422] SRAT: PXM 6 -> APIC 0x61 -> Node 6
[    0.358843] SRAT: PXM 6 -> APIC 0x68 -> Node 6
[    0.363261] SRAT: PXM 6 -> APIC 0x69 -> Node 6
[    0.367682] SRAT: PXM 7 -> APIC 0x70 -> Node 7
[    0.372103] SRAT: PXM 7 -> APIC 0x71 -> Node 7
[    0.376521] SRAT: PXM 7 -> APIC 0x78 -> Node 7
[    0.380942] SRAT: PXM 7 -> APIC 0x79 -> Node 7
[    0.385365] ACPI: SRAT: Node 1 PXM 1 [mem 0x00000000-0x0009ffff]
[    0.391344] ACPI: SRAT: Node 1 PXM 1 [mem 0x00100000-0x7fffffff]
[    0.397323] ACPI: SRAT: Node 1 PXM 1 [mem 0x100000000-0x47fffffff]
[    0.403476] ACPI: SRAT: Node 5 PXM 5 [mem 0x480000000-0x87fffffff]
[    0.409637] NUMA: Node 1 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x7fffffff] -> [mem 0x00000000-0x7fffffff]
[    0.419858] NUMA: Node 1 [mem 0x00000000-0x7fffffff] + [mem 0x100000000-0x47fffffff] -> [mem 0x00000000-0x47fffffff]
[    0.430356] NODE_DATA(0) allocated [mem 0x87efd4000-0x87effefff]
[    0.436325]     NODE_DATA(0) on node 5
[    0.440092] Initmem setup node 0 [mem 0x0000000000000000-0x0000000000000000]
[    0.447078] node[0] zonelist: 
[    0.450106] NODE_DATA(1) allocated [mem 0x47ffd5000-0x47fffffff]
[    0.456114] NODE_DATA(2) allocated [mem 0x87efa9000-0x87efd3fff]
[    0.462064]     NODE_DATA(2) on node 5
[    0.465852] Initmem setup node 2 [mem 0x0000000000000000-0x0000000000000000]
[    0.472813] node[2] zonelist: 
[    0.475846] NODE_DATA(3) allocated [mem 0x87ef7e000-0x87efa8fff]
[    0.481827]     NODE_DATA(3) on node 5
[    0.485590] Initmem setup node 3 [mem 0x0000000000000000-0x0000000000000000]
[    0.492575] node[3] zonelist: 
[    0.495608] NODE_DATA(4) allocated [mem 0x87ef53000-0x87ef7dfff]
[    0.501587]     NODE_DATA(4) on node 5
[    0.505349] Initmem setup node 4 [mem 0x0000000000000000-0x0000000000000000]
[    0.512334] node[4] zonelist: 
[    0.515370] NODE_DATA(5) allocated [mem 0x87ef28000-0x87ef52fff]
[    0.521384] NODE_DATA(6) allocated [mem 0x87eefd000-0x87ef27fff]
[    0.527329]     NODE_DATA(6) on node 5
[    0.531091] Initmem setup node 6 [mem 0x0000000000000000-0x0000000000000000]
[    0.538076] node[6] zonelist: 
[    0.541109] NODE_DATA(7) allocated [mem 0x87eed2000-0x87eefcfff]
[    0.547090]     NODE_DATA(7) on node 5
[    0.550851] Initmem setup node 7 [mem 0x0000000000000000-0x0000000000000000]
[    0.557836] node[7] zonelist: 
[    0.561005] Reserving 500MB of memory at 384MB for crashkernel (System RAM: 32314MB)
[    0.568633] Zone ranges:
[    0.571098]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
[    0.577250]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
[    0.583403]   Normal   [mem 0x0000000100000000-0x000000087effffff]
[    0.589557]   Device   empty
[    0.592418] Movable zone start for each node
[    0.596666] Early memory node ranges
[    0.600217]   node   1: [mem 0x0000000000001000-0x000000000008efff]
[    0.606459]   node   1: [mem 0x0000000000090000-0x000000000009ffff]
[    0.612698]   node   1: [mem 0x0000000000100000-0x000000005c3d6fff]
[    0.618940]   node   1: [mem 0x00000000643df000-0x0000000068ff7fff]
[    0.625179]   node   1: [mem 0x000000006c528000-0x000000006fffffff]
[    0.631419]   node   1: [mem 0x0000000100000000-0x000000047fffffff]
[    0.637660]   node   5: [mem 0x0000000480000000-0x000000087effffff]
[    0.644645] Zeroed struct page in unavailable ranges: 46490 pages
[    0.644646] Initmem setup node 1 [mem 0x0000000000001000-0x000000047fffffff]
[    0.672700] Initmem setup node 5 [mem 0x0000000480000000-0x000000087effffff]
[    0.681057] ACPI: PM-Timer IO Port: 0x408
[    0.684905] ACPI: LAPIC_NMI (acpi_id[0xff] high edge lint[0x1])
[    0.690808] IOAPIC[0]: apic_id 128, version 33, address 0xfec00000, GSI 0-23
[    0.697804] IOAPIC[1]: apic_id 129, version 33, address 0xfd880000, GSI 24-55
[    0.704913] IOAPIC[2]: apic_id 130, version 33, address 0xea900000, GSI 56-87
[    0.712019] IOAPIC[3]: apic_id 131, version 33, address 0xdd900000, GSI 88-119
[    0.719211] IOAPIC[4]: apic_id 132, version 33, address 0xd0900000, GSI 120-151
[    0.726493] IOAPIC[5]: apic_id 133, version 33, address 0xc3900000, GSI 152-183
[    0.733775] IOAPIC[6]: apic_id 134, version 33, address 0xb6900000, GSI 184-215
[    0.741053] IOAPIC[7]: apic_id 135, version 33, address 0xa9900000, GSI 216-247
[    0.748335] IOAPIC[8]: apic_id 136, version 33, address 0x9c900000, GSI 248-279
[    0.755611] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[    0.761938] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 low level)
[    0.768442] Using ACPI (MADT) for SMP configuration information
[    0.774331] ACPI: HPET id: 0x10228201 base: 0xfed00000
[    0.779454] smpboot: Allowing 128 CPUs, 96 hotplug CPUs
[    0.784670] PM: Registered nosave memory: [mem 0x00000000-0x00000fff]
[    0.791059] PM: Registered nosave memory: [mem 0x0008f000-0x0008ffff]
[    0.797471] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff]
[    0.803884] PM: Registered nosave memory: [mem 0x00100000-0x00100fff]
[    0.810299] PM: Registered nosave memory: [mem 0x5c3d7000-0x643defff]
[    0.816714] PM: Registered nosave memory: [mem 0x68ff8000-0x6b4f7fff]
[    0.823125] PM: Registered nosave memory: [mem 0x6b4f8000-0x6c327fff]
[    0.829538] PM: Registered nosave memory: [mem 0x6c328000-0x6c527fff]
[    0.835953] PM: Registered nosave memory: [mem 0x70000000-0x8fffffff]
[    0.842366] PM: Registered nosave memory: [mem 0x90000000-0xfec0ffff]
[    0.848780] PM: Registered nosave memory: [mem 0xfec10000-0xfec10fff]
[    0.855194] PM: Registered nosave memory: [mem 0xfec11000-0xfed7ffff]
[    0.861608] PM: Registered nosave memory: [mem 0xfed80000-0xfed80fff]
[    0.868021] PM: Registered nosave memory: [mem 0xfed81000-0xffffffff]
[    0.874437] [mem 0x90000000-0xfec0ffff] available for PCI devices
[    0.880504] Booting paravirtualized kernel on bare hardware
[    0.886053] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[    1.004227] random: get_random_bytes called from start_kernel+0x9b/0x52e with crng_init=0
[    1.012234] setup_percpu: NR_CPUS:8192 nr_cpumask_bits:128 nr_cpu_ids:128 nr_node_ids:8
[    1.020744] setup_percpu: cpu 0 has no node 0 or node-local memory
[    1.027245] setup_percpu: cpu 1 has no node 4 or node-local memory
[    1.039287] setup_percpu: cpu 4 has no node 2 or node-local memory
[    1.045667] setup_percpu: cpu 5 has no node 6 or node-local memory
[    1.052041] setup_percpu: cpu 6 has no node 3 or node-local memory
[    1.058421] setup_percpu: cpu 7 has no node 7 or node-local memory
[    1.067658] percpu: Embedded 46 pages/cpu @(____ptrval____) s151552 r8192 d28672 u262144
[    1.075692] node[1] zonelist: 1:Normal 1:DMA32 1:DMA 5:Normal 
[    1.081376] node[5] zonelist: 5:Normal 1:Normal 1:DMA32 1:DMA 
[    1.087206] Built 2 zonelists, mobility grouping off.  Total pages: 0
[    1.093597] Policy zone: Normal
[    1.096722] Kernel command line: root=/dev/mapper/xx_dell--per7425--03-root ro crashkernel=500M rd.lvm.lv=xx_dell-per7425-03/root rd.lvm.lv=xx_dell-per7425-03/swap console=ttyS0,115200n81 earlyprintk=ttyS0,115200n81
[    1.140827] Memory: 1333560K/33089944K available (12292K kernel code, 2066K rwdata, 3756K rodata, 2352K init, 6524K bss, 1202452K reserved, 0K cma-reserved)
[    1.154706] swapper: page allocation failure: order:0, mode:0x4000(__GFP_COMP), nodemask=(null)
[    1.163300] swapper cpuset=(null) mems_allowed=0-1023
[    1.168327] CPU: 0 PID: 0 Comm: swapper Not tainted 4.20.0-rc7+ #16
[    1.174564] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.4.3 06/29/2018
[    1.182107] Call Trace:
[    1.184550]  dump_stack+0x5c/0x7b
[    1.187832]  warn_alloc+0xf5/0x180
[    1.191212]  ? __raw_callee_save___native_queued_spin_unlock+0x11/0x1e
[    1.197708]  __alloc_pages_slowpath+0x84f/0xa0d
[    1.202218]  ? pcpu_block_refresh_hint+0x77/0xa0
[    1.206807]  __alloc_pages_nodemask+0x299/0x2e0
[    1.211319]  new_slab+0x425/0x570
[    1.214608]  ___slab_alloc+0x375/0x540
[    1.218339]  ? bootstrap+0x1b/0xcb
[    1.221713]  ? __kmem_cache_create+0x2b/0x150
[    1.226050]  ? printk+0x58/0x6f
[    1.229166]  ? bootstrap+0x1b/0xcb
[    1.232548]  __slab_alloc+0x1c/0x38
[    1.236014]  kmem_cache_alloc+0x192/0x1c0
[    1.240001]  bootstrap+0x1b/0xcb
[    1.243213]  kmem_cache_init+0x8d/0x109
[    1.247026]  start_kernel+0x26c/0x52e
[    1.250662]  ? set_init_arg+0x55/0x55
[    1.254306]  secondary_startup_64+0xa4/0xb0
[    1.258464] Mem-Info:
[    1.260719] active_anon:0 inactive_anon:0 isolated_anon:0
[    1.260719]  active_file:0 inactive_file:0 isolated_file:0
[    1.260719]  unevictable:0 dirty:0 writeback:0 unstable:0
[    1.260719]  slab_reclaimable:0 slab_unreclaimable:2
[    1.260719]  mapped:0 shmem:0 pagetables:0 bounce:0
[    1.260719]  free:333388 free_pcp:0 free_cma:0
[    1.291140] Node 1 active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[    1.316533] Node 5 active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[    1.341925] Node 1 DMA free:15896kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15896kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[    1.367319] lowmem_reserve[]: 0 0 0 0 0
[    1.371132] Node 1 DMA32 free:1055520kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1633056kB managed:1055520kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[    1.397221] lowmem_reserve[]: 0 0 0 0 0
[    1.401033] Node 1 Normal free:131068kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:14680064kB managed:131072kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[    1.427122] lowmem_reserve[]: 0 0 0 0 0
[    1.430934] Node 5 Normal free:131068kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:16760832kB managed:131072kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[    1.457023] lowmem_reserve[]: 0 0 0 0 0
[    1.460837] Node 1 DMA: 2*4kB (U) 2*8kB (U) 2*16kB (U) 3*32kB (U) 2*64kB (U) 2*128kB (U) 2*256kB (U) 1*512kB (U) 0*1024kB 1*2048kB (M) 3*4096kB (M) = 15896kB
[    1.474876] Node 1 DMA32: 2*4kB (M) 1*8kB (M) 1*16kB (M) 2*32kB (M) 3*64kB (M) 2*128kB (M) 3*256kB (M) 3*512kB (M) 2*1024kB (M) 3*2048kB (M) 255*4096kB (M) = 1055520kB
[    1.489784] Node 1 Normal: 1*4kB (U) 1*8kB (U) 1*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 1*512kB (U) 1*1024kB (U) 1*2048kB (U) 31*4096kB (M) = 131068kB
[    1.504603] Node 5 Normal: 1*4kB (U) 1*8kB (U) 1*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 1*512kB (U) 1*1024kB (U) 1*2048kB (U) 31*4096kB (M) = 131068kB
[    1.519425] 0 total pagecache pages
[    1.522894] 0 pages in swap cache
[    1.526185] Swap cache stats: add 0, delete 0, find 0/0
[    1.531385] Free swap  = 0kB
[    1.534245] Total swap = 0kB
[    1.537106] 8272486 pages RAM
[    1.540052] 0 pages HighMem/MovableOnly
[    1.543866] 7939096 pages reserved
[    1.547247] 0 pages cma reserved
[    1.550454] 0 pages hwpoisoned
[    1.553489] SLUB: Unable to allocate memory on node -1, gfp=0x408000(GFP_NOWAIT|__GFP_ZERO)
[    1.561808]   cache: kmem_cache, object size: 392, buffer size: 448, default order: 2, min order: 0
[    1.570822]   node 1: slabs: 0, objs: 0, free: 0
[    1.575415]   node 5: slabs: 0, objs: 0, free: 0
[    1.580023] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[    1.587810] PGD 0 P4D 0 
[    1.590323] Oops: 0002 [#1] SMP NOPTI
[    1.593962] CPU: 0 PID: 0 Comm: swapper Not tainted 4.20.0-rc7+ #16
[    1.600202] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.4.3 06/29/2018
[    1.607744] RIP: 0010:bootstrap+0x2e/0xcb
[    1.611731] Code: ff 55 48 89 fd 48 8b 3d 59 c0 42 00 be 00 80 40 00 53 e8 de c8 aa fe 48 89 c3 48 8b 05 44 c0 42 00 48 89 ee 48 89 df 8b 48 1c <f3> a4 65 8b 35 22 b8 65 5d 48 89 df e8 7a d7 aa fe 44 8b 05 0b 6b
[    1.630452] RSP: 0000:ffffffffa2403ed0 EFLAGS: 00010046
[    1.635652] RAX: ffffffffa2ae6600 RBX: 0000000000000000 RCX: 0000000000000188
[    1.642756] RDX: 00000000000001c0 RSI: ffffffffa2ae6600 RDI: 0000000000000000
[    1.649866] RBP: ffffffffa2ae6600 R08: 0000000030203a65 R09: 000000000000014f
[    1.656972] R10: 736a626f202c3020 R11: 657266202c30203a R12: ffffffffa2a2e900
[    1.664079] R13: ffffffffa2a492c0 R14: 0000000000000000 R15: 0000000000000000
[    1.671187] FS:  0000000000000000(0000) GS:ffff8e4029c00000(0000) knlGS:0000000000000000
[    1.679245] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.684966] CR2: 0000000000000000 CR3: 000000087e00a000 CR4: 00000000000406b0
[    1.692074] Call Trace:
[    1.694500]  kmem_cache_init+0x8d/0x109
[    1.698313]  start_kernel+0x26c/0x52e
[    1.701954]  ? set_init_arg+0x55/0x55
[    1.705594]  secondary_startup_64+0xa4/0xb0
[    1.709756] Modules linked in:
[    1.712787] CR2: 0000000000000000
[    1.716122] ---[ end trace 379bf944903e3d0a ]---
[    1.720674] RIP: 0010:bootstrap+0x2e/0xcb
[    1.724661] Code: ff 55 48 89 fd 48 8b 3d 59 c0 42 00 be 00 80 40 00 53 e8 de c8 aa fe 48 89 c3 48 8b 05 44 c0 42 00 48 89 ee 48 89 df 8b 48 1c <f3> a4 65 8b 35 22 b8 65 5d 48 89 df e8 7a d7 aa fe 44 8b 05 0b 6b
[    1.743381] RSP: 0000:ffffffffa2403ed0 EFLAGS: 00010046
[    1.748581] RAX: ffffffffa2ae6600 RBX: 0000000000000000 RCX: 0000000000000188
[    1.755688] RDX: 00000000000001c0 RSI: ffffffffa2ae6600 RDI: 0000000000000000
[    1.762794] RBP: ffffffffa2ae6600 R08: 0000000030203a65 R09: 000000000000014f
[    1.769904] R10: 736a626f202c3020 R11: 657266202c30203a R12: ffffffffa2a2e900
[    1.777010] R13: ffffffffa2a492c0 R14: 0000000000000000 R15: 0000000000000000
[    1.784118] FS:  0000000000000000(0000) GS:ffff8e4029c00000(0000) knlGS:0000000000000000
[    1.792178] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.797896] CR2: 0000000000000000 CR3: 000000087e00a000 CR4: 00000000000406b0
[    1.805006] Kernel panic - not syncing: Fatal exception
[    1.810277] ---[ end Kernel panic - not syncing: Fatal exception ]---
Michal Hocko Dec. 20, 2018, 9:19 a.m. UTC | #51
On Thu 20-12-18 15:19:39, Pingfan Liu wrote:
> Hi Michal,
> 
> WIth this patch applied on the old one, I got the following message.
> Please get it from attachment.
[...]
> [    0.409637] NUMA: Node 1 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x7fffffff] -> [mem 0x00000000-0x7fffffff]
> [    0.419858] NUMA: Node 1 [mem 0x00000000-0x7fffffff] + [mem 0x100000000-0x47fffffff] -> [mem 0x00000000-0x47fffffff]
> [    0.430356] NODE_DATA(0) allocated [mem 0x87efd4000-0x87effefff]
> [    0.436325]     NODE_DATA(0) on node 5
> [    0.440092] Initmem setup node 0 [mem 0x0000000000000000-0x0000000000000000]
> [    0.447078] node[0] zonelist: 
> [    0.450106] NODE_DATA(1) allocated [mem 0x47ffd5000-0x47fffffff]
> [    0.456114] NODE_DATA(2) allocated [mem 0x87efa9000-0x87efd3fff]
> [    0.462064]     NODE_DATA(2) on node 5
> [    0.465852] Initmem setup node 2 [mem 0x0000000000000000-0x0000000000000000]
> [    0.472813] node[2] zonelist: 
> [    0.475846] NODE_DATA(3) allocated [mem 0x87ef7e000-0x87efa8fff]
> [    0.481827]     NODE_DATA(3) on node 5
> [    0.485590] Initmem setup node 3 [mem 0x0000000000000000-0x0000000000000000]
> [    0.492575] node[3] zonelist: 
> [    0.495608] NODE_DATA(4) allocated [mem 0x87ef53000-0x87ef7dfff]
> [    0.501587]     NODE_DATA(4) on node 5
> [    0.505349] Initmem setup node 4 [mem 0x0000000000000000-0x0000000000000000]
> [    0.512334] node[4] zonelist: 
> [    0.515370] NODE_DATA(5) allocated [mem 0x87ef28000-0x87ef52fff]
> [    0.521384] NODE_DATA(6) allocated [mem 0x87eefd000-0x87ef27fff]
> [    0.527329]     NODE_DATA(6) on node 5
> [    0.531091] Initmem setup node 6 [mem 0x0000000000000000-0x0000000000000000]
> [    0.538076] node[6] zonelist: 
> [    0.541109] NODE_DATA(7) allocated [mem 0x87eed2000-0x87eefcfff]
> [    0.547090]     NODE_DATA(7) on node 5
> [    0.550851] Initmem setup node 7 [mem 0x0000000000000000-0x0000000000000000]
> [    0.557836] node[7] zonelist: 

OK, so it is clear that building zonelists this early is not going to
fly. We do not have the complete information yet. I am not sure when do
we get that at this moment but I suspect the we either need to move that
initialization to a sooner stage or we have to reconsider whether the
phase when we build zonelists really needs to consider only online numa
nodes.

[...]
> [    1.067658] percpu: Embedded 46 pages/cpu @(____ptrval____) s151552 r8192 d28672 u262144
> [    1.075692] node[1] zonelist: 1:Normal 1:DMA32 1:DMA 5:Normal 
> [    1.081376] node[5] zonelist: 5:Normal 1:Normal 1:DMA32 1:DMA 

I hope to get to this before I leave for christmas vacation, if not I
will stare into it after then.

Thanks!
Michal Hocko Jan. 8, 2019, 2:34 p.m. UTC | #52
On Thu 20-12-18 10:19:34, Michal Hocko wrote:
> On Thu 20-12-18 15:19:39, Pingfan Liu wrote:
> > Hi Michal,
> > 
> > WIth this patch applied on the old one, I got the following message.
> > Please get it from attachment.
> [...]
> > [    0.409637] NUMA: Node 1 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x7fffffff] -> [mem 0x00000000-0x7fffffff]
> > [    0.419858] NUMA: Node 1 [mem 0x00000000-0x7fffffff] + [mem 0x100000000-0x47fffffff] -> [mem 0x00000000-0x47fffffff]
> > [    0.430356] NODE_DATA(0) allocated [mem 0x87efd4000-0x87effefff]
> > [    0.436325]     NODE_DATA(0) on node 5
> > [    0.440092] Initmem setup node 0 [mem 0x0000000000000000-0x0000000000000000]
> > [    0.447078] node[0] zonelist: 
> > [    0.450106] NODE_DATA(1) allocated [mem 0x47ffd5000-0x47fffffff]
> > [    0.456114] NODE_DATA(2) allocated [mem 0x87efa9000-0x87efd3fff]
> > [    0.462064]     NODE_DATA(2) on node 5
> > [    0.465852] Initmem setup node 2 [mem 0x0000000000000000-0x0000000000000000]
> > [    0.472813] node[2] zonelist: 
> > [    0.475846] NODE_DATA(3) allocated [mem 0x87ef7e000-0x87efa8fff]
> > [    0.481827]     NODE_DATA(3) on node 5
> > [    0.485590] Initmem setup node 3 [mem 0x0000000000000000-0x0000000000000000]
> > [    0.492575] node[3] zonelist: 
> > [    0.495608] NODE_DATA(4) allocated [mem 0x87ef53000-0x87ef7dfff]
> > [    0.501587]     NODE_DATA(4) on node 5
> > [    0.505349] Initmem setup node 4 [mem 0x0000000000000000-0x0000000000000000]
> > [    0.512334] node[4] zonelist: 
> > [    0.515370] NODE_DATA(5) allocated [mem 0x87ef28000-0x87ef52fff]
> > [    0.521384] NODE_DATA(6) allocated [mem 0x87eefd000-0x87ef27fff]
> > [    0.527329]     NODE_DATA(6) on node 5
> > [    0.531091] Initmem setup node 6 [mem 0x0000000000000000-0x0000000000000000]
> > [    0.538076] node[6] zonelist: 
> > [    0.541109] NODE_DATA(7) allocated [mem 0x87eed2000-0x87eefcfff]
> > [    0.547090]     NODE_DATA(7) on node 5
> > [    0.550851] Initmem setup node 7 [mem 0x0000000000000000-0x0000000000000000]
> > [    0.557836] node[7] zonelist: 
> 
> OK, so it is clear that building zonelists this early is not going to
> fly. We do not have the complete information yet. I am not sure when do
> we get that at this moment but I suspect the we either need to move that
> initialization to a sooner stage or we have to reconsider whether the
> phase when we build zonelists really needs to consider only online numa
> nodes.
> 
> [...]
> > [    1.067658] percpu: Embedded 46 pages/cpu @(____ptrval____) s151552 r8192 d28672 u262144
> > [    1.075692] node[1] zonelist: 1:Normal 1:DMA32 1:DMA 5:Normal 
> > [    1.081376] node[5] zonelist: 5:Normal 1:Normal 1:DMA32 1:DMA 
> 
> I hope to get to this before I leave for christmas vacation, if not I
> will stare into it after then.

I am sorry but I didn't get to this sooner. But I've got another idea. I
concluded that the whole dance is simply bogus and we should treat
memory less nodes, well, as nodes with no memory ranges rather than
special case them. Could you give the following a spin please?

---
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 1308f5408bf7..0e79445cfd85 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -216,8 +216,6 @@ static void __init alloc_node_data(int nid)
 
 	node_data[nid] = nd;
 	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
-
-	node_set_online(nid);
 }
 
 /**
@@ -535,6 +533,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 	/* Account for nodes with cpus and no memory */
 	node_possible_map = numa_nodes_parsed;
 	numa_nodemask_from_meminfo(&node_possible_map, mi);
+	pr_info("parsed=%*pbl, possible=%*pbl\n", nodemask_pr_args(&numa_nodes_parsed), nodemask_pr_args(&node_possible_map));
 	if (WARN_ON(nodes_empty(node_possible_map)))
 		return -EINVAL;
 
@@ -570,7 +569,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 		return -EINVAL;
 
 	/* Finally register nodes. */
-	for_each_node_mask(nid, node_possible_map) {
+	for_each_node_mask(nid, numa_nodes_parsed) {
 		u64 start = PFN_PHYS(max_pfn);
 		u64 end = 0;
 
@@ -581,9 +580,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 			end = max(mi->blk[i].end, end);
 		}
 
-		if (start >= end)
-			continue;
-
 		/*
 		 * Don't confuse VM with a node that doesn't have the
 		 * minimum amount of memory:
@@ -592,6 +588,8 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 			continue;
 
 		alloc_node_data(nid);
+		if (end)
+			node_set_online(nid);
 	}
 
 	/* Dump memblock with node info and return. */
@@ -721,21 +719,6 @@ void __init x86_numa_init(void)
 	numa_init(dummy_numa_init);
 }
 
-static void __init init_memory_less_node(int nid)
-{
-	unsigned long zones_size[MAX_NR_ZONES] = {0};
-	unsigned long zholes_size[MAX_NR_ZONES] = {0};
-
-	/* Allocate and initialize node data. Memory-less node is now online.*/
-	alloc_node_data(nid);
-	free_area_init_node(nid, zones_size, 0, zholes_size);
-
-	/*
-	 * All zonelists will be built later in start_kernel() after per cpu
-	 * areas are initialized.
-	 */
-}
-
 /*
  * Setup early cpu_to_node.
  *
@@ -763,9 +746,6 @@ void __init init_cpu_to_node(void)
 		if (node == NUMA_NO_NODE)
 			continue;
 
-		if (!node_online(node))
-			init_memory_less_node(node);
-
 		numa_set_node(cpu, node);
 	}
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2ec9cc407216..52e54d16662a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5234,6 +5234,8 @@ static void build_zonelists(pg_data_t *pgdat)
 	int node, load, nr_nodes = 0;
 	nodemask_t used_mask;
 	int local_node, prev_node;
+	struct zone *zone;
+	struct zoneref *z;
 
 	/* NUMA-aware ordering of nodes */
 	local_node = pgdat->node_id;
@@ -5259,6 +5261,11 @@ static void build_zonelists(pg_data_t *pgdat)
 
 	build_zonelists_in_node_order(pgdat, node_order, nr_nodes);
 	build_thisnode_zonelists(pgdat);
+
+	pr_info("node[%d] zonelist: ", pgdat->node_id);
+	for_each_zone_zonelist(zone, z, &pgdat->node_zonelists[ZONELIST_FALLBACK], MAX_NR_ZONES-1)
+		pr_cont("%d:%s ", zone_to_nid(zone), zone->name);
+	pr_cont("\n");
 }
 
 #ifdef CONFIG_HAVE_MEMORYLESS_NODES
@@ -5361,10 +5368,11 @@ static void __build_all_zonelists(void *data)
 	if (self && !node_online(self->node_id)) {
 		build_zonelists(self);
 	} else {
-		for_each_online_node(nid) {
+		for_each_node(nid) {
 			pg_data_t *pgdat = NODE_DATA(nid);
 
-			build_zonelists(pgdat);
+			if (pgdat)
+				build_zonelists(pgdat);
 		}
 
 #ifdef CONFIG_HAVE_MEMORYLESS_NODES
@@ -6644,10 +6652,8 @@ static unsigned long __init find_min_pfn_for_node(int nid)
 	for_each_mem_pfn_range(i, nid, &start_pfn, NULL, NULL)
 		min_pfn = min(min_pfn, start_pfn);
 
-	if (min_pfn == ULONG_MAX) {
-		pr_warn("Could not find start_pfn for node %d\n", nid);
+	if (min_pfn == ULONG_MAX)
 		return 0;
-	}
 
 	return min_pfn;
 }
@@ -6991,8 +6997,12 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
 	mminit_verify_pageflags_layout();
 	setup_nr_node_ids();
 	zero_resv_unavail();
-	for_each_online_node(nid) {
+	for_each_node(nid) {
 		pg_data_t *pgdat = NODE_DATA(nid);
+
+		if (!pgdat)
+			continue;
+
 		free_area_init_node(nid, NULL,
 				find_min_pfn_for_node(nid), NULL);
Pingfan Liu Jan. 9, 2019, 3:13 a.m. UTC | #53
On Tue, Jan 8, 2019 at 10:34 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 20-12-18 10:19:34, Michal Hocko wrote:
> > On Thu 20-12-18 15:19:39, Pingfan Liu wrote:
> > > Hi Michal,
> > >
> > > WIth this patch applied on the old one, I got the following message.
> > > Please get it from attachment.
> > [...]
> > > [    0.409637] NUMA: Node 1 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x7fffffff] -> [mem 0x00000000-0x7fffffff]
> > > [    0.419858] NUMA: Node 1 [mem 0x00000000-0x7fffffff] + [mem 0x100000000-0x47fffffff] -> [mem 0x00000000-0x47fffffff]
> > > [    0.430356] NODE_DATA(0) allocated [mem 0x87efd4000-0x87effefff]
> > > [    0.436325]     NODE_DATA(0) on node 5
> > > [    0.440092] Initmem setup node 0 [mem 0x0000000000000000-0x0000000000000000]
> > > [    0.447078] node[0] zonelist:
> > > [    0.450106] NODE_DATA(1) allocated [mem 0x47ffd5000-0x47fffffff]
> > > [    0.456114] NODE_DATA(2) allocated [mem 0x87efa9000-0x87efd3fff]
> > > [    0.462064]     NODE_DATA(2) on node 5
> > > [    0.465852] Initmem setup node 2 [mem 0x0000000000000000-0x0000000000000000]
> > > [    0.472813] node[2] zonelist:
> > > [    0.475846] NODE_DATA(3) allocated [mem 0x87ef7e000-0x87efa8fff]
> > > [    0.481827]     NODE_DATA(3) on node 5
> > > [    0.485590] Initmem setup node 3 [mem 0x0000000000000000-0x0000000000000000]
> > > [    0.492575] node[3] zonelist:
> > > [    0.495608] NODE_DATA(4) allocated [mem 0x87ef53000-0x87ef7dfff]
> > > [    0.501587]     NODE_DATA(4) on node 5
> > > [    0.505349] Initmem setup node 4 [mem 0x0000000000000000-0x0000000000000000]
> > > [    0.512334] node[4] zonelist:
> > > [    0.515370] NODE_DATA(5) allocated [mem 0x87ef28000-0x87ef52fff]
> > > [    0.521384] NODE_DATA(6) allocated [mem 0x87eefd000-0x87ef27fff]
> > > [    0.527329]     NODE_DATA(6) on node 5
> > > [    0.531091] Initmem setup node 6 [mem 0x0000000000000000-0x0000000000000000]
> > > [    0.538076] node[6] zonelist:
> > > [    0.541109] NODE_DATA(7) allocated [mem 0x87eed2000-0x87eefcfff]
> > > [    0.547090]     NODE_DATA(7) on node 5
> > > [    0.550851] Initmem setup node 7 [mem 0x0000000000000000-0x0000000000000000]
> > > [    0.557836] node[7] zonelist:
> >
> > OK, so it is clear that building zonelists this early is not going to
> > fly. We do not have the complete information yet. I am not sure when do
> > we get that at this moment but I suspect the we either need to move that
> > initialization to a sooner stage or we have to reconsider whether the
> > phase when we build zonelists really needs to consider only online numa
> > nodes.
> >
> > [...]
> > > [    1.067658] percpu: Embedded 46 pages/cpu @(____ptrval____) s151552 r8192 d28672 u262144
> > > [    1.075692] node[1] zonelist: 1:Normal 1:DMA32 1:DMA 5:Normal
> > > [    1.081376] node[5] zonelist: 5:Normal 1:Normal 1:DMA32 1:DMA
> >
> > I hope to get to this before I leave for christmas vacation, if not I
> > will stare into it after then.
>
> I am sorry but I didn't get to this sooner. But I've got another idea. I
> concluded that the whole dance is simply bogus and we should treat
> memory less nodes, well, as nodes with no memory ranges rather than
> special case them. Could you give the following a spin please?
>

Sure, I have queued a loan for the remote machine. It will take some time.

Regards,
Pingfan
> ---
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 1308f5408bf7..0e79445cfd85 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -216,8 +216,6 @@ static void __init alloc_node_data(int nid)
>
>         node_data[nid] = nd;
>         memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
> -
> -       node_set_online(nid);
>  }
>
>  /**
> @@ -535,6 +533,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>         /* Account for nodes with cpus and no memory */
>         node_possible_map = numa_nodes_parsed;
>         numa_nodemask_from_meminfo(&node_possible_map, mi);
> +       pr_info("parsed=%*pbl, possible=%*pbl\n", nodemask_pr_args(&numa_nodes_parsed), nodemask_pr_args(&node_possible_map));
>         if (WARN_ON(nodes_empty(node_possible_map)))
>                 return -EINVAL;
>
> @@ -570,7 +569,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>                 return -EINVAL;
>
>         /* Finally register nodes. */
> -       for_each_node_mask(nid, node_possible_map) {
> +       for_each_node_mask(nid, numa_nodes_parsed) {
>                 u64 start = PFN_PHYS(max_pfn);
>                 u64 end = 0;
>
> @@ -581,9 +580,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>                         end = max(mi->blk[i].end, end);
>                 }
>
> -               if (start >= end)
> -                       continue;
> -
>                 /*
>                  * Don't confuse VM with a node that doesn't have the
>                  * minimum amount of memory:
> @@ -592,6 +588,8 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>                         continue;
>
>                 alloc_node_data(nid);
> +               if (end)
> +                       node_set_online(nid);
>         }
>
>         /* Dump memblock with node info and return. */
> @@ -721,21 +719,6 @@ void __init x86_numa_init(void)
>         numa_init(dummy_numa_init);
>  }
>
> -static void __init init_memory_less_node(int nid)
> -{
> -       unsigned long zones_size[MAX_NR_ZONES] = {0};
> -       unsigned long zholes_size[MAX_NR_ZONES] = {0};
> -
> -       /* Allocate and initialize node data. Memory-less node is now online.*/
> -       alloc_node_data(nid);
> -       free_area_init_node(nid, zones_size, 0, zholes_size);
> -
> -       /*
> -        * All zonelists will be built later in start_kernel() after per cpu
> -        * areas are initialized.
> -        */
> -}
> -
>  /*
>   * Setup early cpu_to_node.
>   *
> @@ -763,9 +746,6 @@ void __init init_cpu_to_node(void)
>                 if (node == NUMA_NO_NODE)
>                         continue;
>
> -               if (!node_online(node))
> -                       init_memory_less_node(node);
> -
>                 numa_set_node(cpu, node);
>         }
>  }
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2ec9cc407216..52e54d16662a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5234,6 +5234,8 @@ static void build_zonelists(pg_data_t *pgdat)
>         int node, load, nr_nodes = 0;
>         nodemask_t used_mask;
>         int local_node, prev_node;
> +       struct zone *zone;
> +       struct zoneref *z;
>
>         /* NUMA-aware ordering of nodes */
>         local_node = pgdat->node_id;
> @@ -5259,6 +5261,11 @@ static void build_zonelists(pg_data_t *pgdat)
>
>         build_zonelists_in_node_order(pgdat, node_order, nr_nodes);
>         build_thisnode_zonelists(pgdat);
> +
> +       pr_info("node[%d] zonelist: ", pgdat->node_id);
> +       for_each_zone_zonelist(zone, z, &pgdat->node_zonelists[ZONELIST_FALLBACK], MAX_NR_ZONES-1)
> +               pr_cont("%d:%s ", zone_to_nid(zone), zone->name);
> +       pr_cont("\n");
>  }
>
>  #ifdef CONFIG_HAVE_MEMORYLESS_NODES
> @@ -5361,10 +5368,11 @@ static void __build_all_zonelists(void *data)
>         if (self && !node_online(self->node_id)) {
>                 build_zonelists(self);
>         } else {
> -               for_each_online_node(nid) {
> +               for_each_node(nid) {
>                         pg_data_t *pgdat = NODE_DATA(nid);
>
> -                       build_zonelists(pgdat);
> +                       if (pgdat)
> +                               build_zonelists(pgdat);
>                 }
>
>  #ifdef CONFIG_HAVE_MEMORYLESS_NODES
> @@ -6644,10 +6652,8 @@ static unsigned long __init find_min_pfn_for_node(int nid)
>         for_each_mem_pfn_range(i, nid, &start_pfn, NULL, NULL)
>                 min_pfn = min(min_pfn, start_pfn);
>
> -       if (min_pfn == ULONG_MAX) {
> -               pr_warn("Could not find start_pfn for node %d\n", nid);
> +       if (min_pfn == ULONG_MAX)
>                 return 0;
> -       }
>
>         return min_pfn;
>  }
> @@ -6991,8 +6997,12 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
>         mminit_verify_pageflags_layout();
>         setup_nr_node_ids();
>         zero_resv_unavail();
> -       for_each_online_node(nid) {
> +       for_each_node(nid) {
>                 pg_data_t *pgdat = NODE_DATA(nid);
> +
> +               if (!pgdat)
> +                       continue;
> +
>                 free_area_init_node(nid, NULL,
>                                 find_min_pfn_for_node(nid), NULL);
>
> --
> Michal Hocko
> SUSE Labs
Pingfan Liu Jan. 11, 2019, 3:12 a.m. UTC | #54
On Tue, Jan 8, 2019 at 10:34 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 20-12-18 10:19:34, Michal Hocko wrote:
> > On Thu 20-12-18 15:19:39, Pingfan Liu wrote:
> > > Hi Michal,
> > >
> > > WIth this patch applied on the old one, I got the following message.
> > > Please get it from attachment.
> > [...]
> > > [    0.409637] NUMA: Node 1 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x7fffffff] -> [mem 0x00000000-0x7fffffff]
> > > [    0.419858] NUMA: Node 1 [mem 0x00000000-0x7fffffff] + [mem 0x100000000-0x47fffffff] -> [mem 0x00000000-0x47fffffff]
> > > [    0.430356] NODE_DATA(0) allocated [mem 0x87efd4000-0x87effefff]
> > > [    0.436325]     NODE_DATA(0) on node 5
> > > [    0.440092] Initmem setup node 0 [mem 0x0000000000000000-0x0000000000000000]
> > > [    0.447078] node[0] zonelist:
> > > [    0.450106] NODE_DATA(1) allocated [mem 0x47ffd5000-0x47fffffff]
> > > [    0.456114] NODE_DATA(2) allocated [mem 0x87efa9000-0x87efd3fff]
> > > [    0.462064]     NODE_DATA(2) on node 5
> > > [    0.465852] Initmem setup node 2 [mem 0x0000000000000000-0x0000000000000000]
> > > [    0.472813] node[2] zonelist:
> > > [    0.475846] NODE_DATA(3) allocated [mem 0x87ef7e000-0x87efa8fff]
> > > [    0.481827]     NODE_DATA(3) on node 5
> > > [    0.485590] Initmem setup node 3 [mem 0x0000000000000000-0x0000000000000000]
> > > [    0.492575] node[3] zonelist:
> > > [    0.495608] NODE_DATA(4) allocated [mem 0x87ef53000-0x87ef7dfff]
> > > [    0.501587]     NODE_DATA(4) on node 5
> > > [    0.505349] Initmem setup node 4 [mem 0x0000000000000000-0x0000000000000000]
> > > [    0.512334] node[4] zonelist:
> > > [    0.515370] NODE_DATA(5) allocated [mem 0x87ef28000-0x87ef52fff]
> > > [    0.521384] NODE_DATA(6) allocated [mem 0x87eefd000-0x87ef27fff]
> > > [    0.527329]     NODE_DATA(6) on node 5
> > > [    0.531091] Initmem setup node 6 [mem 0x0000000000000000-0x0000000000000000]
> > > [    0.538076] node[6] zonelist:
> > > [    0.541109] NODE_DATA(7) allocated [mem 0x87eed2000-0x87eefcfff]
> > > [    0.547090]     NODE_DATA(7) on node 5
> > > [    0.550851] Initmem setup node 7 [mem 0x0000000000000000-0x0000000000000000]
> > > [    0.557836] node[7] zonelist:
> >
> > OK, so it is clear that building zonelists this early is not going to
> > fly. We do not have the complete information yet. I am not sure when do
> > we get that at this moment but I suspect the we either need to move that
> > initialization to a sooner stage or we have to reconsider whether the
> > phase when we build zonelists really needs to consider only online numa
> > nodes.
> >
> > [...]
> > > [    1.067658] percpu: Embedded 46 pages/cpu @(____ptrval____) s151552 r8192 d28672 u262144
> > > [    1.075692] node[1] zonelist: 1:Normal 1:DMA32 1:DMA 5:Normal
> > > [    1.081376] node[5] zonelist: 5:Normal 1:Normal 1:DMA32 1:DMA
> >
> > I hope to get to this before I leave for christmas vacation, if not I
> > will stare into it after then.
>
> I am sorry but I didn't get to this sooner. But I've got another idea. I
> concluded that the whole dance is simply bogus and we should treat
> memory less nodes, well, as nodes with no memory ranges rather than
> special case them. Could you give the following a spin please?
>
> ---
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 1308f5408bf7..0e79445cfd85 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -216,8 +216,6 @@ static void __init alloc_node_data(int nid)
>
>         node_data[nid] = nd;
>         memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
> -
> -       node_set_online(nid);
>  }
>
>  /**
> @@ -535,6 +533,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>         /* Account for nodes with cpus and no memory */
>         node_possible_map = numa_nodes_parsed;
>         numa_nodemask_from_meminfo(&node_possible_map, mi);
> +       pr_info("parsed=%*pbl, possible=%*pbl\n", nodemask_pr_args(&numa_nodes_parsed), nodemask_pr_args(&node_possible_map));
>         if (WARN_ON(nodes_empty(node_possible_map)))
>                 return -EINVAL;
>
> @@ -570,7 +569,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>                 return -EINVAL;
>
>         /* Finally register nodes. */
> -       for_each_node_mask(nid, node_possible_map) {
> +       for_each_node_mask(nid, numa_nodes_parsed) {
>                 u64 start = PFN_PHYS(max_pfn);
>                 u64 end = 0;
>
> @@ -581,9 +580,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>                         end = max(mi->blk[i].end, end);
>                 }
>
> -               if (start >= end)
> -                       continue;
> -
>                 /*
>                  * Don't confuse VM with a node that doesn't have the
>                  * minimum amount of memory:
> @@ -592,6 +588,8 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>                         continue;
>
>                 alloc_node_data(nid);
> +               if (end)
> +                       node_set_online(nid);
>         }
>
>         /* Dump memblock with node info and return. */
> @@ -721,21 +719,6 @@ void __init x86_numa_init(void)
>         numa_init(dummy_numa_init);
>  }
>
> -static void __init init_memory_less_node(int nid)
> -{
> -       unsigned long zones_size[MAX_NR_ZONES] = {0};
> -       unsigned long zholes_size[MAX_NR_ZONES] = {0};
> -
> -       /* Allocate and initialize node data. Memory-less node is now online.*/
> -       alloc_node_data(nid);
> -       free_area_init_node(nid, zones_size, 0, zholes_size);
> -
> -       /*
> -        * All zonelists will be built later in start_kernel() after per cpu
> -        * areas are initialized.
> -        */
> -}
> -
>  /*
>   * Setup early cpu_to_node.
>   *
> @@ -763,9 +746,6 @@ void __init init_cpu_to_node(void)
>                 if (node == NUMA_NO_NODE)
>                         continue;
>
> -               if (!node_online(node))
> -                       init_memory_less_node(node);
> -
>                 numa_set_node(cpu, node);
>         }
>  }
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2ec9cc407216..52e54d16662a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5234,6 +5234,8 @@ static void build_zonelists(pg_data_t *pgdat)
>         int node, load, nr_nodes = 0;
>         nodemask_t used_mask;
>         int local_node, prev_node;
> +       struct zone *zone;
> +       struct zoneref *z;
>
>         /* NUMA-aware ordering of nodes */
>         local_node = pgdat->node_id;
> @@ -5259,6 +5261,11 @@ static void build_zonelists(pg_data_t *pgdat)
>
>         build_zonelists_in_node_order(pgdat, node_order, nr_nodes);
>         build_thisnode_zonelists(pgdat);
> +
> +       pr_info("node[%d] zonelist: ", pgdat->node_id);
> +       for_each_zone_zonelist(zone, z, &pgdat->node_zonelists[ZONELIST_FALLBACK], MAX_NR_ZONES-1)
> +               pr_cont("%d:%s ", zone_to_nid(zone), zone->name);
> +       pr_cont("\n");
>  }
>
>  #ifdef CONFIG_HAVE_MEMORYLESS_NODES
> @@ -5361,10 +5368,11 @@ static void __build_all_zonelists(void *data)
>         if (self && !node_online(self->node_id)) {
>                 build_zonelists(self);
>         } else {
> -               for_each_online_node(nid) {
> +               for_each_node(nid) {
>                         pg_data_t *pgdat = NODE_DATA(nid);
>
> -                       build_zonelists(pgdat);
> +                       if (pgdat)
> +                               build_zonelists(pgdat);
>                 }
>
>  #ifdef CONFIG_HAVE_MEMORYLESS_NODES
> @@ -6644,10 +6652,8 @@ static unsigned long __init find_min_pfn_for_node(int nid)
>         for_each_mem_pfn_range(i, nid, &start_pfn, NULL, NULL)
>                 min_pfn = min(min_pfn, start_pfn);
>
> -       if (min_pfn == ULONG_MAX) {
> -               pr_warn("Could not find start_pfn for node %d\n", nid);
> +       if (min_pfn == ULONG_MAX)
>                 return 0;
> -       }
>
>         return min_pfn;
>  }
> @@ -6991,8 +6997,12 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
>         mminit_verify_pageflags_layout();
>         setup_nr_node_ids();
>         zero_resv_unavail();
> -       for_each_online_node(nid) {
> +       for_each_node(nid) {
>                 pg_data_t *pgdat = NODE_DATA(nid);
> +
> +               if (!pgdat)
> +                       continue;
> +
>                 free_area_init_node(nid, NULL,
>                                 find_min_pfn_for_node(nid), NULL);
>
Hi, this patch works! Feel free to use tested-by me

Best Regards
Pingfan
Michal Hocko Jan. 11, 2019, 9:23 a.m. UTC | #55
On Fri 11-01-19 11:12:45, Pingfan Liu wrote:
[...]
> Hi, this patch works! Feel free to use tested-by me

Thanks a lot for your testing! Now it is time to seriously think whether
this is the right thing to do and sync all other arches that might have
the same problem. I will take care of it. Thanks for your patience and
effort. I will post something hopefully soon in a separate thread as
this one grown too large already.
diff mbox series

Patch

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 76f8db0..8324953 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -453,6 +453,8 @@  static inline int gfp_zonelist(gfp_t flags)
  */
 static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
 {
+	if (unlikely(!node_online(nid)))
+		nid = first_online_node;
 	return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
 }