diff mbox series

[v2,1/1] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes

Message ID 20211207224013.880775-2-npache@redhat.com (mailing list archive)
State New
Headers show
Series Dont allocate pages on a offline node | expand

Commit Message

Nico Pache Dec. 7, 2021, 10:40 p.m. UTC
We have run into a panic caused by a shrinker allocation being attempted
on an offlined node.

Our crash analysis has determined that the issue originates from trying
to allocate pages on an offlined node in expand_one_shrinker_info. This
function makes the incorrect assumption that we can allocate on any node.
To correct this we make sure the node is online before tempting an
allocation. If it is not online choose the closest node.

This assumption can lead to an incorrect address being assigned to ac->zonelist
in the following callchain:
	__alloc_pages
	-> prepare_alloc_pages
	 -> node_zonelist

static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
{
        return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
}
if the node is not online the return of node_zonelist will evaluate to a
invalid pointer of 0x00000 + offset_of(node_zonelists) + (1|0)

This address is then dereferenced further down the callchain in:
	prepare_alloc_pages
	-> first_zones_zonelist
  	 -> next_zones_zonelist
	  -> zonelist_zone_idx

static inline int zonelist_zone_idx(struct zoneref *zoneref)
{
        return zoneref->zone_idx;
}

Leading the system to panic.

[  362.132167] BUG: unable to handle page fault for address: 0000000000002088
[  362.139854] #PF: supervisor read access in kernel mode
[  362.145595] #PF: error_code(0x0000) - not-present page
[  362.151334] PGD 0 P4D 0
[  362.154166] Oops: 0000 [#1] SMP PTI
[  362.158064] CPU: 62 PID: 16612 Comm: stress-ng Kdump:
[  362.167876] Hardware name:
[  362.179917] RIP: 0010:prepare_alloc_pages.constprop.0+0xc6/0x150
[  362.207604] RSP: 0018:ffffb4ba31427bc0 EFLAGS: 00010246
[  362.213443] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000081
[  362.221412] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080
[  362.229380] RBP: ffffb4ba31427bf8 R08: 0000000000000000 R09: ffffb4ba31427bf4
[  362.237347] R10: 0000000000000000 R11: 0000000000000000 R12: ffffb4ba31427bf0
[  362.245316] R13: 0000000000000002 R14: ffff9f2fb3788000 R15: 0000000000000078
[  362.253285] FS:  00007fbc57bfd740(0000) GS:ffff9f4c7d780000(0000) knlGS:0000000000000000
[  362.262322] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  362.268739] CR2: 0000000000002088 CR3: 000002004cb58002 CR4: 00000000007706e0
[  362.276707] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  362.284675] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  362.292644] PKRU: 55555554
[  362.295669] Call Trace:
[  362.298402]  __alloc_pages+0x9d/0x210
[  362.302501]  kmalloc_large_node+0x40/0x90
[  362.306988]  __kmalloc_node+0x3ac/0x480
[  362.311279]  kvmalloc_node+0x46/0x80
[  362.315276]  expand_one_shrinker_info+0x84/0x190
[  362.320437]  prealloc_shrinker+0x166/0x1c0
[  362.325015]  alloc_super+0x2ba/0x330
[  362.329011]  ? __fput_sync+0x30/0x30
[  362.333003]  ? set_anon_super+0x40/0x40
[  362.337288]  sget_fc+0x6c/0x2f0
[  362.340798]  ? mqueue_create+0x20/0x20
[  362.344992]  get_tree_keyed+0x2f/0xc0
[  362.349086]  vfs_get_tree+0x25/0xb0
[  362.352982]  fc_mount+0xe/0x30
[  362.356397]  mq_init_ns+0x105/0x1a0
[  362.360291]  copy_ipcs+0x129/0x220
[  362.364093]  create_new_namespaces+0xa1/0x2e0
[  362.368960]  unshare_nsproxy_namespaces+0x55/0xa0
[  362.374216]  ksys_unshare+0x198/0x380
[  362.378310]  __x64_sys_unshare+0xe/0x20
[  362.382595]  do_syscall_64+0x3b/0x90
[  362.386597]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  362.392247] RIP: 0033:0x7fbc57d14d7b
[  362.417204] RSP: 002b:00007fbc4dc73f88 EFLAGS: 00000206 ORIG_RAX: 0000000000000110
[  362.425664] RAX: ffffffffffffffda RBX: 0000560144b09578 RCX: 00007fbc57d14d7b
[  362.433634] RDX: 0000000000000010 RSI: 00007fbc4dc73f90 RDI: 0000000008000000
[  362.441602] RBP: 0000560144b095a0 R08: 0000000000000000 R09: 0000000000000000
[  362.449573] R10: 0000000000000000 R11: 0000000000000206 R12: 00007fbc4dc73f90
[  362.457542] R13: 000000000000006a R14: 0000560144e7e5a0 R15: 00007fff5dec8e10

Fixes: 86daf94efb11 ("mm/memcontrol.c: allocate shrinker_map on appropriate NUMA node")
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/vmscan.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Matthew Wilcox Dec. 7, 2021, 11:34 p.m. UTC | #1
On Tue, Dec 07, 2021 at 05:40:13PM -0500, Nico Pache wrote:
> +++ b/mm/vmscan.c
> @@ -222,13 +222,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>  	int size = map_size + defer_size;
>  
>  	for_each_node(nid) {
> +		int tmp = nid;
>  		pn = memcg->nodeinfo[nid];
>  		old = shrinker_info_protected(memcg, nid);
>  		/* Not yet online memcg */
>  		if (!old)
>  			return 0;
>  
> -		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
> +		if(!node_online(nid))
> +			tmp = numa_mem_id();
> +		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, tmp);
>  		if (!new)

Why should this be fixed here and not in, say, kvmalloc_node()?
Andrew Morton Dec. 7, 2021, 11:44 p.m. UTC | #2
On Tue,  7 Dec 2021 17:40:13 -0500 Nico Pache <npache@redhat.com> wrote:

> We have run into a panic caused by a shrinker allocation being attempted
> on an offlined node.
>
> Our crash analysis has determined that the issue originates from trying
> to allocate pages on an offlined node in expand_one_shrinker_info. This
> function makes the incorrect assumption that we can allocate on any node.
> To correct this we make sure the node is online before tempting an
> allocation. If it is not online choose the closest node.

This isn't fully accurate, is it?  We could allocate on a node which is
presently offline but which was previously onlined, by testing
NODE_DATA(nid).

It isn't entirely clear to me from the v1 discussion why this approach
isn't being taken?

AFAICT the proposed patch is *already* taking this approach, by having
no protection against a concurrent or subsequent node offlining?

> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -222,13 +222,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>  	int size = map_size + defer_size;
>  
>  	for_each_node(nid) {
> +		int tmp = nid;

Not `tmp', please.  Better to use an identifier which explains the
variable's use.  target_nid?

And a newline after defining locals, please.

>  		pn = memcg->nodeinfo[nid];
>  		old = shrinker_info_protected(memcg, nid);
>  		/* Not yet online memcg */
>  		if (!old)
>  			return 0;
>  
> -		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
> +		if(!node_online(nid))

s/if(/if (/

> +			tmp = numa_mem_id();
> +		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, tmp);
>  		if (!new)
>  			return -ENOMEM;
>  

And a code comment fully explaining what's going on here?
Nico Pache Dec. 8, 2021, 12:25 a.m. UTC | #3
On 12/7/21 18:34, Matthew Wilcox wrote:
> On Tue, Dec 07, 2021 at 05:40:13PM -0500, Nico Pache wrote:
>> +++ b/mm/vmscan.c
>> @@ -222,13 +222,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>>  	int size = map_size + defer_size;
>>  
>>  	for_each_node(nid) {
>> +		int tmp = nid;
>>  		pn = memcg->nodeinfo[nid];
>>  		old = shrinker_info_protected(memcg, nid);
>>  		/* Not yet online memcg */
>>  		if (!old)
>>  			return 0;
>>  
>> -		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
>> +		if(!node_online(nid))
>> +			tmp = numa_mem_id();
>> +		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, tmp);
>>  		if (!new)
> 
> Why should this be fixed here and not in, say, kvmalloc_node()?

according to Michal, the caller should be responsible for making sure it is
allocating on a correct node. This avoids adding branches to hot-paths and
wasting cycles. Im not opposed to moving it to kvmalloc_node, but it may result
in masking other issues from other callers.
>
Yang Shi Dec. 8, 2021, 12:26 a.m. UTC | #4
On Tue, Dec 7, 2021 at 3:44 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue,  7 Dec 2021 17:40:13 -0500 Nico Pache <npache@redhat.com> wrote:
>
> > We have run into a panic caused by a shrinker allocation being attempted
> > on an offlined node.
> >
> > Our crash analysis has determined that the issue originates from trying
> > to allocate pages on an offlined node in expand_one_shrinker_info. This
> > function makes the incorrect assumption that we can allocate on any node.
> > To correct this we make sure the node is online before tempting an
> > allocation. If it is not online choose the closest node.
>
> This isn't fully accurate, is it?  We could allocate on a node which is
> presently offline but which was previously onlined, by testing
> NODE_DATA(nid).
>
> It isn't entirely clear to me from the v1 discussion why this approach
> isn't being taken?
>
> AFAICT the proposed patch is *already* taking this approach, by having
> no protection against a concurrent or subsequent node offlining?

AFAICT, we have not reached agreement on how to fix it yet. I saw 3
proposals at least:

1. From Michal, allocate node data for all possible nodes.
https://lore.kernel.org/all/Ya89aqij6nMwJrIZ@dhcp22.suse.cz/T/#u

2. What this patch does. Proposed originally from
https://lore.kernel.org/all/20211108202325.20304-1-amakhalov@vmware.com/T/#u

3. From David, fix in node_zonelist().
https://lore.kernel.org/all/51c65635-1dae-6ba4-daf9-db9df0ec35d8@redhat.com/T/#u

>
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -222,13 +222,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> >       int size = map_size + defer_size;
> >
> >       for_each_node(nid) {
> > +             int tmp = nid;
>
> Not `tmp', please.  Better to use an identifier which explains the
> variable's use.  target_nid?
>
> And a newline after defining locals, please.
>
> >               pn = memcg->nodeinfo[nid];
> >               old = shrinker_info_protected(memcg, nid);
> >               /* Not yet online memcg */
> >               if (!old)
> >                       return 0;
> >
> > -             new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
> > +             if(!node_online(nid))
>
> s/if(/if (/
>
> > +                     tmp = numa_mem_id();
> > +             new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, tmp);
> >               if (!new)
> >                       return -ENOMEM;
> >
>
> And a code comment fully explaining what's going on here?
Nico Pache Dec. 8, 2021, 12:33 a.m. UTC | #5
On 12/7/21 19:26, Yang Shi wrote:
> On Tue, Dec 7, 2021 at 3:44 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>>
>> On Tue,  7 Dec 2021 17:40:13 -0500 Nico Pache <npache@redhat.com> wrote:
>>
>>> We have run into a panic caused by a shrinker allocation being attempted
>>> on an offlined node.
>>>
>>> Our crash analysis has determined that the issue originates from trying
>>> to allocate pages on an offlined node in expand_one_shrinker_info. This
>>> function makes the incorrect assumption that we can allocate on any node.
>>> To correct this we make sure the node is online before tempting an
>>> allocation. If it is not online choose the closest node.
>>
>> This isn't fully accurate, is it?  We could allocate on a node which is
>> presently offline but which was previously onlined, by testing
>> NODE_DATA(nid).
>>
>> It isn't entirely clear to me from the v1 discussion why this approach
>> isn't being taken?
>>
>> AFAICT the proposed patch is *already* taking this approach, by having
>> no protection against a concurrent or subsequent node offlining?
> 
> AFAICT, we have not reached agreement on how to fix it yet. I saw 3
> proposals at least:
> 
> 1. From Michal, allocate node data for all possible nodes.
> https://lore.kernel.org/all/Ya89aqij6nMwJrIZ@dhcp22.suse.cz/T/#u
> 
> 2. What this patch does. Proposed originally from
> https://lore.kernel.org/all/20211108202325.20304-1-amakhalov@vmware.com/T/#u

Correct me if im wrong, but isn't that a different caller? This patch fixes the
issue in expand_one_shrinker_info.

> 3. From David, fix in node_zonelist().
> https://lore.kernel.org/all/51c65635-1dae-6ba4-daf9-db9df0ec35d8@redhat.com/T/#u
> 
>>
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -222,13 +222,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>>>       int size = map_size + defer_size;
>>>
>>>       for_each_node(nid) {
>>> +             int tmp = nid;
>>
>> Not `tmp', please.  Better to use an identifier which explains the
>> variable's use.  target_nid?
>>
>> And a newline after defining locals, please.
>>
>>>               pn = memcg->nodeinfo[nid];
>>>               old = shrinker_info_protected(memcg, nid);
>>>               /* Not yet online memcg */
>>>               if (!old)
>>>                       return 0;
>>>
>>> -             new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
>>> +             if(!node_online(nid))
>>
>> s/if(/if (/
>>
>>> +                     tmp = numa_mem_id();
>>> +             new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, tmp);
>>>               if (!new)
>>>                       return -ENOMEM;
>>>
>>
>> And a code comment fully explaining what's going on here?
>
Nico Pache Dec. 8, 2021, 12:40 a.m. UTC | #6
On 12/7/21 18:44, Andrew Morton wrote:
> On Tue,  7 Dec 2021 17:40:13 -0500 Nico Pache <npache@redhat.com> wrote:
> 
>> We have run into a panic caused by a shrinker allocation being attempted
>> on an offlined node.
>>
>> Our crash analysis has determined that the issue originates from trying
>> to allocate pages on an offlined node in expand_one_shrinker_info. This
>> function makes the incorrect assumption that we can allocate on any node.
>> To correct this we make sure the node is online before tempting an
>> allocation. If it is not online choose the closest node.
> 
> This isn't fully accurate, is it?  We could allocate on a node which is
> presently offline but which was previously onlined, by testing
> NODE_DATA(nid).

Thanks for the review! I took your changes below into consideration for my V3.

My knowledge of offlined/onlined nodes is quite limited but after looking into
it it doesnt seem like anything clears the state of NODE_DATA(nid) after a
try_offline_node is attempted. So theoretically the panic we saw would not
happen. What is the expected behavior of trying to allocate a page on a offline
node?

> 
> It isn't entirely clear to me from the v1 discussion why this approach
> isn't being taken?
> 
> AFAICT the proposed patch is *already* taking this approach, by having
> no protection against a concurrent or subsequent node offlining?
> 
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -222,13 +222,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>>  	int size = map_size + defer_size;
>>  
>>  	for_each_node(nid) {
>> +		int tmp = nid;
> 
> Not `tmp', please.  Better to use an identifier which explains the
> variable's use.  target_nid?
> 
> And a newline after defining locals, please.
> 
>>  		pn = memcg->nodeinfo[nid];
>>  		old = shrinker_info_protected(memcg, nid);
>>  		/* Not yet online memcg */
>>  		if (!old)
>>  			return 0;
>>  
>> -		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
>> +		if(!node_online(nid))
> 
> s/if(/if (/
> 
>> +			tmp = numa_mem_id();
>> +		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, tmp);
>>  		if (!new)
>>  			return -ENOMEM;
>>  
> 
> And a code comment fully explaining what's going on here?
>
Yang Shi Dec. 8, 2021, 1:23 a.m. UTC | #7
On Tue, Dec 7, 2021 at 4:33 PM Nico Pache <npache@redhat.com> wrote:
>
>
>
> On 12/7/21 19:26, Yang Shi wrote:
> > On Tue, Dec 7, 2021 at 3:44 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >>
> >> On Tue,  7 Dec 2021 17:40:13 -0500 Nico Pache <npache@redhat.com> wrote:
> >>
> >>> We have run into a panic caused by a shrinker allocation being attempted
> >>> on an offlined node.
> >>>
> >>> Our crash analysis has determined that the issue originates from trying
> >>> to allocate pages on an offlined node in expand_one_shrinker_info. This
> >>> function makes the incorrect assumption that we can allocate on any node.
> >>> To correct this we make sure the node is online before tempting an
> >>> allocation. If it is not online choose the closest node.
> >>
> >> This isn't fully accurate, is it?  We could allocate on a node which is
> >> presently offline but which was previously onlined, by testing
> >> NODE_DATA(nid).
> >>
> >> It isn't entirely clear to me from the v1 discussion why this approach
> >> isn't being taken?
> >>
> >> AFAICT the proposed patch is *already* taking this approach, by having
> >> no protection against a concurrent or subsequent node offlining?
> >
> > AFAICT, we have not reached agreement on how to fix it yet. I saw 3
> > proposals at least:
> >
> > 1. From Michal, allocate node data for all possible nodes.
> > https://lore.kernel.org/all/Ya89aqij6nMwJrIZ@dhcp22.suse.cz/T/#u
> >
> > 2. What this patch does. Proposed originally from
> > https://lore.kernel.org/all/20211108202325.20304-1-amakhalov@vmware.com/T/#u
>
> Correct me if im wrong, but isn't that a different caller? This patch fixes the
> issue in expand_one_shrinker_info.

Yes, different caller, but same approach. The cons with this approach
is we have to fix all the places. It seems Michal and David are not
fans for this approach IIRC.

>
> > 3. From David, fix in node_zonelist().
> > https://lore.kernel.org/all/51c65635-1dae-6ba4-daf9-db9df0ec35d8@redhat.com/T/#u
> >
> >>
> >>> --- a/mm/vmscan.c
> >>> +++ b/mm/vmscan.c
> >>> @@ -222,13 +222,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> >>>       int size = map_size + defer_size;
> >>>
> >>>       for_each_node(nid) {
> >>> +             int tmp = nid;
> >>
> >> Not `tmp', please.  Better to use an identifier which explains the
> >> variable's use.  target_nid?
> >>
> >> And a newline after defining locals, please.
> >>
> >>>               pn = memcg->nodeinfo[nid];
> >>>               old = shrinker_info_protected(memcg, nid);
> >>>               /* Not yet online memcg */
> >>>               if (!old)
> >>>                       return 0;
> >>>
> >>> -             new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
> >>> +             if(!node_online(nid))
> >>
> >> s/if(/if (/
> >>
> >>> +                     tmp = numa_mem_id();
> >>> +             new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, tmp);
> >>>               if (!new)
> >>>                       return -ENOMEM;
> >>>
> >>
> >> And a code comment fully explaining what's going on here?
> >
>
Yang Shi Dec. 8, 2021, 1:26 a.m. UTC | #8
On Tue, Dec 7, 2021 at 5:23 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Tue, Dec 7, 2021 at 4:33 PM Nico Pache <npache@redhat.com> wrote:
> >
> >
> >
> > On 12/7/21 19:26, Yang Shi wrote:
> > > On Tue, Dec 7, 2021 at 3:44 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >>
> > >> On Tue,  7 Dec 2021 17:40:13 -0500 Nico Pache <npache@redhat.com> wrote:
> > >>
> > >>> We have run into a panic caused by a shrinker allocation being attempted
> > >>> on an offlined node.
> > >>>
> > >>> Our crash analysis has determined that the issue originates from trying
> > >>> to allocate pages on an offlined node in expand_one_shrinker_info. This
> > >>> function makes the incorrect assumption that we can allocate on any node.
> > >>> To correct this we make sure the node is online before tempting an
> > >>> allocation. If it is not online choose the closest node.
> > >>
> > >> This isn't fully accurate, is it?  We could allocate on a node which is
> > >> presently offline but which was previously onlined, by testing
> > >> NODE_DATA(nid).
> > >>
> > >> It isn't entirely clear to me from the v1 discussion why this approach
> > >> isn't being taken?
> > >>
> > >> AFAICT the proposed patch is *already* taking this approach, by having
> > >> no protection against a concurrent or subsequent node offlining?
> > >
> > > AFAICT, we have not reached agreement on how to fix it yet. I saw 3
> > > proposals at least:
> > >
> > > 1. From Michal, allocate node data for all possible nodes.
> > > https://lore.kernel.org/all/Ya89aqij6nMwJrIZ@dhcp22.suse.cz/T/#u
> > >
> > > 2. What this patch does. Proposed originally from
> > > https://lore.kernel.org/all/20211108202325.20304-1-amakhalov@vmware.com/T/#u
> >
> > Correct me if im wrong, but isn't that a different caller? This patch fixes the
> > issue in expand_one_shrinker_info.
>
> Yes, different caller, but same approach. The cons with this approach

And the same underlying problem.

> is we have to fix all the places. It seems Michal and David are not
> fans for this approach IIRC.
>
> >
> > > 3. From David, fix in node_zonelist().
> > > https://lore.kernel.org/all/51c65635-1dae-6ba4-daf9-db9df0ec35d8@redhat.com/T/#u
> > >
> > >>
> > >>> --- a/mm/vmscan.c
> > >>> +++ b/mm/vmscan.c
> > >>> @@ -222,13 +222,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> > >>>       int size = map_size + defer_size;
> > >>>
> > >>>       for_each_node(nid) {
> > >>> +             int tmp = nid;
> > >>
> > >> Not `tmp', please.  Better to use an identifier which explains the
> > >> variable's use.  target_nid?
> > >>
> > >> And a newline after defining locals, please.
> > >>
> > >>>               pn = memcg->nodeinfo[nid];
> > >>>               old = shrinker_info_protected(memcg, nid);
> > >>>               /* Not yet online memcg */
> > >>>               if (!old)
> > >>>                       return 0;
> > >>>
> > >>> -             new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
> > >>> +             if(!node_online(nid))
> > >>
> > >> s/if(/if (/
> > >>
> > >>> +                     tmp = numa_mem_id();
> > >>> +             new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, tmp);
> > >>>               if (!new)
> > >>>                       return -ENOMEM;
> > >>>
> > >>
> > >> And a code comment fully explaining what's going on here?
> > >
> >
Andrew Morton Dec. 8, 2021, 1:53 a.m. UTC | #9
On Tue, 7 Dec 2021 19:25:25 -0500 Nico Pache <npache@redhat.com> wrote:

> 
> 
> On 12/7/21 18:34, Matthew Wilcox wrote:
> > On Tue, Dec 07, 2021 at 05:40:13PM -0500, Nico Pache wrote:
> >> +++ b/mm/vmscan.c
> >> @@ -222,13 +222,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> >>  	int size = map_size + defer_size;
> >>  
> >>  	for_each_node(nid) {
> >> +		int tmp = nid;
> >>  		pn = memcg->nodeinfo[nid];
> >>  		old = shrinker_info_protected(memcg, nid);
> >>  		/* Not yet online memcg */
> >>  		if (!old)
> >>  			return 0;
> >>  
> >> -		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
> >> +		if(!node_online(nid))
> >> +			tmp = numa_mem_id();
> >> +		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, tmp);
> >>  		if (!new)
> > 
> > Why should this be fixed here and not in, say, kvmalloc_node()?
> 
> according to Michal, the caller should be responsible for making sure it is
> allocating on a correct node. This avoids adding branches to hot-paths and
> wasting cycles. Im not opposed to moving it to kvmalloc_node, but it may result
> in masking other issues from other callers.
> > 

Yes, kvmalloc_node(nid) should allocate on `nid', or should fail.

A new kvmalloc_try_node() or whatever would express this idea.
Michal Hocko Dec. 8, 2021, 7:54 a.m. UTC | #10
On Tue 07-12-21 19:40:33, Nico Pache wrote:
> 
> 
> On 12/7/21 18:44, Andrew Morton wrote:
> > On Tue,  7 Dec 2021 17:40:13 -0500 Nico Pache <npache@redhat.com> wrote:
> > 
> >> We have run into a panic caused by a shrinker allocation being attempted
> >> on an offlined node.
> >>
> >> Our crash analysis has determined that the issue originates from trying
> >> to allocate pages on an offlined node in expand_one_shrinker_info. This
> >> function makes the incorrect assumption that we can allocate on any node.
> >> To correct this we make sure the node is online before tempting an
> >> allocation. If it is not online choose the closest node.
> > 
> > This isn't fully accurate, is it?  We could allocate on a node which is
> > presently offline but which was previously onlined, by testing
> > NODE_DATA(nid).
> 
> Thanks for the review! I took your changes below into consideration for my V3.
> 
> My knowledge of offlined/onlined nodes is quite limited but after looking into
> it it doesnt seem like anything clears the state of NODE_DATA(nid) after a
> try_offline_node is attempted. So theoretically the panic we saw would not
> happen. What is the expected behavior of trying to allocate a page on a offline
> node?

To fall back (in the zonelist order) into the other node. If
__GFP_THISNODE is specified then simply fail the allocation.
Michal Hocko Dec. 8, 2021, 7:59 a.m. UTC | #11
On Tue 07-12-21 17:26:32, Yang Shi wrote:
> On Tue, Dec 7, 2021 at 5:23 PM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Tue, Dec 7, 2021 at 4:33 PM Nico Pache <npache@redhat.com> wrote:
> > >
> > >
> > >
> > > On 12/7/21 19:26, Yang Shi wrote:
[...]
> > > > AFAICT, we have not reached agreement on how to fix it yet. I saw 3
> > > > proposals at least:
> > > >
> > > > 1. From Michal, allocate node data for all possible nodes.
> > > > https://lore.kernel.org/all/Ya89aqij6nMwJrIZ@dhcp22.suse.cz/T/#u
> > > >
> > > > 2. What this patch does. Proposed originally from
> > > > https://lore.kernel.org/all/20211108202325.20304-1-amakhalov@vmware.com/T/#u
> > >
> > > Correct me if im wrong, but isn't that a different caller? This patch fixes the
> > > issue in expand_one_shrinker_info.
> >
> > Yes, different caller, but same approach. The cons with this approach
> 
> And the same underlying problem.
> 
> > is we have to fix all the places. It seems Michal and David are not
> > fans for this approach IIRC.

Yes, agreed. We definitely do not want to spread this node_offline
oddity all over the place. There are two different way to approach this.
Either we handle node_offline nodes at the page allocator level when
setting the proper zonelist (ideally protect that by a static key for
setups which have these nodes) or we allocate pgdat for all possible
nodes. I would prefer the second because that is more robust (less
likely to blow up when somebody does
	for_each_node(nid)
		something(NODE_DATA(nid))

The discussion is ongoing at the original thread where Alexey Makhalov
reported a similar problem (the subthread is
http://lkml.kernel.org/r/Ya89aqij6nMwJrIZ@dhcp22.suse.cz)
Yang Shi Dec. 13, 2021, 7:10 p.m. UTC | #12
On Tue, Dec 7, 2021 at 11:59 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 07-12-21 17:26:32, Yang Shi wrote:
> > On Tue, Dec 7, 2021 at 5:23 PM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Tue, Dec 7, 2021 at 4:33 PM Nico Pache <npache@redhat.com> wrote:
> > > >
> > > >
> > > >
> > > > On 12/7/21 19:26, Yang Shi wrote:
> [...]
> > > > > AFAICT, we have not reached agreement on how to fix it yet. I saw 3
> > > > > proposals at least:
> > > > >
> > > > > 1. From Michal, allocate node data for all possible nodes.
> > > > > https://lore.kernel.org/all/Ya89aqij6nMwJrIZ@dhcp22.suse.cz/T/#u
> > > > >
> > > > > 2. What this patch does. Proposed originally from
> > > > > https://lore.kernel.org/all/20211108202325.20304-1-amakhalov@vmware.com/T/#u
> > > >
> > > > Correct me if im wrong, but isn't that a different caller? This patch fixes the
> > > > issue in expand_one_shrinker_info.
> > >
> > > Yes, different caller, but same approach. The cons with this approach
> >
> > And the same underlying problem.
> >
> > > is we have to fix all the places. It seems Michal and David are not
> > > fans for this approach IIRC.
>
> Yes, agreed. We definitely do not want to spread this node_offline
> oddity all over the place. There are two different way to approach this.
> Either we handle node_offline nodes at the page allocator level when
> setting the proper zonelist (ideally protect that by a static key for
> setups which have these nodes) or we allocate pgdat for all possible
> nodes. I would prefer the second because that is more robust (less
> likely to blow up when somebody does
>         for_each_node(nid)
>                 something(NODE_DATA(nid))
>
> The discussion is ongoing at the original thread where Alexey Makhalov
> reported a similar problem (the subthread is
> http://lkml.kernel.org/r/Ya89aqij6nMwJrIZ@dhcp22.suse.cz)

Thanks, Michal. Yeah, it seems more straightforward and more robust to
me. I'm not familiar with memory hotplug, hopefully it doesn't break
hotplug.

> --
> Michal Hocko
> SUSE Labs
Rafael Aquini Jan. 10, 2022, 5:09 p.m. UTC | #13
On Tue, Dec 07, 2021 at 04:26:28PM -0800, Yang Shi wrote:
> On Tue, Dec 7, 2021 at 3:44 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Tue,  7 Dec 2021 17:40:13 -0500 Nico Pache <npache@redhat.com> wrote:
> >
> > > We have run into a panic caused by a shrinker allocation being attempted
> > > on an offlined node.
> > >
> > > Our crash analysis has determined that the issue originates from trying
> > > to allocate pages on an offlined node in expand_one_shrinker_info. This
> > > function makes the incorrect assumption that we can allocate on any node.
> > > To correct this we make sure the node is online before tempting an
> > > allocation. If it is not online choose the closest node.
> >
> > This isn't fully accurate, is it?  We could allocate on a node which is
> > presently offline but which was previously onlined, by testing
> > NODE_DATA(nid).
> >
> > It isn't entirely clear to me from the v1 discussion why this approach
> > isn't being taken?
> >
> > AFAICT the proposed patch is *already* taking this approach, by having
> > no protection against a concurrent or subsequent node offlining?
> 
> AFAICT, we have not reached agreement on how to fix it yet. I saw 3
> proposals at least:
> 
> 1. From Michal, allocate node data for all possible nodes.
> https://lore.kernel.org/all/Ya89aqij6nMwJrIZ@dhcp22.suse.cz/T/#u
> 
> 2. What this patch does. Proposed originally from
> https://lore.kernel.org/all/20211108202325.20304-1-amakhalov@vmware.com/T/#u
>

This patch doesn't cut it, because it only fixes up one caller. The issue,
however, boils down to any iteration like

...
   for_each_possible_cpu(cpu) {
      struct page *page = alloc_pages_node(cpu_to_node(cpu), gfp, 0);
...

where topology allows for possible cpus in yet-to-be-online node to
cause node_zonelist() to produce a bogus zonelist pointer when
populating ac->zonelist within prepare_alloc_pages().

I just bisected the following commit causing a PPC host to crash
at boot, when initializing the cgoup subsystem:

    commit bd0e7491a931f5a2960555b10b9551464ff8cc8e
    Author: Vlastimil Babka <vbabka@suse.cz>
    Date:   Sat May 22 01:59:38 2021 +0200

        mm, slub: convert kmem_cpu_slab protection to local_lock

The commit, obviously is not responsible for the crash, and the only "sin" 
committed by commit bd0e7491a93 as far as the crash issue is concerned is to
increase the size of struct kmem_cache_cpu size with one local_lock_t, which
in the worst case scenario -- !CONFIG_PREEMPT_RT && CONFIG_DEBUG_LOCK_ALLOC --
will end up creating an overhead of 56 bytes for each kmem_cache_cpu created,
burning faster through the bootstrap pre-allocated embedded per-cpu pool.

---8<---
[    0.001888] BUG: Kernel NULL pointer dereference on read at 0x00001508
[    0.001900] Faulting instruction address: 0xc0000000005277fc
[    0.001905] Oops: Kernel access of bad area, sig: 11 [#1]
[    0.001909] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
[    0.001915] Modules linked in:
[    0.001919] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.14.0+ #24
[    0.001925] NIP:  c0000000005277fc LR: c0000000005276d8 CTR: 0000000000000000
[    0.001930] REGS: c000000002c936d0 TRAP: 0380   Not tainted  (5.14.0+)
[    0.001935] MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 24024222  XER: 00000002
[    0.001948] CFAR: c0000000005276e0 IRQMASK: 0
[    0.001948] GPR00: c0000000005276d8 c000000002c93970 c000000002c95500 0000000000001500
[    0.001948] GPR04: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[    0.001948] GPR08: 0000000000000081 0000000000000001 0000000000001500 0000000044024222
[    0.001948] GPR12: 00000003fa6a0000 c0000000049d0000 0000000000000000 c0000000053f0780
[    0.001948] GPR16: 0000000000000000 0000000000000000 0000000000000008 c000000002cec160
[    0.001948] GPR20: 0000000000000000 0000000000000cc2 c000000002cf0a30 c000000002cf02c0
[    0.001948] GPR24: 0000000000000001 0000000000000001 0000000000000000 0000000000000000
[    0.001948] GPR28: 0000000000000000 c000000002c939e8 c000000002c939f0 0000000000000000
[    0.002007] NIP [c0000000005277fc] prepare_alloc_pages.constprop.0+0x20c/0x280
[    0.002014] LR [c0000000005276d8] prepare_alloc_pages.constprop.0+0xe8/0x280
[    0.002020] Call Trace:
[    0.002022] [c000000002c93970] [c0000000005276c4] prepare_alloc_pages.constprop.0+0xd4/0x280 (unreliable)
[    0.002030] [c000000002c939c0] [c000000000531270] __alloc_pages+0xb0/0x3b0
[    0.002036] [c000000002c93a50] [c0000000004cd4d4] pcpu_alloc_pages.constprop.0+0x144/0x2a0
[    0.002044] [c000000002c93ae0] [c0000000004cd714] pcpu_populate_chunk+0x64/0x2b0
[    0.002050] [c000000002c93b90] [c0000000004d1a14] pcpu_alloc+0x944/0xd80
[    0.002056] [c000000002c93cb0] [c0000000005abcd4] mem_cgroup_alloc+0x144/0x470
[    0.002063] [c000000002c93d30] [c0000000005bcedc] mem_cgroup_css_alloc+0xac/0x390
[    0.002070] [c000000002c93d80] [c00000000203181c] cgroup_init_subsys+0xf4/0x260
[    0.002078] [c000000002c93e30] [c000000002031d40] cgroup_init+0x1f4/0x528
[    0.002085] [c000000002c93f00] [c0000000020051ec] start_kernel+0x64c/0x6b0
[    0.002091] [c000000002c93f90] [c00000000000d39c] start_here_common+0x1c/0x600
[    0.002098] Instruction dump:
[    0.002101] 61280080 2c0a0001 7d28489e 913d0000 57ffa7fe 9bfe0020 809e001c e8be0008
[    0.002112] e87e0000 2c250000 7c6a1b78 40820058 <81230008> 7c044840 4180004c f95e0010
[    0.002124] ---[ end trace 7392155448beabaa ]---
[    0.002127]
[    1.002133] Kernel panic - not syncing: Attempted to kill the idle task!
[    1.002195] ------------[ cut here ]------------
--->8---


> 3. From David, fix in node_zonelist().
> https://lore.kernel.org/all/51c65635-1dae-6ba4-daf9-db9df0ec35d8@redhat.com/T/#u

It seems to me that (3) is the simplest and effective way to cope with this case

Cheers,
-- Rafael

> >
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -222,13 +222,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> > >       int size = map_size + defer_size;
> > >
> > >       for_each_node(nid) {
> > > +             int tmp = nid;
> >
> > Not `tmp', please.  Better to use an identifier which explains the
> > variable's use.  target_nid?
> >
> > And a newline after defining locals, please.
> >
> > >               pn = memcg->nodeinfo[nid];
> > >               old = shrinker_info_protected(memcg, nid);
> > >               /* Not yet online memcg */
> > >               if (!old)
> > >                       return 0;
> > >
> > > -             new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
> > > +             if(!node_online(nid))
> >
> > s/if(/if (/
> >
> > > +                     tmp = numa_mem_id();
> > > +             new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, tmp);
> > >               if (!new)
> > >                       return -ENOMEM;
> > >
> >
> > And a code comment fully explaining what's going on here?
>
Michal Hocko Jan. 10, 2022, 5:16 p.m. UTC | #14
On Mon 10-01-22 12:09:50, Rafael Aquini wrote:
[...]
> > 3. From David, fix in node_zonelist().
> > https://lore.kernel.org/all/51c65635-1dae-6ba4-daf9-db9df0ec35d8@redhat.com/T/#u
> 
> It seems to me that (3) is the simplest and effective way to cope with this case

Did you have chance to look at http://lkml.kernel.org/r/20211214100732.26335-1-mhocko@kernel.org
Rafael Aquini Jan. 10, 2022, 5:21 p.m. UTC | #15
On Mon, Jan 10, 2022 at 06:16:23PM +0100, Michal Hocko wrote:
> On Mon 10-01-22 12:09:50, Rafael Aquini wrote:
> [...]
> > > 3. From David, fix in node_zonelist().
> > > https://lore.kernel.org/all/51c65635-1dae-6ba4-daf9-db9df0ec35d8@redhat.com/T/#u
> > 
> > It seems to me that (3) is the simplest and effective way to cope with this case
> 
> Did you have chance to look at http://lkml.kernel.org/r/20211214100732.26335-1-mhocko@kernel.org

I'll take a closer look at it. 

Thanks Michal.

-- Rafael
diff mbox series

Patch

diff --git a/mm/vmscan.c b/mm/vmscan.c
index fb9584641ac7..5ed5b5d29d47 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -222,13 +222,16 @@  static int expand_one_shrinker_info(struct mem_cgroup *memcg,
 	int size = map_size + defer_size;
 
 	for_each_node(nid) {
+		int tmp = nid;
 		pn = memcg->nodeinfo[nid];
 		old = shrinker_info_protected(memcg, nid);
 		/* Not yet online memcg */
 		if (!old)
 			return 0;
 
-		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
+		if(!node_online(nid))
+			tmp = numa_mem_id();
+		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, tmp);
 		if (!new)
 			return -ENOMEM;