[RFC,0/2] mm: fix OOMs for binding workloads to movable zone only node

Message ID	1604470210-124827-1-git-send-email-feng.tang@intel.com (mailing list archive)
Headers	show Return-Path: <SRS0=/lKt=EK=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6CD1C20780 IronPort-SDR: d2xErKAIFyT+i/t3i0u8MEI1TJY8aFJfcwhAFI6VtGmf9ojp7WTCZ+i/V+ZGCXST0rjb3DU9yG qbxJdQ39tcLw== IronPort-SDR: CTlS8MEnIdbU1+0mOOm5HT8sn8pY6I3uLqexE2jIfLfZddLYwjNAfiLwZPkAQDPCmtXPihu2x+ oynKxKbM+Mdg== From: Feng Tang <feng.tang@intel.com> To: Andrew Morton <akpm@linux-foundation.org>, Michal Hocko <mhocko@suse.com>, Johannes Weiner <hannes@cmpxchg.org>, Matthew Wilcox <willy@infradead.org>, Mel Gorman <mgorman@suse.de>, dave.hansen@intel.com, ying.huang@intel.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Feng Tang <feng.tang@intel.com> Subject: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node Date: Wed, 4 Nov 2020 14:10:08 +0800 Message-Id: <1604470210-124827-1-git-send-email-feng.tang@intel.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: fix OOMs for binding workloads to movable zone only node \| expand [RFC,0/2] mm: fix OOMs for binding workloads to movable zone only node [RFC,1/2] mm, oom: dump meminfo for all memory nodes [RFC,2/2] mm, page_alloc: loose the node binding check to avoid helpless oom killing

Feng Tang Nov. 4, 2020, 6:10 a.m. UTC

Hi,

This patchset tries to report a problem and get suggestion/review
for the RFC fix patches.

We recently got a OOM report, that when user try to bind a docker(container)
instance to a memory node which only has movable zones, and OOM killing
still can't solve the page allocation failure.

The callstack was:

	[ 1387.877565] runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0
	[ 1387.877568] CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G        W I E     5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased)
	[ 1387.877569] Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020
	[ 1387.877570] Call Trace:
	[ 1387.877579]  dump_stack+0x6b/0x88
	[ 1387.877584]  dump_header+0x4a/0x1e2
	[ 1387.877586]  oom_kill_process.cold+0xb/0x10
	[ 1387.877588]  out_of_memory.part.0+0xaf/0x230
	[ 1387.877591]  out_of_memory+0x3d/0x80
	[ 1387.877595]  __alloc_pages_slowpath.constprop.0+0x954/0xa20
	[ 1387.877599]  __alloc_pages_nodemask+0x2d3/0x300
	[ 1387.877602]  pipe_write+0x322/0x590
	[ 1387.877607]  new_sync_write+0x196/0x1b0
	[ 1387.877609]  vfs_write+0x1c3/0x1f0
	[ 1387.877611]  ksys_write+0xa7/0xe0
	[ 1387.877617]  do_syscall_64+0x52/0xd0
	[ 1387.877621]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

The meminfo log only shows the movable only node, which has plenty
of free memory. And in our reproducing with 1/2 patch, the normal
node (has DMA/DMA32/Normal) also has lot of free memory when OOM
happens. 

If we hack to make this (GFP_HIGHUSER|__GFP_ACCOUNT) request get
a page, and following full docker run (like installing and running
'stress-ng' stress test) will see more allocation failures due to
different kinds of request(gfp_masks). And the 2/2 patch will detect
such cases that the allowed target nodes only have movable zones
and loose the binding check, otherwise it will trigger OOM while
the OOM won't do any help, as the problem is not lack of free memory.

Feng Tang (2):
  mm, oom: dump meminfo for all memory nodes
  mm, page_alloc: loose the node binding check to avoid helpless oom
    killing

 mm/oom_kill.c   |  2 +-
 mm/page_alloc.c | 22 ++++++++++++++++++++++
 2 files changed, 23 insertions(+), 1 deletion(-)

Michal Hocko Nov. 4, 2020, 7:13 a.m. UTC | #1

On Wed 04-11-20 14:10:08, Feng Tang wrote:
> Hi,
> 
> This patchset tries to report a problem and get suggestion/review
> for the RFC fix patches.
> 
> We recently got a OOM report, that when user try to bind a docker(container)
> instance to a memory node which only has movable zones, and OOM killing
> still can't solve the page allocation failure.

This is a cpuset node binding right?

> The callstack was:
> 
> 	[ 1387.877565] runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0
> 	[ 1387.877568] CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G        W I E     5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased)
> 	[ 1387.877569] Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020
> 	[ 1387.877570] Call Trace:
> 	[ 1387.877579]  dump_stack+0x6b/0x88
> 	[ 1387.877584]  dump_header+0x4a/0x1e2
> 	[ 1387.877586]  oom_kill_process.cold+0xb/0x10
> 	[ 1387.877588]  out_of_memory.part.0+0xaf/0x230
> 	[ 1387.877591]  out_of_memory+0x3d/0x80
> 	[ 1387.877595]  __alloc_pages_slowpath.constprop.0+0x954/0xa20
> 	[ 1387.877599]  __alloc_pages_nodemask+0x2d3/0x300
> 	[ 1387.877602]  pipe_write+0x322/0x590
> 	[ 1387.877607]  new_sync_write+0x196/0x1b0
> 	[ 1387.877609]  vfs_write+0x1c3/0x1f0
> 	[ 1387.877611]  ksys_write+0xa7/0xe0
> 	[ 1387.877617]  do_syscall_64+0x52/0xd0
> 	[ 1387.877621]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> The meminfo log only shows the movable only node, which has plenty
> of free memory. And in our reproducing with 1/2 patch, the normal
> node (has DMA/DMA32/Normal) also has lot of free memory when OOM
> happens. 

OK, so you are bidning to a movable node only and your above request is
for GFP_HIGHUSER which _cannot_ be satisfied from the movable zones
because that memory is not movable. So the system behaves as expected.
Your cpuset is misconfigured IMHO. Movable only nodes come with their
risk and configuration price.

> If we hack to make this (GFP_HIGHUSER|__GFP_ACCOUNT) request get
> a page, and following full docker run (like installing and running
> 'stress-ng' stress test) will see more allocation failures due to
> different kinds of request(gfp_masks). And the 2/2 patch will detect
> such cases that the allowed target nodes only have movable zones
> and loose the binding check, otherwise it will trigger OOM while
> the OOM won't do any help, as the problem is not lack of free memory.

Well, this breaks the cpuset containment, right? I consider this quite
unexpected for something that looks like a misconfiguration. I do agree
that this is unexpected for anybody who is not really familiar with
concept of movable zone but we should probably call out all these
details rather than tweak the existing semantic.

Could you be more specific about the usecase here? Why do you need a
binding to a pure movable node?

Feng Tang Nov. 4, 2020, 7:38 a.m. UTC | #2

Hi Michal,

Thanks for the prompt review!

On Wed, Nov 04, 2020 at 08:13:08AM +0100, Michal Hocko wrote:
> On Wed 04-11-20 14:10:08, Feng Tang wrote:
> > Hi,
> > 
> > This patchset tries to report a problem and get suggestion/review
> > for the RFC fix patches.
> > 
> > We recently got a OOM report, that when user try to bind a docker(container)
> > instance to a memory node which only has movable zones, and OOM killing
> > still can't solve the page allocation failure.
> 
> This is a cpuset node binding right?

Yes.

A simple test command is 'docker run -it --cpuset-mems 3 ubuntu:latest'
while the node 3 is a movable only PMEM node.
 
> > The callstack was:
> > 
> > 	[ 1387.877565] runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0
> > 	[ 1387.877568] CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G        W I E     5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased)
> > 	[ 1387.877569] Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020
> > 	[ 1387.877570] Call Trace:
> > 	[ 1387.877579]  dump_stack+0x6b/0x88
> > 	[ 1387.877584]  dump_header+0x4a/0x1e2
> > 	[ 1387.877586]  oom_kill_process.cold+0xb/0x10
> > 	[ 1387.877588]  out_of_memory.part.0+0xaf/0x230
> > 	[ 1387.877591]  out_of_memory+0x3d/0x80
> > 	[ 1387.877595]  __alloc_pages_slowpath.constprop.0+0x954/0xa20
> > 	[ 1387.877599]  __alloc_pages_nodemask+0x2d3/0x300
> > 	[ 1387.877602]  pipe_write+0x322/0x590
> > 	[ 1387.877607]  new_sync_write+0x196/0x1b0
> > 	[ 1387.877609]  vfs_write+0x1c3/0x1f0
> > 	[ 1387.877611]  ksys_write+0xa7/0xe0
> > 	[ 1387.877617]  do_syscall_64+0x52/0xd0
> > 	[ 1387.877621]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > The meminfo log only shows the movable only node, which has plenty
> > of free memory. And in our reproducing with 1/2 patch, the normal
> > node (has DMA/DMA32/Normal) also has lot of free memory when OOM
> > happens. 
> 
> OK, so you are bidning to a movable node only and your above request is
> for GFP_HIGHUSER which _cannot_ be satisfied from the movable zones
> because that memory is not movable. So the system behaves as expected.
> Your cpuset is misconfigured IMHO. Movable only nodes come with their
> risk and configuration price.

Aha, this is what we told the reporter at first. Their platform is 2S
platform, and each socket has one DRAM node + one persistent memory node,
and we suggested to bind the docker to one DRAM + one PMEM node.

> > If we hack to make this (GFP_HIGHUSER|__GFP_ACCOUNT) request get
> > a page, and following full docker run (like installing and running
> > 'stress-ng' stress test) will see more allocation failures due to
> > different kinds of request(gfp_masks). And the 2/2 patch will detect
> > such cases that the allowed target nodes only have movable zones
> > and loose the binding check, otherwise it will trigger OOM while
> > the OOM won't do any help, as the problem is not lack of free memory.
> 
> Well, this breaks the cpuset containment, right? I consider this quite
> unexpected for something that looks like a misconfiguration. I do agree
> that this is unexpected for anybody who is not really familiar with
> concept of movable zone but we should probably call out all these
> details rather than tweak the existing semantic.

Yes, it does break the cpuset containment 

> Could you be more specific about the usecase here? Why do you need a
> binding to a pure movable node? 

One common configuration for a platform is small size of DRAM plus huge
size of PMEM (which is slower but cheaper), and my guess of their use
is to try to lead the bulk of user space allocation (GFP_HIGHUSER_MOVABLE)
to PMEM node, and only let DRAM be used as less as possible. 

Thanks,
Feng

> -- 
> Michal Hocko
> SUSE Labs

Michal Hocko Nov. 4, 2020, 7:58 a.m. UTC | #3

On Wed 04-11-20 15:38:26, Feng Tang wrote:
[...]
> > Could you be more specific about the usecase here? Why do you need a
> > binding to a pure movable node? 
> 
> One common configuration for a platform is small size of DRAM plus huge
> size of PMEM (which is slower but cheaper), and my guess of their use
> is to try to lead the bulk of user space allocation (GFP_HIGHUSER_MOVABLE)
> to PMEM node, and only let DRAM be used as less as possible. 

While this is possible, it is a tricky configuration. It is essentially 
get us back to 32b and highmem...

As I've said in reply to your second patch. I think we can make the oom
killer behavior more sensible in this misconfigured cases but I do not
think we want break the cpuset isolation for such a configuration.

Feng Tang Nov. 4, 2020, 8:40 a.m. UTC | #4

On Wed, Nov 04, 2020 at 08:58:19AM +0100, Michal Hocko wrote:
> On Wed 04-11-20 15:38:26, Feng Tang wrote:
> [...]
> > > Could you be more specific about the usecase here? Why do you need a
> > > binding to a pure movable node? 
> > 
> > One common configuration for a platform is small size of DRAM plus huge
> > size of PMEM (which is slower but cheaper), and my guess of their use
> > is to try to lead the bulk of user space allocation (GFP_HIGHUSER_MOVABLE)
> > to PMEM node, and only let DRAM be used as less as possible. 
> 
> While this is possible, it is a tricky configuration. It is essentially 
> get us back to 32b and highmem...

:) Another possible case is similar binding on a memory hotplugable
platform, which has one unplugable node and several other nodes configured
as movable only to be hot removable when needed

> As I've said in reply to your second patch. I think we can make the oom
> killer behavior more sensible in this misconfigured cases but I do not
> think we want break the cpuset isolation for such a configuration.

Do you mean we skip the killing and just let the allocation fail? We've
checked the oom killer code first, when the oom happens, both DRAM
node and unmovable node have lots of free memory, and killing process
won't improve the situation.

(Folloing is copied from your comments for 2/2) 
> This allows to spill memory allocations over to any other node which
> has Normal (or other lower) zones and as such it breaks cpuset isolation.
> As I've pointed out in the reply to your cover letter it seems that
> this is more of a misconfiguration than a bug.

For the usage case (docker container running), the spilling is already
happening, I traced its memory allocation requests, many of them are
movable, and got fallback to the normal node naturally with current
code, only a few got blocked, as many of __alloc_pages_nodemask are
called witih 'NULL' nodemask parameter.

And I made this RFC patch inspired by code in __alloc_pages_may_oom():

	if (gfp_mask & __GFP_NOFAIL)
		page = __alloc_pages_cpuset_fallback(gfp_mask, order,
				ALLOC_NO_WATERMARKS, ac);

Thanks,
Feng

> -- 
> Michal Hocko
> SUSE Labs

Michal Hocko Nov. 4, 2020, 8:53 a.m. UTC | #5

On Wed 04-11-20 16:40:21, Feng Tang wrote:
> On Wed, Nov 04, 2020 at 08:58:19AM +0100, Michal Hocko wrote:
> > On Wed 04-11-20 15:38:26, Feng Tang wrote:
> > [...]
> > > > Could you be more specific about the usecase here? Why do you need a
> > > > binding to a pure movable node? 
> > > 
> > > One common configuration for a platform is small size of DRAM plus huge
> > > size of PMEM (which is slower but cheaper), and my guess of their use
> > > is to try to lead the bulk of user space allocation (GFP_HIGHUSER_MOVABLE)
> > > to PMEM node, and only let DRAM be used as less as possible. 
> > 
> > While this is possible, it is a tricky configuration. It is essentially 
> > get us back to 32b and highmem...
> 
> :) Another possible case is similar binding on a memory hotplugable
> platform, which has one unplugable node and several other nodes configured
> as movable only to be hot removable when needed

Yes, another way to shoot your foot ;)

> > As I've said in reply to your second patch. I think we can make the oom
> > killer behavior more sensible in this misconfigured cases but I do not
> > think we want break the cpuset isolation for such a configuration.
> 
> Do you mean we skip the killing and just let the allocation fail? We've
> checked the oom killer code first, when the oom happens, both DRAM
> node and unmovable node have lots of free memory, and killing process
> won't improve the situation.

We already do skip oom killer and fail for lowmem allocation requests already.
This is similar in some sense. Another option would be to kill the
allocating context which will have less corner cases potentially because
some allocation failures might be unexpected.

> (Folloing is copied from your comments for 2/2) 
> > This allows to spill memory allocations over to any other node which
> > has Normal (or other lower) zones and as such it breaks cpuset isolation.
> > As I've pointed out in the reply to your cover letter it seems that
> > this is more of a misconfiguration than a bug.
> 
> For the usage case (docker container running), the spilling is already
> happening, I traced its memory allocation requests, many of them are
> movable, and got fallback to the normal node naturally with current

Could you be more specific? This sounds like a bug. Allocations
shouldn't spill over to a node which is not in the cpuset. There are few
exceptions like IRQ context but that shouldn't happen regurarly.

> code, only a few got blocked, as many of __alloc_pages_nodemask are
> called witih 'NULL' nodemask parameter.
> 
> And I made this RFC patch inspired by code in __alloc_pages_may_oom():
> 
> 	if (gfp_mask & __GFP_NOFAIL)
> 		page = __alloc_pages_cpuset_fallback(gfp_mask, order,
> 				ALLOC_NO_WATERMARKS, ac);

I am not really sure I follow here. __GFP_NOFAIL is a special beast
because such an allocation must not fail. Breaking node affinity is the
only option left. This shouldn't be something used for regular
allocation requests.

Michal Hocko Nov. 5, 2020, 12:08 p.m. UTC | #6

On Thu 05-11-20 09:40:28, Feng Tang wrote:
> On Wed, Nov 04, 2020 at 09:53:43AM +0100, Michal Hocko wrote:
>  
> > > > As I've said in reply to your second patch. I think we can make the oom
> > > > killer behavior more sensible in this misconfigured cases but I do not
> > > > think we want break the cpuset isolation for such a configuration.
> > > 
> > > Do you mean we skip the killing and just let the allocation fail? We've
> > > checked the oom killer code first, when the oom happens, both DRAM
> > > node and unmovable node have lots of free memory, and killing process
> > > won't improve the situation.
> > 
> > We already do skip oom killer and fail for lowmem allocation requests already.
> > This is similar in some sense. Another option would be to kill the
> > allocating context which will have less corner cases potentially because
> > some allocation failures might be unexpected.
> 
> Yes, this can avoid the helpless oom killing to kill a good process(no
> memory pressure at all)
> 
> And I think the important thing is to judge whether this usage (binding
> docker like workload to unmovable node) is a valid case :) 

I am confused. Why wouldbe an unmovable node a problem. Movable
allocations can be satisfied from the Zone Normal just fine. It is other
way around that is a problem.

> Initially, I thought it invalid too, but later think it still makes some
> sense for the 2 cases:
>     * user want to bind his workload to one node(most of user space
>       memory) to avoid cross-node traffic, and that node happens to
>       be configured as unmovable

See above

>     * one small DRAM node + big PMEM node, and memory latency insensitive
>       workload could be bound to the cheaper unmovable PMEM node

Please elaborate some more. As long as you have movable and normal nodes
then this should be possible with a deal of care - most notably the
movable:kernel ratio memory shouldn't be too big.

Besides that why does PMEM node have to be MOVABLE only in the first
place?

> > > (Folloing is copied from your comments for 2/2) 
> > > > This allows to spill memory allocations over to any other node which
> > > > has Normal (or other lower) zones and as such it breaks cpuset isolation.
> > > > As I've pointed out in the reply to your cover letter it seems that
> > > > this is more of a misconfiguration than a bug.
> > > 
> > > For the usage case (docker container running), the spilling is already
> > > happening, I traced its memory allocation requests, many of them are
> > > movable, and got fallback to the normal node naturally with current
> > 
> > Could you be more specific? This sounds like a bug. Allocations
> > shouldn't spill over to a node which is not in the cpuset. There are few
> > exceptions like IRQ context but that shouldn't happen regurarly.
> 
> I mean when the docker starts, it will spawn many processes which obey
> the mem binding set, and they have some kernel page requests, which got
> successfully allocated, like the following callstack:
> 
> 	[  567.044953] CPU: 1 PID: 2021 Comm: runc:[1:CHILD] Tainted: G        W I       5.9.0-rc8+ #6
> 	[  567.044956] Hardware name:  /NUC6i5SYB, BIOS SYSKLi35.86A.0051.2016.0804.1114 08/04/2016
> 	[  567.044958] Call Trace:
> 	[  567.044972]  dump_stack+0x74/0x9a
> 	[  567.044978]  __alloc_pages_nodemask.cold+0x22/0xe5
> 	[  567.044986]  alloc_pages_current+0x87/0xe0
> 	[  567.044991]  allocate_slab+0x2e5/0x4f0
> 	[  567.044996]  ___slab_alloc+0x380/0x5d0
> 	[  567.045021]  __slab_alloc+0x20/0x40
> 	[  567.045025]  kmem_cache_alloc+0x2a0/0x2e0
> 	[  567.045033]  mqueue_alloc_inode+0x1a/0x30
> 	[  567.045041]  alloc_inode+0x22/0xa0
> 	[  567.045045]  new_inode_pseudo+0x12/0x60
> 	[  567.045049]  new_inode+0x17/0x30
> 	[  567.045052]  mqueue_get_inode+0x45/0x3b0
> 	[  567.045060]  mqueue_fill_super+0x41/0x70
> 	[  567.045067]  vfs_get_super+0x7f/0x100
> 	[  567.045074]  get_tree_keyed+0x1d/0x20
> 	[  567.045080]  mqueue_get_tree+0x1c/0x20
> 	[  567.045086]  vfs_get_tree+0x2a/0xc0
> 	[  567.045092]  fc_mount+0x13/0x50
> 	[  567.045099]  mq_create_mount+0x92/0xe0
> 	[  567.045102]  mq_init_ns+0x3b/0x50
> 	[  567.045106]  copy_ipcs+0x10a/0x1b0
> 	[  567.045113]  create_new_namespaces+0xa6/0x2b0
> 	[  567.045118]  unshare_nsproxy_namespaces+0x5a/0xb0
> 	[  567.045124]  ksys_unshare+0x19f/0x360
> 	[  567.045129]  __x64_sys_unshare+0x12/0x20
> 	[  567.045135]  do_syscall_64+0x38/0x90
> 	[  567.045143]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> For it, the __alloc_pages_nodemask() will first try process's targed
> nodemask(unmovable node here), and there is no availabe zone, so it
> goes with the NULL nodemask, and get a page in the slowpath.

OK, I see your point now. I was not aware of the slab allocator not
following cpusets. Sounds like a bug to me.

Vlastimil Babka Nov. 5, 2020, 12:53 p.m. UTC | #7

On 11/5/20 1:08 PM, Michal Hocko wrote:
> On Thu 05-11-20 09:40:28, Feng Tang wrote:
>> > 
>> > Could you be more specific? This sounds like a bug. Allocations
>> > shouldn't spill over to a node which is not in the cpuset. There are few
>> > exceptions like IRQ context but that shouldn't happen regurarly.
>> 
>> I mean when the docker starts, it will spawn many processes which obey
>> the mem binding set, and they have some kernel page requests, which got
>> successfully allocated, like the following callstack:
>> 
>> 	[  567.044953] CPU: 1 PID: 2021 Comm: runc:[1:CHILD] Tainted: G        W I       5.9.0-rc8+ #6
>> 	[  567.044956] Hardware name:  /NUC6i5SYB, BIOS SYSKLi35.86A.0051.2016.0804.1114 08/04/2016
>> 	[  567.044958] Call Trace:
>> 	[  567.044972]  dump_stack+0x74/0x9a
>> 	[  567.044978]  __alloc_pages_nodemask.cold+0x22/0xe5
>> 	[  567.044986]  alloc_pages_current+0x87/0xe0
>> 	[  567.044991]  allocate_slab+0x2e5/0x4f0
>> 	[  567.044996]  ___slab_alloc+0x380/0x5d0
>> 	[  567.045021]  __slab_alloc+0x20/0x40
>> 	[  567.045025]  kmem_cache_alloc+0x2a0/0x2e0
>> 	[  567.045033]  mqueue_alloc_inode+0x1a/0x30
>> 	[  567.045041]  alloc_inode+0x22/0xa0
>> 	[  567.045045]  new_inode_pseudo+0x12/0x60
>> 	[  567.045049]  new_inode+0x17/0x30
>> 	[  567.045052]  mqueue_get_inode+0x45/0x3b0
>> 	[  567.045060]  mqueue_fill_super+0x41/0x70
>> 	[  567.045067]  vfs_get_super+0x7f/0x100
>> 	[  567.045074]  get_tree_keyed+0x1d/0x20
>> 	[  567.045080]  mqueue_get_tree+0x1c/0x20
>> 	[  567.045086]  vfs_get_tree+0x2a/0xc0
>> 	[  567.045092]  fc_mount+0x13/0x50
>> 	[  567.045099]  mq_create_mount+0x92/0xe0
>> 	[  567.045102]  mq_init_ns+0x3b/0x50
>> 	[  567.045106]  copy_ipcs+0x10a/0x1b0
>> 	[  567.045113]  create_new_namespaces+0xa6/0x2b0
>> 	[  567.045118]  unshare_nsproxy_namespaces+0x5a/0xb0
>> 	[  567.045124]  ksys_unshare+0x19f/0x360
>> 	[  567.045129]  __x64_sys_unshare+0x12/0x20
>> 	[  567.045135]  do_syscall_64+0x38/0x90
>> 	[  567.045143]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> 
>> For it, the __alloc_pages_nodemask() will first try process's targed
>> nodemask(unmovable node here), and there is no availabe zone, so it
>> goes with the NULL nodemask, and get a page in the slowpath.
> 
> OK, I see your point now. I was not aware of the slab allocator not
> following cpusets. Sounds like a bug to me.

SLAB and SLUB seem to not care about cpusets in the fast path. But this stack 
shows that it went all the way to the page allocator, so the cpusets should have 
been obeyed there at least.

Michal Hocko Nov. 5, 2020, 12:58 p.m. UTC | #8

On Thu 05-11-20 13:53:24, Vlastimil Babka wrote:
> On 11/5/20 1:08 PM, Michal Hocko wrote:
> > On Thu 05-11-20 09:40:28, Feng Tang wrote:
> > > > > Could you be more specific? This sounds like a bug. Allocations
> > > > shouldn't spill over to a node which is not in the cpuset. There are few
> > > > exceptions like IRQ context but that shouldn't happen regurarly.
> > > 
> > > I mean when the docker starts, it will spawn many processes which obey
> > > the mem binding set, and they have some kernel page requests, which got
> > > successfully allocated, like the following callstack:
> > > 
> > > 	[  567.044953] CPU: 1 PID: 2021 Comm: runc:[1:CHILD] Tainted: G        W I       5.9.0-rc8+ #6
> > > 	[  567.044956] Hardware name:  /NUC6i5SYB, BIOS SYSKLi35.86A.0051.2016.0804.1114 08/04/2016
> > > 	[  567.044958] Call Trace:
> > > 	[  567.044972]  dump_stack+0x74/0x9a
> > > 	[  567.044978]  __alloc_pages_nodemask.cold+0x22/0xe5
> > > 	[  567.044986]  alloc_pages_current+0x87/0xe0
> > > 	[  567.044991]  allocate_slab+0x2e5/0x4f0
> > > 	[  567.044996]  ___slab_alloc+0x380/0x5d0
> > > 	[  567.045021]  __slab_alloc+0x20/0x40
> > > 	[  567.045025]  kmem_cache_alloc+0x2a0/0x2e0
> > > 	[  567.045033]  mqueue_alloc_inode+0x1a/0x30
> > > 	[  567.045041]  alloc_inode+0x22/0xa0
> > > 	[  567.045045]  new_inode_pseudo+0x12/0x60
> > > 	[  567.045049]  new_inode+0x17/0x30
> > > 	[  567.045052]  mqueue_get_inode+0x45/0x3b0
> > > 	[  567.045060]  mqueue_fill_super+0x41/0x70
> > > 	[  567.045067]  vfs_get_super+0x7f/0x100
> > > 	[  567.045074]  get_tree_keyed+0x1d/0x20
> > > 	[  567.045080]  mqueue_get_tree+0x1c/0x20
> > > 	[  567.045086]  vfs_get_tree+0x2a/0xc0
> > > 	[  567.045092]  fc_mount+0x13/0x50
> > > 	[  567.045099]  mq_create_mount+0x92/0xe0
> > > 	[  567.045102]  mq_init_ns+0x3b/0x50
> > > 	[  567.045106]  copy_ipcs+0x10a/0x1b0
> > > 	[  567.045113]  create_new_namespaces+0xa6/0x2b0
> > > 	[  567.045118]  unshare_nsproxy_namespaces+0x5a/0xb0
> > > 	[  567.045124]  ksys_unshare+0x19f/0x360
> > > 	[  567.045129]  __x64_sys_unshare+0x12/0x20
> > > 	[  567.045135]  do_syscall_64+0x38/0x90
> > > 	[  567.045143]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > 
> > > For it, the __alloc_pages_nodemask() will first try process's targed
> > > nodemask(unmovable node here), and there is no availabe zone, so it
> > > goes with the NULL nodemask, and get a page in the slowpath.
> > 
> > OK, I see your point now. I was not aware of the slab allocator not
> > following cpusets. Sounds like a bug to me.
> 
> SLAB and SLUB seem to not care about cpusets in the fast path.

Is a fallback to a different node which is outside of the cpuset
possible?

> But this
> stack shows that it went all the way to the page allocator, so the cpusets
> should have been obeyed there at least.

Looking closer what is this dump_stack saying actually?

Feng Tang Nov. 5, 2020, 1:07 p.m. UTC | #9

On Thu, Nov 05, 2020 at 01:58:28PM +0100, Michal Hocko wrote:
> On Thu 05-11-20 13:53:24, Vlastimil Babka wrote:
> > On 11/5/20 1:08 PM, Michal Hocko wrote:
> > > On Thu 05-11-20 09:40:28, Feng Tang wrote:
> > > > > > Could you be more specific? This sounds like a bug. Allocations
> > > > > shouldn't spill over to a node which is not in the cpuset. There are few
> > > > > exceptions like IRQ context but that shouldn't happen regurarly.
> > > > 
> > > > I mean when the docker starts, it will spawn many processes which obey
> > > > the mem binding set, and they have some kernel page requests, which got
> > > > successfully allocated, like the following callstack:
> > > > 
> > > > 	[  567.044953] CPU: 1 PID: 2021 Comm: runc:[1:CHILD] Tainted: G        W I       5.9.0-rc8+ #6
> > > > 	[  567.044956] Hardware name:  /NUC6i5SYB, BIOS SYSKLi35.86A.0051.2016.0804.1114 08/04/2016
> > > > 	[  567.044958] Call Trace:
> > > > 	[  567.044972]  dump_stack+0x74/0x9a
> > > > 	[  567.044978]  __alloc_pages_nodemask.cold+0x22/0xe5
> > > > 	[  567.044986]  alloc_pages_current+0x87/0xe0
> > > > 	[  567.044991]  allocate_slab+0x2e5/0x4f0
> > > > 	[  567.044996]  ___slab_alloc+0x380/0x5d0
> > > > 	[  567.045021]  __slab_alloc+0x20/0x40
> > > > 	[  567.045025]  kmem_cache_alloc+0x2a0/0x2e0
> > > > 	[  567.045033]  mqueue_alloc_inode+0x1a/0x30
> > > > 	[  567.045041]  alloc_inode+0x22/0xa0
> > > > 	[  567.045045]  new_inode_pseudo+0x12/0x60
> > > > 	[  567.045049]  new_inode+0x17/0x30
> > > > 	[  567.045052]  mqueue_get_inode+0x45/0x3b0
> > > > 	[  567.045060]  mqueue_fill_super+0x41/0x70
> > > > 	[  567.045067]  vfs_get_super+0x7f/0x100
> > > > 	[  567.045074]  get_tree_keyed+0x1d/0x20
> > > > 	[  567.045080]  mqueue_get_tree+0x1c/0x20
> > > > 	[  567.045086]  vfs_get_tree+0x2a/0xc0
> > > > 	[  567.045092]  fc_mount+0x13/0x50
> > > > 	[  567.045099]  mq_create_mount+0x92/0xe0
> > > > 	[  567.045102]  mq_init_ns+0x3b/0x50
> > > > 	[  567.045106]  copy_ipcs+0x10a/0x1b0
> > > > 	[  567.045113]  create_new_namespaces+0xa6/0x2b0
> > > > 	[  567.045118]  unshare_nsproxy_namespaces+0x5a/0xb0
> > > > 	[  567.045124]  ksys_unshare+0x19f/0x360
> > > > 	[  567.045129]  __x64_sys_unshare+0x12/0x20
> > > > 	[  567.045135]  do_syscall_64+0x38/0x90
> > > > 	[  567.045143]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > 
> > > > For it, the __alloc_pages_nodemask() will first try process's targed
> > > > nodemask(unmovable node here), and there is no availabe zone, so it
> > > > goes with the NULL nodemask, and get a page in the slowpath.
> > > 
> > > OK, I see your point now. I was not aware of the slab allocator not
> > > following cpusets. Sounds like a bug to me.
> > 
> > SLAB and SLUB seem to not care about cpusets in the fast path.
> 
> Is a fallback to a different node which is outside of the cpuset
> possible?
 
My debug traces shows it is, and its gfp_mask is 'GFP_KERNEL'

And actually in this usage, I seen other types of kernel allocation
request got fallback to the normal node which is not in cpuset mem
nodemasks, like

	[  567.510901] CPU: 3 PID: 2022 Comm: runc:[2:INIT] Tainted: G        W I       5.9.0-rc8+ #6
	[  567.510902] Hardware name:  /NUC6i5SYB, BIOS SYSKLi35.86A.0051.2016.0804.1114 08/04/2016
	[  567.510903] Call Trace:
	[  567.510909]  dump_stack+0x74/0x9a
	[  567.510910]  __alloc_pages_nodemask.cold+0x22/0xe5
	[  567.510913]  alloc_pages_current+0x87/0xe0
	[  567.510914]  __vmalloc_node_range+0x14c/0x240
	[  567.510918]  module_alloc+0x82/0xe0
	[  567.510921]  bpf_jit_alloc_exec+0xe/0x10
	[  567.510922]  bpf_jit_binary_alloc+0x7a/0x120
	[  567.510925]  bpf_int_jit_compile+0x145/0x424
	[  567.510926]  bpf_prog_select_runtime+0xac/0x130
	[  567.510928]  bpf_prepare_filter+0x44c/0x4b0
	[  567.510932]  bpf_prog_create_from_user+0xc7/0x120
	[  567.510934]  do_seccomp+0x118/0x990
	[  567.510937]  __x64_sys_seccomp+0x1a/0x20
	[  567.510939]  do_syscall_64+0x38/0x90

And its gfp_mask is (GFP_KERNEL|__GFP_HIGHMEM|__GFP_NOWARN)

Thanks,
Feng

> > But this
> > stack shows that it went all the way to the page allocator, so the cpusets
> > should have been obeyed there at least.
> 
> Looking closer what is this dump_stack saying actually?
> 
> -- 
> Michal Hocko
> SUSE Labs

Michal Hocko Nov. 5, 2020, 1:12 p.m. UTC | #10

On Thu 05-11-20 21:07:10, Feng Tang wrote:
[...]
> My debug traces shows it is, and its gfp_mask is 'GFP_KERNEL'

Can you provide the full information please? Which node has been
requested. Which cpuset the calling process run in and which node has
the allocation succeeded from? A bare dump_stack without any further
context is not really helpful.

Vlastimil Babka Nov. 5, 2020, 1:14 p.m. UTC | #11

On 11/5/20 1:58 PM, Michal Hocko wrote:
> On Thu 05-11-20 13:53:24, Vlastimil Babka wrote:
>> On 11/5/20 1:08 PM, Michal Hocko wrote:
>> > On Thu 05-11-20 09:40:28, Feng Tang wrote:
>> > > > > Could you be more specific? This sounds like a bug. Allocations
>> > > > shouldn't spill over to a node which is not in the cpuset. There are few
>> > > > exceptions like IRQ context but that shouldn't happen regurarly.
>> > > 
>> > > I mean when the docker starts, it will spawn many processes which obey
>> > > the mem binding set, and they have some kernel page requests, which got
>> > > successfully allocated, like the following callstack:
>> > > 
>> > > 	[  567.044953] CPU: 1 PID: 2021 Comm: runc:[1:CHILD] Tainted: G        W I       5.9.0-rc8+ #6
>> > > 	[  567.044956] Hardware name:  /NUC6i5SYB, BIOS SYSKLi35.86A.0051.2016.0804.1114 08/04/2016
>> > > 	[  567.044958] Call Trace:
>> > > 	[  567.044972]  dump_stack+0x74/0x9a
>> > > 	[  567.044978]  __alloc_pages_nodemask.cold+0x22/0xe5
>> > > 	[  567.044986]  alloc_pages_current+0x87/0xe0
>> > > 	[  567.044991]  allocate_slab+0x2e5/0x4f0
>> > > 	[  567.044996]  ___slab_alloc+0x380/0x5d0
>> > > 	[  567.045021]  __slab_alloc+0x20/0x40
>> > > 	[  567.045025]  kmem_cache_alloc+0x2a0/0x2e0
>> > > 	[  567.045033]  mqueue_alloc_inode+0x1a/0x30
>> > > 	[  567.045041]  alloc_inode+0x22/0xa0
>> > > 	[  567.045045]  new_inode_pseudo+0x12/0x60
>> > > 	[  567.045049]  new_inode+0x17/0x30
>> > > 	[  567.045052]  mqueue_get_inode+0x45/0x3b0
>> > > 	[  567.045060]  mqueue_fill_super+0x41/0x70
>> > > 	[  567.045067]  vfs_get_super+0x7f/0x100
>> > > 	[  567.045074]  get_tree_keyed+0x1d/0x20
>> > > 	[  567.045080]  mqueue_get_tree+0x1c/0x20
>> > > 	[  567.045086]  vfs_get_tree+0x2a/0xc0
>> > > 	[  567.045092]  fc_mount+0x13/0x50
>> > > 	[  567.045099]  mq_create_mount+0x92/0xe0
>> > > 	[  567.045102]  mq_init_ns+0x3b/0x50
>> > > 	[  567.045106]  copy_ipcs+0x10a/0x1b0
>> > > 	[  567.045113]  create_new_namespaces+0xa6/0x2b0
>> > > 	[  567.045118]  unshare_nsproxy_namespaces+0x5a/0xb0
>> > > 	[  567.045124]  ksys_unshare+0x19f/0x360
>> > > 	[  567.045129]  __x64_sys_unshare+0x12/0x20
>> > > 	[  567.045135]  do_syscall_64+0x38/0x90
>> > > 	[  567.045143]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> > > 
>> > > For it, the __alloc_pages_nodemask() will first try process's targed
>> > > nodemask(unmovable node here), and there is no availabe zone, so it
>> > > goes with the NULL nodemask, and get a page in the slowpath.
>> > 
>> > OK, I see your point now. I was not aware of the slab allocator not
>> > following cpusets. Sounds like a bug to me.
>> 
>> SLAB and SLUB seem to not care about cpusets in the fast path.
> 
> Is a fallback to a different node which is outside of the cpuset
> possible?

AFAICS anything in per-cpu cache will be allocated without looking at the 
cpuset, so it can be outside of the cpuset. In SLUB slowpath, get_partial_node() 
looking for fallback on the same node will also not look at cpuset. 
get_any_partial() looking for a fallback allocation on any node does check 
cpuset_zone_allowed() and obey it strictly. A fallback to page allocator will 
obey whatever page allocator obeys.

So if a process cannot is restricted to allocate from node X via cpuset *and* 
also cannot be executed on CPU's from node X via taskset, then it AFAICS 
effectively cannot violate the cpuset in SLUB because it won't reach the percpu 
or per-node caches that don't check cpusets.

>> But this
>> stack shows that it went all the way to the page allocator, so the cpusets
>> should have been obeyed there at least.
> 
> Looking closer what is this dump_stack saying actually?

Yes, is that a dump of successful allocation (that violates cpusets?) or a 
failing one?

Michal Hocko Nov. 5, 2020, 1:19 p.m. UTC | #12

On Thu 05-11-20 14:14:25, Vlastimil Babka wrote:
> On 11/5/20 1:58 PM, Michal Hocko wrote:
> > On Thu 05-11-20 13:53:24, Vlastimil Babka wrote:
> > > On 11/5/20 1:08 PM, Michal Hocko wrote:
> > > > On Thu 05-11-20 09:40:28, Feng Tang wrote:
> > > > > > > Could you be more specific? This sounds like a bug. Allocations
> > > > > > shouldn't spill over to a node which is not in the cpuset. There are few
> > > > > > exceptions like IRQ context but that shouldn't happen regurarly.
> > > > > > > I mean when the docker starts, it will spawn many processes
> > > which obey
> > > > > the mem binding set, and they have some kernel page requests, which got
> > > > > successfully allocated, like the following callstack:
> > > > > > > 	[  567.044953] CPU: 1 PID: 2021 Comm: runc:[1:CHILD]
> > > Tainted: G        W I       5.9.0-rc8+ #6
> > > > > 	[  567.044956] Hardware name:  /NUC6i5SYB, BIOS SYSKLi35.86A.0051.2016.0804.1114 08/04/2016
> > > > > 	[  567.044958] Call Trace:
> > > > > 	[  567.044972]  dump_stack+0x74/0x9a
> > > > > 	[  567.044978]  __alloc_pages_nodemask.cold+0x22/0xe5
> > > > > 	[  567.044986]  alloc_pages_current+0x87/0xe0
> > > > > 	[  567.044991]  allocate_slab+0x2e5/0x4f0
> > > > > 	[  567.044996]  ___slab_alloc+0x380/0x5d0
> > > > > 	[  567.045021]  __slab_alloc+0x20/0x40
> > > > > 	[  567.045025]  kmem_cache_alloc+0x2a0/0x2e0
> > > > > 	[  567.045033]  mqueue_alloc_inode+0x1a/0x30
> > > > > 	[  567.045041]  alloc_inode+0x22/0xa0
> > > > > 	[  567.045045]  new_inode_pseudo+0x12/0x60
> > > > > 	[  567.045049]  new_inode+0x17/0x30
> > > > > 	[  567.045052]  mqueue_get_inode+0x45/0x3b0
> > > > > 	[  567.045060]  mqueue_fill_super+0x41/0x70
> > > > > 	[  567.045067]  vfs_get_super+0x7f/0x100
> > > > > 	[  567.045074]  get_tree_keyed+0x1d/0x20
> > > > > 	[  567.045080]  mqueue_get_tree+0x1c/0x20
> > > > > 	[  567.045086]  vfs_get_tree+0x2a/0xc0
> > > > > 	[  567.045092]  fc_mount+0x13/0x50
> > > > > 	[  567.045099]  mq_create_mount+0x92/0xe0
> > > > > 	[  567.045102]  mq_init_ns+0x3b/0x50
> > > > > 	[  567.045106]  copy_ipcs+0x10a/0x1b0
> > > > > 	[  567.045113]  create_new_namespaces+0xa6/0x2b0
> > > > > 	[  567.045118]  unshare_nsproxy_namespaces+0x5a/0xb0
> > > > > 	[  567.045124]  ksys_unshare+0x19f/0x360
> > > > > 	[  567.045129]  __x64_sys_unshare+0x12/0x20
> > > > > 	[  567.045135]  do_syscall_64+0x38/0x90
> > > > > 	[  567.045143]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > > > > For it, the __alloc_pages_nodemask() will first try
> > > process's targed
> > > > > nodemask(unmovable node here), and there is no availabe zone, so it
> > > > > goes with the NULL nodemask, and get a page in the slowpath.
> > > > > OK, I see your point now. I was not aware of the slab allocator
> > > not
> > > > following cpusets. Sounds like a bug to me.
> > > 
> > > SLAB and SLUB seem to not care about cpusets in the fast path.
> > 
> > Is a fallback to a different node which is outside of the cpuset
> > possible?
> 
> AFAICS anything in per-cpu cache will be allocated without looking at the
> cpuset, so it can be outside of the cpuset. In SLUB slowpath,
> get_partial_node() looking for fallback on the same node will also not look
> at cpuset. get_any_partial() looking for a fallback allocation on any node
> does check cpuset_zone_allowed() and obey it strictly. A fallback to page
> allocator will obey whatever page allocator obeys.

IIUC this means that if there is no strong CPU binding to cpuset nodes
then a runaway is possible. Albeit only partially and relying on
somebody to fill up pcp object caches, right?

Is that an overlook or a decision design or a performance optimization?

Vlastimil Babka Nov. 5, 2020, 1:34 p.m. UTC | #13

On 11/5/20 2:19 PM, Michal Hocko wrote:
> On Thu 05-11-20 14:14:25, Vlastimil Babka wrote:
>> On 11/5/20 1:58 PM, Michal Hocko wrote:
>> > On Thu 05-11-20 13:53:24, Vlastimil Babka wrote:
>> > > On 11/5/20 1:08 PM, Michal Hocko wrote:
>> > > > On Thu 05-11-20 09:40:28, Feng Tang wrote:
>> > > > > > > Could you be more specific? This sounds like a bug. Allocations
>> > > > > > shouldn't spill over to a node which is not in the cpuset. There are few
>> > > > > > exceptions like IRQ context but that shouldn't happen regurarly.
>> > > > > > > I mean when the docker starts, it will spawn many processes
>> > > which obey
>> > > > > the mem binding set, and they have some kernel page requests, which got
>> > > > > successfully allocated, like the following callstack:
>> > > > > > > 	[  567.044953] CPU: 1 PID: 2021 Comm: runc:[1:CHILD]
>> > > Tainted: G        W I       5.9.0-rc8+ #6
>> > > > > 	[  567.044956] Hardware name:  /NUC6i5SYB, BIOS SYSKLi35.86A.0051.2016.0804.1114 08/04/2016
>> > > > > 	[  567.044958] Call Trace:
>> > > > > 	[  567.044972]  dump_stack+0x74/0x9a
>> > > > > 	[  567.044978]  __alloc_pages_nodemask.cold+0x22/0xe5
>> > > > > 	[  567.044986]  alloc_pages_current+0x87/0xe0
>> > > > > 	[  567.044991]  allocate_slab+0x2e5/0x4f0
>> > > > > 	[  567.044996]  ___slab_alloc+0x380/0x5d0
>> > > > > 	[  567.045021]  __slab_alloc+0x20/0x40
>> > > > > 	[  567.045025]  kmem_cache_alloc+0x2a0/0x2e0
>> > > > > 	[  567.045033]  mqueue_alloc_inode+0x1a/0x30
>> > > > > 	[  567.045041]  alloc_inode+0x22/0xa0
>> > > > > 	[  567.045045]  new_inode_pseudo+0x12/0x60
>> > > > > 	[  567.045049]  new_inode+0x17/0x30
>> > > > > 	[  567.045052]  mqueue_get_inode+0x45/0x3b0
>> > > > > 	[  567.045060]  mqueue_fill_super+0x41/0x70
>> > > > > 	[  567.045067]  vfs_get_super+0x7f/0x100
>> > > > > 	[  567.045074]  get_tree_keyed+0x1d/0x20
>> > > > > 	[  567.045080]  mqueue_get_tree+0x1c/0x20
>> > > > > 	[  567.045086]  vfs_get_tree+0x2a/0xc0
>> > > > > 	[  567.045092]  fc_mount+0x13/0x50
>> > > > > 	[  567.045099]  mq_create_mount+0x92/0xe0
>> > > > > 	[  567.045102]  mq_init_ns+0x3b/0x50
>> > > > > 	[  567.045106]  copy_ipcs+0x10a/0x1b0
>> > > > > 	[  567.045113]  create_new_namespaces+0xa6/0x2b0
>> > > > > 	[  567.045118]  unshare_nsproxy_namespaces+0x5a/0xb0
>> > > > > 	[  567.045124]  ksys_unshare+0x19f/0x360
>> > > > > 	[  567.045129]  __x64_sys_unshare+0x12/0x20
>> > > > > 	[  567.045135]  do_syscall_64+0x38/0x90
>> > > > > 	[  567.045143]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> > > > > > > For it, the __alloc_pages_nodemask() will first try
>> > > process's targed
>> > > > > nodemask(unmovable node here), and there is no availabe zone, so it
>> > > > > goes with the NULL nodemask, and get a page in the slowpath.
>> > > > > OK, I see your point now. I was not aware of the slab allocator
>> > > not
>> > > > following cpusets. Sounds like a bug to me.
>> > > 
>> > > SLAB and SLUB seem to not care about cpusets in the fast path.
>> > 
>> > Is a fallback to a different node which is outside of the cpuset
>> > possible?
>> 
>> AFAICS anything in per-cpu cache will be allocated without looking at the
>> cpuset, so it can be outside of the cpuset. In SLUB slowpath,
>> get_partial_node() looking for fallback on the same node will also not look
>> at cpuset. get_any_partial() looking for a fallback allocation on any node
>> does check cpuset_zone_allowed() and obey it strictly. A fallback to page
>> allocator will obey whatever page allocator obeys.
> 
> IIUC this means that if there is no strong CPU binding to cpuset nodes
> then a runaway is possible. Albeit only partially and relying on
> somebody to fill up pcp object caches, right?

Seems so.

> Is that an overlook or a decision design or a performance optimization?
  ... yes :)

More seriously, let's ask Christoph, as that code goes all the way to first SLUB 
commit.
On SLAB side, it would seem percpu caches came first, then cpuset support was 
added for page allocator, but SLAB was oblivious until it learned some of it in 
commit 765c4507af71c to properly support __GFP_THISNODE.

My guess is that the amount of cpuset constraint violation through percpu caches 
was never considered to be a problem serious enough to justify making the 
fastpaths slower.

Feng Tang Nov. 5, 2020, 1:43 p.m. UTC | #14

On Thu, Nov 05, 2020 at 02:12:45PM +0100, Michal Hocko wrote:
> On Thu 05-11-20 21:07:10, Feng Tang wrote:
> [...]
> > My debug traces shows it is, and its gfp_mask is 'GFP_KERNEL'
> 
> Can you provide the full information please? Which node has been
> requested. Which cpuset the calling process run in and which node has
> the allocation succeeded from? A bare dump_stack without any further
> context is not really helpful.

I don't have the same platform as the original report, so I simulated
one similar setup (with fakenuma and movablecore), which has 2 memory
nodes: node 0 has DMA0/DMA32/Movable zones, while node 1 has only
Movable zone. With it, I can got the same error and same oom callstack
as the original report (as in the cover-letter).

The test command is:
	# docker run -it --rm --cpuset-mems 1 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status"

To debug I only added some trace in the __alloc_pages_nodemask(), and
for the callstack which get the page successfully:

	[  567.510903] Call Trace:
	[  567.510909]  dump_stack+0x74/0x9a
	[  567.510910]  __alloc_pages_nodemask.cold+0x22/0xe5
	[  567.510913]  alloc_pages_current+0x87/0xe0
	[  567.510914]  __vmalloc_node_range+0x14c/0x240
	[  567.510918]  module_alloc+0x82/0xe0
	[  567.510921]  bpf_jit_alloc_exec+0xe/0x10
	[  567.510922]  bpf_jit_binary_alloc+0x7a/0x120
	[  567.510925]  bpf_int_jit_compile+0x145/0x424
	[  567.510926]  bpf_prog_select_runtime+0xac/0x130

The incomming parameter nodemask is NULL, and the function will first try the
cpuset nodemask (1 here), and the zoneidx is only granted 2, which makes the
'ac's preferred zone to be NULL. so it goes into __alloc_pages_slowpath(),
which will first set the nodemask to 'NULL', and this time it got a preferred
zone: zone DMA32 from node 0, following get_page_from_freelist will allocate
one page from that zone. 

Thanks,
Feng

> 
> -- 
> Michal Hocko
> SUSE Labs

Michal Hocko Nov. 5, 2020, 4:16 p.m. UTC | #15

On Thu 05-11-20 21:43:05, Feng Tang wrote:
> On Thu, Nov 05, 2020 at 02:12:45PM +0100, Michal Hocko wrote:
> > On Thu 05-11-20 21:07:10, Feng Tang wrote:
> > [...]
> > > My debug traces shows it is, and its gfp_mask is 'GFP_KERNEL'
> > 
> > Can you provide the full information please? Which node has been
> > requested. Which cpuset the calling process run in and which node has
> > the allocation succeeded from? A bare dump_stack without any further
> > context is not really helpful.
> 
> I don't have the same platform as the original report, so I simulated
> one similar setup (with fakenuma and movablecore), which has 2 memory
> nodes: node 0 has DMA0/DMA32/Movable zones, while node 1 has only
> Movable zone. With it, I can got the same error and same oom callstack
> as the original report (as in the cover-letter).
> 
> The test command is:
> 	# docker run -it --rm --cpuset-mems 1 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status"
> 
> To debug I only added some trace in the __alloc_pages_nodemask(), and
> for the callstack which get the page successfully:
> 
> 	[  567.510903] Call Trace:
> 	[  567.510909]  dump_stack+0x74/0x9a
> 	[  567.510910]  __alloc_pages_nodemask.cold+0x22/0xe5
> 	[  567.510913]  alloc_pages_current+0x87/0xe0
> 	[  567.510914]  __vmalloc_node_range+0x14c/0x240
> 	[  567.510918]  module_alloc+0x82/0xe0
> 	[  567.510921]  bpf_jit_alloc_exec+0xe/0x10
> 	[  567.510922]  bpf_jit_binary_alloc+0x7a/0x120
> 	[  567.510925]  bpf_int_jit_compile+0x145/0x424
> 	[  567.510926]  bpf_prog_select_runtime+0xac/0x130

As already said this doesn't really tell much without the additional
information.

> The incomming parameter nodemask is NULL, and the function will first try the
> cpuset nodemask (1 here), and the zoneidx is only granted 2, which makes the
> 'ac's preferred zone to be NULL. so it goes into __alloc_pages_slowpath(),
> which will first set the nodemask to 'NULL', and this time it got a preferred
> zone: zone DMA32 from node 0, following get_page_from_freelist will allocate
> one page from that zone. 

I do not follow. Both hot and slow paths of the allocator set
ALLOC_CPUSET or emulate it by mems_allowed when cpusets are nebaled
IIRC. This is later enforced in get_page_from_free_list. There are some
exceptions when the allocating process can run away from its cpusets -
e.g. IRQs, OOM victims and few other cases but definitely not a random
allocation. There might be some subtle details that have changed or I
might have forgot but

Huang, Ying Nov. 6, 2020, 4:32 a.m. UTC | #16

Michal Hocko <mhocko@suse.com> writes:

> On Thu 05-11-20 09:40:28, Feng Tang wrote:
>> On Wed, Nov 04, 2020 at 09:53:43AM +0100, Michal Hocko wrote:
>>  
>> > > > As I've said in reply to your second patch. I think we can make the oom
>> > > > killer behavior more sensible in this misconfigured cases but I do not
>> > > > think we want break the cpuset isolation for such a configuration.
>> > > 
>> > > Do you mean we skip the killing and just let the allocation fail? We've
>> > > checked the oom killer code first, when the oom happens, both DRAM
>> > > node and unmovable node have lots of free memory, and killing process
>> > > won't improve the situation.
>> > 
>> > We already do skip oom killer and fail for lowmem allocation requests already.
>> > This is similar in some sense. Another option would be to kill the
>> > allocating context which will have less corner cases potentially because
>> > some allocation failures might be unexpected.
>> 
>> Yes, this can avoid the helpless oom killing to kill a good process(no
>> memory pressure at all)
>> 
>> And I think the important thing is to judge whether this usage (binding
>> docker like workload to unmovable node) is a valid case :) 
>
> I am confused. Why wouldbe an unmovable node a problem. Movable
> allocations can be satisfied from the Zone Normal just fine. It is other
> way around that is a problem.
>
>> Initially, I thought it invalid too, but later think it still makes some
>> sense for the 2 cases:
>>     * user want to bind his workload to one node(most of user space
>>       memory) to avoid cross-node traffic, and that node happens to
>>       be configured as unmovable
>
> See above
>
>>     * one small DRAM node + big PMEM node, and memory latency insensitive
>>       workload could be bound to the cheaper unmovable PMEM node
>
> Please elaborate some more. As long as you have movable and normal nodes
> then this should be possible with a deal of care - most notably the
> movable:kernel ratio memory shouldn't be too big.
>
> Besides that why does PMEM node have to be MOVABLE only in the first
> place?

The performance of PMEM is much worse than that of DRAM.  If we found
that some pages on PMEM are accessed frequently (hot), we may want to
move them to DRAM to optimize the system performance.  If the unmovable
pages are allocated on PMEM and hot, it's possible that we cannot move
the pages to DRAM unless rebooting the system.  So we think we should
make the PMEM nodes to be MOVABLE only.

Best Regards,
Huang, Ying

Feng Tang Nov. 6, 2020, 7:06 a.m. UTC | #17

On Thu, Nov 05, 2020 at 05:16:12PM +0100, Michal Hocko wrote:
> On Thu 05-11-20 21:43:05, Feng Tang wrote:
> > On Thu, Nov 05, 2020 at 02:12:45PM +0100, Michal Hocko wrote:
> > > On Thu 05-11-20 21:07:10, Feng Tang wrote:
> > > [...]
> > > > My debug traces shows it is, and its gfp_mask is 'GFP_KERNEL'
> > > 
> > > Can you provide the full information please? Which node has been
> > > requested. Which cpuset the calling process run in and which node has
> > > the allocation succeeded from? A bare dump_stack without any further
> > > context is not really helpful.
> > 
> > I don't have the same platform as the original report, so I simulated
> > one similar setup (with fakenuma and movablecore), which has 2 memory
> > nodes: node 0 has DMA0/DMA32/Movable zones, while node 1 has only
> > Movable zone. With it, I can got the same error and same oom callstack
> > as the original report (as in the cover-letter).
> > 
> > The test command is:
> > 	# docker run -it --rm --cpuset-mems 1 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status"
> > 
> > To debug I only added some trace in the __alloc_pages_nodemask(), and
> > for the callstack which get the page successfully:
> > 
> > 	[  567.510903] Call Trace:
> > 	[  567.510909]  dump_stack+0x74/0x9a
> > 	[  567.510910]  __alloc_pages_nodemask.cold+0x22/0xe5
> > 	[  567.510913]  alloc_pages_current+0x87/0xe0
> > 	[  567.510914]  __vmalloc_node_range+0x14c/0x240
> > 	[  567.510918]  module_alloc+0x82/0xe0
> > 	[  567.510921]  bpf_jit_alloc_exec+0xe/0x10
> > 	[  567.510922]  bpf_jit_binary_alloc+0x7a/0x120
> > 	[  567.510925]  bpf_int_jit_compile+0x145/0x424
> > 	[  567.510926]  bpf_prog_select_runtime+0xac/0x130
> 
> As already said this doesn't really tell much without the additional
> information.
> 
> > The incomming parameter nodemask is NULL, and the function will first try the
> > cpuset nodemask (1 here), and the zoneidx is only granted 2, which makes the
> > 'ac's preferred zone to be NULL. so it goes into __alloc_pages_slowpath(),
> > which will first set the nodemask to 'NULL', and this time it got a preferred
> > zone: zone DMA32 from node 0, following get_page_from_freelist will allocate
> > one page from that zone. 
> 
> I do not follow. Both hot and slow paths of the allocator set
> ALLOC_CPUSET or emulate it by mems_allowed when cpusets are nebaled
> IIRC. This is later enforced in get_page_from_free_list. There are some
> exceptions when the allocating process can run away from its cpusets -
> e.g. IRQs, OOM victims and few other cases but definitely not a random
> allocation. There might be some subtle details that have changed or I
> might have forgot but 

yes, I was confused too. IIUC, the key check inside get_page_from_freelist()
is 

	if (cpusets_enabled() &&
		(alloc_flags & ALLOC_CPUSET) &&
		!__cpuset_zone_allowed(zone, gfp_mask))

In our case (kernel page got allocated), the first 2 conditions are true,
and for __cpuset_zone_allowed(), the possible place to return true is
checking parent cpuset's nodemask

	cs = nearest_hardwall_ancestor(task_cs(current));
	allowed = node_isset(node, cs->mems_allowed);

This will override the ALLOC_CPUSET check.

Thanks,
Feng
> -- 
> Michal Hocko
> SUSE Labs

Michal Hocko Nov. 6, 2020, 7:43 a.m. UTC | #18

On Fri 06-11-20 12:32:44, Huang, Ying wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Thu 05-11-20 09:40:28, Feng Tang wrote:
> >> On Wed, Nov 04, 2020 at 09:53:43AM +0100, Michal Hocko wrote:
> >>  
> >> > > > As I've said in reply to your second patch. I think we can make the oom
> >> > > > killer behavior more sensible in this misconfigured cases but I do not
> >> > > > think we want break the cpuset isolation for such a configuration.
> >> > > 
> >> > > Do you mean we skip the killing and just let the allocation fail? We've
> >> > > checked the oom killer code first, when the oom happens, both DRAM
> >> > > node and unmovable node have lots of free memory, and killing process
> >> > > won't improve the situation.
> >> > 
> >> > We already do skip oom killer and fail for lowmem allocation requests already.
> >> > This is similar in some sense. Another option would be to kill the
> >> > allocating context which will have less corner cases potentially because
> >> > some allocation failures might be unexpected.
> >> 
> >> Yes, this can avoid the helpless oom killing to kill a good process(no
> >> memory pressure at all)
> >> 
> >> And I think the important thing is to judge whether this usage (binding
> >> docker like workload to unmovable node) is a valid case :) 
> >
> > I am confused. Why wouldbe an unmovable node a problem. Movable
> > allocations can be satisfied from the Zone Normal just fine. It is other
> > way around that is a problem.
> >
> >> Initially, I thought it invalid too, but later think it still makes some
> >> sense for the 2 cases:
> >>     * user want to bind his workload to one node(most of user space
> >>       memory) to avoid cross-node traffic, and that node happens to
> >>       be configured as unmovable
> >
> > See above
> >
> >>     * one small DRAM node + big PMEM node, and memory latency insensitive
> >>       workload could be bound to the cheaper unmovable PMEM node
> >
> > Please elaborate some more. As long as you have movable and normal nodes
> > then this should be possible with a deal of care - most notably the
> > movable:kernel ratio memory shouldn't be too big.
> >
> > Besides that why does PMEM node have to be MOVABLE only in the first
> > place?
> 
> The performance of PMEM is much worse than that of DRAM.  If we found
> that some pages on PMEM are accessed frequently (hot), we may want to
> move them to DRAM to optimize the system performance.  If the unmovable
> pages are allocated on PMEM and hot, it's possible that we cannot move
> the pages to DRAM unless rebooting the system.  So we think we should
> make the PMEM nodes to be MOVABLE only.

That is fair but then you really need a fallback node too. So this is
mere optimization rather than a fundamental restriction.

Michal Hocko Nov. 6, 2020, 8:10 a.m. UTC | #19

On Fri 06-11-20 15:06:56, Feng Tang wrote:
> On Thu, Nov 05, 2020 at 05:16:12PM +0100, Michal Hocko wrote:
> > On Thu 05-11-20 21:43:05, Feng Tang wrote:
> > > On Thu, Nov 05, 2020 at 02:12:45PM +0100, Michal Hocko wrote:
> > > > On Thu 05-11-20 21:07:10, Feng Tang wrote:
> > > > [...]
> > > > > My debug traces shows it is, and its gfp_mask is 'GFP_KERNEL'
> > > > 
> > > > Can you provide the full information please? Which node has been
> > > > requested. Which cpuset the calling process run in and which node has
> > > > the allocation succeeded from? A bare dump_stack without any further
> > > > context is not really helpful.
> > > 
> > > I don't have the same platform as the original report, so I simulated
> > > one similar setup (with fakenuma and movablecore), which has 2 memory
> > > nodes: node 0 has DMA0/DMA32/Movable zones, while node 1 has only
> > > Movable zone. With it, I can got the same error and same oom callstack
> > > as the original report (as in the cover-letter).
> > > 
> > > The test command is:
> > > 	# docker run -it --rm --cpuset-mems 1 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status"
> > > 
> > > To debug I only added some trace in the __alloc_pages_nodemask(), and
> > > for the callstack which get the page successfully:
> > > 
> > > 	[  567.510903] Call Trace:
> > > 	[  567.510909]  dump_stack+0x74/0x9a
> > > 	[  567.510910]  __alloc_pages_nodemask.cold+0x22/0xe5
> > > 	[  567.510913]  alloc_pages_current+0x87/0xe0
> > > 	[  567.510914]  __vmalloc_node_range+0x14c/0x240
> > > 	[  567.510918]  module_alloc+0x82/0xe0
> > > 	[  567.510921]  bpf_jit_alloc_exec+0xe/0x10
> > > 	[  567.510922]  bpf_jit_binary_alloc+0x7a/0x120
> > > 	[  567.510925]  bpf_int_jit_compile+0x145/0x424
> > > 	[  567.510926]  bpf_prog_select_runtime+0xac/0x130
> > 
> > As already said this doesn't really tell much without the additional
> > information.
> > 
> > > The incomming parameter nodemask is NULL, and the function will first try the
> > > cpuset nodemask (1 here), and the zoneidx is only granted 2, which makes the
> > > 'ac's preferred zone to be NULL. so it goes into __alloc_pages_slowpath(),
> > > which will first set the nodemask to 'NULL', and this time it got a preferred
> > > zone: zone DMA32 from node 0, following get_page_from_freelist will allocate
> > > one page from that zone. 
> > 
> > I do not follow. Both hot and slow paths of the allocator set
> > ALLOC_CPUSET or emulate it by mems_allowed when cpusets are nebaled
> > IIRC. This is later enforced in get_page_from_free_list. There are some
> > exceptions when the allocating process can run away from its cpusets -
> > e.g. IRQs, OOM victims and few other cases but definitely not a random
> > allocation. There might be some subtle details that have changed or I
> > might have forgot but 
> 
> yes, I was confused too. IIUC, the key check inside get_page_from_freelist()
> is 
> 
> 	if (cpusets_enabled() &&
> 		(alloc_flags & ALLOC_CPUSET) &&
> 		!__cpuset_zone_allowed(zone, gfp_mask))
> 
> In our case (kernel page got allocated), the first 2 conditions are true,
> and for __cpuset_zone_allowed(), the possible place to return true is
> checking parent cpuset's nodemask
> 
> 	cs = nearest_hardwall_ancestor(task_cs(current));
> 	allowed = node_isset(node, cs->mems_allowed);
> 
> This will override the ALLOC_CPUSET check.

Yes and this is ok because that is defined hierarchical semantic of the
cpusets which applies to any !hardwalled allocation. Cpusets are quite
non intuitive. Re-reading the previous discussion I have realized that
me trying to not go into those details might have mislead you. Let me
try again and clarify that now.

I was talking in context of the patch you are proposing and that is a
clear violation of the cpuset isolation. Especially for hardwalled
setups because it allows to spill over to other nodes which shouldn't be
possible except for few exceptions which shouldn't generate a lot of
allocations (e.g. oom victim exiting, IRQ context).

What I was not talking about, and should have been more clear about, is
that without hardwall resp. exclusive nodes the isolation is best effort
only for most kernel allocation requests (or more specifically those
without __GFP_HARDWALL). Your patch doesn't distinguish between those
and any non movable allocations and effectively allowed to runaway even
for hardwalled allocations which are not movable. Those can be controlled
by userspace very easily.

I hope this clarifies it a bit more and sorry if I mislead you.

Feng Tang Nov. 6, 2020, 9:08 a.m. UTC | #20

On Fri, Nov 06, 2020 at 09:10:26AM +0100, Michal Hocko wrote:
> > > > The incomming parameter nodemask is NULL, and the function will first try the
> > > > cpuset nodemask (1 here), and the zoneidx is only granted 2, which makes the
> > > > 'ac's preferred zone to be NULL. so it goes into __alloc_pages_slowpath(),
> > > > which will first set the nodemask to 'NULL', and this time it got a preferred
> > > > zone: zone DMA32 from node 0, following get_page_from_freelist will allocate
> > > > one page from that zone. 
> > > 
> > > I do not follow. Both hot and slow paths of the allocator set
> > > ALLOC_CPUSET or emulate it by mems_allowed when cpusets are nebaled
> > > IIRC. This is later enforced in get_page_from_free_list. There are some
> > > exceptions when the allocating process can run away from its cpusets -
> > > e.g. IRQs, OOM victims and few other cases but definitely not a random
> > > allocation. There might be some subtle details that have changed or I
> > > might have forgot but 
> > 
> > yes, I was confused too. IIUC, the key check inside get_page_from_freelist()
> > is 
> > 
> > 	if (cpusets_enabled() &&
> > 		(alloc_flags & ALLOC_CPUSET) &&
> > 		!__cpuset_zone_allowed(zone, gfp_mask))
> > 
> > In our case (kernel page got allocated), the first 2 conditions are true,
> > and for __cpuset_zone_allowed(), the possible place to return true is
> > checking parent cpuset's nodemask
> > 
> > 	cs = nearest_hardwall_ancestor(task_cs(current));
> > 	allowed = node_isset(node, cs->mems_allowed);
> > 
> > This will override the ALLOC_CPUSET check.
> 
> Yes and this is ok because that is defined hierarchical semantic of the
> cpusets which applies to any !hardwalled allocation. Cpusets are quite
> non intuitive. Re-reading the previous discussion I have realized that
> me trying to not go into those details might have mislead you. Let me
> try again and clarify that now.
> 
> I was talking in context of the patch you are proposing and that is a
> clear violation of the cpuset isolation. Especially for hardwalled
> setups because it allows to spill over to other nodes which shouldn't be
> possible except for few exceptions which shouldn't generate a lot of
> allocations (e.g. oom victim exiting, IRQ context).

I agree my patch is pretty hacky. As said in the cover-letter, I would
bring up this usage case, and get suggestions on how to support it.
 
> What I was not talking about, and should have been more clear about, is
> that without hardwall resp. exclusive nodes the isolation is best effort
> only for most kernel allocation requests (or more specifically those
> without __GFP_HARDWALL). Your patch doesn't distinguish between those
> and any non movable allocations and effectively allowed to runaway even
> for hardwalled allocations which are not movable. Those can be controlled
> by userspace very easily.

You are right, there are quiet several types of page allocations failures.
The callstack in patch 2/2 is a GFP_HIGHUSER from pipe_write, and there
are more types of kernel allocation requests which will got blocked by
the differnt  check. My RFC patch just gave a easiest one-for-all hack to
let them bypass it.

Do we need to tackle them case by case?

> I hope this clarifies it a bit more and sorry if I mislead you.

Yes, it does and many thanks for the clarifying!

- Feng

> -- 
> Michal Hocko
> SUSE Labs

Michal Hocko Nov. 6, 2020, 10:35 a.m. UTC | #21

On Fri 06-11-20 17:08:57, Feng Tang wrote:
[...]
> You are right, there are quiet several types of page allocations failures.
> The callstack in patch 2/2 is a GFP_HIGHUSER from pipe_write, and there
> are more types of kernel allocation requests which will got blocked by
> the differnt  check. My RFC patch just gave a easiest one-for-all hack to
> let them bypass it.
> 
> Do we need to tackle them case by case?

No, I do not think, how we can change those __GFP_HARDWALL without
breaking the isolation.

[RFC,0/2] mm: fix OOMs for binding workloads to movable zone only node

Message

Comments