Message ID | 1604470210-124827-1-git-send-email-feng.tang@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | mm: fix OOMs for binding workloads to movable zone only node | expand |
On Wed 04-11-20 14:10:08, Feng Tang wrote: > Hi, > > This patchset tries to report a problem and get suggestion/review > for the RFC fix patches. > > We recently got a OOM report, that when user try to bind a docker(container) > instance to a memory node which only has movable zones, and OOM killing > still can't solve the page allocation failure. This is a cpuset node binding right? > The callstack was: > > [ 1387.877565] runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0 > [ 1387.877568] CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G W I E 5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased) > [ 1387.877569] Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020 > [ 1387.877570] Call Trace: > [ 1387.877579] dump_stack+0x6b/0x88 > [ 1387.877584] dump_header+0x4a/0x1e2 > [ 1387.877586] oom_kill_process.cold+0xb/0x10 > [ 1387.877588] out_of_memory.part.0+0xaf/0x230 > [ 1387.877591] out_of_memory+0x3d/0x80 > [ 1387.877595] __alloc_pages_slowpath.constprop.0+0x954/0xa20 > [ 1387.877599] __alloc_pages_nodemask+0x2d3/0x300 > [ 1387.877602] pipe_write+0x322/0x590 > [ 1387.877607] new_sync_write+0x196/0x1b0 > [ 1387.877609] vfs_write+0x1c3/0x1f0 > [ 1387.877611] ksys_write+0xa7/0xe0 > [ 1387.877617] do_syscall_64+0x52/0xd0 > [ 1387.877621] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > The meminfo log only shows the movable only node, which has plenty > of free memory. And in our reproducing with 1/2 patch, the normal > node (has DMA/DMA32/Normal) also has lot of free memory when OOM > happens. OK, so you are bidning to a movable node only and your above request is for GFP_HIGHUSER which _cannot_ be satisfied from the movable zones because that memory is not movable. So the system behaves as expected. Your cpuset is misconfigured IMHO. Movable only nodes come with their risk and configuration price. > If we hack to make this (GFP_HIGHUSER|__GFP_ACCOUNT) request get > a page, and following full docker run (like installing and running > 'stress-ng' stress test) will see more allocation failures due to > different kinds of request(gfp_masks). And the 2/2 patch will detect > such cases that the allowed target nodes only have movable zones > and loose the binding check, otherwise it will trigger OOM while > the OOM won't do any help, as the problem is not lack of free memory. Well, this breaks the cpuset containment, right? I consider this quite unexpected for something that looks like a misconfiguration. I do agree that this is unexpected for anybody who is not really familiar with concept of movable zone but we should probably call out all these details rather than tweak the existing semantic. Could you be more specific about the usecase here? Why do you need a binding to a pure movable node?
Hi Michal, Thanks for the prompt review! On Wed, Nov 04, 2020 at 08:13:08AM +0100, Michal Hocko wrote: > On Wed 04-11-20 14:10:08, Feng Tang wrote: > > Hi, > > > > This patchset tries to report a problem and get suggestion/review > > for the RFC fix patches. > > > > We recently got a OOM report, that when user try to bind a docker(container) > > instance to a memory node which only has movable zones, and OOM killing > > still can't solve the page allocation failure. > > This is a cpuset node binding right? Yes. A simple test command is 'docker run -it --cpuset-mems 3 ubuntu:latest' while the node 3 is a movable only PMEM node. > > The callstack was: > > > > [ 1387.877565] runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0 > > [ 1387.877568] CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G W I E 5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased) > > [ 1387.877569] Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020 > > [ 1387.877570] Call Trace: > > [ 1387.877579] dump_stack+0x6b/0x88 > > [ 1387.877584] dump_header+0x4a/0x1e2 > > [ 1387.877586] oom_kill_process.cold+0xb/0x10 > > [ 1387.877588] out_of_memory.part.0+0xaf/0x230 > > [ 1387.877591] out_of_memory+0x3d/0x80 > > [ 1387.877595] __alloc_pages_slowpath.constprop.0+0x954/0xa20 > > [ 1387.877599] __alloc_pages_nodemask+0x2d3/0x300 > > [ 1387.877602] pipe_write+0x322/0x590 > > [ 1387.877607] new_sync_write+0x196/0x1b0 > > [ 1387.877609] vfs_write+0x1c3/0x1f0 > > [ 1387.877611] ksys_write+0xa7/0xe0 > > [ 1387.877617] do_syscall_64+0x52/0xd0 > > [ 1387.877621] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > > The meminfo log only shows the movable only node, which has plenty > > of free memory. And in our reproducing with 1/2 patch, the normal > > node (has DMA/DMA32/Normal) also has lot of free memory when OOM > > happens. > > OK, so you are bidning to a movable node only and your above request is > for GFP_HIGHUSER which _cannot_ be satisfied from the movable zones > because that memory is not movable. So the system behaves as expected. > Your cpuset is misconfigured IMHO. Movable only nodes come with their > risk and configuration price. Aha, this is what we told the reporter at first. Their platform is 2S platform, and each socket has one DRAM node + one persistent memory node, and we suggested to bind the docker to one DRAM + one PMEM node. > > If we hack to make this (GFP_HIGHUSER|__GFP_ACCOUNT) request get > > a page, and following full docker run (like installing and running > > 'stress-ng' stress test) will see more allocation failures due to > > different kinds of request(gfp_masks). And the 2/2 patch will detect > > such cases that the allowed target nodes only have movable zones > > and loose the binding check, otherwise it will trigger OOM while > > the OOM won't do any help, as the problem is not lack of free memory. > > Well, this breaks the cpuset containment, right? I consider this quite > unexpected for something that looks like a misconfiguration. I do agree > that this is unexpected for anybody who is not really familiar with > concept of movable zone but we should probably call out all these > details rather than tweak the existing semantic. Yes, it does break the cpuset containment > Could you be more specific about the usecase here? Why do you need a > binding to a pure movable node? One common configuration for a platform is small size of DRAM plus huge size of PMEM (which is slower but cheaper), and my guess of their use is to try to lead the bulk of user space allocation (GFP_HIGHUSER_MOVABLE) to PMEM node, and only let DRAM be used as less as possible. Thanks, Feng > -- > Michal Hocko > SUSE Labs
On Wed 04-11-20 15:38:26, Feng Tang wrote: [...] > > Could you be more specific about the usecase here? Why do you need a > > binding to a pure movable node? > > One common configuration for a platform is small size of DRAM plus huge > size of PMEM (which is slower but cheaper), and my guess of their use > is to try to lead the bulk of user space allocation (GFP_HIGHUSER_MOVABLE) > to PMEM node, and only let DRAM be used as less as possible. While this is possible, it is a tricky configuration. It is essentially get us back to 32b and highmem... As I've said in reply to your second patch. I think we can make the oom killer behavior more sensible in this misconfigured cases but I do not think we want break the cpuset isolation for such a configuration.
On Wed, Nov 04, 2020 at 08:58:19AM +0100, Michal Hocko wrote: > On Wed 04-11-20 15:38:26, Feng Tang wrote: > [...] > > > Could you be more specific about the usecase here? Why do you need a > > > binding to a pure movable node? > > > > One common configuration for a platform is small size of DRAM plus huge > > size of PMEM (which is slower but cheaper), and my guess of their use > > is to try to lead the bulk of user space allocation (GFP_HIGHUSER_MOVABLE) > > to PMEM node, and only let DRAM be used as less as possible. > > While this is possible, it is a tricky configuration. It is essentially > get us back to 32b and highmem... :) Another possible case is similar binding on a memory hotplugable platform, which has one unplugable node and several other nodes configured as movable only to be hot removable when needed > As I've said in reply to your second patch. I think we can make the oom > killer behavior more sensible in this misconfigured cases but I do not > think we want break the cpuset isolation for such a configuration. Do you mean we skip the killing and just let the allocation fail? We've checked the oom killer code first, when the oom happens, both DRAM node and unmovable node have lots of free memory, and killing process won't improve the situation. (Folloing is copied from your comments for 2/2) > This allows to spill memory allocations over to any other node which > has Normal (or other lower) zones and as such it breaks cpuset isolation. > As I've pointed out in the reply to your cover letter it seems that > this is more of a misconfiguration than a bug. For the usage case (docker container running), the spilling is already happening, I traced its memory allocation requests, many of them are movable, and got fallback to the normal node naturally with current code, only a few got blocked, as many of __alloc_pages_nodemask are called witih 'NULL' nodemask parameter. And I made this RFC patch inspired by code in __alloc_pages_may_oom(): if (gfp_mask & __GFP_NOFAIL) page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_NO_WATERMARKS, ac); Thanks, Feng > -- > Michal Hocko > SUSE Labs
On Wed 04-11-20 16:40:21, Feng Tang wrote: > On Wed, Nov 04, 2020 at 08:58:19AM +0100, Michal Hocko wrote: > > On Wed 04-11-20 15:38:26, Feng Tang wrote: > > [...] > > > > Could you be more specific about the usecase here? Why do you need a > > > > binding to a pure movable node? > > > > > > One common configuration for a platform is small size of DRAM plus huge > > > size of PMEM (which is slower but cheaper), and my guess of their use > > > is to try to lead the bulk of user space allocation (GFP_HIGHUSER_MOVABLE) > > > to PMEM node, and only let DRAM be used as less as possible. > > > > While this is possible, it is a tricky configuration. It is essentially > > get us back to 32b and highmem... > > :) Another possible case is similar binding on a memory hotplugable > platform, which has one unplugable node and several other nodes configured > as movable only to be hot removable when needed Yes, another way to shoot your foot ;) > > As I've said in reply to your second patch. I think we can make the oom > > killer behavior more sensible in this misconfigured cases but I do not > > think we want break the cpuset isolation for such a configuration. > > Do you mean we skip the killing and just let the allocation fail? We've > checked the oom killer code first, when the oom happens, both DRAM > node and unmovable node have lots of free memory, and killing process > won't improve the situation. We already do skip oom killer and fail for lowmem allocation requests already. This is similar in some sense. Another option would be to kill the allocating context which will have less corner cases potentially because some allocation failures might be unexpected. > (Folloing is copied from your comments for 2/2) > > This allows to spill memory allocations over to any other node which > > has Normal (or other lower) zones and as such it breaks cpuset isolation. > > As I've pointed out in the reply to your cover letter it seems that > > this is more of a misconfiguration than a bug. > > For the usage case (docker container running), the spilling is already > happening, I traced its memory allocation requests, many of them are > movable, and got fallback to the normal node naturally with current Could you be more specific? This sounds like a bug. Allocations shouldn't spill over to a node which is not in the cpuset. There are few exceptions like IRQ context but that shouldn't happen regurarly. > code, only a few got blocked, as many of __alloc_pages_nodemask are > called witih 'NULL' nodemask parameter. > > And I made this RFC patch inspired by code in __alloc_pages_may_oom(): > > if (gfp_mask & __GFP_NOFAIL) > page = __alloc_pages_cpuset_fallback(gfp_mask, order, > ALLOC_NO_WATERMARKS, ac); I am not really sure I follow here. __GFP_NOFAIL is a special beast because such an allocation must not fail. Breaking node affinity is the only option left. This shouldn't be something used for regular allocation requests.
On Thu 05-11-20 09:40:28, Feng Tang wrote: > On Wed, Nov 04, 2020 at 09:53:43AM +0100, Michal Hocko wrote: > > > > > As I've said in reply to your second patch. I think we can make the oom > > > > killer behavior more sensible in this misconfigured cases but I do not > > > > think we want break the cpuset isolation for such a configuration. > > > > > > Do you mean we skip the killing and just let the allocation fail? We've > > > checked the oom killer code first, when the oom happens, both DRAM > > > node and unmovable node have lots of free memory, and killing process > > > won't improve the situation. > > > > We already do skip oom killer and fail for lowmem allocation requests already. > > This is similar in some sense. Another option would be to kill the > > allocating context which will have less corner cases potentially because > > some allocation failures might be unexpected. > > Yes, this can avoid the helpless oom killing to kill a good process(no > memory pressure at all) > > And I think the important thing is to judge whether this usage (binding > docker like workload to unmovable node) is a valid case :) I am confused. Why wouldbe an unmovable node a problem. Movable allocations can be satisfied from the Zone Normal just fine. It is other way around that is a problem. > Initially, I thought it invalid too, but later think it still makes some > sense for the 2 cases: > * user want to bind his workload to one node(most of user space > memory) to avoid cross-node traffic, and that node happens to > be configured as unmovable See above > * one small DRAM node + big PMEM node, and memory latency insensitive > workload could be bound to the cheaper unmovable PMEM node Please elaborate some more. As long as you have movable and normal nodes then this should be possible with a deal of care - most notably the movable:kernel ratio memory shouldn't be too big. Besides that why does PMEM node have to be MOVABLE only in the first place? > > > (Folloing is copied from your comments for 2/2) > > > > This allows to spill memory allocations over to any other node which > > > > has Normal (or other lower) zones and as such it breaks cpuset isolation. > > > > As I've pointed out in the reply to your cover letter it seems that > > > > this is more of a misconfiguration than a bug. > > > > > > For the usage case (docker container running), the spilling is already > > > happening, I traced its memory allocation requests, many of them are > > > movable, and got fallback to the normal node naturally with current > > > > Could you be more specific? This sounds like a bug. Allocations > > shouldn't spill over to a node which is not in the cpuset. There are few > > exceptions like IRQ context but that shouldn't happen regurarly. > > I mean when the docker starts, it will spawn many processes which obey > the mem binding set, and they have some kernel page requests, which got > successfully allocated, like the following callstack: > > [ 567.044953] CPU: 1 PID: 2021 Comm: runc:[1:CHILD] Tainted: G W I 5.9.0-rc8+ #6 > [ 567.044956] Hardware name: /NUC6i5SYB, BIOS SYSKLi35.86A.0051.2016.0804.1114 08/04/2016 > [ 567.044958] Call Trace: > [ 567.044972] dump_stack+0x74/0x9a > [ 567.044978] __alloc_pages_nodemask.cold+0x22/0xe5 > [ 567.044986] alloc_pages_current+0x87/0xe0 > [ 567.044991] allocate_slab+0x2e5/0x4f0 > [ 567.044996] ___slab_alloc+0x380/0x5d0 > [ 567.045021] __slab_alloc+0x20/0x40 > [ 567.045025] kmem_cache_alloc+0x2a0/0x2e0 > [ 567.045033] mqueue_alloc_inode+0x1a/0x30 > [ 567.045041] alloc_inode+0x22/0xa0 > [ 567.045045] new_inode_pseudo+0x12/0x60 > [ 567.045049] new_inode+0x17/0x30 > [ 567.045052] mqueue_get_inode+0x45/0x3b0 > [ 567.045060] mqueue_fill_super+0x41/0x70 > [ 567.045067] vfs_get_super+0x7f/0x100 > [ 567.045074] get_tree_keyed+0x1d/0x20 > [ 567.045080] mqueue_get_tree+0x1c/0x20 > [ 567.045086] vfs_get_tree+0x2a/0xc0 > [ 567.045092] fc_mount+0x13/0x50 > [ 567.045099] mq_create_mount+0x92/0xe0 > [ 567.045102] mq_init_ns+0x3b/0x50 > [ 567.045106] copy_ipcs+0x10a/0x1b0 > [ 567.045113] create_new_namespaces+0xa6/0x2b0 > [ 567.045118] unshare_nsproxy_namespaces+0x5a/0xb0 > [ 567.045124] ksys_unshare+0x19f/0x360 > [ 567.045129] __x64_sys_unshare+0x12/0x20 > [ 567.045135] do_syscall_64+0x38/0x90 > [ 567.045143] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > For it, the __alloc_pages_nodemask() will first try process's targed > nodemask(unmovable node here), and there is no availabe zone, so it > goes with the NULL nodemask, and get a page in the slowpath. OK, I see your point now. I was not aware of the slab allocator not following cpusets. Sounds like a bug to me.
On 11/5/20 1:08 PM, Michal Hocko wrote: > On Thu 05-11-20 09:40:28, Feng Tang wrote: >> > >> > Could you be more specific? This sounds like a bug. Allocations >> > shouldn't spill over to a node which is not in the cpuset. There are few >> > exceptions like IRQ context but that shouldn't happen regurarly. >> >> I mean when the docker starts, it will spawn many processes which obey >> the mem binding set, and they have some kernel page requests, which got >> successfully allocated, like the following callstack: >> >> [ 567.044953] CPU: 1 PID: 2021 Comm: runc:[1:CHILD] Tainted: G W I 5.9.0-rc8+ #6 >> [ 567.044956] Hardware name: /NUC6i5SYB, BIOS SYSKLi35.86A.0051.2016.0804.1114 08/04/2016 >> [ 567.044958] Call Trace: >> [ 567.044972] dump_stack+0x74/0x9a >> [ 567.044978] __alloc_pages_nodemask.cold+0x22/0xe5 >> [ 567.044986] alloc_pages_current+0x87/0xe0 >> [ 567.044991] allocate_slab+0x2e5/0x4f0 >> [ 567.044996] ___slab_alloc+0x380/0x5d0 >> [ 567.045021] __slab_alloc+0x20/0x40 >> [ 567.045025] kmem_cache_alloc+0x2a0/0x2e0 >> [ 567.045033] mqueue_alloc_inode+0x1a/0x30 >> [ 567.045041] alloc_inode+0x22/0xa0 >> [ 567.045045] new_inode_pseudo+0x12/0x60 >> [ 567.045049] new_inode+0x17/0x30 >> [ 567.045052] mqueue_get_inode+0x45/0x3b0 >> [ 567.045060] mqueue_fill_super+0x41/0x70 >> [ 567.045067] vfs_get_super+0x7f/0x100 >> [ 567.045074] get_tree_keyed+0x1d/0x20 >> [ 567.045080] mqueue_get_tree+0x1c/0x20 >> [ 567.045086] vfs_get_tree+0x2a/0xc0 >> [ 567.045092] fc_mount+0x13/0x50 >> [ 567.045099] mq_create_mount+0x92/0xe0 >> [ 567.045102] mq_init_ns+0x3b/0x50 >> [ 567.045106] copy_ipcs+0x10a/0x1b0 >> [ 567.045113] create_new_namespaces+0xa6/0x2b0 >> [ 567.045118] unshare_nsproxy_namespaces+0x5a/0xb0 >> [ 567.045124] ksys_unshare+0x19f/0x360 >> [ 567.045129] __x64_sys_unshare+0x12/0x20 >> [ 567.045135] do_syscall_64+0x38/0x90 >> [ 567.045143] entry_SYSCALL_64_after_hwframe+0x44/0xa9 >> >> For it, the __alloc_pages_nodemask() will first try process's targed >> nodemask(unmovable node here), and there is no availabe zone, so it >> goes with the NULL nodemask, and get a page in the slowpath. > > OK, I see your point now. I was not aware of the slab allocator not > following cpusets. Sounds like a bug to me. SLAB and SLUB seem to not care about cpusets in the fast path. But this stack shows that it went all the way to the page allocator, so the cpusets should have been obeyed there at least.
On Thu 05-11-20 13:53:24, Vlastimil Babka wrote: > On 11/5/20 1:08 PM, Michal Hocko wrote: > > On Thu 05-11-20 09:40:28, Feng Tang wrote: > > > > > Could you be more specific? This sounds like a bug. Allocations > > > > shouldn't spill over to a node which is not in the cpuset. There are few > > > > exceptions like IRQ context but that shouldn't happen regurarly. > > > > > > I mean when the docker starts, it will spawn many processes which obey > > > the mem binding set, and they have some kernel page requests, which got > > > successfully allocated, like the following callstack: > > > > > > [ 567.044953] CPU: 1 PID: 2021 Comm: runc:[1:CHILD] Tainted: G W I 5.9.0-rc8+ #6 > > > [ 567.044956] Hardware name: /NUC6i5SYB, BIOS SYSKLi35.86A.0051.2016.0804.1114 08/04/2016 > > > [ 567.044958] Call Trace: > > > [ 567.044972] dump_stack+0x74/0x9a > > > [ 567.044978] __alloc_pages_nodemask.cold+0x22/0xe5 > > > [ 567.044986] alloc_pages_current+0x87/0xe0 > > > [ 567.044991] allocate_slab+0x2e5/0x4f0 > > > [ 567.044996] ___slab_alloc+0x380/0x5d0 > > > [ 567.045021] __slab_alloc+0x20/0x40 > > > [ 567.045025] kmem_cache_alloc+0x2a0/0x2e0 > > > [ 567.045033] mqueue_alloc_inode+0x1a/0x30 > > > [ 567.045041] alloc_inode+0x22/0xa0 > > > [ 567.045045] new_inode_pseudo+0x12/0x60 > > > [ 567.045049] new_inode+0x17/0x30 > > > [ 567.045052] mqueue_get_inode+0x45/0x3b0 > > > [ 567.045060] mqueue_fill_super+0x41/0x70 > > > [ 567.045067] vfs_get_super+0x7f/0x100 > > > [ 567.045074] get_tree_keyed+0x1d/0x20 > > > [ 567.045080] mqueue_get_tree+0x1c/0x20 > > > [ 567.045086] vfs_get_tree+0x2a/0xc0 > > > [ 567.045092] fc_mount+0x13/0x50 > > > [ 567.045099] mq_create_mount+0x92/0xe0 > > > [ 567.045102] mq_init_ns+0x3b/0x50 > > > [ 567.045106] copy_ipcs+0x10a/0x1b0 > > > [ 567.045113] create_new_namespaces+0xa6/0x2b0 > > > [ 567.045118] unshare_nsproxy_namespaces+0x5a/0xb0 > > > [ 567.045124] ksys_unshare+0x19f/0x360 > > > [ 567.045129] __x64_sys_unshare+0x12/0x20 > > > [ 567.045135] do_syscall_64+0x38/0x90 > > > [ 567.045143] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > > > > For it, the __alloc_pages_nodemask() will first try process's targed > > > nodemask(unmovable node here), and there is no availabe zone, so it > > > goes with the NULL nodemask, and get a page in the slowpath. > > > > OK, I see your point now. I was not aware of the slab allocator not > > following cpusets. Sounds like a bug to me. > > SLAB and SLUB seem to not care about cpusets in the fast path. Is a fallback to a different node which is outside of the cpuset possible? > But this > stack shows that it went all the way to the page allocator, so the cpusets > should have been obeyed there at least. Looking closer what is this dump_stack saying actually?
On Thu, Nov 05, 2020 at 01:58:28PM +0100, Michal Hocko wrote: > On Thu 05-11-20 13:53:24, Vlastimil Babka wrote: > > On 11/5/20 1:08 PM, Michal Hocko wrote: > > > On Thu 05-11-20 09:40:28, Feng Tang wrote: > > > > > > Could you be more specific? This sounds like a bug. Allocations > > > > > shouldn't spill over to a node which is not in the cpuset. There are few > > > > > exceptions like IRQ context but that shouldn't happen regurarly. > > > > > > > > I mean when the docker starts, it will spawn many processes which obey > > > > the mem binding set, and they have some kernel page requests, which got > > > > successfully allocated, like the following callstack: > > > > > > > > [ 567.044953] CPU: 1 PID: 2021 Comm: runc:[1:CHILD] Tainted: G W I 5.9.0-rc8+ #6 > > > > [ 567.044956] Hardware name: /NUC6i5SYB, BIOS SYSKLi35.86A.0051.2016.0804.1114 08/04/2016 > > > > [ 567.044958] Call Trace: > > > > [ 567.044972] dump_stack+0x74/0x9a > > > > [ 567.044978] __alloc_pages_nodemask.cold+0x22/0xe5 > > > > [ 567.044986] alloc_pages_current+0x87/0xe0 > > > > [ 567.044991] allocate_slab+0x2e5/0x4f0 > > > > [ 567.044996] ___slab_alloc+0x380/0x5d0 > > > > [ 567.045021] __slab_alloc+0x20/0x40 > > > > [ 567.045025] kmem_cache_alloc+0x2a0/0x2e0 > > > > [ 567.045033] mqueue_alloc_inode+0x1a/0x30 > > > > [ 567.045041] alloc_inode+0x22/0xa0 > > > > [ 567.045045] new_inode_pseudo+0x12/0x60 > > > > [ 567.045049] new_inode+0x17/0x30 > > > > [ 567.045052] mqueue_get_inode+0x45/0x3b0 > > > > [ 567.045060] mqueue_fill_super+0x41/0x70 > > > > [ 567.045067] vfs_get_super+0x7f/0x100 > > > > [ 567.045074] get_tree_keyed+0x1d/0x20 > > > > [ 567.045080] mqueue_get_tree+0x1c/0x20 > > > > [ 567.045086] vfs_get_tree+0x2a/0xc0 > > > > [ 567.045092] fc_mount+0x13/0x50 > > > > [ 567.045099] mq_create_mount+0x92/0xe0 > > > > [ 567.045102] mq_init_ns+0x3b/0x50 > > > > [ 567.045106] copy_ipcs+0x10a/0x1b0 > > > > [ 567.045113] create_new_namespaces+0xa6/0x2b0 > > > > [ 567.045118] unshare_nsproxy_namespaces+0x5a/0xb0 > > > > [ 567.045124] ksys_unshare+0x19f/0x360 > > > > [ 567.045129] __x64_sys_unshare+0x12/0x20 > > > > [ 567.045135] do_syscall_64+0x38/0x90 > > > > [ 567.045143] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > > > > > > For it, the __alloc_pages_nodemask() will first try process's targed > > > > nodemask(unmovable node here), and there is no availabe zone, so it > > > > goes with the NULL nodemask, and get a page in the slowpath. > > > > > > OK, I see your point now. I was not aware of the slab allocator not > > > following cpusets. Sounds like a bug to me. > > > > SLAB and SLUB seem to not care about cpusets in the fast path. > > Is a fallback to a different node which is outside of the cpuset > possible? My debug traces shows it is, and its gfp_mask is 'GFP_KERNEL' And actually in this usage, I seen other types of kernel allocation request got fallback to the normal node which is not in cpuset mem nodemasks, like [ 567.510901] CPU: 3 PID: 2022 Comm: runc:[2:INIT] Tainted: G W I 5.9.0-rc8+ #6 [ 567.510902] Hardware name: /NUC6i5SYB, BIOS SYSKLi35.86A.0051.2016.0804.1114 08/04/2016 [ 567.510903] Call Trace: [ 567.510909] dump_stack+0x74/0x9a [ 567.510910] __alloc_pages_nodemask.cold+0x22/0xe5 [ 567.510913] alloc_pages_current+0x87/0xe0 [ 567.510914] __vmalloc_node_range+0x14c/0x240 [ 567.510918] module_alloc+0x82/0xe0 [ 567.510921] bpf_jit_alloc_exec+0xe/0x10 [ 567.510922] bpf_jit_binary_alloc+0x7a/0x120 [ 567.510925] bpf_int_jit_compile+0x145/0x424 [ 567.510926] bpf_prog_select_runtime+0xac/0x130 [ 567.510928] bpf_prepare_filter+0x44c/0x4b0 [ 567.510932] bpf_prog_create_from_user+0xc7/0x120 [ 567.510934] do_seccomp+0x118/0x990 [ 567.510937] __x64_sys_seccomp+0x1a/0x20 [ 567.510939] do_syscall_64+0x38/0x90 And its gfp_mask is (GFP_KERNEL|__GFP_HIGHMEM|__GFP_NOWARN) Thanks, Feng > > But this > > stack shows that it went all the way to the page allocator, so the cpusets > > should have been obeyed there at least. > > Looking closer what is this dump_stack saying actually? > > -- > Michal Hocko > SUSE Labs
On Thu 05-11-20 21:07:10, Feng Tang wrote:
[...]
> My debug traces shows it is, and its gfp_mask is 'GFP_KERNEL'
Can you provide the full information please? Which node has been
requested. Which cpuset the calling process run in and which node has
the allocation succeeded from? A bare dump_stack without any further
context is not really helpful.
On 11/5/20 1:58 PM, Michal Hocko wrote: > On Thu 05-11-20 13:53:24, Vlastimil Babka wrote: >> On 11/5/20 1:08 PM, Michal Hocko wrote: >> > On Thu 05-11-20 09:40:28, Feng Tang wrote: >> > > > > Could you be more specific? This sounds like a bug. Allocations >> > > > shouldn't spill over to a node which is not in the cpuset. There are few >> > > > exceptions like IRQ context but that shouldn't happen regurarly. >> > > >> > > I mean when the docker starts, it will spawn many processes which obey >> > > the mem binding set, and they have some kernel page requests, which got >> > > successfully allocated, like the following callstack: >> > > >> > > [ 567.044953] CPU: 1 PID: 2021 Comm: runc:[1:CHILD] Tainted: G W I 5.9.0-rc8+ #6 >> > > [ 567.044956] Hardware name: /NUC6i5SYB, BIOS SYSKLi35.86A.0051.2016.0804.1114 08/04/2016 >> > > [ 567.044958] Call Trace: >> > > [ 567.044972] dump_stack+0x74/0x9a >> > > [ 567.044978] __alloc_pages_nodemask.cold+0x22/0xe5 >> > > [ 567.044986] alloc_pages_current+0x87/0xe0 >> > > [ 567.044991] allocate_slab+0x2e5/0x4f0 >> > > [ 567.044996] ___slab_alloc+0x380/0x5d0 >> > > [ 567.045021] __slab_alloc+0x20/0x40 >> > > [ 567.045025] kmem_cache_alloc+0x2a0/0x2e0 >> > > [ 567.045033] mqueue_alloc_inode+0x1a/0x30 >> > > [ 567.045041] alloc_inode+0x22/0xa0 >> > > [ 567.045045] new_inode_pseudo+0x12/0x60 >> > > [ 567.045049] new_inode+0x17/0x30 >> > > [ 567.045052] mqueue_get_inode+0x45/0x3b0 >> > > [ 567.045060] mqueue_fill_super+0x41/0x70 >> > > [ 567.045067] vfs_get_super+0x7f/0x100 >> > > [ 567.045074] get_tree_keyed+0x1d/0x20 >> > > [ 567.045080] mqueue_get_tree+0x1c/0x20 >> > > [ 567.045086] vfs_get_tree+0x2a/0xc0 >> > > [ 567.045092] fc_mount+0x13/0x50 >> > > [ 567.045099] mq_create_mount+0x92/0xe0 >> > > [ 567.045102] mq_init_ns+0x3b/0x50 >> > > [ 567.045106] copy_ipcs+0x10a/0x1b0 >> > > [ 567.045113] create_new_namespaces+0xa6/0x2b0 >> > > [ 567.045118] unshare_nsproxy_namespaces+0x5a/0xb0 >> > > [ 567.045124] ksys_unshare+0x19f/0x360 >> > > [ 567.045129] __x64_sys_unshare+0x12/0x20 >> > > [ 567.045135] do_syscall_64+0x38/0x90 >> > > [ 567.045143] entry_SYSCALL_64_after_hwframe+0x44/0xa9 >> > > >> > > For it, the __alloc_pages_nodemask() will first try process's targed >> > > nodemask(unmovable node here), and there is no availabe zone, so it >> > > goes with the NULL nodemask, and get a page in the slowpath. >> > >> > OK, I see your point now. I was not aware of the slab allocator not >> > following cpusets. Sounds like a bug to me. >> >> SLAB and SLUB seem to not care about cpusets in the fast path. > > Is a fallback to a different node which is outside of the cpuset > possible? AFAICS anything in per-cpu cache will be allocated without looking at the cpuset, so it can be outside of the cpuset. In SLUB slowpath, get_partial_node() looking for fallback on the same node will also not look at cpuset. get_any_partial() looking for a fallback allocation on any node does check cpuset_zone_allowed() and obey it strictly. A fallback to page allocator will obey whatever page allocator obeys. So if a process cannot is restricted to allocate from node X via cpuset *and* also cannot be executed on CPU's from node X via taskset, then it AFAICS effectively cannot violate the cpuset in SLUB because it won't reach the percpu or per-node caches that don't check cpusets. >> But this >> stack shows that it went all the way to the page allocator, so the cpusets >> should have been obeyed there at least. > > Looking closer what is this dump_stack saying actually? Yes, is that a dump of successful allocation (that violates cpusets?) or a failing one?
On Thu 05-11-20 14:14:25, Vlastimil Babka wrote: > On 11/5/20 1:58 PM, Michal Hocko wrote: > > On Thu 05-11-20 13:53:24, Vlastimil Babka wrote: > > > On 11/5/20 1:08 PM, Michal Hocko wrote: > > > > On Thu 05-11-20 09:40:28, Feng Tang wrote: > > > > > > > Could you be more specific? This sounds like a bug. Allocations > > > > > > shouldn't spill over to a node which is not in the cpuset. There are few > > > > > > exceptions like IRQ context but that shouldn't happen regurarly. > > > > > > > I mean when the docker starts, it will spawn many processes > > > which obey > > > > > the mem binding set, and they have some kernel page requests, which got > > > > > successfully allocated, like the following callstack: > > > > > > > [ 567.044953] CPU: 1 PID: 2021 Comm: runc:[1:CHILD] > > > Tainted: G W I 5.9.0-rc8+ #6 > > > > > [ 567.044956] Hardware name: /NUC6i5SYB, BIOS SYSKLi35.86A.0051.2016.0804.1114 08/04/2016 > > > > > [ 567.044958] Call Trace: > > > > > [ 567.044972] dump_stack+0x74/0x9a > > > > > [ 567.044978] __alloc_pages_nodemask.cold+0x22/0xe5 > > > > > [ 567.044986] alloc_pages_current+0x87/0xe0 > > > > > [ 567.044991] allocate_slab+0x2e5/0x4f0 > > > > > [ 567.044996] ___slab_alloc+0x380/0x5d0 > > > > > [ 567.045021] __slab_alloc+0x20/0x40 > > > > > [ 567.045025] kmem_cache_alloc+0x2a0/0x2e0 > > > > > [ 567.045033] mqueue_alloc_inode+0x1a/0x30 > > > > > [ 567.045041] alloc_inode+0x22/0xa0 > > > > > [ 567.045045] new_inode_pseudo+0x12/0x60 > > > > > [ 567.045049] new_inode+0x17/0x30 > > > > > [ 567.045052] mqueue_get_inode+0x45/0x3b0 > > > > > [ 567.045060] mqueue_fill_super+0x41/0x70 > > > > > [ 567.045067] vfs_get_super+0x7f/0x100 > > > > > [ 567.045074] get_tree_keyed+0x1d/0x20 > > > > > [ 567.045080] mqueue_get_tree+0x1c/0x20 > > > > > [ 567.045086] vfs_get_tree+0x2a/0xc0 > > > > > [ 567.045092] fc_mount+0x13/0x50 > > > > > [ 567.045099] mq_create_mount+0x92/0xe0 > > > > > [ 567.045102] mq_init_ns+0x3b/0x50 > > > > > [ 567.045106] copy_ipcs+0x10a/0x1b0 > > > > > [ 567.045113] create_new_namespaces+0xa6/0x2b0 > > > > > [ 567.045118] unshare_nsproxy_namespaces+0x5a/0xb0 > > > > > [ 567.045124] ksys_unshare+0x19f/0x360 > > > > > [ 567.045129] __x64_sys_unshare+0x12/0x20 > > > > > [ 567.045135] do_syscall_64+0x38/0x90 > > > > > [ 567.045143] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > > > > > For it, the __alloc_pages_nodemask() will first try > > > process's targed > > > > > nodemask(unmovable node here), and there is no availabe zone, so it > > > > > goes with the NULL nodemask, and get a page in the slowpath. > > > > > OK, I see your point now. I was not aware of the slab allocator > > > not > > > > following cpusets. Sounds like a bug to me. > > > > > > SLAB and SLUB seem to not care about cpusets in the fast path. > > > > Is a fallback to a different node which is outside of the cpuset > > possible? > > AFAICS anything in per-cpu cache will be allocated without looking at the > cpuset, so it can be outside of the cpuset. In SLUB slowpath, > get_partial_node() looking for fallback on the same node will also not look > at cpuset. get_any_partial() looking for a fallback allocation on any node > does check cpuset_zone_allowed() and obey it strictly. A fallback to page > allocator will obey whatever page allocator obeys. IIUC this means that if there is no strong CPU binding to cpuset nodes then a runaway is possible. Albeit only partially and relying on somebody to fill up pcp object caches, right? Is that an overlook or a decision design or a performance optimization?
On 11/5/20 2:19 PM, Michal Hocko wrote: > On Thu 05-11-20 14:14:25, Vlastimil Babka wrote: >> On 11/5/20 1:58 PM, Michal Hocko wrote: >> > On Thu 05-11-20 13:53:24, Vlastimil Babka wrote: >> > > On 11/5/20 1:08 PM, Michal Hocko wrote: >> > > > On Thu 05-11-20 09:40:28, Feng Tang wrote: >> > > > > > > Could you be more specific? This sounds like a bug. Allocations >> > > > > > shouldn't spill over to a node which is not in the cpuset. There are few >> > > > > > exceptions like IRQ context but that shouldn't happen regurarly. >> > > > > > > I mean when the docker starts, it will spawn many processes >> > > which obey >> > > > > the mem binding set, and they have some kernel page requests, which got >> > > > > successfully allocated, like the following callstack: >> > > > > > > [ 567.044953] CPU: 1 PID: 2021 Comm: runc:[1:CHILD] >> > > Tainted: G W I 5.9.0-rc8+ #6 >> > > > > [ 567.044956] Hardware name: /NUC6i5SYB, BIOS SYSKLi35.86A.0051.2016.0804.1114 08/04/2016 >> > > > > [ 567.044958] Call Trace: >> > > > > [ 567.044972] dump_stack+0x74/0x9a >> > > > > [ 567.044978] __alloc_pages_nodemask.cold+0x22/0xe5 >> > > > > [ 567.044986] alloc_pages_current+0x87/0xe0 >> > > > > [ 567.044991] allocate_slab+0x2e5/0x4f0 >> > > > > [ 567.044996] ___slab_alloc+0x380/0x5d0 >> > > > > [ 567.045021] __slab_alloc+0x20/0x40 >> > > > > [ 567.045025] kmem_cache_alloc+0x2a0/0x2e0 >> > > > > [ 567.045033] mqueue_alloc_inode+0x1a/0x30 >> > > > > [ 567.045041] alloc_inode+0x22/0xa0 >> > > > > [ 567.045045] new_inode_pseudo+0x12/0x60 >> > > > > [ 567.045049] new_inode+0x17/0x30 >> > > > > [ 567.045052] mqueue_get_inode+0x45/0x3b0 >> > > > > [ 567.045060] mqueue_fill_super+0x41/0x70 >> > > > > [ 567.045067] vfs_get_super+0x7f/0x100 >> > > > > [ 567.045074] get_tree_keyed+0x1d/0x20 >> > > > > [ 567.045080] mqueue_get_tree+0x1c/0x20 >> > > > > [ 567.045086] vfs_get_tree+0x2a/0xc0 >> > > > > [ 567.045092] fc_mount+0x13/0x50 >> > > > > [ 567.045099] mq_create_mount+0x92/0xe0 >> > > > > [ 567.045102] mq_init_ns+0x3b/0x50 >> > > > > [ 567.045106] copy_ipcs+0x10a/0x1b0 >> > > > > [ 567.045113] create_new_namespaces+0xa6/0x2b0 >> > > > > [ 567.045118] unshare_nsproxy_namespaces+0x5a/0xb0 >> > > > > [ 567.045124] ksys_unshare+0x19f/0x360 >> > > > > [ 567.045129] __x64_sys_unshare+0x12/0x20 >> > > > > [ 567.045135] do_syscall_64+0x38/0x90 >> > > > > [ 567.045143] entry_SYSCALL_64_after_hwframe+0x44/0xa9 >> > > > > > > For it, the __alloc_pages_nodemask() will first try >> > > process's targed >> > > > > nodemask(unmovable node here), and there is no availabe zone, so it >> > > > > goes with the NULL nodemask, and get a page in the slowpath. >> > > > > OK, I see your point now. I was not aware of the slab allocator >> > > not >> > > > following cpusets. Sounds like a bug to me. >> > > >> > > SLAB and SLUB seem to not care about cpusets in the fast path. >> > >> > Is a fallback to a different node which is outside of the cpuset >> > possible? >> >> AFAICS anything in per-cpu cache will be allocated without looking at the >> cpuset, so it can be outside of the cpuset. In SLUB slowpath, >> get_partial_node() looking for fallback on the same node will also not look >> at cpuset. get_any_partial() looking for a fallback allocation on any node >> does check cpuset_zone_allowed() and obey it strictly. A fallback to page >> allocator will obey whatever page allocator obeys. > > IIUC this means that if there is no strong CPU binding to cpuset nodes > then a runaway is possible. Albeit only partially and relying on > somebody to fill up pcp object caches, right? Seems so. > Is that an overlook or a decision design or a performance optimization? ... yes :) More seriously, let's ask Christoph, as that code goes all the way to first SLUB commit. On SLAB side, it would seem percpu caches came first, then cpuset support was added for page allocator, but SLAB was oblivious until it learned some of it in commit 765c4507af71c to properly support __GFP_THISNODE. My guess is that the amount of cpuset constraint violation through percpu caches was never considered to be a problem serious enough to justify making the fastpaths slower.
On Thu, Nov 05, 2020 at 02:12:45PM +0100, Michal Hocko wrote: > On Thu 05-11-20 21:07:10, Feng Tang wrote: > [...] > > My debug traces shows it is, and its gfp_mask is 'GFP_KERNEL' > > Can you provide the full information please? Which node has been > requested. Which cpuset the calling process run in and which node has > the allocation succeeded from? A bare dump_stack without any further > context is not really helpful. I don't have the same platform as the original report, so I simulated one similar setup (with fakenuma and movablecore), which has 2 memory nodes: node 0 has DMA0/DMA32/Movable zones, while node 1 has only Movable zone. With it, I can got the same error and same oom callstack as the original report (as in the cover-letter). The test command is: # docker run -it --rm --cpuset-mems 1 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status" To debug I only added some trace in the __alloc_pages_nodemask(), and for the callstack which get the page successfully: [ 567.510903] Call Trace: [ 567.510909] dump_stack+0x74/0x9a [ 567.510910] __alloc_pages_nodemask.cold+0x22/0xe5 [ 567.510913] alloc_pages_current+0x87/0xe0 [ 567.510914] __vmalloc_node_range+0x14c/0x240 [ 567.510918] module_alloc+0x82/0xe0 [ 567.510921] bpf_jit_alloc_exec+0xe/0x10 [ 567.510922] bpf_jit_binary_alloc+0x7a/0x120 [ 567.510925] bpf_int_jit_compile+0x145/0x424 [ 567.510926] bpf_prog_select_runtime+0xac/0x130 The incomming parameter nodemask is NULL, and the function will first try the cpuset nodemask (1 here), and the zoneidx is only granted 2, which makes the 'ac's preferred zone to be NULL. so it goes into __alloc_pages_slowpath(), which will first set the nodemask to 'NULL', and this time it got a preferred zone: zone DMA32 from node 0, following get_page_from_freelist will allocate one page from that zone. Thanks, Feng > > -- > Michal Hocko > SUSE Labs
On Thu 05-11-20 21:43:05, Feng Tang wrote: > On Thu, Nov 05, 2020 at 02:12:45PM +0100, Michal Hocko wrote: > > On Thu 05-11-20 21:07:10, Feng Tang wrote: > > [...] > > > My debug traces shows it is, and its gfp_mask is 'GFP_KERNEL' > > > > Can you provide the full information please? Which node has been > > requested. Which cpuset the calling process run in and which node has > > the allocation succeeded from? A bare dump_stack without any further > > context is not really helpful. > > I don't have the same platform as the original report, so I simulated > one similar setup (with fakenuma and movablecore), which has 2 memory > nodes: node 0 has DMA0/DMA32/Movable zones, while node 1 has only > Movable zone. With it, I can got the same error and same oom callstack > as the original report (as in the cover-letter). > > The test command is: > # docker run -it --rm --cpuset-mems 1 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status" > > To debug I only added some trace in the __alloc_pages_nodemask(), and > for the callstack which get the page successfully: > > [ 567.510903] Call Trace: > [ 567.510909] dump_stack+0x74/0x9a > [ 567.510910] __alloc_pages_nodemask.cold+0x22/0xe5 > [ 567.510913] alloc_pages_current+0x87/0xe0 > [ 567.510914] __vmalloc_node_range+0x14c/0x240 > [ 567.510918] module_alloc+0x82/0xe0 > [ 567.510921] bpf_jit_alloc_exec+0xe/0x10 > [ 567.510922] bpf_jit_binary_alloc+0x7a/0x120 > [ 567.510925] bpf_int_jit_compile+0x145/0x424 > [ 567.510926] bpf_prog_select_runtime+0xac/0x130 As already said this doesn't really tell much without the additional information. > The incomming parameter nodemask is NULL, and the function will first try the > cpuset nodemask (1 here), and the zoneidx is only granted 2, which makes the > 'ac's preferred zone to be NULL. so it goes into __alloc_pages_slowpath(), > which will first set the nodemask to 'NULL', and this time it got a preferred > zone: zone DMA32 from node 0, following get_page_from_freelist will allocate > one page from that zone. I do not follow. Both hot and slow paths of the allocator set ALLOC_CPUSET or emulate it by mems_allowed when cpusets are nebaled IIRC. This is later enforced in get_page_from_free_list. There are some exceptions when the allocating process can run away from its cpusets - e.g. IRQs, OOM victims and few other cases but definitely not a random allocation. There might be some subtle details that have changed or I might have forgot but
Michal Hocko <mhocko@suse.com> writes: > On Thu 05-11-20 09:40:28, Feng Tang wrote: >> On Wed, Nov 04, 2020 at 09:53:43AM +0100, Michal Hocko wrote: >> >> > > > As I've said in reply to your second patch. I think we can make the oom >> > > > killer behavior more sensible in this misconfigured cases but I do not >> > > > think we want break the cpuset isolation for such a configuration. >> > > >> > > Do you mean we skip the killing and just let the allocation fail? We've >> > > checked the oom killer code first, when the oom happens, both DRAM >> > > node and unmovable node have lots of free memory, and killing process >> > > won't improve the situation. >> > >> > We already do skip oom killer and fail for lowmem allocation requests already. >> > This is similar in some sense. Another option would be to kill the >> > allocating context which will have less corner cases potentially because >> > some allocation failures might be unexpected. >> >> Yes, this can avoid the helpless oom killing to kill a good process(no >> memory pressure at all) >> >> And I think the important thing is to judge whether this usage (binding >> docker like workload to unmovable node) is a valid case :) > > I am confused. Why wouldbe an unmovable node a problem. Movable > allocations can be satisfied from the Zone Normal just fine. It is other > way around that is a problem. > >> Initially, I thought it invalid too, but later think it still makes some >> sense for the 2 cases: >> * user want to bind his workload to one node(most of user space >> memory) to avoid cross-node traffic, and that node happens to >> be configured as unmovable > > See above > >> * one small DRAM node + big PMEM node, and memory latency insensitive >> workload could be bound to the cheaper unmovable PMEM node > > Please elaborate some more. As long as you have movable and normal nodes > then this should be possible with a deal of care - most notably the > movable:kernel ratio memory shouldn't be too big. > > Besides that why does PMEM node have to be MOVABLE only in the first > place? The performance of PMEM is much worse than that of DRAM. If we found that some pages on PMEM are accessed frequently (hot), we may want to move them to DRAM to optimize the system performance. If the unmovable pages are allocated on PMEM and hot, it's possible that we cannot move the pages to DRAM unless rebooting the system. So we think we should make the PMEM nodes to be MOVABLE only. Best Regards, Huang, Ying
On Thu, Nov 05, 2020 at 05:16:12PM +0100, Michal Hocko wrote: > On Thu 05-11-20 21:43:05, Feng Tang wrote: > > On Thu, Nov 05, 2020 at 02:12:45PM +0100, Michal Hocko wrote: > > > On Thu 05-11-20 21:07:10, Feng Tang wrote: > > > [...] > > > > My debug traces shows it is, and its gfp_mask is 'GFP_KERNEL' > > > > > > Can you provide the full information please? Which node has been > > > requested. Which cpuset the calling process run in and which node has > > > the allocation succeeded from? A bare dump_stack without any further > > > context is not really helpful. > > > > I don't have the same platform as the original report, so I simulated > > one similar setup (with fakenuma and movablecore), which has 2 memory > > nodes: node 0 has DMA0/DMA32/Movable zones, while node 1 has only > > Movable zone. With it, I can got the same error and same oom callstack > > as the original report (as in the cover-letter). > > > > The test command is: > > # docker run -it --rm --cpuset-mems 1 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status" > > > > To debug I only added some trace in the __alloc_pages_nodemask(), and > > for the callstack which get the page successfully: > > > > [ 567.510903] Call Trace: > > [ 567.510909] dump_stack+0x74/0x9a > > [ 567.510910] __alloc_pages_nodemask.cold+0x22/0xe5 > > [ 567.510913] alloc_pages_current+0x87/0xe0 > > [ 567.510914] __vmalloc_node_range+0x14c/0x240 > > [ 567.510918] module_alloc+0x82/0xe0 > > [ 567.510921] bpf_jit_alloc_exec+0xe/0x10 > > [ 567.510922] bpf_jit_binary_alloc+0x7a/0x120 > > [ 567.510925] bpf_int_jit_compile+0x145/0x424 > > [ 567.510926] bpf_prog_select_runtime+0xac/0x130 > > As already said this doesn't really tell much without the additional > information. > > > The incomming parameter nodemask is NULL, and the function will first try the > > cpuset nodemask (1 here), and the zoneidx is only granted 2, which makes the > > 'ac's preferred zone to be NULL. so it goes into __alloc_pages_slowpath(), > > which will first set the nodemask to 'NULL', and this time it got a preferred > > zone: zone DMA32 from node 0, following get_page_from_freelist will allocate > > one page from that zone. > > I do not follow. Both hot and slow paths of the allocator set > ALLOC_CPUSET or emulate it by mems_allowed when cpusets are nebaled > IIRC. This is later enforced in get_page_from_free_list. There are some > exceptions when the allocating process can run away from its cpusets - > e.g. IRQs, OOM victims and few other cases but definitely not a random > allocation. There might be some subtle details that have changed or I > might have forgot but yes, I was confused too. IIUC, the key check inside get_page_from_freelist() is if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) && !__cpuset_zone_allowed(zone, gfp_mask)) In our case (kernel page got allocated), the first 2 conditions are true, and for __cpuset_zone_allowed(), the possible place to return true is checking parent cpuset's nodemask cs = nearest_hardwall_ancestor(task_cs(current)); allowed = node_isset(node, cs->mems_allowed); This will override the ALLOC_CPUSET check. Thanks, Feng > -- > Michal Hocko > SUSE Labs
On Fri 06-11-20 12:32:44, Huang, Ying wrote: > Michal Hocko <mhocko@suse.com> writes: > > > On Thu 05-11-20 09:40:28, Feng Tang wrote: > >> On Wed, Nov 04, 2020 at 09:53:43AM +0100, Michal Hocko wrote: > >> > >> > > > As I've said in reply to your second patch. I think we can make the oom > >> > > > killer behavior more sensible in this misconfigured cases but I do not > >> > > > think we want break the cpuset isolation for such a configuration. > >> > > > >> > > Do you mean we skip the killing and just let the allocation fail? We've > >> > > checked the oom killer code first, when the oom happens, both DRAM > >> > > node and unmovable node have lots of free memory, and killing process > >> > > won't improve the situation. > >> > > >> > We already do skip oom killer and fail for lowmem allocation requests already. > >> > This is similar in some sense. Another option would be to kill the > >> > allocating context which will have less corner cases potentially because > >> > some allocation failures might be unexpected. > >> > >> Yes, this can avoid the helpless oom killing to kill a good process(no > >> memory pressure at all) > >> > >> And I think the important thing is to judge whether this usage (binding > >> docker like workload to unmovable node) is a valid case :) > > > > I am confused. Why wouldbe an unmovable node a problem. Movable > > allocations can be satisfied from the Zone Normal just fine. It is other > > way around that is a problem. > > > >> Initially, I thought it invalid too, but later think it still makes some > >> sense for the 2 cases: > >> * user want to bind his workload to one node(most of user space > >> memory) to avoid cross-node traffic, and that node happens to > >> be configured as unmovable > > > > See above > > > >> * one small DRAM node + big PMEM node, and memory latency insensitive > >> workload could be bound to the cheaper unmovable PMEM node > > > > Please elaborate some more. As long as you have movable and normal nodes > > then this should be possible with a deal of care - most notably the > > movable:kernel ratio memory shouldn't be too big. > > > > Besides that why does PMEM node have to be MOVABLE only in the first > > place? > > The performance of PMEM is much worse than that of DRAM. If we found > that some pages on PMEM are accessed frequently (hot), we may want to > move them to DRAM to optimize the system performance. If the unmovable > pages are allocated on PMEM and hot, it's possible that we cannot move > the pages to DRAM unless rebooting the system. So we think we should > make the PMEM nodes to be MOVABLE only. That is fair but then you really need a fallback node too. So this is mere optimization rather than a fundamental restriction.
On Fri 06-11-20 15:06:56, Feng Tang wrote: > On Thu, Nov 05, 2020 at 05:16:12PM +0100, Michal Hocko wrote: > > On Thu 05-11-20 21:43:05, Feng Tang wrote: > > > On Thu, Nov 05, 2020 at 02:12:45PM +0100, Michal Hocko wrote: > > > > On Thu 05-11-20 21:07:10, Feng Tang wrote: > > > > [...] > > > > > My debug traces shows it is, and its gfp_mask is 'GFP_KERNEL' > > > > > > > > Can you provide the full information please? Which node has been > > > > requested. Which cpuset the calling process run in and which node has > > > > the allocation succeeded from? A bare dump_stack without any further > > > > context is not really helpful. > > > > > > I don't have the same platform as the original report, so I simulated > > > one similar setup (with fakenuma and movablecore), which has 2 memory > > > nodes: node 0 has DMA0/DMA32/Movable zones, while node 1 has only > > > Movable zone. With it, I can got the same error and same oom callstack > > > as the original report (as in the cover-letter). > > > > > > The test command is: > > > # docker run -it --rm --cpuset-mems 1 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status" > > > > > > To debug I only added some trace in the __alloc_pages_nodemask(), and > > > for the callstack which get the page successfully: > > > > > > [ 567.510903] Call Trace: > > > [ 567.510909] dump_stack+0x74/0x9a > > > [ 567.510910] __alloc_pages_nodemask.cold+0x22/0xe5 > > > [ 567.510913] alloc_pages_current+0x87/0xe0 > > > [ 567.510914] __vmalloc_node_range+0x14c/0x240 > > > [ 567.510918] module_alloc+0x82/0xe0 > > > [ 567.510921] bpf_jit_alloc_exec+0xe/0x10 > > > [ 567.510922] bpf_jit_binary_alloc+0x7a/0x120 > > > [ 567.510925] bpf_int_jit_compile+0x145/0x424 > > > [ 567.510926] bpf_prog_select_runtime+0xac/0x130 > > > > As already said this doesn't really tell much without the additional > > information. > > > > > The incomming parameter nodemask is NULL, and the function will first try the > > > cpuset nodemask (1 here), and the zoneidx is only granted 2, which makes the > > > 'ac's preferred zone to be NULL. so it goes into __alloc_pages_slowpath(), > > > which will first set the nodemask to 'NULL', and this time it got a preferred > > > zone: zone DMA32 from node 0, following get_page_from_freelist will allocate > > > one page from that zone. > > > > I do not follow. Both hot and slow paths of the allocator set > > ALLOC_CPUSET or emulate it by mems_allowed when cpusets are nebaled > > IIRC. This is later enforced in get_page_from_free_list. There are some > > exceptions when the allocating process can run away from its cpusets - > > e.g. IRQs, OOM victims and few other cases but definitely not a random > > allocation. There might be some subtle details that have changed or I > > might have forgot but > > yes, I was confused too. IIUC, the key check inside get_page_from_freelist() > is > > if (cpusets_enabled() && > (alloc_flags & ALLOC_CPUSET) && > !__cpuset_zone_allowed(zone, gfp_mask)) > > In our case (kernel page got allocated), the first 2 conditions are true, > and for __cpuset_zone_allowed(), the possible place to return true is > checking parent cpuset's nodemask > > cs = nearest_hardwall_ancestor(task_cs(current)); > allowed = node_isset(node, cs->mems_allowed); > > This will override the ALLOC_CPUSET check. Yes and this is ok because that is defined hierarchical semantic of the cpusets which applies to any !hardwalled allocation. Cpusets are quite non intuitive. Re-reading the previous discussion I have realized that me trying to not go into those details might have mislead you. Let me try again and clarify that now. I was talking in context of the patch you are proposing and that is a clear violation of the cpuset isolation. Especially for hardwalled setups because it allows to spill over to other nodes which shouldn't be possible except for few exceptions which shouldn't generate a lot of allocations (e.g. oom victim exiting, IRQ context). What I was not talking about, and should have been more clear about, is that without hardwall resp. exclusive nodes the isolation is best effort only for most kernel allocation requests (or more specifically those without __GFP_HARDWALL). Your patch doesn't distinguish between those and any non movable allocations and effectively allowed to runaway even for hardwalled allocations which are not movable. Those can be controlled by userspace very easily. I hope this clarifies it a bit more and sorry if I mislead you.
On Fri, Nov 06, 2020 at 09:10:26AM +0100, Michal Hocko wrote: > > > > The incomming parameter nodemask is NULL, and the function will first try the > > > > cpuset nodemask (1 here), and the zoneidx is only granted 2, which makes the > > > > 'ac's preferred zone to be NULL. so it goes into __alloc_pages_slowpath(), > > > > which will first set the nodemask to 'NULL', and this time it got a preferred > > > > zone: zone DMA32 from node 0, following get_page_from_freelist will allocate > > > > one page from that zone. > > > > > > I do not follow. Both hot and slow paths of the allocator set > > > ALLOC_CPUSET or emulate it by mems_allowed when cpusets are nebaled > > > IIRC. This is later enforced in get_page_from_free_list. There are some > > > exceptions when the allocating process can run away from its cpusets - > > > e.g. IRQs, OOM victims and few other cases but definitely not a random > > > allocation. There might be some subtle details that have changed or I > > > might have forgot but > > > > yes, I was confused too. IIUC, the key check inside get_page_from_freelist() > > is > > > > if (cpusets_enabled() && > > (alloc_flags & ALLOC_CPUSET) && > > !__cpuset_zone_allowed(zone, gfp_mask)) > > > > In our case (kernel page got allocated), the first 2 conditions are true, > > and for __cpuset_zone_allowed(), the possible place to return true is > > checking parent cpuset's nodemask > > > > cs = nearest_hardwall_ancestor(task_cs(current)); > > allowed = node_isset(node, cs->mems_allowed); > > > > This will override the ALLOC_CPUSET check. > > Yes and this is ok because that is defined hierarchical semantic of the > cpusets which applies to any !hardwalled allocation. Cpusets are quite > non intuitive. Re-reading the previous discussion I have realized that > me trying to not go into those details might have mislead you. Let me > try again and clarify that now. > > I was talking in context of the patch you are proposing and that is a > clear violation of the cpuset isolation. Especially for hardwalled > setups because it allows to spill over to other nodes which shouldn't be > possible except for few exceptions which shouldn't generate a lot of > allocations (e.g. oom victim exiting, IRQ context). I agree my patch is pretty hacky. As said in the cover-letter, I would bring up this usage case, and get suggestions on how to support it. > What I was not talking about, and should have been more clear about, is > that without hardwall resp. exclusive nodes the isolation is best effort > only for most kernel allocation requests (or more specifically those > without __GFP_HARDWALL). Your patch doesn't distinguish between those > and any non movable allocations and effectively allowed to runaway even > for hardwalled allocations which are not movable. Those can be controlled > by userspace very easily. You are right, there are quiet several types of page allocations failures. The callstack in patch 2/2 is a GFP_HIGHUSER from pipe_write, and there are more types of kernel allocation requests which will got blocked by the differnt check. My RFC patch just gave a easiest one-for-all hack to let them bypass it. Do we need to tackle them case by case? > I hope this clarifies it a bit more and sorry if I mislead you. Yes, it does and many thanks for the clarifying! - Feng > -- > Michal Hocko > SUSE Labs
On Fri 06-11-20 17:08:57, Feng Tang wrote: [...] > You are right, there are quiet several types of page allocations failures. > The callstack in patch 2/2 is a GFP_HIGHUSER from pipe_write, and there > are more types of kernel allocation requests which will got blocked by > the differnt check. My RFC patch just gave a easiest one-for-all hack to > let them bypass it. > > Do we need to tackle them case by case? No, I do not think, how we can change those __GFP_HARDWALL without breaking the isolation.