mmotm 2015-01-22-15-04: qemu failure due to 'mm: memcontrol: remove unnecessary soft limit tree node test'

Message ID	20150123141817.GA22926@phnom.home.cmpxchg.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> Date: Fri, 23 Jan 2015 09:18:17 -0500 From: Johannes Weiner <hannes@cmpxchg.org> To: Guenter Roeck <linux@roeck-us.net> Cc: akpm@linux-foundation.org, mm-commits@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-next@vger.kernel.org, sfr@canb.auug.org.au, mhocko@suse.cz, cl@linux.com Subject: Re: mmotm 2015-01-22-15-04: qemu failure due to 'mm: memcontrol: remove unnecessary soft limit tree node test' Message-ID: <20150123141817.GA22926@phnom.home.cmpxchg.org> References: <54c1822d.RtdGfWPekQVAw8Ly%akpm@linux-foundation.org> <20150123050802.GB22751@roeck-us.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150123050802.GB22751@roeck-us.net> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Johannes Weiner Jan. 23, 2015, 2:18 p.m. UTC

Hi Guenter,

CC'ing Christoph for slub-stuff:

On Thu, Jan 22, 2015 at 09:08:02PM -0800, Guenter Roeck wrote:
> On Thu, Jan 22, 2015 at 03:05:17PM -0800, akpm@linux-foundation.org wrote:
> > The mm-of-the-moment snapshot 2015-01-22-15-04 has been uploaded to
> > 
> >    http://www.ozlabs.org/~akpm/mmotm/
> > 
> qemu test for ppc64 fails with
> 
> Unable to handle kernel paging request for data at address 0x0000af50
> Faulting instruction address: 0xc00000000089d5d4
> Oops: Kernel access of bad area, sig: 11 [#1]
> 
> with the following call stack:
> 
> Call Trace:
> [c00000003d32f920] [c00000000089d588] .__slab_alloc.isra.44+0x7c/0x6f4
> (unreliable)
> [c00000003d32fa90] [c00000000020cf8c] .kmem_cache_alloc_node_trace+0x12c/0x3b0
> [c00000003d32fb60] [c000000000bceeb4] .mem_cgroup_init+0x128/0x1b0
> [c00000003d32fbf0] [c00000000000a2b4] .do_one_initcall+0xd4/0x260
> [c00000003d32fce0] [c000000000ba26a8] .kernel_init_freeable+0x244/0x32c
> [c00000003d32fdb0] [c00000000000ac24] .kernel_init+0x24/0x140
> [c00000003d32fe30] [c000000000009564] .ret_from_kernel_thread+0x58/0x74
> 
> bisect log:

[...]

> # first bad commit: [a40d0d2cf21e2714e9a6c842085148c938bf36ab] mm: memcontrol: remove unnecessary soft limit tree node test

The change in question is this:

    mm: memcontrol: remove unnecessary soft limit tree node test
    
    kzalloc_node() automatically falls back to nodes with suitable memory.
    
    Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Michal Hocko <mhocko@suse.cz>
    Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


--

Is the assumption of this patch wrong?  Does the specified node have
to be online for the fallback to work?

Thanks
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Lameter (Ampere) Jan. 23, 2015, 3:17 p.m. UTC | #1

On Fri, 23 Jan 2015, Johannes Weiner wrote:

> Is the assumption of this patch wrong?  Does the specified node have
> to be online for the fallback to work?

Nodes that are offline have no control structures allocated and thus
allocations will likely segfault when the address of the controls
structure for the node is accessed.

If we wanted to prevent that then every allocation would have to add a
check to see if the nodes are online which would impact performance.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Guenter Roeck Jan. 23, 2015, 3:46 p.m. UTC | #2

On 01/23/2015 06:18 AM, Johannes Weiner wrote:
> Hi Guenter,
>
> CC'ing Christoph for slub-stuff:
>
> On Thu, Jan 22, 2015 at 09:08:02PM -0800, Guenter Roeck wrote:
>> On Thu, Jan 22, 2015 at 03:05:17PM -0800, akpm@linux-foundation.org wrote:
>>> The mm-of-the-moment snapshot 2015-01-22-15-04 has been uploaded to
>>>
>>>     http://www.ozlabs.org/~akpm/mmotm/
>>>
>> qemu test for ppc64 fails with
>>
>> Unable to handle kernel paging request for data at address 0x0000af50
>> Faulting instruction address: 0xc00000000089d5d4
>> Oops: Kernel access of bad area, sig: 11 [#1]
>>
>> with the following call stack:
>>
>> Call Trace:
>> [c00000003d32f920] [c00000000089d588] .__slab_alloc.isra.44+0x7c/0x6f4
>> (unreliable)
>> [c00000003d32fa90] [c00000000020cf8c] .kmem_cache_alloc_node_trace+0x12c/0x3b0
>> [c00000003d32fb60] [c000000000bceeb4] .mem_cgroup_init+0x128/0x1b0
>> [c00000003d32fbf0] [c00000000000a2b4] .do_one_initcall+0xd4/0x260
>> [c00000003d32fce0] [c000000000ba26a8] .kernel_init_freeable+0x244/0x32c
>> [c00000003d32fdb0] [c00000000000ac24] .kernel_init+0x24/0x140
>> [c00000003d32fe30] [c000000000009564] .ret_from_kernel_thread+0x58/0x74
>>
>> bisect log:
>
> [...]
>
>> # first bad commit: [a40d0d2cf21e2714e9a6c842085148c938bf36ab] mm: memcontrol: remove unnecessary soft limit tree node test
>
> The change in question is this:
>
>      mm: memcontrol: remove unnecessary soft limit tree node test
>
>      kzalloc_node() automatically falls back to nodes with suitable memory.
>
>      Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
>      Acked-by: Michal Hocko <mhocko@suse.cz>
>      Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
>      Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index fb9788af4a3e..10db4a654d68 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4539,13 +4539,10 @@ static void __init mem_cgroup_soft_limit_tree_init(void)
>   {
>          struct mem_cgroup_tree_per_node *rtpn;
>          struct mem_cgroup_tree_per_zone *rtpz;
> -       int tmp, node, zone;
> +       int node, zone;
>
>          for_each_node(node) {
> -               tmp = node;
> -               if (!node_state(node, N_NORMAL_MEMORY))
> -                       tmp = -1;
> -               rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, tmp);
> +               rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, node);
>                  BUG_ON(!rtpn);
>
>                  soft_limit_tree.rb_tree_per_node[node] = rtpn;
>
> --
>
> Is the assumption of this patch wrong?  Does the specified node have
> to be online for the fallback to work?
>

I added some debugging. First, the problem is only seen with SMP disabled.
Second, there is only one online node.

Without your patch:

Node 0 online 1 high 1 memory 1 cpu 0 normal 1 tmp 0 rtpn c00000003d240600
Node 1 online 0 high 0 memory 0 cpu 0 normal 0 tmp -1 rtpn c00000003d240640
Node 2 online 0 high 0 memory 0 cpu 0 normal 0 tmp -1 rtpn c00000003d240680

[ and so on up to node 255 ]

With your patch:

Node 0 online 1 high 1 memory 1 cpu 0 normal 1 rtpn c00000003d240600
Unable to handle kernel paging request for data at address 0x0000af50
Faulting instruction address: 0xc000000000895a3c
Oops: Kernel access of bad area, sig: 11 [#1]

The log message is after the call to kzalloc_node.

So it doesn't look like the fallback is working, at least not with ppc64
in non-SMP mode.

Guenter

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Johannes Weiner Jan. 23, 2015, 4:02 p.m. UTC | #3

On Fri, Jan 23, 2015 at 09:17:44AM -0600, Christoph Lameter wrote:
> On Fri, 23 Jan 2015, Johannes Weiner wrote:
> 
> > Is the assumption of this patch wrong?  Does the specified node have
> > to be online for the fallback to work?
> 
> Nodes that are offline have no control structures allocated and thus
> allocations will likely segfault when the address of the controls
> structure for the node is accessed.
> 
> If we wanted to prevent that then every allocation would have to add a
> check to see if the nodes are online which would impact performance.

Okay, that makes sense, thank you.

Andrew, can you please drop this patch?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Johannes Weiner Jan. 23, 2015, 4:03 p.m. UTC | #4

On Fri, Jan 23, 2015 at 07:46:43AM -0800, Guenter Roeck wrote:
> I added some debugging. First, the problem is only seen with SMP disabled.
> Second, there is only one online node.
> 
> Without your patch:
> 
> Node 0 online 1 high 1 memory 1 cpu 0 normal 1 tmp 0 rtpn c00000003d240600
> Node 1 online 0 high 0 memory 0 cpu 0 normal 0 tmp -1 rtpn c00000003d240640
> Node 2 online 0 high 0 memory 0 cpu 0 normal 0 tmp -1 rtpn c00000003d240680
> 
> [ and so on up to node 255 ]
> 
> With your patch:
> 
> Node 0 online 1 high 1 memory 1 cpu 0 normal 1 rtpn c00000003d240600
> Unable to handle kernel paging request for data at address 0x0000af50
> Faulting instruction address: 0xc000000000895a3c
> Oops: Kernel access of bad area, sig: 11 [#1]
> 
> The log message is after the call to kzalloc_node.
> 
> So it doesn't look like the fallback is working, at least not with ppc64
> in non-SMP mode.

Yep, and Christoph confirmed that it's not meant to work like that.
The patch is flawed.

Thanks for testing and sorry for breaking your setup.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Guenter Roeck Jan. 23, 2015, 4:59 p.m. UTC | #5

On 01/23/2015 08:02 AM, Johannes Weiner wrote:
> On Fri, Jan 23, 2015 at 09:17:44AM -0600, Christoph Lameter wrote:
>> On Fri, 23 Jan 2015, Johannes Weiner wrote:
>>
>>> Is the assumption of this patch wrong?  Does the specified node have
>>> to be online for the fallback to work?
>>
>> Nodes that are offline have no control structures allocated and thus
>> allocations will likely segfault when the address of the controls
>> structure for the node is accessed.
>>
>> If we wanted to prevent that then every allocation would have to add a
>> check to see if the nodes are online which would impact performance.
>
> Okay, that makes sense, thank you.
>
> Andrew, can you please drop this patch?
>
Problem is that there are three patches.

2537ffb mm: memcontrol: consolidate swap controller code
2f9b346 mm: memcontrol: consolidate memory controller initialization
a40d0d2 mm: memcontrol: remove unnecessary soft limit tree node test

Reverting (or dropping) a40d0d2 alone is not possible since it modifies
mem_cgroup_soft_limit_tree_init which is removed by 2f9b346.

Guenter

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Lameter (Ampere) Jan. 23, 2015, 8:20 p.m. UTC | #6

On Fri, 23 Jan 2015, Johannes Weiner wrote:

>         struct mem_cgroup_tree_per_node *rtpn;
>         struct mem_cgroup_tree_per_zone *rtpz;
> -       int tmp, node, zone;
> +       int node, zone;
>
>         for_each_node(node) {

Do for_each_online_node(node) {

instead?

> -               tmp = node;
> -               if (!node_state(node, N_NORMAL_MEMORY))
> -                       tmp = -1;
> -               rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, tmp);
> +               rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, node);
>                 BUG_ON(!rtpn);
>
>                 soft_limit_tree.rb_tree_per_node[node] = rtpn;
>
> --
>
> Is the assumption of this patch wrong?  Does the specified node have
> to be online for the fallback to work?
>
> Thanks
>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Guenter Roeck Jan. 23, 2015, 8:33 p.m. UTC | #7

On 01/23/2015 12:20 PM, Christoph Lameter wrote:
> On Fri, 23 Jan 2015, Johannes Weiner wrote:
>
>>          struct mem_cgroup_tree_per_node *rtpn;
>>          struct mem_cgroup_tree_per_zone *rtpz;
>> -       int tmp, node, zone;
>> +       int node, zone;
>>
>>          for_each_node(node) {
>
> Do for_each_online_node(node) {
>
> instead?
>

Wouldn't that have unintended consequences ? So far
rb tree nodes are allocated even if a node not online;
the above would change that. Are you saying it is
unnecessary to initialize rb tree nodes if the node
is not online ?

Not that I have any idea what is correct, it just seems odd
that the existing code would do all this allocation if it is not
necessary.

Guenter

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Lameter (Ampere) Jan. 23, 2015, 9:09 p.m. UTC | #8

On Fri, 23 Jan 2015, Guenter Roeck wrote:

> Wouldn't that have unintended consequences ? So far
> rb tree nodes are allocated even if a node not online;
> the above would change that. Are you saying it is
> unnecessary to initialize rb tree nodes if the node
> is not online ?

It is not advisable to allocate since an offline node means that the
structure cannot be allocated on the node where it would be most
beneficial. Typically subsystems allocate the per node data structures
when the node is brought online.

> Not that I have any idea what is correct, it just seems odd
> that the existing code would do all this allocation if it is not
> necessary.

Not sure how the code there works just guessing from other subsystems.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Johannes Weiner Jan. 24, 2015, 7:16 a.m. UTC | #9

On Fri, Jan 23, 2015 at 03:09:20PM -0600, Christoph Lameter wrote:
> On Fri, 23 Jan 2015, Guenter Roeck wrote:
> 
> > Wouldn't that have unintended consequences ? So far
> > rb tree nodes are allocated even if a node not online;
> > the above would change that. Are you saying it is
> > unnecessary to initialize rb tree nodes if the node
> > is not online ?
> 
> It is not advisable to allocate since an offline node means that the
> structure cannot be allocated on the node where it would be most
> beneficial. Typically subsystems allocate the per node data structures
> when the node is brought online.

I would generally agree, but this code, which implements a userspace
interface, is already grotesquely inefficient and heavyhanded.  It's
also superseded in the next release, so we can just keep this simple
at this point.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Valdis Kl ē tnieks Jan. 25, 2015, 9:36 p.m. UTC | #10

On Sat, 24 Jan 2015 02:16:23 -0500, Johannes Weiner said:

> I would generally agree, but this code, which implements a userspace
> interface, is already grotesquely inefficient and heavyhanded.  It's
> also superseded in the next release, so we can just keep this simple
> at this point.

Wait, what?  Userspace interface that's superceded in the next release?

I *hope* this was intended as "Yes,it's ugly in v4 of the patch, v5 is
a lot cleaner..."

Johannes Weiner Jan. 26, 2015, 1:37 p.m. UTC | #11

On Sun, Jan 25, 2015 at 04:36:28PM -0500, Valdis.Kletnieks@vt.edu wrote:
> On Sat, 24 Jan 2015 02:16:23 -0500, Johannes Weiner said:
> 
> > I would generally agree, but this code, which implements a userspace
> > interface, is already grotesquely inefficient and heavyhanded.  It's
> > also superseded in the next release, so we can just keep this simple
> > at this point.
> 
> Wait, what?  Userspace interface that's superceded in the next release?

The existing interface and its implementation are going to remain in
place, obviously, we can't break userspace.  But the semantics are
ill-defined and the implementation bad to a point where we decided to
fix both by adding a second interface and encouraging users to switch.

Now if a user were to report that these off-node allocations are
actually creating problems in real life I would fix it.  But I'm
fairly certain that remote access costs are overshadowed by the
reclaim stalls this mechanism creates.

So what I was trying to say above is that I don't see a point in
complicating the v1 implementation for a questionable minor
optimization when v2 is already being added to address much more
severe shortcomings in v1.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Michal Hocko Jan. 27, 2015, 5:24 p.m. UTC | #12

On Fri 23-01-15 09:17:44, Christoph Lameter wrote:
> On Fri, 23 Jan 2015, Johannes Weiner wrote:
> 
> > Is the assumption of this patch wrong?  Does the specified node have
> > to be online for the fallback to work?

Admittedly, I was checking only SLAB allocator when reviewing and
assuming SLUB would behave in the same way :/
But maybe I have misinterpreted the slab code as well and
get_node(struct kmem_cache *, int node) returns non-NULL for !online
nodes.

> Nodes that are offline have no control structures allocated and thus
> allocations will likely segfault when the address of the controls
> structure for the node is accessed.
> 
> If we wanted to prevent that then every allocation would have to add a
> check to see if the nodes are online which would impact performance.

I have briefly checked the code and it seems that many users are aware
of this and use the same construct Johannes used in the end or they use
cpu_to_node. But then there are other users doing:
net/openvswitch/flow_table.c:
        /* Initialize the default stat node. */
        stats = kmem_cache_alloc_node(flow_stats_cache,
                                      GFP_KERNEL | __GFP_ZERO, 0);

and this can blow up if Node0 is not online. I haven't checked other
callers but are we sure they all are aware of !online nodes? E.g.
dev_to_node() will return a node which is assigned to a device. I do not
see where exactly this is set to anything else than -1 (I got quickly
lost in set_dev_node callers). E.g. PCI bus sets its affinity from
bus->sysdata which seems to be initialized in pci_acpi_scan_root and
that is checking for an online node. Is it possible that some devices
will get the node from BIOS or by other means?

That being said I have no problem with checking node_online in the memcg
code which was reported to blow up here. I am just thinking whether it
is safe to simply blow up like that.

Christoph Lameter (Ampere) Jan. 28, 2015, 3:03 p.m. UTC | #13

On Tue, 27 Jan 2015, Michal Hocko wrote:

> Admittedly, I was checking only SLAB allocator when reviewing and
> assuming SLUB would behave in the same way :/
> But maybe I have misinterpreted the slab code as well and
> get_node(struct kmem_cache *, int node) returns non-NULL for !online
> nodes.

Oh. Just allocate from node 12345 in SLAB and you will also have a strange
failure.

> I have briefly checked the code and it seems that many users are aware
> of this and use the same construct Johannes used in the end or they use
> cpu_to_node. But then there are other users doing:
> net/openvswitch/flow_table.c:
>         /* Initialize the default stat node. */
>         stats = kmem_cache_alloc_node(flow_stats_cache,
>                                       GFP_KERNEL | __GFP_ZERO, 0);
>
> and this can blow up if Node0 is not online. I haven't checked other

Node 0 is special in many architectures and is guaranteed to exist.
PowerPC is a notable exception which causes frequent issues with NUMA
changes.

> That being said I have no problem with checking node_online in the memcg
> code which was reported to blow up here. I am just thinking whether it
> is safe to simply blow up like that.

Node numbers must be legitimate in order to be used. Same thing with
processor numbers. We cannot always check if they are online. The numbers
in use must be sane. We have notifier subsystems that do callbacks to
allow subsystems to add and remove new nodes and processors. Those should
be used to ensure that only legitimate node and processor numbers are
used.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

mmotm 2015-01-22-15-04: qemu failure due to 'mm: memcontrol: remove unnecessary soft limit tree node test'

Commit Message

Comments

Patch