BUG: Crash in __free_slab() using SLAB_TYPESAFE_BY_RCU
diff mbox series

Message ID 4a5108b4-5a2f-f83c-e6a8-5e0c9074ac69@linux.ibm.com
State New
Headers show
Series
  • BUG: Crash in __free_slab() using SLAB_TYPESAFE_BY_RCU
Related show

Commit Message

Karsten Graul Oct. 2, 2019, 2:50 p.m. UTC
net/smc is calling proto_register(&smc_proto, 1) with smc_proto.slab_flags = SLAB_TYPESAFE_BY_RCU.
Right after the last SMC socket is destroyed, proto_unregister(&smc_proto) is called, which 
calls kmem_cache_destroy(prot->slab). This results in a kernel crash in __free_slab().
Platform is s390x, reproduced on kernel 5.4-rc1. The problem was introduced by commit
fb2f2b0adb98 ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal")

I added a 'call graph', below of that is the crash log and a (simple) patch that works for me,
but I don't know if this is the correct way to fix it.

(Please keep me on CC of this thread because I do not follow the mm mailing list, thank you)


kmem_cache_destroy() 
  -> shutdown_memcg_caches()
    -> shutdown_cache()
      -> __kmem_cache_shutdown()  (slub.c)
        -> free_partial()
          -> discard_slab()
	    -> free_slab()                                      -- call to __free_slab() is delayed
	      -> call_rcu(rcu_free_slab)
    -> memcg_unlink_cache()
      -> WRITE_ONCE(s->memcg_params.memcg, NULL);               -- !!!
    -> list_add_tail(&s->list, &slab_caches_to_rcu_destroy);
    -> schedule_work(&slab_caches_to_rcu_destroy_work);  -> work_fn uses rcu_barrier() to wait for rcu_batch, 
                                                            so work_fn is not further involved here...
... rcu grace period ...
rcu_batch()
  ...
  -> rcu_free_slab()   (slub.c)
    -> __free_slab()
      -> uncharge_slab_page()
        -> memcg_uncharge_slab()
	  -> memcg = READ_ONCE(s->memcg_params.memcg);          -- !!! memcg == NULL
	  -> mem_cgroup_lruvec(memcg)
	    -> mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id); -- mz == NULL
	    -> lruvec = &mz->lruvec;                            -- lruvec == NULL
	    -> lruvec->pgdat = pgdat;                           -- *crash*

The crash log:

349.361168¨ Unable to handle kernel pointer dereference in virtual kernel address space
349.361210¨ Failing address: 0000000000000000 TEID: 0000000000000483
349.361223¨ Fault in home space mode while using kernel ASCE.
349.361240¨ AS:00000000017d4007 R3:000000007fbd0007 S:000000007fbff000 P:000000000000003d
349.361340¨ Oops: 0004 ilc:3 Ý#1¨ PREEMPT SMP
349.361349¨ Modules linked in: tcp_diag inet_diag xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_at nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_de
349.361436¨ CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-05872-g6133e3e4bada-dirty #14
349.361445¨ Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
349.361450¨ Krnl PSW : 0704d00180000000 00000000003cadb6 (__free_slab+0x686/0x6b0)
349.361464¨            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
349.361470¨ Krnl GPRS: 00000000f3a32928 0000000000000000 000000007fbf5d00 000000000117c4b8
349.361475¨            0000000000000000 000000009e3291c1 0000000000000000 0000000000000000
349.361481¨            0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
349.361481¨            0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
349.361486¨            000000000117ba00 000003e000057db0 00000000003cabcc 000003e000057c78
349.361500¨ Krnl Code: 00000000003cada6: e310a1400004        lg      %r1,320(%r10)
349.361500¨            00000000003cadac: c0e50046c286        brasl   %r14,ca32b8
349.361500¨           #00000000003cadb2: a7f4fe36            brc     15,3caa1e
349.361500¨           >00000000003cadb6: e32060800024        stg     %r2,128(%r6)
349.361500¨            00000000003cadbc: a7f4fd9e            brc     15,3ca8f8
349.361500¨            00000000003cadc0: c0e50046790c        brasl   %r14,c99fd8
349.361500¨            00000000003cadc6: a7f4fe2c            brc     15,3caa
349.361500¨            00000000003cadc6: a7f4fe2c            brc     15,3caa1e
349.361500¨            00000000003cadca: ecb1ffff00d9        aghik   %r11,%r1,-1
349.361619¨ Call Trace:
349.361627¨ (Ý<00000000003cabcc>¨ __free_slab+0x49c/0x6b0)
349.361634¨  Ý<00000000001f5886>¨ rcu_core+0x5a6/0x7e0
349.361643¨  Ý<0000000000ca2dea>¨ __do_softirq+0xf2/0x5c0
349.361652¨  Ý<0000000000152644>¨ irq_exit+0x104/0x130
349.361659¨  Ý<000000000010d222>¨ do_IRQ+0x9a/0xf0
349.361667¨  Ý<0000000000ca2344>¨ ext_int_handler+0x130/0x134
349.361674¨  Ý<0000000000103648>¨ enabled_wait+0x58/0x128
349.361681¨ (Ý<0000000000103634>¨ enabled_wait+0x44/0x128)
349.361688¨  Ý<0000000000103b00>¨ arch_cpu_idle+0x40/0x58
349.361695¨  Ý<0000000000ca0544>¨ default_idle_call+0x3c/0x68
349.361704¨  Ý<000000000018eaa4>¨ do_idle+0xec/0x1c0
349.361748¨  Ý<000000000018ee0e>¨ cpu_startup_entry+0x36/0x40
349.361756¨  Ý<000000000122df34>¨ arch_call_rest_init+0x5c/0x88
349.361761¨  Ý<0000000000000000>¨ 0x0
349.361765¨ INFO: lockdep is turned off.
349.361769¨ Last Breaking-Event-Address:
349.361774¨  Ý<00000000003ca8f4>¨ __free_slab+0x1c4/0x6b0
349.361781¨ Kernel panic - not syncing: Fatal exception in interrupt


A fix that works for me (RFC):

Comments

Roman Gushchin Oct. 2, 2019, 7:41 p.m. UTC | #1
Hello, Karsten!

Thank you for the report!

On Wed, Oct 02, 2019 at 04:50:53PM +0200, Karsten Graul wrote:
> 
> net/smc is calling proto_register(&smc_proto, 1) with smc_proto.slab_flags = SLAB_TYPESAFE_BY_RCU.
> Right after the last SMC socket is destroyed, proto_unregister(&smc_proto) is called, which 
> calls kmem_cache_destroy(prot->slab). This results in a kernel crash in __free_slab().
> Platform is s390x, reproduced on kernel 5.4-rc1. The problem was introduced by commit
> fb2f2b0adb98 ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal")
> 
> I added a 'call graph', below of that is the crash log and a (simple) patch that works for me,
> but I don't know if this is the correct way to fix it.
> 
> (Please keep me on CC of this thread because I do not follow the mm mailing list, thank you)
> 
> 
> kmem_cache_destroy() 
>   -> shutdown_memcg_caches()
>     -> shutdown_cache()
>       -> __kmem_cache_shutdown()  (slub.c)
>         -> free_partial()
>           -> discard_slab()
> 	    -> free_slab()                                      -- call to __free_slab() is delayed
> 	      -> call_rcu(rcu_free_slab)
>     -> memcg_unlink_cache()
>       -> WRITE_ONCE(s->memcg_params.memcg, NULL);               -- !!!
>     -> list_add_tail(&s->list, &slab_caches_to_rcu_destroy);
>     -> schedule_work(&slab_caches_to_rcu_destroy_work);  -> work_fn uses rcu_barrier() to wait for rcu_batch, 
>                                                             so work_fn is not further involved here...
> ... rcu grace period ...
> rcu_batch()
>   ...
>   -> rcu_free_slab()   (slub.c)
>     -> __free_slab()
>       -> uncharge_slab_page()
>         -> memcg_uncharge_slab()
> 	  -> memcg = READ_ONCE(s->memcg_params.memcg);          -- !!! memcg == NULL
> 	  -> mem_cgroup_lruvec(memcg)
> 	    -> mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id); -- mz == NULL
> 	    -> lruvec = &mz->lruvec;                            -- lruvec == NULL
> 	    -> lruvec->pgdat = pgdat;                           -- *crash*
> 
> The crash log:

Hm, I might be wrong, but it seems that the problem is deeper: __free_slab()
called from the rcu path races with kmem_cache_destroy(), which is supposed
to be called when there are no outstanding allocations (and corresponding pages).
Any charged slab page actually holds a reference to the kmem_cache, which prevents
its destruction (look at s->memcg_params.refcnt), but kmem_cache_destroy() ignores
it.

If my thoughts are correct, the commit you've mentioned didn't introduced this
issue, it just made it easier to reproduce.

The proposed fix looks dubious to me: the problem isn't in the memcg pointer
(it's just a luck that it crashes on it), and it seems incorrect to not decrease
the slab statistics of the original memory cgroup.

What we probably need to do instead is to extend flush_memcg_workqueue() to
wait for all outstanding rcu free callbacks. I have to think a bit what's the best
way to fix it. How easy is to reproduce the problem?

> 
> 349.361168¨ Unable to handle kernel pointer dereference in virtual kernel address space

Btw, haven't you noticed anything suspicious in dmesg before this line?

Thank you!

Roman

> 349.361210¨ Failing address: 0000000000000000 TEID: 0000000000000483
> 349.361223¨ Fault in home space mode while using kernel ASCE.
> 349.361240¨ AS:00000000017d4007 R3:000000007fbd0007 S:000000007fbff000 P:000000000000003d
> 349.361340¨ Oops: 0004 ilc:3 Ý#1¨ PREEMPT SMP
> 349.361349¨ Modules linked in: tcp_diag inet_diag xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_at nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_de
> 349.361436¨ CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-05872-g6133e3e4bada-dirty #14
> 349.361445¨ Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
> 349.361450¨ Krnl PSW : 0704d00180000000 00000000003cadb6 (__free_slab+0x686/0x6b0)
> 349.361464¨            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
> 349.361470¨ Krnl GPRS: 00000000f3a32928 0000000000000000 000000007fbf5d00 000000000117c4b8
> 349.361475¨            0000000000000000 000000009e3291c1 0000000000000000 0000000000000000
> 349.361481¨            0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
> 349.361481¨            0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
> 349.361486¨            000000000117ba00 000003e000057db0 00000000003cabcc 000003e000057c78
> 349.361500¨ Krnl Code: 00000000003cada6: e310a1400004        lg      %r1,320(%r10)
> 349.361500¨            00000000003cadac: c0e50046c286        brasl   %r14,ca32b8
> 349.361500¨           #00000000003cadb2: a7f4fe36            brc     15,3caa1e
> 349.361500¨           >00000000003cadb6: e32060800024        stg     %r2,128(%r6)
> 349.361500¨            00000000003cadbc: a7f4fd9e            brc     15,3ca8f8
> 349.361500¨            00000000003cadc0: c0e50046790c        brasl   %r14,c99fd8
> 349.361500¨            00000000003cadc6: a7f4fe2c            brc     15,3caa
> 349.361500¨            00000000003cadc6: a7f4fe2c            brc     15,3caa1e
> 349.361500¨            00000000003cadca: ecb1ffff00d9        aghik   %r11,%r1,-1
> 349.361619¨ Call Trace:
> 349.361627¨ (Ý<00000000003cabcc>¨ __free_slab+0x49c/0x6b0)
> 349.361634¨  Ý<00000000001f5886>¨ rcu_core+0x5a6/0x7e0
> 349.361643¨  Ý<0000000000ca2dea>¨ __do_softirq+0xf2/0x5c0
> 349.361652¨  Ý<0000000000152644>¨ irq_exit+0x104/0x130
> 349.361659¨  Ý<000000000010d222>¨ do_IRQ+0x9a/0xf0
> 349.361667¨  Ý<0000000000ca2344>¨ ext_int_handler+0x130/0x134
> 349.361674¨  Ý<0000000000103648>¨ enabled_wait+0x58/0x128
> 349.361681¨ (Ý<0000000000103634>¨ enabled_wait+0x44/0x128)
> 349.361688¨  Ý<0000000000103b00>¨ arch_cpu_idle+0x40/0x58
> 349.361695¨  Ý<0000000000ca0544>¨ default_idle_call+0x3c/0x68
> 349.361704¨  Ý<000000000018eaa4>¨ do_idle+0xec/0x1c0
> 349.361748¨  Ý<000000000018ee0e>¨ cpu_startup_entry+0x36/0x40
> 349.361756¨  Ý<000000000122df34>¨ arch_call_rest_init+0x5c/0x88
> 349.361761¨  Ý<0000000000000000>¨ 0x0
> 349.361765¨ INFO: lockdep is turned off.
> 349.361769¨ Last Breaking-Event-Address:
> 349.361774¨  Ý<00000000003ca8f4>¨ __free_slab+0x1c4/0x6b0
> 349.361781¨ Kernel panic - not syncing: Fatal exception in interrupt
> 
> 
> A fix that works for me (RFC):
> 
> diff --git a/mm/slab.h b/mm/slab.h
> index a62372d0f271..b19a3f940338 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -328,7 +328,7 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
> 
>         rcu_read_lock();
>         memcg = READ_ONCE(s->memcg_params.memcg);
> -       if (likely(!mem_cgroup_is_root(memcg))) {
> +       if (likely(memcg && !mem_cgroup_is_root(memcg))) {
>                 lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
>                 mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order));
>                 memcg_kmem_uncharge_memcg(page, order, memcg);
> 
> -- 
> Karsten
> 
> (I'm a dude)
> 
>
Roman Gushchin Oct. 3, 2019, 3:35 a.m. UTC | #2
On Wed, Oct 02, 2019 at 12:41:29PM -0700, Roman Gushchin wrote:
> Hello, Karsten!
> 
> Thank you for the report!
> 
> On Wed, Oct 02, 2019 at 04:50:53PM +0200, Karsten Graul wrote:
> > 
> > net/smc is calling proto_register(&smc_proto, 1) with smc_proto.slab_flags = SLAB_TYPESAFE_BY_RCU.
> > Right after the last SMC socket is destroyed, proto_unregister(&smc_proto) is called, which 
> > calls kmem_cache_destroy(prot->slab). This results in a kernel crash in __free_slab().
> > Platform is s390x, reproduced on kernel 5.4-rc1. The problem was introduced by commit
> > fb2f2b0adb98 ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal")
> > 
> > I added a 'call graph', below of that is the crash log and a (simple) patch that works for me,
> > but I don't know if this is the correct way to fix it.
> > 
> > (Please keep me on CC of this thread because I do not follow the mm mailing list, thank you)
> > 
> > 
> > kmem_cache_destroy() 
> >   -> shutdown_memcg_caches()
> >     -> shutdown_cache()
> >       -> __kmem_cache_shutdown()  (slub.c)
> >         -> free_partial()
> >           -> discard_slab()
> > 	    -> free_slab()                                      -- call to __free_slab() is delayed
> > 	      -> call_rcu(rcu_free_slab)
> >     -> memcg_unlink_cache()
> >       -> WRITE_ONCE(s->memcg_params.memcg, NULL);               -- !!!
> >     -> list_add_tail(&s->list, &slab_caches_to_rcu_destroy);
> >     -> schedule_work(&slab_caches_to_rcu_destroy_work);  -> work_fn uses rcu_barrier() to wait for rcu_batch, 
> >                                                             so work_fn is not further involved here...
> > ... rcu grace period ...
> > rcu_batch()
> >   ...
> >   -> rcu_free_slab()   (slub.c)
> >     -> __free_slab()
> >       -> uncharge_slab_page()
> >         -> memcg_uncharge_slab()
> > 	  -> memcg = READ_ONCE(s->memcg_params.memcg);          -- !!! memcg == NULL
> > 	  -> mem_cgroup_lruvec(memcg)
> > 	    -> mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id); -- mz == NULL
> > 	    -> lruvec = &mz->lruvec;                            -- lruvec == NULL
> > 	    -> lruvec->pgdat = pgdat;                           -- *crash*
> > 
> > The crash log:
> 
> Hm, I might be wrong, but it seems that the problem is deeper: __free_slab()
> called from the rcu path races with kmem_cache_destroy(), which is supposed
> to be called when there are no outstanding allocations (and corresponding pages).
> Any charged slab page actually holds a reference to the kmem_cache, which prevents
> its destruction (look at s->memcg_params.refcnt), but kmem_cache_destroy() ignores
> it.
> 
> If my thoughts are correct, the commit you've mentioned didn't introduced this
> issue, it just made it easier to reproduce.
> 
> The proposed fix looks dubious to me: the problem isn't in the memcg pointer
> (it's just a luck that it crashes on it), and it seems incorrect to not decrease
> the slab statistics of the original memory cgroup.
> 
> What we probably need to do instead is to extend flush_memcg_workqueue() to
> wait for all outstanding rcu free callbacks. I have to think a bit what's the best
> way to fix it. How easy is to reproduce the problem?

After a second thought, flush_memcg_workqueue() already contains
a rcu_barrier() call, so now first suspicion is that the last free() call
occurs after the kmem_cache_destroy() call. Can you, please, check if it's not
a case?

Thanks!

> 
> > 
> > 349.361168¨ Unable to handle kernel pointer dereference in virtual kernel address space
> 
> Btw, haven't you noticed anything suspicious in dmesg before this line?
> 
> Thank you!
> 
> Roman
> 
> > 349.361210¨ Failing address: 0000000000000000 TEID: 0000000000000483
> > 349.361223¨ Fault in home space mode while using kernel ASCE.
> > 349.361240¨ AS:00000000017d4007 R3:000000007fbd0007 S:000000007fbff000 P:000000000000003d
> > 349.361340¨ Oops: 0004 ilc:3 Ý#1¨ PREEMPT SMP
> > 349.361349¨ Modules linked in: tcp_diag inet_diag xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_at nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_de
> > 349.361436¨ CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-05872-g6133e3e4bada-dirty #14
> > 349.361445¨ Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
> > 349.361450¨ Krnl PSW : 0704d00180000000 00000000003cadb6 (__free_slab+0x686/0x6b0)
> > 349.361464¨            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
> > 349.361470¨ Krnl GPRS: 00000000f3a32928 0000000000000000 000000007fbf5d00 000000000117c4b8
> > 349.361475¨            0000000000000000 000000009e3291c1 0000000000000000 0000000000000000
> > 349.361481¨            0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
> > 349.361481¨            0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
> > 349.361486¨            000000000117ba00 000003e000057db0 00000000003cabcc 000003e000057c78
> > 349.361500¨ Krnl Code: 00000000003cada6: e310a1400004        lg      %r1,320(%r10)
> > 349.361500¨            00000000003cadac: c0e50046c286        brasl   %r14,ca32b8
> > 349.361500¨           #00000000003cadb2: a7f4fe36            brc     15,3caa1e
> > 349.361500¨           >00000000003cadb6: e32060800024        stg     %r2,128(%r6)
> > 349.361500¨            00000000003cadbc: a7f4fd9e            brc     15,3ca8f8
> > 349.361500¨            00000000003cadc0: c0e50046790c        brasl   %r14,c99fd8
> > 349.361500¨            00000000003cadc6: a7f4fe2c            brc     15,3caa
> > 349.361500¨            00000000003cadc6: a7f4fe2c            brc     15,3caa1e
> > 349.361500¨            00000000003cadca: ecb1ffff00d9        aghik   %r11,%r1,-1
> > 349.361619¨ Call Trace:
> > 349.361627¨ (Ý<00000000003cabcc>¨ __free_slab+0x49c/0x6b0)
> > 349.361634¨  Ý<00000000001f5886>¨ rcu_core+0x5a6/0x7e0
> > 349.361643¨  Ý<0000000000ca2dea>¨ __do_softirq+0xf2/0x5c0
> > 349.361652¨  Ý<0000000000152644>¨ irq_exit+0x104/0x130
> > 349.361659¨  Ý<000000000010d222>¨ do_IRQ+0x9a/0xf0
> > 349.361667¨  Ý<0000000000ca2344>¨ ext_int_handler+0x130/0x134
> > 349.361674¨  Ý<0000000000103648>¨ enabled_wait+0x58/0x128
> > 349.361681¨ (Ý<0000000000103634>¨ enabled_wait+0x44/0x128)
> > 349.361688¨  Ý<0000000000103b00>¨ arch_cpu_idle+0x40/0x58
> > 349.361695¨  Ý<0000000000ca0544>¨ default_idle_call+0x3c/0x68
> > 349.361704¨  Ý<000000000018eaa4>¨ do_idle+0xec/0x1c0
> > 349.361748¨  Ý<000000000018ee0e>¨ cpu_startup_entry+0x36/0x40
> > 349.361756¨  Ý<000000000122df34>¨ arch_call_rest_init+0x5c/0x88
> > 349.361761¨  Ý<0000000000000000>¨ 0x0
> > 349.361765¨ INFO: lockdep is turned off.
> > 349.361769¨ Last Breaking-Event-Address:
> > 349.361774¨  Ý<00000000003ca8f4>¨ __free_slab+0x1c4/0x6b0
> > 349.361781¨ Kernel panic - not syncing: Fatal exception in interrupt
> > 
> > 
> > A fix that works for me (RFC):
> > 
> > diff --git a/mm/slab.h b/mm/slab.h
> > index a62372d0f271..b19a3f940338 100644
> > --- a/mm/slab.h
> > +++ b/mm/slab.h
> > @@ -328,7 +328,7 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
> > 
> >         rcu_read_lock();
> >         memcg = READ_ONCE(s->memcg_params.memcg);
> > -       if (likely(!mem_cgroup_is_root(memcg))) {
> > +       if (likely(memcg && !mem_cgroup_is_root(memcg))) {
> >                 lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
> >                 mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order));
> >                 memcg_kmem_uncharge_memcg(page, order, memcg);
> > 
> > -- 
> > Karsten
> > 
> > (I'm a dude)
> > 
> >
Karsten Graul Oct. 3, 2019, 8:48 a.m. UTC | #3
Hello, Roman!

On 03/10/2019 05:35, Roman Gushchin wrote:
> On Wed, Oct 02, 2019 at 12:41:29PM -0700, Roman Gushchin wrote:
>> Hello, Karsten!
>>
>> Thank you for the report!
>>
>> On Wed, Oct 02, 2019 at 04:50:53PM +0200, Karsten Graul wrote:
>>>
>>> net/smc is calling proto_register(&smc_proto, 1) with smc_proto.slab_flags = SLAB_TYPESAFE_BY_RCU.
>>> Right after the last SMC socket is destroyed, proto_unregister(&smc_proto) is called, which 
>>> calls kmem_cache_destroy(prot->slab). This results in a kernel crash in __free_slab().
>>> Platform is s390x, reproduced on kernel 5.4-rc1. The problem was introduced by commit
>>> fb2f2b0adb98 ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal")
>>>
>>> I added a 'call graph', below of that is the crash log and a (simple) patch that works for me,
>>> but I don't know if this is the correct way to fix it.
>>>
>>> (Please keep me on CC of this thread because I do not follow the mm mailing list, thank you)
>>>
>>>
>>> kmem_cache_destroy() 
>>>   -> shutdown_memcg_caches()
>>>     -> shutdown_cache()
>>>       -> __kmem_cache_shutdown()  (slub.c)
>>>         -> free_partial()
>>>           -> discard_slab()
>>> 	    -> free_slab()                                      -- call to __free_slab() is delayed
>>> 	      -> call_rcu(rcu_free_slab)
>>>     -> memcg_unlink_cache()
>>>       -> WRITE_ONCE(s->memcg_params.memcg, NULL);               -- !!!
>>>     -> list_add_tail(&s->list, &slab_caches_to_rcu_destroy);
>>>     -> schedule_work(&slab_caches_to_rcu_destroy_work);  -> work_fn uses rcu_barrier() to wait for rcu_batch, 
>>>                                                             so work_fn is not further involved here...
>>> ... rcu grace period ...
>>> rcu_batch()
>>>   ...
>>>   -> rcu_free_slab()   (slub.c)
>>>     -> __free_slab()
>>>       -> uncharge_slab_page()
>>>         -> memcg_uncharge_slab()
>>> 	  -> memcg = READ_ONCE(s->memcg_params.memcg);          -- !!! memcg == NULL
>>> 	  -> mem_cgroup_lruvec(memcg)
>>> 	    -> mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id); -- mz == NULL
>>> 	    -> lruvec = &mz->lruvec;                            -- lruvec == NULL
>>> 	    -> lruvec->pgdat = pgdat;                           -- *crash*
>>>
>>> The crash log:
>>
>> Hm, I might be wrong, but it seems that the problem is deeper: __free_slab()
>> called from the rcu path races with kmem_cache_destroy(), which is supposed
>> to be called when there are no outstanding allocations (and corresponding pages).
>> Any charged slab page actually holds a reference to the kmem_cache, which prevents
>> its destruction (look at s->memcg_params.refcnt), but kmem_cache_destroy() ignores
>> it.

I don't see a race between kmem_cache_destroy() and __fre_slab(). kmem_cache_destroy()
is already done when __free_slab() is called. Maybe the below trace shows you the order of calls on my
system: kmem_cache_destroy() unlinks the memcg caches, sets up the rcu callbacks, then 
it starts the slab_caches_to_rcu_destroy_workfn() worker and then kmem_cache_destroy() is done.
You see that the smc code is getting control again after that.
The worker starts in between (before the smc_exit trace), but keeps waiting on the rcu_barrier().
Ages later (see time difference) the rcu grace period ends and rcu_free_slab() is called, and it
crashes. 
I hope that helps!

[  145.539917] kmem_cache_destroy before shutdown_memcg_caches() for 0000000068106f00
[  145.539929] free_slab call_rcu() for 00000000392c2e00, page is 000003d080e45000
[  145.539961] memcg_unlink_cache clearing memcg for 00000000392c2e00
[  145.539970] shutdown_cache adding to slab_caches_to_rcu_destroy queue for work: 00000000392c2e00

[  145.540001] free_slab call_rcu() for 00000000392c2900, page is 000003d080e4a200
[  145.540031] memcg_unlink_cache clearing memcg for 00000000392c2900
[  145.540041] shutdown_cache adding to slab_caches_to_rcu_destroy queue for work: 00000000392c2900

[  145.540066] kmem_cache_destroy after shutdown_memcg_caches() for 0000000068106f00

[  145.540075] kmem_cache_destroy before final shutdown_cache() for 0000000068106f00
[  145.540086] free_slab call_rcu() for 0000000068106f00, page is 000003d080e0a800
[  145.540189] shutdown_cache adding to slab_caches_to_rcu_destroy queue for work: 0000000068106f00

[  145.540548] kmem_cache_destroy after final shutdown_cache() for 0000000068106f00
   kmem_cache_destroy is done
[  145.540573] slab_caches_to_rcu_destroy_workfn before rcu_barrier() in workfunc
   slab_caches_to_rcu_destroy_workfn started and waits in rcu_barrier() now
[  145.540619] smc.0698ae: smc_exit before smc_pnet_exit
   smc module exit code gets back control ...
[  145.540699] smc.616283: smc_exit before unregister_pernet_subsys
[  145.619747] rcu_free_slab called for 00000000392c2e00, page is 000003d080e45000
   much later the rcu callbacks are invoked, and will crash

>>
>> If my thoughts are correct, the commit you've mentioned didn't introduced this
>> issue, it just made it easier to reproduce.
>>
>> The proposed fix looks dubious to me: the problem isn't in the memcg pointer
>> (it's just a luck that it crashes on it), and it seems incorrect to not decrease
>> the slab statistics of the original memory cgroup.

I was quite sure that my approach is way to simple, it's better when the mm experts
work on that.

>>
>> What we probably need to do instead is to extend flush_memcg_workqueue() to
>> wait for all outstanding rcu free callbacks. I have to think a bit what's the best
>> way to fix it. How easy is to reproduce the problem?

I can reproduce this at will and I am happy to test any fixes you provide.

> 
> After a second thought, flush_memcg_workqueue() already contains
> a rcu_barrier() call, so now first suspicion is that the last free() call
> occurs after the kmem_cache_destroy() call. Can you, please, check if it's not
> a case?
> 

In kmem_cache_destroy(), the flush_memcg_workqueue() call is the first one, and after
that shutdown_memcg_caches() is called which setup the rcu callbacks. 
So flush_memcg_workqueue() can not wait for them. If you follow the 'call graph' above 
using the RCU path in slub.c you can see when the callbacks are set up and why no warning 
is printed.


Second thought after I wrote all of the above: when flush_memcg_workqueue() already contains
an rcu_barrier(), whats the point of delaying the slab freeing in the rcu case? All rcu
readers should be done now, so the rcu callbacks and the worker are not needed?
What am I missing here (I am sure I miss something, I am completely new in the mm area)?

> Thanks!
> 
>>
>>>
>>> 349.361168¨ Unable to handle kernel pointer dereference in virtual kernel address space
>>
>> Btw, haven't you noticed anything suspicious in dmesg before this line?

There is no error or warning line in dmesg before this line. Actually, I think that
all pages are no longer in use so no warning is printed. Anyway, the slab freeing is
delayed in any case when RCU is in use, right?


Karsten

>>
>> Thank you!
>>
>> Roman
>>
>>> 349.361210¨ Failing address: 0000000000000000 TEID: 0000000000000483
>>> 349.361223¨ Fault in home space mode while using kernel ASCE.
>>> 349.361240¨ AS:00000000017d4007 R3:000000007fbd0007 S:000000007fbff000 P:000000000000003d
>>> 349.361340¨ Oops: 0004 ilc:3 Ý#1¨ PREEMPT SMP
>>> 349.361349¨ Modules linked in: tcp_diag inet_diag xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_at nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_de
>>> 349.361436¨ CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-05872-g6133e3e4bada-dirty #14
>>> 349.361445¨ Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
>>> 349.361450¨ Krnl PSW : 0704d00180000000 00000000003cadb6 (__free_slab+0x686/0x6b0)
>>> 349.361464¨            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
>>> 349.361470¨ Krnl GPRS: 00000000f3a32928 0000000000000000 000000007fbf5d00 000000000117c4b8
>>> 349.361475¨            0000000000000000 000000009e3291c1 0000000000000000 0000000000000000
>>> 349.361481¨            0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
>>> 349.361481¨            0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
>>> 349.361486¨            000000000117ba00 000003e000057db0 00000000003cabcc 000003e000057c78
>>> 349.361500¨ Krnl Code: 00000000003cada6: e310a1400004        lg      %r1,320(%r10)
>>> 349.361500¨            00000000003cadac: c0e50046c286        brasl   %r14,ca32b8
>>> 349.361500¨           #00000000003cadb2: a7f4fe36            brc     15,3caa1e
>>> 349.361500¨           >00000000003cadb6: e32060800024        stg     %r2,128(%r6)
>>> 349.361500¨            00000000003cadbc: a7f4fd9e            brc     15,3ca8f8
>>> 349.361500¨            00000000003cadc0: c0e50046790c        brasl   %r14,c99fd8
>>> 349.361500¨            00000000003cadc6: a7f4fe2c            brc     15,3caa
>>> 349.361500¨            00000000003cadc6: a7f4fe2c            brc     15,3caa1e
>>> 349.361500¨            00000000003cadca: ecb1ffff00d9        aghik   %r11,%r1,-1
>>> 349.361619¨ Call Trace:
>>> 349.361627¨ (Ý<00000000003cabcc>¨ __free_slab+0x49c/0x6b0)
>>> 349.361634¨  Ý<00000000001f5886>¨ rcu_core+0x5a6/0x7e0
>>> 349.361643¨  Ý<0000000000ca2dea>¨ __do_softirq+0xf2/0x5c0
>>> 349.361652¨  Ý<0000000000152644>¨ irq_exit+0x104/0x130
>>> 349.361659¨  Ý<000000000010d222>¨ do_IRQ+0x9a/0xf0
>>> 349.361667¨  Ý<0000000000ca2344>¨ ext_int_handler+0x130/0x134
>>> 349.361674¨  Ý<0000000000103648>¨ enabled_wait+0x58/0x128
>>> 349.361681¨ (Ý<0000000000103634>¨ enabled_wait+0x44/0x128)
>>> 349.361688¨  Ý<0000000000103b00>¨ arch_cpu_idle+0x40/0x58
>>> 349.361695¨  Ý<0000000000ca0544>¨ default_idle_call+0x3c/0x68
>>> 349.361704¨  Ý<000000000018eaa4>¨ do_idle+0xec/0x1c0
>>> 349.361748¨  Ý<000000000018ee0e>¨ cpu_startup_entry+0x36/0x40
>>> 349.361756¨  Ý<000000000122df34>¨ arch_call_rest_init+0x5c/0x88
>>> 349.361761¨  Ý<0000000000000000>¨ 0x0
>>> 349.361765¨ INFO: lockdep is turned off.
>>> 349.361769¨ Last Breaking-Event-Address:
>>> 349.361774¨  Ý<00000000003ca8f4>¨ __free_slab+0x1c4/0x6b0
>>> 349.361781¨ Kernel panic - not syncing: Fatal exception in interrupt
>>>
>>>
>>> A fix that works for me (RFC):
>>>
>>> diff --git a/mm/slab.h b/mm/slab.h
>>> index a62372d0f271..b19a3f940338 100644
>>> --- a/mm/slab.h
>>> +++ b/mm/slab.h
>>> @@ -328,7 +328,7 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
>>>
>>>         rcu_read_lock();
>>>         memcg = READ_ONCE(s->memcg_params.memcg);
>>> -       if (likely(!mem_cgroup_is_root(memcg))) {
>>> +       if (likely(memcg && !mem_cgroup_is_root(memcg))) {
>>>                 lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
>>>                 mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order));
>>>                 memcg_kmem_uncharge_memcg(page, order, memcg);
>>>
>>> -- 
>>> Karsten
>>>
>>> (I'm a dude)
>>>
>>>
Roman Gushchin Oct. 3, 2019, 4:11 p.m. UTC | #4
On Thu, Oct 03, 2019 at 10:48:15AM +0200, Karsten Graul wrote:
> Hello, Roman!
> 
> On 03/10/2019 05:35, Roman Gushchin wrote:
> > On Wed, Oct 02, 2019 at 12:41:29PM -0700, Roman Gushchin wrote:
> >> Hello, Karsten!
> >>
> >> Thank you for the report!
> >>
> >> On Wed, Oct 02, 2019 at 04:50:53PM +0200, Karsten Graul wrote:
> >>>
> >>> net/smc is calling proto_register(&smc_proto, 1) with smc_proto.slab_flags = SLAB_TYPESAFE_BY_RCU.
> >>> Right after the last SMC socket is destroyed, proto_unregister(&smc_proto) is called, which 
> >>> calls kmem_cache_destroy(prot->slab). This results in a kernel crash in __free_slab().
> >>> Platform is s390x, reproduced on kernel 5.4-rc1. The problem was introduced by commit
> >>> fb2f2b0adb98 ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal")
> >>>
> >>> I added a 'call graph', below of that is the crash log and a (simple) patch that works for me,
> >>> but I don't know if this is the correct way to fix it.
> >>>
> >>> (Please keep me on CC of this thread because I do not follow the mm mailing list, thank you)
> >>>
> >>>
> >>> kmem_cache_destroy() 
> >>>   -> shutdown_memcg_caches()
> >>>     -> shutdown_cache()
> >>>       -> __kmem_cache_shutdown()  (slub.c)
> >>>         -> free_partial()
> >>>           -> discard_slab()
> >>> 	    -> free_slab()                                      -- call to __free_slab() is delayed
> >>> 	      -> call_rcu(rcu_free_slab)
> >>>     -> memcg_unlink_cache()
> >>>       -> WRITE_ONCE(s->memcg_params.memcg, NULL);               -- !!!
> >>>     -> list_add_tail(&s->list, &slab_caches_to_rcu_destroy);
> >>>     -> schedule_work(&slab_caches_to_rcu_destroy_work);  -> work_fn uses rcu_barrier() to wait for rcu_batch, 
> >>>                                                             so work_fn is not further involved here...
> >>> ... rcu grace period ...
> >>> rcu_batch()
> >>>   ...
> >>>   -> rcu_free_slab()   (slub.c)
> >>>     -> __free_slab()
> >>>       -> uncharge_slab_page()
> >>>         -> memcg_uncharge_slab()
> >>> 	  -> memcg = READ_ONCE(s->memcg_params.memcg);          -- !!! memcg == NULL
> >>> 	  -> mem_cgroup_lruvec(memcg)
> >>> 	    -> mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id); -- mz == NULL
> >>> 	    -> lruvec = &mz->lruvec;                            -- lruvec == NULL
> >>> 	    -> lruvec->pgdat = pgdat;                           -- *crash*
> >>>
> >>> The crash log:
> >>
> >> Hm, I might be wrong, but it seems that the problem is deeper: __free_slab()
> >> called from the rcu path races with kmem_cache_destroy(), which is supposed
> >> to be called when there are no outstanding allocations (and corresponding pages).
> >> Any charged slab page actually holds a reference to the kmem_cache, which prevents
> >> its destruction (look at s->memcg_params.refcnt), but kmem_cache_destroy() ignores
> >> it.
> 
> I don't see a race between kmem_cache_destroy() and __fre_slab(). kmem_cache_destroy()
> is already done when __free_slab() is called. Maybe the below trace shows you the order of calls on my
> system: kmem_cache_destroy() unlinks the memcg caches, sets up the rcu callbacks, then 
> it starts the slab_caches_to_rcu_destroy_workfn() worker and then kmem_cache_destroy() is done.

Right, and this is the problem. The question when call_rcu() in free_slab() has
been called: if it did happen before the kmem_cache_destroy(), it's clearly
a bug inside the slab allocator. Otherwise it's probably an incorrect API
invocation.

> You see that the smc code is getting control again after that.
> The worker starts in between (before the smc_exit trace), but keeps waiting on the rcu_barrier().
> Ages later (see time difference) the rcu grace period ends and rcu_free_slab() is called, and it
> crashes. 
> I hope that helps!
> 
> [  145.539917] kmem_cache_destroy before shutdown_memcg_caches() for 0000000068106f00
> [  145.539929] free_slab call_rcu() for 00000000392c2e00, page is 000003d080e45000
> [  145.539961] memcg_unlink_cache clearing memcg for 00000000392c2e00
> [  145.539970] shutdown_cache adding to slab_caches_to_rcu_destroy queue for work: 00000000392c2e00

Does it mean that call_rcu() has been called after kmem_cache_destroy()?
In this case, do you know who called it?

I'd add an atomic flag to the root kmem_cache, set it at the beginning of the
kmem_cache_destroy() and check it in free_slab(). If set, dump the stacktrace.
Just please make sure you're looking at the root kmem_cache flag, not the memcg
one.

Thank you!

Roman

> 
> [  145.540001] free_slab call_rcu() for 00000000392c2900, page is 000003d080e4a200
> [  145.540031] memcg_unlink_cache clearing memcg for 00000000392c2900
> [  145.540041] shutdown_cache adding to slab_caches_to_rcu_destroy queue for work: 00000000392c2900
> 
> [  145.540066] kmem_cache_destroy after shutdown_memcg_caches() for 0000000068106f00
> 
> [  145.540075] kmem_cache_destroy before final shutdown_cache() for 0000000068106f00
> [  145.540086] free_slab call_rcu() for 0000000068106f00, page is 000003d080e0a800
> [  145.540189] shutdown_cache adding to slab_caches_to_rcu_destroy queue for work: 0000000068106f00
> 
> [  145.540548] kmem_cache_destroy after final shutdown_cache() for 0000000068106f00
>    kmem_cache_destroy is done
> [  145.540573] slab_caches_to_rcu_destroy_workfn before rcu_barrier() in workfunc
>    slab_caches_to_rcu_destroy_workfn started and waits in rcu_barrier() now
> [  145.540619] smc.0698ae: smc_exit before smc_pnet_exit
>    smc module exit code gets back control ...
> [  145.540699] smc.616283: smc_exit before unregister_pernet_subsys
> [  145.619747] rcu_free_slab called for 00000000392c2e00, page is 000003d080e45000
>    much later the rcu callbacks are invoked, and will crash
> 
> >>
> >> If my thoughts are correct, the commit you've mentioned didn't introduced this
> >> issue, it just made it easier to reproduce.
> >>
> >> The proposed fix looks dubious to me: the problem isn't in the memcg pointer
> >> (it's just a luck that it crashes on it), and it seems incorrect to not decrease
> >> the slab statistics of the original memory cgroup.
> 
> I was quite sure that my approach is way to simple, it's better when the mm experts
> work on that.
> 
> >>
> >> What we probably need to do instead is to extend flush_memcg_workqueue() to
> >> wait for all outstanding rcu free callbacks. I have to think a bit what's the best
> >> way to fix it. How easy is to reproduce the problem?
> 
> I can reproduce this at will and I am happy to test any fixes you provide.
> 
> > 
> > After a second thought, flush_memcg_workqueue() already contains
> > a rcu_barrier() call, so now first suspicion is that the last free() call
> > occurs after the kmem_cache_destroy() call. Can you, please, check if it's not
> > a case?
> > 
> 
> In kmem_cache_destroy(), the flush_memcg_workqueue() call is the first one, and after
> that shutdown_memcg_caches() is called which setup the rcu callbacks.

These are callbacks to destroy kmem_caches, not pages.

> So flush_memcg_workqueue() can not wait for them. If you follow the 'call graph' above 
> using the RCU path in slub.c you can see when the callbacks are set up and why no warning 
> is printed.
> 
> 
> Second thought after I wrote all of the above: when flush_memcg_workqueue() already contains
> an rcu_barrier(), whats the point of delaying the slab freeing in the rcu case? All rcu
> readers should be done now, so the rcu callbacks and the worker are not needed?
> What am I missing here (I am sure I miss something, I am completely new in the mm area)?
> 
> > Thanks!
> > 
> >>
> >>>
> >>> 349.361168¨ Unable to handle kernel pointer dereference in virtual kernel address space
> >>
> >> Btw, haven't you noticed anything suspicious in dmesg before this line?
> 
> There is no error or warning line in dmesg before this line. Actually, I think that
> all pages are no longer in use so no warning is printed. Anyway, the slab freeing is
> delayed in any case when RCU is in use, right?
> 
> 
> Karsten
> 
> >>
> >> Thank you!
> >>
> >> Roman
> >>
> >>> 349.361210¨ Failing address: 0000000000000000 TEID: 0000000000000483
> >>> 349.361223¨ Fault in home space mode while using kernel ASCE.
> >>> 349.361240¨ AS:00000000017d4007 R3:000000007fbd0007 S:000000007fbff000 P:000000000000003d
> >>> 349.361340¨ Oops: 0004 ilc:3 Ý#1¨ PREEMPT SMP
> >>> 349.361349¨ Modules linked in: tcp_diag inet_diag xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_at nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_de
> >>> 349.361436¨ CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-05872-g6133e3e4bada-dirty #14
> >>> 349.361445¨ Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
> >>> 349.361450¨ Krnl PSW : 0704d00180000000 00000000003cadb6 (__free_slab+0x686/0x6b0)
> >>> 349.361464¨            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
> >>> 349.361470¨ Krnl GPRS: 00000000f3a32928 0000000000000000 000000007fbf5d00 000000000117c4b8
> >>> 349.361475¨            0000000000000000 000000009e3291c1 0000000000000000 0000000000000000
> >>> 349.361481¨            0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
> >>> 349.361481¨            0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
> >>> 349.361486¨            000000000117ba00 000003e000057db0 00000000003cabcc 000003e000057c78
> >>> 349.361500¨ Krnl Code: 00000000003cada6: e310a1400004        lg      %r1,320(%r10)
> >>> 349.361500¨            00000000003cadac: c0e50046c286        brasl   %r14,ca32b8
> >>> 349.361500¨           #00000000003cadb2: a7f4fe36            brc     15,3caa1e
> >>> 349.361500¨           >00000000003cadb6: e32060800024        stg     %r2,128(%r6)
> >>> 349.361500¨            00000000003cadbc: a7f4fd9e            brc     15,3ca8f8
> >>> 349.361500¨            00000000003cadc0: c0e50046790c        brasl   %r14,c99fd8
> >>> 349.361500¨            00000000003cadc6: a7f4fe2c            brc     15,3caa
> >>> 349.361500¨            00000000003cadc6: a7f4fe2c            brc     15,3caa1e
> >>> 349.361500¨            00000000003cadca: ecb1ffff00d9        aghik   %r11,%r1,-1
> >>> 349.361619¨ Call Trace:
> >>> 349.361627¨ (Ý<00000000003cabcc>¨ __free_slab+0x49c/0x6b0)
> >>> 349.361634¨  Ý<00000000001f5886>¨ rcu_core+0x5a6/0x7e0
> >>> 349.361643¨  Ý<0000000000ca2dea>¨ __do_softirq+0xf2/0x5c0
> >>> 349.361652¨  Ý<0000000000152644>¨ irq_exit+0x104/0x130
> >>> 349.361659¨  Ý<000000000010d222>¨ do_IRQ+0x9a/0xf0
> >>> 349.361667¨  Ý<0000000000ca2344>¨ ext_int_handler+0x130/0x134
> >>> 349.361674¨  Ý<0000000000103648>¨ enabled_wait+0x58/0x128
> >>> 349.361681¨ (Ý<0000000000103634>¨ enabled_wait+0x44/0x128)
> >>> 349.361688¨  Ý<0000000000103b00>¨ arch_cpu_idle+0x40/0x58
> >>> 349.361695¨  Ý<0000000000ca0544>¨ default_idle_call+0x3c/0x68
> >>> 349.361704¨  Ý<000000000018eaa4>¨ do_idle+0xec/0x1c0
> >>> 349.361748¨  Ý<000000000018ee0e>¨ cpu_startup_entry+0x36/0x40
> >>> 349.361756¨  Ý<000000000122df34>¨ arch_call_rest_init+0x5c/0x88
> >>> 349.361761¨  Ý<0000000000000000>¨ 0x0
> >>> 349.361765¨ INFO: lockdep is turned off.
> >>> 349.361769¨ Last Breaking-Event-Address:
> >>> 349.361774¨  Ý<00000000003ca8f4>¨ __free_slab+0x1c4/0x6b0
> >>> 349.361781¨ Kernel panic - not syncing: Fatal exception in interrupt
> >>>
> >>>
> >>> A fix that works for me (RFC):
> >>>
> >>> diff --git a/mm/slab.h b/mm/slab.h
> >>> index a62372d0f271..b19a3f940338 100644
> >>> --- a/mm/slab.h
> >>> +++ b/mm/slab.h
> >>> @@ -328,7 +328,7 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
> >>>
> >>>         rcu_read_lock();
> >>>         memcg = READ_ONCE(s->memcg_params.memcg);
> >>> -       if (likely(!mem_cgroup_is_root(memcg))) {
> >>> +       if (likely(memcg && !mem_cgroup_is_root(memcg))) {
> >>>                 lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
> >>>                 mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order));
> >>>                 memcg_kmem_uncharge_memcg(page, order, memcg);
> >>>
> >>> -- 
> >>> Karsten
> >>>
> >>> (I'm a dude)
> >>>
> >>>
>
Karsten Graul Oct. 3, 2019, 4:24 p.m. UTC | #5
On 03/10/2019 18:11, Roman Gushchin wrote:
> On Thu, Oct 03, 2019 at 10:48:15AM +0200, Karsten Graul wrote:
>> Hello, Roman!
>>
>> On 03/10/2019 05:35, Roman Gushchin wrote:
>>> On Wed, Oct 02, 2019 at 12:41:29PM -0700, Roman Gushchin wrote:
>>>> Hello, Karsten!
>>>>
>>>> Thank you for the report!
>>>>
>>>> On Wed, Oct 02, 2019 at 04:50:53PM +0200, Karsten Graul wrote:
>>>>>
>>>>> net/smc is calling proto_register(&smc_proto, 1) with smc_proto.slab_flags = SLAB_TYPESAFE_BY_RCU.
>>>>> Right after the last SMC socket is destroyed, proto_unregister(&smc_proto) is called, which 
>>>>> calls kmem_cache_destroy(prot->slab). This results in a kernel crash in __free_slab().
>>>>> Platform is s390x, reproduced on kernel 5.4-rc1. The problem was introduced by commit
>>>>> fb2f2b0adb98 ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal")
>>>>>
>>>>> I added a 'call graph', below of that is the crash log and a (simple) patch that works for me,
>>>>> but I don't know if this is the correct way to fix it.
>>>>>
>>>>> (Please keep me on CC of this thread because I do not follow the mm mailing list, thank you)
>>>>>
>>>>>
>>>>> kmem_cache_destroy() 
>>>>>   -> shutdown_memcg_caches()
>>>>>     -> shutdown_cache()
>>>>>       -> __kmem_cache_shutdown()  (slub.c)
>>>>>         -> free_partial()
>>>>>           -> discard_slab()
>>>>> 	    -> free_slab()                                      -- call to __free_slab() is delayed
>>>>> 	      -> call_rcu(rcu_free_slab)
>>>>>     -> memcg_unlink_cache()
>>>>>       -> WRITE_ONCE(s->memcg_params.memcg, NULL);               -- !!!
>>>>>     -> list_add_tail(&s->list, &slab_caches_to_rcu_destroy);
>>>>>     -> schedule_work(&slab_caches_to_rcu_destroy_work);  -> work_fn uses rcu_barrier() to wait for rcu_batch, 
>>>>>                                                             so work_fn is not further involved here...
>>>>> ... rcu grace period ...
>>>>> rcu_batch()
>>>>>   ...
>>>>>   -> rcu_free_slab()   (slub.c)
>>>>>     -> __free_slab()
>>>>>       -> uncharge_slab_page()
>>>>>         -> memcg_uncharge_slab()
>>>>> 	  -> memcg = READ_ONCE(s->memcg_params.memcg);          -- !!! memcg == NULL
>>>>> 	  -> mem_cgroup_lruvec(memcg)
>>>>> 	    -> mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id); -- mz == NULL
>>>>> 	    -> lruvec = &mz->lruvec;                            -- lruvec == NULL
>>>>> 	    -> lruvec->pgdat = pgdat;                           -- *crash*
>>>>>
>>>>> The crash log:
>>>>
>>>> Hm, I might be wrong, but it seems that the problem is deeper: __free_slab()
>>>> called from the rcu path races with kmem_cache_destroy(), which is supposed
>>>> to be called when there are no outstanding allocations (and corresponding pages).
>>>> Any charged slab page actually holds a reference to the kmem_cache, which prevents
>>>> its destruction (look at s->memcg_params.refcnt), but kmem_cache_destroy() ignores
>>>> it.
>>
>> I don't see a race between kmem_cache_destroy() and __fre_slab(). kmem_cache_destroy()
>> is already done when __free_slab() is called. Maybe the below trace shows you the order of calls on my
>> system: kmem_cache_destroy() unlinks the memcg caches, sets up the rcu callbacks, then 
>> it starts the slab_caches_to_rcu_destroy_workfn() worker and then kmem_cache_destroy() is done.
> 
> Right, and this is the problem. The question when call_rcu() in free_slab() has
> been called: if it did happen before the kmem_cache_destroy(), it's clearly
> a bug inside the slab allocator. Otherwise it's probably an incorrect API
> invocation.
> 
>> You see that the smc code is getting control again after that.
>> The worker starts in between (before the smc_exit trace), but keeps waiting on the rcu_barrier().
>> Ages later (see time difference) the rcu grace period ends and rcu_free_slab() is called, and it
>> crashes. 
>> I hope that helps!
>>
>> [  145.539917] kmem_cache_destroy before shutdown_memcg_caches() for 0000000068106f00
>> [  145.539929] free_slab call_rcu() for 00000000392c2e00, page is 000003d080e45000
>> [  145.539961] memcg_unlink_cache clearing memcg for 00000000392c2e00
>> [  145.539970] shutdown_cache adding to slab_caches_to_rcu_destroy queue for work: 00000000392c2e00
> 
> Does it mean that call_rcu() has been called after kmem_cache_destroy()?
> In this case, do you know who called it?
> 

Let me show the 'call graph' again, call_rcu() is called by free_slab()
as part of kmem_cache_destroy(), and just before memcg_unlink_cache() clears
the memcg reference.

kmem_cache_destroy() 
  -> shutdown_memcg_caches()
    -> shutdown_cache()
      -> __kmem_cache_shutdown()  (slub.c)
        -> free_partial()
          -> discard_slab()
	    -> free_slab()                                      -- call to __free_slab() is delayed
	      -> call_rcu(rcu_free_slab)
    -> memcg_unlink_cache()
      -> WRITE_ONCE(s->memcg_params.memcg, NULL);               -- !!!


> I'd add an atomic flag to the root kmem_cache, set it at the beginning of the
> kmem_cache_destroy() and check it in free_slab(). If set, dump the stacktrace.
> Just please make sure you're looking at the root kmem_cache flag, not the memcg
> one.
> 
> Thank you!
> 
> Roman
> 
>>
>> [  145.540001] free_slab call_rcu() for 00000000392c2900, page is 000003d080e4a200
>> [  145.540031] memcg_unlink_cache clearing memcg for 00000000392c2900
>> [  145.540041] shutdown_cache adding to slab_caches_to_rcu_destroy queue for work: 00000000392c2900
>>
>> [  145.540066] kmem_cache_destroy after shutdown_memcg_caches() for 0000000068106f00
>>
>> [  145.540075] kmem_cache_destroy before final shutdown_cache() for 0000000068106f00
>> [  145.540086] free_slab call_rcu() for 0000000068106f00, page is 000003d080e0a800
>> [  145.540189] shutdown_cache adding to slab_caches_to_rcu_destroy queue for work: 0000000068106f00
>>
>> [  145.540548] kmem_cache_destroy after final shutdown_cache() for 0000000068106f00
>>    kmem_cache_destroy is done
>> [  145.540573] slab_caches_to_rcu_destroy_workfn before rcu_barrier() in workfunc
>>    slab_caches_to_rcu_destroy_workfn started and waits in rcu_barrier() now
>> [  145.540619] smc.0698ae: smc_exit before smc_pnet_exit
>>    smc module exit code gets back control ...
>> [  145.540699] smc.616283: smc_exit before unregister_pernet_subsys
>> [  145.619747] rcu_free_slab called for 00000000392c2e00, page is 000003d080e45000
>>    much later the rcu callbacks are invoked, and will crash
>>
>>>>
>>>> If my thoughts are correct, the commit you've mentioned didn't introduced this
>>>> issue, it just made it easier to reproduce.
>>>>
>>>> The proposed fix looks dubious to me: the problem isn't in the memcg pointer
>>>> (it's just a luck that it crashes on it), and it seems incorrect to not decrease
>>>> the slab statistics of the original memory cgroup.
>>
>> I was quite sure that my approach is way to simple, it's better when the mm experts
>> work on that.
>>
>>>>
>>>> What we probably need to do instead is to extend flush_memcg_workqueue() to
>>>> wait for all outstanding rcu free callbacks. I have to think a bit what's the best
>>>> way to fix it. How easy is to reproduce the problem?
>>
>> I can reproduce this at will and I am happy to test any fixes you provide.
>>
>>>
>>> After a second thought, flush_memcg_workqueue() already contains
>>> a rcu_barrier() call, so now first suspicion is that the last free() call
>>> occurs after the kmem_cache_destroy() call. Can you, please, check if it's not
>>> a case?
>>>
>>
>> In kmem_cache_destroy(), the flush_memcg_workqueue() call is the first one, and after
>> that shutdown_memcg_caches() is called which setup the rcu callbacks.
> 
> These are callbacks to destroy kmem_caches, not pages.
> 
>> So flush_memcg_workqueue() can not wait for them. If you follow the 'call graph' above 
>> using the RCU path in slub.c you can see when the callbacks are set up and why no warning 
>> is printed.
>>
>>
>> Second thought after I wrote all of the above: when flush_memcg_workqueue() already contains
>> an rcu_barrier(), whats the point of delaying the slab freeing in the rcu case? All rcu
>> readers should be done now, so the rcu callbacks and the worker are not needed?
>> What am I missing here (I am sure I miss something, I am completely new in the mm area)?
>>
>>> Thanks!
>>>
>>>>
>>>>>
>>>>> 349.361168¨ Unable to handle kernel pointer dereference in virtual kernel address space
>>>>
>>>> Btw, haven't you noticed anything suspicious in dmesg before this line?
>>
>> There is no error or warning line in dmesg before this line. Actually, I think that
>> all pages are no longer in use so no warning is printed. Anyway, the slab freeing is
>> delayed in any case when RCU is in use, right?
>>
>>
>> Karsten
>>
>>>>
>>>> Thank you!
>>>>
>>>> Roman
>>>>
>>>>> 349.361210¨ Failing address: 0000000000000000 TEID: 0000000000000483
>>>>> 349.361223¨ Fault in home space mode while using kernel ASCE.
>>>>> 349.361240¨ AS:00000000017d4007 R3:000000007fbd0007 S:000000007fbff000 P:000000000000003d
>>>>> 349.361340¨ Oops: 0004 ilc:3 Ý#1¨ PREEMPT SMP
>>>>> 349.361349¨ Modules linked in: tcp_diag inet_diag xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_at nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_de
>>>>> 349.361436¨ CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-05872-g6133e3e4bada-dirty #14
>>>>> 349.361445¨ Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
>>>>> 349.361450¨ Krnl PSW : 0704d00180000000 00000000003cadb6 (__free_slab+0x686/0x6b0)
>>>>> 349.361464¨            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
>>>>> 349.361470¨ Krnl GPRS: 00000000f3a32928 0000000000000000 000000007fbf5d00 000000000117c4b8
>>>>> 349.361475¨            0000000000000000 000000009e3291c1 0000000000000000 0000000000000000
>>>>> 349.361481¨            0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
>>>>> 349.361481¨            0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
>>>>> 349.361486¨            000000000117ba00 000003e000057db0 00000000003cabcc 000003e000057c78
>>>>> 349.361500¨ Krnl Code: 00000000003cada6: e310a1400004        lg      %r1,320(%r10)
>>>>> 349.361500¨            00000000003cadac: c0e50046c286        brasl   %r14,ca32b8
>>>>> 349.361500¨           #00000000003cadb2: a7f4fe36            brc     15,3caa1e
>>>>> 349.361500¨           >00000000003cadb6: e32060800024        stg     %r2,128(%r6)
>>>>> 349.361500¨            00000000003cadbc: a7f4fd9e            brc     15,3ca8f8
>>>>> 349.361500¨            00000000003cadc0: c0e50046790c        brasl   %r14,c99fd8
>>>>> 349.361500¨            00000000003cadc6: a7f4fe2c            brc     15,3caa
>>>>> 349.361500¨            00000000003cadc6: a7f4fe2c            brc     15,3caa1e
>>>>> 349.361500¨            00000000003cadca: ecb1ffff00d9        aghik   %r11,%r1,-1
>>>>> 349.361619¨ Call Trace:
>>>>> 349.361627¨ (Ý<00000000003cabcc>¨ __free_slab+0x49c/0x6b0)
>>>>> 349.361634¨  Ý<00000000001f5886>¨ rcu_core+0x5a6/0x7e0
>>>>> 349.361643¨  Ý<0000000000ca2dea>¨ __do_softirq+0xf2/0x5c0
>>>>> 349.361652¨  Ý<0000000000152644>¨ irq_exit+0x104/0x130
>>>>> 349.361659¨  Ý<000000000010d222>¨ do_IRQ+0x9a/0xf0
>>>>> 349.361667¨  Ý<0000000000ca2344>¨ ext_int_handler+0x130/0x134
>>>>> 349.361674¨  Ý<0000000000103648>¨ enabled_wait+0x58/0x128
>>>>> 349.361681¨ (Ý<0000000000103634>¨ enabled_wait+0x44/0x128)
>>>>> 349.361688¨  Ý<0000000000103b00>¨ arch_cpu_idle+0x40/0x58
>>>>> 349.361695¨  Ý<0000000000ca0544>¨ default_idle_call+0x3c/0x68
>>>>> 349.361704¨  Ý<000000000018eaa4>¨ do_idle+0xec/0x1c0
>>>>> 349.361748¨  Ý<000000000018ee0e>¨ cpu_startup_entry+0x36/0x40
>>>>> 349.361756¨  Ý<000000000122df34>¨ arch_call_rest_init+0x5c/0x88
>>>>> 349.361761¨  Ý<0000000000000000>¨ 0x0
>>>>> 349.361765¨ INFO: lockdep is turned off.
>>>>> 349.361769¨ Last Breaking-Event-Address:
>>>>> 349.361774¨  Ý<00000000003ca8f4>¨ __free_slab+0x1c4/0x6b0
>>>>> 349.361781¨ Kernel panic - not syncing: Fatal exception in interrupt
>>>>>
>>>>>
>>>>> A fix that works for me (RFC):
>>>>>
>>>>> diff --git a/mm/slab.h b/mm/slab.h
>>>>> index a62372d0f271..b19a3f940338 100644
>>>>> --- a/mm/slab.h
>>>>> +++ b/mm/slab.h
>>>>> @@ -328,7 +328,7 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
>>>>>
>>>>>         rcu_read_lock();
>>>>>         memcg = READ_ONCE(s->memcg_params.memcg);
>>>>> -       if (likely(!mem_cgroup_is_root(memcg))) {
>>>>> +       if (likely(memcg && !mem_cgroup_is_root(memcg))) {
>>>>>                 lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
>>>>>                 mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order));
>>>>>                 memcg_kmem_uncharge_memcg(page, order, memcg);
>>>>>
>>>>> -- 
>>>>> Karsten
>>>>>
>>>>> (I'm a dude)
>>>>>
>>>>>
>>
Roman Gushchin Oct. 3, 2019, 5:34 p.m. UTC | #6
On Thu, Oct 03, 2019 at 06:24:27PM +0200, Karsten Graul wrote:
> On 03/10/2019 18:11, Roman Gushchin wrote:
> > On Thu, Oct 03, 2019 at 10:48:15AM +0200, Karsten Graul wrote:
> >> Hello, Roman!
> >>
> >> On 03/10/2019 05:35, Roman Gushchin wrote:
> >>> On Wed, Oct 02, 2019 at 12:41:29PM -0700, Roman Gushchin wrote:
> >>>> Hello, Karsten!
> >>>>
> >>>> Thank you for the report!
> >>>>
> >>>> On Wed, Oct 02, 2019 at 04:50:53PM +0200, Karsten Graul wrote:
> >>>>>
> >>>>> net/smc is calling proto_register(&smc_proto, 1) with smc_proto.slab_flags = SLAB_TYPESAFE_BY_RCU.
> >>>>> Right after the last SMC socket is destroyed, proto_unregister(&smc_proto) is called, which 
> >>>>> calls kmem_cache_destroy(prot->slab). This results in a kernel crash in __free_slab().
> >>>>> Platform is s390x, reproduced on kernel 5.4-rc1. The problem was introduced by commit
> >>>>> fb2f2b0adb98 ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal")
> >>>>>
> >>>>> I added a 'call graph', below of that is the crash log and a (simple) patch that works for me,
> >>>>> but I don't know if this is the correct way to fix it.
> >>>>>
> >>>>> (Please keep me on CC of this thread because I do not follow the mm mailing list, thank you)
> >>>>>
> >>>>>
> >>>>> kmem_cache_destroy() 
> >>>>>   -> shutdown_memcg_caches()
> >>>>>     -> shutdown_cache()
> >>>>>       -> __kmem_cache_shutdown()  (slub.c)
> >>>>>         -> free_partial()
> >>>>>           -> discard_slab()
> >>>>> 	    -> free_slab()                                      -- call to __free_slab() is delayed
> >>>>> 	      -> call_rcu(rcu_free_slab)
> >>>>>     -> memcg_unlink_cache()
> >>>>>       -> WRITE_ONCE(s->memcg_params.memcg, NULL);               -- !!!
> >>>>>     -> list_add_tail(&s->list, &slab_caches_to_rcu_destroy);
> >>>>>     -> schedule_work(&slab_caches_to_rcu_destroy_work);  -> work_fn uses rcu_barrier() to wait for rcu_batch, 
> >>>>>                                                             so work_fn is not further involved here...
> >>>>> ... rcu grace period ...
> >>>>> rcu_batch()
> >>>>>   ...
> >>>>>   -> rcu_free_slab()   (slub.c)
> >>>>>     -> __free_slab()
> >>>>>       -> uncharge_slab_page()
> >>>>>         -> memcg_uncharge_slab()
> >>>>> 	  -> memcg = READ_ONCE(s->memcg_params.memcg);          -- !!! memcg == NULL
> >>>>> 	  -> mem_cgroup_lruvec(memcg)
> >>>>> 	    -> mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id); -- mz == NULL
> >>>>> 	    -> lruvec = &mz->lruvec;                            -- lruvec == NULL
> >>>>> 	    -> lruvec->pgdat = pgdat;                           -- *crash*
> >>>>>
> >>>>> The crash log:
> >>>>
> >>>> Hm, I might be wrong, but it seems that the problem is deeper: __free_slab()
> >>>> called from the rcu path races with kmem_cache_destroy(), which is supposed
> >>>> to be called when there are no outstanding allocations (and corresponding pages).
> >>>> Any charged slab page actually holds a reference to the kmem_cache, which prevents
> >>>> its destruction (look at s->memcg_params.refcnt), but kmem_cache_destroy() ignores
> >>>> it.
> >>
> >> I don't see a race between kmem_cache_destroy() and __fre_slab(). kmem_cache_destroy()
> >> is already done when __free_slab() is called. Maybe the below trace shows you the order of calls on my
> >> system: kmem_cache_destroy() unlinks the memcg caches, sets up the rcu callbacks, then 
> >> it starts the slab_caches_to_rcu_destroy_workfn() worker and then kmem_cache_destroy() is done.
> > 
> > Right, and this is the problem. The question when call_rcu() in free_slab() has
> > been called: if it did happen before the kmem_cache_destroy(), it's clearly
> > a bug inside the slab allocator. Otherwise it's probably an incorrect API
> > invocation.
> > 
> >> You see that the smc code is getting control again after that.
> >> The worker starts in between (before the smc_exit trace), but keeps waiting on the rcu_barrier().
> >> Ages later (see time difference) the rcu grace period ends and rcu_free_slab() is called, and it
> >> crashes. 
> >> I hope that helps!
> >>
> >> [  145.539917] kmem_cache_destroy before shutdown_memcg_caches() for 0000000068106f00
> >> [  145.539929] free_slab call_rcu() for 00000000392c2e00, page is 000003d080e45000
> >> [  145.539961] memcg_unlink_cache clearing memcg for 00000000392c2e00
> >> [  145.539970] shutdown_cache adding to slab_caches_to_rcu_destroy queue for work: 00000000392c2e00
> > 
> > Does it mean that call_rcu() has been called after kmem_cache_destroy()?
> > In this case, do you know who called it?
> > 
> 
> Let me show the 'call graph' again, call_rcu() is called by free_slab()
> as part of kmem_cache_destroy(), and just before memcg_unlink_cache() clears
> the memcg reference.
> 
> kmem_cache_destroy() 
>   -> shutdown_memcg_caches()
>     -> shutdown_cache()
>       -> __kmem_cache_shutdown()  (slub.c)
>         -> free_partial()
>           -> discard_slab()
> 	    -> free_slab()                                      -- call to __free_slab() is delayed
> 	      -> call_rcu(rcu_free_slab)
>     -> memcg_unlink_cache()
>       -> WRITE_ONCE(s->memcg_params.memcg, NULL);               -- !!!

Ah, got it, thank you!

Then something like this should work. Can you, please, confirm that is solves
the problem?

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 807490fe217a..d916e986f094 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -180,8 +180,11 @@ static void destroy_memcg_params(struct kmem_cache *s)
 {
        if (is_root_cache(s))
                kvfree(rcu_access_pointer(s->memcg_params.memcg_caches));
-       else
+       else {
+               mem_cgroup_put(s->memcg_params.memcg);
+               WRITE_ONCE(s->memcg_params.memcg, NULL);
                percpu_ref_exit(&s->memcg_params.refcnt);
+       }
 }
 
 static void free_memcg_params(struct rcu_head *rcu)
@@ -253,8 +256,6 @@ static void memcg_unlink_cache(struct kmem_cache *s)
        } else {
                list_del(&s->memcg_params.children_node);
                list_del(&s->memcg_params.kmem_caches_node);
-               mem_cgroup_put(s->memcg_params.memcg);
-               WRITE_ONCE(s->memcg_params.memcg, NULL);
        }
 }
 #else


--

Thank you!

Roman


> 
> 
> > I'd add an atomic flag to the root kmem_cache, set it at the beginning of the
> > kmem_cache_destroy() and check it in free_slab(). If set, dump the stacktrace.
> > Just please make sure you're looking at the root kmem_cache flag, not the memcg
> > one.
> > 
> > Thank you!
> > 
> > Roman
> > 
> >>
> >> [  145.540001] free_slab call_rcu() for 00000000392c2900, page is 000003d080e4a200
> >> [  145.540031] memcg_unlink_cache clearing memcg for 00000000392c2900
> >> [  145.540041] shutdown_cache adding to slab_caches_to_rcu_destroy queue for work: 00000000392c2900
> >>
> >> [  145.540066] kmem_cache_destroy after shutdown_memcg_caches() for 0000000068106f00
> >>
> >> [  145.540075] kmem_cache_destroy before final shutdown_cache() for 0000000068106f00
> >> [  145.540086] free_slab call_rcu() for 0000000068106f00, page is 000003d080e0a800
> >> [  145.540189] shutdown_cache adding to slab_caches_to_rcu_destroy queue for work: 0000000068106f00
> >>
> >> [  145.540548] kmem_cache_destroy after final shutdown_cache() for 0000000068106f00
> >>    kmem_cache_destroy is done
> >> [  145.540573] slab_caches_to_rcu_destroy_workfn before rcu_barrier() in workfunc
> >>    slab_caches_to_rcu_destroy_workfn started and waits in rcu_barrier() now
> >> [  145.540619] smc.0698ae: smc_exit before smc_pnet_exit
> >>    smc module exit code gets back control ...
> >> [  145.540699] smc.616283: smc_exit before unregister_pernet_subsys
> >> [  145.619747] rcu_free_slab called for 00000000392c2e00, page is 000003d080e45000
> >>    much later the rcu callbacks are invoked, and will crash
> >>
> >>>>
> >>>> If my thoughts are correct, the commit you've mentioned didn't introduced this
> >>>> issue, it just made it easier to reproduce.
> >>>>
> >>>> The proposed fix looks dubious to me: the problem isn't in the memcg pointer
> >>>> (it's just a luck that it crashes on it), and it seems incorrect to not decrease
> >>>> the slab statistics of the original memory cgroup.
> >>
> >> I was quite sure that my approach is way to simple, it's better when the mm experts
> >> work on that.
> >>
> >>>>
> >>>> What we probably need to do instead is to extend flush_memcg_workqueue() to
> >>>> wait for all outstanding rcu free callbacks. I have to think a bit what's the best
> >>>> way to fix it. How easy is to reproduce the problem?
> >>
> >> I can reproduce this at will and I am happy to test any fixes you provide.
> >>
> >>>
> >>> After a second thought, flush_memcg_workqueue() already contains
> >>> a rcu_barrier() call, so now first suspicion is that the last free() call
> >>> occurs after the kmem_cache_destroy() call. Can you, please, check if it's not
> >>> a case?
> >>>
> >>
> >> In kmem_cache_destroy(), the flush_memcg_workqueue() call is the first one, and after
> >> that shutdown_memcg_caches() is called which setup the rcu callbacks.
> > 
> > These are callbacks to destroy kmem_caches, not pages.
> > 
> >> So flush_memcg_workqueue() can not wait for them. If you follow the 'call graph' above 
> >> using the RCU path in slub.c you can see when the callbacks are set up and why no warning 
> >> is printed.
> >>
> >>
> >> Second thought after I wrote all of the above: when flush_memcg_workqueue() already contains
> >> an rcu_barrier(), whats the point of delaying the slab freeing in the rcu case? All rcu
> >> readers should be done now, so the rcu callbacks and the worker are not needed?
> >> What am I missing here (I am sure I miss something, I am completely new in the mm area)?
> >>
> >>> Thanks!
> >>>
> >>>>
> >>>>>
> >>>>> 349.361168¨ Unable to handle kernel pointer dereference in virtual kernel address space
> >>>>
> >>>> Btw, haven't you noticed anything suspicious in dmesg before this line?
> >>
> >> There is no error or warning line in dmesg before this line. Actually, I think that
> >> all pages are no longer in use so no warning is printed. Anyway, the slab freeing is
> >> delayed in any case when RCU is in use, right?
> >>
> >>
> >> Karsten
> >>
> >>>>
> >>>> Thank you!
> >>>>
> >>>> Roman
> >>>>
> >>>>> 349.361210¨ Failing address: 0000000000000000 TEID: 0000000000000483
> >>>>> 349.361223¨ Fault in home space mode while using kernel ASCE.
> >>>>> 349.361240¨ AS:00000000017d4007 R3:000000007fbd0007 S:000000007fbff000 P:000000000000003d
> >>>>> 349.361340¨ Oops: 0004 ilc:3 Ý#1¨ PREEMPT SMP
> >>>>> 349.361349¨ Modules linked in: tcp_diag inet_diag xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_at nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_de
> >>>>> 349.361436¨ CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-05872-g6133e3e4bada-dirty #14
> >>>>> 349.361445¨ Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
> >>>>> 349.361450¨ Krnl PSW : 0704d00180000000 00000000003cadb6 (__free_slab+0x686/0x6b0)
> >>>>> 349.361464¨            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
> >>>>> 349.361470¨ Krnl GPRS: 00000000f3a32928 0000000000000000 000000007fbf5d00 000000000117c4b8
> >>>>> 349.361475¨            0000000000000000 000000009e3291c1 0000000000000000 0000000000000000
> >>>>> 349.361481¨            0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
> >>>>> 349.361481¨            0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
> >>>>> 349.361486¨            000000000117ba00 000003e000057db0 00000000003cabcc 000003e000057c78
> >>>>> 349.361500¨ Krnl Code: 00000000003cada6: e310a1400004        lg      %r1,320(%r10)
> >>>>> 349.361500¨            00000000003cadac: c0e50046c286        brasl   %r14,ca32b8
> >>>>> 349.361500¨           #00000000003cadb2: a7f4fe36            brc     15,3caa1e
> >>>>> 349.361500¨           >00000000003cadb6: e32060800024        stg     %r2,128(%r6)
> >>>>> 349.361500¨            00000000003cadbc: a7f4fd9e            brc     15,3ca8f8
> >>>>> 349.361500¨            00000000003cadc0: c0e50046790c        brasl   %r14,c99fd8
> >>>>> 349.361500¨            00000000003cadc6: a7f4fe2c            brc     15,3caa
> >>>>> 349.361500¨            00000000003cadc6: a7f4fe2c            brc     15,3caa1e
> >>>>> 349.361500¨            00000000003cadca: ecb1ffff00d9        aghik   %r11,%r1,-1
> >>>>> 349.361619¨ Call Trace:
> >>>>> 349.361627¨ (Ý<00000000003cabcc>¨ __free_slab+0x49c/0x6b0)
> >>>>> 349.361634¨  Ý<00000000001f5886>¨ rcu_core+0x5a6/0x7e0
> >>>>> 349.361643¨  Ý<0000000000ca2dea>¨ __do_softirq+0xf2/0x5c0
> >>>>> 349.361652¨  Ý<0000000000152644>¨ irq_exit+0x104/0x130
> >>>>> 349.361659¨  Ý<000000000010d222>¨ do_IRQ+0x9a/0xf0
> >>>>> 349.361667¨  Ý<0000000000ca2344>¨ ext_int_handler+0x130/0x134
> >>>>> 349.361674¨  Ý<0000000000103648>¨ enabled_wait+0x58/0x128
> >>>>> 349.361681¨ (Ý<0000000000103634>¨ enabled_wait+0x44/0x128)
> >>>>> 349.361688¨  Ý<0000000000103b00>¨ arch_cpu_idle+0x40/0x58
> >>>>> 349.361695¨  Ý<0000000000ca0544>¨ default_idle_call+0x3c/0x68
> >>>>> 349.361704¨  Ý<000000000018eaa4>¨ do_idle+0xec/0x1c0
> >>>>> 349.361748¨  Ý<000000000018ee0e>¨ cpu_startup_entry+0x36/0x40
> >>>>> 349.361756¨  Ý<000000000122df34>¨ arch_call_rest_init+0x5c/0x88
> >>>>> 349.361761¨  Ý<0000000000000000>¨ 0x0
> >>>>> 349.361765¨ INFO: lockdep is turned off.
> >>>>> 349.361769¨ Last Breaking-Event-Address:
> >>>>> 349.361774¨  Ý<00000000003ca8f4>¨ __free_slab+0x1c4/0x6b0
> >>>>> 349.361781¨ Kernel panic - not syncing: Fatal exception in interrupt
> >>>>>
> >>>>>
> >>>>> A fix that works for me (RFC):
> >>>>>
> >>>>> diff --git a/mm/slab.h b/mm/slab.h
> >>>>> index a62372d0f271..b19a3f940338 100644
> >>>>> --- a/mm/slab.h
> >>>>> +++ b/mm/slab.h
> >>>>> @@ -328,7 +328,7 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
> >>>>>
> >>>>>         rcu_read_lock();
> >>>>>         memcg = READ_ONCE(s->memcg_params.memcg);
> >>>>> -       if (likely(!mem_cgroup_is_root(memcg))) {
> >>>>> +       if (likely(memcg && !mem_cgroup_is_root(memcg))) {
> >>>>>                 lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
> >>>>>                 mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order));
> >>>>>                 memcg_kmem_uncharge_memcg(page, order, memcg);
> >>>>>
> >>>>> -- 
> >>>>> Karsten
> >>>>>
> >>>>> (I'm a dude)
> >>>>>
> >>>>>
> >>
> 
> -- 
> Karsten
> 
> (I'm a dude)
>
Karsten Graul Oct. 4, 2019, 8:59 a.m. UTC | #7
>> Let me show the 'call graph' again, call_rcu() is called by free_slab()
>> as part of kmem_cache_destroy(), and just before memcg_unlink_cache() clears
>> the memcg reference.
>>
>> kmem_cache_destroy() 
>>   -> shutdown_memcg_caches()
>>     -> shutdown_cache()
>>       -> __kmem_cache_shutdown()  (slub.c)
>>         -> free_partial()
>>           -> discard_slab()
>> 	    -> free_slab()                                      -- call to __free_slab() is delayed
>> 	      -> call_rcu(rcu_free_slab)
>>     -> memcg_unlink_cache()
>>       -> WRITE_ONCE(s->memcg_params.memcg, NULL);               -- !!!
> 
> Ah, got it, thank you!
> 
> Then something like this should work. Can you, please, confirm that is solves
> the problem?
> 
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 807490fe217a..d916e986f094 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -180,8 +180,11 @@ static void destroy_memcg_params(struct kmem_cache *s)
>  {
>         if (is_root_cache(s))
>                 kvfree(rcu_access_pointer(s->memcg_params.memcg_caches));
> -       else
> +       else {
> +               mem_cgroup_put(s->memcg_params.memcg);
> +               WRITE_ONCE(s->memcg_params.memcg, NULL);
>                 percpu_ref_exit(&s->memcg_params.refcnt);
> +       }
>  }
>  
>  static void free_memcg_params(struct rcu_head *rcu)
> @@ -253,8 +256,6 @@ static void memcg_unlink_cache(struct kmem_cache *s)
>         } else {
>                 list_del(&s->memcg_params.children_node);
>                 list_del(&s->memcg_params.kmem_caches_node);
> -               mem_cgroup_put(s->memcg_params.memcg);
> -               WRITE_ONCE(s->memcg_params.memcg, NULL);
>         }
>  }
>  #else
> 

I tested this fix and can confirm that it solved the problem!

> 
> --
> 
> Thank you!
> 
> Roman
> 
> 
>>
>>
>>> I'd add an atomic flag to the root kmem_cache, set it at the beginning of the
>>> kmem_cache_destroy() and check it in free_slab(). If set, dump the stacktrace.
>>> Just please make sure you're looking at the root kmem_cache flag, not the memcg
>>> one.
>>>
>>> Thank you!
>>>
>>> Roman
>>>
>>>>
>>>> [  145.540001] free_slab call_rcu() for 00000000392c2900, page is 000003d080e4a200
>>>> [  145.540031] memcg_unlink_cache clearing memcg for 00000000392c2900
>>>> [  145.540041] shutdown_cache adding to slab_caches_to_rcu_destroy queue for work: 00000000392c2900
>>>>
>>>> [  145.540066] kmem_cache_destroy after shutdown_memcg_caches() for 0000000068106f00
>>>>
>>>> [  145.540075] kmem_cache_destroy before final shutdown_cache() for 0000000068106f00
>>>> [  145.540086] free_slab call_rcu() for 0000000068106f00, page is 000003d080e0a800
>>>> [  145.540189] shutdown_cache adding to slab_caches_to_rcu_destroy queue for work: 0000000068106f00
>>>>
>>>> [  145.540548] kmem_cache_destroy after final shutdown_cache() for 0000000068106f00
>>>>    kmem_cache_destroy is done
>>>> [  145.540573] slab_caches_to_rcu_destroy_workfn before rcu_barrier() in workfunc
>>>>    slab_caches_to_rcu_destroy_workfn started and waits in rcu_barrier() now
>>>> [  145.540619] smc.0698ae: smc_exit before smc_pnet_exit
>>>>    smc module exit code gets back control ...
>>>> [  145.540699] smc.616283: smc_exit before unregister_pernet_subsys
>>>> [  145.619747] rcu_free_slab called for 00000000392c2e00, page is 000003d080e45000
>>>>    much later the rcu callbacks are invoked, and will crash
>>>>
>>>>>>
>>>>>> If my thoughts are correct, the commit you've mentioned didn't introduced this
>>>>>> issue, it just made it easier to reproduce.
>>>>>>
>>>>>> The proposed fix looks dubious to me: the problem isn't in the memcg pointer
>>>>>> (it's just a luck that it crashes on it), and it seems incorrect to not decrease
>>>>>> the slab statistics of the original memory cgroup.
>>>>
>>>> I was quite sure that my approach is way to simple, it's better when the mm experts
>>>> work on that.
>>>>
>>>>>>
>>>>>> What we probably need to do instead is to extend flush_memcg_workqueue() to
>>>>>> wait for all outstanding rcu free callbacks. I have to think a bit what's the best
>>>>>> way to fix it. How easy is to reproduce the problem?
>>>>
>>>> I can reproduce this at will and I am happy to test any fixes you provide.
>>>>
>>>>>
>>>>> After a second thought, flush_memcg_workqueue() already contains
>>>>> a rcu_barrier() call, so now first suspicion is that the last free() call
>>>>> occurs after the kmem_cache_destroy() call. Can you, please, check if it's not
>>>>> a case?
>>>>>
>>>>
>>>> In kmem_cache_destroy(), the flush_memcg_workqueue() call is the first one, and after
>>>> that shutdown_memcg_caches() is called which setup the rcu callbacks.
>>>
>>> These are callbacks to destroy kmem_caches, not pages.
>>>
>>>> So flush_memcg_workqueue() can not wait for them. If you follow the 'call graph' above 
>>>> using the RCU path in slub.c you can see when the callbacks are set up and why no warning 
>>>> is printed.
>>>>
>>>>
>>>> Second thought after I wrote all of the above: when flush_memcg_workqueue() already contains
>>>> an rcu_barrier(), whats the point of delaying the slab freeing in the rcu case? All rcu
>>>> readers should be done now, so the rcu callbacks and the worker are not needed?
>>>> What am I missing here (I am sure I miss something, I am completely new in the mm area)?
>>>>
>>>>> Thanks!
>>>>>
>>>>>>
>>>>>>>
>>>>>>> 349.361168¨ Unable to handle kernel pointer dereference in virtual kernel address space
>>>>>>
>>>>>> Btw, haven't you noticed anything suspicious in dmesg before this line?
>>>>
>>>> There is no error or warning line in dmesg before this line. Actually, I think that
>>>> all pages are no longer in use so no warning is printed. Anyway, the slab freeing is
>>>> delayed in any case when RCU is in use, right?
>>>>
>>>>
>>>> Karsten
>>>>
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>> Roman
>>>>>>
>>>>>>> 349.361210¨ Failing address: 0000000000000000 TEID: 0000000000000483
>>>>>>> 349.361223¨ Fault in home space mode while using kernel ASCE.
>>>>>>> 349.361240¨ AS:00000000017d4007 R3:000000007fbd0007 S:000000007fbff000 P:000000000000003d
>>>>>>> 349.361340¨ Oops: 0004 ilc:3 Ý#1¨ PREEMPT SMP
>>>>>>> 349.361349¨ Modules linked in: tcp_diag inet_diag xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_at nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_de
>>>>>>> 349.361436¨ CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-05872-g6133e3e4bada-dirty #14
>>>>>>> 349.361445¨ Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
>>>>>>> 349.361450¨ Krnl PSW : 0704d00180000000 00000000003cadb6 (__free_slab+0x686/0x6b0)
>>>>>>> 349.361464¨            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
>>>>>>> 349.361470¨ Krnl GPRS: 00000000f3a32928 0000000000000000 000000007fbf5d00 000000000117c4b8
>>>>>>> 349.361475¨            0000000000000000 000000009e3291c1 0000000000000000 0000000000000000
>>>>>>> 349.361481¨            0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
>>>>>>> 349.361481¨            0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
>>>>>>> 349.361486¨            000000000117ba00 000003e000057db0 00000000003cabcc 000003e000057c78
>>>>>>> 349.361500¨ Krnl Code: 00000000003cada6: e310a1400004        lg      %r1,320(%r10)
>>>>>>> 349.361500¨            00000000003cadac: c0e50046c286        brasl   %r14,ca32b8
>>>>>>> 349.361500¨           #00000000003cadb2: a7f4fe36            brc     15,3caa1e
>>>>>>> 349.361500¨           >00000000003cadb6: e32060800024        stg     %r2,128(%r6)
>>>>>>> 349.361500¨            00000000003cadbc: a7f4fd9e            brc     15,3ca8f8
>>>>>>> 349.361500¨            00000000003cadc0: c0e50046790c        brasl   %r14,c99fd8
>>>>>>> 349.361500¨            00000000003cadc6: a7f4fe2c            brc     15,3caa
>>>>>>> 349.361500¨            00000000003cadc6: a7f4fe2c            brc     15,3caa1e
>>>>>>> 349.361500¨            00000000003cadca: ecb1ffff00d9        aghik   %r11,%r1,-1
>>>>>>> 349.361619¨ Call Trace:
>>>>>>> 349.361627¨ (Ý<00000000003cabcc>¨ __free_slab+0x49c/0x6b0)
>>>>>>> 349.361634¨  Ý<00000000001f5886>¨ rcu_core+0x5a6/0x7e0
>>>>>>> 349.361643¨  Ý<0000000000ca2dea>¨ __do_softirq+0xf2/0x5c0
>>>>>>> 349.361652¨  Ý<0000000000152644>¨ irq_exit+0x104/0x130
>>>>>>> 349.361659¨  Ý<000000000010d222>¨ do_IRQ+0x9a/0xf0
>>>>>>> 349.361667¨  Ý<0000000000ca2344>¨ ext_int_handler+0x130/0x134
>>>>>>> 349.361674¨  Ý<0000000000103648>¨ enabled_wait+0x58/0x128
>>>>>>> 349.361681¨ (Ý<0000000000103634>¨ enabled_wait+0x44/0x128)
>>>>>>> 349.361688¨  Ý<0000000000103b00>¨ arch_cpu_idle+0x40/0x58
>>>>>>> 349.361695¨  Ý<0000000000ca0544>¨ default_idle_call+0x3c/0x68
>>>>>>> 349.361704¨  Ý<000000000018eaa4>¨ do_idle+0xec/0x1c0
>>>>>>> 349.361748¨  Ý<000000000018ee0e>¨ cpu_startup_entry+0x36/0x40
>>>>>>> 349.361756¨  Ý<000000000122df34>¨ arch_call_rest_init+0x5c/0x88
>>>>>>> 349.361761¨  Ý<0000000000000000>¨ 0x0
>>>>>>> 349.361765¨ INFO: lockdep is turned off.
>>>>>>> 349.361769¨ Last Breaking-Event-Address:
>>>>>>> 349.361774¨  Ý<00000000003ca8f4>¨ __free_slab+0x1c4/0x6b0
>>>>>>> 349.361781¨ Kernel panic - not syncing: Fatal exception in interrupt
>>>>>>>
>>>>>>>
>>>>>>> A fix that works for me (RFC):
>>>>>>>
>>>>>>> diff --git a/mm/slab.h b/mm/slab.h
>>>>>>> index a62372d0f271..b19a3f940338 100644
>>>>>>> --- a/mm/slab.h
>>>>>>> +++ b/mm/slab.h
>>>>>>> @@ -328,7 +328,7 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
>>>>>>>
>>>>>>>         rcu_read_lock();
>>>>>>>         memcg = READ_ONCE(s->memcg_params.memcg);
>>>>>>> -       if (likely(!mem_cgroup_is_root(memcg))) {
>>>>>>> +       if (likely(memcg && !mem_cgroup_is_root(memcg))) {
>>>>>>>                 lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
>>>>>>>                 mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order));
>>>>>>>                 memcg_kmem_uncharge_memcg(page, order, memcg);
>>>>>>>
>>>>>>> -- 
>>>>>>> Karsten
>>>>>>>
>>>>>>> (I'm a dude)
>>>>>>>
>>>>>>>
>>>>
>>
>> -- 
>> Karsten
>>
>> (I'm a dude)
>>

Patch
diff mbox series

diff --git a/mm/slab.h b/mm/slab.h
index a62372d0f271..b19a3f940338 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -328,7 +328,7 @@  static __always_inline void memcg_uncharge_slab(struct page *page, int order,

        rcu_read_lock();
        memcg = READ_ONCE(s->memcg_params.memcg);
-       if (likely(!mem_cgroup_is_root(memcg))) {
+       if (likely(memcg && !mem_cgroup_is_root(memcg))) {
                lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
                mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order));
                memcg_kmem_uncharge_memcg(page, order, memcg);