diff mbox series

[v2] cgroup: serialize css kill and release paths

Message ID 20220603181321.443716-1-tadeusz.struk@linaro.org (mailing list archive)
State Not Applicable
Headers show
Series [v2] cgroup: serialize css kill and release paths | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Tadeusz Struk June 3, 2022, 6:13 p.m. UTC
Syzbot found a corrupted list bug scenario that can be triggered from
cgroup_subtree_control_write(cgrp). The reproduces writes to
cgroup.subtree_control file, which invokes:
cgroup_apply_control_enable()->css_create()->css_populate_dir(), which
then fails with a fault injected -ENOMEM.
In such scenario the css_killed_work_fn will be en-queued via
cgroup_apply_control_disable(cgrp)->kill_css(css), and bail out to
cgroup_kn_unlock(). Then cgroup_kn_unlock() will call:
cgroup_put(cgrp)->css_put(&cgrp->self), which will try to enqueue
css_release_work_fn for the same css instance, causing a list_add
corruption bug, as can be seen in the syzkaller report [1].

Fix this by synchronizing the css ref_kill and css_release jobs.
css_release() function will check if the css_killed_work_fn() has been
scheduled for the css and only en-queue the css_release_work_fn()
if css_killed_work_fn wasn't already en-queued. Otherwise css_release() will
set the CSS_REL_LATER flag for that css. This will cause the
css_release_work_fn() work to be executed after css_killed_work_fn() is finished.

Two scc flags have been introduced to implement this serialization mechanizm:

 * CSS_KILL_ENQED, which will be set when css_killed_work_fn() is en-queued, and
 * CSS_REL_LATER, which, if set, will cause the css_release_work_fn() to be
   scheduled after the css_killed_work_fn is finished.

There is also a new lock, which will protect the integrity of the css flags.

[1] https://syzkaller.appspot.com/bug?id=e26e54d6eac9d9fb50b221ec3e4627b327465dbd

Cc: Tejun Heo <tj@kernel.org>
Cc: Michal Koutny<mkoutny@suse.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: <cgroups@vger.kernel.org>
Cc: <netdev@vger.kernel.org>
Cc: <bpf@vger.kernel.org>
Cc: <stable@vger.kernel.org>
Cc: <linux-kernel@vger.kernel.org>

Reported-and-tested-by: syzbot+e42ae441c3b10acf9e9d@syzkaller.appspotmail.com
Fixes: 8f36aaec9c92 ("cgroup: Use rcu_work instead of explicit rcu and work item")
Signed-off-by: Tadeusz Struk <tadeusz.struk@linaro.org>
---
v2: Use correct lock in css_killed_work_fn()
---
 include/linux/cgroup-defs.h |  4 ++++
 kernel/cgroup/cgroup.c      | 35 ++++++++++++++++++++++++++++++++---
 2 files changed, 36 insertions(+), 3 deletions(-)

Comments

Tadeusz Struk June 3, 2022, 6:17 p.m. UTC | #1
On 6/3/22 11:13, Tadeusz Struk wrote:
> Syzbot found a corrupted list bug scenario that can be triggered from
> cgroup_subtree_control_write(cgrp). The reproduces writes to
> cgroup.subtree_control file, which invokes:
> cgroup_apply_control_enable()->css_create()->css_populate_dir(), which
> then fails with a fault injected -ENOMEM.
> In such scenario the css_killed_work_fn will be en-queued via
> cgroup_apply_control_disable(cgrp)->kill_css(css), and bail out to
> cgroup_kn_unlock(). Then cgroup_kn_unlock() will call:
> cgroup_put(cgrp)->css_put(&cgrp->self), which will try to enqueue
> css_release_work_fn for the same css instance, causing a list_add
> corruption bug, as can be seen in the syzkaller report [1].
> 
> Fix this by synchronizing the css ref_kill and css_release jobs.
> css_release() function will check if the css_killed_work_fn() has been
> scheduled for the css and only en-queue the css_release_work_fn()
> if css_killed_work_fn wasn't already en-queued. Otherwise css_release() will
> set the CSS_REL_LATER flag for that css. This will cause the
> css_release_work_fn() work to be executed after css_killed_work_fn() is finished.
> 
> Two scc flags have been introduced to implement this serialization mechanizm:
> 
>   * CSS_KILL_ENQED, which will be set when css_killed_work_fn() is en-queued, and
>   * CSS_REL_LATER, which, if set, will cause the css_release_work_fn() to be
>     scheduled after the css_killed_work_fn is finished.
> 
> There is also a new lock, which will protect the integrity of the css flags.
> 
> [1] https://syzkaller.appspot.com/bug?id=e26e54d6eac9d9fb50b221ec3e4627b327465dbd

This also fixes a similar, cgroup related list corrupt issue:
https://syzkaller.appspot.com/bug?id=3c7ff113ccb695e839b859da3fc481c36eb1cfd5
Michal Koutný June 6, 2022, 12:39 p.m. UTC | #2
Hello.

On Fri, Jun 03, 2022 at 11:13:21AM -0700, Tadeusz Struk <tadeusz.struk@linaro.org> wrote:
> In such scenario the css_killed_work_fn will be en-queued via
> cgroup_apply_control_disable(cgrp)->kill_css(css), and bail out to
> cgroup_kn_unlock(). Then cgroup_kn_unlock() will call:
> cgroup_put(cgrp)->css_put(&cgrp->self), which will try to enqueue
> css_release_work_fn for the same css instance, causing a list_add
> corruption bug, as can be seen in the syzkaller report [1].

This hypothesis doesn't add up to me (I am sorry).

The kill_css(css) would be a css associated with a subsys (css.ss !=
NULL) whereas css_put(&cgrp->self) is a different css just for the
cgroup (css.ss == NULL).

Michal
Tadeusz Struk June 7, 2022, 6:47 p.m. UTC | #3
On 6/6/22 05:39, Michal Koutný wrote:
> On Fri, Jun 03, 2022 at 11:13:21AM -0700, Tadeusz Struk<tadeusz.struk@linaro.org>  wrote:
>> In such scenario the css_killed_work_fn will be en-queued via
>> cgroup_apply_control_disable(cgrp)->kill_css(css), and bail out to
>> cgroup_kn_unlock(). Then cgroup_kn_unlock() will call:
>> cgroup_put(cgrp)->css_put(&cgrp->self), which will try to enqueue
>> css_release_work_fn for the same css instance, causing a list_add
>> corruption bug, as can be seen in the syzkaller report [1].
> This hypothesis doesn't add up to me (I am sorry).
> 
> The kill_css(css) would be a css associated with a subsys (css.ss !=
> NULL) whereas css_put(&cgrp->self) is a different css just for the
> cgroup (css.ss == NULL).

Yes, you are right. I couldn't figure it out where the extra css_put()
is called from, and the only place that fitted into my theory was from
the cgroup_kn_unlock() in cgroup_apply_control_disable().
After some more debugging I can see that, as you said, the cgrp->self
is a different css. The offending _put() is actually called by the
percpu_ref_kill_and_confirm(), as it not only calls the passed confirm_kill
percpu_ref_func_t, but also it puts the refcnt iself.
Because the cgroup_apply_control_disable() will loop for_each_live_descendant,
and call css_kill() on all css'es, and css_killed_work_fn() will also loop
and call css_put() on all parents, the css_release() will be called on the
first parent prematurely, causing the BUG(). What I think should be done
to balance put/get is to call css_get() for all the parents in kill_css():

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c1e1a5c34e77..3ca61325bc4e 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5527,6 +5527,8 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
   */
  static void kill_css(struct cgroup_subsys_state *css)
  {
+       struct cgroup_subsys_state *_css = css;
+
         lockdep_assert_held(&cgroup_mutex);
  
         if (css->flags & CSS_DYING)
@@ -5541,10 +5543,13 @@ static void kill_css(struct cgroup_subsys_state *css)
         css_clear_dir(css);
  
         /*
-        * Killing would put the base ref, but we need to keep it alive
-        * until after ->css_offline().
+        * Killing would put the base ref, but we need to keep it alive,
+        * and all its parents, until after ->css_offline().
          */
-       css_get(css);
+       do {
+               css_get(_css);
+               _css = _css->parent;
+       } while (_css && atomic_read(&_css->online_cnt));
  
         /*
          * cgroup core guarantees that, by the time ->css_offline() is

This will be then "reverted" in css_killed_work_fn()
Please let me know if it makes sense to you.
I'm still testing it, but syzbot is very slow today.
kernel test robot June 9, 2022, 8:56 a.m. UTC | #4
Greeting,

FYI, we noticed the following commit (built with gcc-11):

commit: 3c87862ca13147416d900cf82ca56bb2f23910bf ("[PATCH v2] cgroup: serialize css kill and release paths")
url: https://github.com/intel-lab-lkp/linux/commits/Tadeusz-Struk/cgroup-serialize-css-kill-and-release-paths/20220606-014132
base: https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git for-next
patch link: https://lore.kernel.org/netdev/20220603181321.443716-1-tadeusz.struk@linaro.org

in testcase: boot

on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):



If you fix the issue, kindly add following tag
Reported-by: kernel test robot <oliver.sang@intel.com>


[ 55.821003][ C1] WARNING: CPU: 1 PID: 1 at kernel/softirq.c:363 __local_bh_enable_ip (kernel/softirq.c:363) 
[   55.822745][    C1] Modules linked in: fuse ip_tables
[   55.823837][    C1] CPU: 1 PID: 1 Comm: systemd Not tainted 5.18.0-rc5-00048-g3c87862ca131 #1
[   55.825505][    C1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014
[ 55.827516][ C1] RIP: 0010:__local_bh_enable_ip (kernel/softirq.c:363) 
[ 55.828671][ C1] Code: 47 65 8b 05 b1 d9 a8 47 a9 00 ff ff 00 74 30 65 ff 0d a3 d9 a8 47 e8 9e c6 39 00 fb 5b 5d c3 65 8b 05 8f e2 a8 47 85 c0 75 b0 <0f> 0b eb ac e8 c6 c7 39 00 eb ad 48 89 ef e8 3c 89 16 00 eb b6 65
All code
========
   0:	47                   	rex.RXB
   1:	65 8b 05 b1 d9 a8 47 	mov    %gs:0x47a8d9b1(%rip),%eax        # 0x47a8d9b9
   8:	a9 00 ff ff 00       	test   $0xffff00,%eax
   d:	74 30                	je     0x3f
   f:	65 ff 0d a3 d9 a8 47 	decl   %gs:0x47a8d9a3(%rip)        # 0x47a8d9b9
  16:	e8 9e c6 39 00       	callq  0x39c6b9
  1b:	fb                   	sti    
  1c:	5b                   	pop    %rbx
  1d:	5d                   	pop    %rbp
  1e:	c3                   	retq   
  1f:	65 8b 05 8f e2 a8 47 	mov    %gs:0x47a8e28f(%rip),%eax        # 0x47a8e2b5
  26:	85 c0                	test   %eax,%eax
  28:	75 b0                	jne    0xffffffffffffffda
  2a:*	0f 0b                	ud2    		<-- trapping instruction
  2c:	eb ac                	jmp    0xffffffffffffffda
  2e:	e8 c6 c7 39 00       	callq  0x39c7f9
  33:	eb ad                	jmp    0xffffffffffffffe2
  35:	48 89 ef             	mov    %rbp,%rdi
  38:	e8 3c 89 16 00       	callq  0x168979
  3d:	eb b6                	jmp    0xfffffffffffffff5
  3f:	65                   	gs

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2    
   2:	eb ac                	jmp    0xffffffffffffffb0
   4:	e8 c6 c7 39 00       	callq  0x39c7cf
   9:	eb ad                	jmp    0xffffffffffffffb8
   b:	48 89 ef             	mov    %rbp,%rdi
   e:	e8 3c 89 16 00       	callq  0x16894f
  13:	eb b6                	jmp    0xffffffffffffffcb
  15:	65                   	gs
[   55.832215][    C1] RSP: 0018:ffffc90000178d60 EFLAGS: 00010046
[   55.833390][    C1] RAX: 0000000000000000 RBX: 0000000000000201 RCX: 1ffffffff7a3ca41
[   55.834990][    C1] RDX: 0000000000000000 RSI: 0000000000000201 RDI: ffffffffb88332da
[   55.836576][    C1] RBP: ffffffffb88332da R08: 0000000000000000 R09: ffff8881c0af245b
[   55.838123][    C1] R10: ffffed103815e48b R11: 0000000000000001 R12: 0000000000000008
[   55.839742][    C1] R13: ffff8881a593a000 R14: dffffc0000000000 R15: ffff8881c0af2418
[   55.841328][    C1] FS:  00007fd6571cc900(0000) GS:ffff88839d500000(0000) knlGS:0000000000000000
[   55.843084][    C1] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   55.844302][    C1] CR2: 00007fd51c623028 CR3: 00000001bfc72000 CR4: 00000000000406e0
[   55.845905][    C1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   55.847510][    C1] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   55.849058][    C1] Call Trace:
[   55.849769][    C1]  <IRQ>
[ 55.850460][ C1] put_css_set_locked (include/linux/percpu-refcount.h:335 include/linux/percpu-refcount.h:351 include/linux/cgroup.h:404 kernel/cgroup/cgroup.c:971 kernel/cgroup/cgroup.c:955) 
[ 55.851549][ C1] cgroup_free (include/linux/spinlock.h:404 kernel/cgroup/cgroup-internal.h:212 kernel/cgroup/cgroup-internal.h:198 kernel/cgroup/cgroup.c:6480) 
[ 55.852470][ C1] __put_task_struct (kernel/fork.c:842) 
[ 55.853436][ C1] rcu_do_batch (include/linux/rcupdate.h:273 kernel/rcu/tree.c:2537) 
[ 55.854361][ C1] ? lock_is_held_type (kernel/locking/lockdep.c:5382 kernel/locking/lockdep.c:5684) 
[ 55.855400][ C1] ? rcu_implicit_dynticks_qs (kernel/rcu/tree.c:2474) 
[ 55.856581][ C1] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4493) 
[ 55.857836][ C1] ? note_gp_changes (arch/x86/include/asm/irqflags.h:45 (discriminator 1) arch/x86/include/asm/irqflags.h:80 (discriminator 1) arch/x86/include/asm/irqflags.h:138 (discriminator 1) kernel/rcu/tree.c:1698 (discriminator 1)) 
[ 55.858828][ C1] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:50 (discriminator 22)) 
[ 55.859889][ C1] rcu_core (kernel/rcu/tree.c:2788) 
[ 55.860786][ C1] __do_softirq (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 include/trace/events/irq.h:142 kernel/softirq.c:559) 
[ 55.861747][ C1] __irq_exit_rcu (kernel/softirq.c:432 kernel/softirq.c:637) 
[ 55.862673][ C1] irq_exit_rcu (kernel/softirq.c:651) 
[ 55.863601][ C1] sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1097 (discriminator 14)) 
[   55.864709][    C1]  </IRQ>
[   55.865399][    C1]  <TASK>
[ 55.866087][ C1] asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:645) 
[ 55.867251][ C1] RIP: 0010:walk_component (fs/namei.c:2027) 
[ 55.868365][ C1] Code: c7 43 08 00 00 00 00 48 8b 84 24 80 00 00 00 65 48 2b 04 25 28 00 00 00 0f 85 99 03 00 00 48 81 c4 88 00 00 00 4c 89 e0 5b 5d <41> 5c 41 5d 41 5e 41 5f c3 48 b8 00 00 00 00 00 fc ff df 48 8d 7d
All code
========
   0:	c7 43 08 00 00 00 00 	movl   $0x0,0x8(%rbx)
   7:	48 8b 84 24 80 00 00 	mov    0x80(%rsp),%rax
   e:	00 
   f:	65 48 2b 04 25 28 00 	sub    %gs:0x28,%rax
  16:	00 00 
  18:	0f 85 99 03 00 00    	jne    0x3b7
  1e:	48 81 c4 88 00 00 00 	add    $0x88,%rsp
  25:	4c 89 e0             	mov    %r12,%rax
  28:	5b                   	pop    %rbx
  29:	5d                   	pop    %rbp
  2a:*	41 5c                	pop    %r12		<-- trapping instruction
  2c:	41 5d                	pop    %r13
  2e:	41 5e                	pop    %r14
  30:	41 5f                	pop    %r15
  32:	c3                   	retq   
  33:	48 b8 00 00 00 00 00 	movabs $0xdffffc0000000000,%rax
  3a:	fc ff df 
  3d:	48                   	rex.W
  3e:	8d                   	.byte 0x8d
  3f:	7d                   	.byte 0x7d

Code starting with the faulting instruction
===========================================
   0:	41 5c                	pop    %r12
   2:	41 5d                	pop    %r13
   4:	41 5e                	pop    %r14
   6:	41 5f                	pop    %r15
   8:	c3                   	retq   
   9:	48 b8 00 00 00 00 00 	movabs $0xdffffc0000000000,%rax
  10:	fc ff df 
  13:	48                   	rex.W
  14:	8d                   	.byte 0x8d
  15:	7d                   	.byte 0x7d
[   55.871899][    C1] RSP: 0018:ffffc9000001f988 EFLAGS: 00000286
[   55.873081][    C1] RAX: 0000000000000000 RBX: 000000063dbe831e RCX: ffffc9000001f690
[   55.874697][    C1] RDX: 0000000000000000 RSI: ffffffffbc798f20 RDI: ffffc9000001fbe0
[   55.876281][    C1] RBP: 0000000000000006 R08: ffff888100310dd8 R09: ffffffffbd1de4e7
[   55.877803][    C1] R10: ffffed102003531b R11: 0000000000000001 R12: 0000000000000000
[   55.879386][    C1] R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000
[ 55.881044][ C1] link_path_walk+0x563/0xc80 
[ 55.882258][ C1] ? path_init (fs/namei.c:2411) 
[ 55.883192][ C1] ? __raw_spin_lock_init (kernel/locking/spinlock_debug.c:26) 
[ 55.884260][ C1] ? open_last_lookups (fs/namei.c:2265) 
[ 55.885290][ C1] ? debug_mutex_init (kernel/locking/mutex-debug.c:89) 
[ 55.886305][ C1] ? __alloc_file (fs/file_table.c:153) 
[ 55.887298][ C1] path_openat (fs/namei.c:3605) 
[ 55.888247][ C1] ? __lock_acquire (kernel/locking/lockdep.c:5029) 
[ 55.889229][ C1] ? path_lookupat (fs/namei.c:3591) 
[ 55.890215][ C1] do_filp_open (fs/namei.c:3636) 
[ 55.893219][ C1] ? alloc_fd (fs/file.c:555 (discriminator 10)) 
[ 55.894154][ C1] ? may_open_dev (fs/namei.c:3630) 
[ 55.895168][ C1] ? __lock_release (kernel/locking/lockdep.c:5317) 
[ 55.896219][ C1] ? lock_is_held_type (kernel/locking/lockdep.c:5382 kernel/locking/lockdep.c:5684) 
[ 55.897250][ C1] ? alloc_fd (fs/file.c:555 (discriminator 10)) 
[ 55.898122][ C1] ? _raw_spin_unlock (arch/x86/include/asm/preempt.h:85 include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:186) 
[ 55.899114][ C1] ? alloc_fd (fs/file.c:555 (discriminator 10)) 
[ 55.900031][ C1] ? getname_flags (fs/namei.c:204) 
[ 55.901118][ C1] do_sys_openat2 (fs/open.c:1213) 
[ 55.902081][ C1] ? rcu_nocb_cb_kthread (kernel/rcu/tree_nocb.h:382) 
[ 55.903138][ C1] ? mntput_no_expire (fs/namespace.c:1208) 
[ 55.904198][ C1] ? lock_is_held_type (kernel/locking/lockdep.c:5382 kernel/locking/lockdep.c:5684) 
[ 55.905231][ C1] ? build_open_flags (fs/open.c:1199) 
[ 55.906242][ C1] ? call_rcu (arch/x86/include/asm/irqflags.h:45 arch/x86/include/asm/irqflags.h:80 arch/x86/include/asm/irqflags.h:138 kernel/rcu/tree.c:3109) 
[ 55.907155][ C1] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:50 (discriminator 22)) 
[ 55.908186][ C1] ? call_rcu (arch/x86/include/asm/irqflags.h:45 arch/x86/include/asm/irqflags.h:80 arch/x86/include/asm/irqflags.h:138 kernel/rcu/tree.c:3109) 
[ 55.909104][ C1] __x64_sys_openat (fs/open.c:1240) 
[ 55.910087][ C1] ? __ia32_compat_sys_open (fs/open.c:1240) 
[ 55.911242][ C1] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4501) 
[ 55.912481][ C1] ? syscall_enter_from_user_mode (arch/x86/include/asm/irqflags.h:45 arch/x86/include/asm/irqflags.h:80 kernel/entry/common.c:109) 
[ 55.913602][ C1] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:50 (discriminator 22)) 
[ 55.914638][ C1] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
[ 55.915572][ C1] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4501) 
[ 55.916880][ C1] ? do_syscall_64 (arch/x86/entry/common.c:87) 
[ 55.917809][ C1] ? do_syscall_64 (arch/x86/entry/common.c:87) 
[ 55.918770][ C1] ? do_syscall_64 (arch/x86/entry/common.c:87) 
[ 55.919737][ C1] ? do_syscall_64 (arch/x86/entry/common.c:87) 
[ 55.920694][ C1] ? do_syscall_64 (arch/x86/entry/common.c:87) 


To reproduce:

        # build kernel
	cd linux
	cp config-5.18.0-rc5-00048-g3c87862ca131 .config
	make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage modules
	make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 INSTALL_MOD_PATH=<mod-install-dir> modules_install
	cd <mod-install-dir>
	find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz


        git clone https://github.com/intel/lkp-tests.git
        cd lkp-tests
        bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email

        # if come across any failure that blocks the test,
        # please remove ~/.lkp and /lkp dir to run from a clean state.
Tadeusz Struk June 9, 2022, 2:30 p.m. UTC | #5
On 6/9/22 01:56, kernel test robot wrote:
> 
> Greeting,
> 
> FYI, we noticed the following commit (built with gcc-11):
> 
> commit: 3c87862ca13147416d900cf82ca56bb2f23910bf ("[PATCH v2] cgroup: serialize css kill and release paths")
> url:https://github.com/intel-lab-lkp/linux/commits/Tadeusz-Struk/cgroup-serialize-css-kill-and-release-paths/20220606-014132
> base:https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git  for-next
> patch link:https://lore.kernel.org/netdev/20220603181321.443716-1-tadeusz.struk@linaro.org
> 
> in testcase: boot
> 
> on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> 
> caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
> 
> 
> 
> If you fix the issue, kindly add following tag
> Reported-by: kernel test robot<oliver.sang@intel.com>
> 
> 
> [ 55.821003][ C1] WARNING: CPU: 1 PID: 1 at kernel/softirq.c:363 __local_bh_enable_ip (kernel/softirq.c:363)

Looks like that will need to be spin_lock_irq(&css->lock) instead of spin_lock_bh(&css->lock)
I can respin the patch, but I would like to request some feedback on it first.

Tejun, Michal
Are you interested in fixing this at syzbot issue all? Do you have any more feedback on this?
Michal Koutný June 9, 2022, 2:55 p.m. UTC | #6
Hello Tadeusz.

On Thu, Jun 09, 2022 at 07:30:41AM -0700, Tadeusz Struk <tadeusz.struk@linaro.org> wrote:
> Are you interested in fixing this at syzbot issue all?

The (original) syzbot report is conditioned by allocation failure that's
unlikely under normal conditions (AFAIU). Hence I don't treat it extra
high urgent.
OTOH, it's interesting and it points to some disparity worth fixing --
so I try helping (as time permits, so far I can only run the reproducers
via the syzbot).

> Do you have any more feedback on this?

Ad the patch v2 with spinlock per css -- that looks like an overkill to
me, I didn't look deeper into it.

Ad the in-thread patch with ancestry css_get(), the ->parent ref:
  - is inc'd in init_and_link_css(),
  - is dec'd in css_free_rwork_fn()
and thanks to ->online_cnt offlining is ordered (child before parent).

Where does your patch dec these ancestry references? (Or why would they
be missing in the first place?)

Thanks for digging into this!
Michal
diff mbox series

Patch

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 1bfcfb1af352..8dc8b4edb242 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -53,6 +53,8 @@  enum {
 	CSS_RELEASED	= (1 << 2), /* refcnt reached zero, released */
 	CSS_VISIBLE	= (1 << 3), /* css is visible to userland */
 	CSS_DYING	= (1 << 4), /* css is dying */
+	CSS_KILL_ENQED	= (1 << 5), /* kill work enqueued for the css */
+	CSS_REL_LATER	= (1 << 6), /* release needs to be done after kill */
 };
 
 /* bits in struct cgroup flags field */
@@ -162,6 +164,8 @@  struct cgroup_subsys_state {
 	 */
 	int id;
 
+	/* lock to protect flags */
+	spinlock_t lock;
 	unsigned int flags;
 
 	/*
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1779ccddb734..b1bbd438d426 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5210,8 +5210,23 @@  static void css_release(struct percpu_ref *ref)
 	struct cgroup_subsys_state *css =
 		container_of(ref, struct cgroup_subsys_state, refcnt);
 
-	INIT_WORK(&css->destroy_work, css_release_work_fn);
-	queue_work(cgroup_destroy_wq, &css->destroy_work);
+	spin_lock_bh(&css->lock);
+
+	/*
+	 * Check if the css_killed_work_fn work has been scheduled for this
+	 * css and enqueue css_release_work_fn only if it wasn't.
+	 * Otherwise set the CSS_REL_LATER flag, which will cause
+	 * release to be enqueued after css_killed_work_fn is finished.
+	 * This is to prevent list corruption by en-queuing two instance
+	 * of the same work struct on the same WQ, namely cgroup_destroy_wq.
+	 */
+	if (!(css->flags & CSS_KILL_ENQED)) {
+		INIT_WORK(&css->destroy_work, css_release_work_fn);
+		queue_work(cgroup_destroy_wq, &css->destroy_work);
+	} else {
+		css->flags |= CSS_REL_LATER;
+	}
+	spin_unlock_bh(&css->lock);
 }
 
 static void init_and_link_css(struct cgroup_subsys_state *css,
@@ -5230,6 +5245,7 @@  static void init_and_link_css(struct cgroup_subsys_state *css,
 	INIT_LIST_HEAD(&css->rstat_css_node);
 	css->serial_nr = css_serial_nr_next++;
 	atomic_set(&css->online_cnt, 0);
+	spin_lock_init(&css->lock);
 
 	if (cgroup_parent(cgrp)) {
 		css->parent = cgroup_css(cgroup_parent(cgrp), ss);
@@ -5545,10 +5561,12 @@  int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
  */
 static void css_killed_work_fn(struct work_struct *work)
 {
-	struct cgroup_subsys_state *css =
+	struct cgroup_subsys_state *css_killed, *css =
 		container_of(work, struct cgroup_subsys_state, destroy_work);
 
 	mutex_lock(&cgroup_mutex);
+	css_killed = css;
+	css_killed->flags &= ~CSS_KILL_ENQED;
 
 	do {
 		offline_css(css);
@@ -5557,6 +5575,14 @@  static void css_killed_work_fn(struct work_struct *work)
 		css = css->parent;
 	} while (css && atomic_dec_and_test(&css->online_cnt));
 
+	spin_lock_bh(&css_killed->lock);
+	if (css_killed->flags & CSS_REL_LATER) {
+		/* If css_release work was delayed for the css enqueue it now. */
+		INIT_WORK(&css_killed->destroy_work, css_release_work_fn);
+		queue_work(cgroup_destroy_wq, &css_killed->destroy_work);
+		css_killed->flags &= ~CSS_REL_LATER;
+	}
+	spin_unlock_bh(&css_killed->lock);
 	mutex_unlock(&cgroup_mutex);
 }
 
@@ -5566,10 +5592,13 @@  static void css_killed_ref_fn(struct percpu_ref *ref)
 	struct cgroup_subsys_state *css =
 		container_of(ref, struct cgroup_subsys_state, refcnt);
 
+	spin_lock_bh(&css->lock);
 	if (atomic_dec_and_test(&css->online_cnt)) {
+		css->flags |= CSS_KILL_ENQED;
 		INIT_WORK(&css->destroy_work, css_killed_work_fn);
 		queue_work(cgroup_destroy_wq, &css->destroy_work);
 	}
+	spin_unlock_bh(&css->lock);
 }
 
 /**