diff mbox series

[RFC,bpf-next,v2,1/9] cgroup: Make operations on the cgroup root_list RCU safe

Message ID 20231017124546.24608-2-laoar.shao@gmail.com (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series bpf, cgroup: Add BPF support for cgroup1 hierarchy | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR success PR summary
bpf/vmtest-bpf-next-VM_Test-0 success Logs for ShellCheck
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for bpf-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 10624 this patch: 10624
netdev/cc_maintainers success CCed 5 of 5 maintainers
netdev/build_clang success Errors and warnings before: 2240 this patch: 2240
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 11746 this patch: 11746
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 74 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Yafang Shao Oct. 17, 2023, 12:45 p.m. UTC
At present, when we perform operations on the cgroup root_list, we must
hold the cgroup_mutex, which is a relatively heavyweight lock. In reality,
we can make operations on this list RCU-safe, eliminating the need to hold
the cgroup_mutex during traversal. Modifications to the list only occur in
the cgroup root setup and destroy paths, which should be infrequent in a
production environment. In contrast, traversal may occur frequently.
Therefore, making it RCU-safe would be beneficial.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/cgroup-defs.h     |  1 +
 kernel/cgroup/cgroup-internal.h |  3 ++-
 kernel/cgroup/cgroup.c          | 17 ++++++++++-------
 3 files changed, 13 insertions(+), 8 deletions(-)

Comments

Michal Koutný Oct. 17, 2023, 1:20 p.m. UTC | #1
On Tue, Oct 17, 2023 at 12:45:38PM +0000, Yafang Shao <laoar.shao@gmail.com> wrote:
> Therefore, making it RCU-safe would be beneficial.

Notice that whole cgroup_destroy_root() is actually an RCU callback (via
css_free_rwork_fn()). So the list traversal under RCU should alreay be
OK wrt its stability. Do you see a loophole in this argument?


>  /* iterate across the hierarchies */
>  #define for_each_root(root)						\
> -	list_for_each_entry((root), &cgroup_roots, root_list)
> +	list_for_each_entry_rcu((root), &cgroup_roots, root_list,	\
> +				!lockdep_is_held(&cgroup_mutex))

The extra condition should be constant false (regardless of
cgroup_mutex). IOW, RCU read lock is always required.

> @@ -1386,13 +1386,15 @@ static inline struct cgroup *__cset_cgroup_from_root(struct css_set *cset,
>  		}
>  	}
>  
> -	BUG_ON(!res_cgroup);
> +	WARN_ON_ONCE(!res_cgroup && lockdep_is_held(&cgroup_mutex));
>  	return res_cgroup;

Hm, this would benefit from a comment why !res_cgroup is conditionally
acceptable.

>  }
>  
>  /*
>   * look up cgroup associated with current task's cgroup namespace on the
> - * specified hierarchy
> + * specified hierarchy. Umount synchronization is ensured via VFS layer,
> + * so we don't have to hold cgroup_mutex to prevent the root from being
> + * destroyed.

I tried the similar via explicit lockdep invocation (in a thread I
linked to you previously) and VFS folks weren't ethusiastic about it.

You cannot hide this synchronization assumption in a mere comment.

Thanks,
Michal
Yafang Shao Oct. 18, 2023, 2:51 a.m. UTC | #2
On Tue, Oct 17, 2023 at 9:20 PM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Tue, Oct 17, 2023 at 12:45:38PM +0000, Yafang Shao <laoar.shao@gmail.com> wrote:
> > Therefore, making it RCU-safe would be beneficial.
>
> Notice that whole cgroup_destroy_root() is actually an RCU callback (via
> css_free_rwork_fn()). So the list traversal under RCU should alreay be
> OK wrt its stability. Do you see a loophole in this argument?

css_free_rwork_fn() is designed for handling the cgroup's CSS. When
the RCU callback is executed, this specific CSS is not accessible to
other tasks. However, the cgroup root remains accessible to other
tasks. For instance, other tasks can still traverse the cgroup
root_list and access the cgroup root that is currently being
destroyed. Hence, we must take explicit measures to make access to the
cgroup root RCU-safe.

>
>
> >  /* iterate across the hierarchies */
> >  #define for_each_root(root)                                          \
> > -     list_for_each_entry((root), &cgroup_roots, root_list)
> > +     list_for_each_entry_rcu((root), &cgroup_roots, root_list,       \
> > +                             !lockdep_is_held(&cgroup_mutex))
>
> The extra condition should be constant false (regardless of
> cgroup_mutex). IOW, RCU read lock is always required.

IIUC, the RCU read lock is not required, while the cgroup_mutex is
required when we want to traverse the cgroup root_list, such as in the
case of cgroup_attach_task_all.

>
> > @@ -1386,13 +1386,15 @@ static inline struct cgroup *__cset_cgroup_from_root(struct css_set *cset,
> >               }
> >       }
> >
> > -     BUG_ON(!res_cgroup);
> > +     WARN_ON_ONCE(!res_cgroup && lockdep_is_held(&cgroup_mutex));
> >       return res_cgroup;
>
> Hm, this would benefit from a comment why !res_cgroup is conditionally
> acceptable.

The cgrp_cset_link is freed before we remove the cgroup root from the
root_list. Consequently, when accessing a cgroup root, the cset_link
may have already been freed, resulting in a NULL res_cgroup. However,
by holding the cgroup_mutex, we ensure that res_cgroup cannot be NULL.

Will add the comments in the next version.

>
> >  }
> >
> >  /*
> >   * look up cgroup associated with current task's cgroup namespace on the
> > - * specified hierarchy
> > + * specified hierarchy. Umount synchronization is ensured via VFS layer,
> > + * so we don't have to hold cgroup_mutex to prevent the root from being
> > + * destroyed.
>
> I tried the similar via explicit lockdep invocation (in a thread I
> linked to you previously) and VFS folks weren't ethusiastic about it.

Thanks for your information. will take a look at it.

>
> You cannot hide this synchronization assumption in a mere comment.

will think about how to address it.

--
Regards
Yafang
Tejun Heo Oct. 18, 2023, 9:35 a.m. UTC | #3
On Tue, Oct 17, 2023 at 12:45:38PM +0000, Yafang Shao wrote:
>  #define for_each_root(root)						\
> -	list_for_each_entry((root), &cgroup_roots, root_list)
> +	list_for_each_entry_rcu((root), &cgroup_roots, root_list,	\
> +				!lockdep_is_held(&cgroup_mutex))

Shouldn't that be lockdep_is_held() without the leading negation?

> @@ -1386,13 +1386,15 @@ static inline struct cgroup *__cset_cgroup_from_root(struct css_set *cset,
>  		}
>  	}
>  
> -	BUG_ON(!res_cgroup);
> +	WARN_ON_ONCE(!res_cgroup && lockdep_is_held(&cgroup_mutex));

This doesn't work. lockdep_is_held() is always true if !PROVE_LOCKING.

>  	return res_cgroup;
>  }
>  
>  /*
>   * look up cgroup associated with current task's cgroup namespace on the
> - * specified hierarchy
> + * specified hierarchy. Umount synchronization is ensured via VFS layer,
> + * so we don't have to hold cgroup_mutex to prevent the root from being
> + * destroyed.
>   */

Yeah, as Michal said, let's not do it this way.

Thanks.
Yafang Shao Oct. 19, 2023, 6:38 a.m. UTC | #4
On Wed, Oct 18, 2023 at 5:35 PM Tejun Heo <tj@kernel.org> wrote:
>
> On Tue, Oct 17, 2023 at 12:45:38PM +0000, Yafang Shao wrote:
> >  #define for_each_root(root)                                          \
> > -     list_for_each_entry((root), &cgroup_roots, root_list)
> > +     list_for_each_entry_rcu((root), &cgroup_roots, root_list,       \
> > +                             !lockdep_is_held(&cgroup_mutex))
>
> Shouldn't that be lockdep_is_held() without the leading negation?

right. will fix it.

>
> > @@ -1386,13 +1386,15 @@ static inline struct cgroup *__cset_cgroup_from_root(struct css_set *cset,
> >               }
> >       }
> >
> > -     BUG_ON(!res_cgroup);
> > +     WARN_ON_ONCE(!res_cgroup && lockdep_is_held(&cgroup_mutex));
>
> This doesn't work. lockdep_is_held() is always true if !PROVE_LOCKING.

will use mutex_is_locked() instead.

>
> >       return res_cgroup;
> >  }
> >
> >  /*
> >   * look up cgroup associated with current task's cgroup namespace on the
> > - * specified hierarchy
> > + * specified hierarchy. Umount synchronization is ensured via VFS layer,
> > + * so we don't have to hold cgroup_mutex to prevent the root from being
> > + * destroyed.
> >   */
>
> Yeah, as Michal said, let's not do it this way.

sure, will do it.
Tejun Heo Oct. 19, 2023, 7:08 p.m. UTC | #5
On Thu, Oct 19, 2023 at 02:38:52PM +0800, Yafang Shao wrote:
> > > -     BUG_ON(!res_cgroup);
> > > +     WARN_ON_ONCE(!res_cgroup && lockdep_is_held(&cgroup_mutex));
> >
> > This doesn't work. lockdep_is_held() is always true if !PROVE_LOCKING.
> 
> will use mutex_is_locked() instead.

But then, someone else can hold the lock and trigger the condition
spuriously. The kernel doesn't track who's holding the lock unless lockdep
is enabled.

Thanks.
Waiman Long Oct. 19, 2023, 7:43 p.m. UTC | #6
On 10/19/23 15:08, Tejun Heo wrote:
> On Thu, Oct 19, 2023 at 02:38:52PM +0800, Yafang Shao wrote:
>>>> -     BUG_ON(!res_cgroup);
>>>> +     WARN_ON_ONCE(!res_cgroup && lockdep_is_held(&cgroup_mutex));
>>> This doesn't work. lockdep_is_held() is always true if !PROVE_LOCKING.
>> will use mutex_is_locked() instead.
> But then, someone else can hold the lock and trigger the condition
> spuriously. The kernel doesn't track who's holding the lock unless lockdep
> is enabled.

It is actually possible to detect if the current process is the owner of 
a mutex since there is a owner field in the mutex structure. However, 
the owner field also contains additional information which need to be 
masked off before comparing with "current". If such a functionality is 
really needed, we will have to add a helper function mutex_is_held(), 
for example, to kernel/locking/mutex.c.

Cheers,
Longman
Yafang Shao Oct. 20, 2023, 9:36 a.m. UTC | #7
On Fri, Oct 20, 2023 at 3:43 AM Waiman Long <longman@redhat.com> wrote:
>
> On 10/19/23 15:08, Tejun Heo wrote:
> > On Thu, Oct 19, 2023 at 02:38:52PM +0800, Yafang Shao wrote:
> >>>> -     BUG_ON(!res_cgroup);
> >>>> +     WARN_ON_ONCE(!res_cgroup && lockdep_is_held(&cgroup_mutex));
> >>> This doesn't work. lockdep_is_held() is always true if !PROVE_LOCKING.
> >> will use mutex_is_locked() instead.
> > But then, someone else can hold the lock and trigger the condition
> > spuriously. The kernel doesn't track who's holding the lock unless lockdep
> > is enabled.
>
> It is actually possible to detect if the current process is the owner of
> a mutex since there is a owner field in the mutex structure. However,
> the owner field also contains additional information which need to be
> masked off before comparing with "current". If such a functionality is
> really needed, we will have to add a helper function mutex_is_held(),
> for example, to kernel/locking/mutex.c.
、
Agreed. We should first introduce mutex_is_held(). Thanks for your suggestion.
Tejun Heo Oct. 20, 2023, 5:51 p.m. UTC | #8
On Fri, Oct 20, 2023 at 05:36:57PM +0800, Yafang Shao wrote:
> On Fri, Oct 20, 2023 at 3:43 AM Waiman Long <longman@redhat.com> wrote:
> >
> > On 10/19/23 15:08, Tejun Heo wrote:
> > > On Thu, Oct 19, 2023 at 02:38:52PM +0800, Yafang Shao wrote:
> > >>>> -     BUG_ON(!res_cgroup);
> > >>>> +     WARN_ON_ONCE(!res_cgroup && lockdep_is_held(&cgroup_mutex));
> > >>> This doesn't work. lockdep_is_held() is always true if !PROVE_LOCKING.
> > >> will use mutex_is_locked() instead.
> > > But then, someone else can hold the lock and trigger the condition
> > > spuriously. The kernel doesn't track who's holding the lock unless lockdep
> > > is enabled.
> >
> > It is actually possible to detect if the current process is the owner of
> > a mutex since there is a owner field in the mutex structure. However,
> > the owner field also contains additional information which need to be
> > masked off before comparing with "current". If such a functionality is
> > really needed, we will have to add a helper function mutex_is_held(),
> > for example, to kernel/locking/mutex.c.
> 、
> Agreed. We should first introduce mutex_is_held(). Thanks for your suggestion.

I'm not sure this is the right occassion to add such thing. It's just a
warn_on, we can either pass in the necessary condition from the callers or
just drop the warning.

Thanks.
Yafang Shao Oct. 22, 2023, 9:32 a.m. UTC | #9
On Sat, Oct 21, 2023 at 1:51 AM Tejun Heo <tj@kernel.org> wrote:
>
> On Fri, Oct 20, 2023 at 05:36:57PM +0800, Yafang Shao wrote:
> > On Fri, Oct 20, 2023 at 3:43 AM Waiman Long <longman@redhat.com> wrote:
> > >
> > > On 10/19/23 15:08, Tejun Heo wrote:
> > > > On Thu, Oct 19, 2023 at 02:38:52PM +0800, Yafang Shao wrote:
> > > >>>> -     BUG_ON(!res_cgroup);
> > > >>>> +     WARN_ON_ONCE(!res_cgroup && lockdep_is_held(&cgroup_mutex));
> > > >>> This doesn't work. lockdep_is_held() is always true if !PROVE_LOCKING.
> > > >> will use mutex_is_locked() instead.
> > > > But then, someone else can hold the lock and trigger the condition
> > > > spuriously. The kernel doesn't track who's holding the lock unless lockdep
> > > > is enabled.
> > >
> > > It is actually possible to detect if the current process is the owner of
> > > a mutex since there is a owner field in the mutex structure. However,
> > > the owner field also contains additional information which need to be
> > > masked off before comparing with "current". If such a functionality is
> > > really needed, we will have to add a helper function mutex_is_held(),
> > > for example, to kernel/locking/mutex.c.
> > 、
> > Agreed. We should first introduce mutex_is_held(). Thanks for your suggestion.
>
> I'm not sure this is the right occassion to add such thing. It's just a
> warn_on, we can either pass in the necessary condition from the callers or
> just drop the warning.

OK. will just drop the warning.
kernel test robot Oct. 25, 2023, 8:06 a.m. UTC | #10
Hello,

kernel test robot noticed "WARNING:suspicious_RCU_usage" on:

commit: bf652e038501425f21e30c78c210ce892b9747c5 ("[RFC PATCH bpf-next v2 1/9] cgroup: Make operations on the cgroup root_list RCU safe")
url: https://github.com/intel-lab-lkp/linux/commits/Yafang-Shao/cgroup-Make-operations-on-the-cgroup-root_list-RCU-safe/20231017-204729
base: https://git.kernel.org/cgit/linux/kernel/git/bpf/bpf-next.git master
patch link: https://lore.kernel.org/all/20231017124546.24608-2-laoar.shao@gmail.com/
patch subject: [RFC PATCH bpf-next v2 1/9] cgroup: Make operations on the cgroup root_list RCU safe

in testcase: boot

compiler: gcc-12
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)


+------------------------------------------------------------------+------------+------------+
|                                                                  | 137df1189d | bf652e0385 |
+------------------------------------------------------------------+------------+------------+
| WARNING:suspicious_RCU_usage                                     | 0          | 6          |
| kernel/cgroup/cgroup#c:#RCU-list_traversed_in_non-reader_section | 0          | 6          |
+------------------------------------------------------------------+------------+------------+


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202310251500.9683d034-oliver.sang@intel.com


[  245.792697][    T1] WARNING: suspicious RCU usage
[  245.793945][    T1] 6.6.0-rc5-01395-gbf652e038501 #1 Not tainted
[  245.795294][    T1] -----------------------------
[  245.796458][    T1] kernel/cgroup/cgroup-v1.c:1171 RCU-list traversed in non-reader section!!
[  245.798005][    T1]
[  245.798005][    T1] other info that might help us debug this:
[  245.798005][    T1]
[  245.800905][    T1]
[  245.800905][    T1] rcu_scheduler_active = 2, debug_locks = 1
[  245.815563][    T1] 1 lock held by init/1:
[ 245.816666][ T1] #0: c3bfffb8 (cgroup_mutex){+.+.}-{3:3}, at: cgroup_lock_and_drain_offline (kernel/cgroup/cgroup.c:3066) 
[  245.818306][    T1]
[  245.818306][    T1] stack backtrace:
[  245.820360][    T1] CPU: 0 PID: 1 Comm: init Not tainted 6.6.0-rc5-01395-gbf652e038501 #1
[  245.821860][    T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[  245.823506][    T1] Call Trace:
[ 245.824459][ T1] dump_stack_lvl (lib/dump_stack.c:107) 
[ 245.825654][ T1] dump_stack (lib/dump_stack.c:114) 
[ 245.826665][ T1] lockdep_rcu_suspicious (include/linux/context_tracking.h:153 kernel/locking/lockdep.c:6712) 
[ 245.827845][ T1] cgroup1_root_to_use (kernel/cgroup/cgroup-v1.c:1171 (discriminator 9)) 
[ 245.828999][ T1] cgroup1_get_tree (kernel/cgroup/cgroup-v1.c:1245) 
[ 245.830101][ T1] vfs_get_tree (fs/super.c:1750) 
[ 245.831172][ T1] do_new_mount (fs/namespace.c:3335) 
[ 245.832199][ T1] path_mount (fs/namespace.c:3662) 
[ 245.833147][ T1] ? user_path_at_empty (fs/namei.c:2914) 
[ 245.834198][ T1] __ia32_sys_mount (fs/namespace.c:3675 fs/namespace.c:3884 fs/namespace.c:3861 fs/namespace.c:3861) 
[ 245.835242][ T1] do_int80_syscall_32 (arch/x86/entry/common.c:112 arch/x86/entry/common.c:132) 
[ 245.836343][ T1] ? kfree (mm/slab_common.c:1073) 
[ 245.837335][ T1] ? __ia32_sys_mount (fs/namespace.c:3861) 
[ 245.838404][ T1] ? do_int80_syscall_32 (arch/x86/entry/common.c:136) 
[ 245.839479][ T1] ? syscall_exit_to_user_mode (kernel/entry/common.c:131 kernel/entry/common.c:298) 
[ 245.840593][ T1] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4565) 
[ 245.841824][ T1] ? syscall_exit_to_user_mode (kernel/entry/common.c:299) 
[ 245.842967][ T1] ? do_int80_syscall_32 (arch/x86/entry/common.c:136) 
[ 245.843964][ T1] ? __ia32_sys_mount (fs/namespace.c:3861) 
[ 245.844935][ T1] ? do_int80_syscall_32 (arch/x86/entry/common.c:136) 
[ 245.845958][ T1] ? syscall_exit_to_user_mode (kernel/entry/common.c:131 kernel/entry/common.c:298) 
[ 245.847004][ T1] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4565) 
[ 245.848103][ T1] ? syscall_exit_to_user_mode (kernel/entry/common.c:299) 
[ 245.849159][ T1] ? do_int80_syscall_32 (arch/x86/entry/common.c:136) 
[ 245.850163][ T1] ? syscall_exit_to_user_mode (kernel/entry/common.c:299) 
[ 245.851209][ T1] ? do_int80_syscall_32 (arch/x86/entry/common.c:136) 
[ 245.852179][ T1] ? lock_release (kernel/locking/lockdep.c:467 kernel/locking/lockdep.c:5775) 
[ 245.853126][ T1] ? irqentry_exit (kernel/entry/common.c:445) 
[ 245.858431][ T1] ? irqentry_exit_to_user_mode (kernel/entry/common.c:131 kernel/entry/common.c:311) 
[ 245.859731][ T1] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4565) 
[ 245.860985][ T1] ? irqentry_exit_to_user_mode (kernel/entry/common.c:312) 
[ 245.862110][ T1] ? irqentry_exit (kernel/entry/common.c:445) 
[ 245.863099][ T1] ? exc_page_fault (arch/x86/mm/fault.c:1565) 
[ 245.864073][ T1] entry_INT80_32 (arch/x86/entry/entry_32.S:947) 
[  245.865070][    T1] EIP: 0xb7f895ed
[ 245.865966][ T1] Code: 8b 7c 24 0c 50 e8 06 00 00 00 89 da 5b 5b 5f c3 8b 04 24 05 77 ec 04 00 8b 00 85 c0 74 06 50 8b 44 24 08 c3 8b 44 24 04 cd 80 <c3> 55 50 8b 6c 24 0c 8b 45 00 8b 6d 04 50 8b 44 24 04 e8 b9 ff ff
All code
========
   0:	8b 7c 24 0c          	mov    0xc(%rsp),%edi
   4:	50                   	push   %rax
   5:	e8 06 00 00 00       	call   0x10
   a:	89 da                	mov    %ebx,%edx
   c:	5b                   	pop    %rbx
   d:	5b                   	pop    %rbx
   e:	5f                   	pop    %rdi
   f:	c3                   	ret
  10:	8b 04 24             	mov    (%rsp),%eax
  13:	05 77 ec 04 00       	add    $0x4ec77,%eax
  18:	8b 00                	mov    (%rax),%eax
  1a:	85 c0                	test   %eax,%eax
  1c:	74 06                	je     0x24
  1e:	50                   	push   %rax
  1f:	8b 44 24 08          	mov    0x8(%rsp),%eax
  23:	c3                   	ret
  24:	8b 44 24 04          	mov    0x4(%rsp),%eax
  28:	cd 80                	int    $0x80
  2a:*	c3                   	ret		<-- trapping instruction
  2b:	55                   	push   %rbp
  2c:	50                   	push   %rax
  2d:	8b 6c 24 0c          	mov    0xc(%rsp),%ebp
  31:	8b 45 00             	mov    0x0(%rbp),%eax
  34:	8b 6d 04             	mov    0x4(%rbp),%ebp
  37:	50                   	push   %rax
  38:	8b 44 24 04          	mov    0x4(%rsp),%eax
  3c:	e8                   	.byte 0xe8
  3d:	b9                   	.byte 0xb9
  3e:	ff                   	(bad)
  3f:	ff                   	.byte 0xff

Code starting with the faulting instruction
===========================================
   0:	c3                   	ret
   1:	55                   	push   %rbp
   2:	50                   	push   %rax
   3:	8b 6c 24 0c          	mov    0xc(%rsp),%ebp
   7:	8b 45 00             	mov    0x0(%rbp),%eax
   a:	8b 6d 04             	mov    0x4(%rbp),%ebp
   d:	50                   	push   %rax
   e:	8b 44 24 04          	mov    0x4(%rsp),%eax
  12:	e8                   	.byte 0xe8
  13:	b9                   	.byte 0xb9
  14:	ff                   	(bad)
  15:	ff                   	.byte 0xff
[  245.868936][    T1] EAX: ffffffda EBX: 0804a431 ECX: 0804a429 EDX: 0804a431
[  245.870279][    T1] ESI: 0000000e EDI: 00000000 EBP: bffa7bd8 ESP: bffa7bb8
[  245.871623][    T1] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246
[  245.935209][    T1] init: Console is alive



The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231025/202310251500.9683d034-oliver.sang@intel.com
diff mbox series

Patch

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index f1b3151ac30b..8505eeae6e41 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -558,6 +558,7 @@  struct cgroup_root {
 
 	/* A list running through the active hierarchies */
 	struct list_head root_list;
+	struct rcu_head rcu;
 
 	/* Hierarchy-specific flags */
 	unsigned int flags;
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index c56071f150f2..321af20ea15f 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -170,7 +170,8 @@  extern struct list_head cgroup_roots;
 
 /* iterate across the hierarchies */
 #define for_each_root(root)						\
-	list_for_each_entry((root), &cgroup_roots, root_list)
+	list_for_each_entry_rcu((root), &cgroup_roots, root_list,	\
+				!lockdep_is_held(&cgroup_mutex))
 
 /**
  * for_each_subsys - iterate all enabled cgroup subsystems
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1fb7f562289d..bae8f9f27792 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1313,7 +1313,7 @@  static void cgroup_exit_root_id(struct cgroup_root *root)
 
 void cgroup_free_root(struct cgroup_root *root)
 {
-	kfree(root);
+	kfree_rcu(root, rcu);
 }
 
 static void cgroup_destroy_root(struct cgroup_root *root)
@@ -1346,7 +1346,7 @@  static void cgroup_destroy_root(struct cgroup_root *root)
 	spin_unlock_irq(&css_set_lock);
 
 	if (!list_empty(&root->root_list)) {
-		list_del(&root->root_list);
+		list_del_rcu(&root->root_list);
 		cgroup_root_count--;
 	}
 
@@ -1386,13 +1386,15 @@  static inline struct cgroup *__cset_cgroup_from_root(struct css_set *cset,
 		}
 	}
 
-	BUG_ON(!res_cgroup);
+	WARN_ON_ONCE(!res_cgroup && lockdep_is_held(&cgroup_mutex));
 	return res_cgroup;
 }
 
 /*
  * look up cgroup associated with current task's cgroup namespace on the
- * specified hierarchy
+ * specified hierarchy. Umount synchronization is ensured via VFS layer,
+ * so we don't have to hold cgroup_mutex to prevent the root from being
+ * destroyed.
  */
 static struct cgroup *
 current_cgns_cgroup_from_root(struct cgroup_root *root)
@@ -1445,7 +1447,6 @@  static struct cgroup *current_cgns_cgroup_dfl(void)
 static struct cgroup *cset_cgroup_from_root(struct css_set *cset,
 					    struct cgroup_root *root)
 {
-	lockdep_assert_held(&cgroup_mutex);
 	lockdep_assert_held(&css_set_lock);
 
 	return __cset_cgroup_from_root(cset, root);
@@ -1453,7 +1454,9 @@  static struct cgroup *cset_cgroup_from_root(struct css_set *cset,
 
 /*
  * Return the cgroup for "task" from the given hierarchy. Must be
- * called with cgroup_mutex and css_set_lock held.
+ * called with css_set_lock held to prevent task's groups from being modified.
+ * Must be called with either cgroup_mutex or rcu read lock to prevent the
+ * cgroup root from being destroyed.
  */
 struct cgroup *task_cgroup_from_root(struct task_struct *task,
 				     struct cgroup_root *root)
@@ -2097,7 +2100,7 @@  int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
 	 * care of subsystems' refcounts, which are explicitly dropped in
 	 * the failure exit path.
 	 */
-	list_add(&root->root_list, &cgroup_roots);
+	list_add_rcu(&root->root_list, &cgroup_roots);
 	cgroup_root_count++;
 
 	/*