diff mbox series

[3/3] mm/slub: Fix potential deadlock problem in slab_attr_store()

Message ID 20200210204651.21674-4-longman@redhat.com (mailing list archive)
State New, archived
Headers show
Series locking/mutex: Add mutex_timed_lock() to solve potential deadlock problems | expand

Commit Message

Waiman Long Feb. 10, 2020, 8:46 p.m. UTC
The following lockdep splat was seen:

    [  176.241923] ======================================================
    [  176.241924] WARNING: possible circular locking dependency detected
    [  176.241926] 4.18.0-172.rt13.29.el8.x86_64+debug #1 Not tainted
    [  176.241927] ------------------------------------------------------
    [  176.241929] slub_cpu_partia/5371 is trying to acquire lock:
    [  176.241930] ffffffffa0b83718 (slab_mutex){+.+.}, at: slab_attr_store+0x6b/0xe0
    [  176.241941] but task is already holding lock:
    [  176.241942] ffff88bb6d8b83c8 (kn->count#103){++++}, at: kernfs_fop_write+0x1cc/0x400
    [  176.241947] which lock already depends on the new lock.

    [  176.241949] the existing dependency chain (in reverse order) is:
    [  176.241949] -> #1 (kn->count#103){++++}:
    [  176.241955]        __kernfs_remove+0x616/0x800
    [  176.241957]        kernfs_remove_by_name_ns+0x3e/0x80
    [  176.241959]        sysfs_slab_add+0x1c6/0x330
    [  176.241961]        __kmem_cache_create+0x15f/0x1b0
    [  176.241964]        create_cache+0xe1/0x220
    [  176.241966]        kmem_cache_create_usercopy+0x1a3/0x260
    [  176.241967]        kmem_cache_create+0x12/0x20
    [  176.242076]        mlx5_init_fs+0x18d/0x1a00 [mlx5_core]
    [  176.242100]        mlx5_load_one+0x3b4/0x1730 [mlx5_core]
    [  176.242124]        init_one+0x901/0x11b0 [mlx5_core]
    [  176.242127]        local_pci_probe+0xd4/0x180
    [  176.242131]        work_for_cpu_fn+0x51/0xa0
    [  176.242133]        process_one_work+0x91a/0x1ac0
    [  176.242134]        worker_thread+0x536/0xb40
    [  176.242136]        kthread+0x30c/0x3d0
    [  176.242140]        ret_from_fork+0x27/0x50
    [  176.242140] -> #0 (slab_mutex){+.+.}:
    [  176.242145]        __lock_acquire+0x22cb/0x48c0
    [  176.242146]        lock_acquire+0x134/0x4c0
    [  176.242148]        _mutex_lock+0x28/0x40
    [  176.242150]        slab_attr_store+0x6b/0xe0
    [  176.242151]        kernfs_fop_write+0x251/0x400
    [  176.242154]        vfs_write+0x157/0x460
    [  176.242155]        ksys_write+0xb8/0x170
    [  176.242158]        do_syscall_64+0x13c/0x710
    [  176.242160]        entry_SYSCALL_64_after_hwframe+0x6a/0xdf
    [  176.242161]
                   other info that might help us debug this:

    [  176.242161]  Possible unsafe locking scenario:

    [  176.242162]        CPU0                    CPU1
    [  176.242163]        ----                    ----
    [  176.242163]   lock(kn->count#103);
    [  176.242165]                                lock(slab_mutex);
    [  176.242166]                                lock(kn->count#103);
    [  176.242167]   lock(slab_mutex);
    [  176.242169]
                    *** DEADLOCK ***

    [  176.242170] 3 locks held by slub_cpu_partia/5371:
    [  176.242170]  #0: ffff888705e3a800 (sb_writers#4){.+.+}, at: vfs_write+0x31c/0x460
    [  176.242174]  #1: ffff889aeec4d658 (&of->mutex){+.+.}, at: kernfs_fop_write+0x1a9/0x400
    [  176.242177]  #2: ffff88bb6d8b83c8 (kn->count#103){++++}, at: kernfs_fop_write+0x1cc/0x400
    [  176.242180]
                   stack backtrace:
    [  176.242183] CPU: 36 PID: 5371 Comm: slub_cpu_partia Not tainted 4.18.0-172.rt13.29.el8.x86_64+debug #1
    [  176.242184] Hardware name: AMD Corporation DAYTONA_X/DAYTONA_X, BIOS RDY1005C 11/22/2019
    [  176.242185] Call Trace:
    [  176.242190]  dump_stack+0x9a/0xf0
    [  176.242193]  check_noncircular+0x317/0x3c0
    [  176.242195]  ? print_circular_bug+0x1e0/0x1e0
    [  176.242199]  ? native_sched_clock+0x32/0x1e0
    [  176.242202]  ? sched_clock+0x5/0x10
    [  176.242205]  ? sched_clock_cpu+0x238/0x340
    [  176.242208]  __lock_acquire+0x22cb/0x48c0
    [  176.242213]  ? trace_hardirqs_on+0x10/0x10
    [  176.242215]  ? trace_hardirqs_on+0x10/0x10
    [  176.242218]  lock_acquire+0x134/0x4c0
    [  176.242220]  ? slab_attr_store+0x6b/0xe0
    [  176.242223]  _mutex_lock+0x28/0x40
    [  176.242225]  ? slab_attr_store+0x6b/0xe0
    [  176.242227]  slab_attr_store+0x6b/0xe0
    [  176.242229]  ? sysfs_file_ops+0x160/0x160
    [  176.242230]  kernfs_fop_write+0x251/0x400
    [  176.242232]  ? __sb_start_write+0x26a/0x3f0
    [  176.242234]  vfs_write+0x157/0x460
    [  176.242237]  ksys_write+0xb8/0x170
    [  176.242239]  ? __ia32_sys_read+0xb0/0xb0
    [  176.242242]  ? do_syscall_64+0xb9/0x710
    [  176.242245]  do_syscall_64+0x13c/0x710
    [  176.242247]  entry_SYSCALL_64_after_hwframe+0x6a/0xdf

In order to fix this circular lock dependency problem, we need to do a
mutex_trylock(&slab_mutex) in slab_attr_store() for CPU0 above. A simple
trylock, however, is easy to fail causing user dissatisfaction. So the
new mutex_timed_lock() function is now used to do a trylock with a
100ms timeout.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 mm/slub.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

Andrew Morton Feb. 10, 2020, 10:03 p.m. UTC | #1
On Mon, 10 Feb 2020 15:46:51 -0500 Waiman Long <longman@redhat.com> wrote:

> In order to fix this circular lock dependency problem, we need to do a
> mutex_trylock(&slab_mutex) in slab_attr_store() for CPU0 above. A simple
> trylock, however, is easy to fail causing user dissatisfaction. So the
> new mutex_timed_lock() function is now used to do a trylock with a
> 100ms timeout.
> 
> ...
>
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -5536,7 +5536,12 @@ static ssize_t slab_attr_store(struct kobject *kobj,
>  	if (slab_state >= FULL && err >= 0 && is_root_cache(s)) {
>  		struct kmem_cache *c;
>  
> -		mutex_lock(&slab_mutex);
> +		/*
> +		 * Timeout after 100ms
> +		 */
> +		if (mutex_timed_lock(&slab_mutex, 100) < 0)
> +			return -EBUSY;
> +

Oh dear.  Surely there's a better fix here.  Does slab really need to
hold slab_mutex while creating that sysfs file?  Why?

If the issue is two threads trying to create the same sysfs file
(unlikely, given that both will need to have created the same cache)
then can we add a new mutex specifically for this purpose?

Or something else.
Waiman Long Feb. 10, 2020, 10:14 p.m. UTC | #2
On 2/10/20 5:03 PM, Andrew Morton wrote:
> On Mon, 10 Feb 2020 15:46:51 -0500 Waiman Long <longman@redhat.com> wrote:
>
>> In order to fix this circular lock dependency problem, we need to do a
>> mutex_trylock(&slab_mutex) in slab_attr_store() for CPU0 above. A simple
>> trylock, however, is easy to fail causing user dissatisfaction. So the
>> new mutex_timed_lock() function is now used to do a trylock with a
>> 100ms timeout.
>>
>> ...
>>
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -5536,7 +5536,12 @@ static ssize_t slab_attr_store(struct kobject *kobj,
>>  	if (slab_state >= FULL && err >= 0 && is_root_cache(s)) {
>>  		struct kmem_cache *c;
>>  
>> -		mutex_lock(&slab_mutex);
>> +		/*
>> +		 * Timeout after 100ms
>> +		 */
>> +		if (mutex_timed_lock(&slab_mutex, 100) < 0)
>> +			return -EBUSY;
>> +
> Oh dear.  Surely there's a better fix here.  Does slab really need to
> hold slab_mutex while creating that sysfs file?  Why?
>
> If the issue is two threads trying to create the same sysfs file
> (unlikely, given that both will need to have created the same cache)
> then can we add a new mutex specifically for this purpose?
>
> Or something else.
>
Well, the current code iterates all the memory cgroups to set the same
value in all of them. I believe the reason for holding the slab mutex is
to make sure that memcg hierarchy is stable during this iteration
process. Of course, we can argue if the attribute value should be
duplicated in all memcg's.

Cheers,
Longman
Andrew Morton Feb. 10, 2020, 11:10 p.m. UTC | #3
On Mon, 10 Feb 2020 17:14:31 -0500 Waiman Long <longman@redhat.com> wrote:

> >> --- a/mm/slub.c
> >> +++ b/mm/slub.c
> >> @@ -5536,7 +5536,12 @@ static ssize_t slab_attr_store(struct kobject *kobj,
> >>  	if (slab_state >= FULL && err >= 0 && is_root_cache(s)) {
> >>  		struct kmem_cache *c;
> >>  
> >> -		mutex_lock(&slab_mutex);
> >> +		/*
> >> +		 * Timeout after 100ms
> >> +		 */
> >> +		if (mutex_timed_lock(&slab_mutex, 100) < 0)
> >> +			return -EBUSY;
> >> +
> > Oh dear.  Surely there's a better fix here.  Does slab really need to
> > hold slab_mutex while creating that sysfs file?  Why?
> >
> > If the issue is two threads trying to create the same sysfs file
> > (unlikely, given that both will need to have created the same cache)
> > then can we add a new mutex specifically for this purpose?
> >
> > Or something else.
> >
> Well, the current code iterates all the memory cgroups to set the same
> value in all of them. I believe the reason for holding the slab mutex is
> to make sure that memcg hierarchy is stable during this iteration
> process.

But that is unrelated to creation of the sysfs file?
Waiman Long Feb. 11, 2020, 11:30 p.m. UTC | #4
On 2/10/20 6:10 PM, Andrew Morton wrote:
> On Mon, 10 Feb 2020 17:14:31 -0500 Waiman Long <longman@redhat.com> wrote:
>
>>>> --- a/mm/slub.c
>>>> +++ b/mm/slub.c
>>>> @@ -5536,7 +5536,12 @@ static ssize_t slab_attr_store(struct kobject *kobj,
>>>>  	if (slab_state >= FULL && err >= 0 && is_root_cache(s)) {
>>>>  		struct kmem_cache *c;
>>>>  
>>>> -		mutex_lock(&slab_mutex);
>>>> +		/*
>>>> +		 * Timeout after 100ms
>>>> +		 */
>>>> +		if (mutex_timed_lock(&slab_mutex, 100) < 0)
>>>> +			return -EBUSY;
>>>> +
>>> Oh dear.  Surely there's a better fix here.  Does slab really need to
>>> hold slab_mutex while creating that sysfs file?  Why?
>>>
>>> If the issue is two threads trying to create the same sysfs file
>>> (unlikely, given that both will need to have created the same cache)
>>> then can we add a new mutex specifically for this purpose?
>>>
>>> Or something else.
>>>
>> Well, the current code iterates all the memory cgroups to set the same
>> value in all of them. I believe the reason for holding the slab mutex is
>> to make sure that memcg hierarchy is stable during this iteration
>> process.
> But that is unrelated to creation of the sysfs file?
>
OK, I will take a closer look at that.

Cheers,
Longman
Waiman Long Feb. 12, 2020, 8:40 p.m. UTC | #5
On 2/11/20 6:30 PM, Waiman Long wrote:
> On 2/10/20 6:10 PM, Andrew Morton wrote:
>> On Mon, 10 Feb 2020 17:14:31 -0500 Waiman Long <longman@redhat.com> wrote:
>>
>>>>> --- a/mm/slub.c
>>>>> +++ b/mm/slub.c
>>>>> @@ -5536,7 +5536,12 @@ static ssize_t slab_attr_store(struct kobject *kobj,
>>>>>  	if (slab_state >= FULL && err >= 0 && is_root_cache(s)) {
>>>>>  		struct kmem_cache *c;
>>>>>  
>>>>> -		mutex_lock(&slab_mutex);
>>>>> +		/*
>>>>> +		 * Timeout after 100ms
>>>>> +		 */
>>>>> +		if (mutex_timed_lock(&slab_mutex, 100) < 0)
>>>>> +			return -EBUSY;
>>>>> +
>>>> Oh dear.  Surely there's a better fix here.  Does slab really need to
>>>> hold slab_mutex while creating that sysfs file?  Why?
>>>>
>>>> If the issue is two threads trying to create the same sysfs file
>>>> (unlikely, given that both will need to have created the same cache)
>>>> then can we add a new mutex specifically for this purpose?
>>>>
>>>> Or something else.
>>>>
>>> Well, the current code iterates all the memory cgroups to set the same
>>> value in all of them. I believe the reason for holding the slab mutex is
>>> to make sure that memcg hierarchy is stable during this iteration
>>> process.
>> But that is unrelated to creation of the sysfs file?
>>
> OK, I will take a closer look at that.

During the creation of a sysfs file:

static int sysfs_slab_add(struct kmem_cache *s)
{
  :
        if (unmergeable) {
                /*
                 * Slabcache can never be merged so we can use the name
proper.
                 * This is typically the case for debug situations. In that
                 * case we can catch duplicate names easily.
                 */
                sysfs_remove_link(&slab_kset->kobj, s->name);
                name = s->name;

The code is trying to remove sysfs files of a cache with conflicting
name. So it seems like kmem_cache_create() is called with a name that
has been used before. If it happens that a write to one of the sysfs
files to be removed happens at the same time, a deadlock can happen.

In this particular case, the kmem_cache_create() call comes from the
mlx5_core module.

        steering->fgs_cache = kmem_cache_create("mlx5_fs_fgs",
                                                sizeof(struct
mlx5_flow_group), 0,
                                                0, NULL);

Perhaps the module is somehow unloaded and then loaded again. Unfortunately
this lockdep error was seen once. It is hard to find out how to fix it
without an easy way to reproduce it.

So I will table this for now until there is a way to reproduce it.

Thanks,
Longman
kernel test robot Feb. 13, 2020, 12:22 p.m. UTC | #6
Hi Waiman,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/locking/core]
[also build test ERROR on tip/auto-latest linus/master v5.6-rc1 next-20200213]
[cannot apply to arm-perf/for-next/perf]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Waiman-Long/locking-mutex-Add-mutex_timed_lock-to-solve-potential-deadlock-problems/20200213-063401
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 41f0e29190ac9e38099a37abd1a8a4cb1dc21233
config: x86_64-randconfig-s2-20200213 (attached as .config)
compiler: gcc-7 (Debian 7.5.0-4) 7.5.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   mm/slub.c: In function 'slab_attr_store':
>> mm/slub.c:5542:7: error: implicit declaration of function 'mutex_timed_lock'; did you mean 'mutex_trylock'? [-Werror=implicit-function-declaration]
      if (mutex_timed_lock(&slab_mutex, 100) < 0)
          ^~~~~~~~~~~~~~~~
          mutex_trylock
   cc1: some warnings being treated as errors

vim +5542 mm/slub.c

  5519	
  5520	static ssize_t slab_attr_store(struct kobject *kobj,
  5521					struct attribute *attr,
  5522					const char *buf, size_t len)
  5523	{
  5524		struct slab_attribute *attribute;
  5525		struct kmem_cache *s;
  5526		int err;
  5527	
  5528		attribute = to_slab_attr(attr);
  5529		s = to_slab(kobj);
  5530	
  5531		if (!attribute->store)
  5532			return -EIO;
  5533	
  5534		err = attribute->store(s, buf, len);
  5535	#ifdef CONFIG_MEMCG
  5536		if (slab_state >= FULL && err >= 0 && is_root_cache(s)) {
  5537			struct kmem_cache *c;
  5538	
  5539			/*
  5540			 * Timeout after 100ms
  5541			 */
> 5542			if (mutex_timed_lock(&slab_mutex, 100) < 0)
  5543				return -EBUSY;
  5544	
  5545			if (s->max_attr_size < len)
  5546				s->max_attr_size = len;
  5547	
  5548			/*
  5549			 * This is a best effort propagation, so this function's return
  5550			 * value will be determined by the parent cache only. This is
  5551			 * basically because not all attributes will have a well
  5552			 * defined semantics for rollbacks - most of the actions will
  5553			 * have permanent effects.
  5554			 *
  5555			 * Returning the error value of any of the children that fail
  5556			 * is not 100 % defined, in the sense that users seeing the
  5557			 * error code won't be able to know anything about the state of
  5558			 * the cache.
  5559			 *
  5560			 * Only returning the error code for the parent cache at least
  5561			 * has well defined semantics. The cache being written to
  5562			 * directly either failed or succeeded, in which case we loop
  5563			 * through the descendants with best-effort propagation.
  5564			 */
  5565			for_each_memcg_cache(c, s)
  5566				attribute->store(c, buf, len);
  5567			mutex_unlock(&slab_mutex);
  5568		}
  5569	#endif
  5570		return err;
  5571	}
  5572	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
kernel test robot Feb. 13, 2020, 4:48 p.m. UTC | #7
Hi Waiman,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/locking/core]
[also build test ERROR on tip/auto-latest linus/master v5.6-rc1 next-20200213]
[cannot apply to arm-perf/for-next/perf]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Waiman-Long/locking-mutex-Add-mutex_timed_lock-to-solve-potential-deadlock-problems/20200213-063401
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 41f0e29190ac9e38099a37abd1a8a4cb1dc21233
config: x86_64-randconfig-s2-20200213 (attached as .config)
compiler: gcc-6 (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   mm/slub.c: In function 'slab_attr_store':
>> mm/slub.c:5542:7: error: implicit declaration of function 'mutex_timed_lock' [-Werror=implicit-function-declaration]
      if (mutex_timed_lock(&slab_mutex, 100) < 0)
          ^~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors

vim +/mutex_timed_lock +5542 mm/slub.c

  5519	
  5520	static ssize_t slab_attr_store(struct kobject *kobj,
  5521					struct attribute *attr,
  5522					const char *buf, size_t len)
  5523	{
  5524		struct slab_attribute *attribute;
  5525		struct kmem_cache *s;
  5526		int err;
  5527	
  5528		attribute = to_slab_attr(attr);
  5529		s = to_slab(kobj);
  5530	
  5531		if (!attribute->store)
  5532			return -EIO;
  5533	
  5534		err = attribute->store(s, buf, len);
  5535	#ifdef CONFIG_MEMCG
  5536		if (slab_state >= FULL && err >= 0 && is_root_cache(s)) {
  5537			struct kmem_cache *c;
  5538	
  5539			/*
  5540			 * Timeout after 100ms
  5541			 */
> 5542			if (mutex_timed_lock(&slab_mutex, 100) < 0)
  5543				return -EBUSY;
  5544	
  5545			if (s->max_attr_size < len)
  5546				s->max_attr_size = len;
  5547	
  5548			/*
  5549			 * This is a best effort propagation, so this function's return
  5550			 * value will be determined by the parent cache only. This is
  5551			 * basically because not all attributes will have a well
  5552			 * defined semantics for rollbacks - most of the actions will
  5553			 * have permanent effects.
  5554			 *
  5555			 * Returning the error value of any of the children that fail
  5556			 * is not 100 % defined, in the sense that users seeing the
  5557			 * error code won't be able to know anything about the state of
  5558			 * the cache.
  5559			 *
  5560			 * Only returning the error code for the parent cache at least
  5561			 * has well defined semantics. The cache being written to
  5562			 * directly either failed or succeeded, in which case we loop
  5563			 * through the descendants with best-effort propagation.
  5564			 */
  5565			for_each_memcg_cache(c, s)
  5566				attribute->store(c, buf, len);
  5567			mutex_unlock(&slab_mutex);
  5568		}
  5569	#endif
  5570		return err;
  5571	}
  5572	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
diff mbox series

Patch

diff --git a/mm/slub.c b/mm/slub.c
index 17dc00e33115..495bec9b66ab 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5536,7 +5536,12 @@  static ssize_t slab_attr_store(struct kobject *kobj,
 	if (slab_state >= FULL && err >= 0 && is_root_cache(s)) {
 		struct kmem_cache *c;
 
-		mutex_lock(&slab_mutex);
+		/*
+		 * Timeout after 100ms
+		 */
+		if (mutex_timed_lock(&slab_mutex, 100) < 0)
+			return -EBUSY;
+
 		if (s->max_attr_size < len)
 			s->max_attr_size = len;