Message ID | 20200210204651.21674-4-longman@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | locking/mutex: Add mutex_timed_lock() to solve potential deadlock problems | expand |
On Mon, 10 Feb 2020 15:46:51 -0500 Waiman Long <longman@redhat.com> wrote: > In order to fix this circular lock dependency problem, we need to do a > mutex_trylock(&slab_mutex) in slab_attr_store() for CPU0 above. A simple > trylock, however, is easy to fail causing user dissatisfaction. So the > new mutex_timed_lock() function is now used to do a trylock with a > 100ms timeout. > > ... > > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -5536,7 +5536,12 @@ static ssize_t slab_attr_store(struct kobject *kobj, > if (slab_state >= FULL && err >= 0 && is_root_cache(s)) { > struct kmem_cache *c; > > - mutex_lock(&slab_mutex); > + /* > + * Timeout after 100ms > + */ > + if (mutex_timed_lock(&slab_mutex, 100) < 0) > + return -EBUSY; > + Oh dear. Surely there's a better fix here. Does slab really need to hold slab_mutex while creating that sysfs file? Why? If the issue is two threads trying to create the same sysfs file (unlikely, given that both will need to have created the same cache) then can we add a new mutex specifically for this purpose? Or something else.
On 2/10/20 5:03 PM, Andrew Morton wrote: > On Mon, 10 Feb 2020 15:46:51 -0500 Waiman Long <longman@redhat.com> wrote: > >> In order to fix this circular lock dependency problem, we need to do a >> mutex_trylock(&slab_mutex) in slab_attr_store() for CPU0 above. A simple >> trylock, however, is easy to fail causing user dissatisfaction. So the >> new mutex_timed_lock() function is now used to do a trylock with a >> 100ms timeout. >> >> ... >> >> --- a/mm/slub.c >> +++ b/mm/slub.c >> @@ -5536,7 +5536,12 @@ static ssize_t slab_attr_store(struct kobject *kobj, >> if (slab_state >= FULL && err >= 0 && is_root_cache(s)) { >> struct kmem_cache *c; >> >> - mutex_lock(&slab_mutex); >> + /* >> + * Timeout after 100ms >> + */ >> + if (mutex_timed_lock(&slab_mutex, 100) < 0) >> + return -EBUSY; >> + > Oh dear. Surely there's a better fix here. Does slab really need to > hold slab_mutex while creating that sysfs file? Why? > > If the issue is two threads trying to create the same sysfs file > (unlikely, given that both will need to have created the same cache) > then can we add a new mutex specifically for this purpose? > > Or something else. > Well, the current code iterates all the memory cgroups to set the same value in all of them. I believe the reason for holding the slab mutex is to make sure that memcg hierarchy is stable during this iteration process. Of course, we can argue if the attribute value should be duplicated in all memcg's. Cheers, Longman
On Mon, 10 Feb 2020 17:14:31 -0500 Waiman Long <longman@redhat.com> wrote: > >> --- a/mm/slub.c > >> +++ b/mm/slub.c > >> @@ -5536,7 +5536,12 @@ static ssize_t slab_attr_store(struct kobject *kobj, > >> if (slab_state >= FULL && err >= 0 && is_root_cache(s)) { > >> struct kmem_cache *c; > >> > >> - mutex_lock(&slab_mutex); > >> + /* > >> + * Timeout after 100ms > >> + */ > >> + if (mutex_timed_lock(&slab_mutex, 100) < 0) > >> + return -EBUSY; > >> + > > Oh dear. Surely there's a better fix here. Does slab really need to > > hold slab_mutex while creating that sysfs file? Why? > > > > If the issue is two threads trying to create the same sysfs file > > (unlikely, given that both will need to have created the same cache) > > then can we add a new mutex specifically for this purpose? > > > > Or something else. > > > Well, the current code iterates all the memory cgroups to set the same > value in all of them. I believe the reason for holding the slab mutex is > to make sure that memcg hierarchy is stable during this iteration > process. But that is unrelated to creation of the sysfs file?
On 2/10/20 6:10 PM, Andrew Morton wrote: > On Mon, 10 Feb 2020 17:14:31 -0500 Waiman Long <longman@redhat.com> wrote: > >>>> --- a/mm/slub.c >>>> +++ b/mm/slub.c >>>> @@ -5536,7 +5536,12 @@ static ssize_t slab_attr_store(struct kobject *kobj, >>>> if (slab_state >= FULL && err >= 0 && is_root_cache(s)) { >>>> struct kmem_cache *c; >>>> >>>> - mutex_lock(&slab_mutex); >>>> + /* >>>> + * Timeout after 100ms >>>> + */ >>>> + if (mutex_timed_lock(&slab_mutex, 100) < 0) >>>> + return -EBUSY; >>>> + >>> Oh dear. Surely there's a better fix here. Does slab really need to >>> hold slab_mutex while creating that sysfs file? Why? >>> >>> If the issue is two threads trying to create the same sysfs file >>> (unlikely, given that both will need to have created the same cache) >>> then can we add a new mutex specifically for this purpose? >>> >>> Or something else. >>> >> Well, the current code iterates all the memory cgroups to set the same >> value in all of them. I believe the reason for holding the slab mutex is >> to make sure that memcg hierarchy is stable during this iteration >> process. > But that is unrelated to creation of the sysfs file? > OK, I will take a closer look at that. Cheers, Longman
On 2/11/20 6:30 PM, Waiman Long wrote: > On 2/10/20 6:10 PM, Andrew Morton wrote: >> On Mon, 10 Feb 2020 17:14:31 -0500 Waiman Long <longman@redhat.com> wrote: >> >>>>> --- a/mm/slub.c >>>>> +++ b/mm/slub.c >>>>> @@ -5536,7 +5536,12 @@ static ssize_t slab_attr_store(struct kobject *kobj, >>>>> if (slab_state >= FULL && err >= 0 && is_root_cache(s)) { >>>>> struct kmem_cache *c; >>>>> >>>>> - mutex_lock(&slab_mutex); >>>>> + /* >>>>> + * Timeout after 100ms >>>>> + */ >>>>> + if (mutex_timed_lock(&slab_mutex, 100) < 0) >>>>> + return -EBUSY; >>>>> + >>>> Oh dear. Surely there's a better fix here. Does slab really need to >>>> hold slab_mutex while creating that sysfs file? Why? >>>> >>>> If the issue is two threads trying to create the same sysfs file >>>> (unlikely, given that both will need to have created the same cache) >>>> then can we add a new mutex specifically for this purpose? >>>> >>>> Or something else. >>>> >>> Well, the current code iterates all the memory cgroups to set the same >>> value in all of them. I believe the reason for holding the slab mutex is >>> to make sure that memcg hierarchy is stable during this iteration >>> process. >> But that is unrelated to creation of the sysfs file? >> > OK, I will take a closer look at that. During the creation of a sysfs file: static int sysfs_slab_add(struct kmem_cache *s) { : if (unmergeable) { /* * Slabcache can never be merged so we can use the name proper. * This is typically the case for debug situations. In that * case we can catch duplicate names easily. */ sysfs_remove_link(&slab_kset->kobj, s->name); name = s->name; The code is trying to remove sysfs files of a cache with conflicting name. So it seems like kmem_cache_create() is called with a name that has been used before. If it happens that a write to one of the sysfs files to be removed happens at the same time, a deadlock can happen. In this particular case, the kmem_cache_create() call comes from the mlx5_core module. steering->fgs_cache = kmem_cache_create("mlx5_fs_fgs", sizeof(struct mlx5_flow_group), 0, 0, NULL); Perhaps the module is somehow unloaded and then loaded again. Unfortunately this lockdep error was seen once. It is hard to find out how to fix it without an easy way to reproduce it. So I will table this for now until there is a way to reproduce it. Thanks, Longman
Hi Waiman, Thank you for the patch! Yet something to improve: [auto build test ERROR on tip/locking/core] [also build test ERROR on tip/auto-latest linus/master v5.6-rc1 next-20200213] [cannot apply to arm-perf/for-next/perf] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system. BTW, we also suggest to use '--base' option to specify the base tree in git format-patch, please see https://stackoverflow.com/a/37406982] url: https://github.com/0day-ci/linux/commits/Waiman-Long/locking-mutex-Add-mutex_timed_lock-to-solve-potential-deadlock-problems/20200213-063401 base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 41f0e29190ac9e38099a37abd1a8a4cb1dc21233 config: x86_64-randconfig-s2-20200213 (attached as .config) compiler: gcc-7 (Debian 7.5.0-4) 7.5.0 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 If you fix the issue, kindly add following tag Reported-by: kbuild test robot <lkp@intel.com> All errors (new ones prefixed by >>): mm/slub.c: In function 'slab_attr_store': >> mm/slub.c:5542:7: error: implicit declaration of function 'mutex_timed_lock'; did you mean 'mutex_trylock'? [-Werror=implicit-function-declaration] if (mutex_timed_lock(&slab_mutex, 100) < 0) ^~~~~~~~~~~~~~~~ mutex_trylock cc1: some warnings being treated as errors vim +5542 mm/slub.c 5519 5520 static ssize_t slab_attr_store(struct kobject *kobj, 5521 struct attribute *attr, 5522 const char *buf, size_t len) 5523 { 5524 struct slab_attribute *attribute; 5525 struct kmem_cache *s; 5526 int err; 5527 5528 attribute = to_slab_attr(attr); 5529 s = to_slab(kobj); 5530 5531 if (!attribute->store) 5532 return -EIO; 5533 5534 err = attribute->store(s, buf, len); 5535 #ifdef CONFIG_MEMCG 5536 if (slab_state >= FULL && err >= 0 && is_root_cache(s)) { 5537 struct kmem_cache *c; 5538 5539 /* 5540 * Timeout after 100ms 5541 */ > 5542 if (mutex_timed_lock(&slab_mutex, 100) < 0) 5543 return -EBUSY; 5544 5545 if (s->max_attr_size < len) 5546 s->max_attr_size = len; 5547 5548 /* 5549 * This is a best effort propagation, so this function's return 5550 * value will be determined by the parent cache only. This is 5551 * basically because not all attributes will have a well 5552 * defined semantics for rollbacks - most of the actions will 5553 * have permanent effects. 5554 * 5555 * Returning the error value of any of the children that fail 5556 * is not 100 % defined, in the sense that users seeing the 5557 * error code won't be able to know anything about the state of 5558 * the cache. 5559 * 5560 * Only returning the error code for the parent cache at least 5561 * has well defined semantics. The cache being written to 5562 * directly either failed or succeeded, in which case we loop 5563 * through the descendants with best-effort propagation. 5564 */ 5565 for_each_memcg_cache(c, s) 5566 attribute->store(c, buf, len); 5567 mutex_unlock(&slab_mutex); 5568 } 5569 #endif 5570 return err; 5571 } 5572 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Hi Waiman, Thank you for the patch! Yet something to improve: [auto build test ERROR on tip/locking/core] [also build test ERROR on tip/auto-latest linus/master v5.6-rc1 next-20200213] [cannot apply to arm-perf/for-next/perf] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system. BTW, we also suggest to use '--base' option to specify the base tree in git format-patch, please see https://stackoverflow.com/a/37406982] url: https://github.com/0day-ci/linux/commits/Waiman-Long/locking-mutex-Add-mutex_timed_lock-to-solve-potential-deadlock-problems/20200213-063401 base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 41f0e29190ac9e38099a37abd1a8a4cb1dc21233 config: x86_64-randconfig-s2-20200213 (attached as .config) compiler: gcc-6 (Debian 6.3.0-18+deb9u1) 6.3.0 20170516 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 If you fix the issue, kindly add following tag Reported-by: kbuild test robot <lkp@intel.com> All errors (new ones prefixed by >>): mm/slub.c: In function 'slab_attr_store': >> mm/slub.c:5542:7: error: implicit declaration of function 'mutex_timed_lock' [-Werror=implicit-function-declaration] if (mutex_timed_lock(&slab_mutex, 100) < 0) ^~~~~~~~~~~~~~~~ cc1: some warnings being treated as errors vim +/mutex_timed_lock +5542 mm/slub.c 5519 5520 static ssize_t slab_attr_store(struct kobject *kobj, 5521 struct attribute *attr, 5522 const char *buf, size_t len) 5523 { 5524 struct slab_attribute *attribute; 5525 struct kmem_cache *s; 5526 int err; 5527 5528 attribute = to_slab_attr(attr); 5529 s = to_slab(kobj); 5530 5531 if (!attribute->store) 5532 return -EIO; 5533 5534 err = attribute->store(s, buf, len); 5535 #ifdef CONFIG_MEMCG 5536 if (slab_state >= FULL && err >= 0 && is_root_cache(s)) { 5537 struct kmem_cache *c; 5538 5539 /* 5540 * Timeout after 100ms 5541 */ > 5542 if (mutex_timed_lock(&slab_mutex, 100) < 0) 5543 return -EBUSY; 5544 5545 if (s->max_attr_size < len) 5546 s->max_attr_size = len; 5547 5548 /* 5549 * This is a best effort propagation, so this function's return 5550 * value will be determined by the parent cache only. This is 5551 * basically because not all attributes will have a well 5552 * defined semantics for rollbacks - most of the actions will 5553 * have permanent effects. 5554 * 5555 * Returning the error value of any of the children that fail 5556 * is not 100 % defined, in the sense that users seeing the 5557 * error code won't be able to know anything about the state of 5558 * the cache. 5559 * 5560 * Only returning the error code for the parent cache at least 5561 * has well defined semantics. The cache being written to 5562 * directly either failed or succeeded, in which case we loop 5563 * through the descendants with best-effort propagation. 5564 */ 5565 for_each_memcg_cache(c, s) 5566 attribute->store(c, buf, len); 5567 mutex_unlock(&slab_mutex); 5568 } 5569 #endif 5570 return err; 5571 } 5572 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
diff --git a/mm/slub.c b/mm/slub.c index 17dc00e33115..495bec9b66ab 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -5536,7 +5536,12 @@ static ssize_t slab_attr_store(struct kobject *kobj, if (slab_state >= FULL && err >= 0 && is_root_cache(s)) { struct kmem_cache *c; - mutex_lock(&slab_mutex); + /* + * Timeout after 100ms + */ + if (mutex_timed_lock(&slab_mutex, 100) < 0) + return -EBUSY; + if (s->max_attr_size < len) s->max_attr_size = len;
The following lockdep splat was seen: [ 176.241923] ====================================================== [ 176.241924] WARNING: possible circular locking dependency detected [ 176.241926] 4.18.0-172.rt13.29.el8.x86_64+debug #1 Not tainted [ 176.241927] ------------------------------------------------------ [ 176.241929] slub_cpu_partia/5371 is trying to acquire lock: [ 176.241930] ffffffffa0b83718 (slab_mutex){+.+.}, at: slab_attr_store+0x6b/0xe0 [ 176.241941] but task is already holding lock: [ 176.241942] ffff88bb6d8b83c8 (kn->count#103){++++}, at: kernfs_fop_write+0x1cc/0x400 [ 176.241947] which lock already depends on the new lock. [ 176.241949] the existing dependency chain (in reverse order) is: [ 176.241949] -> #1 (kn->count#103){++++}: [ 176.241955] __kernfs_remove+0x616/0x800 [ 176.241957] kernfs_remove_by_name_ns+0x3e/0x80 [ 176.241959] sysfs_slab_add+0x1c6/0x330 [ 176.241961] __kmem_cache_create+0x15f/0x1b0 [ 176.241964] create_cache+0xe1/0x220 [ 176.241966] kmem_cache_create_usercopy+0x1a3/0x260 [ 176.241967] kmem_cache_create+0x12/0x20 [ 176.242076] mlx5_init_fs+0x18d/0x1a00 [mlx5_core] [ 176.242100] mlx5_load_one+0x3b4/0x1730 [mlx5_core] [ 176.242124] init_one+0x901/0x11b0 [mlx5_core] [ 176.242127] local_pci_probe+0xd4/0x180 [ 176.242131] work_for_cpu_fn+0x51/0xa0 [ 176.242133] process_one_work+0x91a/0x1ac0 [ 176.242134] worker_thread+0x536/0xb40 [ 176.242136] kthread+0x30c/0x3d0 [ 176.242140] ret_from_fork+0x27/0x50 [ 176.242140] -> #0 (slab_mutex){+.+.}: [ 176.242145] __lock_acquire+0x22cb/0x48c0 [ 176.242146] lock_acquire+0x134/0x4c0 [ 176.242148] _mutex_lock+0x28/0x40 [ 176.242150] slab_attr_store+0x6b/0xe0 [ 176.242151] kernfs_fop_write+0x251/0x400 [ 176.242154] vfs_write+0x157/0x460 [ 176.242155] ksys_write+0xb8/0x170 [ 176.242158] do_syscall_64+0x13c/0x710 [ 176.242160] entry_SYSCALL_64_after_hwframe+0x6a/0xdf [ 176.242161] other info that might help us debug this: [ 176.242161] Possible unsafe locking scenario: [ 176.242162] CPU0 CPU1 [ 176.242163] ---- ---- [ 176.242163] lock(kn->count#103); [ 176.242165] lock(slab_mutex); [ 176.242166] lock(kn->count#103); [ 176.242167] lock(slab_mutex); [ 176.242169] *** DEADLOCK *** [ 176.242170] 3 locks held by slub_cpu_partia/5371: [ 176.242170] #0: ffff888705e3a800 (sb_writers#4){.+.+}, at: vfs_write+0x31c/0x460 [ 176.242174] #1: ffff889aeec4d658 (&of->mutex){+.+.}, at: kernfs_fop_write+0x1a9/0x400 [ 176.242177] #2: ffff88bb6d8b83c8 (kn->count#103){++++}, at: kernfs_fop_write+0x1cc/0x400 [ 176.242180] stack backtrace: [ 176.242183] CPU: 36 PID: 5371 Comm: slub_cpu_partia Not tainted 4.18.0-172.rt13.29.el8.x86_64+debug #1 [ 176.242184] Hardware name: AMD Corporation DAYTONA_X/DAYTONA_X, BIOS RDY1005C 11/22/2019 [ 176.242185] Call Trace: [ 176.242190] dump_stack+0x9a/0xf0 [ 176.242193] check_noncircular+0x317/0x3c0 [ 176.242195] ? print_circular_bug+0x1e0/0x1e0 [ 176.242199] ? native_sched_clock+0x32/0x1e0 [ 176.242202] ? sched_clock+0x5/0x10 [ 176.242205] ? sched_clock_cpu+0x238/0x340 [ 176.242208] __lock_acquire+0x22cb/0x48c0 [ 176.242213] ? trace_hardirqs_on+0x10/0x10 [ 176.242215] ? trace_hardirqs_on+0x10/0x10 [ 176.242218] lock_acquire+0x134/0x4c0 [ 176.242220] ? slab_attr_store+0x6b/0xe0 [ 176.242223] _mutex_lock+0x28/0x40 [ 176.242225] ? slab_attr_store+0x6b/0xe0 [ 176.242227] slab_attr_store+0x6b/0xe0 [ 176.242229] ? sysfs_file_ops+0x160/0x160 [ 176.242230] kernfs_fop_write+0x251/0x400 [ 176.242232] ? __sb_start_write+0x26a/0x3f0 [ 176.242234] vfs_write+0x157/0x460 [ 176.242237] ksys_write+0xb8/0x170 [ 176.242239] ? __ia32_sys_read+0xb0/0xb0 [ 176.242242] ? do_syscall_64+0xb9/0x710 [ 176.242245] do_syscall_64+0x13c/0x710 [ 176.242247] entry_SYSCALL_64_after_hwframe+0x6a/0xdf In order to fix this circular lock dependency problem, we need to do a mutex_trylock(&slab_mutex) in slab_attr_store() for CPU0 above. A simple trylock, however, is easy to fail causing user dissatisfaction. So the new mutex_timed_lock() function is now used to do a trylock with a 100ms timeout. Signed-off-by: Waiman Long <longman@redhat.com> --- mm/slub.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)