diff mbox series

[v2,4/4] mm/slub: Fix sysfs shrink circular locking dependency

Message ID 20200427235621.7823-5-longman@redhat.com (mailing list archive)
State New, archived
Headers show
Series mm/slub: Fix sysfs circular locking dependency | expand

Commit Message

Waiman Long April 27, 2020, 11:56 p.m. UTC
A lockdep splat is observed by echoing "1" to the shrink sysfs file
and then shutting down the system:

[  167.473392] Chain exists of:
[  167.473392]   kn->count#279 --> mem_hotplug_lock.rw_sem --> slab_mutex
[  167.473392]
[  167.484323]  Possible unsafe locking scenario:
[  167.484323]
[  167.490273]        CPU0                    CPU1
[  167.494825]        ----                    ----
[  167.499376]   lock(slab_mutex);
[  167.502530]                                lock(mem_hotplug_lock.rw_sem);
[  167.509356]                                lock(slab_mutex);
[  167.515044]   lock(kn->count#279);
[  167.518462]
[  167.518462]  *** DEADLOCK ***

It is because of the get_online_cpus() and get_online_mems() calls in
kmem_cache_shrink() invoked via the shrink sysfs file. To fix that, we
have to use trylock to get the memory and cpu hotplug read locks. Since
hotplug events are rare, it should be fine to refuse a kmem caches
shrink operation when some hotplug events are in progress.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/memory_hotplug.h |  2 ++
 mm/memory_hotplug.c            |  5 +++++
 mm/slub.c                      | 19 +++++++++++++++----
 3 files changed, 22 insertions(+), 4 deletions(-)

Comments

Qian Cai April 28, 2020, 12:13 a.m. UTC | #1
> On Apr 27, 2020, at 7:56 PM, Waiman Long <longman@redhat.com> wrote:
> 
> A lockdep splat is observed by echoing "1" to the shrink sysfs file
> and then shutting down the system:
> 
> [  167.473392] Chain exists of:
> [  167.473392]   kn->count#279 --> mem_hotplug_lock.rw_sem --> slab_mutex
> [  167.473392]
> [  167.484323]  Possible unsafe locking scenario:
> [  167.484323]
> [  167.490273]        CPU0                    CPU1
> [  167.494825]        ----                    ----
> [  167.499376]   lock(slab_mutex);
> [  167.502530]                                lock(mem_hotplug_lock.rw_sem);
> [  167.509356]                                lock(slab_mutex);
> [  167.515044]   lock(kn->count#279);
> [  167.518462]
> [  167.518462]  *** DEADLOCK ***
> 
> It is because of the get_online_cpus() and get_online_mems() calls in
> kmem_cache_shrink() invoked via the shrink sysfs file. To fix that, we
> have to use trylock to get the memory and cpu hotplug read locks. Since
> hotplug events are rare, it should be fine to refuse a kmem caches
> shrink operation when some hotplug events are in progress.

I don’t understand how trylock could prevent a splat. The fundamental issue is that in sysfs slab store case, the locking order (once trylock succeed) is,

kn->count —> cpu/memory_hotplug

But we have the existing reverse chain everywhere.

cpu/memory_hotplug —> slab_mutex —> kn->count
Waiman Long April 28, 2020, 1:39 a.m. UTC | #2
On 4/27/20 8:13 PM, Qian Cai wrote:
>
>> On Apr 27, 2020, at 7:56 PM, Waiman Long <longman@redhat.com> wrote:
>>
>> A lockdep splat is observed by echoing "1" to the shrink sysfs file
>> and then shutting down the system:
>>
>> [  167.473392] Chain exists of:
>> [  167.473392]   kn->count#279 --> mem_hotplug_lock.rw_sem --> slab_mutex
>> [  167.473392]
>> [  167.484323]  Possible unsafe locking scenario:
>> [  167.484323]
>> [  167.490273]        CPU0                    CPU1
>> [  167.494825]        ----                    ----
>> [  167.499376]   lock(slab_mutex);
>> [  167.502530]                                lock(mem_hotplug_lock.rw_sem);
>> [  167.509356]                                lock(slab_mutex);
>> [  167.515044]   lock(kn->count#279);
>> [  167.518462]
>> [  167.518462]  *** DEADLOCK ***
>>
>> It is because of the get_online_cpus() and get_online_mems() calls in
>> kmem_cache_shrink() invoked via the shrink sysfs file. To fix that, we
>> have to use trylock to get the memory and cpu hotplug read locks. Since
>> hotplug events are rare, it should be fine to refuse a kmem caches
>> shrink operation when some hotplug events are in progress.
> I don’t understand how trylock could prevent a splat. The fundamental issue is that in sysfs slab store case, the locking order (once trylock succeed) is,
>
> kn->count —> cpu/memory_hotplug
>
> But we have the existing reverse chain everywhere.
>
> cpu/memory_hotplug —> slab_mutex —> kn->count
>
The sequence that was prevented by this patch is "kn->count --> 
mem_hotplug_lock.rwsem". This sequence isn't directly in the splat. Once 
this link is broken, the 3-lock circular loop cannot be formed. Maybe I 
should modify the commit log to make this point more clear.

Cheers,
Longman
Qian Cai April 28, 2020, 2:11 a.m. UTC | #3
> On Apr 27, 2020, at 9:39 PM, Waiman Long <longman@redhat.com> wrote:
> 
> The sequence that was prevented by this patch is "kn->count --> mem_hotplug_lock.rwsem". This sequence isn't directly in the splat. Once this link is broken, the 3-lock circular loop cannot be formed. Maybe I should modify the commit log to make this point more clear.

I don’t know what you are talking about. Once trylock succeed once, you will have kn->count —> cpu/memory_hotplug_lock.
Waiman Long April 28, 2020, 2:06 p.m. UTC | #4
On 4/27/20 10:11 PM, Qian Cai wrote:
>
>> On Apr 27, 2020, at 9:39 PM, Waiman Long <longman@redhat.com> wrote:
>>
>> The sequence that was prevented by this patch is "kn->count --> mem_hotplug_lock.rwsem". This sequence isn't directly in the splat. Once this link is broken, the 3-lock circular loop cannot be formed. Maybe I should modify the commit log to make this point more clear.
> I don’t know what you are talking about. Once trylock succeed once, you will have kn->count —> cpu/memory_hotplug_lock.
>
Trylock is handled differently from lockdep's perspective as trylock can 
failed. When trylock succeeds, the critical section is executed. As long 
as it doesn't try to acquire another lock in the circular chain, the 
execution will finish at some point and release the lock. On the other 
hand, if another task has already held all those locks, the trylock will 
fail and held locks should be released. Again, no deadlock will happen.

Regards,
Longman
Qian Cai April 29, 2020, 2:52 a.m. UTC | #5
> On Apr 28, 2020, at 10:06 AM, Waiman Long <longman@redhat.com> wrote:
> 
> On 4/27/20 10:11 PM, Qian Cai wrote:
>> 
>>> On Apr 27, 2020, at 9:39 PM, Waiman Long <longman@redhat.com> wrote:
>>> 
>>> The sequence that was prevented by this patch is "kn->count --> mem_hotplug_lock.rwsem". This sequence isn't directly in the splat. Once this link is broken, the 3-lock circular loop cannot be formed. Maybe I should modify the commit log to make this point more clear.
>> I don’t know what you are talking about. Once trylock succeed once, you will have kn->count —> cpu/memory_hotplug_lock.
>> 
> Trylock is handled differently from lockdep's perspective as trylock can failed. When trylock succeeds, the critical section is executed. As long as it doesn't try to acquire another lock in the circular chain, the execution will finish at some point and release the lock. On the other hand, if another task has already held all those locks, the trylock will fail and held locks should be released. Again, no deadlock will happen.

So once,

CPU0 (trylock succeed):
kn->count —> cpu/memory_hotplug_lock.

Did you mean that lockdep will not record this existing chain?

If it did. Then later, are you still sure that CPU1 (via memcg path below) will still be impossible to trigger a splat just because lockdep will be able to tell that those arennon-exclusive (cpu/memory_hotplug_lock) locks instead?

 cpu/memory_hotplug_lock -> kn->count

[  290.805818] -> #3 (kn->count#86){++++}-{0:0}:
[  290.811954]        __kernfs_remove+0x455/0x4c0
[  290.816428]        kernfs_remove+0x23/0x40
[  290.820554]        sysfs_remove_dir+0x74/0x80
[  290.824947]        kobject_del+0x57/0xa0
[  290.828905]        sysfs_slab_unlink+0x1c/0x20
[  290.833377]        shutdown_cache+0x15d/0x1c0
[  290.837964]        kmemcg_cache_shutdown_fn+0xe/0x20
[  290.842963]        kmemcg_workfn+0x35/0x50   <—— cpu/memory_hotplug_lock
[  290.847095]        process_one_work+0x57e/0xb90
[  290.851658]        worker_thread+0x63/0x5b0
[  290.855872]        kthread+0x1f7/0x220
[  290.859653]        ret_from_fork+0x27/0x50
diff mbox series

Patch

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 93d9ada74ddd..4ec4b0a2f0fa 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -231,6 +231,7 @@  extern void get_page_bootmem(unsigned long ingo, struct page *page,
 
 void get_online_mems(void);
 void put_online_mems(void);
+int  tryget_online_mems(void);
 
 void mem_hotplug_begin(void);
 void mem_hotplug_done(void);
@@ -274,6 +275,7 @@  static inline int try_online_node(int nid)
 
 static inline void get_online_mems(void) {}
 static inline void put_online_mems(void) {}
+static inline int  tryget_online_mems(void) { return 1; }
 
 static inline void mem_hotplug_begin(void) {}
 static inline void mem_hotplug_done(void) {}
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index fc0aad0bc1f5..38f9ccec9259 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -59,6 +59,11 @@  void get_online_mems(void)
 	percpu_down_read(&mem_hotplug_lock);
 }
 
+int tryget_online_mems(void)
+{
+	return percpu_down_read_trylock(&mem_hotplug_lock);
+}
+
 void put_online_mems(void)
 {
 	percpu_up_read(&mem_hotplug_lock);
diff --git a/mm/slub.c b/mm/slub.c
index cf2114ca27f7..c4977ac3271b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5343,10 +5343,20 @@  static ssize_t shrink_show(struct kmem_cache *s, char *buf)
 static ssize_t shrink_store(struct kmem_cache *s,
 			const char *buf, size_t length)
 {
-	if (buf[0] == '1')
-		kmem_cache_shrink(s);
-	else
+	if (buf[0] != '1')
 		return -EINVAL;
+
+	if (!cpus_read_trylock())
+		return -EBUSY;
+	if (!tryget_online_mems()) {
+		length = -EBUSY;
+		goto cpus_unlock_out;
+	}
+	kasan_cache_shrink(s);
+	__kmem_cache_shrink(s);
+	put_online_mems();
+cpus_unlock_out:
+	cpus_read_unlock();
 	return length;
 }
 SLAB_ATTR(shrink);
@@ -5654,7 +5664,8 @@  static ssize_t slab_attr_store(struct kobject *kobj,
 
 		for (idx = 0; idx < cnt; idx++) {
 			c = pcaches[idx];
-			attribute->store(c, buf, len);
+			if (attribute->store(c, buf, len) == -EBUSY)
+				err = -EBUSY;
 			percpu_ref_put(&c->memcg_params.refcnt);
 		}
 		kfree(pcaches);