Message ID | 20190321214512.11524-3-longman@redhat.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | Signal: Fix hard lockup problem in flush_sigqueue() | expand |
On Thu, Mar 21, 2019 at 05:45:10PM -0400, Waiman Long wrote: > It was found that if a process had many pending signals (e.g. millions), > the act of exiting that process might cause its parent to have a hard > lockup especially on a debug kernel with features like KASAN enabled. > It was because the flush_sigqueue() was called in release_task() with > tasklist_lock held and irq disabled. This rather apocalyptic language is a bit uncalled for. I appreciate the warning is scary, but all that's really happening is that the debug code is taking too long to execute. > To avoid this dire condition and reduce lock hold time of tasklist_lock, > flush_sigqueue() is modified to pass in a freeing queue pointer so that > the actual freeing of memory objects can be deferred until after the > tasklist_lock is released and irq re-enabled. I think this is a really bad solution. It looks kind of generic, but isn't. It's terribly inefficient, and all it's really doing is deferring the debugging code until we've re-enabled interrupts. We'd be much better off just having a list_head in the caller and list_splice() the queue->list onto that caller. Then call __sigqueue_free() for each signal on the queue.
On 03/21, Matthew Wilcox wrote: > > On Thu, Mar 21, 2019 at 05:45:10PM -0400, Waiman Long wrote: > > > To avoid this dire condition and reduce lock hold time of tasklist_lock, > > flush_sigqueue() is modified to pass in a freeing queue pointer so that > > the actual freeing of memory objects can be deferred until after the > > tasklist_lock is released and irq re-enabled. > > I think this is a really bad solution. It looks kind of generic, > but isn't. It's terribly inefficient, and all it's really doing is > deferring the debugging code until we've re-enabled interrupts. Agreed. > We'd be much better off just having a list_head in the caller > and list_splice() the queue->list onto that caller. Then call > __sigqueue_free() for each signal on the queue. This won't work, note the comment which explains the race with sigqueue_free(). Let me think about it... at least we can do something like close_the_race_with_sigqueue_free(struct sigpending *queue) { struct sigqueue *q, *t; list_for_each_entry_safe(q, t, ...) { if (q->flags & SIGQUEUE_PREALLOC) list_del_init(&q->list); } called with ->siglock held, tasklist_lock is not needed. After that flush_sigqueue() can be called lockless in release_task() release_task. I'll try to make the patch tomorrow. Oleg.
On 03/22/2019 07:16 AM, Oleg Nesterov wrote: > On 03/21, Matthew Wilcox wrote: >> On Thu, Mar 21, 2019 at 05:45:10PM -0400, Waiman Long wrote: >> >>> To avoid this dire condition and reduce lock hold time of tasklist_lock, >>> flush_sigqueue() is modified to pass in a freeing queue pointer so that >>> the actual freeing of memory objects can be deferred until after the >>> tasklist_lock is released and irq re-enabled. >> I think this is a really bad solution. It looks kind of generic, >> but isn't. It's terribly inefficient, and all it's really doing is >> deferring the debugging code until we've re-enabled interrupts. > Agreed. Thanks for looking into that. As I am not knowledgeable enough about the signal handling code path, I choose the lowest risk approach of not trying to change the code flow while deferring memory deallocation after releasing the tasklist_lock. >> We'd be much better off just having a list_head in the caller >> and list_splice() the queue->list onto that caller. Then call >> __sigqueue_free() for each signal on the queue. > This won't work, note the comment which explains the race with sigqueue_free(). > > Let me think about it... at least we can do something like > > close_the_race_with_sigqueue_free(struct sigpending *queue) > { > struct sigqueue *q, *t; > > list_for_each_entry_safe(q, t, ...) { > if (q->flags & SIGQUEUE_PREALLOC) > list_del_init(&q->list); > } > > called with ->siglock held, tasklist_lock is not needed. > > After that flush_sigqueue() can be called lockless in release_task() release_task. > > I'll try to make the patch tomorrow. > > Oleg. > I am looking forward to it. Thanks, Longman
On Fri, 22 Mar 2019, Waiman Long wrote:
> I am looking forward to it.
There is also alrady rcu being used in these paths. kfree_rcu() would not
be enough? It is an estalished mechanism that is mature and well
understood.
On 03/22/2019 01:50 PM, Christopher Lameter wrote: > On Fri, 22 Mar 2019, Waiman Long wrote: > >> I am looking forward to it. > There is also alrady rcu being used in these paths. kfree_rcu() would not > be enough? It is an estalished mechanism that is mature and well > understood. > In this case, the memory objects are from kmem caches, so they can't freed using kfree_rcu(). There are certainly overhead using the kfree_rcu(), or a kfree_rcu()-like mechanism. Also I think the actual freeing is done at SoftIRQ context which can be a problem if there are too many memory objects to free. I think what Oleg is trying to do is probably the most efficient way. Cheers, Longman
On Fri, 22 Mar 2019, Waiman Long wrote: > > > >> I am looking forward to it. > > There is also alrady rcu being used in these paths. kfree_rcu() would not > > be enough? It is an estalished mechanism that is mature and well > > understood. > > > In this case, the memory objects are from kmem caches, so they can't > freed using kfree_rcu(). Oh they can. kfree() can free memory from any slab cache. > There are certainly overhead using the kfree_rcu(), or a > kfree_rcu()-like mechanism. Also I think the actual freeing is done at > SoftIRQ context which can be a problem if there are too many memory > objects to free. No there is none that I am aware of. And this has survived testing of HPC loads with gazillion of objects that have to be freed from multiple processors. We really should not rearchitect this stuff... It took us quite a long time to have this scale well under all loads. Please use rcu_free().
On Fri, Mar 22, 2019 at 07:39:31PM +0000, Christopher Lameter wrote: > On Fri, 22 Mar 2019, Waiman Long wrote: > > > > > > >> I am looking forward to it. > > > There is also alrady rcu being used in these paths. kfree_rcu() would not > > > be enough? It is an estalished mechanism that is mature and well > > > understood. > > > > > In this case, the memory objects are from kmem caches, so they can't > > freed using kfree_rcu(). > > Oh they can. kfree() can free memory from any slab cache. Only for SLAB and SLUB. SLOB requires that you pass a pointer to the slab cache; it has no way to look up the slab cache from the object.
On Fri, 22 Mar 2019, Matthew Wilcox wrote: > On Fri, Mar 22, 2019 at 07:39:31PM +0000, Christopher Lameter wrote: > > On Fri, 22 Mar 2019, Waiman Long wrote: > > > > > > > > > >> I am looking forward to it. > > > > There is also alrady rcu being used in these paths. kfree_rcu() would not > > > > be enough? It is an estalished mechanism that is mature and well > > > > understood. > > > > > > > In this case, the memory objects are from kmem caches, so they can't > > > freed using kfree_rcu(). > > > > Oh they can. kfree() can free memory from any slab cache. > > Only for SLAB and SLUB. SLOB requires that you pass a pointer to the > slab cache; it has no way to look up the slab cache from the object. Well then we could either fix SLOB to conform to the others or add a kmem_cache_free_rcu() variant.
On Mon, Mar 25, 2019 at 02:15:25PM +0000, Christopher Lameter wrote: > On Fri, 22 Mar 2019, Matthew Wilcox wrote: > > > On Fri, Mar 22, 2019 at 07:39:31PM +0000, Christopher Lameter wrote: > > > On Fri, 22 Mar 2019, Waiman Long wrote: > > > > > > > > > > > > >> I am looking forward to it. > > > > > There is also alrady rcu being used in these paths. kfree_rcu() would not > > > > > be enough? It is an estalished mechanism that is mature and well > > > > > understood. > > > > > > > > > In this case, the memory objects are from kmem caches, so they can't > > > > freed using kfree_rcu(). > > > > > > Oh they can. kfree() can free memory from any slab cache. > > > > Only for SLAB and SLUB. SLOB requires that you pass a pointer to the > > slab cache; it has no way to look up the slab cache from the object. > > Well then we could either fix SLOB to conform to the others or add a > kmem_cache_free_rcu() variant. The problem with a kmem_cache_free_rcu() variant is that we now have three pointers to store -- the object pointer, the slab pointer and the rcu next pointer. I spent some time looking at how SLOB might be fixed, and I didn't come up with a good idea. Currently SLOB stores the size of the object in the four bytes before the object, unless the object is "allocated from a slab cache", in which case the size is taken from the cache pointer instead. So calling kfree() on a pointer allocated using kmem_cache_alloc() will cause corruption as it attempts to determine the length of the object. Options: 1. Dispense with this optimisation and always store the size of the object before the object. 2. Add a kmem_cache flag that says whether objects in this cache may be freed with kfree(). Only dispense with this optimisation for slabs with this flag set. 3. Change SLOB to segregate objects by size. If someone has gone to the trouble of creating a kmem_cache, this is a pretty good hint that there will be a lot of objects of this size allocated, so this could help SLOB fight fragmentation. Any other bright ideas?
On Mon, 25 Mar 2019, Matthew Wilcox wrote: > Options: > > 1. Dispense with this optimisation and always store the size of the > object before the object. I think thats how SLOB handled it at some point in the past. Lets go back to that setup so its compatible with the other allocators?
Sorry, I am sick and can't work, hopefully I'll return tomorrow. On 03/22, Christopher Lameter wrote: > > On Fri, 22 Mar 2019, Waiman Long wrote: > > > I am looking forward to it. > > There is also alrady rcu being used in these paths. kfree_rcu() would not > be enough? It is an estalished mechanism that is mature and well > understood. But why do we want to increase the number of rcu callbacks in flight? For the moment, lets discuss the exiting tasks only. The only reason why flush_sigqueue(&tsk->pending) needs spin_lock_irq() is the race with release_posix_timer()->sigqueue_free() from another thread which can remove a SIGQUEUE_PREALLOC'ed sigqueue from list. With the simple patch below flush_sigqueue() can be called lockless with irqs enabled. However, this change is not enough, we need to do something similar with do_sigaction()->flush_sigqueue_mask(), and this is less simple. So I won't really argue with kfree_rcu() but I am not sure this is the best option. Oleg. --- a/kernel/exit.c +++ b/kernel/exit.c @@ -85,6 +85,17 @@ static void __unhash_process(struct task_struct *p, bool group_dead) list_del_rcu(&p->thread_node); } +// Rename me and move into signal.c +void remove_prealloced(struct sigpending *queue) +{ + struct sigqueue *q, *t; + + list_for_each_entry_safe(q, t, &queue->list, list) { + if (q->flags & SIGQUEUE_PREALLOC) + list_del_init(&q->list); + } +} + /* * This function expects the tasklist_lock write-locked. */ @@ -160,16 +171,15 @@ static void __exit_signal(struct task_struct *tsk) * Do this under ->siglock, we can race with another thread * doing sigqueue_free() if we have SIGQUEUE_PREALLOC signals. */ - flush_sigqueue(&tsk->pending); + if (!group_dead) + remove_prealloced(&tsk->pending); tsk->sighand = NULL; spin_unlock(&sighand->siglock); __cleanup_sighand(sighand); clear_tsk_thread_flag(tsk, TIF_SIGPENDING); - if (group_dead) { - flush_sigqueue(&sig->shared_pending); + if (group_dead) tty_kref_put(tty); - } } static void delayed_put_task_struct(struct rcu_head *rhp) @@ -221,6 +231,11 @@ void release_task(struct task_struct *p) write_unlock_irq(&tasklist_lock); cgroup_release(p); release_thread(p); + + flush_sigqueue(&p->pending); + if (thread_group_leader(p)) + flush_sigqueue(&p->signal->shared_pending); + call_rcu(&p->rcu, delayed_put_task_struct); p = leader;
On 03/25, Christopher Lameter wrote: > > On Fri, 22 Mar 2019, Matthew Wilcox wrote: > > > Only for SLAB and SLUB. SLOB requires that you pass a pointer to the > > slab cache; it has no way to look up the slab cache from the object. > > Well then we could either fix SLOB to conform to the others or add a > kmem_cache_free_rcu() variant. Speaking of struct sigqueue we can simply use call_rcu() but see my previous email, I am not sure this is the best option. Oleg.
diff --git a/include/linux/signal.h b/include/linux/signal.h index 9702016734b1..a9562e502122 100644 --- a/include/linux/signal.h +++ b/include/linux/signal.h @@ -5,6 +5,7 @@ #include <linux/bug.h> #include <linux/signal_types.h> #include <linux/string.h> +#include <linux/slab.h> struct task_struct; @@ -254,7 +255,8 @@ static inline void init_sigpending(struct sigpending *sig) INIT_LIST_HEAD(&sig->list); } -extern void flush_sigqueue(struct sigpending *queue); +extern void flush_sigqueue(struct sigpending *queue, + struct kmem_free_q_head *head); /* Test if 'sig' is valid signal. Use this instead of testing _NSIG directly */ static inline int valid_signal(unsigned long sig) diff --git a/kernel/exit.c b/kernel/exit.c index 2166c2d92ddc..ee707a63edfd 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -88,7 +88,8 @@ static void __unhash_process(struct task_struct *p, bool group_dead) /* * This function expects the tasklist_lock write-locked. */ -static void __exit_signal(struct task_struct *tsk) +static void __exit_signal(struct task_struct *tsk, + struct kmem_free_q_head *free_q) { struct signal_struct *sig = tsk->signal; bool group_dead = thread_group_leader(tsk); @@ -160,14 +161,14 @@ static void __exit_signal(struct task_struct *tsk) * Do this under ->siglock, we can race with another thread * doing sigqueue_free() if we have SIGQUEUE_PREALLOC signals. */ - flush_sigqueue(&tsk->pending); + flush_sigqueue(&tsk->pending, free_q); tsk->sighand = NULL; spin_unlock(&sighand->siglock); __cleanup_sighand(sighand); clear_tsk_thread_flag(tsk, TIF_SIGPENDING); if (group_dead) { - flush_sigqueue(&sig->shared_pending); + flush_sigqueue(&sig->shared_pending, free_q); tty_kref_put(tty); } } @@ -186,6 +187,8 @@ void release_task(struct task_struct *p) { struct task_struct *leader; int zap_leader; + DEFINE_KMEM_FREE_Q(free_q); + repeat: /* don't need to get the RCU readlock here - the process is dead and * can't be modifying its own credentials. But shut RCU-lockdep up */ @@ -197,7 +200,7 @@ void release_task(struct task_struct *p) write_lock_irq(&tasklist_lock); ptrace_release_task(p); - __exit_signal(p); + __exit_signal(p, &free_q); /* * If we are the last non-leader member of the thread @@ -219,6 +222,7 @@ void release_task(struct task_struct *p) } write_unlock_irq(&tasklist_lock); + kmem_free_up_q(&free_q); cgroup_release(p); release_thread(p); call_rcu(&p->rcu, delayed_put_task_struct); diff --git a/kernel/signal.c b/kernel/signal.c index b7953934aa99..04fb202c16bd 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -435,16 +435,19 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t flags, int override_rlimi return q; } -static void __sigqueue_free(struct sigqueue *q) +static void __sigqueue_free(struct sigqueue *q, struct kmem_free_q_head *free_q) { if (q->flags & SIGQUEUE_PREALLOC) return; atomic_dec(&q->user->sigpending); free_uid(q->user); - kmem_cache_free(sigqueue_cachep, q); + if (free_q) + kmem_free_q_add(free_q, sigqueue_cachep, q); + else + kmem_cache_free(sigqueue_cachep, q); } -void flush_sigqueue(struct sigpending *queue) +void flush_sigqueue(struct sigpending *queue, struct kmem_free_q_head *free_q) { struct sigqueue *q; @@ -452,7 +455,7 @@ void flush_sigqueue(struct sigpending *queue) while (!list_empty(&queue->list)) { q = list_entry(queue->list.next, struct sigqueue , list); list_del_init(&q->list); - __sigqueue_free(q); + __sigqueue_free(q, free_q); } } @@ -462,12 +465,14 @@ void flush_sigqueue(struct sigpending *queue) void flush_signals(struct task_struct *t) { unsigned long flags; + DEFINE_KMEM_FREE_Q(free_q); spin_lock_irqsave(&t->sighand->siglock, flags); clear_tsk_thread_flag(t, TIF_SIGPENDING); - flush_sigqueue(&t->pending); - flush_sigqueue(&t->signal->shared_pending); + flush_sigqueue(&t->pending, &free_q); + flush_sigqueue(&t->signal->shared_pending, &free_q); spin_unlock_irqrestore(&t->sighand->siglock, flags); + kmem_free_up_q(&free_q); } EXPORT_SYMBOL(flush_signals); @@ -488,7 +493,7 @@ static void __flush_itimer_signals(struct sigpending *pending) } else { sigdelset(&signal, sig); list_del_init(&q->list); - __sigqueue_free(q); + __sigqueue_free(q, NULL); } } @@ -580,7 +585,7 @@ static void collect_signal(int sig, struct sigpending *list, kernel_siginfo_t *i (info->si_code == SI_TIMER) && (info->si_sys_private); - __sigqueue_free(first); + __sigqueue_free(first, NULL); } else { /* * Ok, it wasn't in the queue. This must be @@ -728,7 +733,7 @@ static int dequeue_synchronous_signal(kernel_siginfo_t *info) still_pending: list_del_init(&sync->list); copy_siginfo(info, &sync->info); - __sigqueue_free(sync); + __sigqueue_free(sync, NULL); return info->si_signo; } @@ -776,7 +781,7 @@ static void flush_sigqueue_mask(sigset_t *mask, struct sigpending *s) list_for_each_entry_safe(q, n, &s->list, list) { if (sigismember(mask, q->info.si_signo)) { list_del_init(&q->list); - __sigqueue_free(q); + __sigqueue_free(q, NULL); } } } @@ -1749,7 +1754,7 @@ void sigqueue_free(struct sigqueue *q) spin_unlock_irqrestore(lock, flags); if (q) - __sigqueue_free(q); + __sigqueue_free(q, NULL); } int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type) diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c index 1d0b37af2444..8ca571a0b2ac 100644 --- a/security/selinux/hooks.c +++ b/security/selinux/hooks.c @@ -2548,6 +2548,8 @@ static void selinux_bprm_committed_creds(struct linux_binprm *bprm) rc = avc_has_perm(&selinux_state, osid, sid, SECCLASS_PROCESS, PROCESS__SIGINH, NULL); if (rc) { + DEFINE_KMEM_FREE_Q(free_q); + if (IS_ENABLED(CONFIG_POSIX_TIMERS)) { memset(&itimer, 0, sizeof itimer); for (i = 0; i < 3; i++) @@ -2555,13 +2557,15 @@ static void selinux_bprm_committed_creds(struct linux_binprm *bprm) } spin_lock_irq(¤t->sighand->siglock); if (!fatal_signal_pending(current)) { - flush_sigqueue(¤t->pending); - flush_sigqueue(¤t->signal->shared_pending); + flush_sigqueue(¤t->pending, &free_q); + flush_sigqueue(¤t->signal->shared_pending, + &free_q); flush_signal_handlers(current, 1); sigemptyset(¤t->blocked); recalc_sigpending(); } spin_unlock_irq(¤t->sighand->siglock); + kmem_free_up_q(&free_q); } /* Wake up the parent if it is waiting so that it can recheck
It was found that if a process had many pending signals (e.g. millions), the act of exiting that process might cause its parent to have a hard lockup especially on a debug kernel with features like KASAN enabled. It was because the flush_sigqueue() was called in release_task() with tasklist_lock held and irq disabled. [ 3133.105601] NMI watchdog: Watchdog detected hard LOCKUP on cpu 37 : [ 3133.105709] CPU: 37 PID: 11200 Comm: bash Kdump: loaded Not tainted 4.18.0-80.el8.x86_64+debug #1 : [ 3133.105750] slab_free_freelist_hook+0xa0/0x120 [ 3133.105758] kmem_cache_free+0x9d/0x310 [ 3133.105762] flush_sigqueue+0x12b/0x1d0 [ 3133.105766] release_task.part.14+0xaf7/0x1520 [ 3133.105784] wait_consider_task+0x28da/0x3350 [ 3133.105804] do_wait+0x3eb/0x8c0 [ 3133.105819] kernel_wait4+0xe4/0x1b0 [ 3133.105834] __do_sys_wait4+0xe0/0xf0 [ 3133.105864] do_syscall_64+0xa5/0x4a0 [ 3133.105868] entry_SYSCALL_64_after_hwframe+0x6a/0xdf [ All the "?" stack trace entries were removed from above. ] To avoid this dire condition and reduce lock hold time of tasklist_lock, flush_sigqueue() is modified to pass in a freeing queue pointer so that the actual freeing of memory objects can be deferred until after the tasklist_lock is released and irq re-enabled. Signed-off-by: Waiman Long <longman@redhat.com> --- include/linux/signal.h | 4 +++- kernel/exit.c | 12 ++++++++---- kernel/signal.c | 27 ++++++++++++++++----------- security/selinux/hooks.c | 8 ++++++-- 4 files changed, 33 insertions(+), 18 deletions(-)