diff mbox series

[2/4] signal: Make flush_sigqueue() use free_q to release memory

Message ID 20190321214512.11524-3-longman@redhat.com (mailing list archive)
State Not Applicable
Headers show
Series Signal: Fix hard lockup problem in flush_sigqueue() | expand

Commit Message

Waiman Long March 21, 2019, 9:45 p.m. UTC
It was found that if a process had many pending signals (e.g. millions),
the act of exiting that process might cause its parent to have a hard
lockup especially on a debug kernel with features like KASAN enabled.
It was because the flush_sigqueue() was called in release_task() with
tasklist_lock held and irq disabled.

  [ 3133.105601] NMI watchdog: Watchdog detected hard LOCKUP on cpu 37
    :
  [ 3133.105709] CPU: 37 PID: 11200 Comm: bash Kdump: loaded Not tainted 4.18.0-80.el8.x86_64+debug #1
    :
  [ 3133.105750]  slab_free_freelist_hook+0xa0/0x120
  [ 3133.105758]  kmem_cache_free+0x9d/0x310
  [ 3133.105762]  flush_sigqueue+0x12b/0x1d0
  [ 3133.105766]  release_task.part.14+0xaf7/0x1520
  [ 3133.105784]  wait_consider_task+0x28da/0x3350
  [ 3133.105804]  do_wait+0x3eb/0x8c0
  [ 3133.105819]  kernel_wait4+0xe4/0x1b0
  [ 3133.105834]  __do_sys_wait4+0xe0/0xf0
  [ 3133.105864]  do_syscall_64+0xa5/0x4a0
  [ 3133.105868]  entry_SYSCALL_64_after_hwframe+0x6a/0xdf

[ All the "?" stack trace entries were removed from above. ]

To avoid this dire condition and reduce lock hold time of tasklist_lock,
flush_sigqueue() is modified to pass in a freeing queue pointer so that
the actual freeing of memory objects can be deferred until after the
tasklist_lock is released and irq re-enabled.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/signal.h   |  4 +++-
 kernel/exit.c            | 12 ++++++++----
 kernel/signal.c          | 27 ++++++++++++++++-----------
 security/selinux/hooks.c |  8 ++++++--
 4 files changed, 33 insertions(+), 18 deletions(-)

Comments

Matthew Wilcox March 22, 2019, 1:52 a.m. UTC | #1
On Thu, Mar 21, 2019 at 05:45:10PM -0400, Waiman Long wrote:
> It was found that if a process had many pending signals (e.g. millions),
> the act of exiting that process might cause its parent to have a hard
> lockup especially on a debug kernel with features like KASAN enabled.
> It was because the flush_sigqueue() was called in release_task() with
> tasklist_lock held and irq disabled.

This rather apocalyptic language is a bit uncalled for.  I appreciate the
warning is scary, but all that's really happening is that the debug code
is taking too long to execute.

> To avoid this dire condition and reduce lock hold time of tasklist_lock,
> flush_sigqueue() is modified to pass in a freeing queue pointer so that
> the actual freeing of memory objects can be deferred until after the
> tasklist_lock is released and irq re-enabled.

I think this is a really bad solution.  It looks kind of generic,
but isn't.  It's terribly inefficient, and all it's really doing is
deferring the debugging code until we've re-enabled interrupts.

We'd be much better off just having a list_head in the caller
and list_splice() the queue->list onto that caller.  Then call
__sigqueue_free() for each signal on the queue.
Oleg Nesterov March 22, 2019, 11:16 a.m. UTC | #2
On 03/21, Matthew Wilcox wrote:
>
> On Thu, Mar 21, 2019 at 05:45:10PM -0400, Waiman Long wrote:
>
> > To avoid this dire condition and reduce lock hold time of tasklist_lock,
> > flush_sigqueue() is modified to pass in a freeing queue pointer so that
> > the actual freeing of memory objects can be deferred until after the
> > tasklist_lock is released and irq re-enabled.
>
> I think this is a really bad solution.  It looks kind of generic,
> but isn't.  It's terribly inefficient, and all it's really doing is
> deferring the debugging code until we've re-enabled interrupts.

Agreed.

> We'd be much better off just having a list_head in the caller
> and list_splice() the queue->list onto that caller.  Then call
> __sigqueue_free() for each signal on the queue.

This won't work, note the comment which explains the race with sigqueue_free().

Let me think about it... at least we can do something like

	close_the_race_with_sigqueue_free(struct sigpending *queue)
	{
		struct sigqueue *q, *t;

		list_for_each_entry_safe(q, t, ...) {
			if (q->flags & SIGQUEUE_PREALLOC)
				list_del_init(&q->list);
	}

called with ->siglock held, tasklist_lock is not needed.

After that flush_sigqueue() can be called lockless in release_task() release_task.

I'll try to make the patch tomorrow.

Oleg.
Waiman Long March 22, 2019, 4:10 p.m. UTC | #3
On 03/22/2019 07:16 AM, Oleg Nesterov wrote:
> On 03/21, Matthew Wilcox wrote:
>> On Thu, Mar 21, 2019 at 05:45:10PM -0400, Waiman Long wrote:
>>
>>> To avoid this dire condition and reduce lock hold time of tasklist_lock,
>>> flush_sigqueue() is modified to pass in a freeing queue pointer so that
>>> the actual freeing of memory objects can be deferred until after the
>>> tasklist_lock is released and irq re-enabled.
>> I think this is a really bad solution.  It looks kind of generic,
>> but isn't.  It's terribly inefficient, and all it's really doing is
>> deferring the debugging code until we've re-enabled interrupts.
> Agreed.

Thanks for looking into that. As I am not knowledgeable enough about the
signal handling code path, I choose the lowest risk approach of not
trying to change the code flow while deferring memory deallocation after
releasing the tasklist_lock.

>> We'd be much better off just having a list_head in the caller
>> and list_splice() the queue->list onto that caller.  Then call
>> __sigqueue_free() for each signal on the queue.
> This won't work, note the comment which explains the race with sigqueue_free().
>
> Let me think about it... at least we can do something like
>
> 	close_the_race_with_sigqueue_free(struct sigpending *queue)
> 	{
> 		struct sigqueue *q, *t;
>
> 		list_for_each_entry_safe(q, t, ...) {
> 			if (q->flags & SIGQUEUE_PREALLOC)
> 				list_del_init(&q->list);
> 	}
>
> called with ->siglock held, tasklist_lock is not needed.
>
> After that flush_sigqueue() can be called lockless in release_task() release_task.
>
> I'll try to make the patch tomorrow.
>
> Oleg.
>
I am looking forward to it.

Thanks,
Longman
Christoph Lameter (Ampere) March 22, 2019, 5:50 p.m. UTC | #4
On Fri, 22 Mar 2019, Waiman Long wrote:

> I am looking forward to it.

There is also alrady rcu being used in these paths. kfree_rcu() would not
be enough? It is an estalished mechanism that is mature and well
understood.
Waiman Long March 22, 2019, 6:12 p.m. UTC | #5
On 03/22/2019 01:50 PM, Christopher Lameter wrote:
> On Fri, 22 Mar 2019, Waiman Long wrote:
>
>> I am looking forward to it.
> There is also alrady rcu being used in these paths. kfree_rcu() would not
> be enough? It is an estalished mechanism that is mature and well
> understood.
>
In this case, the memory objects are from kmem caches, so they can't
freed using kfree_rcu().

There are certainly overhead using the kfree_rcu(), or a
kfree_rcu()-like mechanism. Also I think the actual freeing is done at
SoftIRQ context which can be a problem if there are too many memory
objects to free.

I think what Oleg is trying to do is probably the most efficient way.

Cheers,
Longman
Christoph Lameter (Ampere) March 22, 2019, 7:39 p.m. UTC | #6
On Fri, 22 Mar 2019, Waiman Long wrote:

> >
> >> I am looking forward to it.
> > There is also alrady rcu being used in these paths. kfree_rcu() would not
> > be enough? It is an estalished mechanism that is mature and well
> > understood.
> >
> In this case, the memory objects are from kmem caches, so they can't
> freed using kfree_rcu().

Oh they can. kfree() can free memory from any slab cache.

> There are certainly overhead using the kfree_rcu(), or a
> kfree_rcu()-like mechanism. Also I think the actual freeing is done at
> SoftIRQ context which can be a problem if there are too many memory
> objects to free.

No there is none that I am aware of. And this has survived testing of HPC
loads with gazillion of objects that have to be freed from multiple
processors. We really should not rearchitect this stuff... It took us
quite a long time to have this scale well under all loads.

Please use rcu_free().
Matthew Wilcox March 22, 2019, 7:59 p.m. UTC | #7
On Fri, Mar 22, 2019 at 07:39:31PM +0000, Christopher Lameter wrote:
> On Fri, 22 Mar 2019, Waiman Long wrote:
> 
> > >
> > >> I am looking forward to it.
> > > There is also alrady rcu being used in these paths. kfree_rcu() would not
> > > be enough? It is an estalished mechanism that is mature and well
> > > understood.
> > >
> > In this case, the memory objects are from kmem caches, so they can't
> > freed using kfree_rcu().
> 
> Oh they can. kfree() can free memory from any slab cache.

Only for SLAB and SLUB.  SLOB requires that you pass a pointer to the
slab cache; it has no way to look up the slab cache from the object.
Christoph Lameter (Ampere) March 25, 2019, 2:15 p.m. UTC | #8
On Fri, 22 Mar 2019, Matthew Wilcox wrote:

> On Fri, Mar 22, 2019 at 07:39:31PM +0000, Christopher Lameter wrote:
> > On Fri, 22 Mar 2019, Waiman Long wrote:
> >
> > > >
> > > >> I am looking forward to it.
> > > > There is also alrady rcu being used in these paths. kfree_rcu() would not
> > > > be enough? It is an estalished mechanism that is mature and well
> > > > understood.
> > > >
> > > In this case, the memory objects are from kmem caches, so they can't
> > > freed using kfree_rcu().
> >
> > Oh they can. kfree() can free memory from any slab cache.
>
> Only for SLAB and SLUB.  SLOB requires that you pass a pointer to the
> slab cache; it has no way to look up the slab cache from the object.

Well then we could either fix SLOB to conform to the others or add a
kmem_cache_free_rcu() variant.
Matthew Wilcox March 25, 2019, 3:26 p.m. UTC | #9
On Mon, Mar 25, 2019 at 02:15:25PM +0000, Christopher Lameter wrote:
> On Fri, 22 Mar 2019, Matthew Wilcox wrote:
> 
> > On Fri, Mar 22, 2019 at 07:39:31PM +0000, Christopher Lameter wrote:
> > > On Fri, 22 Mar 2019, Waiman Long wrote:
> > >
> > > > >
> > > > >> I am looking forward to it.
> > > > > There is also alrady rcu being used in these paths. kfree_rcu() would not
> > > > > be enough? It is an estalished mechanism that is mature and well
> > > > > understood.
> > > > >
> > > > In this case, the memory objects are from kmem caches, so they can't
> > > > freed using kfree_rcu().
> > >
> > > Oh they can. kfree() can free memory from any slab cache.
> >
> > Only for SLAB and SLUB.  SLOB requires that you pass a pointer to the
> > slab cache; it has no way to look up the slab cache from the object.
> 
> Well then we could either fix SLOB to conform to the others or add a
> kmem_cache_free_rcu() variant.

The problem with a kmem_cache_free_rcu() variant is that we now have
three pointers to store -- the object pointer, the slab pointer and the
rcu next pointer.

I spent some time looking at how SLOB might be fixed, and I didn't come
up with a good idea.  Currently SLOB stores the size of the object in the
four bytes before the object, unless the object is "allocated from a slab
cache", in which case the size is taken from the cache pointer instead.
So calling kfree() on a pointer allocated using kmem_cache_alloc() will
cause corruption as it attempts to determine the length of the object.

Options:

1. Dispense with this optimisation and always store the size of the
object before the object.

2. Add a kmem_cache flag that says whether objects in this cache may be
freed with kfree().  Only dispense with this optimisation for slabs
with this flag set.

3. Change SLOB to segregate objects by size.  If someone has gone to
the trouble of creating a kmem_cache, this is a pretty good hint that
there will be a lot of objects of this size allocated, so this could
help SLOB fight fragmentation.

Any other bright ideas?
Christoph Lameter (Ampere) March 25, 2019, 4:16 p.m. UTC | #10
On Mon, 25 Mar 2019, Matthew Wilcox wrote:

> Options:
>
> 1. Dispense with this optimisation and always store the size of the
> object before the object.

I think thats how SLOB handled it at some point in the past. Lets go back
to that setup so its compatible with the other allocators?
Oleg Nesterov March 26, 2019, 1:29 p.m. UTC | #11
Sorry, I am sick and can't work, hopefully I'll return tomorrow.

On 03/22, Christopher Lameter wrote:
>
> On Fri, 22 Mar 2019, Waiman Long wrote:
>
> > I am looking forward to it.
>
> There is also alrady rcu being used in these paths. kfree_rcu() would not
> be enough? It is an estalished mechanism that is mature and well
> understood.

But why do we want to increase the number of rcu callbacks in flight?

For the moment, lets discuss the exiting tasks only. The only reason why
flush_sigqueue(&tsk->pending) needs spin_lock_irq() is the race with
release_posix_timer()->sigqueue_free() from another thread which can remove
a SIGQUEUE_PREALLOC'ed sigqueue from list. With the simple patch below
flush_sigqueue() can be called lockless with irqs enabled.

However, this change is not enough, we need to do something similar with
do_sigaction()->flush_sigqueue_mask(), and this is less simple.

So I won't really argue with kfree_rcu() but I am not sure this is the best
option.

Oleg.


--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -85,6 +85,17 @@ static void __unhash_process(struct task_struct *p, bool group_dead)
 	list_del_rcu(&p->thread_node);
 }
 
+// Rename me and move into signal.c
+void remove_prealloced(struct sigpending *queue)
+{
+	struct sigqueue *q, *t;
+
+	list_for_each_entry_safe(q, t, &queue->list, list) {
+		if (q->flags & SIGQUEUE_PREALLOC)
+			list_del_init(&q->list);
+	}
+}
+
 /*
  * This function expects the tasklist_lock write-locked.
  */
@@ -160,16 +171,15 @@ static void __exit_signal(struct task_struct *tsk)
 	 * Do this under ->siglock, we can race with another thread
 	 * doing sigqueue_free() if we have SIGQUEUE_PREALLOC signals.
 	 */
-	flush_sigqueue(&tsk->pending);
+	if (!group_dead)
+		remove_prealloced(&tsk->pending);
 	tsk->sighand = NULL;
 	spin_unlock(&sighand->siglock);
 
 	__cleanup_sighand(sighand);
 	clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
-	if (group_dead) {
-		flush_sigqueue(&sig->shared_pending);
+	if (group_dead)
 		tty_kref_put(tty);
-	}
 }
 
 static void delayed_put_task_struct(struct rcu_head *rhp)
@@ -221,6 +231,11 @@ void release_task(struct task_struct *p)
 	write_unlock_irq(&tasklist_lock);
 	cgroup_release(p);
 	release_thread(p);
+
+	flush_sigqueue(&p->pending);
+	if (thread_group_leader(p))
+		flush_sigqueue(&p->signal->shared_pending);
+
 	call_rcu(&p->rcu, delayed_put_task_struct);
 
 	p = leader;
Oleg Nesterov March 26, 2019, 1:36 p.m. UTC | #12
On 03/25, Christopher Lameter wrote:
>
> On Fri, 22 Mar 2019, Matthew Wilcox wrote:
>
> > Only for SLAB and SLUB.  SLOB requires that you pass a pointer to the
> > slab cache; it has no way to look up the slab cache from the object.
>
> Well then we could either fix SLOB to conform to the others or add a
> kmem_cache_free_rcu() variant.

Speaking of struct sigqueue we can simply use call_rcu() but see my previous
email, I am not sure this is the best option.

Oleg.
diff mbox series

Patch

diff --git a/include/linux/signal.h b/include/linux/signal.h
index 9702016734b1..a9562e502122 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -5,6 +5,7 @@ 
 #include <linux/bug.h>
 #include <linux/signal_types.h>
 #include <linux/string.h>
+#include <linux/slab.h>
 
 struct task_struct;
 
@@ -254,7 +255,8 @@  static inline void init_sigpending(struct sigpending *sig)
 	INIT_LIST_HEAD(&sig->list);
 }
 
-extern void flush_sigqueue(struct sigpending *queue);
+extern void flush_sigqueue(struct sigpending *queue,
+			   struct kmem_free_q_head *head);
 
 /* Test if 'sig' is valid signal. Use this instead of testing _NSIG directly */
 static inline int valid_signal(unsigned long sig)
diff --git a/kernel/exit.c b/kernel/exit.c
index 2166c2d92ddc..ee707a63edfd 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -88,7 +88,8 @@  static void __unhash_process(struct task_struct *p, bool group_dead)
 /*
  * This function expects the tasklist_lock write-locked.
  */
-static void __exit_signal(struct task_struct *tsk)
+static void __exit_signal(struct task_struct *tsk,
+			  struct kmem_free_q_head *free_q)
 {
 	struct signal_struct *sig = tsk->signal;
 	bool group_dead = thread_group_leader(tsk);
@@ -160,14 +161,14 @@  static void __exit_signal(struct task_struct *tsk)
 	 * Do this under ->siglock, we can race with another thread
 	 * doing sigqueue_free() if we have SIGQUEUE_PREALLOC signals.
 	 */
-	flush_sigqueue(&tsk->pending);
+	flush_sigqueue(&tsk->pending, free_q);
 	tsk->sighand = NULL;
 	spin_unlock(&sighand->siglock);
 
 	__cleanup_sighand(sighand);
 	clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
 	if (group_dead) {
-		flush_sigqueue(&sig->shared_pending);
+		flush_sigqueue(&sig->shared_pending, free_q);
 		tty_kref_put(tty);
 	}
 }
@@ -186,6 +187,8 @@  void release_task(struct task_struct *p)
 {
 	struct task_struct *leader;
 	int zap_leader;
+	DEFINE_KMEM_FREE_Q(free_q);
+
 repeat:
 	/* don't need to get the RCU readlock here - the process is dead and
 	 * can't be modifying its own credentials. But shut RCU-lockdep up */
@@ -197,7 +200,7 @@  void release_task(struct task_struct *p)
 
 	write_lock_irq(&tasklist_lock);
 	ptrace_release_task(p);
-	__exit_signal(p);
+	__exit_signal(p, &free_q);
 
 	/*
 	 * If we are the last non-leader member of the thread
@@ -219,6 +222,7 @@  void release_task(struct task_struct *p)
 	}
 
 	write_unlock_irq(&tasklist_lock);
+	kmem_free_up_q(&free_q);
 	cgroup_release(p);
 	release_thread(p);
 	call_rcu(&p->rcu, delayed_put_task_struct);
diff --git a/kernel/signal.c b/kernel/signal.c
index b7953934aa99..04fb202c16bd 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -435,16 +435,19 @@  __sigqueue_alloc(int sig, struct task_struct *t, gfp_t flags, int override_rlimi
 	return q;
 }
 
-static void __sigqueue_free(struct sigqueue *q)
+static void __sigqueue_free(struct sigqueue *q, struct kmem_free_q_head *free_q)
 {
 	if (q->flags & SIGQUEUE_PREALLOC)
 		return;
 	atomic_dec(&q->user->sigpending);
 	free_uid(q->user);
-	kmem_cache_free(sigqueue_cachep, q);
+	if (free_q)
+		kmem_free_q_add(free_q, sigqueue_cachep, q);
+	else
+		kmem_cache_free(sigqueue_cachep, q);
 }
 
-void flush_sigqueue(struct sigpending *queue)
+void flush_sigqueue(struct sigpending *queue, struct kmem_free_q_head *free_q)
 {
 	struct sigqueue *q;
 
@@ -452,7 +455,7 @@  void flush_sigqueue(struct sigpending *queue)
 	while (!list_empty(&queue->list)) {
 		q = list_entry(queue->list.next, struct sigqueue , list);
 		list_del_init(&q->list);
-		__sigqueue_free(q);
+		__sigqueue_free(q, free_q);
 	}
 }
 
@@ -462,12 +465,14 @@  void flush_sigqueue(struct sigpending *queue)
 void flush_signals(struct task_struct *t)
 {
 	unsigned long flags;
+	DEFINE_KMEM_FREE_Q(free_q);
 
 	spin_lock_irqsave(&t->sighand->siglock, flags);
 	clear_tsk_thread_flag(t, TIF_SIGPENDING);
-	flush_sigqueue(&t->pending);
-	flush_sigqueue(&t->signal->shared_pending);
+	flush_sigqueue(&t->pending, &free_q);
+	flush_sigqueue(&t->signal->shared_pending, &free_q);
 	spin_unlock_irqrestore(&t->sighand->siglock, flags);
+	kmem_free_up_q(&free_q);
 }
 EXPORT_SYMBOL(flush_signals);
 
@@ -488,7 +493,7 @@  static void __flush_itimer_signals(struct sigpending *pending)
 		} else {
 			sigdelset(&signal, sig);
 			list_del_init(&q->list);
-			__sigqueue_free(q);
+			__sigqueue_free(q, NULL);
 		}
 	}
 
@@ -580,7 +585,7 @@  static void collect_signal(int sig, struct sigpending *list, kernel_siginfo_t *i
 			(info->si_code == SI_TIMER) &&
 			(info->si_sys_private);
 
-		__sigqueue_free(first);
+		__sigqueue_free(first, NULL);
 	} else {
 		/*
 		 * Ok, it wasn't in the queue.  This must be
@@ -728,7 +733,7 @@  static int dequeue_synchronous_signal(kernel_siginfo_t *info)
 still_pending:
 	list_del_init(&sync->list);
 	copy_siginfo(info, &sync->info);
-	__sigqueue_free(sync);
+	__sigqueue_free(sync, NULL);
 	return info->si_signo;
 }
 
@@ -776,7 +781,7 @@  static void flush_sigqueue_mask(sigset_t *mask, struct sigpending *s)
 	list_for_each_entry_safe(q, n, &s->list, list) {
 		if (sigismember(mask, q->info.si_signo)) {
 			list_del_init(&q->list);
-			__sigqueue_free(q);
+			__sigqueue_free(q, NULL);
 		}
 	}
 }
@@ -1749,7 +1754,7 @@  void sigqueue_free(struct sigqueue *q)
 	spin_unlock_irqrestore(lock, flags);
 
 	if (q)
-		__sigqueue_free(q);
+		__sigqueue_free(q, NULL);
 }
 
 int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type)
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 1d0b37af2444..8ca571a0b2ac 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2548,6 +2548,8 @@  static void selinux_bprm_committed_creds(struct linux_binprm *bprm)
 	rc = avc_has_perm(&selinux_state,
 			  osid, sid, SECCLASS_PROCESS, PROCESS__SIGINH, NULL);
 	if (rc) {
+		DEFINE_KMEM_FREE_Q(free_q);
+
 		if (IS_ENABLED(CONFIG_POSIX_TIMERS)) {
 			memset(&itimer, 0, sizeof itimer);
 			for (i = 0; i < 3; i++)
@@ -2555,13 +2557,15 @@  static void selinux_bprm_committed_creds(struct linux_binprm *bprm)
 		}
 		spin_lock_irq(&current->sighand->siglock);
 		if (!fatal_signal_pending(current)) {
-			flush_sigqueue(&current->pending);
-			flush_sigqueue(&current->signal->shared_pending);
+			flush_sigqueue(&current->pending, &free_q);
+			flush_sigqueue(&current->signal->shared_pending,
+				       &free_q);
 			flush_signal_handlers(current, 1);
 			sigemptyset(&current->blocked);
 			recalc_sigpending();
 		}
 		spin_unlock_irq(&current->sighand->siglock);
+		kmem_free_up_q(&free_q);
 	}
 
 	/* Wake up the parent if it is waiting so that it can recheck