diff mbox series

RFC: console: hack up console_lock more v3

Message ID 20190509120903.28939-1-daniel.vetter@ffwll.ch (mailing list archive)
State New, archived
Headers show
Series RFC: console: hack up console_lock more v3 | expand

Commit Message

Daniel Vetter May 9, 2019, 12:09 p.m. UTC
console_trylock, called from within printk, can be called from pretty
much anywhere. Including try_to_wake_up. Note that this isn't common,
usually the box is in pretty bad shape at that point already. But it
really doesn't help when then lockdep jumps in and spams the logs,
potentially obscuring the real backtrace we're really interested in.
One case I've seen (slightly simplified backtrace):

 Call Trace:
  <IRQ>
  console_trylock+0xe/0x60
  vprintk_emit+0xf1/0x320
  printk+0x4d/0x69
  __warn_printk+0x46/0x90
  native_smp_send_reschedule+0x2f/0x40
  check_preempt_curr+0x81/0xa0
  ttwu_do_wakeup+0x14/0x220
  try_to_wake_up+0x218/0x5f0
  pollwake+0x6f/0x90
  credit_entropy_bits+0x204/0x310
  add_interrupt_randomness+0x18f/0x210
  handle_irq+0x67/0x160
  do_IRQ+0x5e/0x130
  common_interrupt+0xf/0xf
  </IRQ>

This alone isn't a problem, but the spinlock in the semaphore is also
still held while waking up waiters (up() -> __up() -> try_to_wake_up()
callchain), which then closes the runqueue vs. semaphore.lock loop,
and upsets lockdep, which issues a circular locking splat to dmesg.
Worse it upsets developers, since we don't want to spam dmesg with
clutter when the machine is dying already.

Fix this by creating a prinkt_safe_up() which calls wake_up_process
outside of the spinlock. This isn't correct in full generality, but
good enough for console_lock:

- console_lock doesn't use interruptible or killable or timeout down()
  calls, hence an up() is the only thing that can wake up a process.
  Hence the process can't get woken and killed and reaped while we try
  to wake it up too.

- semaphore.c always updates the waiter list while under the spinlock,
  so there's no other races. Specifically another process that races
  with a quick console_lock/unlock while we've dropped the spinlock
  already won't see our own waiter.

Note that we only have to break the recursion for the semaphore.lock
spinlock of the console_lock. Recursion within various scheduler
related locks is already prevented by the printk_safe_enter/exit pair
in __up_console_sem().

Also cc'ing John Ogness since perhaps his printk rework fixes this all
properly.

v2: Ditch attempt to fix console_trylock.

v3: Add a comment explaining why the taks we're waking won't
disappear (Chris), and improve commit message to address review
questions.

Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 include/linux/semaphore.h  |  1 +
 kernel/locking/semaphore.c | 31 +++++++++++++++++++++++++++++++
 kernel/printk/printk.c     |  2 +-
 3 files changed, 33 insertions(+), 1 deletion(-)

Comments

Chris Wilson May 9, 2019, 12:21 p.m. UTC | #1
Quoting Daniel Vetter (2019-05-09 13:09:03)
> console_trylock, called from within printk, can be called from pretty
> much anywhere. Including try_to_wake_up. Note that this isn't common,
> usually the box is in pretty bad shape at that point already. But it
> really doesn't help when then lockdep jumps in and spams the logs,
> potentially obscuring the real backtrace we're really interested in.
> One case I've seen (slightly simplified backtrace):
> 
>  Call Trace:
>   <IRQ>
>   console_trylock+0xe/0x60
>   vprintk_emit+0xf1/0x320
>   printk+0x4d/0x69
>   __warn_printk+0x46/0x90
>   native_smp_send_reschedule+0x2f/0x40
>   check_preempt_curr+0x81/0xa0
>   ttwu_do_wakeup+0x14/0x220
>   try_to_wake_up+0x218/0x5f0
>   pollwake+0x6f/0x90
>   credit_entropy_bits+0x204/0x310
>   add_interrupt_randomness+0x18f/0x210
>   handle_irq+0x67/0x160
>   do_IRQ+0x5e/0x130
>   common_interrupt+0xf/0xf
>   </IRQ>
> 
> This alone isn't a problem, but the spinlock in the semaphore is also
> still held while waking up waiters (up() -> __up() -> try_to_wake_up()
> callchain), which then closes the runqueue vs. semaphore.lock loop,
> and upsets lockdep, which issues a circular locking splat to dmesg.
> Worse it upsets developers, since we don't want to spam dmesg with
> clutter when the machine is dying already.
> 
> Fix this by creating a prinkt_safe_up() which calls wake_up_process
> outside of the spinlock. This isn't correct in full generality, but
> good enough for console_lock:
> 
> - console_lock doesn't use interruptible or killable or timeout down()
>   calls, hence an up() is the only thing that can wake up a process.
>   Hence the process can't get woken and killed and reaped while we try
>   to wake it up too.
> 
> - semaphore.c always updates the waiter list while under the spinlock,
>   so there's no other races. Specifically another process that races
>   with a quick console_lock/unlock while we've dropped the spinlock
>   already won't see our own waiter.
> 
> Note that we only have to break the recursion for the semaphore.lock
> spinlock of the console_lock. Recursion within various scheduler
> related locks is already prevented by the printk_safe_enter/exit pair
> in __up_console_sem().
> 
> Also cc'ing John Ogness since perhaps his printk rework fixes this all
> properly.
> 
> v2: Ditch attempt to fix console_trylock.
> 
> v3: Add a comment explaining why the taks we're waking won't
> disappear (Chris), and improve commit message to address review
> questions.
> 
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Petr Mladek <pmladek@suse.com>
> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: John Ogness <john.ogness@linutronix.de>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

I'm a bit nervous about that this is only safe for the precisely
controlled conditions, but then again that it is called printk_safe
should deter any other users.

The logic checks out, and you convinced me that the dereference is
protected, so
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
-Chris
Peter Zijlstra May 9, 2019, 12:31 p.m. UTC | #2
On Thu, May 09, 2019 at 02:09:03PM +0200, Daniel Vetter wrote:
> Fix this by creating a prinkt_safe_up() which calls wake_up_process
> outside of the spinlock. This isn't correct in full generality, but
> good enough for console_lock:
> 
> - console_lock doesn't use interruptible or killable or timeout down()
>   calls, hence an up() is the only thing that can wake up a process.

Wrong :/ Any task can be woken at any random time. We must, at all
times, assume spurious wakeups will happen.

> +void printk_safe_up(struct semaphore *sem)
> +{
> +	unsigned long flags;
> +	struct semaphore_waiter *waiter = NULL;
> +
> +	raw_spin_lock_irqsave(&sem->lock, flags);
> +	if (likely(list_empty(&sem->wait_list))) {
> +		sem->count++;
> +	} else {
> +		waiter = list_first_entry(&sem->wait_list,
> +					  struct semaphore_waiter, list);
> +		list_del(&waiter->list);
> +		waiter->up = true;
> +	}
> +	raw_spin_unlock_irqrestore(&sem->lock, flags);
> +
> +	if (waiter) /* protected by being sole wake source */
> +		wake_up_process(waiter->task);
> +}
> +EXPORT_SYMBOL(printk_safe_up);

Since its only used from printk, that EXPORT really isn't needed.

Something like the below might work.

---
 kernel/locking/semaphore.c | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/kernel/locking/semaphore.c b/kernel/locking/semaphore.c
index 561acdd39960..ac0a67e95aac 100644
--- a/kernel/locking/semaphore.c
+++ b/kernel/locking/semaphore.c
@@ -38,7 +38,6 @@ static noinline void __down(struct semaphore *sem);
 static noinline int __down_interruptible(struct semaphore *sem);
 static noinline int __down_killable(struct semaphore *sem);
 static noinline int __down_timeout(struct semaphore *sem, long timeout);
-static noinline void __up(struct semaphore *sem);
 
 /**
  * down - acquire the semaphore
@@ -178,14 +177,24 @@ EXPORT_SYMBOL(down_timeout);
  */
 void up(struct semaphore *sem)
 {
+	struct semaphore_waiter *waiter;
+	DEFINE_WAKE_Q(wake_q);
 	unsigned long flags;
 
 	raw_spin_lock_irqsave(&sem->lock, flags);
-	if (likely(list_empty(&sem->wait_list)))
+	if (likely(list_empty(&sem->wait_list))) {
 		sem->count++;
-	else
-		__up(sem);
+		goto unlock;
+	}
+
+	waiter = list_first_entry(&sem->wait_list, struct semaphore_waiter, list);
+	list_del(&waiter->list);
+	waiter->up = true;
+	wake_q_add(&wake_q, waiter->task);
+unlock:
 	raw_spin_unlock_irqrestore(&sem->lock, flags);
+
+	wake_up_q(&wake_q);
 }
 EXPORT_SYMBOL(up);
 
@@ -252,12 +261,3 @@ static noinline int __sched __down_timeout(struct semaphore *sem, long timeout)
 {
 	return __down_common(sem, TASK_UNINTERRUPTIBLE, timeout);
 }
-
-static noinline void __sched __up(struct semaphore *sem)
-{
-	struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
-						struct semaphore_waiter, list);
-	list_del(&waiter->list);
-	waiter->up = true;
-	wake_up_process(waiter->task);
-}
Daniel Vetter May 9, 2019, 1:06 p.m. UTC | #3
On Thu, May 9, 2019 at 2:31 PM Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, May 09, 2019 at 02:09:03PM +0200, Daniel Vetter wrote:
> > Fix this by creating a prinkt_safe_up() which calls wake_up_process
> > outside of the spinlock. This isn't correct in full generality, but
> > good enough for console_lock:
> >
> > - console_lock doesn't use interruptible or killable or timeout down()
> >   calls, hence an up() is the only thing that can wake up a process.
>
> Wrong :/ Any task can be woken at any random time. We must, at all
> times, assume spurious wakeups will happen.

Out of curiosity, where do these come from? I know about the races
where you need to recheck on the waiter side to avoid getting stuck,
but didn't know about this. Are these earlier (possibly spurious)
wakeups that got held up and delayed for a while, then hit the task
much later when it's already continued doing something else? Or even
more random, and even if I never put a task on a wait list or anything
else, ever, it can get woken spuriously?

> > +void printk_safe_up(struct semaphore *sem)
> > +{
> > +     unsigned long flags;
> > +     struct semaphore_waiter *waiter = NULL;
> > +
> > +     raw_spin_lock_irqsave(&sem->lock, flags);
> > +     if (likely(list_empty(&sem->wait_list))) {
> > +             sem->count++;
> > +     } else {
> > +             waiter = list_first_entry(&sem->wait_list,
> > +                                       struct semaphore_waiter, list);
> > +             list_del(&waiter->list);
> > +             waiter->up = true;
> > +     }
> > +     raw_spin_unlock_irqrestore(&sem->lock, flags);
> > +
> > +     if (waiter) /* protected by being sole wake source */
> > +             wake_up_process(waiter->task);
> > +}
> > +EXPORT_SYMBOL(printk_safe_up);
>
> Since its only used from printk, that EXPORT really isn't needed.
>
> Something like the below might work.

Yeah that looks like the proper fix. I guess semaphores are uncritical
enough that we can roll this out for everyone. Thanks for the hint.
-Daniel

>
> ---
>  kernel/locking/semaphore.c | 26 +++++++++++++-------------
>  1 file changed, 13 insertions(+), 13 deletions(-)
>
> diff --git a/kernel/locking/semaphore.c b/kernel/locking/semaphore.c
> index 561acdd39960..ac0a67e95aac 100644
> --- a/kernel/locking/semaphore.c
> +++ b/kernel/locking/semaphore.c
> @@ -38,7 +38,6 @@ static noinline void __down(struct semaphore *sem);
>  static noinline int __down_interruptible(struct semaphore *sem);
>  static noinline int __down_killable(struct semaphore *sem);
>  static noinline int __down_timeout(struct semaphore *sem, long timeout);
> -static noinline void __up(struct semaphore *sem);
>
>  /**
>   * down - acquire the semaphore
> @@ -178,14 +177,24 @@ EXPORT_SYMBOL(down_timeout);
>   */
>  void up(struct semaphore *sem)
>  {
> +       struct semaphore_waiter *waiter;
> +       DEFINE_WAKE_Q(wake_q);
>         unsigned long flags;
>
>         raw_spin_lock_irqsave(&sem->lock, flags);
> -       if (likely(list_empty(&sem->wait_list)))
> +       if (likely(list_empty(&sem->wait_list))) {
>                 sem->count++;
> -       else
> -               __up(sem);
> +               goto unlock;
> +       }
> +
> +       waiter = list_first_entry(&sem->wait_list, struct semaphore_waiter, list);
> +       list_del(&waiter->list);
> +       waiter->up = true;
> +       wake_q_add(&wake_q, waiter->task);
> +unlock:
>         raw_spin_unlock_irqrestore(&sem->lock, flags);
> +
> +       wake_up_q(&wake_q);
>  }
>  EXPORT_SYMBOL(up);
>
> @@ -252,12 +261,3 @@ static noinline int __sched __down_timeout(struct semaphore *sem, long timeout)
>  {
>         return __down_common(sem, TASK_UNINTERRUPTIBLE, timeout);
>  }
> -
> -static noinline void __sched __up(struct semaphore *sem)
> -{
> -       struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
> -                                               struct semaphore_waiter, list);
> -       list_del(&waiter->list);
> -       waiter->up = true;
> -       wake_up_process(waiter->task);
> -}
Peter Zijlstra May 9, 2019, 1:36 p.m. UTC | #4
On Thu, May 09, 2019 at 03:06:09PM +0200, Daniel Vetter wrote:
> On Thu, May 9, 2019 at 2:31 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > On Thu, May 09, 2019 at 02:09:03PM +0200, Daniel Vetter wrote:
> > > Fix this by creating a prinkt_safe_up() which calls wake_up_process
> > > outside of the spinlock. This isn't correct in full generality, but
> > > good enough for console_lock:
> > >
> > > - console_lock doesn't use interruptible or killable or timeout down()
> > >   calls, hence an up() is the only thing that can wake up a process.
> >
> > Wrong :/ Any task can be woken at any random time. We must, at all
> > times, assume spurious wakeups will happen.
> 
> Out of curiosity, where do these come from? I know about the races
> where you need to recheck on the waiter side to avoid getting stuck,
> but didn't know about this. Are these earlier (possibly spurious)
> wakeups that got held up and delayed for a while, then hit the task
> much later when it's already continued doing something else?

Yes, this. So they all more or less have the form:

CPU0		CPU1

		enqueue_waiter()
done = true;
if (waiters)
		for (;;) {
		  if (done)
		    break;

		  ...
		}

		dequeue_waiter()

		do something else again

  wake_up_task
		<gets wakeup>


The wake_q thing made the above much more common, but we've had it
forever.

> Or even
> more random, and even if I never put a task on a wait list or anything
> else, ever, it can get woken spuriously?

I had patches that did that on purpose, but no.

> > Something like the below might work.
> 
> Yeah that looks like the proper fix. I guess semaphores are uncritical
> enough that we can roll this out for everyone. Thanks for the hint.

It's actually an optimization that we never did because semaphores are
so uncritical :-)

The thing is, by delaying the wakup until after we've released the
spinlock, the waiter will not contend on the spinlock the moment it
wakes.
Petr Mladek May 9, 2019, 2:56 p.m. UTC | #5
On Thu 2019-05-09 14:09:03, Daniel Vetter wrote:
> console_trylock, called from within printk, can be called from pretty
> much anywhere. Including try_to_wake_up. Note that this isn't common,
> usually the box is in pretty bad shape at that point already. But it
> really doesn't help when then lockdep jumps in and spams the logs,
> potentially obscuring the real backtrace we're really interested in.
> One case I've seen (slightly simplified backtrace):
> 
>  Call Trace:
>   <IRQ>
>   console_trylock+0xe/0x60
>   vprintk_emit+0xf1/0x320
>   printk+0x4d/0x69
>   __warn_printk+0x46/0x90
>   native_smp_send_reschedule+0x2f/0x40
>   check_preempt_curr+0x81/0xa0
>   ttwu_do_wakeup+0x14/0x220
>   try_to_wake_up+0x218/0x5f0
>   pollwake+0x6f/0x90
>   credit_entropy_bits+0x204/0x310
>   add_interrupt_randomness+0x18f/0x210
>   handle_irq+0x67/0x160
>   do_IRQ+0x5e/0x130
>   common_interrupt+0xf/0xf
>   </IRQ>
> 
> This alone isn't a problem, but the spinlock in the semaphore is also
> still held while waking up waiters (up() -> __up() -> try_to_wake_up()
> callchain), which then closes the runqueue vs. semaphore.lock loop,
> and upsets lockdep, which issues a circular locking splat to dmesg.
> Worse it upsets developers, since we don't want to spam dmesg with
> clutter when the machine is dying already.
> 
> Fix this by creating a prinkt_safe_up() which calls wake_up_process
> outside of the spinlock. This isn't correct in full generality, but
> good enough for console_lock:
> 
> - console_lock doesn't use interruptible or killable or timeout down()
>   calls, hence an up() is the only thing that can wake up a process.
>   Hence the process can't get woken and killed and reaped while we try
>   to wake it up too.
> 
> - semaphore.c always updates the waiter list while under the spinlock,
>   so there's no other races. Specifically another process that races
>   with a quick console_lock/unlock while we've dropped the spinlock
>   already won't see our own waiter.
> 
> Note that we only have to break the recursion for the semaphore.lock
> spinlock of the console_lock. Recursion within various scheduler
> related locks is already prevented by the printk_safe_enter/exit pair
> in __up_console_sem().

This is not fully true. printk_safe() helps only when
the first try_to_wake_up() is called from printk_safe() context.

> --- a/kernel/locking/semaphore.c
> +++ b/kernel/locking/semaphore.c
> @@ -197,6 +197,37 @@ struct semaphore_waiter {
>  	bool up;
>  };
>  
> +/**
> + * printk_safe_up - release the semaphore in console_unlock
> + * @sem: the semaphore to release
> + *
> + * Release the semaphore.  Unlike mutexes, up() may be called from any
> + * context and even by tasks which have never called down().
> + *
> + * NOTE: This is a special version of up() for console_unlock only. It is only
> + * safe if there are no killable, interruptible or timing out down() calls.
> + */
> +void printk_safe_up(struct semaphore *sem)
> +{
> +	unsigned long flags;
> +	struct semaphore_waiter *waiter = NULL;
> +
> +	raw_spin_lock_irqsave(&sem->lock, flags);
> +	if (likely(list_empty(&sem->wait_list))) {
> +		sem->count++;
> +	} else {
> +		waiter = list_first_entry(&sem->wait_list,
> +					  struct semaphore_waiter, list);
> +		list_del(&waiter->list);
> +		waiter->up = true;
> +	}
> +	raw_spin_unlock_irqrestore(&sem->lock, flags);
> +
> +	if (waiter) /* protected by being sole wake source */
> +		wake_up_process(waiter->task);

I still do not see how this could help. Let's take the above
backtrace as example:

   <IRQ>
   console_trylock+0xe/0x60
   vprintk_emit+0xf1/0x320
   printk+0x4d/0x69
   __warn_printk+0x46/0x90
   native_smp_send_reschedule +0x2f/0x40
   check_preempt_curr+0x81/0xa0
   ttwu_do_wakeup+0x14/0x220
   try_to_wake_up+0x218/0x5f0
   pollwake+0x6f/0x90
   credit_entropy_bits+0x204/0x310
   add_interrupt_randomness+0x18f/0x210
   handle_irq+0x67/0x160
   do_IRQ+0x5e/0x130
   common_interrupt+0xf/0xf
   </IRQ>

We have the following chain of calls:

  + do_IRQ()
    ...
      + try_to_wake_up()    # takes p->pi_lock
        + ttwu_remote()     # takes rq lock
          + ttwu_do_wakeup()
	    + check_preempt_curr()
	      + native_smp_send_reschedule()
	        + __warn_printk()
		  + printk()
		    + vprintk_emit()
		      + console_trylock() # success
		      + console_unlock()
		        + up_console_sem()
			  + up() # wait list in not empty
			    + __up()
			      + wake_up_process()
			        + try_to_wake_up()

!BANG! Deadlock on p->pi_lock.

It does not matter if the nested try_to_wake_up() was called
under sem->lock or outside.

By other words. The patch removed one lockdep warning. But it just
just delayed the deadlock. It will not happen on sem->lock but
later on p->pi_lock.

I am repeating myself. But IMHO, the only solution is to introduce
printk deferred context and use it in WARN_DEFERRED().

Best Regards,
Petr
Daniel Vetter May 9, 2019, 4:43 p.m. UTC | #6
On Thu, May 9, 2019 at 4:56 PM Petr Mladek <pmladek@suse.com> wrote:
>
> On Thu 2019-05-09 14:09:03, Daniel Vetter wrote:
> > console_trylock, called from within printk, can be called from pretty
> > much anywhere. Including try_to_wake_up. Note that this isn't common,
> > usually the box is in pretty bad shape at that point already. But it
> > really doesn't help when then lockdep jumps in and spams the logs,
> > potentially obscuring the real backtrace we're really interested in.
> > One case I've seen (slightly simplified backtrace):
> >
> >  Call Trace:
> >   <IRQ>
> >   console_trylock+0xe/0x60
> >   vprintk_emit+0xf1/0x320
> >   printk+0x4d/0x69
> >   __warn_printk+0x46/0x90
> >   native_smp_send_reschedule+0x2f/0x40
> >   check_preempt_curr+0x81/0xa0
> >   ttwu_do_wakeup+0x14/0x220
> >   try_to_wake_up+0x218/0x5f0
> >   pollwake+0x6f/0x90
> >   credit_entropy_bits+0x204/0x310
> >   add_interrupt_randomness+0x18f/0x210
> >   handle_irq+0x67/0x160
> >   do_IRQ+0x5e/0x130
> >   common_interrupt+0xf/0xf
> >   </IRQ>
> >
> > This alone isn't a problem, but the spinlock in the semaphore is also
> > still held while waking up waiters (up() -> __up() -> try_to_wake_up()
> > callchain), which then closes the runqueue vs. semaphore.lock loop,
> > and upsets lockdep, which issues a circular locking splat to dmesg.
> > Worse it upsets developers, since we don't want to spam dmesg with
> > clutter when the machine is dying already.
> >
> > Fix this by creating a prinkt_safe_up() which calls wake_up_process
> > outside of the spinlock. This isn't correct in full generality, but
> > good enough for console_lock:
> >
> > - console_lock doesn't use interruptible or killable or timeout down()
> >   calls, hence an up() is the only thing that can wake up a process.
> >   Hence the process can't get woken and killed and reaped while we try
> >   to wake it up too.
> >
> > - semaphore.c always updates the waiter list while under the spinlock,
> >   so there's no other races. Specifically another process that races
> >   with a quick console_lock/unlock while we've dropped the spinlock
> >   already won't see our own waiter.
> >
> > Note that we only have to break the recursion for the semaphore.lock
> > spinlock of the console_lock. Recursion within various scheduler
> > related locks is already prevented by the printk_safe_enter/exit pair
> > in __up_console_sem().
>
> This is not fully true. printk_safe() helps only when
> the first try_to_wake_up() is called from printk_safe() context.
>
> > --- a/kernel/locking/semaphore.c
> > +++ b/kernel/locking/semaphore.c
> > @@ -197,6 +197,37 @@ struct semaphore_waiter {
> >       bool up;
> >  };
> >
> > +/**
> > + * printk_safe_up - release the semaphore in console_unlock
> > + * @sem: the semaphore to release
> > + *
> > + * Release the semaphore.  Unlike mutexes, up() may be called from any
> > + * context and even by tasks which have never called down().
> > + *
> > + * NOTE: This is a special version of up() for console_unlock only. It is only
> > + * safe if there are no killable, interruptible or timing out down() calls.
> > + */
> > +void printk_safe_up(struct semaphore *sem)
> > +{
> > +     unsigned long flags;
> > +     struct semaphore_waiter *waiter = NULL;
> > +
> > +     raw_spin_lock_irqsave(&sem->lock, flags);
> > +     if (likely(list_empty(&sem->wait_list))) {
> > +             sem->count++;
> > +     } else {
> > +             waiter = list_first_entry(&sem->wait_list,
> > +                                       struct semaphore_waiter, list);
> > +             list_del(&waiter->list);
> > +             waiter->up = true;
> > +     }
> > +     raw_spin_unlock_irqrestore(&sem->lock, flags);
> > +
> > +     if (waiter) /* protected by being sole wake source */
> > +             wake_up_process(waiter->task);
>
> I still do not see how this could help. Let's take the above
> backtrace as example:
>
>    <IRQ>
>    console_trylock+0xe/0x60
>    vprintk_emit+0xf1/0x320
>    printk+0x4d/0x69
>    __warn_printk+0x46/0x90
>    native_smp_send_reschedule +0x2f/0x40
>    check_preempt_curr+0x81/0xa0
>    ttwu_do_wakeup+0x14/0x220
>    try_to_wake_up+0x218/0x5f0
>    pollwake+0x6f/0x90
>    credit_entropy_bits+0x204/0x310
>    add_interrupt_randomness+0x18f/0x210
>    handle_irq+0x67/0x160
>    do_IRQ+0x5e/0x130
>    common_interrupt+0xf/0xf
>    </IRQ>
>
> We have the following chain of calls:
>
>   + do_IRQ()
>     ...
>       + try_to_wake_up()    # takes p->pi_lock
>         + ttwu_remote()     # takes rq lock
>           + ttwu_do_wakeup()
>             + check_preempt_curr()
>               + native_smp_send_reschedule()
>                 + __warn_printk()
>                   + printk()
>                     + vprintk_emit()
>                       + console_trylock() # success
>                       + console_unlock()
>                         + up_console_sem()
>                           + up() # wait list in not empty
>                             + __up()
>                               + wake_up_process()
>                                 + try_to_wake_up()
>
> !BANG! Deadlock on p->pi_lock.

Hm right ... I only looked at this starting with console_unlock.

> It does not matter if the nested try_to_wake_up() was called
> under sem->lock or outside.
>
> By other words. The patch removed one lockdep warning. But it just
> just delayed the deadlock. It will not happen on sem->lock but
> later on p->pi_lock.
>
> I am repeating myself. But IMHO, the only solution is to introduce
> printk deferred context and use it in WARN_DEFERRED().

One thing to keep in mind is that the kernel is already dying, and
things will come crashing down later on (I've seen this only in dmesg
tails capture in pstore in our CI, i.e. the box died for good). I just
want to make sure that the useful information isn't overwritten by
more dmesg splats that happen as a consequence of us somehow trying to
run things on an offline cpu. Once console_unlock has completed in
your above backtrace and the important stuff has gone out I'm totally
fine with the kernel just dying. Pulling the wake_up_process out from
under the semaphore.lock is enough to prevent lockdep from dumping
more stuff while we're trying to print the important things, and I
think the untangling of the locking hiararchy is useful irrespective
of this lockdep splat. Plus Peter doesn't sound like he likes to roll
out more printk_deferred kind of things.

But if you think I should do the printk_deferred thing too I can look
into that. Just not quite sure what that's supposed to look like now.
-Daniel





--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
Petr Mladek May 10, 2019, 9:15 a.m. UTC | #7
On Thu 2019-05-09 18:43:12, Daniel Vetter wrote:
> One thing to keep in mind is that the kernel is already dying, and
> things will come crashing down later on

This is important information. I havn't seen it mentioned earlier.

> (I've seen this only in dmesg
> tails capture in pstore in our CI, i.e. the box died for good). I just
> want to make sure that the useful information isn't overwritten by
> more dmesg splats that happen as a consequence of us somehow trying to
> run things on an offline cpu. Once console_unlock has completed in
> your above backtrace and the important stuff has gone out I'm totally
> fine with the kernel just dying. Pulling the wake_up_process out from
> under the semaphore.lock is enough to prevent lockdep from dumping
> more stuff while we're trying to print the important things,

With the more stuff you mean the lockdep splat? If yes, it might
make sense to call debug_locks_off() earlier in panic().


> and I think the untangling of the locking hiararchy is useful irrespective
> of this lockdep splat. Plus Peter doesn't sound like he likes to roll
> out more printk_deferred kind of things.
>
> But if you think I should do the printk_deferred thing too I can look
> into that. Just not quite sure what that's supposed to look like now.

Your patch might remove the particular lockdep splat. It might be
worth it (Peter mentioned also an optimization effect). Anyway
it will not prevent the deadlock.

The only way to avoid the deadlock is to use printk_deferred()
with the current printk() code.


Finally, I have recently worked on similar problem with dying system.
I sent the following patch for testing. I wonder if it might be
acceptable upstream:


From: Petr Mladek <pmladek@suse.com>
Subject: sched/x86: Do not warn about offline CPUs when all are being stopped
Patch-mainline: No, just for testing
References: bsc#1104406

The warning about rescheduling offline CPUs cause dealock when
the CPUs need to get stopped using NMI. It might happen with
logbuf_lock, locks used by console drivers, especially tty.
But it might also be caused by a registered kernel message
dumper, for example, pstore.

The warning is pretty common when there is a high load and
CPUs are being stopped by native_stop_other_cpus(). But
they are not really useful in this context. And they scrolls
the really important messages off the screen.

We need to fix printk() in the long term. But disabling
the message looks reasonable at least in the meantime.

Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 arch/x86/kernel/smp.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -124,7 +124,8 @@ static bool smp_no_nmi_ipi = false;
  */
 static void native_smp_send_reschedule(int cpu)
 {
-	if (unlikely(cpu_is_offline(cpu))) {
+	if (unlikely(cpu_is_offline(cpu) &&
+		     atomic_read(&stopping_cpu) < 0)) {
 		WARN(1, "sched: Unexpected reschedule of offline CPU#%d!\n", cpu);
 		return;
 	}
Sergey Senozhatsky May 10, 2019, 9:51 a.m. UTC | #8
On (05/10/19 11:15), Petr Mladek wrote:
[..]
>  arch/x86/kernel/smp.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -124,7 +124,8 @@ static bool smp_no_nmi_ipi = false;
>   */
>  static void native_smp_send_reschedule(int cpu)
>  {
> -	if (unlikely(cpu_is_offline(cpu))) {
> +	if (unlikely(cpu_is_offline(cpu) &&
> +		     atomic_read(&stopping_cpu) < 0)) {
>  		WARN(1, "sched: Unexpected reschedule of offline CPU#%d!\n", cpu);
>  		return;
>  	}

I think we need to remove CPU which we IPI_STOP from
idle_cpus_mask. So then scheduler won't pick up stopped
CPUs (cpumask_first(nohz.idle_cpus_mask)) and attempt
rescheduling to them. It seems that currently
native_stop_other_cpus() doesn't do that.

	-ss
Daniel Vetter May 10, 2019, 3:05 p.m. UTC | #9
On Fri, May 10, 2019 at 11:15 AM Petr Mladek <pmladek@suse.com> wrote:
> On Thu 2019-05-09 18:43:12, Daniel Vetter wrote:
> > One thing to keep in mind is that the kernel is already dying, and
> > things will come crashing down later on
>
> This is important information. I havn't seen it mentioned earlier.

I thought I explained that, sorry if this wasn't clear.

> > (I've seen this only in dmesg
> > tails capture in pstore in our CI, i.e. the box died for good). I just
> > want to make sure that the useful information isn't overwritten by
> > more dmesg splats that happen as a consequence of us somehow trying to
> > run things on an offline cpu. Once console_unlock has completed in
> > your above backtrace and the important stuff has gone out I'm totally
> > fine with the kernel just dying. Pulling the wake_up_process out from
> > under the semaphore.lock is enough to prevent lockdep from dumping
> > more stuff while we're trying to print the important things,
>
> With the more stuff you mean the lockdep splat? If yes, it might
> make sense to call debug_locks_off() earlier in panic().

I'm not sure this only happens in panics (I'm honestly not sure at all
why we're complaining about offline cpus). But in general I've seen
piles of dying systems that dump WARNINGS and then eventually panic.
Still would be good to not have a pointless lockdep splat in between.

> > and I think the untangling of the locking hiararchy is useful irrespective
> > of this lockdep splat. Plus Peter doesn't sound like he likes to roll
> > out more printk_deferred kind of things.
> >
> > But if you think I should do the printk_deferred thing too I can look
> > into that. Just not quite sure what that's supposed to look like now.
>
> Your patch might remove the particular lockdep splat. It might be
> worth it (Peter mentioned also an optimization effect). Anyway
> it will not prevent the deadlock.
>
> The only way to avoid the deadlock is to use printk_deferred()
> with the current printk() code.
>
>
> Finally, I have recently worked on similar problem with dying system.
> I sent the following patch for testing. I wonder if it might be
> acceptable upstream:
>
>
> From: Petr Mladek <pmladek@suse.com>
> Subject: sched/x86: Do not warn about offline CPUs when all are being stopped
> Patch-mainline: No, just for testing
> References: bsc#1104406
>
> The warning about rescheduling offline CPUs cause dealock when
> the CPUs need to get stopped using NMI. It might happen with
> logbuf_lock, locks used by console drivers, especially tty.
> But it might also be caused by a registered kernel message
> dumper, for example, pstore.
>
> The warning is pretty common when there is a high load and
> CPUs are being stopped by native_stop_other_cpus(). But
> they are not really useful in this context. And they scrolls
> the really important messages off the screen.
>
> We need to fix printk() in the long term. But disabling
> the message looks reasonable at least in the meantime.
>
> Signed-off-by: Petr Mladek <pmladek@suse.com>
> ---
>  arch/x86/kernel/smp.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -124,7 +124,8 @@ static bool smp_no_nmi_ipi = false;
>   */
>  static void native_smp_send_reschedule(int cpu)
>  {
> -       if (unlikely(cpu_is_offline(cpu))) {
> +       if (unlikely(cpu_is_offline(cpu) &&
> +                    atomic_read(&stopping_cpu) < 0)) {

Hm yeah this sounds like another fix which would stop things from
getting worse when the machine is dying. I can try and stuff this (or
something like this, I honestly have no idea about cpu handling code
so no idea what's reasonable) into our CI and see what happens.

Thanks, Daniel

>                 WARN(1, "sched: Unexpected reschedule of offline CPU#%d!\n", cpu);
>                 return;
>         }
diff mbox series

Patch

diff --git a/include/linux/semaphore.h b/include/linux/semaphore.h
index 11c86fbfeb98..7e839c72809d 100644
--- a/include/linux/semaphore.h
+++ b/include/linux/semaphore.h
@@ -41,6 +41,7 @@  extern int __must_check down_interruptible(struct semaphore *sem);
 extern int __must_check down_killable(struct semaphore *sem);
 extern int __must_check down_trylock(struct semaphore *sem);
 extern int __must_check down_timeout(struct semaphore *sem, long jiffies);
+extern void printk_safe_up(struct semaphore *sem);
 extern void up(struct semaphore *sem);
 
 #endif /* __LINUX_SEMAPHORE_H */
diff --git a/kernel/locking/semaphore.c b/kernel/locking/semaphore.c
index 561acdd39960..55a896f18d91 100644
--- a/kernel/locking/semaphore.c
+++ b/kernel/locking/semaphore.c
@@ -197,6 +197,37 @@  struct semaphore_waiter {
 	bool up;
 };
 
+/**
+ * printk_safe_up - release the semaphore in console_unlock
+ * @sem: the semaphore to release
+ *
+ * Release the semaphore.  Unlike mutexes, up() may be called from any
+ * context and even by tasks which have never called down().
+ *
+ * NOTE: This is a special version of up() for console_unlock only. It is only
+ * safe if there are no killable, interruptible or timing out down() calls.
+ */
+void printk_safe_up(struct semaphore *sem)
+{
+	unsigned long flags;
+	struct semaphore_waiter *waiter = NULL;
+
+	raw_spin_lock_irqsave(&sem->lock, flags);
+	if (likely(list_empty(&sem->wait_list))) {
+		sem->count++;
+	} else {
+		waiter = list_first_entry(&sem->wait_list,
+					  struct semaphore_waiter, list);
+		list_del(&waiter->list);
+		waiter->up = true;
+	}
+	raw_spin_unlock_irqrestore(&sem->lock, flags);
+
+	if (waiter) /* protected by being sole wake source */
+		wake_up_process(waiter->task);
+}
+EXPORT_SYMBOL(printk_safe_up);
+
 /*
  * Because this function is inlined, the 'state' parameter will be
  * constant, and thus optimised away by the compiler.  Likewise the
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 02ca827b8fac..62303929afda 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -234,7 +234,7 @@  static void __up_console_sem(unsigned long ip)
 	mutex_release(&console_lock_dep_map, 1, ip);
 
 	printk_safe_enter_irqsave(flags);
-	up(&console_sem);
+	printk_safe_up(&console_sem);
 	printk_safe_exit_irqrestore(flags);
 }
 #define up_console_sem() __up_console_sem(_RET_IP_)