diff mbox

[BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

Message ID 1306412511.1200.90.camel@twins (mailing list archive)
State New, archived
Headers show

Commit Message

Peter Zijlstra May 26, 2011, 12:21 p.m. UTC
On Thu, 2011-05-26 at 13:32 +0200, Peter Zijlstra wrote:
> 
> The bad news is of course that I've got a little more head-scratching to
> do, will keep you informed. 

OK, that wasn't too hard.. (/me crosses fingers and prays Marc doesn't
find more funnies ;-).

Does the below cure all woes?

---
Subject: sched: Fix ttwu() for __ARCH_WANT_INTERRUPTS_ON_CTXSW
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Thu May 26 14:21:33 CEST 2011

Marc reported that e4a52bcb9 (sched: Remove rq->lock from the first
half of ttwu()) broke his ARM-SMP machine. Now ARM is one of the few
__ARCH_WANT_INTERRUPTS_ON_CTXSW users, so that exception in the ttwu()
code was suspect.

Yong found that the interrupt could hit hits after context_switch() changes
current but before it clears p->on_cpu, if that interrupt were to
attempt a wake-up of p we would indeed find ourselves spinning in IRQ
context.

Sort this by reverting to the old behaviour for this situation and
perform a full remote wake-up.

Cc: Frank Rowand <frank.rowand@am.sony.com>
Cc: Yong Zhang <yong.zhang0@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Reported-by: Marc Zyngier <Marc.Zyngier@arm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c |   37 ++++++++++++++++++++++++++++---------
 1 file changed, 28 insertions(+), 9 deletions(-)

Comments

Ingo Molnar May 26, 2011, 12:26 p.m. UTC | #1
* Peter Zijlstra <peterz@infradead.org> wrote:

> Sort this by reverting to the old behaviour for this situation and 
> perform a full remote wake-up.

Btw., ARM should consider switching most of its subarchitectures to 
!__ARCH_WANT_INTERRUPTS_ON_CTXSW - enabling irqs during context 
switches is silly and now expensive as well.

Thanks,

	Ingo
Russell King - ARM Linux May 26, 2011, 12:31 p.m. UTC | #2
On Thu, May 26, 2011 at 02:26:23PM +0200, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > Sort this by reverting to the old behaviour for this situation and 
> > perform a full remote wake-up.
> 
> Btw., ARM should consider switching most of its subarchitectures to 
> !__ARCH_WANT_INTERRUPTS_ON_CTXSW - enabling irqs during context 
> switches is silly and now expensive as well.

Not going to happen.  The reason we do it is because most of the CPUs
have to (slowly) flush their caches during switch_mm(), and to have
IRQs off over the cache flush means that we lose IRQs.

So it's not silly at all, bit a technical requirement imposed by the
cache architecture.

If it's become expensive through development, it suggests that the
development did not take account of the issues we have on ARM.
Peter Zijlstra May 26, 2011, 12:37 p.m. UTC | #3
On Thu, 2011-05-26 at 13:31 +0100, Russell King - ARM Linux wrote:
> On Thu, May 26, 2011 at 02:26:23PM +0200, Ingo Molnar wrote:
> > 
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > Sort this by reverting to the old behaviour for this situation and 
> > > perform a full remote wake-up.
> > 
> > Btw., ARM should consider switching most of its subarchitectures to 
> > !__ARCH_WANT_INTERRUPTS_ON_CTXSW - enabling irqs during context 
> > switches is silly and now expensive as well.
> 
> Not going to happen.  The reason we do it is because most of the CPUs
> have to (slowly) flush their caches during switch_mm(), and to have
> IRQs off over the cache flush means that we lose IRQs.
> 
> So it's not silly at all, bit a technical requirement imposed by the
> cache architecture.
> 
> If it's become expensive through development, it suggests that the
> development did not take account of the issues we have on ARM.

Its not more expensive than it was before this patch series, and the
case in question is relatively rare (guesstimate, lacking measurements)
so ARM should benefit from most of the optimization provided.
Ingo Molnar May 26, 2011, 12:50 p.m. UTC | #4
* Russell King - ARM Linux <linux@arm.linux.org.uk> wrote:

> On Thu, May 26, 2011 at 02:26:23PM +0200, Ingo Molnar wrote:
> > 
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > Sort this by reverting to the old behaviour for this situation 
> > > and perform a full remote wake-up.
> > 
> > Btw., ARM should consider switching most of its subarchitectures 
> > to !__ARCH_WANT_INTERRUPTS_ON_CTXSW - enabling irqs during 
> > context switches is silly and now expensive as well.
> 
> Not going to happen.  The reason we do it is because most of the 
> CPUs have to (slowly) flush their caches during switch_mm(), and to 
> have IRQs off over the cache flush means that we lose IRQs.

How much time does that take on contemporary ARM hardware, typically 
(and worst-case)?

Thanks,

	Ingo
Russell King - ARM Linux May 26, 2011, 1:36 p.m. UTC | #5
On Thu, May 26, 2011 at 02:50:07PM +0200, Ingo Molnar wrote:
> 
> * Russell King - ARM Linux <linux@arm.linux.org.uk> wrote:
> 
> > On Thu, May 26, 2011 at 02:26:23PM +0200, Ingo Molnar wrote:
> > > 
> > > * Peter Zijlstra <peterz@infradead.org> wrote:
> > > 
> > > > Sort this by reverting to the old behaviour for this situation 
> > > > and perform a full remote wake-up.
> > > 
> > > Btw., ARM should consider switching most of its subarchitectures 
> > > to !__ARCH_WANT_INTERRUPTS_ON_CTXSW - enabling irqs during 
> > > context switches is silly and now expensive as well.
> > 
> > Not going to happen.  The reason we do it is because most of the 
> > CPUs have to (slowly) flush their caches during switch_mm(), and to 
> > have IRQs off over the cache flush means that we lose IRQs.
> 
> How much time does that take on contemporary ARM hardware, typically 
> (and worst-case)?

I can't give you precise figures because it really depends on the hardware
and how stuff is setup.  All I can do is give you examples from platforms
I have here running which rely upon this.

Some ARM CPUs have to read 32K of data into the data cache in order to
ensure that any dirty data is flushed out.  Others have to loop over the
cache segments/entries, cleaning and invalidating each one (that's 8 x 64
for ARM920 so 512 interations).

If my userspace program is correct, then it looks like StrongARM takes
about 700us to read 32K of data into the cache.

Measuring the context switches per second on the same machine (using an
old version of the Byte Benchmarks) gives about 904 context switches per
second (equating to 1.1ms per switch), so this figure looks about right.

Same CPU but different hardware gives 698 context switches per second -
about 1.4ms per switch.  With IRQs enabled, its possible to make this
work but you have to read 64K of data instead, which would double the
ctxt switch latency here.

On an ARM920 machine, running the same program gives around 2476 per
second, which is around 400us per switch.

Your typical 16550A with a 16-byte FIFO running at 115200 baud will fill
from completely empty to overrun in 1.1ms.  Realistically, you'll start
getting overruns well below that because of the FIFO thresholds - which
may be trigger an IRQ at half-full.  So 600us.

This would mean 16550A's would be entirely unusable with StrongARM, with
an overrun guaranteed at every context switch.

This is not the whole story: if you have timing sensitive peripherals
like UARTs, then 1.1ms - 700us doesn't sound that bad, until you start
considering other IRQ load which can lock out servicing those peripherals
while other interrupt handlers are running.

So all in all, having IRQs off for the order of 700us over a context
switch is a complete non-starter of an idea.
Catalin Marinas May 26, 2011, 2:45 p.m. UTC | #6
On 26 May 2011 13:50, Ingo Molnar <mingo@elte.hu> wrote:
> * Russell King - ARM Linux <linux@arm.linux.org.uk> wrote:
>
>> On Thu, May 26, 2011 at 02:26:23PM +0200, Ingo Molnar wrote:
>> >
>> > * Peter Zijlstra <peterz@infradead.org> wrote:
>> >
>> > > Sort this by reverting to the old behaviour for this situation
>> > > and perform a full remote wake-up.
>> >
>> > Btw., ARM should consider switching most of its subarchitectures
>> > to !__ARCH_WANT_INTERRUPTS_ON_CTXSW - enabling irqs during
>> > context switches is silly and now expensive as well.
>>
>> Not going to happen.  The reason we do it is because most of the
>> CPUs have to (slowly) flush their caches during switch_mm(), and to
>> have IRQs off over the cache flush means that we lose IRQs.
>
> How much time does that take on contemporary ARM hardware, typically
> (and worst-case)?

On newer ARMv6 and ARMv7 hardware, we no longer flush the caches at
context switch as we got VIPT (or PIPT-like) caches.

But modern ARM processors use something called ASID to tag the TLB
entries and we are limited to 256. The switch_mm() code checks for
whether we ran out of them to restart the counting. This ASID
roll-over event needs to be broadcast to the other CPUs and issuing
IPIs with the IRQs disabled isn't always safe. Of course, we could
briefly re-enable them at the ASID roll-over time but I'm not sure
what the expectations of the code calling switch_mm() are.
Marc Zyngier May 26, 2011, 2:56 p.m. UTC | #7
On Thu, 2011-05-26 at 14:21 +0200, Peter Zijlstra wrote:
> On Thu, 2011-05-26 at 13:32 +0200, Peter Zijlstra wrote:
> > 
> > The bad news is of course that I've got a little more head-scratching to
> > do, will keep you informed. 
> 
> OK, that wasn't too hard.. (/me crosses fingers and prays Marc doesn't
> find more funnies ;-).
> 
> Does the below cure all woes?

So far so good. The box just went through it's two first iterations of
kernel building without a sweat, carried on, and still feels snappy
enough.

Thanks for having fixed that quickly!

> ---
> Subject: sched: Fix ttwu() for __ARCH_WANT_INTERRUPTS_ON_CTXSW
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Thu May 26 14:21:33 CEST 2011
> 
> Marc reported that e4a52bcb9 (sched: Remove rq->lock from the first
> half of ttwu()) broke his ARM-SMP machine. Now ARM is one of the few
> __ARCH_WANT_INTERRUPTS_ON_CTXSW users, so that exception in the ttwu()
> code was suspect.
> 
> Yong found that the interrupt could hit hits after context_switch() changes
> current but before it clears p->on_cpu, if that interrupt were to
> attempt a wake-up of p we would indeed find ourselves spinning in IRQ
> context.
> 
> Sort this by reverting to the old behaviour for this situation and
> perform a full remote wake-up.
> 
> Cc: Frank Rowand <frank.rowand@am.sony.com>
> Cc: Yong Zhang <yong.zhang0@gmail.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Reported-by: Marc Zyngier <Marc.Zyngier@arm.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Tested-by: Marc Zyngier <marc.zyngier@arm.com>

	M.
Oleg Nesterov May 26, 2011, 3:45 p.m. UTC | #8
On 05/26, Peter Zijlstra wrote:
>
>  static void ttwu_queue(struct task_struct *p, int cpu)
>  {
> @@ -2631,17 +2650,17 @@ try_to_wake_up(struct task_struct *p, un
>  	while (p->on_cpu) {
>  #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
>  		/*
> -		 * If called from interrupt context we could have landed in the
> -		 * middle of schedule(), in this case we should take care not
> -		 * to spin on ->on_cpu if p is current, since that would
> -		 * deadlock.
> +		 * In case the architecture enables interrupts in
> +		 * context_switch(), we cannot busy wait, since that
> +		 * would lead to live-locks when an interrupt hits and
> +		 * tries to wake up @prev. So bail and do a complete
> +		 * remote wakeup.
>  		 */
> -		if (p == current) {
> -			ttwu_queue(p, cpu);
> +		if (ttwu_activate_remote(p, wake_flags))

Stupid question. Can't we fix this problem if we do

	-	if (p == current)
	+	if (cpu == raw_smp_processor_id())

?

I forgot the rules... but iirc task_cpu(p) can't be changed under us?

Oleg.
Peter Zijlstra May 26, 2011, 3:59 p.m. UTC | #9
On Thu, 2011-05-26 at 17:45 +0200, Oleg Nesterov wrote:
> Stupid question. Can't we fix this problem if we do
> 
>         -       if (p == current)
>         +       if (cpu == raw_smp_processor_id())
> 
> ?
> 
> I forgot the rules... but iirc task_cpu(p) can't be changed under us?

Easy enough to test.. brain gave out again,. hold on ;-)
Ingo Molnar May 27, 2011, 12:06 p.m. UTC | #10
* Catalin Marinas <catalin.marinas@arm.com> wrote:

> > How much time does that take on contemporary ARM hardware, 
> > typically (and worst-case)?
> 
> On newer ARMv6 and ARMv7 hardware, we no longer flush the caches at 
> context switch as we got VIPT (or PIPT-like) caches.
> 
> But modern ARM processors use something called ASID to tag the TLB 
> entries and we are limited to 256. The switch_mm() code checks for 
> whether we ran out of them to restart the counting. This ASID 
> roll-over event needs to be broadcast to the other CPUs and issuing 
> IPIs with the IRQs disabled isn't always safe. Of course, we could 
> briefly re-enable them at the ASID roll-over time but I'm not sure 
> what the expectations of the code calling switch_mm() are.

The expectations are to have irqs off (we are holding the runqueue 
lock if !__ARCH_WANT_INTERRUPTS_ON_CTXSW), so that's not workable i 
suspect.

But in theory we could drop the rq lock and restart the scheduler 
task-pick and balancing sequence when the ARM TLB tag rolls over. So 
instead of this fragile and assymetric method we'd have a 
straightforward retry-in-rare-cases method.

That means some modifications to switch_mm() but should be solvable.

That would make ARM special only in so far that it's one of the few 
architectures that signal 'retry task pickup' via switch_mm() - it 
would use the stock scheduler otherwise and we could remove 
__ARCH_WANT_INTERRUPTS_ON_CTXSW and perhaps even 
__ARCH_WANT_UNLOCKED_CTXSW altogether.

I'd suggest doing this once modern ARM chips get so widespread that 
you can realistically induce a ~700 usecs irqs-off delays on old, 
virtual-cache ARM chips. Old chips would likely use old kernels 
anyway, right?

Thanks,

	Ingo
Russell King - ARM Linux May 27, 2011, 5:55 p.m. UTC | #11
On Fri, May 27, 2011 at 02:06:29PM +0200, Ingo Molnar wrote:
> I'd suggest doing this once modern ARM chips get so widespread that 
> you can realistically induce a ~700 usecs irqs-off delays on old, 
> virtual-cache ARM chips. Old chips would likely use old kernels 
> anyway, right?

Not necessarily.  I have rather a lot of legacy hardware (it outweighs
the more modern stuff) and that legacy hardware is _loads_ more useful
than the modern stuff in that it can actually do stuff like run a network
(such as running kerberos servers, httpd, mtas, etc).  Modern ARM
machines typically don't have ways to attach mass storage to them which
make them hellishly limited for such applications.

I'm planning to continue using my old machines, and continue to upgrade
their kernels, especially in order to keep up to date with security
issues.
Nicolas Pitre May 27, 2011, 7:41 p.m. UTC | #12
On Fri, 27 May 2011, Ingo Molnar wrote:

> I'd suggest doing this once modern ARM chips get so widespread that 
> you can realistically induce a ~700 usecs irqs-off delays on old, 
> virtual-cache ARM chips. Old chips would likely use old kernels 
> anyway, right?

Those "old" ARM chips are still largely produced and fitted in new 
designs, and using latest kernels.  They are unlikely to fade away 
before a couple years.


Nicolas
Russell King - ARM Linux May 27, 2011, 8:52 p.m. UTC | #13
On Fri, May 27, 2011 at 02:06:29PM +0200, Ingo Molnar wrote:
> The expectations are to have irqs off (we are holding the runqueue 
> lock if !__ARCH_WANT_INTERRUPTS_ON_CTXSW), so that's not workable i 
> suspect.

Just a thought, but we _might_ be able to avoid a lot of this hastle if
we had a new arch hook in finish_task_switch(), after finish_lock_switch()
returns but before the old MM is dropped.

For the new ASID-based switch_mm(), we currently do this:

1. check ASID validity
2. flush branch predictor
3. set reserved ASID value
4. set new page tables
5. set new ASID value

This will be shortly changed to:

1. check ASID validity
2. flush branch predictor
3. set swapper_pg_dir tables
4. set new ASID value
5. set new page tables

We could change switch_mm() to only do:

1. flush branch predictor
2. set swapper_pg_dir tables
3. check ASID validity
4. set new ASID value

At this point, we have no user mappings, and so nothing will be using the
ASID at this point.  Then in a new post-finish_lock_switch() arch hook:

5. check whether we need to do flushing as a result of ASID change
6. set new page tables

I think this may simplify the ASID code.  It needs prototyping out,
reviewing and testing, but I think it may work.

And I think it may also be workable with the CPUs which need to flush
the caches on context switches - we can postpone their page table
switch to this new arch hook too, which will mean we wouldn't require
__ARCH_WANT_INTERRUPTS_ON_CTXSW on ARM at all.

Any thoughts (if you've followed what I'm going on about) ?
diff mbox

Patch

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2573,7 +2573,26 @@  static void ttwu_queue_remote(struct tas
 	if (!next)
 		smp_send_reschedule(cpu);
 }
-#endif
+
+#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
+static int ttwu_activate_remote(struct task_struct *p, int wake_flags)
+{
+	struct rq *rq;
+	int ret = 0;
+
+	rq = __task_rq_lock(p);
+	if (p->on_cpu) {
+		ttwu_activate(rq, p, ENQUEUE_WAKEUP);
+		ttwu_do_wakeup(rq, p, wake_flags);
+		ret = 1;
+	}
+	__task_rq_unlock(rq);
+
+	return ret;
+
+}
+#endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
+#endif /* CONFIG_SMP */
 
 static void ttwu_queue(struct task_struct *p, int cpu)
 {
@@ -2631,17 +2650,17 @@  try_to_wake_up(struct task_struct *p, un
 	while (p->on_cpu) {
 #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
 		/*
-		 * If called from interrupt context we could have landed in the
-		 * middle of schedule(), in this case we should take care not
-		 * to spin on ->on_cpu if p is current, since that would
-		 * deadlock.
+		 * In case the architecture enables interrupts in
+		 * context_switch(), we cannot busy wait, since that
+		 * would lead to live-locks when an interrupt hits and
+		 * tries to wake up @prev. So bail and do a complete
+		 * remote wakeup.
 		 */
-		if (p == current) {
-			ttwu_queue(p, cpu);
+		if (ttwu_activate_remote(p, wake_flags))
 			goto stat;
-		}
-#endif
+#else
 		cpu_relax();
+#endif
 	}
 	/*
 	 * Pairs with the smp_wmb() in finish_lock_switch().