Message ID | 1306412511.1200.90.camel@twins (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
* Peter Zijlstra <peterz@infradead.org> wrote: > Sort this by reverting to the old behaviour for this situation and > perform a full remote wake-up. Btw., ARM should consider switching most of its subarchitectures to !__ARCH_WANT_INTERRUPTS_ON_CTXSW - enabling irqs during context switches is silly and now expensive as well. Thanks, Ingo
On Thu, May 26, 2011 at 02:26:23PM +0200, Ingo Molnar wrote: > > * Peter Zijlstra <peterz@infradead.org> wrote: > > > Sort this by reverting to the old behaviour for this situation and > > perform a full remote wake-up. > > Btw., ARM should consider switching most of its subarchitectures to > !__ARCH_WANT_INTERRUPTS_ON_CTXSW - enabling irqs during context > switches is silly and now expensive as well. Not going to happen. The reason we do it is because most of the CPUs have to (slowly) flush their caches during switch_mm(), and to have IRQs off over the cache flush means that we lose IRQs. So it's not silly at all, bit a technical requirement imposed by the cache architecture. If it's become expensive through development, it suggests that the development did not take account of the issues we have on ARM.
On Thu, 2011-05-26 at 13:31 +0100, Russell King - ARM Linux wrote: > On Thu, May 26, 2011 at 02:26:23PM +0200, Ingo Molnar wrote: > > > > * Peter Zijlstra <peterz@infradead.org> wrote: > > > > > Sort this by reverting to the old behaviour for this situation and > > > perform a full remote wake-up. > > > > Btw., ARM should consider switching most of its subarchitectures to > > !__ARCH_WANT_INTERRUPTS_ON_CTXSW - enabling irqs during context > > switches is silly and now expensive as well. > > Not going to happen. The reason we do it is because most of the CPUs > have to (slowly) flush their caches during switch_mm(), and to have > IRQs off over the cache flush means that we lose IRQs. > > So it's not silly at all, bit a technical requirement imposed by the > cache architecture. > > If it's become expensive through development, it suggests that the > development did not take account of the issues we have on ARM. Its not more expensive than it was before this patch series, and the case in question is relatively rare (guesstimate, lacking measurements) so ARM should benefit from most of the optimization provided.
* Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > On Thu, May 26, 2011 at 02:26:23PM +0200, Ingo Molnar wrote: > > > > * Peter Zijlstra <peterz@infradead.org> wrote: > > > > > Sort this by reverting to the old behaviour for this situation > > > and perform a full remote wake-up. > > > > Btw., ARM should consider switching most of its subarchitectures > > to !__ARCH_WANT_INTERRUPTS_ON_CTXSW - enabling irqs during > > context switches is silly and now expensive as well. > > Not going to happen. The reason we do it is because most of the > CPUs have to (slowly) flush their caches during switch_mm(), and to > have IRQs off over the cache flush means that we lose IRQs. How much time does that take on contemporary ARM hardware, typically (and worst-case)? Thanks, Ingo
On Thu, May 26, 2011 at 02:50:07PM +0200, Ingo Molnar wrote: > > * Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > > > On Thu, May 26, 2011 at 02:26:23PM +0200, Ingo Molnar wrote: > > > > > > * Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > > Sort this by reverting to the old behaviour for this situation > > > > and perform a full remote wake-up. > > > > > > Btw., ARM should consider switching most of its subarchitectures > > > to !__ARCH_WANT_INTERRUPTS_ON_CTXSW - enabling irqs during > > > context switches is silly and now expensive as well. > > > > Not going to happen. The reason we do it is because most of the > > CPUs have to (slowly) flush their caches during switch_mm(), and to > > have IRQs off over the cache flush means that we lose IRQs. > > How much time does that take on contemporary ARM hardware, typically > (and worst-case)? I can't give you precise figures because it really depends on the hardware and how stuff is setup. All I can do is give you examples from platforms I have here running which rely upon this. Some ARM CPUs have to read 32K of data into the data cache in order to ensure that any dirty data is flushed out. Others have to loop over the cache segments/entries, cleaning and invalidating each one (that's 8 x 64 for ARM920 so 512 interations). If my userspace program is correct, then it looks like StrongARM takes about 700us to read 32K of data into the cache. Measuring the context switches per second on the same machine (using an old version of the Byte Benchmarks) gives about 904 context switches per second (equating to 1.1ms per switch), so this figure looks about right. Same CPU but different hardware gives 698 context switches per second - about 1.4ms per switch. With IRQs enabled, its possible to make this work but you have to read 64K of data instead, which would double the ctxt switch latency here. On an ARM920 machine, running the same program gives around 2476 per second, which is around 400us per switch. Your typical 16550A with a 16-byte FIFO running at 115200 baud will fill from completely empty to overrun in 1.1ms. Realistically, you'll start getting overruns well below that because of the FIFO thresholds - which may be trigger an IRQ at half-full. So 600us. This would mean 16550A's would be entirely unusable with StrongARM, with an overrun guaranteed at every context switch. This is not the whole story: if you have timing sensitive peripherals like UARTs, then 1.1ms - 700us doesn't sound that bad, until you start considering other IRQ load which can lock out servicing those peripherals while other interrupt handlers are running. So all in all, having IRQs off for the order of 700us over a context switch is a complete non-starter of an idea.
On 26 May 2011 13:50, Ingo Molnar <mingo@elte.hu> wrote: > * Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > >> On Thu, May 26, 2011 at 02:26:23PM +0200, Ingo Molnar wrote: >> > >> > * Peter Zijlstra <peterz@infradead.org> wrote: >> > >> > > Sort this by reverting to the old behaviour for this situation >> > > and perform a full remote wake-up. >> > >> > Btw., ARM should consider switching most of its subarchitectures >> > to !__ARCH_WANT_INTERRUPTS_ON_CTXSW - enabling irqs during >> > context switches is silly and now expensive as well. >> >> Not going to happen. The reason we do it is because most of the >> CPUs have to (slowly) flush their caches during switch_mm(), and to >> have IRQs off over the cache flush means that we lose IRQs. > > How much time does that take on contemporary ARM hardware, typically > (and worst-case)? On newer ARMv6 and ARMv7 hardware, we no longer flush the caches at context switch as we got VIPT (or PIPT-like) caches. But modern ARM processors use something called ASID to tag the TLB entries and we are limited to 256. The switch_mm() code checks for whether we ran out of them to restart the counting. This ASID roll-over event needs to be broadcast to the other CPUs and issuing IPIs with the IRQs disabled isn't always safe. Of course, we could briefly re-enable them at the ASID roll-over time but I'm not sure what the expectations of the code calling switch_mm() are.
On Thu, 2011-05-26 at 14:21 +0200, Peter Zijlstra wrote: > On Thu, 2011-05-26 at 13:32 +0200, Peter Zijlstra wrote: > > > > The bad news is of course that I've got a little more head-scratching to > > do, will keep you informed. > > OK, that wasn't too hard.. (/me crosses fingers and prays Marc doesn't > find more funnies ;-). > > Does the below cure all woes? So far so good. The box just went through it's two first iterations of kernel building without a sweat, carried on, and still feels snappy enough. Thanks for having fixed that quickly! > --- > Subject: sched: Fix ttwu() for __ARCH_WANT_INTERRUPTS_ON_CTXSW > From: Peter Zijlstra <a.p.zijlstra@chello.nl> > Date: Thu May 26 14:21:33 CEST 2011 > > Marc reported that e4a52bcb9 (sched: Remove rq->lock from the first > half of ttwu()) broke his ARM-SMP machine. Now ARM is one of the few > __ARCH_WANT_INTERRUPTS_ON_CTXSW users, so that exception in the ttwu() > code was suspect. > > Yong found that the interrupt could hit hits after context_switch() changes > current but before it clears p->on_cpu, if that interrupt were to > attempt a wake-up of p we would indeed find ourselves spinning in IRQ > context. > > Sort this by reverting to the old behaviour for this situation and > perform a full remote wake-up. > > Cc: Frank Rowand <frank.rowand@am.sony.com> > Cc: Yong Zhang <yong.zhang0@gmail.com> > Cc: Oleg Nesterov <oleg@redhat.com> > Reported-by: Marc Zyngier <Marc.Zyngier@arm.com> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Marc Zyngier <marc.zyngier@arm.com> M.
On 05/26, Peter Zijlstra wrote: > > static void ttwu_queue(struct task_struct *p, int cpu) > { > @@ -2631,17 +2650,17 @@ try_to_wake_up(struct task_struct *p, un > while (p->on_cpu) { > #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW > /* > - * If called from interrupt context we could have landed in the > - * middle of schedule(), in this case we should take care not > - * to spin on ->on_cpu if p is current, since that would > - * deadlock. > + * In case the architecture enables interrupts in > + * context_switch(), we cannot busy wait, since that > + * would lead to live-locks when an interrupt hits and > + * tries to wake up @prev. So bail and do a complete > + * remote wakeup. > */ > - if (p == current) { > - ttwu_queue(p, cpu); > + if (ttwu_activate_remote(p, wake_flags)) Stupid question. Can't we fix this problem if we do - if (p == current) + if (cpu == raw_smp_processor_id()) ? I forgot the rules... but iirc task_cpu(p) can't be changed under us? Oleg.
On Thu, 2011-05-26 at 17:45 +0200, Oleg Nesterov wrote: > Stupid question. Can't we fix this problem if we do > > - if (p == current) > + if (cpu == raw_smp_processor_id()) > > ? > > I forgot the rules... but iirc task_cpu(p) can't be changed under us? Easy enough to test.. brain gave out again,. hold on ;-)
* Catalin Marinas <catalin.marinas@arm.com> wrote: > > How much time does that take on contemporary ARM hardware, > > typically (and worst-case)? > > On newer ARMv6 and ARMv7 hardware, we no longer flush the caches at > context switch as we got VIPT (or PIPT-like) caches. > > But modern ARM processors use something called ASID to tag the TLB > entries and we are limited to 256. The switch_mm() code checks for > whether we ran out of them to restart the counting. This ASID > roll-over event needs to be broadcast to the other CPUs and issuing > IPIs with the IRQs disabled isn't always safe. Of course, we could > briefly re-enable them at the ASID roll-over time but I'm not sure > what the expectations of the code calling switch_mm() are. The expectations are to have irqs off (we are holding the runqueue lock if !__ARCH_WANT_INTERRUPTS_ON_CTXSW), so that's not workable i suspect. But in theory we could drop the rq lock and restart the scheduler task-pick and balancing sequence when the ARM TLB tag rolls over. So instead of this fragile and assymetric method we'd have a straightforward retry-in-rare-cases method. That means some modifications to switch_mm() but should be solvable. That would make ARM special only in so far that it's one of the few architectures that signal 'retry task pickup' via switch_mm() - it would use the stock scheduler otherwise and we could remove __ARCH_WANT_INTERRUPTS_ON_CTXSW and perhaps even __ARCH_WANT_UNLOCKED_CTXSW altogether. I'd suggest doing this once modern ARM chips get so widespread that you can realistically induce a ~700 usecs irqs-off delays on old, virtual-cache ARM chips. Old chips would likely use old kernels anyway, right? Thanks, Ingo
On Fri, May 27, 2011 at 02:06:29PM +0200, Ingo Molnar wrote: > I'd suggest doing this once modern ARM chips get so widespread that > you can realistically induce a ~700 usecs irqs-off delays on old, > virtual-cache ARM chips. Old chips would likely use old kernels > anyway, right? Not necessarily. I have rather a lot of legacy hardware (it outweighs the more modern stuff) and that legacy hardware is _loads_ more useful than the modern stuff in that it can actually do stuff like run a network (such as running kerberos servers, httpd, mtas, etc). Modern ARM machines typically don't have ways to attach mass storage to them which make them hellishly limited for such applications. I'm planning to continue using my old machines, and continue to upgrade their kernels, especially in order to keep up to date with security issues.
On Fri, 27 May 2011, Ingo Molnar wrote: > I'd suggest doing this once modern ARM chips get so widespread that > you can realistically induce a ~700 usecs irqs-off delays on old, > virtual-cache ARM chips. Old chips would likely use old kernels > anyway, right? Those "old" ARM chips are still largely produced and fitted in new designs, and using latest kernels. They are unlikely to fade away before a couple years. Nicolas
On Fri, May 27, 2011 at 02:06:29PM +0200, Ingo Molnar wrote: > The expectations are to have irqs off (we are holding the runqueue > lock if !__ARCH_WANT_INTERRUPTS_ON_CTXSW), so that's not workable i > suspect. Just a thought, but we _might_ be able to avoid a lot of this hastle if we had a new arch hook in finish_task_switch(), after finish_lock_switch() returns but before the old MM is dropped. For the new ASID-based switch_mm(), we currently do this: 1. check ASID validity 2. flush branch predictor 3. set reserved ASID value 4. set new page tables 5. set new ASID value This will be shortly changed to: 1. check ASID validity 2. flush branch predictor 3. set swapper_pg_dir tables 4. set new ASID value 5. set new page tables We could change switch_mm() to only do: 1. flush branch predictor 2. set swapper_pg_dir tables 3. check ASID validity 4. set new ASID value At this point, we have no user mappings, and so nothing will be using the ASID at this point. Then in a new post-finish_lock_switch() arch hook: 5. check whether we need to do flushing as a result of ASID change 6. set new page tables I think this may simplify the ASID code. It needs prototyping out, reviewing and testing, but I think it may work. And I think it may also be workable with the CPUs which need to flush the caches on context switches - we can postpone their page table switch to this new arch hook too, which will mean we wouldn't require __ARCH_WANT_INTERRUPTS_ON_CTXSW on ARM at all. Any thoughts (if you've followed what I'm going on about) ?
Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -2573,7 +2573,26 @@ static void ttwu_queue_remote(struct tas if (!next) smp_send_reschedule(cpu); } -#endif + +#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW +static int ttwu_activate_remote(struct task_struct *p, int wake_flags) +{ + struct rq *rq; + int ret = 0; + + rq = __task_rq_lock(p); + if (p->on_cpu) { + ttwu_activate(rq, p, ENQUEUE_WAKEUP); + ttwu_do_wakeup(rq, p, wake_flags); + ret = 1; + } + __task_rq_unlock(rq); + + return ret; + +} +#endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */ +#endif /* CONFIG_SMP */ static void ttwu_queue(struct task_struct *p, int cpu) { @@ -2631,17 +2650,17 @@ try_to_wake_up(struct task_struct *p, un while (p->on_cpu) { #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW /* - * If called from interrupt context we could have landed in the - * middle of schedule(), in this case we should take care not - * to spin on ->on_cpu if p is current, since that would - * deadlock. + * In case the architecture enables interrupts in + * context_switch(), we cannot busy wait, since that + * would lead to live-locks when an interrupt hits and + * tries to wake up @prev. So bail and do a complete + * remote wakeup. */ - if (p == current) { - ttwu_queue(p, cpu); + if (ttwu_activate_remote(p, wake_flags)) goto stat; - } -#endif +#else cpu_relax(); +#endif } /* * Pairs with the smp_wmb() in finish_lock_switch().