[6/7] sched: Clean up preempt_enable_no_resched() abuse

Message ID	20131120162736.691879744@infradead.org (mailing list archive)
State	Not Applicable, archived
Headers	show Return-Path: <linux-pm-owner@kernel.org> Message-Id: <20131120162736.691879744@infradead.org> User-Agent: quilt/0.60-1 Date: Wed, 20 Nov 2013 17:04:56 +0100 From: Peter Zijlstra <peterz@infradead.org> To: Arjan van de Ven <arjan@linux.intel.com>, lenb@kernel.org, rjw@rjwysocki.net, Eliezer Tamir <eliezer.tamir@linux.intel.com>, Chris Leech <christopher.leech@intel.com>, David Miller <davem@davemloft.net>, rui.zhang@intel.com, jacob.jun.pan@linux.intel.com, Mike Galbraith <bitbucket@online.de>, Ingo Molnar <mingo@kernel.org>, hpa@zytor.com, Thomas Gleixner <tglx@linutronix.de>, Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org Subject: [PATCH 6/7] sched: Clean up preempt_enable_no_resched() abuse References: <20131120160450.072555619@infradead.org> Content-Disposition: inline; filename=peterz-fixup-weird-preempt_enable_no_resched-usage.patch Sender: linux-pm-owner@vger.kernel.org Precedence: bulk

Peter Zijlstra Nov. 20, 2013, 4:04 p.m. UTC

The only valid use of preempt_enable_no_resched() is if the very next
line is schedule() or if we know preemption cannot actually be enabled
by that statement due to known more preempt_count 'refs'.

As to the busy_poll mess; that looks to be completely and utterly
broken, sched_clock() can return utter garbage with interrupts enabled
(rare but still), it can drift unbounded between CPUs, so if you get
preempted/migrated and your new CPU is years behind on the previous
CPU we get to busy spin for a _very_ long time. There is a _REASON_
sched_clock() warns about preemptability - papering over it with a
preempt_disable()/preempt_enable_no_resched() is just terminal brain
damage on so many levels.

Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Cc: rjw@rjwysocki.net
Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: Chris Leech <christopher.leech@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: hpa@zytor.com
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/net/busy_poll.h |   20 ++++++++------------
 net/ipv4/tcp.c          |    4 ++--
 2 files changed, 10 insertions(+), 14 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eliezer Tamir Nov. 20, 2013, 6:02 p.m. UTC | #1

On 20/11/2013 18:04, Peter Zijlstra wrote:
> The only valid use of preempt_enable_no_resched() is if the very next
> line is schedule() or if we know preemption cannot actually be enabled
> by that statement due to known more preempt_count 'refs'.

The reason I used the no resched version is that busy_poll_end_time()
is almost always called with rcu read lock held, so it seemed the more
correct option.

I have no issue with you changing this.

> As to the busy_poll mess; that looks to be completely and utterly
> broken, sched_clock() can return utter garbage with interrupts enabled
> (rare but still), it can drift unbounded between CPUs, so if you get
> preempted/migrated and your new CPU is years behind on the previous
> CPU we get to busy spin for a _very_ long time. There is a _REASON_
> sched_clock() warns about preemptability - papering over it with a
> preempt_disable()/preempt_enable_no_resched() is just terminal brain
> damage on so many levels.

IMHO This has been reviewed thoroughly.

When Ben Hutchings voiced concerns I rewrote the code to use time_after,
so even if you do get switched over to a CPU where the time is random
you will at most poll another full interval.

Linus asked me to remove this since it makes us use two time values
instead of one. see https://lkml.org/lkml/2013/7/8/345.

Cheers,
Eliezer
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra Nov. 20, 2013, 6:15 p.m. UTC | #2

On Wed, Nov 20, 2013 at 08:02:54PM +0200, Eliezer Tamir wrote:
> On 20/11/2013 18:04, Peter Zijlstra wrote:
> > The only valid use of preempt_enable_no_resched() is if the very next
> > line is schedule() or if we know preemption cannot actually be enabled
> > by that statement due to known more preempt_count 'refs'.
> 
> The reason I used the no resched version is that busy_poll_end_time()
> is almost always called with rcu read lock held, so it seemed the more
> correct option.
> 
> I have no issue with you changing this.

There are options (CONFIG_PREEMPT_RCU) that allow scheduling while
holding rcu_read_lock().

Also, preempt_enable() only schedules when its possible to schedule, so
calling it when you know you cannot schedule is no issue.

> > As to the busy_poll mess; that looks to be completely and utterly
> > broken, sched_clock() can return utter garbage with interrupts enabled
> > (rare but still), it can drift unbounded between CPUs, so if you get
> > preempted/migrated and your new CPU is years behind on the previous
> > CPU we get to busy spin for a _very_ long time. There is a _REASON_
> > sched_clock() warns about preemptability - papering over it with a
> > preempt_disable()/preempt_enable_no_resched() is just terminal brain
> > damage on so many levels.
> 
> IMHO This has been reviewed thoroughly.

At the very least you completely forgot to preserve any of that. The
changelog that introduced it is completely void of anything useful and
the code has a distinct lack of comments.

> When Ben Hutchings voiced concerns I rewrote the code to use time_after,
> so even if you do get switched over to a CPU where the time is random
> you will at most poll another full interval.
> 
> Linus asked me to remove this since it makes us use two time values
> instead of one. see https://lkml.org/lkml/2013/7/8/345.

My brain is fried for today, I'll have a look tomorrow.

But note that with patch 7/7 in place modular code an no longer use
preempt_enable_no_resched(). I'm not sure net/ipv4/tcp.c can be build
modular -- but istr a time when it was.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eliezer Tamir Nov. 20, 2013, 8:14 p.m. UTC | #3

On 20/11/2013 20:15, Peter Zijlstra wrote:
> There are options (CONFIG_PREEMPT_RCU) that allow scheduling while
> holding rcu_read_lock().
> 
> Also, preempt_enable() only schedules when its possible to schedule, so
> calling it when you know you cannot schedule is no issue.
> 

I have no issue with you changing busy_loop_us_clock() to use a regular
preempt enable.

I think that we still need to only do this if config preempt debug
is on. When it's off we should use the alternate implementation.
We are silencing a warning, but this is a performance critical path,
and we think we know what we are doing.

I tried to explain this in the comments. If you think my comments are
not clear enough, I'm open to suggestions.

Cheers,
Eliezer
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra Nov. 21, 2013, 10:10 a.m. UTC | #4

On Wed, Nov 20, 2013 at 08:02:54PM +0200, Eliezer Tamir wrote:
> IMHO This has been reviewed thoroughly.
> 
> When Ben Hutchings voiced concerns I rewrote the code to use time_after,
> so even if you do get switched over to a CPU where the time is random
> you will at most poll another full interval.
> 
> Linus asked me to remove this since it makes us use two time values
> instead of one. see https://lkml.org/lkml/2013/7/8/345.

I'm not sure I see how this would be true.

So the do_select() code basically does:

  for (;;) {

    /* actual poll loop */

    if (!need_resched()) {
      if (!busy_end) {
	busy_end = now() + busypoll;
	continue;
      }
      if (!((long)(busy_end - now()) < 0))
	continue;
    }

    /* go sleep */

  }

So imagine our CPU0 timebase is 1 minute ahead of CPU1 (60e9 vs 0), and we start by:

  busy_end = now() + busypoll; /* CPU0: 60e9 + d */

but then we migrate to CPU1 and do:

  busy_end - now() /* CPU1: 60e9 + d' */

and find we're still a minute out; and in fact we'll keep spinning for
that entire minute barring a need_resched().

Surely that's not intended and desired?
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eliezer Tamir Nov. 21, 2013, 1:26 p.m. UTC | #5

On 21/11/2013 12:10, Peter Zijlstra wrote:
> On Wed, Nov 20, 2013 at 08:02:54PM +0200, Eliezer Tamir wrote:
>> IMHO This has been reviewed thoroughly.
>>
>> When Ben Hutchings voiced concerns I rewrote the code to use time_after,
>> so even if you do get switched over to a CPU where the time is random
>> you will at most poll another full interval.
>>
>> Linus asked me to remove this since it makes us use two time values
>> instead of one. see https://lkml.org/lkml/2013/7/8/345.
> 
> I'm not sure I see how this would be true.
> 
> So the do_select() code basically does:
> 
>   for (;;) {
> 
>     /* actual poll loop */
> 
>     if (!need_resched()) {
>       if (!busy_end) {
> 	busy_end = now() + busypoll;
> 	continue;
>       }
>       if (!((long)(busy_end - now()) < 0))
> 	continue;
>     }
> 
>     /* go sleep */
> 
>   }
> 
> So imagine our CPU0 timebase is 1 minute ahead of CPU1 (60e9 vs 0), and we start by:
> 
>   busy_end = now() + busypoll; /* CPU0: 60e9 + d */
> 
> but then we migrate to CPU1 and do:
> 
>   busy_end - now() /* CPU1: 60e9 + d' */
> 
> and find we're still a minute out; and in fact we'll keep spinning for
> that entire minute barring a need_resched().

not exactly, poll will return if there are any events to report of if
a signal is pending.

> Surely that's not intended and desired?

This limit is an extra safety net, because busy polling is expensive,
we limit the time we are willing to do it.

We don't override any limit the user has put on the system call.
A signal or having events to report will also stop the looping.
So we are mostly capping the resources an _idle_ system will waste
on busy polling.

We want to globally cap the amount of time the system busy polls, on
average. Nothing catastrophic will happen in the extremely rare occasion
that we miss.

The alternative is to use one more int on every poll/select all the
time, this seems like a bigger cost.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra Nov. 21, 2013, 1:39 p.m. UTC | #6

On Thu, Nov 21, 2013 at 03:26:17PM +0200, Eliezer Tamir wrote:
> On 21/11/2013 12:10, Peter Zijlstra wrote:
> > On Wed, Nov 20, 2013 at 08:02:54PM +0200, Eliezer Tamir wrote:
> >> IMHO This has been reviewed thoroughly.
> >>
> >> When Ben Hutchings voiced concerns I rewrote the code to use time_after,
> >> so even if you do get switched over to a CPU where the time is random
> >> you will at most poll another full interval.
> >>
> >> Linus asked me to remove this since it makes us use two time values
> >> instead of one. see https://lkml.org/lkml/2013/7/8/345.
> > 
> > I'm not sure I see how this would be true.
> > 
> > So the do_select() code basically does:
> > 
> >   for (;;) {
> > 
> >     /* actual poll loop */
> > 
> >     if (!need_resched()) {
> >       if (!busy_end) {
> > 	busy_end = now() + busypoll;
> > 	continue;
> >       }
> >       if (!((long)(busy_end - now()) < 0))
> > 	continue;
> >     }
> > 
> >     /* go sleep */
> > 
> >   }
> > 
> > So imagine our CPU0 timebase is 1 minute ahead of CPU1 (60e9 vs 0), and we start by:
> > 
> >   busy_end = now() + busypoll; /* CPU0: 60e9 + d */
> > 
> > but then we migrate to CPU1 and do:
> > 
> >   busy_end - now() /* CPU1: 60e9 + d' */
> > 
> > and find we're still a minute out; and in fact we'll keep spinning for
> > that entire minute barring a need_resched().
> 
> not exactly, poll will return if there are any events to report of if
> a signal is pending.

Sure, but lacking any of those, you're now busy waiting for a minute.

> > Surely that's not intended and desired?
> 
> This limit is an extra safety net, because busy polling is expensive,
> we limit the time we are willing to do it.

I just said your limit 'sysctl_net_busy_poll' isn't meaningful in any
way shape or fashion.

> We don't override any limit the user has put on the system call.

You are in fact, note how the normal select @endtime argument is only
set up _after_ you're done polling. So if the syscall had a timeout of 5
seconds, you just blew it by 55.

> A signal or having events to report will also stop the looping.
> So we are mostly capping the resources an _idle_ system will waste
> on busy polling.

Repeat, you're not actually capping anything.

> We want to globally cap the amount of time the system busy polls, on
> average. Nothing catastrophic will happen in the extremely rare occasion
> that we miss.
> 
> The alternative is to use one more int on every poll/select all the
> time, this seems like a bigger cost.

No, 'int' has nothing to do with it, using a semi-sane timesource does.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eliezer Tamir Nov. 22, 2013, 6:56 a.m. UTC | #7

On 21/11/2013 15:39, Peter Zijlstra wrote:
> On Thu, Nov 21, 2013 at 03:26:17PM +0200, Eliezer Tamir wrote:

>> We don't override any limit the user has put on the system call.
> 
> You are in fact, note how the normal select @endtime argument is only
> set up _after_ you're done polling. So if the syscall had a timeout of 5
> seconds, you just blew it by 55.
> 

That's a bug, we will fix it.

Cheers,
Eliezer
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra Nov. 22, 2013, 11:30 a.m. UTC | #8

On Fri, Nov 22, 2013 at 08:56:00AM +0200, Eliezer Tamir wrote:
> On 21/11/2013 15:39, Peter Zijlstra wrote:
> > On Thu, Nov 21, 2013 at 03:26:17PM +0200, Eliezer Tamir wrote:
> 
> >> We don't override any limit the user has put on the system call.
> > 
> > You are in fact, note how the normal select @endtime argument is only
> > set up _after_ you're done polling. So if the syscall had a timeout of 5
> > seconds, you just blew it by 55.
> > 
> 
> That's a bug, we will fix it.

The entire thing is a bug, and if I could take away sched_clock() from
you I would, but sadly its all inlines so I can't :-(

Please use local_clock(), yes its slightly more expensive, but I doubt
you can actually measure the effects on sane hardware.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eliezer Tamir Nov. 26, 2013, 7:15 a.m. UTC | #9

On 22/11/2013 13:30, Peter Zijlstra wrote:
> On Fri, Nov 22, 2013 at 08:56:00AM +0200, Eliezer Tamir wrote:
>> On 21/11/2013 15:39, Peter Zijlstra wrote:
>>> On Thu, Nov 21, 2013 at 03:26:17PM +0200, Eliezer Tamir wrote:
> 
> Please use local_clock(), yes its slightly more expensive, but I doubt
> you can actually measure the effects on sane hardware.

If we limit the discussion to sane hardware, I should mention that on
current Intel CPUs TSC is guaranteed to be monotonic for anything up to
8 sockets. Even on slightly older HS TSC skew is very small and should
not be an issue for this use case.

So:
Modern sane HW does not have this issue.
The people that do busy polling typically pin tasks to cores anyway.
You need cap_net_admin to use this setting.
There is no real damage if the issue happens.
This is fast-low-latency-path so we are very sensitive to adding even
a small cost.
Linus really didn't like adding to the cost of poll/select when busy
polling is not being used.

Having said that, since we need to fix the timeout issue you pointed
out, we will test the use of local_clock() and see if it matters or
not.

Again, I have no objection to changing the use of
preempt_enable_no_resched() to a plain preempt_enable().

Cheers,
Eliezer
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thomas Gleixner Nov. 26, 2013, 10:51 a.m. UTC | #10

On Tue, 26 Nov 2013, Eliezer Tamir wrote:

> On 22/11/2013 13:30, Peter Zijlstra wrote:
> > On Fri, Nov 22, 2013 at 08:56:00AM +0200, Eliezer Tamir wrote:
> >> On 21/11/2013 15:39, Peter Zijlstra wrote:
> >>> On Thu, Nov 21, 2013 at 03:26:17PM +0200, Eliezer Tamir wrote:
> > 
> > Please use local_clock(), yes its slightly more expensive, but I doubt
> > you can actually measure the effects on sane hardware.
> 
> If we limit the discussion to sane hardware, I should mention that on
> current Intel CPUs TSC is guaranteed to be monotonic for anything up to
> 8 sockets. Even on slightly older HS TSC skew is very small and should
> not be an issue for this use case.

> Modern sane HW does not have this issue.

That's wrong to begin with. There is no such thing which qualifies as
"sane hardware". Especially not if we are talking about timers.

> The people that do busy polling typically pin tasks to cores anyway.

This is completely irrelevant. If stuff falls apart if the task is not
pinned, then you lost nevertheless.

> You need cap_net_admin to use this setting.

And how is that relevant? cap_net_admin does not change the fact, that
you violate your constraints.

> There is no real damage if the issue happens.

You'r violating the constraints which is not fatal, but not desired
either.

> This is fast-low-latency-path so we are very sensitive to adding even
> a small cost.
> Linus really didn't like adding to the cost of poll/select when busy
> polling is not being used.

And that justifies exposing those who do not have access to "sane"
hardware and/or did not pin their tasks to constraint violation?

> Having said that, since we need to fix the timeout issue you pointed
> out, we will test the use of local_clock() and see if it matters or
> not.

If the hardware provides an indicator that the TSC is sane to use,
then sched_clock_stable is 1, so local_clock() will not do the slow
update dance at all. So for "sane" hardware the overhead is minimal
and on crappy hardware the correctness is still ensured with more
overhead.

If you are really concerned about the minimal overhead in the
sched_clock_stable == 1 case, then you better fix that (it's doable
with some brain) instead of hacking broken crap, based on even more
broken assumptions, into the networking code.

It's not the kernels fault, that we need to deal with
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK at all. And we have to deal with it
no matter what, so we cannot make this undone by magic assumptions.

Complain to those who forced us to do this. Hint: It's only ONE CPU
vendor who thought that providing useless timestamps is a brilliant
idea.

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[6/7] sched: Clean up preempt_enable_no_resched() abuse

Commit Message

Comments

Patch