Message ID | 20131120162736.691879744@infradead.org (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
On 20/11/2013 18:04, Peter Zijlstra wrote: > The only valid use of preempt_enable_no_resched() is if the very next > line is schedule() or if we know preemption cannot actually be enabled > by that statement due to known more preempt_count 'refs'. The reason I used the no resched version is that busy_poll_end_time() is almost always called with rcu read lock held, so it seemed the more correct option. I have no issue with you changing this. > As to the busy_poll mess; that looks to be completely and utterly > broken, sched_clock() can return utter garbage with interrupts enabled > (rare but still), it can drift unbounded between CPUs, so if you get > preempted/migrated and your new CPU is years behind on the previous > CPU we get to busy spin for a _very_ long time. There is a _REASON_ > sched_clock() warns about preemptability - papering over it with a > preempt_disable()/preempt_enable_no_resched() is just terminal brain > damage on so many levels. IMHO This has been reviewed thoroughly. When Ben Hutchings voiced concerns I rewrote the code to use time_after, so even if you do get switched over to a CPU where the time is random you will at most poll another full interval. Linus asked me to remove this since it makes us use two time values instead of one. see https://lkml.org/lkml/2013/7/8/345. Cheers, Eliezer -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 20, 2013 at 08:02:54PM +0200, Eliezer Tamir wrote: > On 20/11/2013 18:04, Peter Zijlstra wrote: > > The only valid use of preempt_enable_no_resched() is if the very next > > line is schedule() or if we know preemption cannot actually be enabled > > by that statement due to known more preempt_count 'refs'. > > The reason I used the no resched version is that busy_poll_end_time() > is almost always called with rcu read lock held, so it seemed the more > correct option. > > I have no issue with you changing this. There are options (CONFIG_PREEMPT_RCU) that allow scheduling while holding rcu_read_lock(). Also, preempt_enable() only schedules when its possible to schedule, so calling it when you know you cannot schedule is no issue. > > As to the busy_poll mess; that looks to be completely and utterly > > broken, sched_clock() can return utter garbage with interrupts enabled > > (rare but still), it can drift unbounded between CPUs, so if you get > > preempted/migrated and your new CPU is years behind on the previous > > CPU we get to busy spin for a _very_ long time. There is a _REASON_ > > sched_clock() warns about preemptability - papering over it with a > > preempt_disable()/preempt_enable_no_resched() is just terminal brain > > damage on so many levels. > > IMHO This has been reviewed thoroughly. At the very least you completely forgot to preserve any of that. The changelog that introduced it is completely void of anything useful and the code has a distinct lack of comments. > When Ben Hutchings voiced concerns I rewrote the code to use time_after, > so even if you do get switched over to a CPU where the time is random > you will at most poll another full interval. > > Linus asked me to remove this since it makes us use two time values > instead of one. see https://lkml.org/lkml/2013/7/8/345. My brain is fried for today, I'll have a look tomorrow. But note that with patch 7/7 in place modular code an no longer use preempt_enable_no_resched(). I'm not sure net/ipv4/tcp.c can be build modular -- but istr a time when it was. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 20/11/2013 20:15, Peter Zijlstra wrote: > There are options (CONFIG_PREEMPT_RCU) that allow scheduling while > holding rcu_read_lock(). > > Also, preempt_enable() only schedules when its possible to schedule, so > calling it when you know you cannot schedule is no issue. > I have no issue with you changing busy_loop_us_clock() to use a regular preempt enable. I think that we still need to only do this if config preempt debug is on. When it's off we should use the alternate implementation. We are silencing a warning, but this is a performance critical path, and we think we know what we are doing. I tried to explain this in the comments. If you think my comments are not clear enough, I'm open to suggestions. Cheers, Eliezer -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 20, 2013 at 08:02:54PM +0200, Eliezer Tamir wrote: > IMHO This has been reviewed thoroughly. > > When Ben Hutchings voiced concerns I rewrote the code to use time_after, > so even if you do get switched over to a CPU where the time is random > you will at most poll another full interval. > > Linus asked me to remove this since it makes us use two time values > instead of one. see https://lkml.org/lkml/2013/7/8/345. I'm not sure I see how this would be true. So the do_select() code basically does: for (;;) { /* actual poll loop */ if (!need_resched()) { if (!busy_end) { busy_end = now() + busypoll; continue; } if (!((long)(busy_end - now()) < 0)) continue; } /* go sleep */ } So imagine our CPU0 timebase is 1 minute ahead of CPU1 (60e9 vs 0), and we start by: busy_end = now() + busypoll; /* CPU0: 60e9 + d */ but then we migrate to CPU1 and do: busy_end - now() /* CPU1: 60e9 + d' */ and find we're still a minute out; and in fact we'll keep spinning for that entire minute barring a need_resched(). Surely that's not intended and desired? -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 21/11/2013 12:10, Peter Zijlstra wrote: > On Wed, Nov 20, 2013 at 08:02:54PM +0200, Eliezer Tamir wrote: >> IMHO This has been reviewed thoroughly. >> >> When Ben Hutchings voiced concerns I rewrote the code to use time_after, >> so even if you do get switched over to a CPU where the time is random >> you will at most poll another full interval. >> >> Linus asked me to remove this since it makes us use two time values >> instead of one. see https://lkml.org/lkml/2013/7/8/345. > > I'm not sure I see how this would be true. > > So the do_select() code basically does: > > for (;;) { > > /* actual poll loop */ > > if (!need_resched()) { > if (!busy_end) { > busy_end = now() + busypoll; > continue; > } > if (!((long)(busy_end - now()) < 0)) > continue; > } > > /* go sleep */ > > } > > So imagine our CPU0 timebase is 1 minute ahead of CPU1 (60e9 vs 0), and we start by: > > busy_end = now() + busypoll; /* CPU0: 60e9 + d */ > > but then we migrate to CPU1 and do: > > busy_end - now() /* CPU1: 60e9 + d' */ > > and find we're still a minute out; and in fact we'll keep spinning for > that entire minute barring a need_resched(). not exactly, poll will return if there are any events to report of if a signal is pending. > Surely that's not intended and desired? This limit is an extra safety net, because busy polling is expensive, we limit the time we are willing to do it. We don't override any limit the user has put on the system call. A signal or having events to report will also stop the looping. So we are mostly capping the resources an _idle_ system will waste on busy polling. We want to globally cap the amount of time the system busy polls, on average. Nothing catastrophic will happen in the extremely rare occasion that we miss. The alternative is to use one more int on every poll/select all the time, this seems like a bigger cost. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Nov 21, 2013 at 03:26:17PM +0200, Eliezer Tamir wrote: > On 21/11/2013 12:10, Peter Zijlstra wrote: > > On Wed, Nov 20, 2013 at 08:02:54PM +0200, Eliezer Tamir wrote: > >> IMHO This has been reviewed thoroughly. > >> > >> When Ben Hutchings voiced concerns I rewrote the code to use time_after, > >> so even if you do get switched over to a CPU where the time is random > >> you will at most poll another full interval. > >> > >> Linus asked me to remove this since it makes us use two time values > >> instead of one. see https://lkml.org/lkml/2013/7/8/345. > > > > I'm not sure I see how this would be true. > > > > So the do_select() code basically does: > > > > for (;;) { > > > > /* actual poll loop */ > > > > if (!need_resched()) { > > if (!busy_end) { > > busy_end = now() + busypoll; > > continue; > > } > > if (!((long)(busy_end - now()) < 0)) > > continue; > > } > > > > /* go sleep */ > > > > } > > > > So imagine our CPU0 timebase is 1 minute ahead of CPU1 (60e9 vs 0), and we start by: > > > > busy_end = now() + busypoll; /* CPU0: 60e9 + d */ > > > > but then we migrate to CPU1 and do: > > > > busy_end - now() /* CPU1: 60e9 + d' */ > > > > and find we're still a minute out; and in fact we'll keep spinning for > > that entire minute barring a need_resched(). > > not exactly, poll will return if there are any events to report of if > a signal is pending. Sure, but lacking any of those, you're now busy waiting for a minute. > > Surely that's not intended and desired? > > This limit is an extra safety net, because busy polling is expensive, > we limit the time we are willing to do it. I just said your limit 'sysctl_net_busy_poll' isn't meaningful in any way shape or fashion. > We don't override any limit the user has put on the system call. You are in fact, note how the normal select @endtime argument is only set up _after_ you're done polling. So if the syscall had a timeout of 5 seconds, you just blew it by 55. > A signal or having events to report will also stop the looping. > So we are mostly capping the resources an _idle_ system will waste > on busy polling. Repeat, you're not actually capping anything. > We want to globally cap the amount of time the system busy polls, on > average. Nothing catastrophic will happen in the extremely rare occasion > that we miss. > > The alternative is to use one more int on every poll/select all the > time, this seems like a bigger cost. No, 'int' has nothing to do with it, using a semi-sane timesource does. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 21/11/2013 15:39, Peter Zijlstra wrote: > On Thu, Nov 21, 2013 at 03:26:17PM +0200, Eliezer Tamir wrote: >> We don't override any limit the user has put on the system call. > > You are in fact, note how the normal select @endtime argument is only > set up _after_ you're done polling. So if the syscall had a timeout of 5 > seconds, you just blew it by 55. > That's a bug, we will fix it. Cheers, Eliezer -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 22, 2013 at 08:56:00AM +0200, Eliezer Tamir wrote: > On 21/11/2013 15:39, Peter Zijlstra wrote: > > On Thu, Nov 21, 2013 at 03:26:17PM +0200, Eliezer Tamir wrote: > > >> We don't override any limit the user has put on the system call. > > > > You are in fact, note how the normal select @endtime argument is only > > set up _after_ you're done polling. So if the syscall had a timeout of 5 > > seconds, you just blew it by 55. > > > > That's a bug, we will fix it. The entire thing is a bug, and if I could take away sched_clock() from you I would, but sadly its all inlines so I can't :-( Please use local_clock(), yes its slightly more expensive, but I doubt you can actually measure the effects on sane hardware. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 22/11/2013 13:30, Peter Zijlstra wrote: > On Fri, Nov 22, 2013 at 08:56:00AM +0200, Eliezer Tamir wrote: >> On 21/11/2013 15:39, Peter Zijlstra wrote: >>> On Thu, Nov 21, 2013 at 03:26:17PM +0200, Eliezer Tamir wrote: > > Please use local_clock(), yes its slightly more expensive, but I doubt > you can actually measure the effects on sane hardware. If we limit the discussion to sane hardware, I should mention that on current Intel CPUs TSC is guaranteed to be monotonic for anything up to 8 sockets. Even on slightly older HS TSC skew is very small and should not be an issue for this use case. So: Modern sane HW does not have this issue. The people that do busy polling typically pin tasks to cores anyway. You need cap_net_admin to use this setting. There is no real damage if the issue happens. This is fast-low-latency-path so we are very sensitive to adding even a small cost. Linus really didn't like adding to the cost of poll/select when busy polling is not being used. Having said that, since we need to fix the timeout issue you pointed out, we will test the use of local_clock() and see if it matters or not. Again, I have no objection to changing the use of preempt_enable_no_resched() to a plain preempt_enable(). Cheers, Eliezer -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 26 Nov 2013, Eliezer Tamir wrote: > On 22/11/2013 13:30, Peter Zijlstra wrote: > > On Fri, Nov 22, 2013 at 08:56:00AM +0200, Eliezer Tamir wrote: > >> On 21/11/2013 15:39, Peter Zijlstra wrote: > >>> On Thu, Nov 21, 2013 at 03:26:17PM +0200, Eliezer Tamir wrote: > > > > Please use local_clock(), yes its slightly more expensive, but I doubt > > you can actually measure the effects on sane hardware. > > If we limit the discussion to sane hardware, I should mention that on > current Intel CPUs TSC is guaranteed to be monotonic for anything up to > 8 sockets. Even on slightly older HS TSC skew is very small and should > not be an issue for this use case. > Modern sane HW does not have this issue. That's wrong to begin with. There is no such thing which qualifies as "sane hardware". Especially not if we are talking about timers. > The people that do busy polling typically pin tasks to cores anyway. This is completely irrelevant. If stuff falls apart if the task is not pinned, then you lost nevertheless. > You need cap_net_admin to use this setting. And how is that relevant? cap_net_admin does not change the fact, that you violate your constraints. > There is no real damage if the issue happens. You'r violating the constraints which is not fatal, but not desired either. > This is fast-low-latency-path so we are very sensitive to adding even > a small cost. > Linus really didn't like adding to the cost of poll/select when busy > polling is not being used. And that justifies exposing those who do not have access to "sane" hardware and/or did not pin their tasks to constraint violation? > Having said that, since we need to fix the timeout issue you pointed > out, we will test the use of local_clock() and see if it matters or > not. If the hardware provides an indicator that the TSC is sane to use, then sched_clock_stable is 1, so local_clock() will not do the slow update dance at all. So for "sane" hardware the overhead is minimal and on crappy hardware the correctness is still ensured with more overhead. If you are really concerned about the minimal overhead in the sched_clock_stable == 1 case, then you better fix that (it's doable with some brain) instead of hacking broken crap, based on even more broken assumptions, into the networking code. It's not the kernels fault, that we need to deal with CONFIG_HAVE_UNSTABLE_SCHED_CLOCK at all. And we have to deal with it no matter what, so we cannot make this undone by magic assumptions. Complain to those who forced us to do this. Hint: It's only ONE CPU vendor who thought that providing useless timestamps is a brilliant idea. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--- a/include/net/busy_poll.h +++ b/include/net/busy_poll.h @@ -42,27 +42,23 @@ static inline bool net_busy_loop_on(void return sysctl_net_busy_poll; } -/* a wrapper to make debug_smp_processor_id() happy - * we can use sched_clock() because we don't care much about precision - * we only care that the average is bounded - */ -#ifdef CONFIG_DEBUG_PREEMPT static inline u64 busy_loop_us_clock(void) { u64 rc; + /* + * XXX with interrupts enabled sched_clock() can return utter garbage + * Futhermore, it can have unbounded drift between CPUs, so the below + * usage is terminally broken and only serves to shut up a valid debug + * warning. + */ + preempt_disable_notrace(); rc = sched_clock(); - preempt_enable_no_resched_notrace(); + preempt_enable_notrace(); return rc >> 10; } -#else /* CONFIG_DEBUG_PREEMPT */ -static inline u64 busy_loop_us_clock(void) -{ - return sched_clock() >> 10; -} -#endif /* CONFIG_DEBUG_PREEMPT */ static inline unsigned long sk_busy_loop_end_time(struct sock *sk) { --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1623,11 +1623,11 @@ int tcp_recvmsg(struct kiocb *iocb, stru (len > sysctl_tcp_dma_copybreak) && !(flags & MSG_PEEK) && !sysctl_tcp_low_latency && net_dma_find_channel()) { - preempt_enable_no_resched(); + preempt_enable(); tp->ucopy.pinned_list = dma_pin_iovec_pages(msg->msg_iov, len); } else { - preempt_enable_no_resched(); + preempt_enable(); } } #endif