Message ID | 20220407075242.118253-2-asavkov@redhat.com (mailing list archive) |
---|---|
State | Not Applicable |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | Upper bound kernel timers | expand |
On Thu, Apr 07 2022 at 09:52, Artem Savkov wrote: > Current timer wheel implementation is optimized for performance and > energy usage but lacks in precision. This, normally, is not a problem as > most timers that use timer wheel are used for timeouts and thus rarely > expire, instead they often get canceled or modified before expiration. > Even when they don't, expiring a bit late is not an issue for timeout > timers. > > TCP keepalive timer is a special case, it's aim is to prevent timeouts, > so triggering earlier rather than later is desired behavior. In a > reported case the user had a 3600s keepalive timer for preventing firewall > disconnects (on a 3650s interval). They observed keepalive timers coming > in up to four minutes late, causing unexpected disconnects. > > This commit adds upper_bound_timeout() function that takes a relative s/This commit adds/Add a new function to ..../ See Documentation/process/* > timeout and adjusts it based on timer wheel granularity so that supplied > value effectively becomes an upper bound for the timer. > > -static int calc_wheel_index(unsigned long expires, unsigned long clk, > - unsigned long *bucket_expiry) > +static inline int get_wheel_lvl(unsigned long delta) > { > - unsigned long delta = expires - clk; > - unsigned int idx; > - > if (delta < LVL_START(1)) { > - idx = calc_index(expires, 0, bucket_expiry); > + return 0; > } else if (delta < LVL_START(2)) { Can you please get rid of all those brackets? > + return -1; > +} > + > +static int calc_wheel_index(unsigned long expires, unsigned long clk, > + unsigned long *bucket_expiry) > +{ > + unsigned long delta = expires - clk; > + unsigned int idx; > + int lvl = get_wheel_lvl(delta); > + > + if (lvl >= 0) { > + idx = calc_index(expires, lvl, bucket_expiry); > } else if ((long) delta < 0) { > idx = clk & LVL_MASK; > *bucket_expiry = clk; > @@ -545,6 +555,38 @@ static int calc_wheel_index(unsigned long expires, unsigned long clk, > return idx; > } This generates horrible code on various compilers. I ran that through a couple of perf test scenarios and came up with the following, which still is a tad slower for the level 0 case depending on the branch predictor state. But it at least prevents the compilers from doing stupid things and on average it's on par. Though the level 0 case matters because of *drumroll* networking. static int calc_wheel_index(unsigned long expires, unsigned long clk, unsigned long *bucket_expiry) { unsigned long delta = expires - clk; int lvl = get_wheel_lvl(delta); if (likely(lvl) >= 0) return __calc_index(expires, lvl, bucket_expiry); if ((long) delta < 0) { *bucket_expiry = clk; return clk & LVL_MASK; } /* * Force expire obscene large timeouts to expire at the capacity * limit of the wheel. */ if (delta >= WHEEL_TIMEOUT_CUTOFF) expires = clk + WHEEL_TIMEOUT_MAX; return __calc_index(expires, LVL_DEPTH - 1, bucket_expiry); } Just for the record. I told you last time that your patch creates a measurable overhead and I explained you in depth why the performance of this stupid thing matters. So why are you not providing a proper analysis for that? > +/** > + * upper_bound_timeout - return granularity-adjusted timeout > + * @timeout: timeout value in jiffies > + * > + * This function return supplied timeout adjusted based on timer wheel > + * granularity effectively making supplied value an upper bound at which the > + * timer will expire. Due to the way timer wheel works timeouts smaller than > + * LVL_GRAN on their respecrive levels will be _at least_ > + * LVL_GRAN(lvl) - LVL_GRAN(lvl -1)) jiffies early. Contrary to the simple "timeout - timeout/8" this gives better accuracy as it does not converge to the early side for long timeouts. With the quirk that this cuts timeout=1 to 0, which means it expires immediately. The wonders of integer math avoid that with the simple timeout -= timeout >> 3 approach for timeouts up to 8 ticks. :) But that want's to be properly documented. > +unsigned long upper_bound_timeout(unsigned long timeout) > +{ > + int lvl = get_wheel_lvl(timeout); which is equivalent to: lvl = calc_wheel_index(timeout, 0, &dummy) >> LVL_BITS; Sorry, could not resist. :) The more interesting question is, how frequently this upper bounds function is used. It's definitely not something which you want to inflict onto a high frequency (re)arming timer. Did you analyse that? And if so, then why is that analysis missing from the change log of the keepalive timer patch? Aside of that it clearly lacks any argument why the simple, stupid, but fast approach of shortening the timeout by 12.5% is not good enough and why we need yet another function which is just going to be another source of 'optimizations' for the wrong reasons. Seriously, I apprecitate that you want to make this 'perfect', but it's never going to be perfect and the real question is whether there is any reasonable difference between 'good' and almost 'perfect'. And this clearly resonates in your changelog of the network patch: "Make sure TCP keepalive timer does not expire late. Switching to upper bound timers means it can fire off early but in case of keepalive tcp_keepalive_timer() handler checks elapsed time and resets the timer if it was triggered early. This results in timer "cascading" to a higher precision and being just a couple of milliseconds off it's original mark." Which reinvents the cascading effect of the original timer wheel just with more overhead. Where is the justification for this? Is this really true for all the reasons where the keep alive timers are armed? I seriously doubt that. Why? On the end which waits for the keep alive packet to arrive in time it does not matter at all, whether the cutoff is a bit later than defined. So why do you want to let the timer fire early just to rearm it? But it matters a lot on the sender side. If that is late and the other end is strict about the timeout then you lost. But does it matter whether you send the packet too early? No, it does not matter at all because the important point is that you send it _before_ the other side decides to give up. So why do you want to let the timer fire precise? You are solving the sender side problem by introducing a receiver side problem and both suffer from the overhead for no reason. Aside of the theoerical issue why this matters at all I have yet ot see a reasonable argument what the practical problen is. If this would be a real problem in the wild then why haven't we ssen a reassonable bug report within 6 years? Thanks, tglx
On Fri, Apr 08, 2022 at 02:37:25AM +0200, Thomas Gleixner wrote: > "Make sure TCP keepalive timer does not expire late. Switching to upper > bound timers means it can fire off early but in case of keepalive > tcp_keepalive_timer() handler checks elapsed time and resets the timer > if it was triggered early. This results in timer "cascading" to a > higher precision and being just a couple of milliseconds off it's > original mark." > > Which reinvents the cascading effect of the original timer wheel just > with more overhead. Where is the justification for this? > > Is this really true for all the reasons where the keep alive timers are > armed? I seriously doubt that. Why? > > On the end which waits for the keep alive packet to arrive in time it > does not matter at all, whether the cutoff is a bit later than defined. > > So why do you want to let the timer fire early just to rearm it? > > But it matters a lot on the sender side. If that is late and the other > end is strict about the timeout then you lost. But does it matter > whether you send the packet too early? No, it does not matter at all > because the important point is that you send it _before_ the other side > decides to give up. > > So why do you want to let the timer fire precise? > > You are solving the sender side problem by introducing a receiver side > problem and both suffer from the overhead for no reason. Here are my thoughts. Maybe some networking folks can chime in to keep us honest. I get most of what you're saying, though my understanding is that keepalive is only involved in sending packets, not receiving them. I do think there would be two opposing use cases: 1) Client sending packets to prevent server disconnects 2) Server sending packets to detect client disconnects For #1, it's ok for the timer to pop early. For #2, it's ok for it to pop late. So my conclusion is about the same as your sender/receiver scenario: there are two sides to the same coin. If we assume both use cases are valid (which I'm not entirely convinced of), doesn't that mean that the keepalive timer needs to be precise? Otherwise we're going to have broken expectations in one direction or the other, depending on the use case. > Aside of the theoerical issue why this matters at all I have yet ot see > a reasonable argument what the practical problen is. If this would be a > real problem in the wild then why haven't we ssen a reassonable bug > report within 6 years? Good question. At least part of the answer *might* be that enterprise kernels tend to be adopted very slowly. This issue was reported on RHEL 8.3 which is a 4.18 based kernel: The time that the 1st TCP keepalive probe is sent can be configured by the "net.ipv4.tcp_keepalive_time" sysctl or by setsockopt(). We observe that if that value is set to 300 seconds, the timer actually fires around 15-20 seconds later. So ~317 seconds. The larger the expiration time the greater the delay. So for the default of 2 hours it can be delayed by minutes. This is causing problems for some customers that rely on the TCP keepalive timer to keep entries active in firewalls and expect it to be accurate as TCP keepalive values have to correspond to the firewall settings.
On Fri, Apr 08, 2022 at 02:37:25AM +0200, Thomas Gleixner wrote: > On Thu, Apr 07 2022 at 09:52, Artem Savkov wrote: > > + return -1; > > +} > > + > > +static int calc_wheel_index(unsigned long expires, unsigned long clk, > > + unsigned long *bucket_expiry) > > +{ > > + unsigned long delta = expires - clk; > > + unsigned int idx; > > + int lvl = get_wheel_lvl(delta); > > + > > + if (lvl >= 0) { > > + idx = calc_index(expires, lvl, bucket_expiry); > > } else if ((long) delta < 0) { > > idx = clk & LVL_MASK; > > *bucket_expiry = clk; > > @@ -545,6 +555,38 @@ static int calc_wheel_index(unsigned long expires, unsigned long clk, > > return idx; > > } > > This generates horrible code on various compilers. I ran that through a > couple of perf test scenarios and came up with the following, which > still is a tad slower for the level 0 case depending on the branch > predictor state. But it at least prevents the compilers from doing > stupid things and on average it's on par. > > Though the level 0 case matters because of *drumroll* networking. > > Just for the record. I told you last time that your patch creates a > measurable overhead and I explained you in depth why the performance of > this stupid thing matters. So why are you not providing a proper > analysis for that? I did do a simple check of measuring the time it takes for calc_wheel_index to execute and it turned out a tad lower after the patch. Your measurements are sure to be much more refined so would you give me any pointers on what to measure and which cases to check to have a better view on the impact of these patches? > > +/** > > + * upper_bound_timeout - return granularity-adjusted timeout > > + * @timeout: timeout value in jiffies > > + * > > + * This function return supplied timeout adjusted based on timer wheel > > + * granularity effectively making supplied value an upper bound at which the > > + * timer will expire. Due to the way timer wheel works timeouts smaller than > > + * LVL_GRAN on their respecrive levels will be _at least_ > > + * LVL_GRAN(lvl) - LVL_GRAN(lvl -1)) jiffies early. > > Contrary to the simple "timeout - timeout/8" this gives better accuracy > as it does not converge to the early side for long timeouts. > > With the quirk that this cuts timeout=1 to 0, which means it expires > immediately. The wonders of integer math avoid that with the simple > timeout -= timeout >> 3 approach for timeouts up to 8 ticks. :) > > But that want's to be properly documented. > > > +unsigned long upper_bound_timeout(unsigned long timeout) > > +{ > > + int lvl = get_wheel_lvl(timeout); > > which is equivalent to: > > lvl = calc_wheel_index(timeout, 0, &dummy) >> LVL_BITS; > > Sorry, could not resist. :) Right, but that would mean a significantly more overhead, right? > The more interesting question is, how frequently this upper bounds > function is used. It's definitely not something which you want to > inflict onto a high frequency (re)arming timer. As of now not that frequent. Instead of re-arming the timer networking code allows it to expire and makes the decisions in the handler, so it shouldn't be a problem because keepalives are usually quite big. > Did you analyse that? And if so, then why is that analysis missing from > the change log of the keepalive timer patch? I agree, this needs to be added. > Aside of that it clearly lacks any argument why the simple, stupid, but > fast approach of shortening the timeout by 12.5% is not good enough and > why we need yet another function which is just going to be another > source of 'optimizations' for the wrong reasons. I am just not used to this mode of thinking I guess. This will be more efficient at the cost of being less obvious and requiring to be remembered about in case the 12.5% figure changes. > Seriously, I apprecitate that you want to make this 'perfect', but it's > never going to be perfect and the real question is whether there is any > reasonable difference between 'good' and almost 'perfect'. > > And this clearly resonates in your changelog of the network patch: > > "Make sure TCP keepalive timer does not expire late. Switching to upper > bound timers means it can fire off early but in case of keepalive > tcp_keepalive_timer() handler checks elapsed time and resets the timer > if it was triggered early. This results in timer "cascading" to a > higher precision and being just a couple of milliseconds off it's > original mark." > > Which reinvents the cascading effect of the original timer wheel just > with more overhead. Where is the justification for this? > > Is this really true for all the reasons where the keep alive timers are > armed? I seriously doubt that. Why? > > On the end which waits for the keep alive packet to arrive in time it > does not matter at all, whether the cutoff is a bit later than defined. > > So why do you want to let the timer fire early just to rearm it? > > But it matters a lot on the sender side. If that is late and the other > end is strict about the timeout then you lost. But does it matter > whether you send the packet too early? No, it does not matter at all > because the important point is that you send it _before_ the other side > decides to give up. > > So why do you want to let the timer fire precise? > > You are solving the sender side problem by introducing a receiver side > problem and both suffer from the overhead for no reason. I was hoping to discuss this during the second patch review. Keepalive timer handler complexity makes me think that it handles a lot of cases I am not currently aware of and dropping the "cascading" code would result in problems in places not obvious to me. Josh's point also makes sense to me. For cases when keepalive is used to check for dead clients I don't think we want to be early. > Aside of the theoerical issue why this matters at all I have yet ot see > a reasonable argument what the practical problen is. If this would be a > real problem in the wild then why haven't we ssen a reassonable bug > report within 6 years? I think it is unusual for keepalive timers to be set so close to the timeout so it is not an easy problem to hit. But regardless of that from what I saw in related discussions nobody sees this keepalive timer behavior as normal let alone expected. If you think it is better to keep it as is this will need to be clearly described/justified somewhere, but I am not sure how one would approach that.
diff --git a/include/linux/timer.h b/include/linux/timer.h index fda13c9d1256c..b209d31d543f0 100644 --- a/include/linux/timer.h +++ b/include/linux/timer.h @@ -168,6 +168,7 @@ static inline int timer_pending(const struct timer_list * timer) return !hlist_unhashed_lockless(&timer->entry); } +extern unsigned long upper_bound_timeout(unsigned long timeout); extern void add_timer_on(struct timer_list *timer, int cpu); extern int del_timer(struct timer_list * timer); extern int mod_timer(struct timer_list *timer, unsigned long expires); diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 85f1021ad4595..a645b62e257e2 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -507,28 +507,38 @@ static inline unsigned calc_index(unsigned long expires, unsigned lvl, return LVL_OFFS(lvl) + (expires & LVL_MASK); } -static int calc_wheel_index(unsigned long expires, unsigned long clk, - unsigned long *bucket_expiry) +static inline int get_wheel_lvl(unsigned long delta) { - unsigned long delta = expires - clk; - unsigned int idx; - if (delta < LVL_START(1)) { - idx = calc_index(expires, 0, bucket_expiry); + return 0; } else if (delta < LVL_START(2)) { - idx = calc_index(expires, 1, bucket_expiry); + return 1; } else if (delta < LVL_START(3)) { - idx = calc_index(expires, 2, bucket_expiry); + return 2; } else if (delta < LVL_START(4)) { - idx = calc_index(expires, 3, bucket_expiry); + return 3; } else if (delta < LVL_START(5)) { - idx = calc_index(expires, 4, bucket_expiry); + return 4; } else if (delta < LVL_START(6)) { - idx = calc_index(expires, 5, bucket_expiry); + return 5; } else if (delta < LVL_START(7)) { - idx = calc_index(expires, 6, bucket_expiry); + return 6; } else if (LVL_DEPTH > 8 && delta < LVL_START(8)) { - idx = calc_index(expires, 7, bucket_expiry); + return 7; + } + + return -1; +} + +static int calc_wheel_index(unsigned long expires, unsigned long clk, + unsigned long *bucket_expiry) +{ + unsigned long delta = expires - clk; + unsigned int idx; + int lvl = get_wheel_lvl(delta); + + if (lvl >= 0) { + idx = calc_index(expires, lvl, bucket_expiry); } else if ((long) delta < 0) { idx = clk & LVL_MASK; *bucket_expiry = clk; @@ -545,6 +555,38 @@ static int calc_wheel_index(unsigned long expires, unsigned long clk, return idx; } +/** + * upper_bound_timeout - return granularity-adjusted timeout + * @timeout: timeout value in jiffies + * + * This function return supplied timeout adjusted based on timer wheel + * granularity effectively making supplied value an upper bound at which the + * timer will expire. Due to the way timer wheel works timeouts smaller than + * LVL_GRAN on their respecrive levels will be _at least_ + * LVL_GRAN(lvl) - LVL_GRAN(lvl -1)) jiffies early. + */ +unsigned long upper_bound_timeout(unsigned long timeout) +{ + int lvl = get_wheel_lvl(timeout); + + if (lvl < 0) { + if ((long) timeout < 0) { + /* + * This will expire immediately so no adjustment + * needed. + */ + return timeout; + } else { + if (timeout > WHEEL_TIMEOUT_CUTOFF) + timeout = WHEEL_TIMEOUT_CUTOFF; + lvl = LVL_DEPTH - 1; + } + } + + return LVL_GRAN(lvl) > timeout ? 0 : timeout - LVL_GRAN(lvl); +} +EXPORT_SYMBOL(upper_bound_timeout); + static void trigger_dyntick_cpu(struct timer_base *base, struct timer_list *timer) {
Current timer wheel implementation is optimized for performance and energy usage but lacks in precision. This, normally, is not a problem as most timers that use timer wheel are used for timeouts and thus rarely expire, instead they often get canceled or modified before expiration. Even when they don't, expiring a bit late is not an issue for timeout timers. TCP keepalive timer is a special case, it's aim is to prevent timeouts, so triggering earlier rather than later is desired behavior. In a reported case the user had a 3600s keepalive timer for preventing firewall disconnects (on a 3650s interval). They observed keepalive timers coming in up to four minutes late, causing unexpected disconnects. This commit adds upper_bound_timeout() function that takes a relative timeout and adjusts it based on timer wheel granularity so that supplied value effectively becomes an upper bound for the timer. This was previously discussed here: https://lore.kernel.org/all/20210302001054.4qgrvnkltvkgikzr@treble/T/#u Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Artem Savkov <asavkov@redhat.com> --- include/linux/timer.h | 1 + kernel/time/timer.c | 68 ++++++++++++++++++++++++++++++++++--------- 2 files changed, 56 insertions(+), 13 deletions(-)