diff mbox series

[v3,1/2] timer: add a function to adjust timeouts to be upper bound

Message ID 20220330082046.3512424-2-asavkov@redhat.com (mailing list archive)
State Superseded
Headers show
Series Upper bound kernel timers | expand

Checks

Context Check Description
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 18116 this patch: 18116
netdev/cc_maintainers fail 2 maintainers not CCed: sboyd@kernel.org john.stultz@linaro.org
netdev/build_clang success Errors and warnings before: 3329 this patch: 3329
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 17356 this patch: 17356
netdev/checkpatch warning CHECK: No space is necessary after a cast CHECK: extern prototypes should be avoided in .h files WARNING: else is not generally useful after a break or return
netdev/kdoc success Errors and warnings before: 2 this patch: 2
netdev/source_inline fail Was 0 now: 1
netdev/tree_selection success Guessing tree name failed - patch did not apply

Commit Message

Artem Savkov March 30, 2022, 8:20 a.m. UTC
Current timer wheel implementation is optimized for performance and
energy usage but lacks in precision. This, normally, is not a problem as
most timers that use timer wheel are used for timeouts and thus rarely
expire, instead they often get canceled or modified before expiration.
Even when they don't, expiring a bit late is not an issue for timeout
timers.

TCP keepalive timer is a special case, it's aim is to prevent timeouts,
so triggering earlier rather than later is desired behavior. In a
reported case the user had a 3600s keepalive timer for preventing firewall
disconnects (on a 3650s interval). They observed keepalive timers coming
in up to four minutes late, causing unexpected disconnects.

This commit adds upper_bound_timeout() function that takes a relative
timeout and adjusts it based on timer wheel granularity so that supplied
value effectively becomes an upper bound for the timer.

This was previously discussed here:
https://lore.kernel.org/all/20210302001054.4qgrvnkltvkgikzr@treble/T/#u

Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Artem Savkov <asavkov@redhat.com>
---
 include/linux/timer.h |  1 +
 kernel/time/timer.c   | 92 +++++++++++++++++++++++++++++++++++++------
 2 files changed, 80 insertions(+), 13 deletions(-)

Comments

Anna-Maria Behnsen March 30, 2022, 1:40 p.m. UTC | #1
On Wed, 30 Mar 2022, Artem Savkov wrote:

> Current timer wheel implementation is optimized for performance and
> energy usage but lacks in precision. This, normally, is not a problem as
> most timers that use timer wheel are used for timeouts and thus rarely
> expire, instead they often get canceled or modified before expiration.
> Even when they don't, expiring a bit late is not an issue for timeout
> timers.
> 
> TCP keepalive timer is a special case, it's aim is to prevent timeouts,
> so triggering earlier rather than later is desired behavior. In a
> reported case the user had a 3600s keepalive timer for preventing firewall
> disconnects (on a 3650s interval). They observed keepalive timers coming
> in up to four minutes late, causing unexpected disconnects.
> 
> This commit adds upper_bound_timeout() function that takes a relative
> timeout and adjusts it based on timer wheel granularity so that supplied
> value effectively becomes an upper bound for the timer.
> 

I think there is a problem with this approach. Please correct me, if I'm
wrong. The timer wheel index and level calculation depends on
timer_base::clk. The timeout/delta which is used for this calculation is
relative to timer_base::clk (delta = expires - base::clk). timer_base::clk
is not updated in sync with jiffies. It is forwarded before a new timer is
queued. It is possible, that timer_base::clk is behind jiffies after
forwarding because of a not yet expired timer.

When calculating the level/index with a relative timeout, there is no
guarantee that the result is the same when actual enqueueing the timer with
expiry = jiffies + timeout .

Thanks,

	Anna-Maria
Artem Savkov April 2, 2022, 6:55 a.m. UTC | #2
On Wed, Mar 30, 2022 at 03:40:55PM +0200, Anna-Maria Behnsen wrote:
> On Wed, 30 Mar 2022, Artem Savkov wrote:
> 
> > Current timer wheel implementation is optimized for performance and
> > energy usage but lacks in precision. This, normally, is not a problem as
> > most timers that use timer wheel are used for timeouts and thus rarely
> > expire, instead they often get canceled or modified before expiration.
> > Even when they don't, expiring a bit late is not an issue for timeout
> > timers.
> > 
> > TCP keepalive timer is a special case, it's aim is to prevent timeouts,
> > so triggering earlier rather than later is desired behavior. In a
> > reported case the user had a 3600s keepalive timer for preventing firewall
> > disconnects (on a 3650s interval). They observed keepalive timers coming
> > in up to four minutes late, causing unexpected disconnects.
> > 
> > This commit adds upper_bound_timeout() function that takes a relative
> > timeout and adjusts it based on timer wheel granularity so that supplied
> > value effectively becomes an upper bound for the timer.
> > 
> 
> I think there is a problem with this approach. Please correct me, if I'm
> wrong. The timer wheel index and level calculation depends on
> timer_base::clk. The timeout/delta which is used for this calculation is
> relative to timer_base::clk (delta = expires - base::clk). timer_base::clk
> is not updated in sync with jiffies. It is forwarded before a new timer is
> queued. It is possible, that timer_base::clk is behind jiffies after
> forwarding because of a not yet expired timer.
> 
> When calculating the level/index with a relative timeout, there is no
> guarantee that the result is the same when actual enqueueing the timer with
> expiry = jiffies + timeout .

Yes, you are correct. This especially is a problem for timeouts placed
just before LVL_START(x), which is a good chunk of cases. I don't think
it is possible to get to timer_base clock without meddling with the
hot-path.

Is it possible to determine the upper limit of error margin here? My
assumption is it shouldn't be very big, so maybe it would be enough to
account for this when adjusting timeout at the edge of a level.
I know this doesn't sound good but I am running out of ideas here.
Thomas Gleixner April 5, 2022, 3:33 p.m. UTC | #3
On Sat, Apr 02 2022 at 08:55, Artem Savkov wrote:
> On Wed, Mar 30, 2022 at 03:40:55PM +0200, Anna-Maria Behnsen wrote:
>> When calculating the level/index with a relative timeout, there is no
>> guarantee that the result is the same when actual enqueueing the timer with
>> expiry = jiffies + timeout .
>
> Yes, you are correct. This especially is a problem for timeouts placed
> just before LVL_START(x), which is a good chunk of cases. I don't
> think it is possible to get to timer_base clock without meddling with
> the hot-path.

Especially not when the timer might end up being migrated to a different
base, which would in the worst case require to do the calculation twice
if the base clocks differ.

The problem especially with network timers is that in the traffic case
the timers are often rearmed at high frequency. See the optimizations in
mod_timer() which try to avoid dequeue/enqueue sequences.

One of the main reasons for giving up on accuracy for the timer wheel
back then was to avoid two issues for networking:

    - Full rearming of a timer which ended up in the same expiry bucket
      due to software batching (incomplete, but with the same goal and
      the same 12.5% error margin).

    - Cascading bursts.

      We gathered statistics from various workloads which showed that
      99+% of all timers were canceled or rearmed before expiry, but
      under certain load scenarios a large portion of those timers where
      cascaded into a lower wheel level with smaller granularity before
      cancelation or rearming.

      As the wheel level granularity was strictly exponential the event
      of cascading a large amount (hint: thousands) of timers in one go
      was not uncommon. Cascading happened with interrupts disabled and
      it touched a usualy cold cache line per cascaded timer...

> Is it possible to determine the upper limit of error margin here? My
> assumption is it shouldn't be very big, so maybe it would be enough to
> account for this when adjusting timeout at the edge of a level.
> I know this doesn't sound good but I am running out of ideas here.

Let's just take a step back.

So we know, that the maximal error margin in the wheel is 12.5%, right?
That means, if you take your relative timeout and subtract 12.5% then
you are in the right ballpark and the earliest expiry will not be before
that point obviously, but it's also guaranteed not to expire later than
the original timeout. Obviously this will converge towards the early
expiry the longer the timeouts are, but it's bound.

Also due to the properties of the wheel, the lag of base::clk will
obviously only affect those levels where lag >= LVL_GRAN(level).

That's not perfect, but perfect is the enemy of good.

Thanks,

        tglx
Thomas Gleixner April 7, 2022, 11:26 p.m. UTC | #4
On Wed, Apr 06 2022 at 11:52, Artem Savkov wrote:

> On Tue, Apr 05, 2022 at 05:33:23PM +0200, Thomas Gleixner wrote:
>> On Sat, Apr 02 2022 at 08:55, Artem Savkov wrote:
>> > Is it possible to determine the upper limit of error margin here? My
>> > assumption is it shouldn't be very big, so maybe it would be enough to
>> > account for this when adjusting timeout at the edge of a level.
>> > I know this doesn't sound good but I am running out of ideas here.
>> 
>> Let's just take a step back.
>> 
>> So we know, that the maximal error margin in the wheel is 12.5%, right?
>> That means, if you take your relative timeout and subtract 12.5% then
>> you are in the right ballpark and the earliest expiry will not be before
>> that point obviously, but it's also guaranteed not to expire later than
>> the original timeout. Obviously this will converge towards the early
>> expiry the longer the timeouts are, but it's bound.
>
> Ok, I was trying to avoid a "hole" where any timeout < LVL_GRAN(lvl)
> would be always substantially (LVL_GRAN(lvl) - LVL_GRAN(lvl - 1)) early
> but looks like this is unavoidable with this approach.

Right, but where is the problem you are trying to solve? Does it matter
whether the keepalive timer fires after 28 minutes or after 30 minutes?

Not really. All you are about that it does not fire 2 minutes late. So
what?

>> Also due to the properties of the wheel, the lag of base::clk will
>> obviously only affect those levels where lag >= LVL_GRAN(level).
>
> Is this true? Won't it be enough for the lag to be just
> lag >= (LVL_START(lvl) - adjusted_timeout) for the cases when we cross
> level boundary on adjustment?

The corner case is at the next boundary level. The resulting worst case
there is one jiffy, which is below noise level :)

Thanks,

        tglx
diff mbox series

Patch

diff --git a/include/linux/timer.h b/include/linux/timer.h
index fda13c9d1256c..b209d31d543f0 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -168,6 +168,7 @@  static inline int timer_pending(const struct timer_list * timer)
 	return !hlist_unhashed_lockless(&timer->entry);
 }
 
+extern unsigned long upper_bound_timeout(unsigned long timeout);
 extern void add_timer_on(struct timer_list *timer, int cpu);
 extern int del_timer(struct timer_list * timer);
 extern int mod_timer(struct timer_list *timer, unsigned long expires);
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 85f1021ad4595..76d4f26c991be 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -507,28 +507,38 @@  static inline unsigned calc_index(unsigned long expires, unsigned lvl,
 	return LVL_OFFS(lvl) + (expires & LVL_MASK);
 }
 
-static int calc_wheel_index(unsigned long expires, unsigned long clk,
-			    unsigned long *bucket_expiry)
+static inline int get_wheel_lvl(unsigned long delta)
 {
-	unsigned long delta = expires - clk;
-	unsigned int idx;
-
 	if (delta < LVL_START(1)) {
-		idx = calc_index(expires, 0, bucket_expiry);
+		return 0;
 	} else if (delta < LVL_START(2)) {
-		idx = calc_index(expires, 1, bucket_expiry);
+		return 1;
 	} else if (delta < LVL_START(3)) {
-		idx = calc_index(expires, 2, bucket_expiry);
+		return 2;
 	} else if (delta < LVL_START(4)) {
-		idx = calc_index(expires, 3, bucket_expiry);
+		return 3;
 	} else if (delta < LVL_START(5)) {
-		idx = calc_index(expires, 4, bucket_expiry);
+		return 4;
 	} else if (delta < LVL_START(6)) {
-		idx = calc_index(expires, 5, bucket_expiry);
+		return 5;
 	} else if (delta < LVL_START(7)) {
-		idx = calc_index(expires, 6, bucket_expiry);
+		return 6;
 	} else if (LVL_DEPTH > 8 && delta < LVL_START(8)) {
-		idx = calc_index(expires, 7, bucket_expiry);
+		return 7;
+	}
+
+	return -1;
+}
+
+static int calc_wheel_index(unsigned long expires, unsigned long clk,
+			    unsigned long *bucket_expiry)
+{
+	unsigned long delta = expires - clk;
+	unsigned int idx;
+	int lvl = get_wheel_lvl(delta);
+
+	if (lvl >= 0) {
+		idx = calc_index(expires, lvl, bucket_expiry);
 	} else if ((long) delta < 0) {
 		idx = clk & LVL_MASK;
 		*bucket_expiry = clk;
@@ -545,6 +555,62 @@  static int calc_wheel_index(unsigned long expires, unsigned long clk,
 	return idx;
 }
 
+/**
+ * upper_bound_timeout - return granularity-adjusted timeout
+ * @timeout: timeout value in jiffies
+ *
+ * This function return supplied timeout adjusted based on timer wheel
+ * granularity effectively making supplied value an upper bound at which the
+ * timer will expire.
+ */
+unsigned long upper_bound_timeout(unsigned long timeout)
+{
+	int lvl = get_wheel_lvl(timeout);
+
+	if (lvl < 0) {
+		if ((long) timeout < 0) {
+			/*
+			 * This will expire immediately so no adjustment
+			 * needed.
+			 */
+			return timeout;
+		} else {
+			if (timeout > WHEEL_TIMEOUT_CUTOFF)
+				timeout = WHEEL_TIMEOUT_CUTOFF;
+			lvl = LVL_DEPTH - 1;
+		}
+	}
+
+	if (timeout - LVL_GRAN(lvl) < LVL_START(lvl)) {
+		/*
+		 * Beginning of each level is a special case, we can't just
+		 * subtract LVL_GRAN(lvl) because then timeout ends up on a
+		 * previous level and is guaranteed to fire off early. We can't
+		 * mitigate this completely, but we can try to minimize the
+		 * margin.
+		 */
+		if (timeout - LVL_GRAN(lvl - 1) < LVL_START(lvl)) {
+			/*
+			 * If timeout is within previous level's granularity
+			 * adjust timeout using that level's granularity.
+			 */
+			return timeout - LVL_GRAN(lvl - 1);
+		} else {
+			/*
+			 * If LVL_GRAN(lvl - 1) < timeout < LVL_GRAN(lvl)
+			 * the best we can do is to set timeout to the end of
+			 * previous level. This means that the timer will
+			 * trigger early. In worst case it will be _at least_
+			 * (LVL_GRAN(lvl) - LVL_GRAN(lvl -1)) jiffies early.
+			 */
+			return LVL_START(lvl) - 1;
+		}
+	} else {
+		return timeout - LVL_GRAN(lvl);
+	}
+}
+EXPORT_SYMBOL(upper_bound_timeout);
+
 static void
 trigger_dyntick_cpu(struct timer_base *base, struct timer_list *timer)
 {