Message ID | 20220323111642.2517885-2-asavkov@redhat.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Upper bound mode for kernel timers | expand |
On Wed, Mar 23, 2022 at 12:16:41PM +0100, Artem Savkov wrote: > Add TIMER_UPPER_BOUND flag which allows creation of timers that would > expire at most at specified time or earlier. > > This was previously discussed here: > https://lore.kernel.org/all/20210302001054.4qgrvnkltvkgikzr@treble/T/#u > > Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com> > Signed-off-by: Artem Savkov <asavkov@redhat.com> Thanks Artem, this approach looks good to me. Some kernel style nits... The commit log could use some more background and motivation for the change: a) describe the current timer bound functionality and why it works for most cases b) describe the type of situation where it doesn't work and why c) describe the solution > --- > include/linux/timer.h | 3 ++- > kernel/time/timer.c | 36 ++++++++++++++++++++++-------------- > 2 files changed, 24 insertions(+), 15 deletions(-) > > diff --git a/include/linux/timer.h b/include/linux/timer.h > index fda13c9d1256..9b0963f49528 100644 > --- a/include/linux/timer.h > +++ b/include/linux/timer.h > @@ -67,7 +67,8 @@ struct timer_list { > #define TIMER_DEFERRABLE 0x00080000 > #define TIMER_PINNED 0x00100000 > #define TIMER_IRQSAFE 0x00200000 > -#define TIMER_INIT_FLAGS (TIMER_DEFERRABLE | TIMER_PINNED | TIMER_IRQSAFE) > +#define TIMER_UPPER_BOUND 0x00400000 > +#define TIMER_INIT_FLAGS (TIMER_DEFERRABLE | TIMER_PINNED | TIMER_IRQSAFE | TIMER_UPPER_BOUND) > #define TIMER_ARRAYSHIFT 22 > #define TIMER_ARRAYMASK 0xFFC00000 This new flag needs a comment above, where the other user flags are also commented. > } > return idx; > } > @@ -607,7 +613,8 @@ static void internal_add_timer(struct timer_base *base, struct timer_list *timer > unsigned long bucket_expiry; > unsigned int idx; > > - idx = calc_wheel_index(timer->expires, base->clk, &bucket_expiry); > + idx = calc_wheel_index(timer->expires, base->clk, &bucket_expiry, > + timer->flags & TIMER_UPPER_BOUND); Here and elsewhere, to follow kernel convention, the 2nd line needs spaces after the tabs to align it with the previous line. That makes it more obvious that the 2nd line is just another function argument. Like: idx = calc_wheel_index(timer->expires, base->clk, &bucket_expiry, timer->flags & TIMER_UPPER_BOUND);
Artem, On Wed, Mar 23 2022 at 12:16, Artem Savkov wrote: > Add TIMER_UPPER_BOUND flag which allows creation of timers that would > expire at most at specified time or earlier. > > This was previously discussed here: > https://lore.kernel.org/all/20210302001054.4qgrvnkltvkgikzr@treble/T/#u please add the context to the changelog. A link is only supplemental information and does not replace content. > static inline unsigned calc_index(unsigned long expires, unsigned lvl, > - unsigned long *bucket_expiry) > + unsigned long *bucket_expiry, bool upper_bound) > { > > /* > @@ -501,34 +501,39 @@ static inline unsigned calc_index(unsigned long expires, unsigned lvl, > * - Truncation of the expiry time in the outer wheel levels > * > * Round up with level granularity to prevent this. > + * Do not perform round up in case of upper bound timer. > */ > - expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl); > + if (upper_bound) > + expires = expires >> LVL_SHIFT(lvl); > + else > + expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl); While this "works", I fundamentally hate this because it adds an extra conditional into the common case. That affects every user of the timer wheel. We went great length to optimize that code and I'm not really enthused to sacrifice that just because of _one_ use case. The resulting text bloat is +25% on x86/64 and a quick test case shows that this is well measurable overhead. The first ones who will complain about that are going to be - drumroll - the network people. There must be smarter ways to solve that. Let me think about it. Thanks, tglx
On Thu, Mar 24 2022 at 13:28, Thomas Gleixner wrote: > On Wed, Mar 23 2022 at 12:16, Artem Savkov wrote: >> Add TIMER_UPPER_BOUND flag which allows creation of timers that would >> expire at most at specified time or earlier. >> >> This was previously discussed here: >> https://lore.kernel.org/all/20210302001054.4qgrvnkltvkgikzr@treble/T/#u > > please add the context to the changelog. A link is only supplemental > information and does not replace content. > >> static inline unsigned calc_index(unsigned long expires, unsigned lvl, >> - unsigned long *bucket_expiry) >> + unsigned long *bucket_expiry, bool upper_bound) >> { >> >> /* >> @@ -501,34 +501,39 @@ static inline unsigned calc_index(unsigned long expires, unsigned lvl, >> * - Truncation of the expiry time in the outer wheel levels >> * >> * Round up with level granularity to prevent this. >> + * Do not perform round up in case of upper bound timer. >> */ >> - expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl); >> + if (upper_bound) >> + expires = expires >> LVL_SHIFT(lvl); >> + else >> + expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl); > > While this "works", I fundamentally hate this because it adds an extra > conditional into the common case. That affects every user of the timer > wheel. We went great length to optimize that code and I'm not really enthused > to sacrifice that just because of _one_ use case. Aside of that this is not mathematically correct. Why? The level selection makes the cutoff at: LEVEL_MAX(lvl) - 1. E.g. 62 instead of 63 for the first level. The reason is that this accomodates for the + LVL_GRAN(lvl). Now with surpressing the roundup this creates a gap. Not a horrible problem, but not correct either. Thanks, tglx
Greeting, FYI, we noticed the following commit (built with gcc-9): commit: d41e0719d576bc515e6de2112fff2c6b20f357e8 ("[PATCH 1/2] timer: introduce upper bound timers") url: https://github.com/intel-lab-lkp/linux/commits/Artem-Savkov/Upper-bound-mode-for-kernel-timers/20220323-191831 patch link: https://lore.kernel.org/netdev/20220323111642.2517885-2-asavkov@redhat.com in testcase: xfstests version: xfstests-x86_64-1de1db8-1_20220217 with following parameters: disk: 4HDD fs: btrfs test: generic-group-28 ucode: 0xec test-description: xfstests is a regression test suite for xfs and other files ystems. test-url: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git on test machine: 4 threads Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz with 32G memory caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace): If you fix the issue, kindly add following tag Reported-by: kernel test robot <oliver.sang@intel.com> [ 42.401895][ C0] UBSAN: shift-out-of-bounds in lib/flex_proportions.c:80:20 [ 42.410963][ C0] shift exponent -1007885658 is negative [ 42.416462][ C0] CPU: 0 PID: 330 Comm: sed Tainted: G I 5.17.0-rc6-00027-gd41e0719d576 #1 [ 42.426240][ C0] Hardware name: Dell Inc. OptiPlex 7040/0Y7WYT, BIOS 1.1.1 10/07/2015 [ 42.434363][ C0] Call Trace: [ 42.437516][ C0] <TASK> [ 42.440319][ C0] dump_stack_lvl (lib/dump_stack.c:107) [ 42.444699][ C0] ubsan_epilogue (lib/ubsan.c:152) [ 42.448985][ C0] __ubsan_handle_shift_out_of_bounds.cold (lib/ubsan.c:330) [ 42.455618][ C0] ? cpumask_next (lib/cpumask.c:23) [ 42.459996][ C0] ? __percpu_counter_sum (lib/percpu_counter.c:138) [ 42.465248][ C0] fprop_new_period.cold (lib/flex_proportions.c:80 (discriminator 1)) [ 42.470224][ C0] writeout_period (mm/page-writeback.c:623) [ 42.474768][ C0] ? wb_position_ratio (mm/page-writeback.c:617) [ 42.479747][ C0] call_timer_fn (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:212 include/trace/events/timer.h:125 kernel/time/timer.c:1430) [ 42.484115][ C0] run_timer_softirq (kernel/time/timer.c:1475 kernel/time/timer.c:1742 kernel/time/timer.c:1718 kernel/time/timer.c:1757) [ 42.489019][ C0] ? del_timer_sync (kernel/time/timer.c:1752) [ 42.493561][ C0] ? __next_base (kernel/time/hrtimer.c:506) [ 42.498014][ C0] ? sched_clock_cpu (kernel/sched/clock.c:371) [ 42.502732][ C0] ? setup_local_APIC (arch/x86/kernel/apic/apic.c:475) [ 42.507624][ C0] __do_softirq (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:212 include/trace/events/irq.h:142 kernel/softirq.c:559) [ 42.511990][ C0] irq_exit_rcu (kernel/softirq.c:432 kernel/softirq.c:637 kernel/softirq.c:649) [ 42.516359][ C0] sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1097 (discriminator 13)) [ 42.521864][ C0] ? asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:638) [ 42.527812][ C0] asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:638) [ 42.533660][ C0] RIP: 0033:0x55e961adf002 [ 42.537942][ C0] Code: ff ff ff 41 b8 10 00 00 00 ba 01 00 00 00 48 89 c7 e8 a2 ed ff ff 49 8b 4d 08 eb 89 0f 1f 40 00 44 09 71 08 48 83 c4 08 5b 5d <41> 5c 41 5d 41 5e 41 5f c3 48 3b 6a 10 7d 08 48 89 e9 48 89 c2 eb All code ======== 0: ff (bad) 1: ff (bad) 2: ff 41 b8 incl -0x48(%rcx) 5: 10 00 adc %al,(%rax) 7: 00 00 add %al,(%rax) 9: ba 01 00 00 00 mov $0x1,%edx e: 48 89 c7 mov %rax,%rdi 11: e8 a2 ed ff ff callq 0xffffffffffffedb8 16: 49 8b 4d 08 mov 0x8(%r13),%rcx 1a: eb 89 jmp 0xffffffffffffffa5 1c: 0f 1f 40 00 nopl 0x0(%rax) 20: 44 09 71 08 or %r14d,0x8(%rcx) 24: 48 83 c4 08 add $0x8,%rsp 28: 5b pop %rbx 29: 5d pop %rbp 2a:* 41 5c pop %r12 <-- trapping instruction 2c: 41 5d pop %r13 2e: 41 5e pop %r14 30: 41 5f pop %r15 32: c3 retq 33: 48 3b 6a 10 cmp 0x10(%rdx),%rbp 37: 7d 08 jge 0x41 39: 48 89 e9 mov %rbp,%rcx 3c: 48 89 c2 mov %rax,%rdx 3f: eb .byte 0xeb Code starting with the faulting instruction =========================================== 0: 41 5c pop %r12 2: 41 5d pop %r13 4: 41 5e pop %r14 6: 41 5f pop %r15 8: c3 retq 9: 48 3b 6a 10 cmp 0x10(%rdx),%rbp d: 7d 08 jge 0x17 f: 48 89 e9 mov %rbp,%rcx 12: 48 89 c2 mov %rax,%rdx 15: eb .byte 0xeb [ 42.557457][ C0] RSP: 002b:00007fff12170f68 EFLAGS: 00000202 [ 42.563393][ C0] RAX: 000055e966fd5de0 RBX: 0000000000002ac0 RCX: 000055e966fe1ad0 [ 42.571245][ C0] RDX: 0000000000000bcf RSI: 0000000000000bd1 RDI: 0000000000000bd0 [ 42.579093][ C0] RBP: 00000000000019a0 R08: 000055e966fd5dd0 R09: 00007f781695a430 [ 42.586954][ C0] R10: 000055e961b62010 R11: 00007f781695a430 R12: 0000000000000bd0 [ 42.594805][ C0] R13: 00007fff12170fe0 R14: 00000000000001ff R15: 000055e962f771e0 [ 42.602664][ C0] </TASK> [ 42.605562][ C0] ================================================================================ [ 42.619841][ T2170] BTRFS: device fsid c931cb40-7bb4-4aea-a481-3898abe22c7f devid 1 transid 5 /dev/sda2 scanned by mkfs.btrfs (2170) [ 42.658545][ T2181] BTRFS info (device sda2): flagging fs with big metadata feature [ 42.666306][ T2181] BTRFS info (device sda2): disk space caching is enabled [ 42.673308][ T2181] BTRFS info (device sda2): has skinny extents [ 42.685475][ T2181] BTRFS info (device sda2): checking UUID tree [ 43.019857][ T2218] BTRFS: device fsid 196e6b2a-cd50-4fb7-9e22-b255906549fb devid 1 transid 5 /dev/sda2 scanned by mkfs.btrfs (2218) [ 43.055315][ T2232] BTRFS info (device sda2): flagging fs with big metadata feature [ 43.063092][ T2232] BTRFS info (device sda2): disk space caching is enabled [ 43.070097][ T2232] BTRFS info (device sda2): has skinny extents [ 43.082188][ T2232] BTRFS info (device sda2): checking UUID tree [ 43.646926][ T330] 262144 bytes (262 kB, 256 KiB) copied, 0.0121294 s, 21.6 MB/s [ 43.646936][ T330] [ 43.657040][ T330] 512+0 records in [ 43.657045][ T330] [ 43.663189][ T330] 512+0 records out [ 43.663194][ T330] [ 43.670327][ T330] 262144 bytes (262 kB, 256 KiB) copied, 0.0340874 s, 7.7 MB/s [ 43.670333][ T330] [ 43.680303][ T330] 512+0 records in [ 43.680308][ T330] [ 43.686450][ T330] 512+0 records out [ 43.686470][ T330] [ 43.693603][ T330] 262144 bytes (262 kB, 256 KiB) copied, 0.0436927 s, 6.0 MB/s [ 43.693609][ T330] [ 75.600365][ T2983] BTRFS info (device sda2): flagging fs with big metadata feature [ 75.608412][ T2983] BTRFS info (device sda2): disk space caching is enabled [ 75.615416][ T2983] BTRFS info (device sda2): has skinny extents [ 83.853890][ T3053] BTRFS info (device sda2): flagging fs with big metadata feature [ 83.861582][ T3053] BTRFS info (device sda2): disk space caching is enabled [ 83.868571][ T3053] BTRFS info (device sda2): has skinny extents [ 84.030348][ T328] generic/560 _check_dmesg: something found in dmesg (see /lkp/benchmarks/xfstests/results//generic/560.dmesg) [ 84.030358][ T328] [ 84.044262][ T328] [ 84.044269][ T328] [ 84.088818][ T1625] run fstests generic/561 at 2022-03-24 12:37:26 [ 84.816005][ T3246] BTRFS info (device sda1): flagging fs with big metadata feature [ 84.823700][ T3246] BTRFS info (device sda1): using free space tree [ 84.830014][ T3246] BTRFS info (device sda1): has skinny extents [ 85.074278][ T3295] BTRFS: device fsid b05a03cd-540a-4c2a-a470-7eb68b5ab847 devid 1 transid 5 /dev/sda2 scanned by mkfs.btrfs (3295) [ 85.110629][ T3306] BTRFS info (device sda2): flagging fs with big metadata feature [ 85.118389][ T3306] BTRFS info (device sda2): disk space caching is enabled [ 85.125437][ T3306] BTRFS info (device sda2): has skinny extents [ 85.138213][ T3306] BTRFS info (device sda2): checking UUID tree [ 85.416006][ T3348] BTRFS: device fsid 0d6806aa-5e85-40dd-b571-5227cd8c2d80 devid 1 transid 5 /dev/sda2 scanned by mkfs.btrfs (3348) [ 85.453695][ T3359] BTRFS info (device sda2): flagging fs with big metadata feature [ 85.461394][ T3359] BTRFS info (device sda2): disk space caching is enabled [ 85.468386][ T3359] BTRFS info (device sda2): has skinny extents [ 85.480643][ T3359] BTRFS info (device sda2): checking UUID tree [ 85.675145][ T3381] Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O! [ 85.675448][ T3385] Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O! [ 85.686755][ T3381] File: /fs/scratch/dir/p0/f1XX PID: 3381 Comm: fsstress [ 85.705365][ T3385] File: /fs/scratch/dir/p2/f7XXXXXXXXXXXXX PID: 3385 Comm: fsstress [ 86.009824][ T3383] Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O! [ 86.021432][ T3383] File: /fs/scratch/dir/p1/f8XX PID: 3383 Comm: fsstress [ 87.014795][ T3381] Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O! [ 87.026427][ T3381] File: /fs/scratch/dir/p0/f14XXXXXXXXXXXX PID: 3381 Comm: fsstress [ 98.693320][ T328] generic/561 IPMI BMC is not supported on this machine, skip bmc-watchdog setup! [ 98.693329][ T328] [ 103.829084][ T3381] Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O! [ 103.840703][ T3381] File: /fs/scratch/dir/p0/d1eXXXXXXXXXXXX/d42XX/fabXXXXX PID: 3381 Comm: fsstress [ 142.694608][ T4737] BTRFS info (device sda2): flagging fs with big metadata feature [ 142.702316][ T4737] BTRFS info (device sda2): disk space caching is enabled [ 142.709324][ T4737] BTRFS info (device sda2): has skinny extents [ 159.500542][ T4838] BTRFS info (device sda2): flagging fs with big metadata feature [ 159.508232][ T4838] BTRFS info (device sda2): disk space caching is enabled [ 159.515223][ T4838] BTRFS info (device sda2): has skinny extents [ 159.704146][ T328] 76s [ 159.704155][ T328] [ 159.748026][ T1625] run fstests generic/562 at 2022-03-24 12:38:42 [ 160.507334][ T5033] BTRFS info (device sda1): flagging fs with big metadata feature [ 160.515029][ T5033] BTRFS info (device sda1): using free space tree [ 160.521330][ T5033] BTRFS info (device sda1): has skinny extents [ 160.777005][ T5095] BTRFS: device fsid 08444c95-7d09-450b-9e80-4897d2b3bfae devid 1 transid 5 /dev/sda2 scanned by mkfs.btrfs (5095) [ 160.811850][ T5109] BTRFS info (device sda2): flagging fs with big metadata feature [ 160.819593][ T5109] BTRFS info (device sda2): disk space caching is enabled [ 160.827020][ T5109] BTRFS info (device sda2): has skinny extents To reproduce: git clone https://github.com/intel/lkp-tests.git cd lkp-tests sudo bin/lkp install job.yaml # job file is attached in this email bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run sudo bin/lkp run generated-yaml-file # if come across any failure that blocks the test, # please remove ~/.lkp and /lkp dir to run from a clean state.
On Fri, Mar 25 2022 at 15:38, kernel test robot wrote: > [ 42.401895][ C0] UBSAN: shift-out-of-bounds in lib/flex_proportions.c:80:20 > [ 42.410963][ C0] shift exponent -1007885658 is negative Cute. > [ 42.416462][ C0] CPU: 0 PID: 330 Comm: sed Tainted: G I 5.17.0-rc6-00027-gd41e0719d576 #1 > [ 42.426240][ C0] Hardware name: Dell Inc. OptiPlex 7040/0Y7WYT, BIOS 1.1.1 10/07/2015 > [ 42.434363][ C0] Call Trace: > [ 42.437516][ C0] <TASK> > [ 42.440319][ C0] dump_stack_lvl (lib/dump_stack.c:107) > [ 42.444699][ C0] ubsan_epilogue (lib/ubsan.c:152) > [ 42.448985][ C0] __ubsan_handle_shift_out_of_bounds.cold (lib/ubsan.c:330) > [ 42.455618][ C0] ? cpumask_next (lib/cpumask.c:23) > [ 42.459996][ C0] ? __percpu_counter_sum (lib/percpu_counter.c:138) > [ 42.465248][ C0] fprop_new_period.cold (lib/flex_proportions.c:80 (discriminator 1)) > [ 42.470224][ C0] writeout_period (mm/page-writeback.c:623) So it seems a timer fired early. Which then makes writeout_period() go south: int miss_periods = (jiffies - dom->period_time) / VM_COMPLETIONS_PERIOD_LEN; If jiffies < dom->period_time the result is a very large negative number. This happens because of: > @@ -67,7 +67,8 @@ struct timer_list { > #define TIMER_DEFERRABLE 0x00080000 > #define TIMER_PINNED 0x00100000 > #define TIMER_IRQSAFE 0x00200000 > -#define TIMER_INIT_FLAGS (TIMER_DEFERRABLE | TIMER_PINNED | TIMER_IRQSAFE) > +#define TIMER_UPPER_BOUND 0x00400000 > +#define TIMER_INIT_FLAGS (TIMER_DEFERRABLE | TIMER_PINNED | TIMER_IRQSAFE | TIMER_UPPER_BOUND) > #define TIMER_ARRAYSHIFT 22 > #define TIMER_ARRAYMASK 0xFFC00000 TIMER_UPPER_BOUND steals a bit from the ARRAYMASK. So if the timer is armed and the stored arraymask happens to have bit 22 set, then on the next arming of the timer it will be treated as upper bound timer, expires early and all hell breaks lose. The same can happen the other way round. So I really have to ask how this ever "worked". Thanks, tglx
Artem, On Thu, Mar 24 2022 at 13:28, Thomas Gleixner wrote: > On Wed, Mar 23 2022 at 12:16, Artem Savkov wrote: >> * Round up with level granularity to prevent this. >> + * Do not perform round up in case of upper bound timer. >> */ >> - expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl); >> + if (upper_bound) >> + expires = expires >> LVL_SHIFT(lvl); >> + else >> + expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl); > > While this "works", I fundamentally hate this because it adds an extra actually it cannot not work. At least not in a realiable and predictable way. The timer wheel is fundamentaly relying on moving the timer one bucket out. Let's look how this all works. The wheel has N levels of bucket arrays. Each level has 64 buckets and the granularity increases by a factor of 8 per level, i.e. the worst case deviation is 12.5% per level. The original timer wheel was able to fire the timer at expiry time + one tick for the price of cascading timers every so often from one level to the next lower level. Here are a few pointers: https://lwn.net/Articles/152436/ https://lwn.net/Articles/646950/ The accuracy of the original wheel implementation was weakened already in course of the NOHZ development where the concept of slack (algorithmic batching at enqueue time for the price of overhead) was introduced. It had the same 12.5% worst case deviation of the resulting granularity level, though the batching was not enforced and only worked most of the time. So in theory the same issue could have been seen back then already. The enqueue placement and the expiry is driven by base::clock, which is nothing else than jiffies. When a timer is queued then base::clock is the time on which the next tick and expiry check happens. Now let's look how the expiry mechanism works. The first level (0) is obviously expired every tick. On every eigth tick the second level (1) is expired, on every 64th tick the third level (2)... IOW, the expiry events of a level happen at 8^index(level) intervals. Let's assume that base::clk is 0. That means at the next tick (which could be imminent) in _all_ levels bucket[0] is due for expiry. Now let's enqueue a timer with expiry value of 64: delta = 64 - 0 = 64 That means the timer ends up in the second level in bucket[0], which makes it eligible for expiry at the next tick. The same is true for any expiry value of 8^N. Not what you are trying to achieve, right? You try to enforce an upper bound, but you expect that the timer does not fire earlier than 12.5% of the granularity level of that upper bound, right? IOW, you created a expiry lottery and there is no way to prevent that except with more conditionals and heuristics which are hardely welcomed. You've also seen the outcome of a timer firing unexpectedly due to the bit abuse, right? Now let's take a step back and look at the problem at hand. TCP alive timer, which is affected by the batching and the resulting inaccuracy, is (re)armed with a relative timeout, which is known upfront, right? The important part is *relative* timeout. Why? Because the timeout is relative it's trivial to calculate a relative timeout value from the given accurate relative timeout value, which is guaranteed to not expire late and within the timerwheel's error margin, and use that for the actual rearming. I'm pretty sure that you can come up with a conversion function for that and use this function in the TCP code at the point where the TCP keepalive timer is configured. Hint 1: The output of calc_wheel_index() should be good enough to figure that out. Hint 2: If you get frustrated, try git grep johnstul arch/x86/ | awk '{ $1=$2=$3=""; print $0 }' Hint 3: Feel free to ask questions anytime Thanks, tglx
diff --git a/include/linux/timer.h b/include/linux/timer.h index fda13c9d1256..9b0963f49528 100644 --- a/include/linux/timer.h +++ b/include/linux/timer.h @@ -67,7 +67,8 @@ struct timer_list { #define TIMER_DEFERRABLE 0x00080000 #define TIMER_PINNED 0x00100000 #define TIMER_IRQSAFE 0x00200000 -#define TIMER_INIT_FLAGS (TIMER_DEFERRABLE | TIMER_PINNED | TIMER_IRQSAFE) +#define TIMER_UPPER_BOUND 0x00400000 +#define TIMER_INIT_FLAGS (TIMER_DEFERRABLE | TIMER_PINNED | TIMER_IRQSAFE | TIMER_UPPER_BOUND) #define TIMER_ARRAYSHIFT 22 #define TIMER_ARRAYMASK 0xFFC00000 diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 85f1021ad459..60253ad9ed86 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -491,7 +491,7 @@ static inline void timer_set_idx(struct timer_list *timer, unsigned int idx) * time. */ static inline unsigned calc_index(unsigned long expires, unsigned lvl, - unsigned long *bucket_expiry) + unsigned long *bucket_expiry, bool upper_bound) { /* @@ -501,34 +501,39 @@ static inline unsigned calc_index(unsigned long expires, unsigned lvl, * - Truncation of the expiry time in the outer wheel levels * * Round up with level granularity to prevent this. + * Do not perform round up in case of upper bound timer. */ - expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl); + if (upper_bound) + expires = expires >> LVL_SHIFT(lvl); + else + expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl); + *bucket_expiry = expires << LVL_SHIFT(lvl); return LVL_OFFS(lvl) + (expires & LVL_MASK); } static int calc_wheel_index(unsigned long expires, unsigned long clk, - unsigned long *bucket_expiry) + unsigned long *bucket_expiry, bool upper_bound) { unsigned long delta = expires - clk; unsigned int idx; if (delta < LVL_START(1)) { - idx = calc_index(expires, 0, bucket_expiry); + idx = calc_index(expires, 0, bucket_expiry, upper_bound); } else if (delta < LVL_START(2)) { - idx = calc_index(expires, 1, bucket_expiry); + idx = calc_index(expires, 1, bucket_expiry, upper_bound); } else if (delta < LVL_START(3)) { - idx = calc_index(expires, 2, bucket_expiry); + idx = calc_index(expires, 2, bucket_expiry, upper_bound); } else if (delta < LVL_START(4)) { - idx = calc_index(expires, 3, bucket_expiry); + idx = calc_index(expires, 3, bucket_expiry, upper_bound); } else if (delta < LVL_START(5)) { - idx = calc_index(expires, 4, bucket_expiry); + idx = calc_index(expires, 4, bucket_expiry, upper_bound); } else if (delta < LVL_START(6)) { - idx = calc_index(expires, 5, bucket_expiry); + idx = calc_index(expires, 5, bucket_expiry, upper_bound); } else if (delta < LVL_START(7)) { - idx = calc_index(expires, 6, bucket_expiry); + idx = calc_index(expires, 6, bucket_expiry, upper_bound); } else if (LVL_DEPTH > 8 && delta < LVL_START(8)) { - idx = calc_index(expires, 7, bucket_expiry); + idx = calc_index(expires, 7, bucket_expiry, upper_bound); } else if ((long) delta < 0) { idx = clk & LVL_MASK; *bucket_expiry = clk; @@ -540,7 +545,8 @@ static int calc_wheel_index(unsigned long expires, unsigned long clk, if (delta >= WHEEL_TIMEOUT_CUTOFF) expires = clk + WHEEL_TIMEOUT_MAX; - idx = calc_index(expires, LVL_DEPTH - 1, bucket_expiry); + idx = calc_index(expires, LVL_DEPTH - 1, bucket_expiry, + upper_bound); } return idx; } @@ -607,7 +613,8 @@ static void internal_add_timer(struct timer_base *base, struct timer_list *timer unsigned long bucket_expiry; unsigned int idx; - idx = calc_wheel_index(timer->expires, base->clk, &bucket_expiry); + idx = calc_wheel_index(timer->expires, base->clk, &bucket_expiry, + timer->flags & TIMER_UPPER_BOUND); enqueue_timer(base, timer, idx, bucket_expiry); } @@ -1000,7 +1007,8 @@ __mod_timer(struct timer_list *timer, unsigned long expires, unsigned int option } clk = base->clk; - idx = calc_wheel_index(expires, clk, &bucket_expiry); + idx = calc_wheel_index(expires, clk, &bucket_expiry, + timer->flags & TIMER_UPPER_BOUND); /* * Retrieve and compare the array index of the pending
Add TIMER_UPPER_BOUND flag which allows creation of timers that would expire at most at specified time or earlier. This was previously discussed here: https://lore.kernel.org/all/20210302001054.4qgrvnkltvkgikzr@treble/T/#u Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Artem Savkov <asavkov@redhat.com> --- include/linux/timer.h | 3 ++- kernel/time/timer.c | 36 ++++++++++++++++++++++-------------- 2 files changed, 24 insertions(+), 15 deletions(-)