diff mbox series

[1/2] timer: introduce upper bound timers

Message ID 20220323111642.2517885-2-asavkov@redhat.com (mailing list archive)
State Superseded
Headers show
Series Upper bound mode for kernel timers | expand

Checks

Context Check Description
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 17747 this patch: 17747
netdev/cc_maintainers fail 2 maintainers not CCed: sboyd@kernel.org john.stultz@linaro.org
netdev/build_clang success Errors and warnings before: 3889 this patch: 3889
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 17362 this patch: 17362
netdev/checkpatch warning CHECK: Alignment should match open parenthesis WARNING: line length of 101 exceeds 80 columns WARNING: line length of 81 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 2 this patch: 2
netdev/source_inline success Was 0 now: 0
netdev/tree_selection success Guessing tree name failed - patch did not apply

Commit Message

Artem Savkov March 23, 2022, 11:16 a.m. UTC
Add TIMER_UPPER_BOUND flag which allows creation of timers that would
expire at most at specified time or earlier.

This was previously discussed here:
https://lore.kernel.org/all/20210302001054.4qgrvnkltvkgikzr@treble/T/#u

Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Artem Savkov <asavkov@redhat.com>
---
 include/linux/timer.h |  3 ++-
 kernel/time/timer.c   | 36 ++++++++++++++++++++++--------------
 2 files changed, 24 insertions(+), 15 deletions(-)

Comments

Josh Poimboeuf March 23, 2022, 6:40 p.m. UTC | #1
On Wed, Mar 23, 2022 at 12:16:41PM +0100, Artem Savkov wrote:
> Add TIMER_UPPER_BOUND flag which allows creation of timers that would
> expire at most at specified time or earlier.
> 
> This was previously discussed here:
> https://lore.kernel.org/all/20210302001054.4qgrvnkltvkgikzr@treble/T/#u
> 
> Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
> Signed-off-by: Artem Savkov <asavkov@redhat.com>

Thanks Artem, this approach looks good to me.  Some kernel style nits...

The commit log could use some more background and motivation for the
change:

a) describe the current timer bound functionality and why it works for
   most cases

b) describe the type of situation where it doesn't work and why

c) describe the solution

> ---
>  include/linux/timer.h |  3 ++-
>  kernel/time/timer.c   | 36 ++++++++++++++++++++++--------------
>  2 files changed, 24 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/timer.h b/include/linux/timer.h
> index fda13c9d1256..9b0963f49528 100644
> --- a/include/linux/timer.h
> +++ b/include/linux/timer.h
> @@ -67,7 +67,8 @@ struct timer_list {
>  #define TIMER_DEFERRABLE	0x00080000
>  #define TIMER_PINNED		0x00100000
>  #define TIMER_IRQSAFE		0x00200000
> -#define TIMER_INIT_FLAGS	(TIMER_DEFERRABLE | TIMER_PINNED | TIMER_IRQSAFE)
> +#define TIMER_UPPER_BOUND	0x00400000
> +#define TIMER_INIT_FLAGS	(TIMER_DEFERRABLE | TIMER_PINNED | TIMER_IRQSAFE | TIMER_UPPER_BOUND)
>  #define TIMER_ARRAYSHIFT	22
>  #define TIMER_ARRAYMASK		0xFFC00000

This new flag needs a comment above, where the other user flags are also
commented.

>  	}
>  	return idx;
>  }
> @@ -607,7 +613,8 @@ static void internal_add_timer(struct timer_base *base, struct timer_list *timer
>  	unsigned long bucket_expiry;
>  	unsigned int idx;
>  
> -	idx = calc_wheel_index(timer->expires, base->clk, &bucket_expiry);
> +	idx = calc_wheel_index(timer->expires, base->clk, &bucket_expiry,
> +			timer->flags & TIMER_UPPER_BOUND);

Here and elsewhere, to follow kernel convention, the 2nd line needs
spaces after the tabs to align it with the previous line.  That makes it
more obvious that the 2nd line is just another function argument.  Like:

	idx = calc_wheel_index(timer->expires, base->clk, &bucket_expiry,
			       timer->flags & TIMER_UPPER_BOUND);
Thomas Gleixner March 24, 2022, 12:28 p.m. UTC | #2
Artem,

On Wed, Mar 23 2022 at 12:16, Artem Savkov wrote:
> Add TIMER_UPPER_BOUND flag which allows creation of timers that would
> expire at most at specified time or earlier.
>
> This was previously discussed here:
> https://lore.kernel.org/all/20210302001054.4qgrvnkltvkgikzr@treble/T/#u

please add the context to the changelog. A link is only supplemental
information and does not replace content.

>  static inline unsigned calc_index(unsigned long expires, unsigned lvl,
> -				  unsigned long *bucket_expiry)
> +				  unsigned long *bucket_expiry, bool upper_bound)
>  {
>  
>  	/*
> @@ -501,34 +501,39 @@ static inline unsigned calc_index(unsigned long expires, unsigned lvl,
>  	 * - Truncation of the expiry time in the outer wheel levels
>  	 *
>  	 * Round up with level granularity to prevent this.
> +	 * Do not perform round up in case of upper bound timer.
>  	 */
> -	expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl);
> +	if (upper_bound)
> +		expires = expires >> LVL_SHIFT(lvl);
> +	else
> +		expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl);

While this "works", I fundamentally hate this because it adds an extra
conditional into the common case. That affects every user of the timer
wheel. We went great length to optimize that code and I'm not really enthused
to sacrifice that just because of _one_ use case.

The resulting text bloat is +25% on x86/64 and a quick test case shows
that this is well measurable overhead. The first ones who will complain
about that are going to be - drumroll - the network people.

There must be smarter ways to solve that. Let me think about it.

Thanks,

        tglx
Thomas Gleixner March 24, 2022, 1:54 p.m. UTC | #3
On Thu, Mar 24 2022 at 13:28, Thomas Gleixner wrote:
> On Wed, Mar 23 2022 at 12:16, Artem Savkov wrote:
>> Add TIMER_UPPER_BOUND flag which allows creation of timers that would
>> expire at most at specified time or earlier.
>>
>> This was previously discussed here:
>> https://lore.kernel.org/all/20210302001054.4qgrvnkltvkgikzr@treble/T/#u
>
> please add the context to the changelog. A link is only supplemental
> information and does not replace content.
>
>>  static inline unsigned calc_index(unsigned long expires, unsigned lvl,
>> -				  unsigned long *bucket_expiry)
>> +				  unsigned long *bucket_expiry, bool upper_bound)
>>  {
>>  
>>  	/*
>> @@ -501,34 +501,39 @@ static inline unsigned calc_index(unsigned long expires, unsigned lvl,
>>  	 * - Truncation of the expiry time in the outer wheel levels
>>  	 *
>>  	 * Round up with level granularity to prevent this.
>> +	 * Do not perform round up in case of upper bound timer.
>>  	 */
>> -	expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl);
>> +	if (upper_bound)
>> +		expires = expires >> LVL_SHIFT(lvl);
>> +	else
>> +		expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl);
>
> While this "works", I fundamentally hate this because it adds an extra
> conditional into the common case. That affects every user of the timer
> wheel. We went great length to optimize that code and I'm not really enthused
> to sacrifice that just because of _one_ use case.

Aside of that this is not mathematically correct. Why?

The level selection makes the cutoff at: LEVEL_MAX(lvl) - 1. E.g. 62
instead of 63 for the first level.

The reason is that this accomodates for the + LVL_GRAN(lvl). Now with
surpressing the roundup this creates a gap. Not a horrible problem, but
not correct either.

Thanks,

        tglx
kernel test robot March 25, 2022, 7:38 a.m. UTC | #4
Greeting,

FYI, we noticed the following commit (built with gcc-9):

commit: d41e0719d576bc515e6de2112fff2c6b20f357e8 ("[PATCH 1/2] timer: introduce upper bound timers")
url: https://github.com/intel-lab-lkp/linux/commits/Artem-Savkov/Upper-bound-mode-for-kernel-timers/20220323-191831
patch link: https://lore.kernel.org/netdev/20220323111642.2517885-2-asavkov@redhat.com

in testcase: xfstests
version: xfstests-x86_64-1de1db8-1_20220217
with following parameters:

	disk: 4HDD
	fs: btrfs
	test: generic-group-28
	ucode: 0xec

test-description: xfstests is a regression test suite for xfs and other files ystems.
test-url: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git


on test machine: 4 threads Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz with 32G memory

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):



If you fix the issue, kindly add following tag
Reported-by: kernel test robot <oliver.sang@intel.com>


[   42.401895][    C0] UBSAN: shift-out-of-bounds in lib/flex_proportions.c:80:20
[   42.410963][    C0] shift exponent -1007885658 is negative
[   42.416462][    C0] CPU: 0 PID: 330 Comm: sed Tainted: G          I       5.17.0-rc6-00027-gd41e0719d576 #1
[   42.426240][    C0] Hardware name: Dell Inc. OptiPlex 7040/0Y7WYT, BIOS 1.1.1 10/07/2015
[   42.434363][    C0] Call Trace:
[   42.437516][    C0]  <TASK>
[ 42.440319][ C0] dump_stack_lvl (lib/dump_stack.c:107) 
[ 42.444699][ C0] ubsan_epilogue (lib/ubsan.c:152) 
[ 42.448985][ C0] __ubsan_handle_shift_out_of_bounds.cold (lib/ubsan.c:330) 
[ 42.455618][ C0] ? cpumask_next (lib/cpumask.c:23) 
[ 42.459996][ C0] ? __percpu_counter_sum (lib/percpu_counter.c:138) 
[ 42.465248][ C0] fprop_new_period.cold (lib/flex_proportions.c:80 (discriminator 1)) 
[ 42.470224][ C0] writeout_period (mm/page-writeback.c:623) 
[ 42.474768][ C0] ? wb_position_ratio (mm/page-writeback.c:617) 
[ 42.479747][ C0] call_timer_fn (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:212 include/trace/events/timer.h:125 kernel/time/timer.c:1430) 
[ 42.484115][ C0] run_timer_softirq (kernel/time/timer.c:1475 kernel/time/timer.c:1742 kernel/time/timer.c:1718 kernel/time/timer.c:1757) 
[ 42.489019][ C0] ? del_timer_sync (kernel/time/timer.c:1752) 
[ 42.493561][ C0] ? __next_base (kernel/time/hrtimer.c:506) 
[ 42.498014][ C0] ? sched_clock_cpu (kernel/sched/clock.c:371) 
[ 42.502732][ C0] ? setup_local_APIC (arch/x86/kernel/apic/apic.c:475) 
[ 42.507624][ C0] __do_softirq (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:212 include/trace/events/irq.h:142 kernel/softirq.c:559) 
[ 42.511990][ C0] irq_exit_rcu (kernel/softirq.c:432 kernel/softirq.c:637 kernel/softirq.c:649) 
[ 42.516359][ C0] sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1097 (discriminator 13)) 
[ 42.521864][ C0] ? asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:638) 
[ 42.527812][ C0] asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:638) 
[   42.533660][    C0] RIP: 0033:0x55e961adf002
[ 42.537942][ C0] Code: ff ff ff 41 b8 10 00 00 00 ba 01 00 00 00 48 89 c7 e8 a2 ed ff ff 49 8b 4d 08 eb 89 0f 1f 40 00 44 09 71 08 48 83 c4 08 5b 5d <41> 5c 41 5d 41 5e 41 5f c3 48 3b 6a 10 7d 08 48 89 e9 48 89 c2 eb
All code
========
   0:	ff                   	(bad)  
   1:	ff                   	(bad)  
   2:	ff 41 b8             	incl   -0x48(%rcx)
   5:	10 00                	adc    %al,(%rax)
   7:	00 00                	add    %al,(%rax)
   9:	ba 01 00 00 00       	mov    $0x1,%edx
   e:	48 89 c7             	mov    %rax,%rdi
  11:	e8 a2 ed ff ff       	callq  0xffffffffffffedb8
  16:	49 8b 4d 08          	mov    0x8(%r13),%rcx
  1a:	eb 89                	jmp    0xffffffffffffffa5
  1c:	0f 1f 40 00          	nopl   0x0(%rax)
  20:	44 09 71 08          	or     %r14d,0x8(%rcx)
  24:	48 83 c4 08          	add    $0x8,%rsp
  28:	5b                   	pop    %rbx
  29:	5d                   	pop    %rbp
  2a:*	41 5c                	pop    %r12		<-- trapping instruction
  2c:	41 5d                	pop    %r13
  2e:	41 5e                	pop    %r14
  30:	41 5f                	pop    %r15
  32:	c3                   	retq   
  33:	48 3b 6a 10          	cmp    0x10(%rdx),%rbp
  37:	7d 08                	jge    0x41
  39:	48 89 e9             	mov    %rbp,%rcx
  3c:	48 89 c2             	mov    %rax,%rdx
  3f:	eb                   	.byte 0xeb

Code starting with the faulting instruction
===========================================
   0:	41 5c                	pop    %r12
   2:	41 5d                	pop    %r13
   4:	41 5e                	pop    %r14
   6:	41 5f                	pop    %r15
   8:	c3                   	retq   
   9:	48 3b 6a 10          	cmp    0x10(%rdx),%rbp
   d:	7d 08                	jge    0x17
   f:	48 89 e9             	mov    %rbp,%rcx
  12:	48 89 c2             	mov    %rax,%rdx
  15:	eb                   	.byte 0xeb
[   42.557457][    C0] RSP: 002b:00007fff12170f68 EFLAGS: 00000202
[   42.563393][    C0] RAX: 000055e966fd5de0 RBX: 0000000000002ac0 RCX: 000055e966fe1ad0
[   42.571245][    C0] RDX: 0000000000000bcf RSI: 0000000000000bd1 RDI: 0000000000000bd0
[   42.579093][    C0] RBP: 00000000000019a0 R08: 000055e966fd5dd0 R09: 00007f781695a430
[   42.586954][    C0] R10: 000055e961b62010 R11: 00007f781695a430 R12: 0000000000000bd0
[   42.594805][    C0] R13: 00007fff12170fe0 R14: 00000000000001ff R15: 000055e962f771e0
[   42.602664][    C0]  </TASK>
[   42.605562][    C0] ================================================================================
[   42.619841][ T2170] BTRFS: device fsid c931cb40-7bb4-4aea-a481-3898abe22c7f devid 1 transid 5 /dev/sda2 scanned by mkfs.btrfs (2170)
[   42.658545][ T2181] BTRFS info (device sda2): flagging fs with big metadata feature
[   42.666306][ T2181] BTRFS info (device sda2): disk space caching is enabled
[   42.673308][ T2181] BTRFS info (device sda2): has skinny extents
[   42.685475][ T2181] BTRFS info (device sda2): checking UUID tree
[   43.019857][ T2218] BTRFS: device fsid 196e6b2a-cd50-4fb7-9e22-b255906549fb devid 1 transid 5 /dev/sda2 scanned by mkfs.btrfs (2218)
[   43.055315][ T2232] BTRFS info (device sda2): flagging fs with big metadata feature
[   43.063092][ T2232] BTRFS info (device sda2): disk space caching is enabled
[   43.070097][ T2232] BTRFS info (device sda2): has skinny extents
[   43.082188][ T2232] BTRFS info (device sda2): checking UUID tree
[   43.646926][  T330] 262144 bytes (262 kB, 256 KiB) copied, 0.0121294 s, 21.6 MB/s
[   43.646936][  T330]
[   43.657040][  T330] 512+0 records in
[   43.657045][  T330]
[   43.663189][  T330] 512+0 records out
[   43.663194][  T330]
[   43.670327][  T330] 262144 bytes (262 kB, 256 KiB) copied, 0.0340874 s, 7.7 MB/s
[   43.670333][  T330]
[   43.680303][  T330] 512+0 records in
[   43.680308][  T330]
[   43.686450][  T330] 512+0 records out
[   43.686470][  T330]
[   43.693603][  T330] 262144 bytes (262 kB, 256 KiB) copied, 0.0436927 s, 6.0 MB/s
[   43.693609][  T330]
[   75.600365][ T2983] BTRFS info (device sda2): flagging fs with big metadata feature
[   75.608412][ T2983] BTRFS info (device sda2): disk space caching is enabled
[   75.615416][ T2983] BTRFS info (device sda2): has skinny extents
[   83.853890][ T3053] BTRFS info (device sda2): flagging fs with big metadata feature
[   83.861582][ T3053] BTRFS info (device sda2): disk space caching is enabled
[   83.868571][ T3053] BTRFS info (device sda2): has skinny extents
[   84.030348][  T328] generic/560	_check_dmesg: something found in dmesg (see /lkp/benchmarks/xfstests/results//generic/560.dmesg)
[   84.030358][  T328]
[   84.044262][  T328]
[   84.044269][  T328]
[   84.088818][ T1625] run fstests generic/561 at 2022-03-24 12:37:26
[   84.816005][ T3246] BTRFS info (device sda1): flagging fs with big metadata feature
[   84.823700][ T3246] BTRFS info (device sda1): using free space tree
[   84.830014][ T3246] BTRFS info (device sda1): has skinny extents
[   85.074278][ T3295] BTRFS: device fsid b05a03cd-540a-4c2a-a470-7eb68b5ab847 devid 1 transid 5 /dev/sda2 scanned by mkfs.btrfs (3295)
[   85.110629][ T3306] BTRFS info (device sda2): flagging fs with big metadata feature
[   85.118389][ T3306] BTRFS info (device sda2): disk space caching is enabled
[   85.125437][ T3306] BTRFS info (device sda2): has skinny extents
[   85.138213][ T3306] BTRFS info (device sda2): checking UUID tree
[   85.416006][ T3348] BTRFS: device fsid 0d6806aa-5e85-40dd-b571-5227cd8c2d80 devid 1 transid 5 /dev/sda2 scanned by mkfs.btrfs (3348)
[   85.453695][ T3359] BTRFS info (device sda2): flagging fs with big metadata feature
[   85.461394][ T3359] BTRFS info (device sda2): disk space caching is enabled
[   85.468386][ T3359] BTRFS info (device sda2): has skinny extents
[   85.480643][ T3359] BTRFS info (device sda2): checking UUID tree
[   85.675145][ T3381] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   85.675448][ T3385] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   85.686755][ T3381] File: /fs/scratch/dir/p0/f1XX PID: 3381 Comm: fsstress
[   85.705365][ T3385] File: /fs/scratch/dir/p2/f7XXXXXXXXXXXXX PID: 3385 Comm: fsstress
[   86.009824][ T3383] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   86.021432][ T3383] File: /fs/scratch/dir/p1/f8XX PID: 3383 Comm: fsstress
[   87.014795][ T3381] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   87.026427][ T3381] File: /fs/scratch/dir/p0/f14XXXXXXXXXXXX PID: 3381 Comm: fsstress
[   98.693320][  T328] generic/561	IPMI BMC is not supported on this machine, skip bmc-watchdog setup!
[   98.693329][  T328]
[  103.829084][ T3381] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[  103.840703][ T3381] File: /fs/scratch/dir/p0/d1eXXXXXXXXXXXX/d42XX/fabXXXXX PID: 3381 Comm: fsstress
[  142.694608][ T4737] BTRFS info (device sda2): flagging fs with big metadata feature
[  142.702316][ T4737] BTRFS info (device sda2): disk space caching is enabled
[  142.709324][ T4737] BTRFS info (device sda2): has skinny extents
[  159.500542][ T4838] BTRFS info (device sda2): flagging fs with big metadata feature
[  159.508232][ T4838] BTRFS info (device sda2): disk space caching is enabled
[  159.515223][ T4838] BTRFS info (device sda2): has skinny extents
[  159.704146][  T328]  76s
[  159.704155][  T328]
[  159.748026][ T1625] run fstests generic/562 at 2022-03-24 12:38:42
[  160.507334][ T5033] BTRFS info (device sda1): flagging fs with big metadata feature
[  160.515029][ T5033] BTRFS info (device sda1): using free space tree
[  160.521330][ T5033] BTRFS info (device sda1): has skinny extents
[  160.777005][ T5095] BTRFS: device fsid 08444c95-7d09-450b-9e80-4897d2b3bfae devid 1 transid 5 /dev/sda2 scanned by mkfs.btrfs (5095)
[  160.811850][ T5109] BTRFS info (device sda2): flagging fs with big metadata feature
[  160.819593][ T5109] BTRFS info (device sda2): disk space caching is enabled
[  160.827020][ T5109] BTRFS info (device sda2): has skinny extents


To reproduce:

        git clone https://github.com/intel/lkp-tests.git
        cd lkp-tests
        sudo bin/lkp install job.yaml           # job file is attached in this email
        bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
        sudo bin/lkp run generated-yaml-file

        # if come across any failure that blocks the test,
        # please remove ~/.lkp and /lkp dir to run from a clean state.
Thomas Gleixner March 25, 2022, 7:14 p.m. UTC | #5
On Fri, Mar 25 2022 at 15:38, kernel test robot wrote:
> [   42.401895][    C0] UBSAN: shift-out-of-bounds in lib/flex_proportions.c:80:20
> [   42.410963][    C0] shift exponent -1007885658 is negative

Cute.

> [   42.416462][    C0] CPU: 0 PID: 330 Comm: sed Tainted: G          I       5.17.0-rc6-00027-gd41e0719d576 #1
> [   42.426240][    C0] Hardware name: Dell Inc. OptiPlex 7040/0Y7WYT, BIOS 1.1.1 10/07/2015
> [   42.434363][    C0] Call Trace:
> [   42.437516][    C0]  <TASK>
> [ 42.440319][ C0] dump_stack_lvl (lib/dump_stack.c:107) 
> [ 42.444699][ C0] ubsan_epilogue (lib/ubsan.c:152) 
> [ 42.448985][ C0] __ubsan_handle_shift_out_of_bounds.cold (lib/ubsan.c:330) 
> [ 42.455618][ C0] ? cpumask_next (lib/cpumask.c:23) 
> [ 42.459996][ C0] ? __percpu_counter_sum (lib/percpu_counter.c:138) 
> [ 42.465248][ C0] fprop_new_period.cold (lib/flex_proportions.c:80 (discriminator 1)) 
> [ 42.470224][ C0] writeout_period (mm/page-writeback.c:623) 

So it seems a timer fired early. Which then makes writeout_period() go south:

	int miss_periods = (jiffies - dom->period_time) / VM_COMPLETIONS_PERIOD_LEN;

If jiffies < dom->period_time the result is a very large negative
number.

This happens because of:

> @@ -67,7 +67,8 @@ struct timer_list {
>  #define TIMER_DEFERRABLE	0x00080000
>  #define TIMER_PINNED		0x00100000
>  #define TIMER_IRQSAFE		0x00200000
> -#define TIMER_INIT_FLAGS	(TIMER_DEFERRABLE | TIMER_PINNED | TIMER_IRQSAFE)
> +#define TIMER_UPPER_BOUND	0x00400000
> +#define TIMER_INIT_FLAGS	(TIMER_DEFERRABLE | TIMER_PINNED | TIMER_IRQSAFE | TIMER_UPPER_BOUND)
> #define TIMER_ARRAYSHIFT	22
> #define TIMER_ARRAYMASK		0xFFC00000

TIMER_UPPER_BOUND steals a bit from the ARRAYMASK. So if the timer is
armed and the stored arraymask happens to have bit 22 set, then on the
next arming of the timer it will be treated as upper bound timer,
expires early and all hell breaks lose. The same can happen the other
way round. So I really have to ask how this ever "worked".

Thanks,

        tglx
Thomas Gleixner March 26, 2022, 9:13 p.m. UTC | #6
Artem,

On Thu, Mar 24 2022 at 13:28, Thomas Gleixner wrote:
> On Wed, Mar 23 2022 at 12:16, Artem Savkov wrote:
>>  	 * Round up with level granularity to prevent this.
>> +	 * Do not perform round up in case of upper bound timer.
>>  	 */
>> -	expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl);
>> +	if (upper_bound)
>> +		expires = expires >> LVL_SHIFT(lvl);
>> +	else
>> +		expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl);
>
> While this "works", I fundamentally hate this because it adds an extra

actually it cannot not work. At least not in a realiable and predictable
way.

The timer wheel is fundamentaly relying on moving the timer one bucket
out. Let's look how this all works.

The wheel has N levels of bucket arrays. Each level has 64 buckets and
the granularity increases by a factor of 8 per level, i.e. the worst
case deviation is 12.5% per level.

The original timer wheel was able to fire the timer at expiry time + one
tick for the price of cascading timers every so often from one level to
the next lower level. Here are a few pointers:

    https://lwn.net/Articles/152436/
    https://lwn.net/Articles/646950/

The accuracy of the original wheel implementation was weakened already
in course of the NOHZ development where the concept of slack
(algorithmic batching at enqueue time for the price of overhead) was
introduced. It had the same 12.5% worst case deviation of the resulting
granularity level, though the batching was not enforced and only worked
most of the time. So in theory the same issue could have been seen back
then already.

The enqueue placement and the expiry is driven by base::clock, which
is nothing else than jiffies. When a timer is queued then base::clock is
the time on which the next tick and expiry check happens.

Now let's look how the expiry mechanism works. The first level (0) is
obviously expired every tick. On every eigth tick the second level (1) is
expired, on every 64th tick the third level (2)...

IOW, the expiry events of a level happen at 8^index(level) intervals.

Let's assume that base::clk is 0. That means at the next tick (which
could be imminent) in _all_ levels bucket[0] is due for expiry.

Now let's enqueue a timer with expiry value of 64:

    delta = 64 - 0 = 64

That means the timer ends up in the second level in bucket[0], which
makes it eligible for expiry at the next tick. The same is true for any
expiry value of 8^N.

Not what you are trying to achieve, right? You try to enforce an upper
bound, but you expect that the timer does not fire earlier than 12.5% of
the granularity level of that upper bound, right?

IOW, you created a expiry lottery and there is no way to prevent that
except with more conditionals and heuristics which are hardely welcomed.

You've also seen the outcome of a timer firing unexpectedly due to the
bit abuse, right?

Now let's take a step back and look at the problem at hand.

TCP alive timer, which is affected by the batching and the resulting
inaccuracy, is (re)armed with a relative timeout, which is known
upfront, right?

The important part is *relative* timeout. Why?

Because the timeout is relative it's trivial to calculate a relative
timeout value from the given accurate relative timeout value, which is
guaranteed to not expire late and within the timerwheel's error margin,
and use that for the actual rearming.

I'm pretty sure that you can come up with a conversion function for that
and use this function in the TCP code at the point where the TCP
keepalive timer is configured.

Hint 1: The output of calc_wheel_index() should be good enough to figure
        that out.

Hint 2: If you get frustrated, try

        git grep johnstul arch/x86/ | awk '{ $1=$2=$3=""; print $0 }'

Hint 3: Feel free to ask questions anytime

Thanks,

        tglx
diff mbox series

Patch

diff --git a/include/linux/timer.h b/include/linux/timer.h
index fda13c9d1256..9b0963f49528 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -67,7 +67,8 @@  struct timer_list {
 #define TIMER_DEFERRABLE	0x00080000
 #define TIMER_PINNED		0x00100000
 #define TIMER_IRQSAFE		0x00200000
-#define TIMER_INIT_FLAGS	(TIMER_DEFERRABLE | TIMER_PINNED | TIMER_IRQSAFE)
+#define TIMER_UPPER_BOUND	0x00400000
+#define TIMER_INIT_FLAGS	(TIMER_DEFERRABLE | TIMER_PINNED | TIMER_IRQSAFE | TIMER_UPPER_BOUND)
 #define TIMER_ARRAYSHIFT	22
 #define TIMER_ARRAYMASK		0xFFC00000
 
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 85f1021ad459..60253ad9ed86 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -491,7 +491,7 @@  static inline void timer_set_idx(struct timer_list *timer, unsigned int idx)
  * time.
  */
 static inline unsigned calc_index(unsigned long expires, unsigned lvl,
-				  unsigned long *bucket_expiry)
+				  unsigned long *bucket_expiry, bool upper_bound)
 {
 
 	/*
@@ -501,34 +501,39 @@  static inline unsigned calc_index(unsigned long expires, unsigned lvl,
 	 * - Truncation of the expiry time in the outer wheel levels
 	 *
 	 * Round up with level granularity to prevent this.
+	 * Do not perform round up in case of upper bound timer.
 	 */
-	expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl);
+	if (upper_bound)
+		expires = expires >> LVL_SHIFT(lvl);
+	else
+		expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl);
+
 	*bucket_expiry = expires << LVL_SHIFT(lvl);
 	return LVL_OFFS(lvl) + (expires & LVL_MASK);
 }
 
 static int calc_wheel_index(unsigned long expires, unsigned long clk,
-			    unsigned long *bucket_expiry)
+			    unsigned long *bucket_expiry, bool upper_bound)
 {
 	unsigned long delta = expires - clk;
 	unsigned int idx;
 
 	if (delta < LVL_START(1)) {
-		idx = calc_index(expires, 0, bucket_expiry);
+		idx = calc_index(expires, 0, bucket_expiry, upper_bound);
 	} else if (delta < LVL_START(2)) {
-		idx = calc_index(expires, 1, bucket_expiry);
+		idx = calc_index(expires, 1, bucket_expiry, upper_bound);
 	} else if (delta < LVL_START(3)) {
-		idx = calc_index(expires, 2, bucket_expiry);
+		idx = calc_index(expires, 2, bucket_expiry, upper_bound);
 	} else if (delta < LVL_START(4)) {
-		idx = calc_index(expires, 3, bucket_expiry);
+		idx = calc_index(expires, 3, bucket_expiry, upper_bound);
 	} else if (delta < LVL_START(5)) {
-		idx = calc_index(expires, 4, bucket_expiry);
+		idx = calc_index(expires, 4, bucket_expiry, upper_bound);
 	} else if (delta < LVL_START(6)) {
-		idx = calc_index(expires, 5, bucket_expiry);
+		idx = calc_index(expires, 5, bucket_expiry, upper_bound);
 	} else if (delta < LVL_START(7)) {
-		idx = calc_index(expires, 6, bucket_expiry);
+		idx = calc_index(expires, 6, bucket_expiry, upper_bound);
 	} else if (LVL_DEPTH > 8 && delta < LVL_START(8)) {
-		idx = calc_index(expires, 7, bucket_expiry);
+		idx = calc_index(expires, 7, bucket_expiry, upper_bound);
 	} else if ((long) delta < 0) {
 		idx = clk & LVL_MASK;
 		*bucket_expiry = clk;
@@ -540,7 +545,8 @@  static int calc_wheel_index(unsigned long expires, unsigned long clk,
 		if (delta >= WHEEL_TIMEOUT_CUTOFF)
 			expires = clk + WHEEL_TIMEOUT_MAX;
 
-		idx = calc_index(expires, LVL_DEPTH - 1, bucket_expiry);
+		idx = calc_index(expires, LVL_DEPTH - 1, bucket_expiry,
+				upper_bound);
 	}
 	return idx;
 }
@@ -607,7 +613,8 @@  static void internal_add_timer(struct timer_base *base, struct timer_list *timer
 	unsigned long bucket_expiry;
 	unsigned int idx;
 
-	idx = calc_wheel_index(timer->expires, base->clk, &bucket_expiry);
+	idx = calc_wheel_index(timer->expires, base->clk, &bucket_expiry,
+			timer->flags & TIMER_UPPER_BOUND);
 	enqueue_timer(base, timer, idx, bucket_expiry);
 }
 
@@ -1000,7 +1007,8 @@  __mod_timer(struct timer_list *timer, unsigned long expires, unsigned int option
 		}
 
 		clk = base->clk;
-		idx = calc_wheel_index(expires, clk, &bucket_expiry);
+		idx = calc_wheel_index(expires, clk, &bucket_expiry,
+				timer->flags & TIMER_UPPER_BOUND);
 
 		/*
 		 * Retrieve and compare the array index of the pending