diff mbox series

RFC: arm64: arch_timer: Fix timer inconsistency test for A64

Message ID 20200929111347.1967438-1-r.stratiienko@gmail.com (mailing list archive)
State New, archived
Headers show
Series RFC: arm64: arch_timer: Fix timer inconsistency test for A64 | expand

Commit Message

Roman Stratiienko Sept. 29, 2020, 11:13 a.m. UTC
Fixes linux_kselftest:timers_inconsistency-check_arm_64

Test logs without the fix:
'''
binary returned non-zero. Exit code: 1, stderr: , stdout:
Consistent CLOCK_REALTIME
1601335525:467086804
1601335525:467087554
1601335525:467088345
1601335525:467089095
1601335525:467089887
1601335525:467090637
1601335525:467091429
1601335525:467092179
1601335525:467092929
1601335525:467093720
1601335525:467094470
1601335525:467095262
1601335525:467096012
1601335525:467096804
--------------------
1601335525:467097554
1601335525:467077012
--------------------
1601335525:467099095
1601335525:467099845
1601335525:467100637
1601335525:467101387
1601335525:467102179
1601335525:467102929
'''

Signed-off-by: Roman Stratiienko <r.stratiienko@gmail.com>
CC: linux-arm-kernel@lists.infradead.org
CC: linux-kernel@vger.kernel.org
CC: linux-sunxi@googlegroups.com
CC: megous@megous.com
---
 drivers/clocksource/arm_arch_timer.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

Comments

Ondřej Jirman Sept. 29, 2020, 11:35 a.m. UTC | #1
Hello Roman,

On Tue, Sep 29, 2020 at 02:13:47PM +0300, Roman Stratiienko wrote:
> Fixes linux_kselftest:timers_inconsistency-check_arm_64
> 
> Test logs without the fix:
> '''
> binary returned non-zero. Exit code: 1, stderr: , stdout:
> Consistent CLOCK_REALTIME
> 1601335525:467086804
> 1601335525:467087554
> 1601335525:467088345
> 1601335525:467089095
> 1601335525:467089887
> 1601335525:467090637
> 1601335525:467091429
> 1601335525:467092179
> 1601335525:467092929
> 1601335525:467093720
> 1601335525:467094470
> 1601335525:467095262
> 1601335525:467096012
> 1601335525:467096804
> --------------------
> 1601335525:467097554
> 1601335525:467077012
> --------------------
> 1601335525:467099095
> 1601335525:467099845
> 1601335525:467100637
> 1601335525:467101387
> 1601335525:467102179
> 1601335525:467102929
> '''

Can you reproduce the issue with a fixed CPU frequency. I suspect the root
cause is around CPU frequency scaling code on A64, and timer jumps happen when
the kernel is changing CPU frequency.

I fixed a similar issue on H3 SoC just by changing the CPU frequency scaling
code, without having to touch the timer readout code.

https://megous.com/git/linux/commit/?h=ths-5.9&id=51ff1a6d80126f678efca42555f93efa611f50c4

regards,
	o.

> Signed-off-by: Roman Stratiienko <r.stratiienko@gmail.com>
> CC: linux-arm-kernel@lists.infradead.org
> CC: linux-kernel@vger.kernel.org
> CC: linux-sunxi@googlegroups.com
> CC: megous@megous.com
> ---
>  drivers/clocksource/arm_arch_timer.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
> index 6c3e841801461..d50aa43cb654b 100644
> --- a/drivers/clocksource/arm_arch_timer.c
> +++ b/drivers/clocksource/arm_arch_timer.c
> @@ -346,16 +346,17 @@ static u64 notrace arm64_858921_read_cntvct_el0(void)
>   * number of CPU cycles in 3 consecutive 24 MHz counter periods.
>   */
>  #define __sun50i_a64_read_reg(reg) ({					\
> -	u64 _val;							\
> +	u64 _val1, _val2;						\
>  	int _retries = 150;						\
>  									\
>  	do {								\
> -		_val = read_sysreg(reg);				\
> +		_val1 = read_sysreg(reg);				\
> +		_val2 = read_sysreg(reg);				\
>  		_retries--;						\
> -	} while (((_val + 1) & GENMASK(9, 0)) <= 1 && _retries);	\
> +	} while (((_val2 - _val1) > 0x10) && _retries);			\
>  									\
>  	WARN_ON_ONCE(!_retries);					\
> -	_val;								\
> +	_val2;								\
>  })
>  
>  static u64 notrace sun50i_a64_read_cntpct_el0(void)
> -- 
> 2.25.1
>
Mark Rutland Sept. 29, 2020, 3:40 p.m. UTC | #2
Hi,

Please Cc maintainers for drivers -- Marc and I maintain the arch timer
driver.

On Tue, Sep 29, 2020 at 02:13:47PM +0300, Roman Stratiienko wrote:
> Fixes linux_kselftest:timers_inconsistency-check_arm_64
> 
> Test logs without the fix:
> '''
> binary returned non-zero. Exit code: 1, stderr: , stdout:
> Consistent CLOCK_REALTIME
> 1601335525:467086804
> 1601335525:467087554
> 1601335525:467088345
> 1601335525:467089095
> 1601335525:467089887
> 1601335525:467090637
> 1601335525:467091429
> 1601335525:467092179
> 1601335525:467092929
> 1601335525:467093720
> 1601335525:467094470
> 1601335525:467095262
> 1601335525:467096012
> 1601335525:467096804
> --------------------
> 1601335525:467097554
> 1601335525:467077012

That's 0x1BD757D2 followed by 0x1BD70794. The rollback is somewhere in
bits 15:12 to go from 0x1BD75xxx to 0x1BD70xxx, which suggests the
analysis in the existing comment is incomplete.

> --------------------
> 1601335525:467099095
> 1601335525:467099845
> 1601335525:467100637
> 1601335525:467101387
> 1601335525:467102179
> 1601335525:467102929
> '''

It would be very helpful if the commit message could explain the rough
idea behind the change, because the rationale is not clear to me.

> Signed-off-by: Roman Stratiienko <r.stratiienko@gmail.com>
> CC: linux-arm-kernel@lists.infradead.org
> CC: linux-kernel@vger.kernel.org
> CC: linux-sunxi@googlegroups.com
> CC: megous@megous.com
> ---
>  drivers/clocksource/arm_arch_timer.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
> index 6c3e841801461..d50aa43cb654b 100644
> --- a/drivers/clocksource/arm_arch_timer.c
> +++ b/drivers/clocksource/arm_arch_timer.c
> @@ -346,16 +346,17 @@ static u64 notrace arm64_858921_read_cntvct_el0(void)
>   * number of CPU cycles in 3 consecutive 24 MHz counter periods.
>   */
>  #define __sun50i_a64_read_reg(reg) ({					\
> -	u64 _val;							\
> +	u64 _val1, _val2;						\
>  	int _retries = 150;						\
>  									\
>  	do {								\
> -		_val = read_sysreg(reg);				\
> +		_val1 = read_sysreg(reg);				\
> +		_val2 = read_sysreg(reg);				\
>  		_retries--;						\
> -	} while (((_val + 1) & GENMASK(9, 0)) <= 1 && _retries);	\
> +	} while (((_val2 - _val1) > 0x10) && _retries);			\

This is going to fail quite often at low CPU frequencies, and it's not
clear to me that this solves the problem any more generally. DO we know
what the underlying erratum is here?

Thanks,
Mark.

>  									\
>  	WARN_ON_ONCE(!_retries);					\
> -	_val;								\
> +	_val2;								\
>  })
>  
>  static u64 notrace sun50i_a64_read_cntpct_el0(void)
> -- 
> 2.25.1
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Samuel Holland Sept. 30, 2020, 1 a.m. UTC | #3
On 9/29/20 10:40 AM, Mark Rutland wrote:
> Hi,
> 
> Please Cc maintainers for drivers -- Marc and I maintain the arch timer
> driver.
> 
> On Tue, Sep 29, 2020 at 02:13:47PM +0300, Roman Stratiienko wrote:
>> Fixes linux_kselftest:timers_inconsistency-check_arm_64
>>
>> Test logs without the fix:
>> '''
>> binary returned non-zero. Exit code: 1, stderr: , stdout:
>> Consistent CLOCK_REALTIME
>> 1601335525:467086804
>> 1601335525:467087554
>> 1601335525:467088345
>> 1601335525:467089095
>> 1601335525:467089887
>> 1601335525:467090637
>> 1601335525:467091429
>> 1601335525:467092179
>> 1601335525:467092929
>> 1601335525:467093720
>> 1601335525:467094470
>> 1601335525:467095262
>> 1601335525:467096012
>> 1601335525:467096804
>> --------------------
>> 1601335525:467097554
>> 1601335525:467077012
> 
> That's 0x1BD757D2 followed by 0x1BD70794. The rollback is somewhere in
> bits 15:12 to go from 0x1BD75xxx to 0x1BD70xxx, which suggests the
> analysis in the existing comment is incomplete.

My analysis only looked at consecutive timer reads on a single core. Apparently,
from the vendor comment that they needed CLOCKSOURCE_VALIDATE_LAST_CYCLE (linked
below), there is another inconsistency with reads across cores.

>> --------------------
>> 1601335525:467099095
>> 1601335525:467099845
>> 1601335525:467100637
>> 1601335525:467101387
>> 1601335525:467102179
>> 1601335525:467102929
>> '''
> 
> It would be very helpful if the commit message could explain the rough
> idea behind the change, because the rationale is not clear to me.
> 
>> Signed-off-by: Roman Stratiienko <r.stratiienko@gmail.com>
>> CC: linux-arm-kernel@lists.infradead.org
>> CC: linux-kernel@vger.kernel.org
>> CC: linux-sunxi@googlegroups.com
>> CC: megous@megous.com
>> ---
>>  drivers/clocksource/arm_arch_timer.c | 9 +++++----
>>  1 file changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
>> index 6c3e841801461..d50aa43cb654b 100644
>> --- a/drivers/clocksource/arm_arch_timer.c
>> +++ b/drivers/clocksource/arm_arch_timer.c
>> @@ -346,16 +346,17 @@ static u64 notrace arm64_858921_read_cntvct_el0(void)
>>   * number of CPU cycles in 3 consecutive 24 MHz counter periods.
>>   */
>>  #define __sun50i_a64_read_reg(reg) ({					\
>> -	u64 _val;							\
>> +	u64 _val1, _val2;						\
>>  	int _retries = 150;						\
>>  									\
>>  	do {								\
>> -		_val = read_sysreg(reg);				\
>> +		_val1 = read_sysreg(reg);				\
>> +		_val2 = read_sysreg(reg);				\
>>  		_retries--;						\
>> -	} while (((_val + 1) & GENMASK(9, 0)) <= 1 && _retries);	\
>> +	} while (((_val2 - _val1) > 0x10) && _retries);			\
> 
> This is going to fail quite often at low CPU frequencies, and it's not
> clear to me that this solves the problem any more generally. DO we know
> what the underlying erratum is here?

This is what we have from the vendor:

https://github.com/Allwinner-Homlet/H6-BSP4.9-linux/blob/master/arch/arm64/include/asm/arch_timer.h#L155
https://github.com/Allwinner-Homlet/H6-BSP4.9-linux/blob/master/drivers/clocksource/arm_arch_timer.c#L272

and they select CLOCKSOURCE_VALIDATE_LAST_CYCLE.

Everything else we know (e.g. the tables and explanation in my original commit
log) is from testing by users.

I had not seen the vendor changes to arch_timer.h when I wrote the existing
workaround. Their retry loop is similar to this patch, but as you mention, it
won't work if the CPU frequency is too low. That may not be a problem for the
vendor kernel, but it breaks badly on mainline.

As Ondrej referenced, the mainline DVFS implementation bypasses the PLL during
frequency changes. At that point, both the CPU and the timer are running at 24
MHz, so the chance of hitting the retry limit is high... and there's a udelay()
call in the PLL bypass code (ccu_mux_notifier_cb()) that will need to read the
timer.

> Thanks,
> Mark.

Cheers,
Samuel

>>  									\
>>  	WARN_ON_ONCE(!_retries);					\
>> -	_val;								\
>> +	_val2;								\
>>  })
>>  
>>  static u64 notrace sun50i_a64_read_cntpct_el0(void)
>> -- 
>> 2.25.1
>>
>>
>> _______________________________________________
>> linux-arm-kernel mailing list
>> linux-arm-kernel@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
diff mbox series

Patch

diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
index 6c3e841801461..d50aa43cb654b 100644
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -346,16 +346,17 @@  static u64 notrace arm64_858921_read_cntvct_el0(void)
  * number of CPU cycles in 3 consecutive 24 MHz counter periods.
  */
 #define __sun50i_a64_read_reg(reg) ({					\
-	u64 _val;							\
+	u64 _val1, _val2;						\
 	int _retries = 150;						\
 									\
 	do {								\
-		_val = read_sysreg(reg);				\
+		_val1 = read_sysreg(reg);				\
+		_val2 = read_sysreg(reg);				\
 		_retries--;						\
-	} while (((_val + 1) & GENMASK(9, 0)) <= 1 && _retries);	\
+	} while (((_val2 - _val1) > 0x10) && _retries);			\
 									\
 	WARN_ON_ONCE(!_retries);					\
-	_val;								\
+	_val2;								\
 })
 
 static u64 notrace sun50i_a64_read_cntpct_el0(void)