diff mbox series

[net-next] selftests/net: force synchronized GC for a test.

Message ID 20240223081346.2052267-1-thinker.li@gmail.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net-next] selftests/net: force synchronized GC for a test. | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 8 this patch: 8
netdev/build_tools success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers warning 4 maintainers not CCed: shuah@kernel.org pabeni@redhat.com linux-kselftest@vger.kernel.org edumazet@google.com
netdev/build_clang success Errors and warnings before: 8 this patch: 8
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success net selftest script(s) already in Makefile
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 8 this patch: 8
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 17 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-02-23--18-00 (tests: 1457)

Commit Message

Kui-Feng Lee Feb. 23, 2024, 8:13 a.m. UTC
Due to the slowness of the test environment, always set off a synchronized
GC after waiting for GC. This can fix the problem that Fib6 garbage
collection test fails occasionally.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
---
 tools/testing/selftests/net/fib_tests.sh | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Jakub Kicinski Feb. 24, 2024, 2:21 a.m. UTC | #1
On Fri, 23 Feb 2024 00:13:46 -0800 Kui-Feng Lee wrote:
> Due to the slowness of the test environment

Would be interesting if it's slowness, because it failed 2 times
on the debug runner but 5 times on the non-debug one. We'll see.
David Ahern Feb. 24, 2024, 3:15 a.m. UTC | #2
On 2/23/24 7:21 PM, Jakub Kicinski wrote:
> On Fri, 23 Feb 2024 00:13:46 -0800 Kui-Feng Lee wrote:
>> Due to the slowness of the test environment
> 
> Would be interesting if it's slowness, because it failed 2 times
> on the debug runner but 5 times on the non-debug one. We'll see.

hmmm... that should be debugged. waiting 2*N + 1 and then requesting GC
and still failing suggests something else is at play
Kui-Feng Lee Feb. 24, 2024, 3:54 a.m. UTC | #3
On 2/23/24 19:15, David Ahern wrote:
> On 2/23/24 7:21 PM, Jakub Kicinski wrote:
>> On Fri, 23 Feb 2024 00:13:46 -0800 Kui-Feng Lee wrote:
>>> Due to the slowness of the test environment
>>
>> Would be interesting if it's slowness, because it failed 2 times
>> on the debug runner but 5 times on the non-debug one. We'll see.
> 
> hmmm... that should be debugged. waiting 2*N + 1 and then requesting GC
> and still failing suggests something else is at play

Do you mean it is still happening even with this patch?
Kui-Feng Lee Feb. 24, 2024, 4:40 a.m. UTC | #4
On 2/23/24 18:21, Jakub Kicinski wrote:
> On Fri, 23 Feb 2024 00:13:46 -0800 Kui-Feng Lee wrote:
>> Due to the slowness of the test environment
> 
> Would be interesting if it's slowness, because it failed 2 times
> on the debug runner but 5 times on the non-debug one. We'll see.

The code was landed on Feb 12, 2024.
It is actually 1 time with debug runner but 3 times on the non-debug 
one. 1 time with debug runner and 2 times with the non-debug runner 
happened in a 12 hours period.
Paolo Abeni Feb. 27, 2024, 8:53 a.m. UTC | #5
On Fri, 2024-02-23 at 20:40 -0800, Kui-Feng Lee wrote:
> 
> On 2/23/24 18:21, Jakub Kicinski wrote:
> > On Fri, 23 Feb 2024 00:13:46 -0800 Kui-Feng Lee wrote:
> > > Due to the slowness of the test environment
> > 
> > Would be interesting if it's slowness, because it failed 2 times
> > on the debug runner but 5 times on the non-debug one. We'll see.
> 
> The code was landed on Feb 12, 2024.
> It is actually 1 time with debug runner but 3 times on the non-debug 
> one. 1 time with debug runner and 2 times with the non-debug runner 
> happened in a 12 hours period.

The only failure in a debug run was causes by an unrelated issue:

https://netdev-3.bots.linux.dev/vmksft-net-dbg/results/467362/7-fib-tests-sh/stdout

(grep for UBSAN)

Basically the problems happens quite sporadically only in "fast" env. I
think it's worthy a deeper investigation (it sounds like a race
somewhere).

I guess/hope then raising the number of temporary and permanent routes
created by the relevant test-case could help making the issue more
easily reproducible.

Thanks,

Paolo
Kui-Feng Lee Feb. 29, 2024, 5:39 p.m. UTC | #6
On 2/23/24 19:15, David Ahern wrote:
> On 2/23/24 7:21 PM, Jakub Kicinski wrote:
>> On Fri, 23 Feb 2024 00:13:46 -0800 Kui-Feng Lee wrote:
>>> Due to the slowness of the test environment
>>
>> Would be interesting if it's slowness, because it failed 2 times
>> on the debug runner but 5 times on the non-debug one. We'll see.
> 
> hmmm... that should be debugged. waiting 2*N + 1 and then requesting GC
> and still failing suggests something else is at play

I did some tests, and found fib6_run_gc() do round_jiffies()
for the gc timer. So, gc_interval can increase 0.75 seconds in
some case. I am doing more investigation on this.
Kui-Feng Lee March 2, 2024, 12:45 a.m. UTC | #7
On 2/29/24 09:39, Kui-Feng Lee wrote:
> 
> 
> On 2/23/24 19:15, David Ahern wrote:
>> On 2/23/24 7:21 PM, Jakub Kicinski wrote:
>>> On Fri, 23 Feb 2024 00:13:46 -0800 Kui-Feng Lee wrote:
>>>> Due to the slowness of the test environment
>>>
>>> Would be interesting if it's slowness, because it failed 2 times
>>> on the debug runner but 5 times on the non-debug one. We'll see.
>>
>> hmmm... that should be debugged. waiting 2*N + 1 and then requesting GC
>> and still failing suggests something else is at play
> 
> I did some tests, and found fib6_run_gc() do round_jiffies()
> for the gc timer. So, gc_interval can increase 0.75 seconds in
> some case. I am doing more investigation on this.

My conclusion is routes going to expire in N seconds can stay
for 2*N + 1 seconds.

Adding a new route going to expire in N seconds, it will starts
gc timer for N seconds. In our case, we add several routes going to
expire in N seconds consecutively. The later routes may expires in
different ticks than the first route added. So, a route may wait nearly
2*N seconds. The test case waits for 2*N + 1 seconds, that should be
enough.

However, some extra waiting may be added to it.
There are two possible extra waiting. The first one is calling
round_jiffies() in fib6_run_gc(), that may add 750ms at most. The second
one is the granularity of waiting for 5 seconds (in our case) is 512ms
for HZ 1000 according to the comment at the very begin of timer.c.
In fact, it can add 392ms for 5750ms (5000ms + 750ms). Overall, they may
contribute up to 1144ms.

Does that make sense?

Debug build is slower. So, the test scripts will be slower than normal
build. That means the script is actually waiting longer with a debug build.
Jakub Kicinski March 4, 2024, 3:44 p.m. UTC | #8
On Fri, 1 Mar 2024 16:45:58 -0800 Kui-Feng Lee wrote:
> However, some extra waiting may be added to it.
> There are two possible extra waiting. The first one is calling
> round_jiffies() in fib6_run_gc(), that may add 750ms at most. The second
> one is the granularity of waiting for 5 seconds (in our case) is 512ms
> for HZ 1000 according to the comment at the very begin of timer.c.
> In fact, it can add 392ms for 5750ms (5000ms + 750ms). Overall, they may
> contribute up to 1144ms.
> 
> Does that make sense?
> 
> Debug build is slower. So, the test scripts will be slower than normal
> build. That means the script is actually waiting longer with a debug build.

Meaning bumping the wait to $((($EXPIRE + 1) * 2))
should be enough for the non-debug runner?
Kui-Feng Lee March 4, 2024, 5:27 p.m. UTC | #9
On 3/4/24 07:44, Jakub Kicinski wrote:
> On Fri, 1 Mar 2024 16:45:58 -0800 Kui-Feng Lee wrote:
>> However, some extra waiting may be added to it.
>> There are two possible extra waiting. The first one is calling
>> round_jiffies() in fib6_run_gc(), that may add 750ms at most. The second
>> one is the granularity of waiting for 5 seconds (in our case) is 512ms
>> for HZ 1000 according to the comment at the very begin of timer.c.
>> In fact, it can add 392ms for 5750ms (5000ms + 750ms). Overall, they may
>> contribute up to 1144ms.
>>
>> Does that make sense?
>>
>> Debug build is slower. So, the test scripts will be slower than normal
>> build. That means the script is actually waiting longer with a debug build.
> 
> Meaning bumping the wait to $((($EXPIRE + 1) * 2))
> should be enough for the non-debug runner?

Yes, it should be enough. I will send out another patch base on this.
David Ahern March 5, 2024, 2:41 a.m. UTC | #10
On 3/4/24 8:44 AM, Jakub Kicinski wrote:
> On Fri, 1 Mar 2024 16:45:58 -0800 Kui-Feng Lee wrote:
>> However, some extra waiting may be added to it.
>> There are two possible extra waiting. The first one is calling
>> round_jiffies() in fib6_run_gc(), that may add 750ms at most. The second
>> one is the granularity of waiting for 5 seconds (in our case) is 512ms
>> for HZ 1000 according to the comment at the very begin of timer.c.
>> In fact, it can add 392ms for 5750ms (5000ms + 750ms). Overall, they may
>> contribute up to 1144ms.
>>
>> Does that make sense?
>>
>> Debug build is slower. So, the test scripts will be slower than normal
>> build. That means the script is actually waiting longer with a debug build.
> 
> Meaning bumping the wait to $((($EXPIRE + 1) * 2))
> should be enough for the non-debug runner?

I have not had time to do a deep a dive on the timing, but it seems odd
to me that a 1 second timer can turn into 11 sec. That means for 10
seconds (10x the time the user requested) the route survived.
David Ahern March 5, 2024, 2:49 a.m. UTC | #11
On 3/4/24 7:41 PM, David Ahern wrote:
> On 3/4/24 8:44 AM, Jakub Kicinski wrote:
>> On Fri, 1 Mar 2024 16:45:58 -0800 Kui-Feng Lee wrote:
>>> However, some extra waiting may be added to it.
>>> There are two possible extra waiting. The first one is calling
>>> round_jiffies() in fib6_run_gc(), that may add 750ms at most. The second
>>> one is the granularity of waiting for 5 seconds (in our case) is 512ms
>>> for HZ 1000 according to the comment at the very begin of timer.c.
>>> In fact, it can add 392ms for 5750ms (5000ms + 750ms). Overall, they may
>>> contribute up to 1144ms.
>>>
>>> Does that make sense?
>>>
>>> Debug build is slower. So, the test scripts will be slower than normal
>>> build. That means the script is actually waiting longer with a debug build.
>>
>> Meaning bumping the wait to $((($EXPIRE + 1) * 2))
>> should be enough for the non-debug runner?
> 
> I have not had time to do a deep a dive on the timing, but it seems odd
> to me that a 1 second timer can turn into 11 sec. That means for 10
> seconds (10x the time the user requested) the route survived.

Also, you added a direct call to ipv6_sysctl_rtcache_flush to force a
flush which is going to try to kick off gc at that moment. Is the delay
kicking in?

delay = net->ipv6.sysctl.flush_delay;
Kui-Feng Lee March 5, 2024, 3:20 a.m. UTC | #12
On 3/4/24 18:49, David Ahern wrote:
> On 3/4/24 7:41 PM, David Ahern wrote:
>> On 3/4/24 8:44 AM, Jakub Kicinski wrote:
>>> On Fri, 1 Mar 2024 16:45:58 -0800 Kui-Feng Lee wrote:
>>>> However, some extra waiting may be added to it.
>>>> There are two possible extra waiting. The first one is calling
>>>> round_jiffies() in fib6_run_gc(), that may add 750ms at most. The second
>>>> one is the granularity of waiting for 5 seconds (in our case) is 512ms
>>>> for HZ 1000 according to the comment at the very begin of timer.c.
>>>> In fact, it can add 392ms for 5750ms (5000ms + 750ms). Overall, they may
>>>> contribute up to 1144ms.
>>>>
>>>> Does that make sense?
>>>>
>>>> Debug build is slower. So, the test scripts will be slower than normal
>>>> build. That means the script is actually waiting longer with a debug build.
>>>
>>> Meaning bumping the wait to $((($EXPIRE + 1) * 2))
>>> should be enough for the non-debug runner?
>>
>> I have not had time to do a deep a dive on the timing, but it seems odd
>> to me that a 1 second timer can turn into 11 sec. That means for 10
>> seconds (10x the time the user requested) the route survived.
> 
> Also, you added a direct call to ipv6_sysctl_rtcache_flush to force a
> flush which is going to try to kick off gc at that moment. Is the delay
> kicking in?
> 
> delay = net->ipv6.sysctl.flush_delay;

The delay doesn't cause any different here. The delay affects only 
exceptions.

It doesn't 10x. It does 2 times.
In fib_tests.sh

   sysctl -wq net.ipv6.route.gc_interval= $EXPIRE

The test sets the gc interval to 5 seconds.

The test install several routes in a round. So, some later routes may
miss the first GC that emitted after the first 5 seconds. These routes
have to wait for 5 more seconds for the next time GC. So, it is 10
seconds. However, due to calling round_jiffies() in fib6_run_gc(), the
waiting time of the second GC period can be longer. It could increase
750ms at most.  And, due to the granularity of the timer, it can adds
another 512ms at the range 4s~32s.
diff mbox series

Patch

diff --git a/tools/testing/selftests/net/fib_tests.sh b/tools/testing/selftests/net/fib_tests.sh
index 3ec1050e47a2..0a82c9bc07bb 100755
--- a/tools/testing/selftests/net/fib_tests.sh
+++ b/tools/testing/selftests/net/fib_tests.sh
@@ -823,7 +823,9 @@  fib6_gc_test()
 	    $IP -6 route add 2001:20::$i \
 		via 2001:10::2 dev dummy_10 expires $EXPIRE
 	done
+	# Wait for GC
 	sleep $(($EXPIRE * 2 + 1))
+	$NS_EXEC sysctl -wq net.ipv6.route.flush=1
 	check_rt_num 0 $($IP -6 route list |grep expires|wc -l)
 	log_test $ret 0 "ipv6 route garbage collection (with permanent routes)"
 
@@ -864,7 +866,7 @@  fib6_gc_test()
 
 	# Wait for GC
 	sleep $(($EXPIRE * 2 + 1))
-
+	$NS_EXEC sysctl -wq net.ipv6.route.flush=1
 	check_rt_num 5 $($IP -6 route list |grep -v expires|grep 2001:20::|wc -l)
 	log_test $ret 0 "ipv6 route garbage collection (replace with permanent)"