diff mbox series

[net-next] selftests/net: ignore timing errors in so_txtime if KSFT_MACHINE_SLOW

Message ID 20240201162130.2278240-1-willemdebruijn.kernel@gmail.com (mailing list archive)
State Accepted
Commit c41dfb0dfbece824143ff51829d42cba4cb3c277
Headers show
Series [net-next] selftests/net: ignore timing errors in so_txtime if KSFT_MACHINE_SLOW | expand

Commit Message

Willem de Bruijn Feb. 1, 2024, 4:21 p.m. UTC
From: Willem de Bruijn <willemb@google.com>

This test is time sensitive. It may fail on virtual machines and for
debug builds.

Continue to run in these environments to get code coverage. But
optionally suppress failure for timing errors (only). This is
controlled with environment variable KSFT_MACHINE_SLOW.

The test continues to return 0 (KSFT_PASS), rather than KSFT_XFAIL
as previously discussed. Because making so_txtime.c return that and
then making so_txtime.sh capture runs that pass that vs KSFT_FAIL
and pass it on added a bunch of (fragile bash) boilerplate, while the
result is interpreted the same as KSFT_PASS anyway.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 tools/testing/selftests/net/so_txtime.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Comments

Jakub Kicinski Feb. 2, 2024, 11:39 p.m. UTC | #1
On Thu,  1 Feb 2024 11:21:19 -0500 Willem de Bruijn wrote:
> This test is time sensitive. It may fail on virtual machines and for
> debug builds.
> 
> Continue to run in these environments to get code coverage. But
> optionally suppress failure for timing errors (only). This is
> controlled with environment variable KSFT_MACHINE_SLOW.
> 
> The test continues to return 0 (KSFT_PASS), rather than KSFT_XFAIL
> as previously discussed. Because making so_txtime.c return that and
> then making so_txtime.sh capture runs that pass that vs KSFT_FAIL
> and pass it on added a bunch of (fragile bash) boilerplate, while the
> result is interpreted the same as KSFT_PASS anyway.

FWIW another idea that came up when talking to Matthieu -
isolate the VMs which run time-sensitive tests to dedicated
CPUs. Right now we kick off around 70 4 CPU VMs and let them 
battle for 72 cores. The machines don't look overloaded but
there can be some latency spikes (CPU use diagram attached).

So the idea would be to have a handful of special VMs running 
on dedicated CPUs without any CPU time competition. That could help 
with latency spikes. But we'd probably need to annotate the tests
which need some special treatment.

Probably too much work both to annotate tests and set up env,
but I thought I'd bring it up here in case you had an opinion.
Willem de Bruijn Feb. 3, 2024, 12:31 a.m. UTC | #2
Jakub Kicinski wrote:
> On Thu,  1 Feb 2024 11:21:19 -0500 Willem de Bruijn wrote:
> > This test is time sensitive. It may fail on virtual machines and for
> > debug builds.
> > 
> > Continue to run in these environments to get code coverage. But
> > optionally suppress failure for timing errors (only). This is
> > controlled with environment variable KSFT_MACHINE_SLOW.
> > 
> > The test continues to return 0 (KSFT_PASS), rather than KSFT_XFAIL
> > as previously discussed. Because making so_txtime.c return that and
> > then making so_txtime.sh capture runs that pass that vs KSFT_FAIL
> > and pass it on added a bunch of (fragile bash) boilerplate, while the
> > result is interpreted the same as KSFT_PASS anyway.
> 
> FWIW another idea that came up when talking to Matthieu -
> isolate the VMs which run time-sensitive tests to dedicated
> CPUs. Right now we kick off around 70 4 CPU VMs and let them 
> battle for 72 cores. The machines don't look overloaded but
> there can be some latency spikes (CPU use diagram attached).
> 
> So the idea would be to have a handful of special VMs running 
> on dedicated CPUs without any CPU time competition. That could help 
> with latency spikes. But we'd probably need to annotate the tests
> which need some special treatment.
> 
> Probably too much work both to annotate tests and set up env,
> but I thought I'd bring it up here in case you had an opinion.

I'm not sure whether the issue with timing in VMs is CPU affinity.
Variance may just come from expensive hypercalls, even with a
dedicated CPU. Though tests can tell.

There's still the debug builds, as well.
Paolo Abeni Feb. 6, 2024, 9:18 a.m. UTC | #3
On Fri, 2024-02-02 at 19:31 -0500, Willem de Bruijn wrote:
> Jakub Kicinski wrote:
> > On Thu,  1 Feb 2024 11:21:19 -0500 Willem de Bruijn wrote:
> > > This test is time sensitive. It may fail on virtual machines and for
> > > debug builds.
> > > 
> > > Continue to run in these environments to get code coverage. But
> > > optionally suppress failure for timing errors (only). This is
> > > controlled with environment variable KSFT_MACHINE_SLOW.
> > > 
> > > The test continues to return 0 (KSFT_PASS), rather than KSFT_XFAIL
> > > as previously discussed. Because making so_txtime.c return that and
> > > then making so_txtime.sh capture runs that pass that vs KSFT_FAIL
> > > and pass it on added a bunch of (fragile bash) boilerplate, while the
> > > result is interpreted the same as KSFT_PASS anyway.
> > 
> > FWIW another idea that came up when talking to Matthieu -
> > isolate the VMs which run time-sensitive tests to dedicated
> > CPUs. Right now we kick off around 70 4 CPU VMs and let them 
> > battle for 72 cores. The machines don't look overloaded but
> > there can be some latency spikes (CPU use diagram attached).
> > 
> > So the idea would be to have a handful of special VMs running 
> > on dedicated CPUs without any CPU time competition. That could help 
> > with latency spikes. But we'd probably need to annotate the tests
> > which need some special treatment.
> > 
> > Probably too much work both to annotate tests and set up env,
> > but I thought I'd bring it up here in case you had an opinion.
> 
> I'm not sure whether the issue with timing in VMs is CPU affinity.
> Variance may just come from expensive hypercalls, even with a
> dedicated CPU. Though tests can tell.

FTR, I think the CPU affinity setup is a bit too complex, and hard to
reproduce for 3rd parties willing to investigate eventual future CI
failures, I think the current env-variable-based approach would help
with reproducibility.

> There's still the debug builds, as well.

I understand/hope you are investigating it? 

Cheers,

Paolo
patchwork-bot+netdevbpf@kernel.org Feb. 6, 2024, 9:30 a.m. UTC | #4
Hello:

This patch was applied to netdev/net-next.git (main)
by Paolo Abeni <pabeni@redhat.com>:

On Thu,  1 Feb 2024 11:21:19 -0500 you wrote:
> From: Willem de Bruijn <willemb@google.com>
> 
> This test is time sensitive. It may fail on virtual machines and for
> debug builds.
> 
> Continue to run in these environments to get code coverage. But
> optionally suppress failure for timing errors (only). This is
> controlled with environment variable KSFT_MACHINE_SLOW.
> 
> [...]

Here is the summary with links:
  - [net-next] selftests/net: ignore timing errors in so_txtime if KSFT_MACHINE_SLOW
    https://git.kernel.org/netdev/net-next/c/c41dfb0dfbec

You are awesome, thank you!
Matthieu Baerts Feb. 6, 2024, 10:38 a.m. UTC | #5
Hi Paolo, Willem, Jakub,

On 06/02/2024 10:18, Paolo Abeni wrote:
> On Fri, 2024-02-02 at 19:31 -0500, Willem de Bruijn wrote:
>> Jakub Kicinski wrote:
>>> On Thu,  1 Feb 2024 11:21:19 -0500 Willem de Bruijn wrote:
>>>> This test is time sensitive. It may fail on virtual machines and for
>>>> debug builds.
>>>>
>>>> Continue to run in these environments to get code coverage. But
>>>> optionally suppress failure for timing errors (only). This is
>>>> controlled with environment variable KSFT_MACHINE_SLOW.
>>>>
>>>> The test continues to return 0 (KSFT_PASS), rather than KSFT_XFAIL
>>>> as previously discussed. Because making so_txtime.c return that and
>>>> then making so_txtime.sh capture runs that pass that vs KSFT_FAIL
>>>> and pass it on added a bunch of (fragile bash) boilerplate, while the
>>>> result is interpreted the same as KSFT_PASS anyway.
>>>
>>> FWIW another idea that came up when talking to Matthieu -
>>> isolate the VMs which run time-sensitive tests to dedicated
>>> CPUs. Right now we kick off around 70 4 CPU VMs and let them 
>>> battle for 72 cores. The machines don't look overloaded but
>>> there can be some latency spikes (CPU use diagram attached).
>>>
>>> So the idea would be to have a handful of special VMs running 
>>> on dedicated CPUs without any CPU time competition. That could help 
>>> with latency spikes. But we'd probably need to annotate the tests
>>> which need some special treatment.
>>>
>>> Probably too much work both to annotate tests and set up env,
>>> but I thought I'd bring it up here in case you had an opinion.
>>
>> I'm not sure whether the issue with timing in VMs is CPU affinity.
>> Variance may just come from expensive hypercalls, even with a
>> dedicated CPU. Though tests can tell.
> 
> FTR, I think the CPU affinity setup is a bit too complex, and hard to
> reproduce for 3rd parties willing to investigate eventual future CI
> failures, I think the current env-variable-based approach would help
> with reproducibility.

I agree with you. Initially, with 70 VMs with 4 CPU cores, I thought it
would have taken more CPU resources, especially when KVM is not used.

Looking at the screenshot provided by Jakub, the host doesn't seem
overloaded, and the VM isolation is probably enough. Maybe only the
first test(s) can be impacted?

At the end, now that the runner without KVM is no longer there, the
situation should be improved :)

>> There's still the debug builds, as well.

For one MPTCP selftest checking the time to transfer some data, we
increase the tolerance by looking at kallsyms:

  grep -q ' kmemleak_init$\| lockdep_init$\| kasan_init$\|
prove_locking$' /proc/kallsyms

We can also look at KSFT_MACHINE_SLOW if it is the new standard.

Cheers,
Matt
diff mbox series

Patch

diff --git a/tools/testing/selftests/net/so_txtime.c b/tools/testing/selftests/net/so_txtime.c
index 2672ac0b6d1f..8457b7ccbc09 100644
--- a/tools/testing/selftests/net/so_txtime.c
+++ b/tools/testing/selftests/net/so_txtime.c
@@ -134,8 +134,11 @@  static void do_recv_one(int fdr, struct timed_send *ts)
 	if (rbuf[0] != ts->data)
 		error(1, 0, "payload mismatch. expected %c", ts->data);
 
-	if (llabs(tstop - texpect) > cfg_variance_us)
-		error(1, 0, "exceeds variance (%d us)", cfg_variance_us);
+	if (llabs(tstop - texpect) > cfg_variance_us) {
+		fprintf(stderr, "exceeds variance (%d us)\n", cfg_variance_us);
+		if (!getenv("KSFT_MACHINE_SLOW"))
+			exit(1);
+	}
 }
 
 static void do_recv_verify_empty(int fdr)