diff mbox series

[net-next,v2] selftests: openvswitch: retry instead of sleep

Message ID 20240710090500.1655212-1-amorenoz@redhat.com (mailing list archive)
State Accepted
Commit 5e724cb688a207ae7a348d57f9ea77f475998883
Headers show
Series [net-next,v2] selftests: openvswitch: retry instead of sleep | expand

Commit Message

Adrián Moreno July 10, 2024, 9:04 a.m. UTC
There are a couple of places where the test script "sleep"s to wait for
some external condition to be met.

This is error prone, specially in slow systems (identified in CI by
"KSFT_MACHINE_SLOW=yes").

To fix this, add a "ovs_wait" function that tries to execute a command
a few times until it succeeds. The timeout used is set to 5s for
"normal" systems and doubled if a slow CI machine is detected.

This should make the following work:

$ vng --build  \
    --config tools/testing/selftests/net/config \
    --config kernel/configs/debug.config

$ vng --run . --user root -- "make -C tools/testing/selftests/ \
    KSFT_MACHINE_SLOW=yes TARGETS=net/openvswitch run_tests"

Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
---
 .../selftests/net/openvswitch/openvswitch.sh  | 45 +++++++++++++++----
 .../selftests/net/openvswitch/ovs-dpctl.py    |  1 +
 2 files changed, 38 insertions(+), 8 deletions(-)

Comments

Adrián Moreno July 11, 2024, 9:41 a.m. UTC | #1
On Wed, Jul 10, 2024 at 11:04:59AM GMT, Adrian Moreno wrote:
> There are a couple of places where the test script "sleep"s to wait for
> some external condition to be met.
>
> This is error prone, specially in slow systems (identified in CI by
> "KSFT_MACHINE_SLOW=yes").
>
> To fix this, add a "ovs_wait" function that tries to execute a command
> a few times until it succeeds. The timeout used is set to 5s for
> "normal" systems and doubled if a slow CI machine is detected.
>
> This should make the following work:
>
> $ vng --build  \
>     --config tools/testing/selftests/net/config \
>     --config kernel/configs/debug.config
>
> $ vng --run . --user root -- "make -C tools/testing/selftests/ \
>     KSFT_MACHINE_SLOW=yes TARGETS=net/openvswitch run_tests"
>
> Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
> ---
>  .../selftests/net/openvswitch/openvswitch.sh  | 45 +++++++++++++++----
>  .../selftests/net/openvswitch/ovs-dpctl.py    |  1 +
>  2 files changed, 38 insertions(+), 8 deletions(-)
>
> diff --git a/tools/testing/selftests/net/openvswitch/openvswitch.sh b/tools/testing/selftests/net/openvswitch/openvswitch.sh
> index bc71dbc18b21..cc0bfae2bafa 100755
> --- a/tools/testing/selftests/net/openvswitch/openvswitch.sh
> +++ b/tools/testing/selftests/net/openvswitch/openvswitch.sh
> @@ -11,6 +11,11 @@ ksft_skip=4
>  PAUSE_ON_FAIL=no
>  VERBOSE=0
>  TRACING=0
> +WAIT_TIMEOUT=5
> +
> +if test "X$KSFT_MACHINE_SLOW" == "Xyes"; then
> +	WAIT_TIMEOUT=10
> +fi
>
>  tests="
>  	arp_ping				eth-arp: Basic arp ping between two NS
> @@ -29,6 +34,30 @@ info() {
>  	[ $VERBOSE = 0 ] || echo $*
>  }
>
> +ovs_wait() {
> +	info "waiting $WAIT_TIMEOUT s for: $@"
> +
> +	if "$@" ; then
> +		info "wait succeeded immediately"
> +		return 0
> +	fi
> +
> +	# A quick re-check helps speed up small races in fast systems.
> +	# However, fractional sleeps might not necessarily work.
> +	local start=0
> +	sleep 0.1 || { sleep 1; start=1; }
> +
> +	for (( i=start; i<WAIT_TIMEOUT; i++ )); do
> +		if "$@" ; then
> +			info "wait succeeded after $i seconds"
> +			return 0
> +		fi
> +		sleep 1
> +	done
> +	info "wait failed after $i seconds"
> +	return 1
> +}
> +
>  ovs_base=`pwd`
>  sbxs=
>  sbx_add () {
> @@ -278,20 +307,19 @@ test_psample() {
>
>  	# Record psample data.
>  	ovs_spawn_daemon "test_psample" python3 $ovs_base/ovs-dpctl.py psample-events
> +	ovs_wait grep -q "listening for psample events" ${ovs_dir}/stdout
>
>  	# Send a single ping.
> -	sleep 1
>  	ovs_sbx "test_psample" ip netns exec client ping -I c1 172.31.110.20 -c 1 || return 1
> -	sleep 1
>
>  	# We should have received one userspace action upcall and 2 psample packets.
> -	grep -E "userspace action command" $ovs_dir/s0.out >/dev/null 2>&1 || return 1
> +	ovs_wait grep -q "userspace action command" $ovs_dir/s0.out || return 1
>
>  	# client -> server samples should only contain the first 14 bytes of the packet.
> -	grep -E "rate:4294967295,group:1,cookie:c0ffee data:[0-9a-f]{28}$" \
> -			 $ovs_dir/stdout >/dev/null 2>&1 || return 1
> -	grep -E "rate:4294967295,group:2,cookie:eeff0c" \
> -			 $ovs_dir/stdout >/dev/null 2>&1 || return 1
> +	ovs_wait grep -qE "rate:4294967295,group:1,cookie:c0ffee data:[0-9a-f]{28}$" \
> +		$ovs_dir/stdout || return 1
> +
> +	ovs_wait grep -q "rate:4294967295,group:2,cookie:eeff0c" $ovs_dir/stdout || return 1
>
>  	return 0
>  }
> @@ -711,7 +739,8 @@ test_upcall_interfaces() {
>  	ovs_add_netns_and_veths "test_upcall_interfaces" ui0 upc left0 l0 \
>  	    172.31.110.1/24 -u || return 1
>
> -	sleep 1
> +	ovs_wait grep -q "listening on upcall packet handler" ${ovs_dir}/left0.out
> +
>  	info "sending arping"
>  	ip netns exec upc arping -I l0 172.31.110.20 -c 1 \
>  	    >$ovs_dir/arping.stdout 2>$ovs_dir/arping.stderr
> diff --git a/tools/testing/selftests/net/openvswitch/ovs-dpctl.py b/tools/testing/selftests/net/openvswitch/ovs-dpctl.py
> index 1e15b0818074..8a0396bfaf99 100644
> --- a/tools/testing/selftests/net/openvswitch/ovs-dpctl.py
> +++ b/tools/testing/selftests/net/openvswitch/ovs-dpctl.py
> @@ -2520,6 +2520,7 @@ class PsampleEvent(EventSocket):
>      marshal_class = psample_msg
>
>      def read_samples(self):
> +        print("listening for psample events", flush=True)
>          while True:
>              try:
>                  for msg in self.get():
> --
> 2.45.2
>


This patch is supposed to fix openvswitch selftests on "-dbg" machines.
However, as Simon points out, all recent rounds are failing [1]. I don't
see this patch being included in the batches and I was wondering why.

Also I see a (presumably unrelated) build error netdev/build_32bit.
Is there anything I can do?

[1]
https://netdev.bots.linux.dev/contest.html?executor=vmksft-net-dbg&test=openvswitch-sh

Thanks.
Adrián
Ilya Maximets July 11, 2024, 11:28 a.m. UTC | #2
On 7/10/24 11:04, Adrian Moreno wrote:
> There are a couple of places where the test script "sleep"s to wait for
> some external condition to be met.
> 
> This is error prone, specially in slow systems (identified in CI by
> "KSFT_MACHINE_SLOW=yes").
> 
> To fix this, add a "ovs_wait" function that tries to execute a command
> a few times until it succeeds. The timeout used is set to 5s for
> "normal" systems and doubled if a slow CI machine is detected.
> 
> This should make the following work:
> 
> $ vng --build  \
>     --config tools/testing/selftests/net/config \
>     --config kernel/configs/debug.config
> 
> $ vng --run . --user root -- "make -C tools/testing/selftests/ \
>     KSFT_MACHINE_SLOW=yes TARGETS=net/openvswitch run_tests"
> 
> Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
> ---
>  .../selftests/net/openvswitch/openvswitch.sh  | 45 +++++++++++++++----
>  .../selftests/net/openvswitch/ovs-dpctl.py    |  1 +
>  2 files changed, 38 insertions(+), 8 deletions(-)

Seem like we don't have a signal from CI for some reason yet,
but I tested this locally and it seem to work fine.  Either
way it's a better way of doing things than sleep'n'hope.

Reviewed-by: Ilya Maximets <i.maximets@ovn.org>
Jakub Kicinski July 11, 2024, 3:44 p.m. UTC | #3
On Thu, 11 Jul 2024 09:41:08 +0000 Adrián Moreno wrote:
> This patch is supposed to fix openvswitch selftests on "-dbg" machines.
> However, as Simon points out, all recent rounds are failing [1]. I don't
> see this patch being included in the batches and I was wondering why.
> 
> Also I see a (presumably unrelated) build error netdev/build_32bit.
> Is there anything I can do?

Hopefully fixed now, we'll see if it gets into the next run.
patchwork-bot+netdevbpf@kernel.org July 12, 2024, 1:20 a.m. UTC | #4
Hello:

This patch was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Wed, 10 Jul 2024 11:04:59 +0200 you wrote:
> There are a couple of places where the test script "sleep"s to wait for
> some external condition to be met.
> 
> This is error prone, specially in slow systems (identified in CI by
> "KSFT_MACHINE_SLOW=yes").
> 
> To fix this, add a "ovs_wait" function that tries to execute a command
> a few times until it succeeds. The timeout used is set to 5s for
> "normal" systems and doubled if a slow CI machine is detected.
> 
> [...]

Here is the summary with links:
  - [net-next,v2] selftests: openvswitch: retry instead of sleep
    https://git.kernel.org/netdev/net-next/c/5e724cb688a2

You are awesome, thank you!
diff mbox series

Patch

diff --git a/tools/testing/selftests/net/openvswitch/openvswitch.sh b/tools/testing/selftests/net/openvswitch/openvswitch.sh
index bc71dbc18b21..cc0bfae2bafa 100755
--- a/tools/testing/selftests/net/openvswitch/openvswitch.sh
+++ b/tools/testing/selftests/net/openvswitch/openvswitch.sh
@@ -11,6 +11,11 @@  ksft_skip=4
 PAUSE_ON_FAIL=no
 VERBOSE=0
 TRACING=0
+WAIT_TIMEOUT=5
+
+if test "X$KSFT_MACHINE_SLOW" == "Xyes"; then
+	WAIT_TIMEOUT=10
+fi
 
 tests="
 	arp_ping				eth-arp: Basic arp ping between two NS
@@ -29,6 +34,30 @@  info() {
 	[ $VERBOSE = 0 ] || echo $*
 }
 
+ovs_wait() {
+	info "waiting $WAIT_TIMEOUT s for: $@"
+
+	if "$@" ; then
+		info "wait succeeded immediately"
+		return 0
+	fi
+
+	# A quick re-check helps speed up small races in fast systems.
+	# However, fractional sleeps might not necessarily work.
+	local start=0
+	sleep 0.1 || { sleep 1; start=1; }
+
+	for (( i=start; i<WAIT_TIMEOUT; i++ )); do
+		if "$@" ; then
+			info "wait succeeded after $i seconds"
+			return 0
+		fi
+		sleep 1
+	done
+	info "wait failed after $i seconds"
+	return 1
+}
+
 ovs_base=`pwd`
 sbxs=
 sbx_add () {
@@ -278,20 +307,19 @@  test_psample() {
 
 	# Record psample data.
 	ovs_spawn_daemon "test_psample" python3 $ovs_base/ovs-dpctl.py psample-events
+	ovs_wait grep -q "listening for psample events" ${ovs_dir}/stdout
 
 	# Send a single ping.
-	sleep 1
 	ovs_sbx "test_psample" ip netns exec client ping -I c1 172.31.110.20 -c 1 || return 1
-	sleep 1
 
 	# We should have received one userspace action upcall and 2 psample packets.
-	grep -E "userspace action command" $ovs_dir/s0.out >/dev/null 2>&1 || return 1
+	ovs_wait grep -q "userspace action command" $ovs_dir/s0.out || return 1
 
 	# client -> server samples should only contain the first 14 bytes of the packet.
-	grep -E "rate:4294967295,group:1,cookie:c0ffee data:[0-9a-f]{28}$" \
-			 $ovs_dir/stdout >/dev/null 2>&1 || return 1
-	grep -E "rate:4294967295,group:2,cookie:eeff0c" \
-			 $ovs_dir/stdout >/dev/null 2>&1 || return 1
+	ovs_wait grep -qE "rate:4294967295,group:1,cookie:c0ffee data:[0-9a-f]{28}$" \
+		$ovs_dir/stdout || return 1
+
+	ovs_wait grep -q "rate:4294967295,group:2,cookie:eeff0c" $ovs_dir/stdout || return 1
 
 	return 0
 }
@@ -711,7 +739,8 @@  test_upcall_interfaces() {
 	ovs_add_netns_and_veths "test_upcall_interfaces" ui0 upc left0 l0 \
 	    172.31.110.1/24 -u || return 1
 
-	sleep 1
+	ovs_wait grep -q "listening on upcall packet handler" ${ovs_dir}/left0.out
+
 	info "sending arping"
 	ip netns exec upc arping -I l0 172.31.110.20 -c 1 \
 	    >$ovs_dir/arping.stdout 2>$ovs_dir/arping.stderr
diff --git a/tools/testing/selftests/net/openvswitch/ovs-dpctl.py b/tools/testing/selftests/net/openvswitch/ovs-dpctl.py
index 1e15b0818074..8a0396bfaf99 100644
--- a/tools/testing/selftests/net/openvswitch/ovs-dpctl.py
+++ b/tools/testing/selftests/net/openvswitch/ovs-dpctl.py
@@ -2520,6 +2520,7 @@  class PsampleEvent(EventSocket):
     marshal_class = psample_msg
 
     def read_samples(self):
+        print("listening for psample events", flush=True)
         while True:
             try:
                 for msg in self.get():