diff mbox series

[net,v2] selftests: forwarding: Avoid false MDB delete/flush failures

Message ID c92569919307749f879b9482b0f3e125b7d9d2e3.1726480066.git.jamie.bainbridge@gmail.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net,v2] selftests: forwarding: Avoid false MDB delete/flush failures | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 7 this patch: 7
netdev/build_tools success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers warning 1 maintainers not CCed: liuhangbin@gmail.com
netdev/build_clang success Errors and warnings before: 7 this patch: 7
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success net selftest script(s) already in Makefile
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 7 this patch: 7
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 177 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-09-16--18-00 (tests: 764)

Commit Message

Jamie Bainbridge Sept. 16, 2024, 9:49 a.m. UTC
Running this test on a small system produces different failures every
test checking deletions, and some flushes. From different test runs:

TEST: Common host entries configuration tests (L2)                [FAIL]
  Failed to delete L2 host entry

TEST: Common port group entries configuration tests (IPv4 (S, G)) [FAIL]
  IPv4 (S, G) entry with VLAN 10 not deleted when VLAN was not specified

TEST: Common port group entries configuration tests (IPv6 (*, G)) [FAIL]
  IPv6 (*, G) entry with VLAN 10 not deleted when VLAN was not specified

TEST: Flush tests                                                 [FAIL]
  Entry not flushed by specified VLAN ID

TEST: Flush tests                                                 [FAIL]
  IPv6 host entry not flushed by "nopermanent" state

Add a short sleep after deletion and flush to resolve this.

Create a delay variable just for this test to allow short sleep, the
lib.sh WAIT_TIME of 5 seconds makes the test far longer than necessary.

Tested on several weak systems with 0.1s delay:
- Ivy Bridge Celeron netbook (2014 x86_64)
- Raspberry Pi 3B (2016 aarch64)
- Small KVM VM on Intel 10th gen (2020 x86_64)
All these systems ran 25 test runs in a row with 100% pass OK.

Fixes: b6d00da08610 ("selftests: forwarding: Add bridge MDB test")
Signed-off-by: Jamie Bainbridge <jamie.bainbridge@gmail.com>
---
v2: Avoid false check failures as seen by Jakub Kicinski.
---
 .../selftests/net/forwarding/bridge_mdb.sh    | 28 +++++++++++++++++++
 1 file changed, 28 insertions(+)

Comments

Simon Horman Sept. 19, 2024, 1:41 p.m. UTC | #1
On Mon, Sep 16, 2024 at 07:49:05PM +1000, Jamie Bainbridge wrote:
> Running this test on a small system produces different failures every
> test checking deletions, and some flushes. From different test runs:
> 
> TEST: Common host entries configuration tests (L2)                [FAIL]
>   Failed to delete L2 host entry
> 
> TEST: Common port group entries configuration tests (IPv4 (S, G)) [FAIL]
>   IPv4 (S, G) entry with VLAN 10 not deleted when VLAN was not specified
> 
> TEST: Common port group entries configuration tests (IPv6 (*, G)) [FAIL]
>   IPv6 (*, G) entry with VLAN 10 not deleted when VLAN was not specified
> 
> TEST: Flush tests                                                 [FAIL]
>   Entry not flushed by specified VLAN ID
> 
> TEST: Flush tests                                                 [FAIL]
>   IPv6 host entry not flushed by "nopermanent" state
> 
> Add a short sleep after deletion and flush to resolve this.
> 
> Create a delay variable just for this test to allow short sleep, the
> lib.sh WAIT_TIME of 5 seconds makes the test far longer than necessary.
> 
> Tested on several weak systems with 0.1s delay:
> - Ivy Bridge Celeron netbook (2014 x86_64)
> - Raspberry Pi 3B (2016 aarch64)
> - Small KVM VM on Intel 10th gen (2020 x86_64)
> All these systems ran 25 test runs in a row with 100% pass OK.
> 
> Fixes: b6d00da08610 ("selftests: forwarding: Add bridge MDB test")
> Signed-off-by: Jamie Bainbridge <jamie.bainbridge@gmail.com>
> ---
> v2: Avoid false check failures as seen by Jakub Kicinski.
> ---

Reviewed-by: Simon Horman <horms@kernel.org>
Ido Schimmel Sept. 19, 2024, 2:55 p.m. UTC | #2
Hi,

Thanks for the patch and sorry for the late reply (was OOO).

On Mon, Sep 16, 2024 at 07:49:05PM +1000, Jamie Bainbridge wrote:
> Running this test on a small system produces different failures every
> test checking deletions, and some flushes. From different test runs:
> 
> TEST: Common host entries configuration tests (L2)                [FAIL]
>   Failed to delete L2 host entry
> 
> TEST: Common port group entries configuration tests (IPv4 (S, G)) [FAIL]
>   IPv4 (S, G) entry with VLAN 10 not deleted when VLAN was not specified
> 
> TEST: Common port group entries configuration tests (IPv6 (*, G)) [FAIL]
>   IPv6 (*, G) entry with VLAN 10 not deleted when VLAN was not specified
> 
> TEST: Flush tests                                                 [FAIL]
>   Entry not flushed by specified VLAN ID
> 
> TEST: Flush tests                                                 [FAIL]
>   IPv6 host entry not flushed by "nopermanent" state
> 
> Add a short sleep after deletion and flush to resolve this.

The port group entry is removed from MDB entry's list synchronously, but
the MDB entry itself is removed from the hash table asynchronously and
the MDB get query will only return an error if an entry was not found
there.

IOW, I think that when you do get a response after deletion, the entry
you get is empty.

Can you please test the following patch [1] (w/o yours, obviously)?

[1]
diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index bc37e47ad829..1a52a0bca086 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -1674,7 +1674,7 @@ int br_mdb_get(struct net_device *dev, struct nlattr *tb[], u32 portid, u32 seq,
        spin_lock_bh(&br->multicast_lock);
 
        mp = br_mdb_ip_get(br, &group);
-       if (!mp) {
+       if (!mp || (!mp->ports && !mp->host_joined)) {
                NL_SET_ERR_MSG_MOD(extack, "MDB entry not found");
                err = -ENOENT;
                goto unlock;
Jamie Bainbridge Sept. 20, 2024, 12:54 p.m. UTC | #3
On Fri, 20 Sept 2024 at 00:56, Ido Schimmel <idosch@nvidia.com> wrote:
>
> Hi,
>
> Thanks for the patch and sorry for the late reply (was OOO).
>
> On Mon, Sep 16, 2024 at 07:49:05PM +1000, Jamie Bainbridge wrote:
> > Running this test on a small system produces different failures every
> > test checking deletions, and some flushes. From different test runs:
> >
> > TEST: Common host entries configuration tests (L2)                [FAIL]
> >   Failed to delete L2 host entry
> >
> > TEST: Common port group entries configuration tests (IPv4 (S, G)) [FAIL]
> >   IPv4 (S, G) entry with VLAN 10 not deleted when VLAN was not specified
> >
> > TEST: Common port group entries configuration tests (IPv6 (*, G)) [FAIL]
> >   IPv6 (*, G) entry with VLAN 10 not deleted when VLAN was not specified
> >
> > TEST: Flush tests                                                 [FAIL]
> >   Entry not flushed by specified VLAN ID
> >
> > TEST: Flush tests                                                 [FAIL]
> >   IPv6 host entry not flushed by "nopermanent" state
> >
> > Add a short sleep after deletion and flush to resolve this.
>
> The port group entry is removed from MDB entry's list synchronously, but
> the MDB entry itself is removed from the hash table asynchronously and
> the MDB get query will only return an error if an entry was not found
> there.
>
> IOW, I think that when you do get a response after deletion, the entry
> you get is empty.
>
> Can you please test the following patch [1] (w/o yours, obviously)?
>
> [1]
> diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
> index bc37e47ad829..1a52a0bca086 100644
> --- a/net/bridge/br_mdb.c
> +++ b/net/bridge/br_mdb.c
> @@ -1674,7 +1674,7 @@ int br_mdb_get(struct net_device *dev, struct nlattr *tb[], u32 portid, u32 seq,
>         spin_lock_bh(&br->multicast_lock);
>
>         mp = br_mdb_ip_get(br, &group);
> -       if (!mp) {
> +       if (!mp || (!mp->ports && !mp->host_joined)) {
>                 NL_SET_ERR_MSG_MOD(extack, "MDB entry not found");
>                 err = -ENOENT;
>                 goto unlock;

This works perfectly for me. Previously I would get at least 2
failures in 10. Without my patch and with the above patch, 100 tests
pass without any failure.

Many thanks for looking at this!

Jamie
diff mbox series

Patch

diff --git a/tools/testing/selftests/net/forwarding/bridge_mdb.sh b/tools/testing/selftests/net/forwarding/bridge_mdb.sh
index d9d587454d207931a539f59be15cbc63d471888f..49136279973d05d0e6b14237228ab58455554bb0 100755
--- a/tools/testing/selftests/net/forwarding/bridge_mdb.sh
+++ b/tools/testing/selftests/net/forwarding/bridge_mdb.sh
@@ -30,6 +30,9 @@  ALL_TESTS="
 	ctrl_test
 "
 
+# time to wait for delete and flush to complete
+: "${SETTLE_DELAY:=0.1}"
+
 NUM_NETIFS=4
 source lib.sh
 source tc_common.sh
@@ -152,6 +155,7 @@  cfg_test_host_common()
 	check_fail $? "Managed to replace $name host entry"
 
 	bridge mdb del dev br0 port br0 grp $grp $state vid 10
+	sleep "$SETTLE_DELAY"
 	bridge mdb get dev br0 grp $grp vid 10 &> /dev/null
 	check_fail $? "Failed to delete $name host entry"
 
@@ -208,6 +212,7 @@  cfg_test_port_common()
 	check_err $? "Failed to replace $name entry"
 
 	bridge mdb del dev br0 port $swp1 $grp_key permanent vid 10
+	sleep "$SETTLE_DELAY"
 	bridge mdb get dev br0 $grp_key vid 10 &> /dev/null
 	check_fail $? "Failed to delete $name entry"
 
@@ -230,6 +235,7 @@  cfg_test_port_common()
 	check_err $? "$name entry with VLAN 20 not added when VLAN was not specified"
 
 	bridge mdb del dev br0 port $swp1 $grp_key permanent
+	sleep "$SETTLE_DELAY"
 	bridge mdb get dev br0 $grp_key vid 10 &> /dev/null
 	check_fail $? "$name entry with VLAN 10 not deleted when VLAN was not specified"
 	bridge mdb get dev br0 $grp_key vid 20 &> /dev/null
@@ -310,6 +316,7 @@  __cfg_test_port_ip_star_g()
 	bridge -d mdb get dev br0 grp $grp src $src1 vid 10 &> /dev/null
 	check_err $? "(S, G) entry not created"
 	bridge mdb del dev br0 port $swp1 grp $grp vid 10
+	sleep "$SETTLE_DELAY"
 	bridge -d mdb get dev br0 grp $grp vid 10 &> /dev/null
 	check_fail $? "(*, G) entry not deleted"
 	bridge -d mdb get dev br0 grp $grp src $src1 vid 10 &> /dev/null
@@ -828,6 +835,7 @@  cfg_test_flush()
 	bridge mdb add dev br0 port $swp1 grp 239.1.1.8 vid 10 temp
 
 	bridge mdb flush dev br0
+	sleep "$SETTLE_DELAY"
 	num_entries=$(bridge mdb show dev br0 | wc -l)
 	[[ $num_entries -eq 0 ]]
 	check_err $? 0 "Not all entries flushed after flush all"
@@ -840,6 +848,7 @@  cfg_test_flush()
 	bridge mdb add dev br0 port br0 grp 239.1.1.1 vid 10
 
 	bridge mdb flush dev br0 port $swp1
+	sleep "$SETTLE_DELAY"
 
 	bridge mdb get dev br0 grp 239.1.1.1 vid 10 | grep -q "port $swp1"
 	check_fail $? "Entry not flushed by specified port"
@@ -849,11 +858,13 @@  cfg_test_flush()
 	check_err $? "Host entry flushed by wrong port"
 
 	bridge mdb flush dev br0 port br0
+	sleep "$SETTLE_DELAY"
 
 	bridge mdb get dev br0 grp 239.1.1.1 vid 10 | grep -q "port br0"
 	check_fail $? "Host entry not flushed by specified port"
 
 	bridge mdb flush dev br0
+	sleep "$SETTLE_DELAY"
 
 	# Check that when flushing by VLAN ID only entries programmed with the
 	# specified VLAN ID are flushed and the rest are not.
@@ -864,6 +875,7 @@  cfg_test_flush()
 	bridge mdb add dev br0 port $swp2 grp 239.1.1.1 vid 20
 
 	bridge mdb flush dev br0 vid 10
+	sleep "$SETTLE_DELAY"
 
 	bridge mdb get dev br0 grp 239.1.1.1 vid 10 &> /dev/null
 	check_fail $? "Entry not flushed by specified VLAN ID"
@@ -871,6 +883,7 @@  cfg_test_flush()
 	check_err $? "Entry flushed by wrong VLAN ID"
 
 	bridge mdb flush dev br0
+	sleep "$SETTLE_DELAY"
 
 	# Check that all permanent entries are flushed when "permanent" is
 	# specified and that temporary entries are not.
@@ -879,6 +892,7 @@  cfg_test_flush()
 	bridge mdb add dev br0 port $swp2 grp 239.1.1.1 temp vid 10
 
 	bridge mdb flush dev br0 permanent
+	sleep "$SETTLE_DELAY"
 
 	bridge mdb get dev br0 grp 239.1.1.1 vid 10 | grep -q "port $swp1"
 	check_fail $? "Entry not flushed by \"permanent\" state"
@@ -886,6 +900,7 @@  cfg_test_flush()
 	check_err $? "Entry flushed by wrong state (\"permanent\")"
 
 	bridge mdb flush dev br0
+	sleep "$SETTLE_DELAY"
 
 	# Check that all temporary entries are flushed when "nopermanent" is
 	# specified and that permanent entries are not.
@@ -894,6 +909,7 @@  cfg_test_flush()
 	bridge mdb add dev br0 port $swp2 grp 239.1.1.1 temp vid 10
 
 	bridge mdb flush dev br0 nopermanent
+	sleep "$SETTLE_DELAY"
 
 	bridge mdb get dev br0 grp 239.1.1.1 vid 10 | grep -q "port $swp1"
 	check_err $? "Entry flushed by wrong state (\"nopermanent\")"
@@ -901,6 +917,7 @@  cfg_test_flush()
 	check_fail $? "Entry not flushed by \"nopermanent\" state"
 
 	bridge mdb flush dev br0
+	sleep "$SETTLE_DELAY"
 
 	# Check that L2 host entries are not flushed when "nopermanent" is
 	# specified, but flushed when "permanent" is specified.
@@ -908,16 +925,19 @@  cfg_test_flush()
 	bridge mdb add dev br0 port br0 grp 01:02:03:04:05:06 permanent vid 10
 
 	bridge mdb flush dev br0 nopermanent
+	sleep "$SETTLE_DELAY"
 
 	bridge mdb get dev br0 grp 01:02:03:04:05:06 vid 10 &> /dev/null
 	check_err $? "L2 host entry flushed by wrong state (\"nopermanent\")"
 
 	bridge mdb flush dev br0 permanent
+	sleep "$SETTLE_DELAY"
 
 	bridge mdb get dev br0 grp 01:02:03:04:05:06 vid 10 &> /dev/null
 	check_fail $? "L2 host entry not flushed by \"permanent\" state"
 
 	bridge mdb flush dev br0
+	sleep "$SETTLE_DELAY"
 
 	# Check that IPv4 host entries are not flushed when "permanent" is
 	# specified, but flushed when "nopermanent" is specified.
@@ -925,16 +945,19 @@  cfg_test_flush()
 	bridge mdb add dev br0 port br0 grp 239.1.1.1 temp vid 10
 
 	bridge mdb flush dev br0 permanent
+	sleep "$SETTLE_DELAY"
 
 	bridge mdb get dev br0 grp 239.1.1.1 vid 10 &> /dev/null
 	check_err $? "IPv4 host entry flushed by wrong state (\"permanent\")"
 
 	bridge mdb flush dev br0 nopermanent
+	sleep "$SETTLE_DELAY"
 
 	bridge mdb get dev br0 grp 239.1.1.1 vid 10 &> /dev/null
 	check_fail $? "IPv4 host entry not flushed by \"nopermanent\" state"
 
 	bridge mdb flush dev br0
+	sleep "$SETTLE_DELAY"
 
 	# Check that IPv6 host entries are not flushed when "permanent" is
 	# specified, but flushed when "nopermanent" is specified.
@@ -942,16 +965,19 @@  cfg_test_flush()
 	bridge mdb add dev br0 port br0 grp ff0e::1 temp vid 10
 
 	bridge mdb flush dev br0 permanent
+	sleep "$SETTLE_DELAY"
 
 	bridge mdb get dev br0 grp ff0e::1 vid 10 &> /dev/null
 	check_err $? "IPv6 host entry flushed by wrong state (\"permanent\")"
 
 	bridge mdb flush dev br0 nopermanent
+	sleep "$SETTLE_DELAY"
 
 	bridge mdb get dev br0 grp ff0e::1 vid 10 &> /dev/null
 	check_fail $? "IPv6 host entry not flushed by \"nopermanent\" state"
 
 	bridge mdb flush dev br0
+	sleep "$SETTLE_DELAY"
 
 	# Check that when flushing by routing protocol only entries programmed
 	# with the specified routing protocol are flushed and the rest are not.
@@ -961,6 +987,7 @@  cfg_test_flush()
 	bridge mdb add dev br0 port br0 grp 239.1.1.1 vid 10
 
 	bridge mdb flush dev br0 proto bgp
+	sleep "$SETTLE_DELAY"
 
 	bridge mdb get dev br0 grp 239.1.1.1 vid 10 | grep -q "port $swp1"
 	check_fail $? "Entry not flushed by specified routing protocol"
@@ -970,6 +997,7 @@  cfg_test_flush()
 	check_err $? "Host entry flushed by wrong routing protocol"
 
 	bridge mdb flush dev br0
+	sleep "$SETTLE_DELAY"
 
 	# Test that an error is returned when trying to flush using unsupported
 	# parameters.