Message ID | 20241215163334.615427-1-robert.hodaszi@digi.com (mailing list archive) |
---|---|
Headers | show |
Series | net: dsa: felix: fix VLAN-unaware reception | expand |
On Sun, Dec 15, 2024 at 05:33:32PM +0100, Robert Hodaszi wrote: > Hello, > > This patchset supposed to fix the currently non-working ocelot-8021q > mode of the Felix driver if bridge is VLAN-unaware. > > As can see in the commit messages, the driver enables > 'untag_vlan_aware_bridge_pvid' to software VLAN untag all packets, but > tagging is only enabled if VLAN-filtering is enabled > (push_inner_tag=1). > > Untagging packets from VLAN-unaware bridge ports is wrong, and corrupts > the packets. > > It was tempting to simply restore dsa_software_vlan_untag()'s checking > as it was before: > > /* Move VLAN tag from data to hwaccel ** > if (!skb_vlan_tag_present(skb) && skb->protocol == htons(proto)) { > skb = skb_vlan_untag(skb); > if (!skb) > return NULL; > } > > And so untagging only VLAN packets, but that's not really a solution, > VLAN-tagged packets may arrive on VLAN-unaware ports, and those would > get untagged incorrectly. > > So I added a way to mark ports as untagged when untagging is enabled, > and return without altering the packet. Give me an example traffic pattern, Linux configuration and corruption, please. I spent a lot of time trying to make sure I am not introducing regressions, and I have no idea what you are seeing that is wrong. Please don't try to make assumptions, just let me see what you see.
On Sunday, 15.12.2024 at 18:09 +0100, Vladimir Oltean <vladimir.oltean@nxp.com> wrote: > > Give me an example traffic pattern, Linux configuration and corruption, > please. I spent a lot of time trying to make sure I am not introducing > regressions, and I have no idea what you are seeing that is wrong. > Please don't try to make assumptions, just let me see what you see. The config I'm using: - Using the 2.5Gbps as CPU port in 'ocelot-8021q' mode, Linux interface name is 'eth0' - Using 2 downstream ports as external Ethernet ports: 'eth1' and 'eth2' - 'eth1' port of the device is directly connected with my PC (Ethernet interface #1, 192.168.1.1) - 'eth2' port of the device is directly connected with my PC (Ethernet interface #2, 192.168.2.1) DTS: &mscc_felix_port0 { label = "eth1"; managed = "in-band-status"; phy-handle = <&qsgmii_phy0>; phy-mode = "qsgmii"; status = "okay"; }; &mscc_felix_port1 { label = "eth2"; managed = "in-band-status"; phy-handle = <&qsgmii_phy1>; phy-mode = "qsgmii"; status = "okay"; }; &mscc_felix_port4 { ethernet = <&enetc_port2>; status = "okay"; dsa-tag-protocol = "ocelot-8021q"; }; LS1028 unit's Linux config: # Static IP to 'eth1' $ ifconfig eth1 192.168.1.2 up # Create a VLAN-unaware bridge, and add 'eth2' to that $ brctl addbr br0 $ brctl addif br0 eth2 # Set static IP to the bridge $ ifconfig br0 192.168.2.2 up $ ifconfig eth2 up Now at this point: 1. I can ping perfectly fine the eth1 interface from my PC ("ping 192.168.1.2"), and vice-versa 2. Pinging 'br0' from my PC is not working ("ping 192.168.2.2"). I can see the ARP requests, but there are not ARP replies at all. If I enable VLAN-filtering on 'br0', it starts working: $ echo 1 > /sys/class/net/br0/bridge/vlan_filtering So basically: 1. Raw interface -> working 2. VLAN-aware bridge -> working 3. VLAN-unaware bridge -> NOT working I traced what is happening. When VLAN-filtering is not enabled on the bridge, LS1028's switch is configured with 'push_inner_tag = OCELOT_NO_ES0_TAG'. But ds->untag_vlan_aware_bridge_pvid is always set to true at switch setup, in felix_tag_8021q_setup(). That makes dsa_switch_rcv() call dsa_software_vlan_untag() for each packets. Now in dsa_software_vlan_untag(), if the port is not part of the bridge (case #1), it returns with the skb early. That's OK. static inline struct sk_buff *dsa_software_vlan_untag(struct sk_buff *skb) { struct dsa_port *dp = dsa_user_to_port(skb->dev); struct net_device *br = dsa_port_bridge_dev_get(dp); u16 vid; /* software untagging for standalone ports not yet necessary */ if (!br) return skb; But if port is part of a bridge, no matter "push_inner_tag" is set as OCELOT_ES0_TAG or OCELOT_NO_ES0_TAG, it always untags it: /* Move VLAN tag from data to hwaccel */ if (!skb_vlan_tag_present(skb)) { skb = skb_vlan_untag(skb); if (!skb) return NULL; } As the "untag_vlan_aware_bridge_pvid" is a switch-specific thing, not port-specific, I cannot change it to false/true depending on the port is added to a VLAN-unaware/aware bridge, as the other port may be added to another bridge (eth1 -> VLAN-aware (tags enabled), eth2 -> VLAN-unaware (tags disabled)). Also, in the past this code part looked like this: /* Move VLAN tag from data to hwaccel */ if (!skb_vlan_tag_present(skb) && skb->protocol == htons(proto)) { skb = skb_vlan_untag(skb); if (!skb) return NULL; } So we had a protocol check. This wouldn't work 100% neither, because what if a VLAN packet arrives from the outer world into a VLAN-unaware bridge? I assume, that shouldn't be untagged, still, it would do that. I'm not that happy with my patch though, as I had to add another flag for each ports. But that seems to be the "cleanest" solution. That's why as marked it as RFC. Thanks, Robert
On Mon, Dec 16, 2024 at 11:10:05AM +0100, Robert Hodaszi wrote: > On Sunday, 15.12.2024 at 18:09 +0100, Vladimir Oltean <vladimir.oltean@nxp.com> wrote: > > > > Give me an example traffic pattern, Linux configuration and corruption, > > please. I spent a lot of time trying to make sure I am not introducing > > regressions, and I have no idea what you are seeing that is wrong. > > Please don't try to make assumptions, just let me see what you see. > > The config I'm using: > - Using the 2.5Gbps as CPU port in 'ocelot-8021q' mode, Linux interface name is 'eth0' > - Using 2 downstream ports as external Ethernet ports: 'eth1' and 'eth2' > - 'eth1' port of the device is directly connected with my PC (Ethernet interface #1, 192.168.1.1) > - 'eth2' port of the device is directly connected with my PC (Ethernet interface #2, 192.168.2.1) > > DTS: > > &mscc_felix_port0 { > label = "eth1"; > managed = "in-band-status"; > phy-handle = <&qsgmii_phy0>; > phy-mode = "qsgmii"; > status = "okay"; > }; > > &mscc_felix_port1 { > label = "eth2"; > managed = "in-band-status"; > phy-handle = <&qsgmii_phy1>; > phy-mode = "qsgmii"; > status = "okay"; > }; > > &mscc_felix_port4 { > ethernet = <&enetc_port2>; > status = "okay"; > dsa-tag-protocol = "ocelot-8021q"; > }; > > LS1028 unit's Linux config: > > # Static IP to 'eth1' > $ ifconfig eth1 192.168.1.2 up > > # Create a VLAN-unaware bridge, and add 'eth2' to that > $ brctl addbr br0 > $ brctl addif br0 eth2 > > # Set static IP to the bridge > $ ifconfig br0 192.168.2.2 up > $ ifconfig eth2 up > > Now at this point: > > 1. I can ping perfectly fine the eth1 interface from my PC ("ping 192.168.1.2"), and vice-versa > 2. Pinging 'br0' from my PC is not working ("ping 192.168.2.2"). I can see the ARP requests, but there are not ARP replies at all. > > If I enable VLAN-filtering on 'br0', it starts working: > > $ echo 1 > /sys/class/net/br0/bridge/vlan_filtering > > > So basically: > > 1. Raw interface -> working > 2. VLAN-aware bridge -> working > 3. VLAN-unaware bridge -> NOT working > > I traced what is happening. When VLAN-filtering is not enabled on the bridge, LS1028's switch is configured with 'push_inner_tag = OCELOT_NO_ES0_TAG'. But ds->untag_vlan_aware_bridge_pvid is always set to true at switch setup, in felix_tag_8021q_setup(). That makes dsa_switch_rcv() call dsa_software_vlan_untag() for each packets. > > > Now in dsa_software_vlan_untag(), if the port is not part of the bridge (case #1), it returns with the skb early. That's OK. > > > static inline struct sk_buff *dsa_software_vlan_untag(struct sk_buff *skb) > { > struct dsa_port *dp = dsa_user_to_port(skb->dev); > struct net_device *br = dsa_port_bridge_dev_get(dp); > u16 vid; > > /* software untagging for standalone ports not yet necessary */ > if (!br) > return skb; > > > But if port is part of a bridge, no matter "push_inner_tag" is set as OCELOT_ES0_TAG or OCELOT_NO_ES0_TAG, it always untags it: > > /* Move VLAN tag from data to hwaccel */ > if (!skb_vlan_tag_present(skb)) { > skb = skb_vlan_untag(skb); > if (!skb) > return NULL; > } > > As the "untag_vlan_aware_bridge_pvid" is a switch-specific thing, not port-specific, I cannot change it to false/true depending on the port is added to a VLAN-unaware/aware bridge, as the other port may be added to another bridge (eth1 -> VLAN-aware (tags enabled), eth2 -> VLAN-unaware (tags disabled)). > > Also, in the past this code part looked like this: > > /* Move VLAN tag from data to hwaccel */ > if (!skb_vlan_tag_present(skb) && skb->protocol == htons(proto)) { > skb = skb_vlan_untag(skb); > if (!skb) > return NULL; > } > > So we had a protocol check. This wouldn't work 100% neither, because what if a VLAN packet arrives from the outer world into a VLAN-unaware bridge? I assume, that shouldn't be untagged, still, it would do that. > > > I'm not that happy with my patch though, as I had to add another flag for each ports. But that seems to be the "cleanest" solution. That's why as marked it as RFC. > > Thanks, > Robert The memory is starting to come back :-| Ok, so the good news is that you aren't seeing things, I can reproduce with tools/testing/selftests/drivers/net/dsa/local_termination.sh. Another good thing is that the fix is easier than your posted attempt. You've correctly identified the previous VLAN stripping logic, and that is what we should go forward with. I don't agree with your analysis that it wouldn't work, because if you look at the implementation of skb_vlan_untag(), it strips the VLAN header from the skb head, but still keeps it in the hwaccel area, so packets are still VLAN-tagged. This does not have a functional impact upon reception, it is just done to have unified handling later on in the function: skb_vlan_tag_present() and skb_vlan_tag_get_id(). This side effect is also mentioned as a comment on dsa_software_vlan_untag(). The stripping itself will only take place in dsa_software_untag_vlan_unaware_bridge() if the switch driver sets dp->ds->untag_bridge_pvid. The felix driver does not set this. What is not so good is that I'm seriously starting to doubt my sanity. You'd think that I ran the selftests that I had posted together with the patch introducing the bug, but somehow they fail :-| And not only that, but thoughts about this problem itself have since passed through my head, and I failed to correctly identify where the problem applies and where it does not. I'm sorry for that. I've just posted a fix to this bug, which I would like you to double-check and respond with review and test tags, or let me know if it doesn't work. https://lore.kernel.org/netdev/20241216135059.1258266-1-vladimir.oltean@nxp.com/ I posted it myself because I don't expect you to have the full context (it's a bug that I introduced), and with yours there are still a lot of unanswered "why"s, as well as not the simplest solution.
On Monday, 16.12.2024 at 14:51 +0100, Vladimir Oltean <vladimir.oltean@nxp.com> wrote: > > The memory is starting to come back :-| > > Ok, so the good news is that you aren't seeing things, I can reproduce > with tools/testing/selftests/drivers/net/dsa/https://linkprotect.cudasvc.com/url?a=https%3a%2f%2flocal_termination.sh&c=E,1,Ez0FDT_USqvD0083KxZU7x7ffGuJDoeCC6xMtetczJrwErBfCEyO1pnImOnY_ifhHDMKhhPtJGv8MpKk1zKoqa6Gm1JP-zTkotP2AOShxr3N&typo=1 > > Another good thing is that the fix is easier than your posted attempt. > You've correctly identified the previous VLAN stripping logic, and that > is what we should go forward with. I don't agree with your analysis that > it wouldn't work, because if you look at the implementation of > skb_vlan_untag(), it strips the VLAN header from the skb head, but still > keeps it in the hwaccel area, so packets are still VLAN-tagged. > > This does not have a functional impact upon reception, it is just done > to have unified handling later on in the function: > skb_vlan_tag_present() and skb_vlan_tag_get_id(). This side effect is > also mentioned as a comment on dsa_software_vlan_untag(). > > The stripping itself will only take place in dsa_software_untag_vlan_unaware_bridge() > if the switch driver sets dp->ds->untag_bridge_pvid. The felix driver > does not set this. > > What is not so good is that I'm seriously starting to doubt my sanity. > You'd think that I ran the selftests that I had posted together with the > patch introducing the bug, but somehow they fail :-| And not only that, > but thoughts about this problem itself have since passed through my head, > and I failed to correctly identify where the problem applies and where > it does not. I'm sorry for that. > > I've just posted a fix to this bug, which I would like you to double-check > and respond with review and test tags, or let me know if it doesn't work. > https://linkprotect.cudasvc.com/url?a=https%3a%2f%2flore.kernel.org%2fnetdev%2f20241216135059.1258266-1-vladimir.oltean%40nxp.com%2f&c=E,1,iMsl_DfLMdZXF3FfFIT1CISQcjOL417WIsr7z01GodEy-1vyX9d_6X-8hFJih2CA2zAax4kx2mFdtftzn-ELRkGDBCa9lxIWU_wEN8dtO2aVO7NS7ck,&typo=1 > I posted it myself because I don't expect you to have the full context > (it's a bug that I introduced), and with yours there are still a lot of > unanswered "why"s, as well as not the simplest solution. Actually, what you did is exactly what I did first to fix the issue, but it broke my setup when I sent VLAN-tagged messages to the device. Now I tested again, and it is working fine. That made me think it's happening because it is stripping incorrectly the VLAN tag. Probably it was just an incorrect setup, maybe something remained set either on my PC or on the unit from the previous test. One thing is different to my change though: you're calling the br_vlan_get_proto() twice. You can tweak performance a bit probably, if you rather pass 'proto' to both dsa_software_untag_vlan_aware_bridge and dsa_software_untag_vlan_unaware_bridge instead. So something like this: diff --git a/net/dsa/tag.h b/net/dsa/tag.h index d5707870906b..3d790d8e16cd 100644 --- a/net/dsa/tag.h +++ b/net/dsa/tag.h @@ -57,15 +57,11 @@ static inline struct net_device *dsa_conduit_find_user(struct net_device *dev, */ static inline void dsa_software_untag_vlan_aware_bridge(struct sk_buff *skb, struct net_device *br, - u16 vid) + u16 vid, u16 proto) { - u16 pvid, proto; + u16 pvid; int err; - err = br_vlan_get_proto(br, &proto); - if (err) - return; - err = br_vlan_get_pvid_rcu(skb->dev, &pvid); if (err) return; @@ -103,16 +99,12 @@ static inline void dsa_software_untag_vlan_aware_bridge(struct sk_buff *skb, */ static inline void dsa_software_untag_vlan_unaware_bridge(struct sk_buff *skb, struct net_device *br, - u16 vid) + u16 vid, u16 proto) { struct net_device *upper_dev; - u16 pvid, proto; + u16 pvid; int err; - err = br_vlan_get_proto(br, &proto); - if (err) - return; - err = br_vlan_get_pvid_rcu(skb->dev, &pvid); if (err) return; @@ -149,14 +141,19 @@ static inline struct sk_buff *dsa_software_vlan_untag(struct sk_buff *skb) { struct dsa_port *dp = dsa_user_to_port(skb->dev); struct net_device *br = dsa_port_bridge_dev_get(dp); - u16 vid; + u16 vid, proto; + int err; /* software untagging for standalone ports not yet necessary */ if (!br) return skb; + err = br_vlan_get_proto(br, &proto); + if (err) + return skb; + /* Move VLAN tag from data to hwaccel */ - if (!skb_vlan_tag_present(skb)) { + if (!skb_vlan_tag_present(skb) && skb->protocol == htons(proto)) { skb = skb_vlan_untag(skb); if (!skb) return NULL; @@ -169,10 +166,12 @@ static inline struct sk_buff *dsa_software_vlan_untag(struct sk_buff *skb) if (br_vlan_enabled(br)) { if (dp->ds->untag_vlan_aware_bridge_pvid) - dsa_software_untag_vlan_aware_bridge(skb, br, vid); + dsa_software_untag_vlan_aware_bridge(skb, br, vid, + proto); } else { if (dp->ds->untag_bridge_pvid) - dsa_software_untag_vlan_unaware_bridge(skb, br, vid); + dsa_software_untag_vlan_unaware_bridge(skb, br, vid, + proto); } return skb; Robert
On Mon, Dec 16, 2024 at 03:39:17PM +0100, Robert Hodaszi wrote: > Actually, what you did is exactly what I did first to fix the issue, but it broke my setup when I sent VLAN-tagged messages to the device. Now I tested again, and it is working fine. That made me think it's happening because it is stripping incorrectly the VLAN tag. Probably it was just an incorrect setup, maybe something remained set either on my PC or on the unit from the previous test. > > One thing is different to my change though: you're calling the br_vlan_get_proto() twice. You can tweak performance a bit probably, if you rather pass 'proto' to both dsa_software_untag_vlan_aware_bridge and dsa_software_untag_vlan_unaware_bridge instead. So something like this: The patch is going to become stuffy, but ok. We also have to update the kernel-doc of the 2 untagging functions to document the new argument. I will send a v2 tomorrow after the expiry of the 24 hour timeout for other review comments.
On Mon, Dec 16, 2024 at 04:48:31PM +0200, Vladimir Oltean wrote: > On Mon, Dec 16, 2024 at 03:39:17PM +0100, Robert Hodaszi wrote: > > Actually, what you did is exactly what I did first to fix the issue, but it broke my setup when I sent VLAN-tagged messages to the device. Now I tested again, and it is working fine. That made me think it's happening because it is stripping incorrectly the VLAN tag. Probably it was just an incorrect setup, maybe something remained set either on my PC or on the unit from the previous test. > > > > One thing is different to my change though: you're calling the br_vlan_get_proto() twice. You can tweak performance a bit probably, if you rather pass 'proto' to both dsa_software_untag_vlan_aware_bridge and dsa_software_untag_vlan_unaware_bridge instead. So something like this: > > The patch is going to become stuffy, but ok. We also have to update the > kernel-doc of the 2 untagging functions to document the new argument. > I will send a v2 tomorrow after the expiry of the 24 hour timeout for > other review comments. Hi Vladimir, As you are touching both Kernel doc and dsa_software_vlan_untag, could you consider also adding a "Returns: " section to the Kernel doc of that function?