Message ID | 20210913144300.1265143-6-linus.walleij@linaro.org (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | RTL8366(RB) cleanups part 1 | expand |
Context | Check | Description |
---|---|---|
netdev/cover_letter | success | Link |
netdev/fixes_present | success | Link |
netdev/patch_count | success | Link |
netdev/tree_selection | success | Clearly marked for net-next |
netdev/subject_prefix | success | Link |
netdev/cc_maintainers | success | CCed 8 of 8 maintainers |
netdev/source_inline | success | Was 0 now: 0 |
netdev/verify_signedoff | success | Link |
netdev/module_param | success | Was 0 now: 0 |
netdev/build_32bit | success | Errors and warnings before: 0 this patch: 0 |
netdev/kdoc | success | Errors and warnings before: 0 this patch: 0 |
netdev/verify_fixes | success | Link |
netdev/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 20 lines checked |
netdev/build_allmodconfig_warn | success | Errors and warnings before: 0 this patch: 0 |
netdev/header_inline | success | Link |
On Mon, Sep 13, 2021 at 04:42:57PM +0200, Linus Walleij wrote: > I have to disable this feature to have working VLANs on the > RTL8366RB at least, probably on all of them. > > It appears that the very custom VLAN set-up was using this > feature by setting up one VLAN per port for a reason: when > using "4K" VLAN, every frame transmitted by the switch > MUST have a VLAN tag. > > This is the reason that every port had its own VLAN, > including the CPU port, and all of them had PVID turned on: > this way every frame going in or out of the switch will > indeed have a VLAN tag. > > However the way Linux userspace like to use VLANs such as > by default assigning all ports on a bridge to the same VLAN > this does not work at all because PVID is not set for these, > and all packets get lost. > > Therefore we have to do with 16 VLAN for now, the "4K" > 4096 VLAN feature is clearly only for switches in > environments where everything is a VLAN. > > This was discovered when testing with OpenWrt that join > the LAN ports lan0 ... lan3 into a bridge and then assign > each of them into VLAN 1 with PVID set on each port: without > this patch this will not work and the bridge goes numb. It is important to explain _why_ the switch will go "numb" and not pass packets if the Linux bridge assigns all ports to VLAN ID 1 as pvid. It is certainly not expected for that to happen. The purpose of the PVID feature is specifically to classify untagged packets to a port-based VLAN ID. So "everything is a VLAN" even for Linux user space, not sure what you're talking about. When the Linux bridge has the vlan_filtering attribute set to 1, the hardware should follow suit by making untagged packets get classified to the VLAN ID that the software bridge wants to see, on the ports that are members of that bridge. When the Linux bridge has the vlan_filtering attribute set to 0, the software bridge very much ignores any VLAN tags from packets, and does not perform any VLAN-based ingress admission checks. If the hardware classifies all packets to a VLAN even when VLAN "filtering" (i.e. ingress dropping on mismatch) is disabled, that is perfectly fine too, although the software bridge doesn't care. You need to set up a private VLAN ID for your VLAN-unaware ports, and make it the pvid on those ports, and somehow force the hardware to classify any packet towards that pvid on those VLAN-unaware ports, regardless of whether the packets are untagged or 802.1Q-tagged or 802.1ad-tagged or whatever. That is simply the way things are supposed to work. VLAN ID 0 and 4095 are good candidates to use privately within your driver as the pvid on VLAN-unaware ports, and you can/must manually bring up these VLANs, since the bridge will refuse to install these VLANs in its database. Other VLAN IDs like the range 4000-4094 are also potentially ok as long as you document the fact that your driver crops that range out of the usable range of the bridge, and you make sure that no packet leaks inside or outside of those private VLANs are possible ("attackers" could still try to send a packet tagged with VLAN ID 4094 towards a port that is under a VLAN-aware bridge. Since that port is VLAN-aware, it will recognize the VLAN ID as 4094, so unless you configure that port to drop VLAN ID 4094, it might well leak into the VLAN domain 4094 which is privately used by your driver to ensure VLAN-unaware forwarding between the ports of a nearby VLAN-unaware bridge. I know there are lots of things to think about, but this patch is way too simplistic and does not really offer solid explanations.
Hi Vladimir, first, thanks for your help and patience. I learned a lot the recent weeks, much thanks to your questions and explanations! On Mon, Sep 13, 2021 at 5:34 PM Vladimir Oltean <olteanv@gmail.com> wrote: > > This was discovered when testing with OpenWrt that join > > the LAN ports lan0 ... lan3 into a bridge and then assign > > each of them into VLAN 1 with PVID set on each port: without > > this patch this will not work and the bridge goes numb. > > It is important to explain _why_ the switch will go "numb" and not pass > packets if the Linux bridge assigns all ports to VLAN ID 1 as pvid. It > is certainly not expected for that to happen. Yeah it is pretty weird. What happens now is that this is a regression when using OpenWrt userspace as it sets up the VLANs like this, but if I boot a clean system and just manually do e.g. ifconfig lan0 169.254.1.2 netmask 255.255.255.0 up it works fine because the default VLANs that were set up by the driver (removed by patch 2/8) will tag all packets using PVID and send packets on 5 ingress and 1 egress VLANs. > The purpose of the PVID feature is specifically to classify untagged > packets to a port-based VLAN ID. So "everything is a VLAN" even for > Linux user space, not sure what you're talking about. I think what happens is that OpenWrts userspace sets VLAN 1 for all ingress ports with PVID, so all packets from ingress ports get tagged nicely with VID 1. But as the CPU port is hidden inside the bridge it can't join the CPU port into that VLAN (userspace does not know it exist I think?) and thus no packets can go into or out of the CPU port. But you can still pass packets between the lan ports. > When the Linux bridge has the vlan_filtering attribute set to 1, the > hardware should follow suit by making untagged packets get classified to > the VLAN ID that the software bridge wants to see, on the ports that are > members of that bridge. This is what it does, I think. But the "4K" VLAN feature is so strict that it will restrict also the CPU port from this (in hardware) with no way to turn it off. It seems the "4K" mode is a "VLAN with filtering only mode" so no matter whether we turned on filtering or not, the CPU port will not see any packets from any other ports unless we add also that port (port 5) into the VLAN. One solution I could try would be to just add the CPU port to all VLANs by default, but .. is that right? I suppose this would work as software will add the right VID to the packets so they will only propagate to the right ports anyway. It could test it. > When the Linux bridge has the vlan_filtering attribute set to 0, the > software bridge very much ignores any VLAN tags from packets, and does > not perform any VLAN-based ingress admission checks. If the hardware > classifies all packets to a VLAN even when VLAN "filtering" (i.e. > ingress dropping on mismatch) is disabled, that is perfectly fine too, I think this is what happens in this hardware. > although the software bridge doesn't care. You need to set up a private > VLAN ID for your VLAN-unaware ports, and make it the pvid on those ports, Would the CPU port be a VLAN-unaware port? My problem is that in the "4K" mode, the CPU port will not see packets from any VLAN it is not a member of. > and somehow force the hardware to classify any packet towards that pvid > on those VLAN-unaware ports, regardless of whether the packets are > untagged or 802.1Q-tagged or 802.1ad-tagged or whatever. That is simply > the way things are supposed to work. > > VLAN ID 0 and 4095 are good candidates to use privately within your > driver as the pvid on VLAN-unaware ports, and you can/must manually > bring up these VLANs, since the bridge will refuse to install these > VLANs in its database. > > Other VLAN IDs like the range 4000-4094 are also potentially ok as long > as you document the fact that your driver crops that range out of the > usable range of the bridge, and you make sure that no packet leaks > inside or outside of those private VLANs are possible ("attackers" could > still try to send a packet tagged with VLAN ID 4094 towards a port that > is under a VLAN-aware bridge. Since that port is VLAN-aware, it will > recognize the VLAN ID as 4094, so unless you configure that port to drop > VLAN ID 4094, it might well leak into the VLAN domain 4094 which is > privately used by your driver to ensure VLAN-unaware forwarding between > the ports of a nearby VLAN-unaware bridge. I don't have any VLAN-unaware ports other than the CPU port. What happened before patch 2/8 was that it was given its own VLAN with PVID and all other ports were assigned members of that VLAN as well, and thus egress traffic from the CPU port could go out. For ingress traffic, the CPU port was member of all ingress VLANs. (Also removed by patch 2/8) > I know there are lots of things to think about, but this patch is way > too simplistic and does not really offer solid explanations. I'm trying my best and learning along the way :) It's fine, no hurry and let's get this right so that the other RTL switches can look at this a good example. Yours, Linus Walleij
Hi Linus, On Tue, Sep 14, 2021 at 01:20:14AM +0200, Linus Walleij wrote: > Hi Vladimir, > > first, thanks for your help and patience. I learned a lot the recent > weeks, much thanks to your questions and explanations! > > On Mon, Sep 13, 2021 at 5:34 PM Vladimir Oltean <olteanv@gmail.com> wrote: > > > > This was discovered when testing with OpenWrt that join > > > the LAN ports lan0 ... lan3 into a bridge and then assign > > > each of them into VLAN 1 with PVID set on each port: without > > > this patch this will not work and the bridge goes numb. > > > > It is important to explain _why_ the switch will go "numb" and not pass > > packets if the Linux bridge assigns all ports to VLAN ID 1 as pvid. It > > is certainly not expected for that to happen. > > Yeah it is pretty weird. What happens now is that this is a regression > when using OpenWrt userspace as it sets up the VLANs like this, Were you running net-next kernel? There have been major changes to DSA since 5.10, so you'd better test your driver on net-next. > but if I boot a clean system and just manually do e.g. > ifconfig lan0 169.254.1.2 netmask 255.255.255.0 up > it works fine because the default VLANs that were set up by the > driver (removed by patch 2/8) will tag all packets using PVID and > send packets on 5 ingress and 1 egress VLANs. > > > The purpose of the PVID feature is specifically to classify untagged > > packets to a port-based VLAN ID. So "everything is a VLAN" even for > > Linux user space, not sure what you're talking about. > > I think what happens is that OpenWrts userspace sets VLAN 1 > for all ingress ports with PVID, so all packets from ingress ports > get tagged nicely with VID 1. > > But as the CPU port is hidden inside the bridge > it can't join the CPU port into that VLAN (userspace does not > know it exist I think?) and thus no packets > can go into or out of the CPU port. But you can still pass packets > between the lan ports. > > > When the Linux bridge has the vlan_filtering attribute set to 1, the > > hardware should follow suit by making untagged packets get classified to > > the VLAN ID that the software bridge wants to see, on the ports that are > > members of that bridge. > > This is what it does, I think. > > But the "4K" VLAN feature is so strict that it will restrict also the CPU > port from this (in hardware) with no way to turn it off. > > It seems the "4K" mode is a "VLAN with filtering only mode" so no > matter whether we turned on filtering or not, the CPU port > will not see any packets from any other ports unless we add also > that port (port 5) into the VLAN. > > One solution I could try would be to just add the CPU port to all > VLANs by default, but .. is that right? The DSA core already adds the CPU port to VLAN members for you. In file net/dsa/slave.c, function dsa_slave_vlan_add: ... err = dsa_port_vlan_add(dp, &vlan, extack); if (err) return err; /* We need the dedicated CPU port to be a member of the VLAN as well. * Even though drivers often handle CPU membership in special ways, * it doesn't make sense to program a PVID, so clear this flag. */ vlan.flags &= ~BRIDGE_VLAN_INFO_PVID; err = dsa_port_vlan_add(dp->cpu_dp, &vlan, extack); if (err) return err; ... If it does not work, you may have done something wrong.
On Tue, Sep 14, 2021 at 8:29 AM DENG Qingfang <dqfext@gmail.com> wrote: > > Yeah it is pretty weird. What happens now is that this is a regression > > when using OpenWrt userspace as it sets up the VLANs like this, > > Were you running net-next kernel? Yep just the userspace is from OpenWrt, the kernel is latest mainline. > The DSA core already adds the CPU port to VLAN members for you. > In file net/dsa/slave.c, function dsa_slave_vlan_add: Hmmmmm I will look into this. Put in some debug prints etc. Certainly the callbacks to the driver for doing this aren't getting called even with v5.15-rc1 AFAICT. Yours, Linus Walleij
diff --git a/drivers/net/dsa/rtl8366.c b/drivers/net/dsa/rtl8366.c index fae14c448fe4..9652323167c2 100644 --- a/drivers/net/dsa/rtl8366.c +++ b/drivers/net/dsa/rtl8366.c @@ -313,13 +313,15 @@ int rtl8366_vlan_add(struct dsa_switch *ds, int port, untagged = true; - /* Enable VLAN in the hardware - * FIXME: what's with this 4k business? - * Just rtl8366_enable_vlan() seems inconclusive. + /* Enable VLAN in the hardware, do NOT enable VLAN4K, because the + * 4K VLAN will activate a 4096 entries VID table, but has the side + * effect that every processed frame MUST have a VID, meaning non-VLAN + * traffic will now work at all. So we will let the 16 VLAN entries + * suffice. */ - ret = rtl8366_enable_vlan4k(smi, true); + ret = rtl8366_enable_vlan(smi, true); if (ret) { - NL_SET_ERR_MSG_MOD(extack, "Failed to enable VLAN 4K"); + NL_SET_ERR_MSG_MOD(extack, "Failed to enable VLAN"); return ret; }
I have to disable this feature to have working VLANs on the RTL8366RB at least, probably on all of them. It appears that the very custom VLAN set-up was using this feature by setting up one VLAN per port for a reason: when using "4K" VLAN, every frame transmitted by the switch MUST have a VLAN tag. This is the reason that every port had its own VLAN, including the CPU port, and all of them had PVID turned on: this way every frame going in or out of the switch will indeed have a VLAN tag. However the way Linux userspace like to use VLANs such as by default assigning all ports on a bridge to the same VLAN this does not work at all because PVID is not set for these, and all packets get lost. Therefore we have to do with 16 VLAN for now, the "4K" 4096 VLAN feature is clearly only for switches in environments where everything is a VLAN. This was discovered when testing with OpenWrt that join the LAN ports lan0 ... lan3 into a bridge and then assign each of them into VLAN 1 with PVID set on each port: without this patch this will not work and the bridge goes numb. Cc: Vladimir Oltean <olteanv@gmail.com> Cc: Mauri Sandberg <sandberg@mailfence.com> Cc: Alvin Šipraga <alsi@bang-olufsen.dk> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: DENG Qingfang <dqfext@gmail.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> --- ChangeLog v1->v4: - New patch after discovering that the VLAN configuration in OpenWrt was not working. --- drivers/net/dsa/rtl8366.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-)