diff mbox series

[net-next,5/8] net: dsa: rtl8366: Disable "4K" VLANs

Message ID 20210913144300.1265143-6-linus.walleij@linaro.org (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series RTL8366(RB) cleanups part 1 | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net-next
netdev/subject_prefix success Link
netdev/cc_maintainers success CCed 8 of 8 maintainers
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 20 lines checked
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/header_inline success Link

Commit Message

Linus Walleij Sept. 13, 2021, 2:42 p.m. UTC
I have to disable this feature to have working VLANs on the
RTL8366RB at least, probably on all of them.

It appears that the very custom VLAN set-up was using this
feature by setting up one VLAN per port for a reason: when
using "4K" VLAN, every frame transmitted by the switch
MUST have a VLAN tag.

This is the reason that every port had its own VLAN,
including the CPU port, and all of them had PVID turned on:
this way every frame going in or out of the switch will
indeed have a VLAN tag.

However the way Linux userspace like to use VLANs such as
by default assigning all ports on a bridge to the same VLAN
this does not work at all because PVID is not set for these,
and all packets get lost.

Therefore we have to do with 16 VLAN for now, the "4K"
4096 VLAN feature is clearly only for switches in
environments where everything is a VLAN.

This was discovered when testing with OpenWrt that join
the LAN ports lan0 ... lan3 into a bridge and then assign
each of them into VLAN 1 with PVID set on each port: without
this patch this will not work and the bridge goes numb.

Cc: Vladimir Oltean <olteanv@gmail.com>
Cc: Mauri Sandberg <sandberg@mailfence.com>
Cc: Alvin Šipraga <alsi@bang-olufsen.dk>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: DENG Qingfang <dqfext@gmail.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
---
ChangeLog v1->v4:
- New patch after discovering that the VLAN configuration in
  OpenWrt was not working.
---
 drivers/net/dsa/rtl8366.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

Comments

Vladimir Oltean Sept. 13, 2021, 3:34 p.m. UTC | #1
On Mon, Sep 13, 2021 at 04:42:57PM +0200, Linus Walleij wrote:
> I have to disable this feature to have working VLANs on the
> RTL8366RB at least, probably on all of them.
> 
> It appears that the very custom VLAN set-up was using this
> feature by setting up one VLAN per port for a reason: when
> using "4K" VLAN, every frame transmitted by the switch
> MUST have a VLAN tag.
> 
> This is the reason that every port had its own VLAN,
> including the CPU port, and all of them had PVID turned on:
> this way every frame going in or out of the switch will
> indeed have a VLAN tag.
> 
> However the way Linux userspace like to use VLANs such as
> by default assigning all ports on a bridge to the same VLAN
> this does not work at all because PVID is not set for these,
> and all packets get lost.
> 
> Therefore we have to do with 16 VLAN for now, the "4K"
> 4096 VLAN feature is clearly only for switches in
> environments where everything is a VLAN.
> 
> This was discovered when testing with OpenWrt that join
> the LAN ports lan0 ... lan3 into a bridge and then assign
> each of them into VLAN 1 with PVID set on each port: without
> this patch this will not work and the bridge goes numb.

It is important to explain _why_ the switch will go "numb" and not pass
packets if the Linux bridge assigns all ports to VLAN ID 1 as pvid. It
is certainly not expected for that to happen.

The purpose of the PVID feature is specifically to classify untagged
packets to a port-based VLAN ID. So "everything is a VLAN" even for
Linux user space, not sure what you're talking about.

When the Linux bridge has the vlan_filtering attribute set to 1, the
hardware should follow suit by making untagged packets get classified to
the VLAN ID that the software bridge wants to see, on the ports that are
members of that bridge.

When the Linux bridge has the vlan_filtering attribute set to 0, the
software bridge very much ignores any VLAN tags from packets, and does
not perform any VLAN-based ingress admission checks. If the hardware
classifies all packets to a VLAN even when VLAN "filtering" (i.e.
ingress dropping on mismatch) is disabled, that is perfectly fine too,
although the software bridge doesn't care. You need to set up a private
VLAN ID for your VLAN-unaware ports, and make it the pvid on those ports,
and somehow force the hardware to classify any packet towards that pvid
on those VLAN-unaware ports, regardless of whether the packets are
untagged or 802.1Q-tagged or 802.1ad-tagged or whatever. That is simply
the way things are supposed to work.

VLAN ID 0 and 4095 are good candidates to use privately within your
driver as the pvid on VLAN-unaware ports, and you can/must manually
bring up these VLANs, since the bridge will refuse to install these
VLANs in its database.

Other VLAN IDs like the range 4000-4094 are also potentially ok as long
as you document the fact that your driver crops that range out of the
usable range of the bridge, and you make sure that no packet leaks
inside or outside of those private VLANs are possible ("attackers" could
still try to send a packet tagged with VLAN ID 4094 towards a port that
is under a VLAN-aware bridge. Since that port is VLAN-aware, it will
recognize the VLAN ID as 4094, so unless you configure that port to drop
VLAN ID 4094, it might well leak into the VLAN domain 4094 which is
privately used by your driver to ensure VLAN-unaware forwarding between
the ports of a nearby VLAN-unaware bridge.

I know there are lots of things to think about, but this patch is way
too simplistic and does not really offer solid explanations.
Linus Walleij Sept. 13, 2021, 11:20 p.m. UTC | #2
Hi Vladimir,

first, thanks for your help and patience. I learned a lot the recent
weeks, much thanks to your questions and explanations!

On Mon, Sep 13, 2021 at 5:34 PM Vladimir Oltean <olteanv@gmail.com> wrote:

> > This was discovered when testing with OpenWrt that join
> > the LAN ports lan0 ... lan3 into a bridge and then assign
> > each of them into VLAN 1 with PVID set on each port: without
> > this patch this will not work and the bridge goes numb.
>
> It is important to explain _why_ the switch will go "numb" and not pass
> packets if the Linux bridge assigns all ports to VLAN ID 1 as pvid. It
> is certainly not expected for that to happen.

Yeah it is pretty weird. What happens now is that this is a regression
when using OpenWrt userspace as it sets up the VLANs like this,
but if I boot a clean system and just manually do e.g.
ifconfig lan0 169.254.1.2 netmask 255.255.255.0 up
it works fine because the default VLANs that were set up by the
driver (removed by patch 2/8) will tag all packets using PVID and
send packets on 5 ingress and 1 egress VLANs.

> The purpose of the PVID feature is specifically to classify untagged
> packets to a port-based VLAN ID. So "everything is a VLAN" even for
> Linux user space, not sure what you're talking about.

I think what happens is that OpenWrts userspace sets VLAN 1
for all ingress ports with PVID, so all packets from ingress ports
get tagged nicely with VID 1.

But as the CPU port is hidden inside the bridge
it can't join the CPU port into that VLAN (userspace does not
know it exist I think?) and thus no packets
can go into or out of the CPU port. But you can still pass packets
between the lan ports.

> When the Linux bridge has the vlan_filtering attribute set to 1, the
> hardware should follow suit by making untagged packets get classified to
> the VLAN ID that the software bridge wants to see, on the ports that are
> members of that bridge.

This is what it does, I think.

But the "4K" VLAN feature is so strict that it will restrict also the CPU
port from this (in hardware) with no way to turn it off.

It seems the "4K" mode is a "VLAN with filtering only mode" so no
matter whether we turned on filtering or not, the CPU port
will not see any packets from any other ports unless we add also
that port (port 5) into the VLAN.

One solution I could try would be to just add the CPU port to all
VLANs by default, but .. is that right?

I suppose this would work as software will add the right
VID to the packets so they will only propagate to the right
ports anyway. It could test it.

> When the Linux bridge has the vlan_filtering attribute set to 0, the
> software bridge very much ignores any VLAN tags from packets, and does
> not perform any VLAN-based ingress admission checks. If the hardware
> classifies all packets to a VLAN even when VLAN "filtering" (i.e.
> ingress dropping on mismatch) is disabled, that is perfectly fine too,

I think this is what happens in this hardware.

> although the software bridge doesn't care. You need to set up a private
> VLAN ID for your VLAN-unaware ports, and make it the pvid on those ports,

Would the CPU port be a VLAN-unaware port?

My problem is that in the "4K" mode, the CPU port will not see packets
from any VLAN it is not a member of.

> and somehow force the hardware to classify any packet towards that pvid
> on those VLAN-unaware ports, regardless of whether the packets are
> untagged or 802.1Q-tagged or 802.1ad-tagged or whatever. That is simply
> the way things are supposed to work.
>
> VLAN ID 0 and 4095 are good candidates to use privately within your
> driver as the pvid on VLAN-unaware ports, and you can/must manually
> bring up these VLANs, since the bridge will refuse to install these
> VLANs in its database.
>
> Other VLAN IDs like the range 4000-4094 are also potentially ok as long
> as you document the fact that your driver crops that range out of the
> usable range of the bridge, and you make sure that no packet leaks
> inside or outside of those private VLANs are possible ("attackers" could
> still try to send a packet tagged with VLAN ID 4094 towards a port that
> is under a VLAN-aware bridge. Since that port is VLAN-aware, it will
> recognize the VLAN ID as 4094, so unless you configure that port to drop
> VLAN ID 4094, it might well leak into the VLAN domain 4094 which is
> privately used by your driver to ensure VLAN-unaware forwarding between
> the ports of a nearby VLAN-unaware bridge.

I don't have any VLAN-unaware ports other than the CPU
port.

What happened before patch 2/8 was that it was given its own VLAN
with PVID and all other ports were assigned members of that VLAN as
well, and thus egress traffic from the CPU port could go out.

For ingress traffic, the CPU port was member of all ingress
VLANs. (Also removed by patch 2/8)

> I know there are lots of things to think about, but this patch is way
> too simplistic and does not really offer solid explanations.

I'm trying my best and learning along the way :)
It's fine, no hurry and let's get this right so that the other
RTL switches can look at this a good example.

Yours,
Linus Walleij
Qingfang Deng Sept. 14, 2021, 6:29 a.m. UTC | #3
Hi Linus,

On Tue, Sep 14, 2021 at 01:20:14AM +0200, Linus Walleij wrote:
> Hi Vladimir,
> 
> first, thanks for your help and patience. I learned a lot the recent
> weeks, much thanks to your questions and explanations!
> 
> On Mon, Sep 13, 2021 at 5:34 PM Vladimir Oltean <olteanv@gmail.com> wrote:
> 
> > > This was discovered when testing with OpenWrt that join
> > > the LAN ports lan0 ... lan3 into a bridge and then assign
> > > each of them into VLAN 1 with PVID set on each port: without
> > > this patch this will not work and the bridge goes numb.
> >
> > It is important to explain _why_ the switch will go "numb" and not pass
> > packets if the Linux bridge assigns all ports to VLAN ID 1 as pvid. It
> > is certainly not expected for that to happen.
> 
> Yeah it is pretty weird. What happens now is that this is a regression
> when using OpenWrt userspace as it sets up the VLANs like this,

Were you running net-next kernel?

There have been major changes to DSA since 5.10, so you'd better test your
driver on net-next.

> but if I boot a clean system and just manually do e.g.
> ifconfig lan0 169.254.1.2 netmask 255.255.255.0 up
> it works fine because the default VLANs that were set up by the
> driver (removed by patch 2/8) will tag all packets using PVID and
> send packets on 5 ingress and 1 egress VLANs.
> 
> > The purpose of the PVID feature is specifically to classify untagged
> > packets to a port-based VLAN ID. So "everything is a VLAN" even for
> > Linux user space, not sure what you're talking about.
> 
> I think what happens is that OpenWrts userspace sets VLAN 1
> for all ingress ports with PVID, so all packets from ingress ports
> get tagged nicely with VID 1.
> 
> But as the CPU port is hidden inside the bridge
> it can't join the CPU port into that VLAN (userspace does not
> know it exist I think?) and thus no packets
> can go into or out of the CPU port. But you can still pass packets
> between the lan ports.
> 
> > When the Linux bridge has the vlan_filtering attribute set to 1, the
> > hardware should follow suit by making untagged packets get classified to
> > the VLAN ID that the software bridge wants to see, on the ports that are
> > members of that bridge.
> 
> This is what it does, I think.
> 
> But the "4K" VLAN feature is so strict that it will restrict also the CPU
> port from this (in hardware) with no way to turn it off.
> 
> It seems the "4K" mode is a "VLAN with filtering only mode" so no
> matter whether we turned on filtering or not, the CPU port
> will not see any packets from any other ports unless we add also
> that port (port 5) into the VLAN.
> 
> One solution I could try would be to just add the CPU port to all
> VLANs by default, but .. is that right?

The DSA core already adds the CPU port to VLAN members for you.
In file net/dsa/slave.c, function dsa_slave_vlan_add:
...
	err = dsa_port_vlan_add(dp, &vlan, extack);
	if (err)
		return err;

	/* We need the dedicated CPU port to be a member of the VLAN as well.
	 * Even though drivers often handle CPU membership in special ways,
	 * it doesn't make sense to program a PVID, so clear this flag.
	 */
	vlan.flags &= ~BRIDGE_VLAN_INFO_PVID;

	err = dsa_port_vlan_add(dp->cpu_dp, &vlan, extack);
	if (err)
		return err;
...

If it does not work, you may have done something wrong.
Linus Walleij Sept. 14, 2021, 1:12 p.m. UTC | #4
On Tue, Sep 14, 2021 at 8:29 AM DENG Qingfang <dqfext@gmail.com> wrote:

> > Yeah it is pretty weird. What happens now is that this is a regression
> > when using OpenWrt userspace as it sets up the VLANs like this,
>
> Were you running net-next kernel?

Yep just the userspace is from OpenWrt, the kernel is latest mainline.

> The DSA core already adds the CPU port to VLAN members for you.
> In file net/dsa/slave.c, function dsa_slave_vlan_add:

Hmmmmm I will look into this. Put in some debug prints etc.
Certainly the callbacks to the driver for doing this aren't getting
called even with v5.15-rc1 AFAICT.

Yours,
Linus Walleij
diff mbox series

Patch

diff --git a/drivers/net/dsa/rtl8366.c b/drivers/net/dsa/rtl8366.c
index fae14c448fe4..9652323167c2 100644
--- a/drivers/net/dsa/rtl8366.c
+++ b/drivers/net/dsa/rtl8366.c
@@ -313,13 +313,15 @@  int rtl8366_vlan_add(struct dsa_switch *ds, int port,
 		untagged = true;
 
 
-	/* Enable VLAN in the hardware
-	 * FIXME: what's with this 4k business?
-	 * Just rtl8366_enable_vlan() seems inconclusive.
+	/* Enable VLAN in the hardware, do NOT enable VLAN4K, because the
+	 * 4K VLAN will activate a 4096 entries VID table, but has the side
+	 * effect that every processed frame MUST have a VID, meaning non-VLAN
+	 * traffic will now work at all. So we will let the 16 VLAN entries
+	 * suffice.
 	 */
-	ret = rtl8366_enable_vlan4k(smi, true);
+	ret = rtl8366_enable_vlan(smi, true);
 	if (ret) {
-		NL_SET_ERR_MSG_MOD(extack, "Failed to enable VLAN 4K");
+		NL_SET_ERR_MSG_MOD(extack, "Failed to enable VLAN");
 		return ret;
 	}