diff mbox series

[v2,14/14] net: ethernet: mtk_eth_soc: support creating mac address based offload entries

Message ID 20220405195755.10817-15-nbd@nbd.name (mailing list archive)
State New, archived
Headers show
Series MediaTek SoC flow offload improvements + wireless support | expand

Commit Message

Felix Fietkau April 5, 2022, 7:57 p.m. UTC
This will be used to implement a limited form of bridge offloading.
Since the hardware does not support flow table entries with just source
and destination MAC address, the driver has to emulate it.

The hardware automatically creates entries entries for incoming flows, even
when they are bridged instead of routed, and reports when packets for these
flows have reached the minimum PPS rate for offloading.

After this happens, we look up the L2 flow offload entry based on the MAC
header and fill in the output routing information in the flow table.
The dynamically created per-flow entries are automatically removed when
either the hardware flowtable entry expires, is replaced, or if the offload
rule they belong to is removed

Signed-off-by: Felix Fietkau <nbd@nbd.name>
---
 drivers/net/ethernet/mediatek/mtk_ppe.c       | 241 ++++++++++++++++--
 drivers/net/ethernet/mediatek/mtk_ppe.h       |  44 +++-
 .../net/ethernet/mediatek/mtk_ppe_offload.c   |  60 ++++-
 3 files changed, 299 insertions(+), 46 deletions(-)

Comments

Andrew Lunn April 7, 2022, 6:10 p.m. UTC | #1
On Tue, Apr 05, 2022 at 09:57:55PM +0200, Felix Fietkau wrote:
> This will be used to implement a limited form of bridge offloading.
> Since the hardware does not support flow table entries with just source
> and destination MAC address, the driver has to emulate it.
> 
> The hardware automatically creates entries entries for incoming flows, even
> when they are bridged instead of routed, and reports when packets for these
> flows have reached the minimum PPS rate for offloading.
> 
> After this happens, we look up the L2 flow offload entry based on the MAC
> header and fill in the output routing information in the flow table.
> The dynamically created per-flow entries are automatically removed when
> either the hardware flowtable entry expires, is replaced, or if the offload
> rule they belong to is removed

> +
> +	if (found)
> +		goto out;
> +
> +	eh = eth_hdr(skb);
> +	ether_addr_copy(key.dest_mac, eh->h_dest);
> +	ether_addr_copy(key.src_mac, eh->h_source);
> +	tag = skb->data - 2;
> +	key.vlan = 0;
> +	switch (skb->protocol) {
> +#if IS_ENABLED(CONFIG_NET_DSA)
> +	case htons(ETH_P_XDSA):
> +		if (!netdev_uses_dsa(skb->dev) ||
> +		    skb->dev->dsa_ptr->tag_ops->proto != DSA_TAG_PROTO_MTK)
> +			goto out;
> +
> +		tag += 4;
> +		if (get_unaligned_be16(tag) != ETH_P_8021Q)
> +			break;
> +
> +		fallthrough;
> +#endif
> +	case htons(ETH_P_8021Q):
> +		key.vlan = get_unaligned_be16(tag + 2) & VLAN_VID_MASK;
> +		break;
> +	default:
> +		break;
> +	}

I'm trying to understand the architecture here.

We have an Ethernet interface and a Wireless interface. The slow path
is that frames ingress from one of these interfaces, Linux decides
what to do with them, either L2 or L3, and they then egress probably
out the other interface.

The hardware will look at the frames and try to spot flows? It will
then report any it finds. You can then add an offload, telling it for
a flow it needs to perform L2 or L3 processing, and egress out a
specific port? Linux then no longer sees the frame, the hardware
handles it, until the flow times out?

So i'm wondering what is going on here. So is this a frame which has
ingressed, either from the WiFi, or another switch port, gone to the
software bridge, bridges to a DSA slave interface, the DSA tagger has
added a tag and now it is in the master interface? Can you accelerate
such frames? What is adding the DSA tag on the fast path? And in the
opposite direction, frames which egress the switch which have a DSA
tag and are heading to the WiFi, what is removing the tag? Does the
accelerator also understand the tag and know what to do with it?

	Andrew
Felix Fietkau April 7, 2022, 6:21 p.m. UTC | #2
On 07.04.22 20:10, Andrew Lunn wrote:
> On Tue, Apr 05, 2022 at 09:57:55PM +0200, Felix Fietkau wrote:
>> This will be used to implement a limited form of bridge offloading.
>> Since the hardware does not support flow table entries with just source
>> and destination MAC address, the driver has to emulate it.
>> 
>> The hardware automatically creates entries entries for incoming flows, even
>> when they are bridged instead of routed, and reports when packets for these
>> flows have reached the minimum PPS rate for offloading.
>> 
>> After this happens, we look up the L2 flow offload entry based on the MAC
>> header and fill in the output routing information in the flow table.
>> The dynamically created per-flow entries are automatically removed when
>> either the hardware flowtable entry expires, is replaced, or if the offload
>> rule they belong to is removed
> 
>> +
>> +	if (found)
>> +		goto out;
>> +
>> +	eh = eth_hdr(skb);
>> +	ether_addr_copy(key.dest_mac, eh->h_dest);
>> +	ether_addr_copy(key.src_mac, eh->h_source);
>> +	tag = skb->data - 2;
>> +	key.vlan = 0;
>> +	switch (skb->protocol) {
>> +#if IS_ENABLED(CONFIG_NET_DSA)
>> +	case htons(ETH_P_XDSA):
>> +		if (!netdev_uses_dsa(skb->dev) ||
>> +		    skb->dev->dsa_ptr->tag_ops->proto != DSA_TAG_PROTO_MTK)
>> +			goto out;
>> +
>> +		tag += 4;
>> +		if (get_unaligned_be16(tag) != ETH_P_8021Q)
>> +			break;
>> +
>> +		fallthrough;
>> +#endif
>> +	case htons(ETH_P_8021Q):
>> +		key.vlan = get_unaligned_be16(tag + 2) & VLAN_VID_MASK;
>> +		break;
>> +	default:
>> +		break;
>> +	}
> 
> I'm trying to understand the architecture here.
> 
> We have an Ethernet interface and a Wireless interface. The slow path
> is that frames ingress from one of these interfaces, Linux decides
> what to do with them, either L2 or L3, and they then egress probably
> out the other interface.
> 
> The hardware will look at the frames and try to spot flows? It will
> then report any it finds. You can then add an offload, telling it for
> a flow it needs to perform L2 or L3 processing, and egress out a
> specific port? Linux then no longer sees the frame, the hardware
> handles it, until the flow times out?
Yes, the hw handles it until either the flow times out, or the 
corresponding offload entry is removed.

For OpenWrt I also wrote a daemon that uses tc classifier BPF to 
accelerate the software bridge and create hardware offload entries as 
well via hardware TC flower rules: https://github.com/nbd168/bridger
It works in combination with these changes.

> So i'm wondering what is going on here. So is this a frame which has
> ingressed, either from the WiFi, or another switch port, gone to the
> software bridge, bridges to a DSA slave interface, the DSA tagger has
> added a tag and now it is in the master interface? Can you accelerate
> such frames? What is adding the DSA tag on the fast path? And in the
> opposite direction, frames which egress the switch which have a DSA
> tag and are heading to the WiFi, what is removing the tag? Does the
> accelerator also understand the tag and know what to do with it?WiFi -> Ethernet is not supported by MT7622, but will be added for newer 
SoCs like MT7986. The PPE supports both parsing and inserting MT7530 
compatible DSA tags. For Ethernet->WiFi flows, the PPE will also add 
required metadata that is parsed by the MT7915 WiFi Firmware in order to 
figure out what vif/station the packets were meant for.

- Felix
Andrew Lunn April 11, 2022, 1 p.m. UTC | #3
On Thu, Apr 07, 2022 at 08:21:43PM +0200, Felix Fietkau wrote:
> 
> On 07.04.22 20:10, Andrew Lunn wrote:
> > On Tue, Apr 05, 2022 at 09:57:55PM +0200, Felix Fietkau wrote:
> > > This will be used to implement a limited form of bridge offloading.
> > > Since the hardware does not support flow table entries with just source
> > > and destination MAC address, the driver has to emulate it.
> > > 
> > > The hardware automatically creates entries entries for incoming flows, even
> > > when they are bridged instead of routed, and reports when packets for these
> > > flows have reached the minimum PPS rate for offloading.
> > > 
> > > After this happens, we look up the L2 flow offload entry based on the MAC
> > > header and fill in the output routing information in the flow table.
> > > The dynamically created per-flow entries are automatically removed when
> > > either the hardware flowtable entry expires, is replaced, or if the offload
> > > rule they belong to is removed
> > 
> > > +
> > > +	if (found)
> > > +		goto out;
> > > +
> > > +	eh = eth_hdr(skb);
> > > +	ether_addr_copy(key.dest_mac, eh->h_dest);
> > > +	ether_addr_copy(key.src_mac, eh->h_source);
> > > +	tag = skb->data - 2;
> > > +	key.vlan = 0;
> > > +	switch (skb->protocol) {
> > > +#if IS_ENABLED(CONFIG_NET_DSA)
> > > +	case htons(ETH_P_XDSA):
> > > +		if (!netdev_uses_dsa(skb->dev) ||
> > > +		    skb->dev->dsa_ptr->tag_ops->proto != DSA_TAG_PROTO_MTK)
> > > +			goto out;
> > > +
> > > +		tag += 4;
> > > +		if (get_unaligned_be16(tag) != ETH_P_8021Q)
> > > +			break;
> > > +
> > > +		fallthrough;
> > > +#endif
> > > +	case htons(ETH_P_8021Q):
> > > +		key.vlan = get_unaligned_be16(tag + 2) & VLAN_VID_MASK;
> > > +		break;
> > > +	default:
> > > +		break;
> > > +	}
> > 
> > I'm trying to understand the architecture here.
> > 
> > We have an Ethernet interface and a Wireless interface. The slow path
> > is that frames ingress from one of these interfaces, Linux decides
> > what to do with them, either L2 or L3, and they then egress probably
> > out the other interface.
> > 
> > The hardware will look at the frames and try to spot flows? It will
> > then report any it finds. You can then add an offload, telling it for
> > a flow it needs to perform L2 or L3 processing, and egress out a
> > specific port? Linux then no longer sees the frame, the hardware
> > handles it, until the flow times out?
> Yes, the hw handles it until either the flow times out, or the corresponding
> offload entry is removed.
> 
> For OpenWrt I also wrote a daemon that uses tc classifier BPF to accelerate
> the software bridge and create hardware offload entries as well via hardware
> TC flower rules: https://github.com/nbd168/bridger
> It works in combination with these changes.

What about the bridge? In Linux, it is the software bridge which
controls all this at L2, and it should be offloading the flows, via
switchdev. The egress port you derive here is from the software bridge
FDB?

> > So i'm wondering what is going on here. So is this a frame which has
> > ingressed, either from the WiFi, or another switch port, gone to the
> > software bridge, bridges to a DSA slave interface, the DSA tagger has
> > added a tag and now it is in the master interface? Can you accelerate
> > such frames? What is adding the DSA tag on the fast path? And in the
> > opposite direction, frames which egress the switch which have a DSA
> > tag and are heading to the WiFi, what is removing the tag? Does the
> > accelerator also understand the tag and know what to do with it?WiFi ->
> > Ethernet is not supported by MT7622, but will be added for newer

> SoCs like MT7986. The PPE supports both parsing and inserting MT7530
> compatible DSA tags. For Ethernet->WiFi flows, the PPE will also add
> required metadata that is parsed by the MT7915 WiFi Firmware in order to
> figure out what vif/station the packets were meant for.

O.K. What about IGMP and multicast? Does the accelerate match on IGMP
and forwards it to the CPU, rather than follow the flow rules? Can you
set multiple egress destinations for multicast so that it can go both
to the switch and the host, when the host has a local interest in the
traffic?

	Andrew
Felix Fietkau April 12, 2022, 7:13 a.m. UTC | #4
On 11.04.22 15:00, Andrew Lunn wrote:
> On Thu, Apr 07, 2022 at 08:21:43PM +0200, Felix Fietkau wrote:
>> 
>> On 07.04.22 20:10, Andrew Lunn wrote:
>> > On Tue, Apr 05, 2022 at 09:57:55PM +0200, Felix Fietkau wrote:
>> > > This will be used to implement a limited form of bridge offloading.
>> > > Since the hardware does not support flow table entries with just source
>> > > and destination MAC address, the driver has to emulate it.
>> > > 
>> > > The hardware automatically creates entries entries for incoming flows, even
>> > > when they are bridged instead of routed, and reports when packets for these
>> > > flows have reached the minimum PPS rate for offloading.
>> > > 
>> > > After this happens, we look up the L2 flow offload entry based on the MAC
>> > > header and fill in the output routing information in the flow table.
>> > > The dynamically created per-flow entries are automatically removed when
>> > > either the hardware flowtable entry expires, is replaced, or if the offload
>> > > rule they belong to is removed
>> > 
>> > > +
>> > > +	if (found)
>> > > +		goto out;
>> > > +
>> > > +	eh = eth_hdr(skb);
>> > > +	ether_addr_copy(key.dest_mac, eh->h_dest);
>> > > +	ether_addr_copy(key.src_mac, eh->h_source);
>> > > +	tag = skb->data - 2;
>> > > +	key.vlan = 0;
>> > > +	switch (skb->protocol) {
>> > > +#if IS_ENABLED(CONFIG_NET_DSA)
>> > > +	case htons(ETH_P_XDSA):
>> > > +		if (!netdev_uses_dsa(skb->dev) ||
>> > > +		    skb->dev->dsa_ptr->tag_ops->proto != DSA_TAG_PROTO_MTK)
>> > > +			goto out;
>> > > +
>> > > +		tag += 4;
>> > > +		if (get_unaligned_be16(tag) != ETH_P_8021Q)
>> > > +			break;
>> > > +
>> > > +		fallthrough;
>> > > +#endif
>> > > +	case htons(ETH_P_8021Q):
>> > > +		key.vlan = get_unaligned_be16(tag + 2) & VLAN_VID_MASK;
>> > > +		break;
>> > > +	default:
>> > > +		break;
>> > > +	}
>> > 
>> > I'm trying to understand the architecture here.
>> > 
>> > We have an Ethernet interface and a Wireless interface. The slow path
>> > is that frames ingress from one of these interfaces, Linux decides
>> > what to do with them, either L2 or L3, and they then egress probably
>> > out the other interface.
>> > 
>> > The hardware will look at the frames and try to spot flows? It will
>> > then report any it finds. You can then add an offload, telling it for
>> > a flow it needs to perform L2 or L3 processing, and egress out a
>> > specific port? Linux then no longer sees the frame, the hardware
>> > handles it, until the flow times out?
>> Yes, the hw handles it until either the flow times out, or the corresponding
>> offload entry is removed.
>> 
>> For OpenWrt I also wrote a daemon that uses tc classifier BPF to accelerate
>> the software bridge and create hardware offload entries as well via hardware
>> TC flower rules: https://github.com/nbd168/bridger
>> It works in combination with these changes.
> 
> What about the bridge? In Linux, it is the software bridge which
> controls all this at L2, and it should be offloading the flows, via
> switchdev. The egress port you derive here is from the software bridge
> FDB?
My code uses netlink to fetch and monitor the bridge configuration, 
including fdb, port state, vlans, etc. and it uses that for the offload 
path - no extra configuration needed.

>> > So i'm wondering what is going on here. So is this a frame which has
>> > ingressed, either from the WiFi, or another switch port, gone to the
>> > software bridge, bridges to a DSA slave interface, the DSA tagger has
>> > added a tag and now it is in the master interface? Can you accelerate
>> > such frames? What is adding the DSA tag on the fast path? And in the
>> > opposite direction, frames which egress the switch which have a DSA
>> > tag and are heading to the WiFi, what is removing the tag? Does the
>> > accelerator also understand the tag and know what to do with it?WiFi ->
>> > Ethernet is not supported by MT7622, but will be added for newer
> 
>> SoCs like MT7986. The PPE supports both parsing and inserting MT7530
>> compatible DSA tags. For Ethernet->WiFi flows, the PPE will also add
>> required metadata that is parsed by the MT7915 WiFi Firmware in order to
>> figure out what vif/station the packets were meant for.
> 
> O.K. What about IGMP and multicast? Does the accelerate match on IGMP
> and forwards it to the CPU, rather than follow the flow rules? Can you
> set multiple egress destinations for multicast so that it can go both
> to the switch and the host, when the host has a local interest in the
> traffic?
IGMP/multicast isn't handled yet at the moment. I still need to do some 
research on what can be offloaded and how. The offload only handles 
unicast and everything else is going through the CPU.

- Felix
Andrew Lunn April 12, 2022, 1:07 p.m. UTC | #5
> > > > > I'm trying to understand the architecture here.
> > > > > We have an Ethernet interface and a Wireless interface. The slow
> > > path
> > > > is that frames ingress from one of these interfaces, Linux decides
> > > > what to do with them, either L2 or L3, and they then egress probably
> > > > out the other interface.
> > > > > The hardware will look at the frames and try to spot flows? It
> > > will
> > > > then report any it finds. You can then add an offload, telling it for
> > > > a flow it needs to perform L2 or L3 processing, and egress out a
> > > > specific port? Linux then no longer sees the frame, the hardware
> > > > handles it, until the flow times out?
> > > Yes, the hw handles it until either the flow times out, or the corresponding
> > > offload entry is removed.
> > > 
> > > For OpenWrt I also wrote a daemon that uses tc classifier BPF to accelerate
> > > the software bridge and create hardware offload entries as well via hardware
> > > TC flower rules: https://github.com/nbd168/bridger
> > > It works in combination with these changes.
> > 
> > What about the bridge? In Linux, it is the software bridge which
> > controls all this at L2, and it should be offloading the flows, via
> > switchdev. The egress port you derive here is from the software bridge
> > FDB?

> My code uses netlink to fetch and monitor the bridge configuration,
> including fdb, port state, vlans, etc. and it uses that for the offload path
> - no extra configuration needed.

So this is where we get into architecture issues. Do we really want
Linux to have two ways for setting up L2 networking? It was decided
that users should not need to know about how to use an accelerator,
they should not use additional tools, it should just look like
linux. The user should just add the WiFi netdev to the bridge and
switchdev will do the rest to offload L2 switching to the hardware.

You appear to be saying you need a daemon in userspace. That is not
how every other accelerate works in Linux networking.

We the Linux network community need to decided if we want this?

    Andrew
Felix Fietkau April 12, 2022, 1:49 p.m. UTC | #6
On 12.04.22 15:07, Andrew Lunn wrote:
>> > > > > I'm trying to understand the architecture here.
>> > > > > We have an Ethernet interface and a Wireless interface. The slow
>> > > path
>> > > > is that frames ingress from one of these interfaces, Linux decides
>> > > > what to do with them, either L2 or L3, and they then egress probably
>> > > > out the other interface.
>> > > > > The hardware will look at the frames and try to spot flows? It
>> > > will
>> > > > then report any it finds. You can then add an offload, telling it for
>> > > > a flow it needs to perform L2 or L3 processing, and egress out a
>> > > > specific port? Linux then no longer sees the frame, the hardware
>> > > > handles it, until the flow times out?
>> > > Yes, the hw handles it until either the flow times out, or the corresponding
>> > > offload entry is removed.
>> > > 
>> > > For OpenWrt I also wrote a daemon that uses tc classifier BPF to accelerate
>> > > the software bridge and create hardware offload entries as well via hardware
>> > > TC flower rules: https://github.com/nbd168/bridger
>> > > It works in combination with these changes.
>> > 
>> > What about the bridge? In Linux, it is the software bridge which
>> > controls all this at L2, and it should be offloading the flows, via
>> > switchdev. The egress port you derive here is from the software bridge
>> > FDB?
> 
>> My code uses netlink to fetch and monitor the bridge configuration,
>> including fdb, port state, vlans, etc. and it uses that for the offload path
>> - no extra configuration needed.
> 
> So this is where we get into architecture issues. Do we really want
> Linux to have two ways for setting up L2 networking? It was decided
> that users should not need to know about how to use an accelerator,
> they should not use additional tools, it should just look like
> linux. The user should just add the WiFi netdev to the bridge and
> switchdev will do the rest to offload L2 switching to the hardware.
> 
> You appear to be saying you need a daemon in userspace. That is not
> how every other accelerate works in Linux networking.
> 
> We the Linux network community need to decided if we want this?
The problem here is that it can't be fully transparent. Enabling 
hardware offload for LAN -> WiFi comes at a cost of bypassing airtime 
fairness and mac80211's bufferbloat mitigation.
Some people want this anyway (often but not always for 
benchmark/marketing purposes), but it's not something that I would want 
to have enabled by default simply by a wifi netdev to a bridge.

Initially, I wanted to put more of the state tracking code in the 
kernel. I made the first implementation of my acceleration code as a 
patch to the network bridge - speeding up bridge unicast forwarding 
significantly for any device regardless of hardware support. I wanted to 
build on that to avoid putting a lot of FDB/VLAN related tracking 
directly into the driver.

That approach was immediately rejected and I was told to use BPF instead.

That said, I really don't think it's a good idea to put all the code for 
tracking the bridge state, and all possible forwarding destinations into 
the driver directly.

I believe the combination of doing the bridge state tracking in user 
space + using the standard TC API for programming offloading rules into 
the hardware is a reasonable compromise.

- Felix
Andrew Lunn April 12, 2022, 2:21 p.m. UTC | #7
On Tue, Apr 12, 2022 at 03:49:51PM +0200, Felix Fietkau wrote:
> 
> On 12.04.22 15:07, Andrew Lunn wrote:
> > > > > > > I'm trying to understand the architecture here.
> > > > > > > We have an Ethernet interface and a Wireless interface. The slow
> > > > > path
> > > > > > is that frames ingress from one of these interfaces, Linux decides
> > > > > > what to do with them, either L2 or L3, and they then egress probably
> > > > > > out the other interface.
> > > > > > > The hardware will look at the frames and try to spot flows? It
> > > > > will
> > > > > > then report any it finds. You can then add an offload, telling it for
> > > > > > a flow it needs to perform L2 or L3 processing, and egress out a
> > > > > > specific port? Linux then no longer sees the frame, the hardware
> > > > > > handles it, until the flow times out?
> > > > > Yes, the hw handles it until either the flow times out, or the corresponding
> > > > > offload entry is removed.
> > > > > > > For OpenWrt I also wrote a daemon that uses tc classifier
> > > BPF to accelerate
> > > > > the software bridge and create hardware offload entries as well via hardware
> > > > > TC flower rules: https://github.com/nbd168/bridger
> > > > > It works in combination with these changes.
> > > > > What about the bridge? In Linux, it is the software bridge which
> > > > controls all this at L2, and it should be offloading the flows, via
> > > > switchdev. The egress port you derive here is from the software bridge
> > > > FDB?
> > 
> > > My code uses netlink to fetch and monitor the bridge configuration,
> > > including fdb, port state, vlans, etc. and it uses that for the offload path
> > > - no extra configuration needed.
> > 
> > So this is where we get into architecture issues. Do we really want
> > Linux to have two ways for setting up L2 networking? It was decided
> > that users should not need to know about how to use an accelerator,
> > they should not use additional tools, it should just look like
> > linux. The user should just add the WiFi netdev to the bridge and
> > switchdev will do the rest to offload L2 switching to the hardware.
> > 
> > You appear to be saying you need a daemon in userspace. That is not
> > how every other accelerate works in Linux networking.
> > 
> > We the Linux network community need to decided if we want this?

> The problem here is that it can't be fully transparent. Enabling hardware
> offload for LAN -> WiFi comes at a cost of bypassing airtime fairness and
> mac80211's bufferbloat mitigation.
> Some people want this anyway (often but not always for benchmark/marketing
> purposes), but it's not something that I would want to have enabled by
> default simply by a wifi netdev to a bridge.

So this sounds like a generic issue. How does IPA handle this? Looping
in Alex Elder.

There is already something partially in this direction in the
bridge. You can add a static entry with our without self. This
controls if a specific static entry in the FDB is offloaded to the
accelerate or not. Maybe you can add an attribute to a port which
determines if dynamic entries are self or not, so you can decide if
frames egressing out a specific interface are accelerated or not,
depending on user choice. Since such a change should not touch the
fast path, it has a better chance of being merged.

> Initially, I wanted to put more of the state tracking code in the kernel. I
> made the first implementation of my acceleration code as a patch to the
> network bridge - speeding up bridge unicast forwarding significantly for any
> device regardless of hardware support. I wanted to build on that to avoid
> putting a lot of FDB/VLAN related tracking directly into the driver.

But the driver is the correct place for this. How generic is the state
tracking? Do you expect any other hardware to need the same state
tracking? IPA? Some other accelerate your know of?

	  Andrew
Felix Fietkau April 12, 2022, 3:51 p.m. UTC | #8
On 12.04.22 16:21, Andrew Lunn wrote:
> On Tue, Apr 12, 2022 at 03:49:51PM +0200, Felix Fietkau wrote:
>> 
>> On 12.04.22 15:07, Andrew Lunn wrote:
>> > > > > > > I'm trying to understand the architecture here.
>> > > > > > > We have an Ethernet interface and a Wireless interface. The slow
>> > > > > path
>> > > > > > is that frames ingress from one of these interfaces, Linux decides
>> > > > > > what to do with them, either L2 or L3, and they then egress probably
>> > > > > > out the other interface.
>> > > > > > > The hardware will look at the frames and try to spot flows? It
>> > > > > will
>> > > > > > then report any it finds. You can then add an offload, telling it for
>> > > > > > a flow it needs to perform L2 or L3 processing, and egress out a
>> > > > > > specific port? Linux then no longer sees the frame, the hardware
>> > > > > > handles it, until the flow times out?
>> > > > > Yes, the hw handles it until either the flow times out, or the corresponding
>> > > > > offload entry is removed.
>> > > > > > > For OpenWrt I also wrote a daemon that uses tc classifier
>> > > BPF to accelerate
>> > > > > the software bridge and create hardware offload entries as well via hardware
>> > > > > TC flower rules: https://github.com/nbd168/bridger
>> > > > > It works in combination with these changes.
>> > > > > What about the bridge? In Linux, it is the software bridge which
>> > > > controls all this at L2, and it should be offloading the flows, via
>> > > > switchdev. The egress port you derive here is from the software bridge
>> > > > FDB?
>> > 
>> > > My code uses netlink to fetch and monitor the bridge configuration,
>> > > including fdb, port state, vlans, etc. and it uses that for the offload path
>> > > - no extra configuration needed.
>> > 
>> > So this is where we get into architecture issues. Do we really want
>> > Linux to have two ways for setting up L2 networking? It was decided
>> > that users should not need to know about how to use an accelerator,
>> > they should not use additional tools, it should just look like
>> > linux. The user should just add the WiFi netdev to the bridge and
>> > switchdev will do the rest to offload L2 switching to the hardware.
>> > 
>> > You appear to be saying you need a daemon in userspace. That is not
>> > how every other accelerate works in Linux networking.
>> > 
>> > We the Linux network community need to decided if we want this?
> 
>> The problem here is that it can't be fully transparent. Enabling hardware
>> offload for LAN -> WiFi comes at a cost of bypassing airtime fairness and
>> mac80211's bufferbloat mitigation.
>> Some people want this anyway (often but not always for benchmark/marketing
>> purposes), but it's not something that I would want to have enabled by
>> default simply by a wifi netdev to a bridge.
> 
> So this sounds like a generic issue. How does IPA handle this? Looping
> in Alex Elder.
> 
> There is already something partially in this direction in the
> bridge. You can add a static entry with our without self. This
> controls if a specific static entry in the FDB is offloaded to the
> accelerate or not. Maybe you can add an attribute to a port which
> determines if dynamic entries are self or not, so you can decide if
> frames egressing out a specific interface are accelerated or not,
> depending on user choice. Since such a change should not touch the
> fast path, it has a better chance of being merged.
Sounds interesting. If there is some overlap and if we can get some 
common code in place, it might be worth looking into for the MTK offload 
as well. I do think this will take more time though.

>> Initially, I wanted to put more of the state tracking code in the kernel. I
>> made the first implementation of my acceleration code as a patch to the
>> network bridge - speeding up bridge unicast forwarding significantly for any
>> device regardless of hardware support. I wanted to build on that to avoid
>> putting a lot of FDB/VLAN related tracking directly into the driver.
> 
> But the driver is the correct place for this. How generic is the state
> tracking? Do you expect any other hardware to need the same state
> tracking? IPA? Some other accelerate your know of?
It basically has to keep track of all possible destination ports, their 
STP state, all their fdb entries, member VLANs of all ports. It has to 
quickly react to changes in any of these.
In order to implement this properly, I would also need to make more 
changes to mac80211. Right now, mac80211 drivers do not have access to 
the net_device pointer of virtual interfaces. So mac80211 itself would 
likely need to implement the switchdev ops and handle some of this.

There are also some other issues where I don't know how this is supposed 
to be solved properly:
On MT7622 most of the bridge ports are connected to a MT7531 switch 
using DSA. Offloading (lan->wlan bridging or L3/L4 NAT/routing) is not 
handled by the switch itself, it is handled by a packet processing 
engine in the SoC, which knows how to handle the DSA tags of the MT7531 
switch.

So if I were to handle this through switchdev implemented on the wlan 
and ethernet devices, it would technically not be part of the same 
switch, since it's a behind a different component with a different driver.

Also, is switchdev able to handle the situation where only parts of the 
traffic is offloaded and the rest (e.g. multicast) is handled through 
the regular software path?

In my opinion, handling it through the TC offload has a number of 
advantages:
- It's a lot simpler
- It uses the same kind of offloading rules that my software fastpath 
already uses
- It allows more fine grained control over which traffic should be 
offloaded (src mac -> destination MAC tuple)

I also plan on extending my software fast path code to support emulating 
bridging of WiFi client mode interfaces. This involves doing some MAC 
address translation with some IP address tracking. I want that to 
support hardware offload as well.

I really don't think that desire for supporting switchdev based offload 
should be a blocker for accepting this code now, especially since my 
implementation relies on existing Linux network APIs without inventing 
any new ones, and there are valid use cases for using it, even with 
switchdev support in place.

- Felix
Andrew Lunn April 12, 2022, 5:37 p.m. UTC | #9
> It basically has to keep track of all possible destination ports, their STP
> state, all their fdb entries, member VLANs of all ports. It has to quickly
> react to changes in any of these.

switchdev gives you all of those i think. DSA does not make use of
them all, in particularly the fdb entries, because of the low
bandwidth management link to the switch. But look at the Mellanox
switch, it keeps its hardware fdb entries in sync with the software
fdb.

And you get every quick access to these, sometimes too quick in that
it is holding a spinlock when it calls the switchdev functions, and
you need to defer the handling in your driver if you want to use a
mutex, perform blocking IO etc.

> In order to implement this properly, I would also need to make more changes
> to mac80211. Right now, mac80211 drivers do not have access to the
> net_device pointer of virtual interfaces. So mac80211 itself would likely
> need to implement the switchdev ops and handle some of this.

So this again sounds like something which would be shared by IPA, and
any other hardware which can accelerate forwarding between WiFi and
some other sort of interface.

> There are also some other issues where I don't know how this is supposed to
> be solved properly:
> On MT7622 most of the bridge ports are connected to a MT7531 switch using
> DSA. Offloading (lan->wlan bridging or L3/L4 NAT/routing) is not handled by
> the switch itself, it is handled by a packet processing engine in the SoC,
> which knows how to handle the DSA tags of the MT7531 switch.
> 
> So if I were to handle this through switchdev implemented on the wlan and
> ethernet devices, it would technically not be part of the same switch, since
> it's a behind a different component with a different driver.

What is important here is the user experience. The user is not
expected to know there is an accelerate being used. You setup the
bridge just as normal, using iproute2. You add routes in the normal
way, either by iproute2, or frr can add routes from OSPF, BGP, RIP or
whatever, via zebra. I'm not sure anybody has yet accelerated NAT, but
the same principle should be used, using iptables in the normal way,
and the accelerate is then informed and should accelerate it if
possible.

switchdev gives you notification of when anything changes. You can
have multiple receivers of these notifications, so the packet
processor can act on them as well as the DSA switch.
 
> Also, is switchdev able to handle the situation where only parts of the
> traffic is offloaded and the rest (e.g. multicast) is handled through the
> regular software path?

Yes, that is not a problem. I deliberately use the term
accelerator. We accelerate what Linux can already do. If the
accelerator hardware is not capable of something, Linux still is, so
just pass it the frames and it will do the right thing. Multicast is a
good example of this, many of the DSA switch drivers don't accelerate
it.

> In my opinion, handling it through the TC offload has a number of
> advantages:
> - It's a lot simpler
> - It uses the same kind of offloading rules that my software fastpath
> already uses
> - It allows more fine grained control over which traffic should be offloaded
> (src mac -> destination MAC tuple)
> 
> I also plan on extending my software fast path code to support emulating
> bridging of WiFi client mode interfaces. This involves doing some MAC
> address translation with some IP address tracking. I want that to support
> hardware offload as well.
> 
> I really don't think that desire for supporting switchdev based offload
> should be a blocker for accepting this code now, especially since my
> implementation relies on existing Linux network APIs without inventing any
> new ones, and there are valid use cases for using it, even with switchdev
> support in place.

What we need to avoid is fragmentation of the way we do things. It has
been decided that switchdev is how we use accelerators, and the user
should not really know anything about the accelerator. No other in
kernel network accelerator needs a user space component listening to
netlink notifications and programming the accelerator from user space.
Do we really want two ways to do this?

   Andrew
Felix Fietkau April 12, 2022, 5:51 p.m. UTC | #10
On 12.04.22 19:37, Andrew Lunn wrote:
>> It basically has to keep track of all possible destination ports, their STP
>> state, all their fdb entries, member VLANs of all ports. It has to quickly
>> react to changes in any of these.
> 
> switchdev gives you all of those i think. DSA does not make use of
> them all, in particularly the fdb entries, because of the low
> bandwidth management link to the switch. But look at the Mellanox
> switch, it keeps its hardware fdb entries in sync with the software
> fdb.
> 
> And you get every quick access to these, sometimes too quick in that
> it is holding a spinlock when it calls the switchdev functions, and
> you need to defer the handling in your driver if you want to use a
> mutex, perform blocking IO etc.
> 
>> In order to implement this properly, I would also need to make more changes
>> to mac80211. Right now, mac80211 drivers do not have access to the
>> net_device pointer of virtual interfaces. So mac80211 itself would likely
>> need to implement the switchdev ops and handle some of this.
> 
> So this again sounds like something which would be shared by IPA, and
> any other hardware which can accelerate forwarding between WiFi and
> some other sort of interface.
I would really like to see an example of how this should be done.
Is there a work in progress tree for IPA with offloading? Because the 
code that I see upstream doesn't seem to have any of that - or did I 
look in the wrong place?

>> There are also some other issues where I don't know how this is supposed to
>> be solved properly:
>> On MT7622 most of the bridge ports are connected to a MT7531 switch using
>> DSA. Offloading (lan->wlan bridging or L3/L4 NAT/routing) is not handled by
>> the switch itself, it is handled by a packet processing engine in the SoC,
>> which knows how to handle the DSA tags of the MT7531 switch.
>> 
>> So if I were to handle this through switchdev implemented on the wlan and
>> ethernet devices, it would technically not be part of the same switch, since
>> it's a behind a different component with a different driver.
> 
> What is important here is the user experience. The user is not
> expected to know there is an accelerate being used. You setup the
> bridge just as normal, using iproute2. You add routes in the normal
> way, either by iproute2, or frr can add routes from OSPF, BGP, RIP or
> whatever, via zebra. I'm not sure anybody has yet accelerated NAT, but
> the same principle should be used, using iptables in the normal way,
> and the accelerate is then informed and should accelerate it if
> possible.
Accelerated NAT on MT7622 is already present in the upstream code for a 
while. It's there for ethernet, and with my patches it also works for 
ethernet -> wlan.

> switchdev gives you notification of when anything changes. You can
> have multiple receivers of these notifications, so the packet
> processor can act on them as well as the DSA switch.
>   
>> Also, is switchdev able to handle the situation where only parts of the
>> traffic is offloaded and the rest (e.g. multicast) is handled through the
>> regular software path?
> 
> Yes, that is not a problem. I deliberately use the term
> accelerator. We accelerate what Linux can already do. If the
> accelerator hardware is not capable of something, Linux still is, so
> just pass it the frames and it will do the right thing. Multicast is a
> good example of this, many of the DSA switch drivers don't accelerate
> it.
Don't get me wrong, I'm not against switchdev support at all. I just 
don't know how to do it yet, and the code that I put in place is useful 
for non-switchdev use cases as well.

>> In my opinion, handling it through the TC offload has a number of
>> advantages:
>> - It's a lot simpler
>> - It uses the same kind of offloading rules that my software fastpath
>> already uses
>> - It allows more fine grained control over which traffic should be offloaded
>> (src mac -> destination MAC tuple)
>> 
>> I also plan on extending my software fast path code to support emulating
>> bridging of WiFi client mode interfaces. This involves doing some MAC
>> address translation with some IP address tracking. I want that to support
>> hardware offload as well.
>> 
>> I really don't think that desire for supporting switchdev based offload
>> should be a blocker for accepting this code now, especially since my
>> implementation relies on existing Linux network APIs without inventing any
>> new ones, and there are valid use cases for using it, even with switchdev
>> support in place.
> 
> What we need to avoid is fragmentation of the way we do things. It has
> been decided that switchdev is how we use accelerators, and the user
> should not really know anything about the accelerator. No other in
> kernel network accelerator needs a user space component listening to
> netlink notifications and programming the accelerator from user space.
> Do we really want two ways to do this?
There's always some overlap in what the APIs can do. And when it comes 
to the "client mode bridge" use case that I mentioned, I would also need 
exactly the same API that I put in place here. And this is not something 
that can (or even should) be done using switchdev. mac80211 prevents 
adding client mode interfaces to bridges for a reason.

- Felix
diff mbox series

Patch

diff --git a/drivers/net/ethernet/mediatek/mtk_ppe.c b/drivers/net/ethernet/mediatek/mtk_ppe.c
index aa0d190db65d..282a1f34a88a 100644
--- a/drivers/net/ethernet/mediatek/mtk_ppe.c
+++ b/drivers/net/ethernet/mediatek/mtk_ppe.c
@@ -6,12 +6,22 @@ 
 #include <linux/iopoll.h>
 #include <linux/etherdevice.h>
 #include <linux/platform_device.h>
+#include <linux/if_ether.h>
+#include <linux/if_vlan.h>
+#include <net/dsa.h>
 #include "mtk_eth_soc.h"
 #include "mtk_ppe.h"
 #include "mtk_ppe_regs.h"
 
 static DEFINE_SPINLOCK(ppe_lock);
 
+static const struct rhashtable_params mtk_flow_l2_ht_params = {
+	.head_offset = offsetof(struct mtk_flow_entry, l2_node),
+	.key_offset = offsetof(struct mtk_flow_entry, data.bridge),
+	.key_len = offsetof(struct mtk_foe_bridge, key_end),
+	.automatic_shrinking = true,
+};
+
 static void ppe_w32(struct mtk_ppe *ppe, u32 reg, u32 val)
 {
 	writel(val, ppe->base + reg);
@@ -123,6 +133,9 @@  mtk_foe_entry_l2(struct mtk_foe_entry *entry)
 {
 	int type = FIELD_GET(MTK_FOE_IB1_PACKET_TYPE, entry->ib1);
 
+	if (type == MTK_PPE_PKT_TYPE_BRIDGE)
+		return &entry->bridge.l2;
+
 	if (type >= MTK_PPE_PKT_TYPE_IPV4_DSLITE)
 		return &entry->ipv6.l2;
 
@@ -134,6 +147,9 @@  mtk_foe_entry_ib2(struct mtk_foe_entry *entry)
 {
 	int type = FIELD_GET(MTK_FOE_IB1_PACKET_TYPE, entry->ib1);
 
+	if (type == MTK_PPE_PKT_TYPE_BRIDGE)
+		return &entry->bridge.ib2;
+
 	if (type >= MTK_PPE_PKT_TYPE_IPV4_DSLITE)
 		return &entry->ipv6.ib2;
 
@@ -168,7 +184,12 @@  int mtk_foe_entry_prepare(struct mtk_foe_entry *entry, int type, int l4proto,
 	if (type == MTK_PPE_PKT_TYPE_IPV6_ROUTE_3T)
 		entry->ipv6.ports = ports_pad;
 
-	if (type >= MTK_PPE_PKT_TYPE_IPV4_DSLITE) {
+	if (type == MTK_PPE_PKT_TYPE_BRIDGE) {
+		ether_addr_copy(entry->bridge.src_mac, src_mac);
+		ether_addr_copy(entry->bridge.dest_mac, dest_mac);
+		entry->bridge.ib2 = val;
+		l2 = &entry->bridge.l2;
+	} else if (type >= MTK_PPE_PKT_TYPE_IPV4_DSLITE) {
 		entry->ipv6.ib2 = val;
 		l2 = &entry->ipv6.l2;
 	} else {
@@ -371,6 +392,84 @@  mtk_flow_entry_match(struct mtk_flow_entry *entry, struct mtk_foe_entry *data)
 	return !memcmp(&entry->data.data, &data->data, len - 4);
 }
 
+static void
+__mtk_foe_entry_clear(struct mtk_ppe *ppe, struct mtk_flow_entry *entry)
+{
+	struct hlist_head *head;
+	struct hlist_node *tmp;
+
+	if (entry->type == MTK_FLOW_TYPE_L2) {
+		rhashtable_remove_fast(&ppe->l2_flows, &entry->l2_node,
+				       mtk_flow_l2_ht_params);
+
+		head = &entry->l2_flows;
+		hlist_for_each_entry_safe(entry, tmp, head, l2_data.list)
+			__mtk_foe_entry_clear(ppe, entry);
+		return;
+	}
+
+	hlist_del_init(&entry->list);
+	if (entry->hash != 0xffff) {
+		ppe->foe_table[entry->hash].ib1 &= ~MTK_FOE_IB1_STATE;
+		ppe->foe_table[entry->hash].ib1 |= FIELD_PREP(MTK_FOE_IB1_STATE,
+							      MTK_FOE_STATE_BIND);
+		dma_wmb();
+	}
+	entry->hash = 0xffff;
+
+	if (entry->type != MTK_FLOW_TYPE_L2_SUBFLOW)
+		return;
+
+	hlist_del_init(&entry->l2_data.list);
+	kfree(entry);
+}
+
+static int __mtk_foe_entry_idle_time(struct mtk_ppe *ppe, u32 ib1)
+{
+	u16 timestamp;
+	u16 now;
+
+	now = mtk_eth_timestamp(ppe->eth) & MTK_FOE_IB1_BIND_TIMESTAMP;
+	timestamp = ib1 & MTK_FOE_IB1_BIND_TIMESTAMP;
+
+	if (timestamp > now)
+		return MTK_FOE_IB1_BIND_TIMESTAMP + 1 - timestamp + now;
+	else
+		return now - timestamp;
+}
+
+static void
+mtk_flow_entry_update_l2(struct mtk_ppe *ppe, struct mtk_flow_entry *entry)
+{
+	struct mtk_flow_entry *cur;
+	struct mtk_foe_entry *hwe;
+	struct hlist_node *tmp;
+	int idle;
+
+	idle = __mtk_foe_entry_idle_time(ppe, entry->data.ib1);
+	hlist_for_each_entry_safe(cur, tmp, &entry->l2_flows, l2_data.list) {
+		int cur_idle;
+		u32 ib1;
+
+		hwe = &ppe->foe_table[cur->hash];
+		ib1 = READ_ONCE(hwe->ib1);
+
+		if (FIELD_GET(MTK_FOE_IB1_STATE, ib1) != MTK_FOE_STATE_BIND) {
+			cur->hash = 0xffff;
+			__mtk_foe_entry_clear(ppe, cur);
+			continue;
+		}
+
+		cur_idle = __mtk_foe_entry_idle_time(ppe, ib1);
+		if (cur_idle >= idle)
+			continue;
+
+		idle = cur_idle;
+		entry->data.ib1 &= ~MTK_FOE_IB1_BIND_TIMESTAMP;
+		entry->data.ib1 |= hwe->ib1 & MTK_FOE_IB1_BIND_TIMESTAMP;
+	}
+}
+
 static void
 mtk_flow_entry_update(struct mtk_ppe *ppe, struct mtk_flow_entry *entry)
 {
@@ -378,6 +477,12 @@  mtk_flow_entry_update(struct mtk_ppe *ppe, struct mtk_flow_entry *entry)
 	struct mtk_foe_entry foe;
 
 	spin_lock_bh(&ppe_lock);
+
+	if (entry->type == MTK_FLOW_TYPE_L2) {
+		mtk_flow_entry_update_l2(ppe, entry);
+		goto out;
+	}
+
 	if (entry->hash == 0xffff)
 		goto out;
 
@@ -419,21 +524,28 @@  __mtk_foe_entry_commit(struct mtk_ppe *ppe, struct mtk_foe_entry *entry,
 void mtk_foe_entry_clear(struct mtk_ppe *ppe, struct mtk_flow_entry *entry)
 {
 	spin_lock_bh(&ppe_lock);
-	hlist_del_init(&entry->list);
-	if (entry->hash != 0xffff) {
-		ppe->foe_table[entry->hash].ib1 &= ~MTK_FOE_IB1_STATE;
-		ppe->foe_table[entry->hash].ib1 |= FIELD_PREP(MTK_FOE_IB1_STATE,
-							      MTK_FOE_STATE_BIND);
-		dma_wmb();
-	}
-	entry->hash = 0xffff;
+	__mtk_foe_entry_clear(ppe, entry);
 	spin_unlock_bh(&ppe_lock);
 }
 
+static int
+mtk_foe_entry_commit_l2(struct mtk_ppe *ppe, struct mtk_flow_entry *entry)
+{
+	entry->type = MTK_FLOW_TYPE_L2;
+
+	return rhashtable_insert_fast(&ppe->l2_flows, &entry->l2_node,
+				      mtk_flow_l2_ht_params);
+}
+
 int mtk_foe_entry_commit(struct mtk_ppe *ppe, struct mtk_flow_entry *entry)
 {
-	u32 hash = mtk_ppe_hash_entry(&entry->data);
+	int type = FIELD_GET(MTK_FOE_IB1_PACKET_TYPE, entry->data.ib1);
+	u32 hash;
+
+	if (type == MTK_PPE_PKT_TYPE_BRIDGE)
+		return mtk_foe_entry_commit_l2(ppe, entry);
 
+	hash = mtk_ppe_hash_entry(&entry->data);
 	entry->hash = 0xffff;
 	spin_lock_bh(&ppe_lock);
 	hlist_add_head(&entry->list, &ppe->foe_flow[hash / 2]);
@@ -442,18 +554,72 @@  int mtk_foe_entry_commit(struct mtk_ppe *ppe, struct mtk_flow_entry *entry)
 	return 0;
 }
 
+static void
+mtk_foe_entry_commit_subflow(struct mtk_ppe *ppe, struct mtk_flow_entry *entry,
+			     u16 hash)
+{
+	struct mtk_flow_entry *flow_info;
+	struct mtk_foe_entry foe, *hwe;
+	struct mtk_foe_mac_info *l2;
+	u32 ib1_mask = MTK_FOE_IB1_PACKET_TYPE | MTK_FOE_IB1_UDP;
+	int type;
+
+	flow_info = kzalloc(offsetof(struct mtk_flow_entry, l2_data.end),
+			    GFP_ATOMIC);
+	if (!flow_info)
+		return;
+
+	flow_info->l2_data.base_flow = entry;
+	flow_info->type = MTK_FLOW_TYPE_L2_SUBFLOW;
+	flow_info->hash = hash;
+	hlist_add_head(&flow_info->list, &ppe->foe_flow[hash / 2]);
+	hlist_add_head(&flow_info->l2_data.list, &entry->l2_flows);
+
+	hwe = &ppe->foe_table[hash];
+	memcpy(&foe, hwe, sizeof(foe));
+	foe.ib1 &= ib1_mask;
+	foe.ib1 |= entry->data.ib1 & ~ib1_mask;
+
+	l2 = mtk_foe_entry_l2(&foe);
+	memcpy(l2, &entry->data.bridge.l2, sizeof(*l2));
+
+	type = FIELD_GET(MTK_FOE_IB1_PACKET_TYPE, foe.ib1);
+	if (type == MTK_PPE_PKT_TYPE_IPV4_HNAPT)
+		memcpy(&foe.ipv4.new, &foe.ipv4.orig, sizeof(foe.ipv4.new));
+	else if (type >= MTK_PPE_PKT_TYPE_IPV6_ROUTE_3T && l2->etype == ETH_P_IP)
+		l2->etype = ETH_P_IPV6;
+
+	*mtk_foe_entry_ib2(&foe) = entry->data.bridge.ib2;
+
+	__mtk_foe_entry_commit(ppe, &foe, hash);
+}
+
 void __mtk_ppe_check_skb(struct mtk_ppe *ppe, struct sk_buff *skb, u16 hash)
 {
 	struct hlist_head *head = &ppe->foe_flow[hash / 2];
-	struct mtk_flow_entry *entry;
 	struct mtk_foe_entry *hwe = &ppe->foe_table[hash];
+	struct mtk_flow_entry *entry;
+	struct mtk_foe_bridge key = {};
+	struct ethhdr *eh;
 	bool found = false;
-
-	if (hlist_empty(head))
-		return;
+	u8 *tag;
 
 	spin_lock_bh(&ppe_lock);
+
+	if (FIELD_GET(MTK_FOE_IB1_STATE, hwe->ib1) == MTK_FOE_STATE_BIND)
+		goto out;
+
 	hlist_for_each_entry(entry, head, list) {
+		if (entry->type == MTK_FLOW_TYPE_L2_SUBFLOW) {
+			if (unlikely(FIELD_GET(MTK_FOE_IB1_STATE, hwe->ib1) ==
+				     MTK_FOE_STATE_BIND))
+				continue;
+
+			entry->hash = 0xffff;
+			__mtk_foe_entry_clear(ppe, entry);
+			continue;
+		}
+
 		if (found || !mtk_flow_entry_match(entry, hwe)) {
 			if (entry->hash != 0xffff)
 				entry->hash = 0xffff;
@@ -464,21 +630,50 @@  void __mtk_ppe_check_skb(struct mtk_ppe *ppe, struct sk_buff *skb, u16 hash)
 		__mtk_foe_entry_commit(ppe, &entry->data, hash);
 		found = true;
 	}
+
+	if (found)
+		goto out;
+
+	eh = eth_hdr(skb);
+	ether_addr_copy(key.dest_mac, eh->h_dest);
+	ether_addr_copy(key.src_mac, eh->h_source);
+	tag = skb->data - 2;
+	key.vlan = 0;
+	switch (skb->protocol) {
+#if IS_ENABLED(CONFIG_NET_DSA)
+	case htons(ETH_P_XDSA):
+		if (!netdev_uses_dsa(skb->dev) ||
+		    skb->dev->dsa_ptr->tag_ops->proto != DSA_TAG_PROTO_MTK)
+			goto out;
+
+		tag += 4;
+		if (get_unaligned_be16(tag) != ETH_P_8021Q)
+			break;
+
+		fallthrough;
+#endif
+	case htons(ETH_P_8021Q):
+		key.vlan = get_unaligned_be16(tag + 2) & VLAN_VID_MASK;
+		break;
+	default:
+		break;
+	}
+
+	entry = rhashtable_lookup_fast(&ppe->l2_flows, &key, mtk_flow_l2_ht_params);
+	if (!entry)
+		goto out;
+
+	mtk_foe_entry_commit_subflow(ppe, entry, hash);
+
+out:
 	spin_unlock_bh(&ppe_lock);
 }
 
 int mtk_foe_entry_idle_time(struct mtk_ppe *ppe, struct mtk_flow_entry *entry)
 {
-	u16 now = mtk_eth_timestamp(ppe->eth) & MTK_FOE_IB1_BIND_TIMESTAMP;
-	u16 timestamp;
-
 	mtk_flow_entry_update(ppe, entry);
-	timestamp = entry->data.ib1 & MTK_FOE_IB1_BIND_TIMESTAMP;
 
-	if (timestamp > now)
-		return MTK_FOE_IB1_BIND_TIMESTAMP + 1 - timestamp + now;
-	else
-		return now - timestamp;
+	return __mtk_foe_entry_idle_time(ppe, entry->data.ib1);
 }
 
 struct mtk_ppe *mtk_ppe_init(struct mtk_eth *eth, void __iomem *base,
@@ -492,6 +687,8 @@  struct mtk_ppe *mtk_ppe_init(struct mtk_eth *eth, void __iomem *base,
 	if (!ppe)
 		return NULL;
 
+	rhashtable_init(&ppe->l2_flows, &mtk_flow_l2_ht_params);
+
 	/* need to allocate a separate device, since it PPE DMA access is
 	 * not coherent.
 	 */
diff --git a/drivers/net/ethernet/mediatek/mtk_ppe.h b/drivers/net/ethernet/mediatek/mtk_ppe.h
index 217b44bed8b6..1f5cf1c9a947 100644
--- a/drivers/net/ethernet/mediatek/mtk_ppe.h
+++ b/drivers/net/ethernet/mediatek/mtk_ppe.h
@@ -6,6 +6,7 @@ 
 
 #include <linux/kernel.h>
 #include <linux/bitfield.h>
+#include <linux/rhashtable.h>
 
 #define MTK_ETH_PPE_BASE		0xc00
 
@@ -84,19 +85,16 @@  struct mtk_foe_mac_info {
 	u16 src_mac_lo;
 };
 
+/* software-only entry type */
 struct mtk_foe_bridge {
-	u32 dest_mac_hi;
-
-	u16 src_mac_lo;
-	u16 dest_mac_lo;
+	u8 dest_mac[ETH_ALEN];
+	u8 src_mac[ETH_ALEN];
+	u16 vlan;
 
-	u32 src_mac_hi;
+	struct {} key_end;
 
 	u32 ib2;
 
-	u32 _rsv[5];
-
-	u32 udf_tsid;
 	struct mtk_foe_mac_info l2;
 };
 
@@ -235,13 +233,33 @@  enum {
 	MTK_PPE_CPU_REASON_INVALID			= 0x1f,
 };
 
+enum {
+	MTK_FLOW_TYPE_L4,
+	MTK_FLOW_TYPE_L2,
+	MTK_FLOW_TYPE_L2_SUBFLOW,
+};
+
 struct mtk_flow_entry {
+	union {
+		struct hlist_node list;
+		struct {
+			struct rhash_head l2_node;
+			struct hlist_head l2_flows;
+		};
+	};
+	u8 type;
+	s8 wed_index;
+	u16 hash;
+	union {
+		struct mtk_foe_entry data;
+		struct {
+			struct mtk_flow_entry *base_flow;
+			struct hlist_node list;
+			struct {} end;
+		} l2_data;
+	};
 	struct rhash_head node;
-	struct hlist_node list;
 	unsigned long cookie;
-	struct mtk_foe_entry data;
-	u16 hash;
-	s8 wed_index;
 };
 
 struct mtk_ppe {
@@ -256,6 +274,8 @@  struct mtk_ppe {
 	u16 foe_check_time[MTK_PPE_ENTRIES];
 	struct hlist_head foe_flow[MTK_PPE_ENTRIES / 2];
 
+	struct rhashtable l2_flows;
+
 	void *acct_table;
 };
 
diff --git a/drivers/net/ethernet/mediatek/mtk_ppe_offload.c b/drivers/net/ethernet/mediatek/mtk_ppe_offload.c
index 378685fe92b6..1fe31058b0f2 100644
--- a/drivers/net/ethernet/mediatek/mtk_ppe_offload.c
+++ b/drivers/net/ethernet/mediatek/mtk_ppe_offload.c
@@ -31,6 +31,8 @@  struct mtk_flow_data {
 	__be16 src_port;
 	__be16 dst_port;
 
+	u16 vlan_in;
+
 	struct {
 		u16 id;
 		__be16 proto;
@@ -257,9 +259,45 @@  mtk_flow_offload_replace(struct mtk_eth *eth, struct flow_cls_offload *f)
 		return -EOPNOTSUPP;
 	}
 
+	switch (addr_type) {
+	case 0:
+		offload_type = MTK_PPE_PKT_TYPE_BRIDGE;
+		if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
+			struct flow_match_eth_addrs match;
+
+			flow_rule_match_eth_addrs(rule, &match);
+			memcpy(data.eth.h_dest, match.key->dst, ETH_ALEN);
+			memcpy(data.eth.h_source, match.key->src, ETH_ALEN);
+		} else {
+			return -EOPNOTSUPP;
+		}
+
+		if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_VLAN)) {
+			struct flow_match_vlan match;
+
+			flow_rule_match_vlan(rule, &match);
+
+			if (match.key->vlan_tpid != cpu_to_be16(ETH_P_8021Q))
+				return -EOPNOTSUPP;
+
+			data.vlan_in = match.key->vlan_id;
+		}
+		break;
+	case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
+		offload_type = MTK_PPE_PKT_TYPE_IPV4_HNAPT;
+		break;
+	case FLOW_DISSECTOR_KEY_IPV6_ADDRS:
+		offload_type = MTK_PPE_PKT_TYPE_IPV6_ROUTE_5T;
+		break;
+	default:
+		return -EOPNOTSUPP;
+	}
+
 	flow_action_for_each(i, act, &rule->action) {
 		switch (act->id) {
 		case FLOW_ACTION_MANGLE:
+			if (offload_type == MTK_PPE_PKT_TYPE_BRIDGE)
+				return -EOPNOTSUPP;
 			if (act->mangle.htype == FLOW_ACT_MANGLE_HDR_TYPE_ETH)
 				mtk_flow_offload_mangle_eth(act, &data.eth);
 			break;
@@ -291,17 +329,6 @@  mtk_flow_offload_replace(struct mtk_eth *eth, struct flow_cls_offload *f)
 		}
 	}
 
-	switch (addr_type) {
-	case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
-		offload_type = MTK_PPE_PKT_TYPE_IPV4_HNAPT;
-		break;
-	case FLOW_DISSECTOR_KEY_IPV6_ADDRS:
-		offload_type = MTK_PPE_PKT_TYPE_IPV6_ROUTE_5T;
-		break;
-	default:
-		return -EOPNOTSUPP;
-	}
-
 	if (!is_valid_ether_addr(data.eth.h_source) ||
 	    !is_valid_ether_addr(data.eth.h_dest))
 		return -EINVAL;
@@ -315,10 +342,13 @@  mtk_flow_offload_replace(struct mtk_eth *eth, struct flow_cls_offload *f)
 	if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_PORTS)) {
 		struct flow_match_ports ports;
 
+		if (offload_type == MTK_PPE_PKT_TYPE_BRIDGE)
+			return -EOPNOTSUPP;
+
 		flow_rule_match_ports(rule, &ports);
 		data.src_port = ports.key->src;
 		data.dst_port = ports.key->dst;
-	} else {
+	} else if (offload_type != MTK_PPE_PKT_TYPE_BRIDGE) {
 		return -EOPNOTSUPP;
 	}
 
@@ -348,6 +378,9 @@  mtk_flow_offload_replace(struct mtk_eth *eth, struct flow_cls_offload *f)
 		if (act->id != FLOW_ACTION_MANGLE)
 			continue;
 
+		if (offload_type == MTK_PPE_PKT_TYPE_BRIDGE)
+			return -EOPNOTSUPP;
+
 		switch (act->mangle.htype) {
 		case FLOW_ACT_MANGLE_HDR_TYPE_TCP:
 		case FLOW_ACT_MANGLE_HDR_TYPE_UDP:
@@ -373,6 +406,9 @@  mtk_flow_offload_replace(struct mtk_eth *eth, struct flow_cls_offload *f)
 			return err;
 	}
 
+	if (offload_type == MTK_PPE_PKT_TYPE_BRIDGE)
+		foe.bridge.vlan = data.vlan_in;
+
 	if (data.vlan.num == 1) {
 		if (data.vlan.proto != htons(ETH_P_8021Q))
 			return -EOPNOTSUPP;