[net-next,0/5] mlxsw: Add support for RED qevent "mark"

Message ID	20210117080223.2107288-1-idosch@idosch.org (mailing list archive)
Headers	show Return-Path: <netdev-owner@kernel.org> From: Ido Schimmel <idosch@idosch.org> To: netdev@vger.kernel.org Cc: davem@davemloft.net, kuba@kernel.org, petrm@nvidia.com, jiri@nvidia.com, amcohen@nvidia.com, mlxsw@nvidia.com, Ido Schimmel <idosch@nvidia.com> Subject: [PATCH net-next 0/5] mlxsw: Add support for RED qevent "mark" Date: Sun, 17 Jan 2021 10:02:18 +0200 Message-Id: <20210117080223.2107288-1-idosch@idosch.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	mlxsw: Add support for RED qevent "mark" \| expand [net-next,0/5] mlxsw: Add support for RED qevent "mark" [net-next,1/5] devlink: Add ecn_mark trap [net-next,2/5] mlxsw: spectrum_trap: Add ecn_mark trap [net-next,3/5] mlxsw: spectrum_qdisc: Offload RED qevent mark [net-next,4/5] selftests: mlxsw: sch_red_core: Drop two unused variables [net-next,5/5] selftests: mlxsw: RED: Add selftests for the mark qevent

Ido Schimmel Jan. 17, 2021, 8:02 a.m. UTC

From: Ido Schimmel <idosch@nvidia.com>

Petr says:

The RED qdisc currently supports two qevents: "early_drop" and "mark". The
filters added to the block bound to the "early_drop" qevent are executed on
packets for which the RED algorithm decides that they should be
early-dropped. The "mark" filters are similarly executed on ECT packets
that are marked as ECN-CE (Congestion Encountered).

A previous patchset has offloaded "early_drop" filters on Spectrum-2 and
later, provided that the classifier used is "matchall", that the action
used is either "trap" or "mirred", and a handful or further limitations.

This patchset similarly offloads "mark" filters.

Patch set overview:

Patches #1 and #2 add the trap, under which packets will be reported to the
CPU, if the qevent filter uses the action "trap".

Patch #3 then recognizes FLOW_BLOCK_BINDER_TYPE_RED_MARK as a binder type,
and offloads the attached filters similarly to _EARLY_DROP.

Patch #4 cleans up some unused variables in a selftest, and patch #5 adds a
new selftest for the RED "mark" qevent offload.

Petr Machata (5):
  devlink: Add ecn_mark trap
  mlxsw: spectrum_trap: Add ecn_mark trap
  mlxsw: spectrum_qdisc: Offload RED qevent mark
  selftests: mlxsw: sch_red_core: Drop two unused variables
  selftests: mlxsw: RED: Add selftests for the mark qevent

 .../networking/devlink/devlink-trap.rst       |  4 +
 .../net/ethernet/mellanox/mlxsw/spectrum.c    |  2 +
 .../net/ethernet/mellanox/mlxsw/spectrum.h    |  2 +
 .../ethernet/mellanox/mlxsw/spectrum_qdisc.c  | 14 +++-
 .../ethernet/mellanox/mlxsw/spectrum_span.c   | 16 ++++
 .../ethernet/mellanox/mlxsw/spectrum_span.h   |  1 +
 .../ethernet/mellanox/mlxsw/spectrum_trap.c   |  9 ++
 include/net/devlink.h                         |  3 +
 net/core/devlink.c                            |  1 +
 .../drivers/net/mlxsw/sch_red_core.sh         | 84 ++++++++++++++++++-
 .../drivers/net/mlxsw/sch_red_ets.sh          | 74 ++++++++++++++--
 11 files changed, 200 insertions(+), 10 deletions(-)

Jakub Kicinski Jan. 19, 2021, 10:22 p.m. UTC | #1

On Sun, 17 Jan 2021 10:02:18 +0200 Ido Schimmel wrote:
> From: Ido Schimmel <idosch@nvidia.com>
> 
> The RED qdisc currently supports two qevents: "early_drop" and "mark". The
> filters added to the block bound to the "early_drop" qevent are executed on
> packets for which the RED algorithm decides that they should be
> early-dropped. The "mark" filters are similarly executed on ECT packets
> that are marked as ECN-CE (Congestion Encountered).
> 
> A previous patchset has offloaded "early_drop" filters on Spectrum-2 and
> later, provided that the classifier used is "matchall", that the action
> used is either "trap" or "mirred", and a handful or further limitations.

For early_drop trap or mirred makes obvious sense, no explanation
needed.

But for marked as a user I'd like to see a _copy_ of the packet, 
while the original continues on its marry way to the destination.
I'd venture to say that e.g. for a DCTCP deployment mark+trap is
unusable, at least for tracing, because it distorts the operation 
by effectively dropping instead of marking.

Am I reading this right?

If that is the case and you really want to keep the mark+trap
functionality - I feel like at least better documentation is needed.
The current two liner should also be rewritten, quoting from patch 1:

> * - ``ecn_mark``
>   - ``drop``
>   - Traps ECN-capable packets that were marked with CE (Congestion
>     Encountered) code point by RED algorithm instead of being dropped

That needs to say that the trap is for datagrams trapped by a qevent.
Otherwise "Traps ... instead of being dropped" is too much of a
thought-shortcut, marked packets are not dropped.

(I'd also think that trap is better documented next to early_drop,
let's look at it from the reader's perspective)

Ido Schimmel Jan. 20, 2021, 9:14 a.m. UTC | #2

On Tue, Jan 19, 2021 at 02:22:55PM -0800, Jakub Kicinski wrote:
> On Sun, 17 Jan 2021 10:02:18 +0200 Ido Schimmel wrote:
> > From: Ido Schimmel <idosch@nvidia.com>
> > 
> > The RED qdisc currently supports two qevents: "early_drop" and "mark". The
> > filters added to the block bound to the "early_drop" qevent are executed on
> > packets for which the RED algorithm decides that they should be
> > early-dropped. The "mark" filters are similarly executed on ECT packets
> > that are marked as ECN-CE (Congestion Encountered).
> > 
> > A previous patchset has offloaded "early_drop" filters on Spectrum-2 and
> > later, provided that the classifier used is "matchall", that the action
> > used is either "trap" or "mirred", and a handful or further limitations.
> 
> For early_drop trap or mirred makes obvious sense, no explanation
> needed.
> 
> But for marked as a user I'd like to see a _copy_ of the packet, 
> while the original continues on its marry way to the destination.
> I'd venture to say that e.g. for a DCTCP deployment mark+trap is
> unusable, at least for tracing, because it distorts the operation 
> by effectively dropping instead of marking.
> 
> Am I reading this right?

You get a copy of the packet as otherwise it will create a lot of
problems (like you wrote).

> 
> If that is the case and you really want to keep the mark+trap
> functionality - I feel like at least better documentation is needed.
> The current two liner should also be rewritten, quoting from patch 1:
> 
> > * - ``ecn_mark``
> >   - ``drop``
> >   - Traps ECN-capable packets that were marked with CE (Congestion
> >     Encountered) code point by RED algorithm instead of being dropped
> 
> That needs to say that the trap is for datagrams trapped by a qevent.
> Otherwise "Traps ... instead of being dropped" is too much of a
> thought-shortcut, marked packets are not dropped.
> 
> (I'd also think that trap is better documented next to early_drop,
> let's look at it from the reader's perspective)

How about:

"Traps a copy of ECN-capable packets that were marked with CE
(Congestion Encountered) code point by RED algorithm instead of being
dropped. The trap is enabled by attaching a filter with action 'trap' to
the 'mark' qevent of the RED qdisc."

In addition, this output:

$ devlink trap show pci/0000:06:00.0 trap ecn_mark 
pci/0000:06:00.0:
  name ecn_mark type drop generic true action trap group buffer_drops

Can be converted to:

$ devlink trap show pci/0000:06:00.0 trap ecn_mark 
pci/0000:06:00.0:
  name ecn_mark type drop generic true action mirror group buffer_drops

"mirror: The packet is forwarded by the underlying device and a copy is sent to
the CPU."

In this case the action is static and you cannot change it.

Jakub Kicinski Jan. 21, 2021, 12:45 a.m. UTC | #3

On Wed, 20 Jan 2021 11:14:37 +0200 Ido Schimmel wrote:
> On Tue, Jan 19, 2021 at 02:22:55PM -0800, Jakub Kicinski wrote:
> > On Sun, 17 Jan 2021 10:02:18 +0200 Ido Schimmel wrote:  
> > > From: Ido Schimmel <idosch@nvidia.com>
> > > 
> > > The RED qdisc currently supports two qevents: "early_drop" and "mark". The
> > > filters added to the block bound to the "early_drop" qevent are executed on
> > > packets for which the RED algorithm decides that they should be
> > > early-dropped. The "mark" filters are similarly executed on ECT packets
> > > that are marked as ECN-CE (Congestion Encountered).
> > > 
> > > A previous patchset has offloaded "early_drop" filters on Spectrum-2 and
> > > later, provided that the classifier used is "matchall", that the action
> > > used is either "trap" or "mirred", and a handful or further limitations.  
> > 
> > For early_drop trap or mirred makes obvious sense, no explanation
> > needed.
> > 
> > But for marked as a user I'd like to see a _copy_ of the packet, 
> > while the original continues on its marry way to the destination.
> > I'd venture to say that e.g. for a DCTCP deployment mark+trap is
> > unusable, at least for tracing, because it distorts the operation 
> > by effectively dropping instead of marking.
> > 
> > Am I reading this right?  
> 
> You get a copy of the packet as otherwise it will create a lot of
> problems (like you wrote).

Hm, so am I missing some background on semantics on TC_ACT_TRAP?
Or perhaps you use a different action code?

AFAICT the code in the kernel is:

struct sk_buff *tcf_qevent_handle(...

	case TC_ACT_STOLEN:
	case TC_ACT_QUEUED:
	case TC_ACT_TRAP:
		__qdisc_drop(skb, to_free);
		*ret = __NET_XMIT_STOLEN;
		return NULL;

Having TRAP mean DROP makes sense for filters, but in case of qevents
shouldn't they be a no-op?

Looking at sch_red looks like TRAP being a no-op would actually give us
the expected behavior.

> > If that is the case and you really want to keep the mark+trap
> > functionality - I feel like at least better documentation is needed.
> > The current two liner should also be rewritten, quoting from patch 1:
> >   
> > > * - ``ecn_mark``
> > >   - ``drop``
> > >   - Traps ECN-capable packets that were marked with CE (Congestion
> > >     Encountered) code point by RED algorithm instead of being dropped  
> > 
> > That needs to say that the trap is for datagrams trapped by a qevent.
> > Otherwise "Traps ... instead of being dropped" is too much of a
> > thought-shortcut, marked packets are not dropped.
> > 
> > (I'd also think that trap is better documented next to early_drop,
> > let's look at it from the reader's perspective)  
> 
> How about:
> 
> "Traps a copy of ECN-capable packets that were marked with CE

I think "Traps copies" or "Traps the copy of .. packet"?
I'm not a native speaker but there seems to be a grammatical mix here.

> (Congestion Encountered) code point by RED algorithm instead of being
> dropped. The trap is enabled by attaching a filter with action 'trap' to

... instead of those copies being dropped.

> the 'mark' qevent of the RED qdisc."
>
> In addition, this output:
> 
> $ devlink trap show pci/0000:06:00.0 trap ecn_mark 
> pci/0000:06:00.0:
>   name ecn_mark type drop generic true action trap group buffer_drops
> 
> Can be converted to:
> 
> $ devlink trap show pci/0000:06:00.0 trap ecn_mark 
> pci/0000:06:00.0:
>   name ecn_mark type drop generic true action mirror group buffer_drops
> 
> "mirror: The packet is forwarded by the underlying device and a copy is sent to
> the CPU."
> 
> In this case the action is static and you cannot change it.

Oh yes, that's nice, I thought mirror in traps means mirror to another
port. Are there already traps which implement the mirroring / trapping
a clone? Quick grep yields nothing of substance.

Ido Schimmel Jan. 21, 2021, 10:23 a.m. UTC | #4

On Wed, Jan 20, 2021 at 04:45:08PM -0800, Jakub Kicinski wrote:
> On Wed, 20 Jan 2021 11:14:37 +0200 Ido Schimmel wrote:
> > On Tue, Jan 19, 2021 at 02:22:55PM -0800, Jakub Kicinski wrote:
> > > On Sun, 17 Jan 2021 10:02:18 +0200 Ido Schimmel wrote:  
> > > > From: Ido Schimmel <idosch@nvidia.com>
> > > > 
> > > > The RED qdisc currently supports two qevents: "early_drop" and "mark". The
> > > > filters added to the block bound to the "early_drop" qevent are executed on
> > > > packets for which the RED algorithm decides that they should be
> > > > early-dropped. The "mark" filters are similarly executed on ECT packets
> > > > that are marked as ECN-CE (Congestion Encountered).
> > > > 
> > > > A previous patchset has offloaded "early_drop" filters on Spectrum-2 and
> > > > later, provided that the classifier used is "matchall", that the action
> > > > used is either "trap" or "mirred", and a handful or further limitations.  
> > > 
> > > For early_drop trap or mirred makes obvious sense, no explanation
> > > needed.
> > > 
> > > But for marked as a user I'd like to see a _copy_ of the packet, 
> > > while the original continues on its marry way to the destination.
> > > I'd venture to say that e.g. for a DCTCP deployment mark+trap is
> > > unusable, at least for tracing, because it distorts the operation 
> > > by effectively dropping instead of marking.
> > > 
> > > Am I reading this right?  
> > 
> > You get a copy of the packet as otherwise it will create a lot of
> > problems (like you wrote).
> 
> Hm, so am I missing some background on semantics on TC_ACT_TRAP?
> Or perhaps you use a different action code?

Well, to make it really clear, we can add TC_ACT_TRAP_MIRROR.

TC_ACT_TRAP: Sole copy goes to the CPU
TC_ACT_TRAP_MIRROR: The packet is forwarded by the underlying device and
a copy is sent to the CPU

And only allow (in mlxsw) attaching filters with TC_ACT_TRAP_MIRROR to
the "mark" qevent.

> 
> AFAICT the code in the kernel is:
> 
> struct sk_buff *tcf_qevent_handle(...
> 
> 	case TC_ACT_STOLEN:
> 	case TC_ACT_QUEUED:
> 	case TC_ACT_TRAP:
> 		__qdisc_drop(skb, to_free);
> 		*ret = __NET_XMIT_STOLEN;
> 		return NULL;
> 
> Having TRAP mean DROP makes sense for filters, but in case of qevents
> shouldn't they be a no-op?
> 
> Looking at sch_red looks like TRAP being a no-op would actually give us
> the expected behavior.

I'm not sure it makes sense to try to interpret these actions in
software (I expect they will be used with "skip_sw" filters), but
TC_ACT_TRAP_MIRROR can be a no-op like you suggested.

> 
> > > If that is the case and you really want to keep the mark+trap
> > > functionality - I feel like at least better documentation is needed.
> > > The current two liner should also be rewritten, quoting from patch 1:
> > >   
> > > > * - ``ecn_mark``
> > > >   - ``drop``
> > > >   - Traps ECN-capable packets that were marked with CE (Congestion
> > > >     Encountered) code point by RED algorithm instead of being dropped  
> > > 
> > > That needs to say that the trap is for datagrams trapped by a qevent.
> > > Otherwise "Traps ... instead of being dropped" is too much of a
> > > thought-shortcut, marked packets are not dropped.
> > > 
> > > (I'd also think that trap is better documented next to early_drop,
> > > let's look at it from the reader's perspective)  
> > 
> > How about:
> > 
> > "Traps a copy of ECN-capable packets that were marked with CE
> 
> I think "Traps copies" or "Traps the copy of .. packet"?
> I'm not a native speaker but there seems to be a grammatical mix here.
> 
> > (Congestion Encountered) code point by RED algorithm instead of being
> > dropped. The trap is enabled by attaching a filter with action 'trap' to
> 
> ... instead of those copies being dropped.

Will reword

> 
> > the 'mark' qevent of the RED qdisc."
> >
> > In addition, this output:
> > 
> > $ devlink trap show pci/0000:06:00.0 trap ecn_mark 
> > pci/0000:06:00.0:
> >   name ecn_mark type drop generic true action trap group buffer_drops
> > 
> > Can be converted to:
> > 
> > $ devlink trap show pci/0000:06:00.0 trap ecn_mark 
> > pci/0000:06:00.0:
> >   name ecn_mark type drop generic true action mirror group buffer_drops
> > 
> > "mirror: The packet is forwarded by the underlying device and a copy is sent to
> > the CPU."
> > 
> > In this case the action is static and you cannot change it.
> 
> Oh yes, that's nice, I thought mirror in traps means mirror to another
> port. Are there already traps which implement the mirroring / trapping
> a clone? Quick grep yields nothing of substance.

Yes. That's why we have the 'offload_fwd_mark' and 'offload_l3_fwd_mark'
bits in the skb. For example, we let the hardware flood ARP requests
('arp_request'), but also send a copy to the CPU in case it needs to
update its neighbour table. The trapping happens at L2, so we only set
the 'offload_fwd_mark' bit. It will tell the bridge driver to not flood
the packet again.

The 'offload_l3_fwd_mark' bit is mainly used to support one-armed router
use cases where a packet is forwarded through the same interface through
which it was received ('uc_loopback'). We do the forwarding in hardware,
but also send a copy to the CPU to give the kernel the chance to
generate an ICMP redirect if it was not disabled by the user. See more
info in commit 55827458e058 ("Merge branch
'mlxsw-Add-one-armed-router-support'").

I also want to explain how the qevent stuff works in hardware to make
sure it is all clear. We have the ability to bind different triggers to
a mirroring (SPAN) agent. The agent can point to a physical port /
virtual interface (e.g., gretap for ERSPAN) or to the CPU port. The
first is programmed via the mirred action and the second using the trap
action.

The triggers can be simple such as Rx/Tx packet (matchall + mirred) or
policy engine (flower + mirred). The more advanced triggers are various
buffer events such as early drops ('early_drop' qevent) and ECN marking
('mark' qevent). Currently, it is only possible to bind these triggers
to a mirroring agent which is why we only support (in mlxsw) attaching
matchall filters to these qevents. In the future we might be able to
bind ACLs to these triggers in which case we will allow attaching flower
filters. devlink-trap is really only a read-only interface in this case,
meant to tell you why you go the packet from the hardware datapath. The
enablement / disablement is done by tc which gives us feature parity
with the software datapath.

Jakub Kicinski Jan. 21, 2021, 5:19 p.m. UTC | #5

On Thu, 21 Jan 2021 12:23:18 +0200 Ido Schimmel wrote:
> On Wed, Jan 20, 2021 at 04:45:08PM -0800, Jakub Kicinski wrote:
> > On Wed, 20 Jan 2021 11:14:37 +0200 Ido Schimmel wrote:  
> > > On Tue, Jan 19, 2021 at 02:22:55PM -0800, Jakub Kicinski wrote:  
> > > > On Sun, 17 Jan 2021 10:02:18 +0200 Ido Schimmel wrote:    
> > > > > From: Ido Schimmel <idosch@nvidia.com>
> > > > > 
> > > > > The RED qdisc currently supports two qevents: "early_drop" and "mark". The
> > > > > filters added to the block bound to the "early_drop" qevent are executed on
> > > > > packets for which the RED algorithm decides that they should be
> > > > > early-dropped. The "mark" filters are similarly executed on ECT packets
> > > > > that are marked as ECN-CE (Congestion Encountered).
> > > > > 
> > > > > A previous patchset has offloaded "early_drop" filters on Spectrum-2 and
> > > > > later, provided that the classifier used is "matchall", that the action
> > > > > used is either "trap" or "mirred", and a handful or further limitations.    
> > > > 
> > > > For early_drop trap or mirred makes obvious sense, no explanation
> > > > needed.
> > > > 
> > > > But for marked as a user I'd like to see a _copy_ of the packet, 
> > > > while the original continues on its marry way to the destination.
> > > > I'd venture to say that e.g. for a DCTCP deployment mark+trap is
> > > > unusable, at least for tracing, because it distorts the operation 
> > > > by effectively dropping instead of marking.
> > > > 
> > > > Am I reading this right?    
> > > 
> > > You get a copy of the packet as otherwise it will create a lot of
> > > problems (like you wrote).  
> > 
> > Hm, so am I missing some background on semantics on TC_ACT_TRAP?
> > Or perhaps you use a different action code?  
> 
> Well, to make it really clear, we can add TC_ACT_TRAP_MIRROR.
> 
> TC_ACT_TRAP: Sole copy goes to the CPU
> TC_ACT_TRAP_MIRROR: The packet is forwarded by the underlying device and
> a copy is sent to the CPU
> 
> And only allow (in mlxsw) attaching filters with TC_ACT_TRAP_MIRROR to
> the "mark" qevent.
> 
> > 
> > AFAICT the code in the kernel is:
> > 
> > struct sk_buff *tcf_qevent_handle(...
> > 
> > 	case TC_ACT_STOLEN:
> > 	case TC_ACT_QUEUED:
> > 	case TC_ACT_TRAP:
> > 		__qdisc_drop(skb, to_free);
> > 		*ret = __NET_XMIT_STOLEN;
> > 		return NULL;
> > 
> > Having TRAP mean DROP makes sense for filters, but in case of qevents
> > shouldn't they be a no-op?
> > 
> > Looking at sch_red looks like TRAP being a no-op would actually give us
> > the expected behavior.  
> 
> I'm not sure it makes sense to try to interpret these actions in
> software (I expect they will be used with "skip_sw" filters), but
> TC_ACT_TRAP_MIRROR can be a no-op like you suggested.

Well our paradigm is SW defines the behavior, we can't have HW forward
and copy, while the SW drops the frame. Some engineer will try to
implement this some day in their switch driver, look at the SW behavior
and scratch their head.

> > > the 'mark' qevent of the RED qdisc."
> > >
> > > In addition, this output:
> > > 
> > > $ devlink trap show pci/0000:06:00.0 trap ecn_mark 
> > > pci/0000:06:00.0:
> > >   name ecn_mark type drop generic true action trap group buffer_drops
> > > 
> > > Can be converted to:
> > > 
> > > $ devlink trap show pci/0000:06:00.0 trap ecn_mark 
> > > pci/0000:06:00.0:
> > >   name ecn_mark type drop generic true action mirror group buffer_drops
> > > 
> > > "mirror: The packet is forwarded by the underlying device and a copy is sent to
> > > the CPU."
> > > 
> > > In this case the action is static and you cannot change it.  
> > 
> > Oh yes, that's nice, I thought mirror in traps means mirror to another
> > port. Are there already traps which implement the mirroring / trapping
> > a clone? Quick grep yields nothing of substance.  
> 
> Yes. That's why we have the 'offload_fwd_mark' and 'offload_l3_fwd_mark'
> bits in the skb. For example, we let the hardware flood ARP requests
> ('arp_request'), but also send a copy to the CPU in case it needs to
> update its neighbour table. The trapping happens at L2, so we only set
> the 'offload_fwd_mark' bit. It will tell the bridge driver to not flood
> the packet again.
> 
> The 'offload_l3_fwd_mark' bit is mainly used to support one-armed router
> use cases where a packet is forwarded through the same interface through
> which it was received ('uc_loopback'). We do the forwarding in hardware,
> but also send a copy to the CPU to give the kernel the chance to
> generate an ICMP redirect if it was not disabled by the user. See more
> info in commit 55827458e058 ("Merge branch
> 'mlxsw-Add-one-armed-router-support'").

I see, thanks for the example, but just to be clear those are "internal
traps", they don't have any impact on the devlink trap uAPI (in case we
want to change the definition of MIRRED since nothing is using it).

> I also want to explain how the qevent stuff works in hardware to make
> sure it is all clear. We have the ability to bind different triggers to
> a mirroring (SPAN) agent. The agent can point to a physical port /
> virtual interface (e.g., gretap for ERSPAN) or to the CPU port. The
> first is programmed via the mirred action and the second using the trap
> action.
> 
> The triggers can be simple such as Rx/Tx packet (matchall + mirred) or
> policy engine (flower + mirred). The more advanced triggers are various
> buffer events such as early drops ('early_drop' qevent) and ECN marking
> ('mark' qevent). Currently, it is only possible to bind these triggers
> to a mirroring agent which is why we only support (in mlxsw) attaching
> matchall filters to these qevents. In the future we might be able to
> bind ACLs to these triggers in which case we will allow attaching flower
> filters. devlink-trap is really only a read-only interface in this case,
> meant to tell you why you go the packet from the hardware datapath. The
> enablement / disablement is done by tc which gives us feature parity
> with the software datapath.

Thanks for the explanation. I feel more and more convinced now that
we should have TC_ACT_TRAP_MIRROR and the devlink trap should only 
be on/off :S Current model of "if ACT_TRAP consult devlink for trap
configuration" is impossible to model in SW since it doesn't have a
equivalent of devlink traps. Or we need that equivalent..

Ido Schimmel Jan. 23, 2021, 3:28 p.m. UTC | #6

On Thu, Jan 21, 2021 at 09:19:40AM -0800, Jakub Kicinski wrote:
> On Thu, 21 Jan 2021 12:23:18 +0200 Ido Schimmel wrote:
> > On Wed, Jan 20, 2021 at 04:45:08PM -0800, Jakub Kicinski wrote:
> > > On Wed, 20 Jan 2021 11:14:37 +0200 Ido Schimmel wrote:  
> > > > On Tue, Jan 19, 2021 at 02:22:55PM -0800, Jakub Kicinski wrote:  
> > > > > On Sun, 17 Jan 2021 10:02:18 +0200 Ido Schimmel wrote:    
> > > > > > From: Ido Schimmel <idosch@nvidia.com>
> > > > > > 
> > > > > > The RED qdisc currently supports two qevents: "early_drop" and "mark". The
> > > > > > filters added to the block bound to the "early_drop" qevent are executed on
> > > > > > packets for which the RED algorithm decides that they should be
> > > > > > early-dropped. The "mark" filters are similarly executed on ECT packets
> > > > > > that are marked as ECN-CE (Congestion Encountered).
> > > > > > 
> > > > > > A previous patchset has offloaded "early_drop" filters on Spectrum-2 and
> > > > > > later, provided that the classifier used is "matchall", that the action
> > > > > > used is either "trap" or "mirred", and a handful or further limitations.    
> > > > > 
> > > > > For early_drop trap or mirred makes obvious sense, no explanation
> > > > > needed.
> > > > > 
> > > > > But for marked as a user I'd like to see a _copy_ of the packet, 
> > > > > while the original continues on its marry way to the destination.
> > > > > I'd venture to say that e.g. for a DCTCP deployment mark+trap is
> > > > > unusable, at least for tracing, because it distorts the operation 
> > > > > by effectively dropping instead of marking.
> > > > > 
> > > > > Am I reading this right?    
> > > > 
> > > > You get a copy of the packet as otherwise it will create a lot of
> > > > problems (like you wrote).  
> > > 
> > > Hm, so am I missing some background on semantics on TC_ACT_TRAP?
> > > Or perhaps you use a different action code?  
> > 
> > Well, to make it really clear, we can add TC_ACT_TRAP_MIRROR.
> > 
> > TC_ACT_TRAP: Sole copy goes to the CPU
> > TC_ACT_TRAP_MIRROR: The packet is forwarded by the underlying device and
> > a copy is sent to the CPU
> > 
> > And only allow (in mlxsw) attaching filters with TC_ACT_TRAP_MIRROR to
> > the "mark" qevent.
> > 
> > > 
> > > AFAICT the code in the kernel is:
> > > 
> > > struct sk_buff *tcf_qevent_handle(...
> > > 
> > > 	case TC_ACT_STOLEN:
> > > 	case TC_ACT_QUEUED:
> > > 	case TC_ACT_TRAP:
> > > 		__qdisc_drop(skb, to_free);
> > > 		*ret = __NET_XMIT_STOLEN;
> > > 		return NULL;
> > > 
> > > Having TRAP mean DROP makes sense for filters, but in case of qevents
> > > shouldn't they be a no-op?
> > > 
> > > Looking at sch_red looks like TRAP being a no-op would actually give us
> > > the expected behavior.  
> > 
> > I'm not sure it makes sense to try to interpret these actions in
> > software (I expect they will be used with "skip_sw" filters), but
> > TC_ACT_TRAP_MIRROR can be a no-op like you suggested.
> 
> Well our paradigm is SW defines the behavior, we can't have HW forward
> and copy, while the SW drops the frame. Some engineer will try to
> implement this some day in their switch driver, look at the SW behavior
> and scratch their head.

OK, TC_ACT_TRAP_MIRROR will be a no-op in software

> 
> > > > the 'mark' qevent of the RED qdisc."
> > > >
> > > > In addition, this output:
> > > > 
> > > > $ devlink trap show pci/0000:06:00.0 trap ecn_mark 
> > > > pci/0000:06:00.0:
> > > >   name ecn_mark type drop generic true action trap group buffer_drops
> > > > 
> > > > Can be converted to:
> > > > 
> > > > $ devlink trap show pci/0000:06:00.0 trap ecn_mark 
> > > > pci/0000:06:00.0:
> > > >   name ecn_mark type drop generic true action mirror group buffer_drops
> > > > 
> > > > "mirror: The packet is forwarded by the underlying device and a copy is sent to
> > > > the CPU."
> > > > 
> > > > In this case the action is static and you cannot change it.  
> > > 
> > > Oh yes, that's nice, I thought mirror in traps means mirror to another
> > > port. Are there already traps which implement the mirroring / trapping
> > > a clone? Quick grep yields nothing of substance.  
> > 
> > Yes. That's why we have the 'offload_fwd_mark' and 'offload_l3_fwd_mark'
> > bits in the skb. For example, we let the hardware flood ARP requests
> > ('arp_request'), but also send a copy to the CPU in case it needs to
> > update its neighbour table. The trapping happens at L2, so we only set
> > the 'offload_fwd_mark' bit. It will tell the bridge driver to not flood
> > the packet again.
> > 
> > The 'offload_l3_fwd_mark' bit is mainly used to support one-armed router
> > use cases where a packet is forwarded through the same interface through
> > which it was received ('uc_loopback'). We do the forwarding in hardware,
> > but also send a copy to the CPU to give the kernel the chance to
> > generate an ICMP redirect if it was not disabled by the user. See more
> > info in commit 55827458e058 ("Merge branch
> > 'mlxsw-Add-one-armed-router-support'").
> 
> I see, thanks for the example, but just to be clear those are "internal
> traps", they don't have any impact on the devlink trap uAPI (in case we
> want to change the definition of MIRRED since nothing is using it).

It's not MIRRED, but MIRROR. Anyway, these are not internal traps:

$ devlink trap show pci/0000:01:00.0 trap arp_request 
pci/0000:01:00.0:
  name arp_request type control generic true action mirror group neigh_discovery

> 
> > I also want to explain how the qevent stuff works in hardware to make
> > sure it is all clear. We have the ability to bind different triggers to
> > a mirroring (SPAN) agent. The agent can point to a physical port /
> > virtual interface (e.g., gretap for ERSPAN) or to the CPU port. The
> > first is programmed via the mirred action and the second using the trap
> > action.
> > 
> > The triggers can be simple such as Rx/Tx packet (matchall + mirred) or
> > policy engine (flower + mirred). The more advanced triggers are various
> > buffer events such as early drops ('early_drop' qevent) and ECN marking
> > ('mark' qevent). Currently, it is only possible to bind these triggers
> > to a mirroring agent which is why we only support (in mlxsw) attaching
> > matchall filters to these qevents. In the future we might be able to
> > bind ACLs to these triggers in which case we will allow attaching flower
> > filters. devlink-trap is really only a read-only interface in this case,
> > meant to tell you why you go the packet from the hardware datapath. The
> > enablement / disablement is done by tc which gives us feature parity
> > with the software datapath.
> 
> Thanks for the explanation. I feel more and more convinced now that
> we should have TC_ACT_TRAP_MIRROR and the devlink trap should only 
> be on/off :S Current model of "if ACT_TRAP consult devlink for trap
> configuration" is impossible to model in SW since it doesn't have a
> equivalent of devlink traps. Or we need that equivalent..

Wait, the current model is not "if ACT_TRAP consult devlink for trap
configuration". 'ecn_mark' action is always 'trap' ('mirror' in v2) and
can't be changed. Such packets can always be sent to the CPU, but the
decision of whether to send them or not is based on the presence of tc
filters attached to RED's 'mark' qevent with TC_ACT_TRAP
(TC_ACT_TRAP_MIRROR in v2).

I believe that with the proposed changes in v2 it should be perfectly
clear that ECN marked packets are forwarded in hardware and a copy is
sent to the CPU.

Jakub Kicinski Jan. 23, 2021, 7:55 p.m. UTC | #7

On Sat, 23 Jan 2021 17:28:02 +0200 Ido Schimmel wrote:
> > Thanks for the explanation. I feel more and more convinced now that
> > we should have TC_ACT_TRAP_MIRROR and the devlink trap should only 
> > be on/off :S Current model of "if ACT_TRAP consult devlink for trap
> > configuration" is impossible to model in SW since it doesn't have a
> > equivalent of devlink traps. Or we need that equivalent..  
> 
> Wait, the current model is not "if ACT_TRAP consult devlink for trap
> configuration". 'ecn_mark' action is always 'trap' ('mirror' in v2) and
> can't be changed. Such packets can always be sent to the CPU, but the
> decision of whether to send them or not is based on the presence of tc
> filters attached to RED's 'mark' qevent with TC_ACT_TRAP
> (TC_ACT_TRAP_MIRROR in v2).

I see, missed that, but I think my point conceptually stands, right?
Part of forwarding behavior was (in v1) only expressed in control 
plane (devlink) not dataplane (tc).
 
> I believe that with the proposed changes in v2 it should be perfectly
> clear that ECN marked packets are forwarded in hardware and a copy is
> sent to the CPU.

Yup, sounds good.

Ido Schimmel Jan. 24, 2021, 8:42 a.m. UTC | #8

On Sat, Jan 23, 2021 at 11:55:27AM -0800, Jakub Kicinski wrote:
> On Sat, 23 Jan 2021 17:28:02 +0200 Ido Schimmel wrote:
> > > Thanks for the explanation. I feel more and more convinced now that
> > > we should have TC_ACT_TRAP_MIRROR and the devlink trap should only 
> > > be on/off :S Current model of "if ACT_TRAP consult devlink for trap
> > > configuration" is impossible to model in SW since it doesn't have a
> > > equivalent of devlink traps. Or we need that equivalent..  
> > 
> > Wait, the current model is not "if ACT_TRAP consult devlink for trap
> > configuration". 'ecn_mark' action is always 'trap' ('mirror' in v2) and
> > can't be changed. Such packets can always be sent to the CPU, but the
> > decision of whether to send them or not is based on the presence of tc
> > filters attached to RED's 'mark' qevent with TC_ACT_TRAP
> > (TC_ACT_TRAP_MIRROR in v2).
> 
> I see, missed that, but I think my point conceptually stands, right?
> Part of forwarding behavior was (in v1) only expressed in control 
> plane (devlink) not dataplane (tc).

I don't think so? The action was set to 'trap' in both devlink and tc.

>  
> > I believe that with the proposed changes in v2 it should be perfectly
> > clear that ECN marked packets are forwarded in hardware and a copy is
> > sent to the CPU.
> 
> Yup, sounds good.

Thanks!

[net-next,0/5] mlxsw: Add support for RED qevent "mark"

Message

Comments