diff mbox series

[RFC,bpf-next,v3,09/14] net/mlx5e: Implement devtx kfuncs

Message ID 20230707193006.1309662-10-sdf@google.com (mailing list archive)
State RFC
Delegated to: BPF
Headers show
Series bpf: Netdev TX metadata | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR fail PR summary
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ${{ matrix.test }} on ${{ matrix.arch }} with ${{ matrix.toolchain_full }}
bpf/vmtest-bpf-next-VM_Test-2 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-3 fail Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-4 fail Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-5 fail Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-6 fail Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-7 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-8 success Logs for veristat
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 8 this patch: 8
netdev/cc_maintainers warning 9 maintainers not CCed: saeedm@nvidia.com tariqt@nvidia.com davem@davemloft.net pabeni@redhat.com leon@kernel.org maxtram95@gmail.com dtatulea@nvidia.com edumazet@google.com linux-rdma@vger.kernel.org
netdev/build_clang fail Errors and warnings before: 18 this patch: 18
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 8 this patch: 8
netdev/checkpatch warning WARNING: line length of 81 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns WARNING: line length of 84 exceeds 80 columns WARNING: line length of 85 exceeds 80 columns WARNING: line length of 86 exceeds 80 columns WARNING: line length of 87 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns WARNING: line length of 90 exceeds 80 columns WARNING: line length of 91 exceeds 80 columns WARNING: line length of 93 exceeds 80 columns WARNING: line length of 95 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Stanislav Fomichev July 7, 2023, 7:30 p.m. UTC
TX timestamp:
- requires passing clock, not sure I'm passing the correct one (from
  cq->mdev), but the timestamp value looks convincing

TX checksum:
- looks like device does packet parsing (and doesn't accept custom
  start/offset), so I'm using the same approach we do for skb (support
  specific header offesets)

Some quirks that might need some more thought:
- AF_XDP requires creating on-stack xdp_frame
- AF_XDP passes only head to the completion hook

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 .../net/ethernet/mellanox/mlx5/core/en/txrx.h |  15 ++
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 155 +++++++++++++++++-
 .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |   4 +-
 .../ethernet/mellanox/mlx5/core/en/xsk/tx.c   |  10 ++
 .../net/ethernet/mellanox/mlx5/core/en_tx.c   |  24 +++
 .../net/ethernet/mellanox/mlx5/core/main.c    |  15 +-
 6 files changed, 217 insertions(+), 6 deletions(-)

Comments

Alexei Starovoitov July 11, 2023, 10:56 p.m. UTC | #1
On Fri, Jul 07, 2023 at 12:30:01PM -0700, Stanislav Fomichev wrote:
> +
> +static int mlx5e_devtx_request_l4_checksum(const struct devtx_ctx *_ctx,
> +					   u16 csum_start, u16 csum_offset)
> +{
> +	const struct mlx5e_devtx_ctx *ctx = (void *)_ctx;
> +	struct mlx5_wqe_eth_seg *eseg;
> +
> +	if (unlikely(!ctx->wqe))
> +		return -ENODATA;
> +
> +	eseg = &ctx->wqe->eth;
> +
> +	switch (csum_offset) {
> +	case sizeof(struct ethhdr) + sizeof(struct iphdr) + offsetof(struct udphdr, check):
> +	case sizeof(struct ethhdr) + sizeof(struct ipv6hdr) + offsetof(struct udphdr, check):
> +		/* Looks like HW/FW is doing parsing, so offsets are largely ignored. */
> +		eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM | MLX5_ETH_WQE_L4_CSUM;
> +		break;
> +	default:
> +		return -EINVAL;
> +	}

I think this proves my point: csum is not generalizable even across veth and mlx5.
Above is a square peg that tries to fit csum_start/offset api (that makes sense from SW pov)
into HW that has different ideas about csum-ing.

Here is what mlx5 does:
mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq, struct sk_buff *skb,
                            struct mlx5e_accel_tx_state *accel,
                            struct mlx5_wqe_eth_seg *eseg)
{
        if (unlikely(mlx5e_ipsec_txwqe_build_eseg_csum(sq, skb, eseg)))
                return;

        if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
                eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM;
                if (skb->encapsulation) {
                        eseg->cs_flags |= MLX5_ETH_WQE_L3_INNER_CSUM |
                                          MLX5_ETH_WQE_L4_INNER_CSUM;
                        sq->stats->csum_partial_inner++;
                } else {
                        eseg->cs_flags |= MLX5_ETH_WQE_L4_CSUM;
                        sq->stats->csum_partial++;
                }

How would you generalize that into csum api that will work across NICs ?

My answer stands: you cannot.

My proposal again:
add driver specifc kfuncs and hooks for things like csum.

Kuba,
since you nacked driver specific stuff please suggest a way to unblock this stalemate.
Stanislav Fomichev July 11, 2023, 11:24 p.m. UTC | #2
On Tue, Jul 11, 2023 at 3:57 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Jul 07, 2023 at 12:30:01PM -0700, Stanislav Fomichev wrote:
> > +
> > +static int mlx5e_devtx_request_l4_checksum(const struct devtx_ctx *_ctx,
> > +                                        u16 csum_start, u16 csum_offset)
> > +{
> > +     const struct mlx5e_devtx_ctx *ctx = (void *)_ctx;
> > +     struct mlx5_wqe_eth_seg *eseg;
> > +
> > +     if (unlikely(!ctx->wqe))
> > +             return -ENODATA;
> > +
> > +     eseg = &ctx->wqe->eth;
> > +
> > +     switch (csum_offset) {
> > +     case sizeof(struct ethhdr) + sizeof(struct iphdr) + offsetof(struct udphdr, check):
> > +     case sizeof(struct ethhdr) + sizeof(struct ipv6hdr) + offsetof(struct udphdr, check):
> > +             /* Looks like HW/FW is doing parsing, so offsets are largely ignored. */
> > +             eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM | MLX5_ETH_WQE_L4_CSUM;
> > +             break;
> > +     default:
> > +             return -EINVAL;
> > +     }
>
> I think this proves my point: csum is not generalizable even across veth and mlx5.
> Above is a square peg that tries to fit csum_start/offset api (that makes sense from SW pov)
> into HW that has different ideas about csum-ing.
>
> Here is what mlx5 does:
> mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>                             struct mlx5e_accel_tx_state *accel,
>                             struct mlx5_wqe_eth_seg *eseg)
> {
>         if (unlikely(mlx5e_ipsec_txwqe_build_eseg_csum(sq, skb, eseg)))
>                 return;
>
>         if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
>                 eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM;
>                 if (skb->encapsulation) {
>                         eseg->cs_flags |= MLX5_ETH_WQE_L3_INNER_CSUM |
>                                           MLX5_ETH_WQE_L4_INNER_CSUM;
>                         sq->stats->csum_partial_inner++;
>                 } else {
>                         eseg->cs_flags |= MLX5_ETH_WQE_L4_CSUM;
>                         sq->stats->csum_partial++;
>                 }
>
> How would you generalize that into csum api that will work across NICs ?
>
> My answer stands: you cannot.
>
> My proposal again:
> add driver specifc kfuncs and hooks for things like csum.

I do see your point, but to also give you my perspective: I have no
clue what those _CSUM tx bits do (as a non-mlx employee). And what
kind of packets they support (initial patch doesn't give any info).
We can definitely expose mlx5 specific request_l4_checksum(bool encap)
which does things similar to mlx5e_txwqe_build_eseg_csum, but then,
what does it _actually_ do? It obviously can't checksum arbitrary
packet formats (because it has this inner/outer selection bit), so
there is really no way for me to provide a per-driver kfunc api. Maybe
the vendors can?

So having csum_start/csum_offset abstraction which works with fixed
offsets seems like at least it correctly sets the expectation for BPF
program writers.
The vendors are already supposed to conform to this start/offset API for skb.

But back to your point: should we maybe try to meet somewhere in the middle?
1. We try to provide "generic" offload kfuncs; for mlx5, we'll have
this mlx5e_devtx_request_l4_checksum which works for fixed offsets
2. We also let vendors do device-specific "extensions" where devices
deviate too much: bpf_request_RAW_mlx5e_l4_checksum(bool encap)
This can give BPF authors opportunity to write somewhat portable
programs and also use vendor specific apis when/if needed.

I think we had a similar idea for rx side: have generic kfuncs, but
also let vendors experiment with custom kfuncs if they want to
differentiate.
WDYT? Can it give us the best things from both sides?

> Kuba,
> since you nacked driver specific stuff please suggest a way to unblock this stalemate.
Alexei Starovoitov July 11, 2023, 11:45 p.m. UTC | #3
On Tue, Jul 11, 2023 at 4:25 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On Tue, Jul 11, 2023 at 3:57 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Fri, Jul 07, 2023 at 12:30:01PM -0700, Stanislav Fomichev wrote:
> > > +
> > > +static int mlx5e_devtx_request_l4_checksum(const struct devtx_ctx *_ctx,
> > > +                                        u16 csum_start, u16 csum_offset)
> > > +{
> > > +     const struct mlx5e_devtx_ctx *ctx = (void *)_ctx;
> > > +     struct mlx5_wqe_eth_seg *eseg;
> > > +
> > > +     if (unlikely(!ctx->wqe))
> > > +             return -ENODATA;
> > > +
> > > +     eseg = &ctx->wqe->eth;
> > > +
> > > +     switch (csum_offset) {
> > > +     case sizeof(struct ethhdr) + sizeof(struct iphdr) + offsetof(struct udphdr, check):
> > > +     case sizeof(struct ethhdr) + sizeof(struct ipv6hdr) + offsetof(struct udphdr, check):
> > > +             /* Looks like HW/FW is doing parsing, so offsets are largely ignored. */
> > > +             eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM | MLX5_ETH_WQE_L4_CSUM;
> > > +             break;
> > > +     default:
> > > +             return -EINVAL;
> > > +     }
> >
> > I think this proves my point: csum is not generalizable even across veth and mlx5.
> > Above is a square peg that tries to fit csum_start/offset api (that makes sense from SW pov)
> > into HW that has different ideas about csum-ing.
> >
> > Here is what mlx5 does:
> > mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq, struct sk_buff *skb,
> >                             struct mlx5e_accel_tx_state *accel,
> >                             struct mlx5_wqe_eth_seg *eseg)
> > {
> >         if (unlikely(mlx5e_ipsec_txwqe_build_eseg_csum(sq, skb, eseg)))
> >                 return;
> >
> >         if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
> >                 eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM;
> >                 if (skb->encapsulation) {
> >                         eseg->cs_flags |= MLX5_ETH_WQE_L3_INNER_CSUM |
> >                                           MLX5_ETH_WQE_L4_INNER_CSUM;
> >                         sq->stats->csum_partial_inner++;
> >                 } else {
> >                         eseg->cs_flags |= MLX5_ETH_WQE_L4_CSUM;
> >                         sq->stats->csum_partial++;
> >                 }
> >
> > How would you generalize that into csum api that will work across NICs ?
> >
> > My answer stands: you cannot.
> >
> > My proposal again:
> > add driver specifc kfuncs and hooks for things like csum.
>
> I do see your point, but to also give you my perspective: I have no
> clue what those _CSUM tx bits do (as a non-mlx employee). And what
> kind of packets they support (initial patch doesn't give any info).
> We can definitely expose mlx5 specific request_l4_checksum(bool encap)
> which does things similar to mlx5e_txwqe_build_eseg_csum, but then,
> what does it _actually_ do? It obviously can't checksum arbitrary
> packet formats (because it has this inner/outer selection bit), so
> there is really no way for me to provide a per-driver kfunc api. Maybe
> the vendors can?
>
> So having csum_start/csum_offset abstraction which works with fixed
> offsets seems like at least it correctly sets the expectation for BPF
> program writers.
> The vendors are already supposed to conform to this start/offset API for skb.
>
> But back to your point: should we maybe try to meet somewhere in the middle?
> 1. We try to provide "generic" offload kfuncs; for mlx5, we'll have
> this mlx5e_devtx_request_l4_checksum which works for fixed offsets

But it doesn't.
Even if you add a check for csum_start (that's missing in the patch)
there need to be a way to somehow figure out
whether skb->encapsulation is true and set appropriate flags.
Otherwise this request csum will do "something" that only the HW vendor knows.
That would be even harder to debug for bpf prog writers.

So instead of helping bpf prog devs it will actively hurt them.

Another example. If bpf prog was developed and tested on veth
it will work for some values of csum_offset on real HW and will -EINVAL
for the other values.
Just horrible user experience comparing to the case where
the user knows that each netdev is potentially different and
_has_ to develop and test their prog on the given HW NIC and
not assume that the kernel can "do the right thing".
This csum exercise is clear example that kernel is not in a position
to do so.
For timestamp it's arguable, but for csum there is no generic api that
kernel can apply universally to NICs.

> 2. We also let vendors do device-specific "extensions" where devices
> deviate too much: bpf_request_RAW_mlx5e_l4_checksum(bool encap)
> This can give BPF authors opportunity to write somewhat portable
> programs and also use vendor specific apis when/if needed.
>
> I think we had a similar idea for rx side: have generic kfuncs, but
> also let vendors experiment with custom kfuncs if they want to
> differentiate.
> WDYT? Can it give us the best things from both sides?
>
> > Kuba,
> > since you nacked driver specific stuff please suggest a way to unblock this stalemate.
Stanislav Fomichev July 12, 2023, 12:14 a.m. UTC | #4
On Tue, Jul 11, 2023 at 4:45 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Jul 11, 2023 at 4:25 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On Tue, Jul 11, 2023 at 3:57 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Fri, Jul 07, 2023 at 12:30:01PM -0700, Stanislav Fomichev wrote:
> > > > +
> > > > +static int mlx5e_devtx_request_l4_checksum(const struct devtx_ctx *_ctx,
> > > > +                                        u16 csum_start, u16 csum_offset)
> > > > +{
> > > > +     const struct mlx5e_devtx_ctx *ctx = (void *)_ctx;
> > > > +     struct mlx5_wqe_eth_seg *eseg;
> > > > +
> > > > +     if (unlikely(!ctx->wqe))
> > > > +             return -ENODATA;
> > > > +
> > > > +     eseg = &ctx->wqe->eth;
> > > > +
> > > > +     switch (csum_offset) {
> > > > +     case sizeof(struct ethhdr) + sizeof(struct iphdr) + offsetof(struct udphdr, check):
> > > > +     case sizeof(struct ethhdr) + sizeof(struct ipv6hdr) + offsetof(struct udphdr, check):
> > > > +             /* Looks like HW/FW is doing parsing, so offsets are largely ignored. */
> > > > +             eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM | MLX5_ETH_WQE_L4_CSUM;
> > > > +             break;
> > > > +     default:
> > > > +             return -EINVAL;
> > > > +     }
> > >
> > > I think this proves my point: csum is not generalizable even across veth and mlx5.
> > > Above is a square peg that tries to fit csum_start/offset api (that makes sense from SW pov)
> > > into HW that has different ideas about csum-ing.
> > >
> > > Here is what mlx5 does:
> > > mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq, struct sk_buff *skb,
> > >                             struct mlx5e_accel_tx_state *accel,
> > >                             struct mlx5_wqe_eth_seg *eseg)
> > > {
> > >         if (unlikely(mlx5e_ipsec_txwqe_build_eseg_csum(sq, skb, eseg)))
> > >                 return;
> > >
> > >         if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
> > >                 eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM;
> > >                 if (skb->encapsulation) {
> > >                         eseg->cs_flags |= MLX5_ETH_WQE_L3_INNER_CSUM |
> > >                                           MLX5_ETH_WQE_L4_INNER_CSUM;
> > >                         sq->stats->csum_partial_inner++;
> > >                 } else {
> > >                         eseg->cs_flags |= MLX5_ETH_WQE_L4_CSUM;
> > >                         sq->stats->csum_partial++;
> > >                 }
> > >
> > > How would you generalize that into csum api that will work across NICs ?
> > >
> > > My answer stands: you cannot.
> > >
> > > My proposal again:
> > > add driver specifc kfuncs and hooks for things like csum.
> >
> > I do see your point, but to also give you my perspective: I have no
> > clue what those _CSUM tx bits do (as a non-mlx employee). And what
> > kind of packets they support (initial patch doesn't give any info).
> > We can definitely expose mlx5 specific request_l4_checksum(bool encap)
> > which does things similar to mlx5e_txwqe_build_eseg_csum, but then,
> > what does it _actually_ do? It obviously can't checksum arbitrary
> > packet formats (because it has this inner/outer selection bit), so
> > there is really no way for me to provide a per-driver kfunc api. Maybe
> > the vendors can?
> >
> > So having csum_start/csum_offset abstraction which works with fixed
> > offsets seems like at least it correctly sets the expectation for BPF
> > program writers.
> > The vendors are already supposed to conform to this start/offset API for skb.
> >
> > But back to your point: should we maybe try to meet somewhere in the middle?
> > 1. We try to provide "generic" offload kfuncs; for mlx5, we'll have
> > this mlx5e_devtx_request_l4_checksum which works for fixed offsets
>
> But it doesn't.
> Even if you add a check for csum_start (that's missing in the patch)
> there need to be a way to somehow figure out
> whether skb->encapsulation is true and set appropriate flags.
> Otherwise this request csum will do "something" that only the HW vendor knows.
> That would be even harder to debug for bpf prog writers.
>
> So instead of helping bpf prog devs it will actively hurt them.

Can we make it more robust? The device can look at the payload (via
descriptors or extra payload pointer via devtx_ctx) and verify
h_proto/nexthdr.
It won't be perfect, I agree, but we can get it working for the common
cases (and have device-specific kfuncs for the rest).

> Another example. If bpf prog was developed and tested on veth
> it will work for some values of csum_offset on real HW and will -EINVAL
> for the other values.
> Just horrible user experience comparing to the case where
> the user knows that each netdev is potentially different and
> _has_ to develop and test their prog on the given HW NIC and
> not assume that the kernel can "do the right thing".

For this, I was actually thinking that we need to provide some
SW-based fallback mechanism.
Because if I have a program and a nic that doesn't have an offload
implemented at all, having a fallback might be useful:

if (bpf_devtx_request_l4_csum(...)) {
  /* oops, hw bailed on us, fallback to sw and expose a counter */
  bpf_devtx_l4_csum_slowpath(csum_start, csum_offset, data, len);
  pkt_sw_csum++;
}

This is probably needed regardless of which way we do it?

Regarding veth vs non-veth: we already have similar issues with
generic xdp vs non-generic.
I'm not sure we can completely avoid having surprises when switching
from sw to hw paths.
It's whether the users will have to debug 10-20% of their program or
they'd have to start completely from scratch for every nic.

> This csum exercise is clear example that kernel is not in a position
> to do so.
> For timestamp it's arguable, but for csum there is no generic api that
> kernel can apply universally to NICs.

Sure, I agree, it's a mix of both. For some offloads, we can have
something common, for some we can't.
But I'm not sure why we have to pick one or another. We can try to
have common apis (maybe not ideal, yes) and we can expose vendor
specific ones if there is need.
If the generic ones get unused - we kill them in the future. If none
of the vendors comes up with non-generic ones - the generic ones are
good enough.

I'm assuming you favor non-generic ones because it's easier to implement?
But most of the netdev-bound infra is already there for rx, so there
is really not a lot of extra code to reuse it at tx. (it's two lines
to allow tracing progs to be dev-bound and to check whether devices
match at attach).
Or is there some other reason I'm missing?

> > 2. We also let vendors do device-specific "extensions" where devices
> > deviate too much: bpf_request_RAW_mlx5e_l4_checksum(bool encap)
> > This can give BPF authors opportunity to write somewhat portable
> > programs and also use vendor specific apis when/if needed.
> >
> > I think we had a similar idea for rx side: have generic kfuncs, but
> > also let vendors experiment with custom kfuncs if they want to
> > differentiate.
> > WDYT? Can it give us the best things from both sides?
> >
> > > Kuba,
> > > since you nacked driver specific stuff please suggest a way to unblock this stalemate.
Jakub Kicinski July 12, 2023, 12:32 a.m. UTC | #5
On Tue, 11 Jul 2023 15:56:57 -0700 Alexei Starovoitov wrote:
> I think this proves my point: csum is not generalizable even across veth and mlx5.
> Above is a square peg that tries to fit csum_start/offset api (that makes sense from SW pov)
> into HW that has different ideas about csum-ing.
> 
> Here is what mlx5 does:
> mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>                             struct mlx5e_accel_tx_state *accel,
>                             struct mlx5_wqe_eth_seg *eseg)
> {
>         if (unlikely(mlx5e_ipsec_txwqe_build_eseg_csum(sq, skb, eseg)))
>                 return;
> 
>         if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
>                 eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM;
>                 if (skb->encapsulation) {

This should be irrelevant today, as LCO exists?

>                         eseg->cs_flags |= MLX5_ETH_WQE_L3_INNER_CSUM |
>                                           MLX5_ETH_WQE_L4_INNER_CSUM;
>                         sq->stats->csum_partial_inner++;
>                 } else {
>                         eseg->cs_flags |= MLX5_ETH_WQE_L4_CSUM;
>                         sq->stats->csum_partial++;
>                 }
> 
> How would you generalize that into csum api that will work across NICs ?
> 
> My answer stands: you cannot.
> 
> My proposal again:
> add driver specifc kfuncs and hooks for things like csum.
> 
> Kuba,
> since you nacked driver specific stuff please suggest a way to unblock this stalemate.

I hope I'm not misremembering but I think I suggested at the beginning
to create a structure describing packet geometry and requested offloads,
and for the prog fill that in.

All operating systems I know end up doing that, we'll end up doing
that as well. The question is whether we're willing to learn from
experience or prefer to go on a wild ride first...
Alexei Starovoitov July 12, 2023, 2:37 a.m. UTC | #6
On Tue, Jul 11, 2023 at 5:32 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 11 Jul 2023 15:56:57 -0700 Alexei Starovoitov wrote:
> > I think this proves my point: csum is not generalizable even across veth and mlx5.
> > Above is a square peg that tries to fit csum_start/offset api (that makes sense from SW pov)
> > into HW that has different ideas about csum-ing.
> >
> > Here is what mlx5 does:
> > mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq, struct sk_buff *skb,
> >                             struct mlx5e_accel_tx_state *accel,
> >                             struct mlx5_wqe_eth_seg *eseg)
> > {
> >         if (unlikely(mlx5e_ipsec_txwqe_build_eseg_csum(sq, skb, eseg)))
> >                 return;
> >
> >         if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
> >                 eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM;
> >                 if (skb->encapsulation) {
>
> This should be irrelevant today, as LCO exists?

Hmm. Maybe. But LCO is an example that prog devs have to be aware of
and use it properly.
Meaning for certain protocols compute outer csum LCO way and
let inner go through HW csuming.
In this case I have no idea what these mlx5 flags do.
I hope this part of the code was tested with udp tunnels.

> >                         eseg->cs_flags |= MLX5_ETH_WQE_L3_INNER_CSUM |
> >                                           MLX5_ETH_WQE_L4_INNER_CSUM;
> >                         sq->stats->csum_partial_inner++;
> >                 } else {
> >                         eseg->cs_flags |= MLX5_ETH_WQE_L4_CSUM;
> >                         sq->stats->csum_partial++;
> >                 }
> >
> > How would you generalize that into csum api that will work across NICs ?
> >
> > My answer stands: you cannot.
> >
> > My proposal again:
> > add driver specifc kfuncs and hooks for things like csum.
> >
> > Kuba,
> > since you nacked driver specific stuff please suggest a way to unblock this stalemate.
>
> I hope I'm not misremembering but I think I suggested at the beginning
> to create a structure describing packet geometry and requested offloads,
> and for the prog fill that in.

hmm. but that's what skb is for. skb == packet geometry ==
layout of headers, payload, inner vs outer, csum partial, gso params.

bpf at tc layer supposed to interact with that correctly.
If the packet is modified skb geometry should be adjusted accordingly.
Like BPF_F_RECOMPUTE_CSUM flag in bpf_skb_store_bytes().

>
> All operating systems I know end up doing that, we'll end up doing
> that as well. The question is whether we're willing to learn from
> experience or prefer to go on a wild ride first...

I don't follow. This thread was aimed to add xdp layer knobs.
To me XDP is a driver level. 'struct xdp_md' along with
BPF_F_XDP_HAS_FRAGS is the best abstraction we could do generalizing
dma-buffers (page and multi-page) that drivers operate on.
Everything else at driver level is too unique to generalize.
skb layer is already doing its job.

In that sense "generic XDP" is a feature for testing only.
Trying to make "generic XDP" fast is missing the point of XDP.

AF_XDP is a different concept. Exposing timestamp,
csum, TSO to AF_XDP users is a different design challenge.
I'm all for doing that, but trying to combine "timestamp in xdp tx"
and "timestamp in AF_XDP" will lead to bad trade-off-s for both.
Which I think this patchset demonstrates.
Alexei Starovoitov July 12, 2023, 2:50 a.m. UTC | #7
On Tue, Jul 11, 2023 at 5:15 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On Tue, Jul 11, 2023 at 4:45 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Jul 11, 2023 at 4:25 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > On Tue, Jul 11, 2023 at 3:57 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Fri, Jul 07, 2023 at 12:30:01PM -0700, Stanislav Fomichev wrote:
> > > > > +
> > > > > +static int mlx5e_devtx_request_l4_checksum(const struct devtx_ctx *_ctx,
> > > > > +                                        u16 csum_start, u16 csum_offset)
> > > > > +{
> > > > > +     const struct mlx5e_devtx_ctx *ctx = (void *)_ctx;
> > > > > +     struct mlx5_wqe_eth_seg *eseg;
> > > > > +
> > > > > +     if (unlikely(!ctx->wqe))
> > > > > +             return -ENODATA;
> > > > > +
> > > > > +     eseg = &ctx->wqe->eth;
> > > > > +
> > > > > +     switch (csum_offset) {
> > > > > +     case sizeof(struct ethhdr) + sizeof(struct iphdr) + offsetof(struct udphdr, check):
> > > > > +     case sizeof(struct ethhdr) + sizeof(struct ipv6hdr) + offsetof(struct udphdr, check):
> > > > > +             /* Looks like HW/FW is doing parsing, so offsets are largely ignored. */
> > > > > +             eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM | MLX5_ETH_WQE_L4_CSUM;
> > > > > +             break;
> > > > > +     default:
> > > > > +             return -EINVAL;
> > > > > +     }
> > > >
> > > > I think this proves my point: csum is not generalizable even across veth and mlx5.
> > > > Above is a square peg that tries to fit csum_start/offset api (that makes sense from SW pov)
> > > > into HW that has different ideas about csum-ing.
> > > >
> > > > Here is what mlx5 does:
> > > > mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq, struct sk_buff *skb,
> > > >                             struct mlx5e_accel_tx_state *accel,
> > > >                             struct mlx5_wqe_eth_seg *eseg)
> > > > {
> > > >         if (unlikely(mlx5e_ipsec_txwqe_build_eseg_csum(sq, skb, eseg)))
> > > >                 return;
> > > >
> > > >         if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
> > > >                 eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM;
> > > >                 if (skb->encapsulation) {
> > > >                         eseg->cs_flags |= MLX5_ETH_WQE_L3_INNER_CSUM |
> > > >                                           MLX5_ETH_WQE_L4_INNER_CSUM;
> > > >                         sq->stats->csum_partial_inner++;
> > > >                 } else {
> > > >                         eseg->cs_flags |= MLX5_ETH_WQE_L4_CSUM;
> > > >                         sq->stats->csum_partial++;
> > > >                 }
> > > >
> > > > How would you generalize that into csum api that will work across NICs ?
> > > >
> > > > My answer stands: you cannot.
> > > >
> > > > My proposal again:
> > > > add driver specifc kfuncs and hooks for things like csum.
> > >
> > > I do see your point, but to also give you my perspective: I have no
> > > clue what those _CSUM tx bits do (as a non-mlx employee). And what
> > > kind of packets they support (initial patch doesn't give any info).
> > > We can definitely expose mlx5 specific request_l4_checksum(bool encap)
> > > which does things similar to mlx5e_txwqe_build_eseg_csum, but then,
> > > what does it _actually_ do? It obviously can't checksum arbitrary
> > > packet formats (because it has this inner/outer selection bit), so
> > > there is really no way for me to provide a per-driver kfunc api. Maybe
> > > the vendors can?
> > >
> > > So having csum_start/csum_offset abstraction which works with fixed
> > > offsets seems like at least it correctly sets the expectation for BPF
> > > program writers.
> > > The vendors are already supposed to conform to this start/offset API for skb.
> > >
> > > But back to your point: should we maybe try to meet somewhere in the middle?
> > > 1. We try to provide "generic" offload kfuncs; for mlx5, we'll have
> > > this mlx5e_devtx_request_l4_checksum which works for fixed offsets
> >
> > But it doesn't.
> > Even if you add a check for csum_start (that's missing in the patch)
> > there need to be a way to somehow figure out
> > whether skb->encapsulation is true and set appropriate flags.
> > Otherwise this request csum will do "something" that only the HW vendor knows.
> > That would be even harder to debug for bpf prog writers.
> >
> > So instead of helping bpf prog devs it will actively hurt them.
>
> Can we make it more robust? The device can look at the payload (via
> descriptors or extra payload pointer via devtx_ctx) and verify
> h_proto/nexthdr.
> It won't be perfect, I agree, but we can get it working for the common
> cases (and have device-specific kfuncs for the rest).

More robust with more checks ?
That will slow things down and the main advantage of XDP vs skb
layer will be lost.
It's best to stay at skb then when csum and timestamp is available.

> > Another example. If bpf prog was developed and tested on veth
> > it will work for some values of csum_offset on real HW and will -EINVAL
> > for the other values.
> > Just horrible user experience comparing to the case where
> > the user knows that each netdev is potentially different and
> > _has_ to develop and test their prog on the given HW NIC and
> > not assume that the kernel can "do the right thing".
>
> For this, I was actually thinking that we need to provide some
> SW-based fallback mechanism.
> Because if I have a program and a nic that doesn't have an offload
> implemented at all, having a fallback might be useful:
>
> if (bpf_devtx_request_l4_csum(...)) {
>   /* oops, hw bailed on us, fallback to sw and expose a counter */
>   bpf_devtx_l4_csum_slowpath(csum_start, csum_offset, data, len);
>   pkt_sw_csum++;
> }
>
> This is probably needed regardless of which way we do it?

sw fallback? We already have 'generic XDP' and people misuse it
thinking it's a layer they should be using.
It's a nice feeling to say that my XDP prog was developed
and tested on mlx5, but I can move it to a different server
with brand new NIC that doesn't support XDP yet and my prog will
still work because of "generic XDP".
I think such devs are playing with fire and will be burned
when "generic XDP" NIC will be DDoSed.
Same thing here. If we do HW offload of csum it's better be in HW.
Devs have to be 100% certain that HW is offloading it.

>
> Regarding veth vs non-veth: we already have similar issues with
> generic xdp vs non-generic.
> I'm not sure we can completely avoid having surprises when switching
> from sw to hw paths.
> It's whether the users will have to debug 10-20% of their program or
> they'd have to start completely from scratch for every nic.

If rewrite for a nic is not acceptable then they should be using skb layer.

>
> > This csum exercise is clear example that kernel is not in a position
> > to do so.
> > For timestamp it's arguable, but for csum there is no generic api that
> > kernel can apply universally to NICs.
>
> Sure, I agree, it's a mix of both. For some offloads, we can have
> something common, for some we can't.
> But I'm not sure why we have to pick one or another. We can try to
> have common apis (maybe not ideal, yes) and we can expose vendor
> specific ones if there is need.
> If the generic ones get unused - we kill them in the future. If none
> of the vendors comes up with non-generic ones - the generic ones are
> good enough.
>
> I'm assuming you favor non-generic ones because it's easier to implement?

Not only.
yes, it's easier to implement, but the expectations are also clear.
The kernel won't be trying to fall back to the slow path.
XDP prog will tell HW 'do csum' and HW will do it.
For generality we have an skb layer.
Jakub Kicinski July 12, 2023, 3:07 a.m. UTC | #8
On Tue, 11 Jul 2023 19:37:23 -0700 Alexei Starovoitov wrote:
> > I hope I'm not misremembering but I think I suggested at the beginning
> > to create a structure describing packet geometry and requested offloads,
> > and for the prog fill that in.  
> 
> hmm. but that's what skb is for. skb == packet geometry ==
> layout of headers, payload, inner vs outer, csum partial, gso params.
> 
> bpf at tc layer supposed to interact with that correctly.
> If the packet is modified skb geometry should be adjusted accordingly.
> Like BPF_F_RECOMPUTE_CSUM flag in bpf_skb_store_bytes().
> 
> > All operating systems I know end up doing that, we'll end up doing
> > that as well. The question is whether we're willing to learn from
> > experience or prefer to go on a wild ride first...  
> 
> I don't follow. This thread was aimed to add xdp layer knobs.
> To me XDP is a driver level.

Driver is not a layer of the networking stack, I don't think it's 
a useful or meaningful anchor point for the mental model. 

We're talking about a set of functionality, we can look at how that
functionality was exposed in existing code.

> 'struct xdp_md' along with
> BPF_F_XDP_HAS_FRAGS is the best abstraction we could do generalizing
> dma-buffers (page and multi-page) that drivers operate on.

I think you're working against your own claim.
xdp frags explicitly reuse struct skb_shared_info.

> Everything else at driver level is too unique to generalize.
> skb layer is already doing its job.

How can you say that drivers are impossible to generalize and than
that the skb layer "is doing its job" ie. generalizes them?

> In that sense "generic XDP" is a feature for testing only.
> Trying to make "generic XDP" fast is missing the point of XDP.

That's a topic on its own.

> AF_XDP is a different concept. Exposing timestamp,
> csum, TSO to AF_XDP users is a different design challenge.
> I'm all for doing that, but trying to combine "timestamp in xdp tx"
> and "timestamp in AF_XDP" will lead to bad trade-off-s for both.
> Which I think this patchset demonstrates.

Too vague to agree or disagree with.
Alexei Starovoitov July 12, 2023, 3:23 a.m. UTC | #9
On Tue, Jul 11, 2023 at 8:07 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 11 Jul 2023 19:37:23 -0700 Alexei Starovoitov wrote:
> > > I hope I'm not misremembering but I think I suggested at the beginning
> > > to create a structure describing packet geometry and requested offloads,
> > > and for the prog fill that in.
> >
> > hmm. but that's what skb is for. skb == packet geometry ==
> > layout of headers, payload, inner vs outer, csum partial, gso params.
> >
> > bpf at tc layer supposed to interact with that correctly.
> > If the packet is modified skb geometry should be adjusted accordingly.
> > Like BPF_F_RECOMPUTE_CSUM flag in bpf_skb_store_bytes().
> >
> > > All operating systems I know end up doing that, we'll end up doing
> > > that as well. The question is whether we're willing to learn from
> > > experience or prefer to go on a wild ride first...
> >
> > I don't follow. This thread was aimed to add xdp layer knobs.
> > To me XDP is a driver level.
>
> Driver is not a layer of the networking stack, I don't think it's
> a useful or meaningful anchor point for the mental model.
>
> We're talking about a set of functionality, we can look at how that
> functionality was exposed in existing code.
>
> > 'struct xdp_md' along with
> > BPF_F_XDP_HAS_FRAGS is the best abstraction we could do generalizing
> > dma-buffers (page and multi-page) that drivers operate on.
>
> I think you're working against your own claim.
> xdp frags explicitly reuse struct skb_shared_info.

Yes they do and it's an implementation detail.
skb_shared_info is convenient for drivers to fill in.
xdp prog isn't aware of the exact mechanism.
skb_shared_info is not exposed to a program.
Not sure what point you're trying to make.

>
> > Everything else at driver level is too unique to generalize.
> > skb layer is already doing its job.
>
> How can you say that drivers are impossible to generalize and than
> that the skb layer "is doing its job" ie. generalizes them?

Yeah. 'generalize' is a wrong word to use to describe a feature
that should have the same look and feel across drivers.
What I meant by 'csum impossible to generalize across drivers'
is that there is no HW-only path across them.
skb layer is doing it through sw fallback and combination
of csum_complete/unnecessary.

I'm saying that your proposal of
"create a structure describing packet geometry and requested offloads"
already exists and it's called skb.
I see no point doing this exercise again at XDP layer.
Stanislav Fomichev July 12, 2023, 3:29 a.m. UTC | #10
On 07/11, Alexei Starovoitov wrote:
> On Tue, Jul 11, 2023 at 5:15 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On Tue, Jul 11, 2023 at 4:45 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Tue, Jul 11, 2023 at 4:25 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > >
> > > > On Tue, Jul 11, 2023 at 3:57 PM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Fri, Jul 07, 2023 at 12:30:01PM -0700, Stanislav Fomichev wrote:
> > > > > > +
> > > > > > +static int mlx5e_devtx_request_l4_checksum(const struct devtx_ctx *_ctx,
> > > > > > +                                        u16 csum_start, u16 csum_offset)
> > > > > > +{
> > > > > > +     const struct mlx5e_devtx_ctx *ctx = (void *)_ctx;
> > > > > > +     struct mlx5_wqe_eth_seg *eseg;
> > > > > > +
> > > > > > +     if (unlikely(!ctx->wqe))
> > > > > > +             return -ENODATA;
> > > > > > +
> > > > > > +     eseg = &ctx->wqe->eth;
> > > > > > +
> > > > > > +     switch (csum_offset) {
> > > > > > +     case sizeof(struct ethhdr) + sizeof(struct iphdr) + offsetof(struct udphdr, check):
> > > > > > +     case sizeof(struct ethhdr) + sizeof(struct ipv6hdr) + offsetof(struct udphdr, check):
> > > > > > +             /* Looks like HW/FW is doing parsing, so offsets are largely ignored. */
> > > > > > +             eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM | MLX5_ETH_WQE_L4_CSUM;
> > > > > > +             break;
> > > > > > +     default:
> > > > > > +             return -EINVAL;
> > > > > > +     }
> > > > >
> > > > > I think this proves my point: csum is not generalizable even across veth and mlx5.
> > > > > Above is a square peg that tries to fit csum_start/offset api (that makes sense from SW pov)
> > > > > into HW that has different ideas about csum-ing.
> > > > >
> > > > > Here is what mlx5 does:
> > > > > mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq, struct sk_buff *skb,
> > > > >                             struct mlx5e_accel_tx_state *accel,
> > > > >                             struct mlx5_wqe_eth_seg *eseg)
> > > > > {
> > > > >         if (unlikely(mlx5e_ipsec_txwqe_build_eseg_csum(sq, skb, eseg)))
> > > > >                 return;
> > > > >
> > > > >         if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
> > > > >                 eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM;
> > > > >                 if (skb->encapsulation) {
> > > > >                         eseg->cs_flags |= MLX5_ETH_WQE_L3_INNER_CSUM |
> > > > >                                           MLX5_ETH_WQE_L4_INNER_CSUM;
> > > > >                         sq->stats->csum_partial_inner++;
> > > > >                 } else {
> > > > >                         eseg->cs_flags |= MLX5_ETH_WQE_L4_CSUM;
> > > > >                         sq->stats->csum_partial++;
> > > > >                 }
> > > > >
> > > > > How would you generalize that into csum api that will work across NICs ?
> > > > >
> > > > > My answer stands: you cannot.
> > > > >
> > > > > My proposal again:
> > > > > add driver specifc kfuncs and hooks for things like csum.
> > > >
> > > > I do see your point, but to also give you my perspective: I have no
> > > > clue what those _CSUM tx bits do (as a non-mlx employee). And what
> > > > kind of packets they support (initial patch doesn't give any info).
> > > > We can definitely expose mlx5 specific request_l4_checksum(bool encap)
> > > > which does things similar to mlx5e_txwqe_build_eseg_csum, but then,
> > > > what does it _actually_ do? It obviously can't checksum arbitrary
> > > > packet formats (because it has this inner/outer selection bit), so
> > > > there is really no way for me to provide a per-driver kfunc api. Maybe
> > > > the vendors can?
> > > >
> > > > So having csum_start/csum_offset abstraction which works with fixed
> > > > offsets seems like at least it correctly sets the expectation for BPF
> > > > program writers.
> > > > The vendors are already supposed to conform to this start/offset API for skb.
> > > >
> > > > But back to your point: should we maybe try to meet somewhere in the middle?
> > > > 1. We try to provide "generic" offload kfuncs; for mlx5, we'll have
> > > > this mlx5e_devtx_request_l4_checksum which works for fixed offsets
> > >
> > > But it doesn't.
> > > Even if you add a check for csum_start (that's missing in the patch)
> > > there need to be a way to somehow figure out
> > > whether skb->encapsulation is true and set appropriate flags.
> > > Otherwise this request csum will do "something" that only the HW vendor knows.
> > > That would be even harder to debug for bpf prog writers.
> > >
> > > So instead of helping bpf prog devs it will actively hurt them.
> >
> > Can we make it more robust? The device can look at the payload (via
> > descriptors or extra payload pointer via devtx_ctx) and verify
> > h_proto/nexthdr.
> > It won't be perfect, I agree, but we can get it working for the common
> > cases (and have device-specific kfuncs for the rest).
> 
> More robust with more checks ?
> That will slow things down and the main advantage of XDP vs skb
> layer will be lost.
> It's best to stay at skb then when csum and timestamp is available.

This will slow things down, but not to the point where it's on par
with doing sw checksum. At least in theory.
We can't stay at skb when using AF_XDP. AF_XDP would benefit from having
the offloads.

> > > Another example. If bpf prog was developed and tested on veth
> > > it will work for some values of csum_offset on real HW and will -EINVAL
> > > for the other values.
> > > Just horrible user experience comparing to the case where
> > > the user knows that each netdev is potentially different and
> > > _has_ to develop and test their prog on the given HW NIC and
> > > not assume that the kernel can "do the right thing".
> >
> > For this, I was actually thinking that we need to provide some
> > SW-based fallback mechanism.
> > Because if I have a program and a nic that doesn't have an offload
> > implemented at all, having a fallback might be useful:
> >
> > if (bpf_devtx_request_l4_csum(...)) {
> >   /* oops, hw bailed on us, fallback to sw and expose a counter */
> >   bpf_devtx_l4_csum_slowpath(csum_start, csum_offset, data, len);
> >   pkt_sw_csum++;
> > }
> >
> > This is probably needed regardless of which way we do it?
> 
> sw fallback? We already have 'generic XDP' and people misuse it
> thinking it's a layer they should be using.
> It's a nice feeling to say that my XDP prog was developed
> and tested on mlx5, but I can move it to a different server
> with brand new NIC that doesn't support XDP yet and my prog will
> still work because of "generic XDP".
> I think such devs are playing with fire and will be burned
> when "generic XDP" NIC will be DDoSed.

This is user controlled. The users might as well chose to
not fallback to sw by not calling another helper and fail hard.

I agree with generic xdp concerns and veth path being deceptive,
but I also believe they serve their purpose of getting user's feet
wet before they switch to experimenting with real devices:
you get most of you program working and polish it up on the real HW.

Moving to a different NIC will obviously require testing. But the
difference with generic APIs is that the programs will mostly work
and won't have to be completely rewritten.

> Same thing here. If we do HW offload of csum it's better be in HW.
> Devs have to be 100% certain that HW is offloading it.

I hope we can both agree that with an api like
mlx5_l4_csum_offload(bool encap) we can't be 100% certain that the
hw is gonna handle any packet layout? So how is that different
from a generic api that also can't work in all cases?

> > Regarding veth vs non-veth: we already have similar issues with
> > generic xdp vs non-generic.
> > I'm not sure we can completely avoid having surprises when switching
> > from sw to hw paths.
> > It's whether the users will have to debug 10-20% of their program or
> > they'd have to start completely from scratch for every nic.
> 
> If rewrite for a nic is not acceptable then they should be using skb layer.

AF_XDP is a generic layer for low-level access and it provides generic
descriptor format, so why suddenly we have this requirement where we have
to do prog rewrite for every new nic?

Current AF_XDP programs are pretty portable (obviously depend on
a bunch of nic features), it seems like a good idea to try to preserve
this property? (again, with an asterisk, where we should allow some
differentiation, etc, etc)

> > > This csum exercise is clear example that kernel is not in a position
> > > to do so.
> > > For timestamp it's arguable, but for csum there is no generic api that
> > > kernel can apply universally to NICs.
> >
> > Sure, I agree, it's a mix of both. For some offloads, we can have
> > something common, for some we can't.
> > But I'm not sure why we have to pick one or another. We can try to
> > have common apis (maybe not ideal, yes) and we can expose vendor
> > specific ones if there is need.
> > If the generic ones get unused - we kill them in the future. If none
> > of the vendors comes up with non-generic ones - the generic ones are
> > good enough.
> >
> > I'm assuming you favor non-generic ones because it's easier to implement?
> 
> Not only.
> yes, it's easier to implement, but the expectations are also clear.
> The kernel won't be trying to fall back to the slow path.
> XDP prog will tell HW 'do csum' and HW will do it.
> For generality we have an skb layer.

(seems like I've addressed these points above, I don't see anything
extra here)
Alexei Starovoitov July 12, 2023, 4:59 a.m. UTC | #11
On Tue, Jul 11, 2023 at 8:29 PM Stanislav Fomichev <sdf@google.com> wrote:
>
>
> This will slow things down, but not to the point where it's on par
> with doing sw checksum. At least in theory.
> We can't stay at skb when using AF_XDP. AF_XDP would benefit from having
> the offloads.

To clarify: yes, AF_XDP needs generalized HW offloads.
I just don't see how xdp tx offloads are moving a needle in that direction.

> I hope we can both agree that with an api like
> mlx5_l4_csum_offload(bool encap) we can't be 100% certain that the
> hw is gonna handle any packet layout? So how is that different
> from a generic api that also can't work in all cases?

If it's hw specific then yes.
Will [mlx5|...]_l4_csum_offload apply to other nics? I doubt.

> AF_XDP is a generic layer for low-level access and it provides generic
> descriptor format, so why suddenly we have this requirement where we have
> to do prog rewrite for every new nic?
>
> Current AF_XDP programs are pretty portable (obviously depend on
> a bunch of nic features), it seems like a good idea to try to preserve
> this property? (again, with an asterisk, where we should allow some
> differentiation, etc, etc)

Agree. AF_XDP needs a generic api that will allow user space
request HW to do TSO, csum offload, etc.
xdp tx and af_xdp are too different to pull under the same framework.
xdp progs will interact with the kernel via kfuncs.
af_xdp needs a different api to express packet geometry and offload requests.
The user space cannot do it with bpf programs.
In case of AF_XDP the bpf prog in the kernel is a filter only.
For the majority of the cases bpf prog is not necessary and shouldn't be
required to express HW offloads.
Stanislav Fomichev July 12, 2023, 5:36 a.m. UTC | #12
On Tue, Jul 11, 2023 at 9:59 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Jul 11, 2023 at 8:29 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> >
> > This will slow things down, but not to the point where it's on par
> > with doing sw checksum. At least in theory.
> > We can't stay at skb when using AF_XDP. AF_XDP would benefit from having
> > the offloads.
>
> To clarify: yes, AF_XDP needs generalized HW offloads.

Great! To reiterate, I'm mostly interested in af_xdp wrt tx
timestamps. So if the consensus is not to mix xdp-tx and af_xdp-tx,
I'm fine with switching to adding some fixed af_xdp descriptor format
to enable offloads on tx.

> I just don't see how xdp tx offloads are moving a needle in that direction.

Let me try to explain how both might be similar, maybe I wasn't clear
enough on that.
For af_xdp tx packet, the userspace puts something in the af_xdp frame
metadata area (headrom) which then gets executed/interpreted by the
bpf program at devtx (which calls kfuncs to enable particular
offloads).
IOW, instead of defining some fixed layout for the tx offloads, the
userspace and bpf program have some agreement on the layout (and bpf
program "applies" the offloads by calling the kfuncs).
Also (in theory) the same hooks can be used for xdp-tx.
Does it make sense? But, again, happy to scratch that whole idea if
we're fine with a fixed layout for af_xdp.
Willem de Bruijn July 12, 2023, 3:16 p.m. UTC | #13
On Wed, Jul 12, 2023 at 1:36 AM Stanislav Fomichev <sdf@google.com> wrote:
>
> On Tue, Jul 11, 2023 at 9:59 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Jul 11, 2023 at 8:29 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > >
> > > This will slow things down, but not to the point where it's on par
> > > with doing sw checksum. At least in theory.
> > > We can't stay at skb when using AF_XDP. AF_XDP would benefit from having
> > > the offloads.
> >
> > To clarify: yes, AF_XDP needs generalized HW offloads.
>
> Great! To reiterate, I'm mostly interested in af_xdp wrt tx
> timestamps. So if the consensus is not to mix xdp-tx and af_xdp-tx,
> I'm fine with switching to adding some fixed af_xdp descriptor format
> to enable offloads on tx.
>
> > I just don't see how xdp tx offloads are moving a needle in that direction.
>
> Let me try to explain how both might be similar, maybe I wasn't clear
> enough on that.
> For af_xdp tx packet, the userspace puts something in the af_xdp frame
> metadata area (headrom) which then gets executed/interpreted by the
> bpf program at devtx (which calls kfuncs to enable particular
> offloads).
> IOW, instead of defining some fixed layout for the tx offloads, the
> userspace and bpf program have some agreement on the layout (and bpf
> program "applies" the offloads by calling the kfuncs).
> Also (in theory) the same hooks can be used for xdp-tx.
> Does it make sense? But, again, happy to scratch that whole idea if
> we're fine with a fixed layout for af_xdp.

Checksum offload is an important demonstrator too.

It is admittedly a non-trivial one. Checksum offload has often been
discussed as a pain point ("protocol ossification").

In general, drivers can accept every CHECKSUM_COMPLETE skb that
matches their advertised feature NETIF_F_[HW|IP|IPV6]_CSUM. I don't
see why this would be different for kfuncs for packets coming from
userspace.

The problematic drivers are the ones that do not implement
CHECKSUM_COMPLETE as intended, but ignore this simple
protocol-independent hint in favor of parsing from scratch, possibly
zeroing the field, computing multiple layers, etc.

All of which is unnecessary with LCO. An AF_XDP user can be expected
to apply LCO and only request checksum insertion for the innermost
checksum.

The biggest problem is with these devices that parse in hardware (and
possibly also in the driver to identify and fix up hardware
limitations) is that they will fail if encountering an unknown
protocol. Which brings us to advertising limited typed support:
NETIF_F_HW_CSUM vs NETIF_F_IP_CSUM.

The fact that some devices that deviate from industry best practices
cannot support more advanced packet formats is unfortunate, but not a
reason to hold others back. No different from current kernel path. The
BPF program can fallback onto software checksumming on these devices,
like the kernel path. Perhaps we do need to pass along with csum_start
and csum_off a csum_type that matches the existing
NETIF_F_[HW|IP|IPV6]_CSUM, to let drivers return with -EOPNOTSUPP
quickly if for the generic case.

For implementation in essence it is just reordering driver code that
already exists for the skb case. I think the ice patch series to
support rx timestamping is a good indication of what it takes to
support XDP kfuncs: not so much new code, but reordering the driver
logic.

Which also indicates to me that the driver *is* the right place to
implement this logic, rather than reimplement it in a BPF library. It
avoids both code duplication and dependency hell, if the library ships
independent from the driver.
Willem de Bruijn July 12, 2023, 4:28 p.m. UTC | #14
On Wed, Jul 12, 2023 at 11:16 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Wed, Jul 12, 2023 at 1:36 AM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On Tue, Jul 11, 2023 at 9:59 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Tue, Jul 11, 2023 at 8:29 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > >
> > > >
> > > > This will slow things down, but not to the point where it's on par
> > > > with doing sw checksum. At least in theory.
> > > > We can't stay at skb when using AF_XDP. AF_XDP would benefit from having
> > > > the offloads.
> > >
> > > To clarify: yes, AF_XDP needs generalized HW offloads.
> >
> > Great! To reiterate, I'm mostly interested in af_xdp wrt tx
> > timestamps. So if the consensus is not to mix xdp-tx and af_xdp-tx,
> > I'm fine with switching to adding some fixed af_xdp descriptor format
> > to enable offloads on tx.
> >
> > > I just don't see how xdp tx offloads are moving a needle in that direction.
> >
> > Let me try to explain how both might be similar, maybe I wasn't clear
> > enough on that.
> > For af_xdp tx packet, the userspace puts something in the af_xdp frame
> > metadata area (headrom) which then gets executed/interpreted by the
> > bpf program at devtx (which calls kfuncs to enable particular
> > offloads).
> > IOW, instead of defining some fixed layout for the tx offloads, the
> > userspace and bpf program have some agreement on the layout (and bpf
> > program "applies" the offloads by calling the kfuncs).
> > Also (in theory) the same hooks can be used for xdp-tx.
> > Does it make sense? But, again, happy to scratch that whole idea if
> > we're fine with a fixed layout for af_xdp.
>
> Checksum offload is an important demonstrator too.
>
> It is admittedly a non-trivial one. Checksum offload has often been
> discussed as a pain point ("protocol ossification").
>
> In general, drivers can accept every CHECKSUM_COMPLETE skb that

Erm.. CHECKSUM_PARTIAL
Alexei Starovoitov July 12, 2023, 7:03 p.m. UTC | #15
On Wed, Jul 12, 2023 at 11:16:04AM -0400, Willem de Bruijn wrote:
> On Wed, Jul 12, 2023 at 1:36 AM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On Tue, Jul 11, 2023 at 9:59 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Tue, Jul 11, 2023 at 8:29 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > >
> > > >
> > > > This will slow things down, but not to the point where it's on par
> > > > with doing sw checksum. At least in theory.
> > > > We can't stay at skb when using AF_XDP. AF_XDP would benefit from having
> > > > the offloads.
> > >
> > > To clarify: yes, AF_XDP needs generalized HW offloads.
> >
> > Great! To reiterate, I'm mostly interested in af_xdp wrt tx
> > timestamps. So if the consensus is not to mix xdp-tx and af_xdp-tx,
> > I'm fine with switching to adding some fixed af_xdp descriptor format
> > to enable offloads on tx.

since af_xdp is a primary user let's figure out what is the best api for that.
If any code can be salvaged for xdp tx, great, but let's not start with xdp tx
as prerequisite.

> >
> > > I just don't see how xdp tx offloads are moving a needle in that direction.
> >
> > Let me try to explain how both might be similar, maybe I wasn't clear
> > enough on that.
> > For af_xdp tx packet, the userspace puts something in the af_xdp frame
> > metadata area (headrom) which then gets executed/interpreted by the
> > bpf program at devtx (which calls kfuncs to enable particular
> > offloads).
> > IOW, instead of defining some fixed layout for the tx offloads, the
> > userspace and bpf program have some agreement on the layout (and bpf
> > program "applies" the offloads by calling the kfuncs).
> > Also (in theory) the same hooks can be used for xdp-tx.
> > Does it make sense? But, again, happy to scratch that whole idea if
> > we're fine with a fixed layout for af_xdp.

So instead of defining csum offload format in xsk metadata we'll
defining it as a set of arguments to a kfunc and tx-side xsk prog
will just copy the args from metadata into kfunc args ?
Seems like an unnecesary step. Such xsk prog won't be doing
anything useful. Just copying from one place to another.
It seems the only purpose of such bpf prog is to side step uapi exposure.
bpf is not used to program anything. There won't be any control flow.
Just odd intermediate copy step.
Instead we can define a metadata struct for csum nic offload
outside of uapi/linux/if_xdp.h with big 'this is not an uapi' warning.
User space can request it via setsockopt.
And probably feature query the nic via getsockopt.

Error handling is critical here. With xsk tx prog the errors
are messy. What to do when kfunc returns error? Store it back into
packet metadata ? and then user space needs to check every single
packet for errors? Not practical imo.

Feature query via getsockopt would be done once instead and
user space will fill in "csum offload struct" in packet metadata
and won't check per-packet error. If driver said the csum feature
is available it's better work for every packet.
Notice mlx5e_txwqe_build_eseg_csum() returns void.

> 
> Checksum offload is an important demonstrator too.
> 
> It is admittedly a non-trivial one. Checksum offload has often been
> discussed as a pain point ("protocol ossification").
> 
> In general, drivers can accept every CHECKSUM_COMPLETE skb that
> matches their advertised feature NETIF_F_[HW|IP|IPV6]_CSUM. I don't
> see why this would be different for kfuncs for packets coming from
> userspace.
> 
> The problematic drivers are the ones that do not implement
> CHECKSUM_COMPLETE as intended, but ignore this simple
> protocol-independent hint in favor of parsing from scratch, possibly
> zeroing the field, computing multiple layers, etc.
> 
> All of which is unnecessary with LCO. An AF_XDP user can be expected
> to apply LCO and only request checksum insertion for the innermost
> checksum.
> 
> The biggest problem is with these devices that parse in hardware (and
> possibly also in the driver to identify and fix up hardware
> limitations) is that they will fail if encountering an unknown
> protocol. Which brings us to advertising limited typed support:
> NETIF_F_HW_CSUM vs NETIF_F_IP_CSUM.
> 
> The fact that some devices that deviate from industry best practices
> cannot support more advanced packet formats is unfortunate, but not a
> reason to hold others back. No different from current kernel path. The
> BPF program can fallback onto software checksumming on these devices,
> like the kernel path. Perhaps we do need to pass along with csum_start
> and csum_off a csum_type that matches the existing
> NETIF_F_[HW|IP|IPV6]_CSUM, to let drivers return with -EOPNOTSUPP
> quickly if for the generic case.
> 
> For implementation in essence it is just reordering driver code that
> already exists for the skb case. I think the ice patch series to
> support rx timestamping is a good indication of what it takes to
> support XDP kfuncs: not so much new code, but reordering the driver
> logic.
> 
> Which also indicates to me that the driver *is* the right place to
> implement this logic, rather than reimplement it in a BPF library. It
> avoids both code duplication and dependency hell, if the library ships
> independent from the driver.

Agree with all of the above.
I think defining CHECKSUM_PARTIAL struct request for af_xdp is doable and
won't require much changes in the drivers.
If we do it for more than one driver from the start there is a chance it
will work for other drivers too. imo ice+gve+mlx5 would be enough.
Willem de Bruijn July 12, 2023, 7:11 p.m. UTC | #16
On Wed, Jul 12, 2023 at 3:03 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Jul 12, 2023 at 11:16:04AM -0400, Willem de Bruijn wrote:
> > On Wed, Jul 12, 2023 at 1:36 AM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > On Tue, Jul 11, 2023 at 9:59 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Tue, Jul 11, 2023 at 8:29 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > > >
> > > > >
> > > > > This will slow things down, but not to the point where it's on par
> > > > > with doing sw checksum. At least in theory.
> > > > > We can't stay at skb when using AF_XDP. AF_XDP would benefit from having
> > > > > the offloads.
> > > >
> > > > To clarify: yes, AF_XDP needs generalized HW offloads.
> > >
> > > Great! To reiterate, I'm mostly interested in af_xdp wrt tx
> > > timestamps. So if the consensus is not to mix xdp-tx and af_xdp-tx,
> > > I'm fine with switching to adding some fixed af_xdp descriptor format
> > > to enable offloads on tx.
>
> since af_xdp is a primary user let's figure out what is the best api for that.
> If any code can be salvaged for xdp tx, great, but let's not start with xdp tx
> as prerequisite.
>
> > >
> > > > I just don't see how xdp tx offloads are moving a needle in that direction.
> > >
> > > Let me try to explain how both might be similar, maybe I wasn't clear
> > > enough on that.
> > > For af_xdp tx packet, the userspace puts something in the af_xdp frame
> > > metadata area (headrom) which then gets executed/interpreted by the
> > > bpf program at devtx (which calls kfuncs to enable particular
> > > offloads).
> > > IOW, instead of defining some fixed layout for the tx offloads, the
> > > userspace and bpf program have some agreement on the layout (and bpf
> > > program "applies" the offloads by calling the kfuncs).
> > > Also (in theory) the same hooks can be used for xdp-tx.
> > > Does it make sense? But, again, happy to scratch that whole idea if
> > > we're fine with a fixed layout for af_xdp.
>
> So instead of defining csum offload format in xsk metadata we'll
> defining it as a set of arguments to a kfunc and tx-side xsk prog
> will just copy the args from metadata into kfunc args ?
> Seems like an unnecesary step. Such xsk prog won't be doing
> anything useful. Just copying from one place to another.
> It seems the only purpose of such bpf prog is to side step uapi exposure.
> bpf is not used to program anything. There won't be any control flow.
> Just odd intermediate copy step.
> Instead we can define a metadata struct for csum nic offload
> outside of uapi/linux/if_xdp.h with big 'this is not an uapi' warning.
> User space can request it via setsockopt.
> And probably feature query the nic via getsockopt.
>
> Error handling is critical here. With xsk tx prog the errors
> are messy. What to do when kfunc returns error? Store it back into
> packet metadata ? and then user space needs to check every single
> packet for errors? Not practical imo.
>
> Feature query via getsockopt would be done once instead and
> user space will fill in "csum offload struct" in packet metadata
> and won't check per-packet error. If driver said the csum feature
> is available it's better work for every packet.
> Notice mlx5e_txwqe_build_eseg_csum() returns void.
>
> >
> > Checksum offload is an important demonstrator too.
> >
> > It is admittedly a non-trivial one. Checksum offload has often been
> > discussed as a pain point ("protocol ossification").
> >
> > In general, drivers can accept every CHECKSUM_COMPLETE skb that
> > matches their advertised feature NETIF_F_[HW|IP|IPV6]_CSUM. I don't
> > see why this would be different for kfuncs for packets coming from
> > userspace.
> >
> > The problematic drivers are the ones that do not implement
> > CHECKSUM_COMPLETE as intended, but ignore this simple
> > protocol-independent hint in favor of parsing from scratch, possibly
> > zeroing the field, computing multiple layers, etc.
> >
> > All of which is unnecessary with LCO. An AF_XDP user can be expected
> > to apply LCO and only request checksum insertion for the innermost
> > checksum.
> >
> > The biggest problem is with these devices that parse in hardware (and
> > possibly also in the driver to identify and fix up hardware
> > limitations) is that they will fail if encountering an unknown
> > protocol. Which brings us to advertising limited typed support:
> > NETIF_F_HW_CSUM vs NETIF_F_IP_CSUM.
> >
> > The fact that some devices that deviate from industry best practices
> > cannot support more advanced packet formats is unfortunate, but not a
> > reason to hold others back. No different from current kernel path. The
> > BPF program can fallback onto software checksumming on these devices,
> > like the kernel path. Perhaps we do need to pass along with csum_start
> > and csum_off a csum_type that matches the existing
> > NETIF_F_[HW|IP|IPV6]_CSUM, to let drivers return with -EOPNOTSUPP
> > quickly if for the generic case.
> >
> > For implementation in essence it is just reordering driver code that
> > already exists for the skb case. I think the ice patch series to
> > support rx timestamping is a good indication of what it takes to
> > support XDP kfuncs: not so much new code, but reordering the driver
> > logic.
> >
> > Which also indicates to me that the driver *is* the right place to
> > implement this logic, rather than reimplement it in a BPF library. It
> > avoids both code duplication and dependency hell, if the library ships
> > independent from the driver.
>
> Agree with all of the above.
> I think defining CHECKSUM_PARTIAL struct request for af_xdp is doable and
> won't require much changes in the drivers.
> If we do it for more than one driver from the start there is a chance it
> will work for other drivers too. imo ice+gve+mlx5 would be enough.

Basically, add to AF_XDP what we already have for its predecessor
AF_PACKET: setsockopt PACKET_VNET_HDR?

Possibly with a separate new struct, rather than virtio_net_hdr. As
that has dependencies on other drivers, notably virtio and its
specification process.
Alexei Starovoitov July 12, 2023, 7:42 p.m. UTC | #17
On Wed, Jul 12, 2023 at 12:12 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Basically, add to AF_XDP what we already have for its predecessor
> AF_PACKET: setsockopt PACKET_VNET_HDR?
>
> Possibly with a separate new struct, rather than virtio_net_hdr. As
> that has dependencies on other drivers, notably virtio and its
> specification process.

yeah. Forgot about this one.
That's a perfect fit. I would reuse virtio_net_hdr as-is.
Why reinvent the wheel?
It would force uapi, but some might argue it's a good thing.
Jakub Kicinski July 12, 2023, 8:09 p.m. UTC | #18
On Wed, 12 Jul 2023 12:42:35 -0700 Alexei Starovoitov wrote:
> > Basically, add to AF_XDP what we already have for its predecessor
> > AF_PACKET: setsockopt PACKET_VNET_HDR?
> >
> > Possibly with a separate new struct, rather than virtio_net_hdr. As
> > that has dependencies on other drivers, notably virtio and its
> > specification process.  
> 
> yeah. Forgot about this one.
> That's a perfect fit. I would reuse virtio_net_hdr as-is.
> Why reinvent the wheel?
> It would force uapi, but some might argue it's a good thing.

I was gonna reply on the other leg of the thread but it sounds like
we're in agreement now? 
Stanislav Fomichev July 12, 2023, 8:53 p.m. UTC | #19
On 07/12, Jakub Kicinski wrote:
> On Wed, 12 Jul 2023 12:42:35 -0700 Alexei Starovoitov wrote:
> > > Basically, add to AF_XDP what we already have for its predecessor
> > > AF_PACKET: setsockopt PACKET_VNET_HDR?
> > >
> > > Possibly with a separate new struct, rather than virtio_net_hdr. As
> > > that has dependencies on other drivers, notably virtio and its
> > > specification process.  
> > 
> > yeah. Forgot about this one.
> > That's a perfect fit. I would reuse virtio_net_hdr as-is.
> > Why reinvent the wheel?
> > It would force uapi, but some might argue it's a good thing.
> 
> I was gonna reply on the other leg of the thread but it sounds like
> we're in agreement now? 
diff mbox series

Patch

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
index 879d698b6119..02b617a534c4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
@@ -6,6 +6,7 @@ 
 
 #include "en.h"
 #include <linux/indirect_call_wrapper.h>
+#include <net/devtx.h>
 
 #define MLX5E_TX_WQE_EMPTY_DS_COUNT (sizeof(struct mlx5e_tx_wqe) / MLX5_SEND_WQE_DS)
 
@@ -506,4 +507,18 @@  static inline struct mlx5e_mpw_info *mlx5e_get_mpw_info(struct mlx5e_rq *rq, int
 
 	return (struct mlx5e_mpw_info *)((char *)rq->mpwqe.info + array_size(i, isz));
 }
+
+struct mlx5e_devtx_ctx {
+	struct devtx_ctx devtx;
+	union {
+		struct {
+			struct mlx5_cqe64 *cqe; /* tx completion */
+			struct mlx5e_cq *cq; /* tx completion */
+		};
+		struct mlx5e_tx_wqe *wqe; /* tx */
+	};
+};
+
+DECLARE_DEVTX_HOOKS(mlx5e);
+
 #endif
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index f0e6095809fa..59342d5e7e6d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -255,9 +255,60 @@  static int mlx5e_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash,
 	return 0;
 }
 
+static int mlx5e_devtx_request_tx_timestamp(const struct devtx_ctx *ctx)
+{
+	/* Nothing to do here, CQE always has a timestamp. */
+	return 0;
+}
+
+static int mlx5e_devtx_tx_timestamp(const struct devtx_ctx *_ctx, u64 *timestamp)
+{
+	const struct mlx5e_devtx_ctx *ctx = (void *)_ctx;
+	u64 ts;
+
+	if (unlikely(!ctx->cqe || !ctx->cq))
+		return -ENODATA;
+
+	ts = get_cqe_ts(ctx->cqe);
+
+	if (mlx5_is_real_time_rq(ctx->cq->mdev) || mlx5_is_real_time_sq(ctx->cq->mdev))
+		*timestamp = mlx5_real_time_cyc2time(&ctx->cq->mdev->clock, ts);
+	else
+		*timestamp = mlx5_timecounter_cyc2time(&ctx->cq->mdev->clock, ts);
+
+	return 0;
+}
+
+static int mlx5e_devtx_request_l4_checksum(const struct devtx_ctx *_ctx,
+					   u16 csum_start, u16 csum_offset)
+{
+	const struct mlx5e_devtx_ctx *ctx = (void *)_ctx;
+	struct mlx5_wqe_eth_seg *eseg;
+
+	if (unlikely(!ctx->wqe))
+		return -ENODATA;
+
+	eseg = &ctx->wqe->eth;
+
+	switch (csum_offset) {
+	case sizeof(struct ethhdr) + sizeof(struct iphdr) + offsetof(struct udphdr, check):
+	case sizeof(struct ethhdr) + sizeof(struct ipv6hdr) + offsetof(struct udphdr, check):
+		/* Looks like HW/FW is doing parsing, so offsets are largely ignored. */
+		eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM | MLX5_ETH_WQE_L4_CSUM;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 const struct xdp_metadata_ops mlx5e_xdp_metadata_ops = {
 	.xmo_rx_timestamp		= mlx5e_xdp_rx_timestamp,
 	.xmo_rx_hash			= mlx5e_xdp_rx_hash,
+	.xmo_request_tx_timestamp	= mlx5e_devtx_request_tx_timestamp,
+	.xmo_tx_timestamp		= mlx5e_devtx_tx_timestamp,
+	.xmo_request_l4_checksum	= mlx5e_devtx_request_l4_checksum,
 };
 
 /* returns true if packet was consumed by xdp */
@@ -453,6 +504,27 @@  mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptx
 
 	mlx5e_xdp_mpwqe_add_dseg(sq, p, stats);
 
+	if (devtx_enabled()) {
+		struct mlx5e_xmit_data_frags *xdptxdf =
+			container_of(xdptxd, struct mlx5e_xmit_data_frags, xd);
+
+		struct xdp_frame frame = {
+			.data = p->data,
+			.len = p->len,
+			.metasize = sq->xsk_pool->tx_metadata_len,
+		};
+
+		struct mlx5e_devtx_ctx ctx = {
+			.devtx = {
+				.netdev = sq->cq.netdev,
+				.sinfo = xdptxd->has_frags ? xdptxdf->sinfo : NULL,
+			},
+			.wqe = sq->mpwqe.wqe,
+		};
+
+		mlx5e_devtx_submit_xdp(&ctx.devtx, &frame);
+	}
+
 	if (unlikely(mlx5e_xdp_mpwqe_is_full(session, sq->max_sq_mpw_wqebbs)))
 		mlx5e_xdp_mpwqe_complete(sq);
 
@@ -560,6 +632,24 @@  mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
 		dseg++;
 	}
 
+	if (devtx_enabled()) {
+		struct xdp_frame frame = {
+			.data = xdptxd->data,
+			.len = xdptxd->len,
+			.metasize = sq->xsk_pool->tx_metadata_len,
+		};
+
+		struct mlx5e_devtx_ctx ctx = {
+			.devtx = {
+				.netdev = sq->cq.netdev,
+				.sinfo = xdptxd->has_frags ? xdptxdf->sinfo : NULL,
+			},
+			.wqe = wqe,
+		};
+
+		mlx5e_devtx_submit_xdp(&ctx.devtx, &frame);
+	}
+
 	cseg->opmod_idx_opcode = cpu_to_be32((sq->pc << 8) | MLX5_OPCODE_SEND);
 
 	if (test_bit(MLX5E_SQ_STATE_XDP_MULTIBUF, &sq->state)) {
@@ -607,7 +697,9 @@  mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
 static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
 				  struct mlx5e_xdp_wqe_info *wi,
 				  u32 *xsk_frames,
-				  struct xdp_frame_bulk *bq)
+				  struct xdp_frame_bulk *bq,
+				  struct mlx5e_cq *cq,
+				  struct mlx5_cqe64 *cqe)
 {
 	struct mlx5e_xdp_info_fifo *xdpi_fifo = &sq->db.xdpi_fifo;
 	u16 i;
@@ -626,6 +718,21 @@  static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
 			xdpi = mlx5e_xdpi_fifo_pop(xdpi_fifo);
 			dma_addr = xdpi.frame.dma_addr;
 
+			if (devtx_enabled()) {
+				struct mlx5e_devtx_ctx ctx = {
+					.devtx = {
+						.netdev = sq->cq.netdev,
+						.sinfo = xdp_frame_has_frags(xdpf) ?
+							 xdp_get_shared_info_from_frame(xdpf) :
+							 NULL,
+					},
+					.cqe = cqe,
+					.cq = cq,
+				};
+
+				mlx5e_devtx_complete_xdp(&ctx.devtx, xdpi.frame.xdpf);
+			}
+
 			dma_unmap_single(sq->pdev, dma_addr,
 					 xdpf->len, DMA_TO_DEVICE);
 			if (xdp_frame_has_frags(xdpf)) {
@@ -659,6 +766,24 @@  static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
 				xdpi = mlx5e_xdpi_fifo_pop(xdpi_fifo);
 				page = xdpi.page.page;
 
+				if (devtx_enabled()) {
+					struct xdp_frame frame = {
+						.data = page,
+						.len = PAGE_SIZE,
+						.metasize = sq->xsk_pool->tx_metadata_len,
+					};
+
+					struct mlx5e_devtx_ctx ctx = {
+						.devtx = {
+							.netdev = sq->cq.netdev,
+						},
+						.cqe = cqe,
+						.cq = cq,
+					};
+
+					mlx5e_devtx_complete_xdp(&ctx.devtx, &frame);
+				}
+
 				/* No need to check ((page->pp_magic & ~0x3UL) == PP_SIGNATURE)
 				 * as we know this is a page_pool page.
 				 */
@@ -668,10 +793,32 @@  static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
 
 			break;
 		}
-		case MLX5E_XDP_XMIT_MODE_XSK:
+		case MLX5E_XDP_XMIT_MODE_XSK: {
 			/* AF_XDP send */
+			struct xdp_frame frame = {};
+
+			xdpi = mlx5e_xdpi_fifo_pop(xdpi_fifo);
+			frame.data = xdpi.frame.xsk_head;
+			xdpi = mlx5e_xdpi_fifo_pop(xdpi_fifo);
+			frame.len = xdpi.frame.xsk_head_len;
+
+			if (devtx_enabled()) {
+				struct mlx5e_devtx_ctx ctx = {
+					.devtx = {
+						.netdev = sq->cq.netdev,
+					},
+					.cqe = cqe,
+					.cq = cq,
+				};
+
+				frame.metasize = sq->xsk_pool->tx_metadata_len;
+
+				mlx5e_devtx_complete_xdp(&ctx.devtx, &frame);
+			}
+
 			(*xsk_frames)++;
 			break;
+		}
 		default:
 			WARN_ON_ONCE(true);
 		}
@@ -720,7 +867,7 @@  bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq)
 
 			sqcc += wi->num_wqebbs;
 
-			mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, &bq);
+			mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, &bq, cq, cqe);
 		} while (!last_wqe);
 
 		if (unlikely(get_cqe_opcode(cqe) != MLX5_CQE_REQ)) {
@@ -767,7 +914,7 @@  void mlx5e_free_xdpsq_descs(struct mlx5e_xdpsq *sq)
 
 		sq->cc += wi->num_wqebbs;
 
-		mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, &bq);
+		mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, &bq, NULL, NULL);
 	}
 
 	xdp_flush_frame_bulk(&bq);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
index 9e8e6184f9e4..f3ea9f58273a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
@@ -82,13 +82,15 @@  enum mlx5e_xdp_xmit_mode {
  *    num, page_1, page_2, ... , page_num.
  *
  * MLX5E_XDP_XMIT_MODE_XSK:
- *    none.
+ *    frame.xsk_head + frame.xsk_head_len for header portion only.
  */
 union mlx5e_xdp_info {
 	enum mlx5e_xdp_xmit_mode mode;
 	union {
 		struct xdp_frame *xdpf;
 		dma_addr_t dma_addr;
+		void *xsk_head;
+		u32 xsk_head_len;
 	} frame;
 	union {
 		struct mlx5e_rq *rq;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
index 597f319d4770..4ea1fc1aa500 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
@@ -55,6 +55,10 @@  static void mlx5e_xsk_tx_post_err(struct mlx5e_xdpsq *sq,
 
 	nopwqe = mlx5e_post_nop(&sq->wq, sq->sqn, &sq->pc);
 	mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo, *xdpi);
+	mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo,
+			     (union mlx5e_xdp_info) { .frame.xsk_head = NULL });
+	mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo,
+			     (union mlx5e_xdp_info) { .frame.xsk_head_len = 0 });
 	sq->doorbell_cseg = &nopwqe->ctrl;
 }
 
@@ -106,6 +110,12 @@  bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget)
 			mlx5e_xsk_tx_post_err(sq, &xdpi);
 		} else {
 			mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo, xdpi);
+			mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo,
+					     (union mlx5e_xdp_info)
+					     { .frame.xsk_head = xdptxd.data });
+			mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo,
+					     (union mlx5e_xdp_info)
+					     { .frame.xsk_head_len = xdptxd.len });
 		}
 
 		flush = true;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index c7eb6b238c2b..fc76e8efc9ea 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -758,6 +758,18 @@  static void mlx5e_tx_wi_consume_fifo_skbs(struct mlx5e_txqsq *sq, struct mlx5e_t
 	for (i = 0; i < wi->num_fifo_pkts; i++) {
 		struct sk_buff *skb = mlx5e_skb_fifo_pop(&sq->db.skb_fifo);
 
+		if (devtx_enabled()) {
+			struct mlx5e_devtx_ctx ctx = {
+				.devtx = {
+					.netdev = skb->dev,
+					.sinfo = skb_shinfo(skb),
+				},
+				.cqe = cqe,
+			};
+
+			mlx5e_devtx_complete_skb(&ctx.devtx, skb);
+		}
+
 		mlx5e_consume_skb(sq, skb, cqe, napi_budget);
 	}
 }
@@ -826,6 +838,18 @@  bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
 			sqcc += wi->num_wqebbs;
 
 			if (likely(wi->skb)) {
+				if (devtx_enabled()) {
+					struct mlx5e_devtx_ctx ctx = {
+						.devtx = {
+							.netdev = wi->skb->dev,
+							.sinfo = skb_shinfo(wi->skb),
+						},
+						.cqe = cqe,
+					};
+
+					mlx5e_devtx_complete_skb(&ctx.devtx, wi->skb);
+				}
+
 				mlx5e_tx_wi_dma_unmap(sq, wi, &dma_fifo_cc);
 				mlx5e_consume_skb(sq, wi->skb, cqe, napi_budget);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 88dbea6631d5..ca0c71688b88 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -48,6 +48,7 @@ 
 #include <linux/mlx5/vport.h>
 #include <linux/version.h>
 #include <net/devlink.h>
+#include <net/devtx.h>
 #include "mlx5_core.h"
 #include "thermal.h"
 #include "lib/eq.h"
@@ -73,6 +74,7 @@ 
 #include "sf/dev/dev.h"
 #include "sf/sf.h"
 #include "mlx5_irq.h"
+#include "en/xdp.h"
 
 MODULE_AUTHOR("Eli Cohen <eli@mellanox.com>");
 MODULE_DESCRIPTION("Mellanox 5th generation network adapters (ConnectX series) core driver");
@@ -2277,6 +2279,8 @@  static void mlx5_core_verify_params(void)
 	}
 }
 
+DEFINE_DEVTX_HOOKS(mlx5e);
+
 static int __init mlx5_init(void)
 {
 	int err;
@@ -2289,9 +2293,15 @@  static int __init mlx5_init(void)
 	mlx5_core_verify_params();
 	mlx5_register_debugfs();
 
+	err = devtx_hooks_register(&mlx5e_devtx_hook_ids, &mlx5e_xdp_metadata_ops);
+	if (err) {
+		pr_warn("failed to register devtx hooks: %d", err);
+		goto err_debug;
+	}
+
 	err = mlx5e_init();
 	if (err)
-		goto err_debug;
+		goto err_devtx;
 
 	err = mlx5_sf_driver_register();
 	if (err)
@@ -2307,6 +2317,8 @@  static int __init mlx5_init(void)
 	mlx5_sf_driver_unregister();
 err_sf:
 	mlx5e_cleanup();
+err_devtx:
+	devtx_hooks_unregister(&mlx5e_devtx_hook_ids);
 err_debug:
 	mlx5_unregister_debugfs();
 	return err;
@@ -2314,6 +2326,7 @@  static int __init mlx5_init(void)
 
 static void __exit mlx5_cleanup(void)
 {
+	devtx_hooks_unregister(&mlx5e_devtx_hook_ids);
 	pci_unregister_driver(&mlx5_core_driver);
 	mlx5_sf_driver_unregister();
 	mlx5e_cleanup();