mbox series

[RFC,bpf-next,0/7] bpf: netdev TX metadata

Message ID 20230612172307.3923165-1-sdf@google.com (mailing list archive)
Headers show
Series bpf: netdev TX metadata | expand

Message

Stanislav Fomichev June 12, 2023, 5:23 p.m. UTC
--- Use cases ---

The goal of this series is to add two new standard-ish places
in the transmit path:

1. Right before the packet is transmitted (with access to TX
   descriptors)
2. Right after the packet is actually transmitted and we've received the
   completion (again, with access to TX completion descriptors)

Accessing TX descriptors unlocks the following use-cases:

- Setting device hints at TX: XDP/AF_XDP might use these new hooks to
use device offloads. The existing case implements TX timestamp.
- Observability: global per-netdev hooks can be used for tracing
the packets and exploring completion descriptors for all sorts of
device errors.

Accessing TX descriptors also means that the hooks have to be called
from the drivers.

The hooks are a light-weight alternative to XDP at egress and currently
don't provide any packet modification abilities. However, eventually,
can expose new kfuncs to operate on the packet (or, rather, the actual
descriptors; for performance sake).

--- UAPI ---

The hooks are implemented in a HID-BPF style. Meaning they don't
expose any UAPI and are implemented as tracing programs that call
a bunch of kfuncs. The attach/detach operation happen via BPF syscall
programs. The series expands device-bound infrastructure to tracing
programs.

--- skb vs xdp ---

The hooks operate on a new light-weight devtx_frame which contains:
- data
- len
- sinfo

This should allow us to have a unified (from BPF POW) place at TX
and not be super-taxing (we need to copy 2 pointers + len to the stack
for each invocation).

--- Multiprog attachment ---

Currently, attach/detach don't expose links and don't support multiple
programs. I'm planning to use Daniel's bpf_mprog once it lands.

--- TODO ---

Things that I'm planning to do for the non-RFC series:
- have some real device support to verify xdp_hw_metadata works
- freplace
- Documentation/networking/xdp-rx-metadata.rst - like documentation

--- CC ---

CC'ing people only on the cover letter. Hopefully can find the rest via
lore.

Cc: willemb@google.com
Cc: dsahern@kernel.org
Cc: john.fastabend@gmail.com
Cc: magnus.karlsson@intel.com
Cc: bjorn@kernel.org
Cc: maciej.fijalkowski@intel.com
Cc: netdev@vger.kernel.org

Stanislav Fomichev (7):
  bpf: rename some xdp-metadata functions into dev-bound
  bpf: resolve single typedef when walking structs
  bpf: implement devtx hook points
  bpf: implement devtx timestamp kfunc
  net: veth: implement devtx timestamp kfuncs
  selftests/bpf: extend xdp_metadata with devtx kfuncs
  selftests/bpf: extend xdp_hw_metadata with devtx kfuncs

 MAINTAINERS                                   |   2 +
 drivers/net/veth.c                            |  94 ++++++-
 include/linux/netdevice.h                     |   6 +
 include/net/devtx.h                           |  76 +++++
 include/net/offload.h                         |  38 +++
 include/net/xdp.h                             |  18 +-
 kernel/bpf/btf.c                              |   2 +
 kernel/bpf/offload.c                          |  40 ++-
 kernel/bpf/verifier.c                         |   4 +-
 net/core/Makefile                             |   1 +
 net/core/dev.c                                |   2 +
 net/core/devtx.c                              | 266 ++++++++++++++++++
 net/core/xdp.c                                |  20 +-
 .../selftests/bpf/prog_tests/xdp_metadata.c   |  82 +++++-
 .../selftests/bpf/progs/xdp_hw_metadata.c     |  59 ++++
 .../selftests/bpf/progs/xdp_metadata.c        | 101 +++++++
 tools/testing/selftests/bpf/xdp_hw_metadata.c | 160 ++++++++++-
 tools/testing/selftests/bpf/xdp_metadata.h    |  13 +
 18 files changed, 934 insertions(+), 50 deletions(-)
 create mode 100644 include/net/devtx.h
 create mode 100644 include/net/offload.h
 create mode 100644 net/core/devtx.c

Comments

Toke Høiland-Jørgensen June 12, 2023, 9 p.m. UTC | #1
Some immediate thoughts after glancing through this:

> --- Use cases ---
>
> The goal of this series is to add two new standard-ish places
> in the transmit path:
>
> 1. Right before the packet is transmitted (with access to TX
>    descriptors)
> 2. Right after the packet is actually transmitted and we've received the
>    completion (again, with access to TX completion descriptors)
>
> Accessing TX descriptors unlocks the following use-cases:
>
> - Setting device hints at TX: XDP/AF_XDP might use these new hooks to
> use device offloads. The existing case implements TX timestamp.
> - Observability: global per-netdev hooks can be used for tracing
> the packets and exploring completion descriptors for all sorts of
> device errors.
>
> Accessing TX descriptors also means that the hooks have to be called
> from the drivers.
>
> The hooks are a light-weight alternative to XDP at egress and currently
> don't provide any packet modification abilities. However, eventually,
> can expose new kfuncs to operate on the packet (or, rather, the actual
> descriptors; for performance sake).

dynptr?

> --- UAPI ---
>
> The hooks are implemented in a HID-BPF style. Meaning they don't
> expose any UAPI and are implemented as tracing programs that call
> a bunch of kfuncs. The attach/detach operation happen via BPF syscall
> programs. The series expands device-bound infrastructure to tracing
> programs.

Not a fan of the "attach from BPF syscall program" thing. These are part
of the XDP data path API, and I think we should expose them as proper
bpf_link attachments from userspace with introspection etc. But I guess
the bpf_mprog thing will give us that?

> --- skb vs xdp ---
>
> The hooks operate on a new light-weight devtx_frame which contains:
> - data
> - len
> - sinfo
>
> This should allow us to have a unified (from BPF POW) place at TX
> and not be super-taxing (we need to copy 2 pointers + len to the stack
> for each invocation).

Not sure what I think about this one. At the very least I think we
should expose xdp->data_meta as well. I'm not sure what the use case for
accessing skbs is? If that *is* indeed useful, probably there will also
end up being a use case for accessing the full skb?

> --- Multiprog attachment ---
>
> Currently, attach/detach don't expose links and don't support multiple
> programs. I'm planning to use Daniel's bpf_mprog once it lands.
>
> --- TODO ---
>
> Things that I'm planning to do for the non-RFC series:
> - have some real device support to verify xdp_hw_metadata works

Would be good to see some performance numbers as well :)

> - freplace
> - Documentation/networking/xdp-rx-metadata.rst - like documentation
>
> --- CC ---
>
> CC'ing people only on the cover letter. Hopefully can find the rest via
> lore.

Well, I found it there, even though I was apparently left off the Cc
list :(

-Toke
Stanislav Fomichev June 13, 2023, 4:32 p.m. UTC | #2
On 06/12, Toke Høiland-Jørgensen wrote:
> Some immediate thoughts after glancing through this:
> 
> > --- Use cases ---
> >
> > The goal of this series is to add two new standard-ish places
> > in the transmit path:
> >
> > 1. Right before the packet is transmitted (with access to TX
> >    descriptors)
> > 2. Right after the packet is actually transmitted and we've received the
> >    completion (again, with access to TX completion descriptors)
> >
> > Accessing TX descriptors unlocks the following use-cases:
> >
> > - Setting device hints at TX: XDP/AF_XDP might use these new hooks to
> > use device offloads. The existing case implements TX timestamp.
> > - Observability: global per-netdev hooks can be used for tracing
> > the packets and exploring completion descriptors for all sorts of
> > device errors.
> >
> > Accessing TX descriptors also means that the hooks have to be called
> > from the drivers.
> >
> > The hooks are a light-weight alternative to XDP at egress and currently
> > don't provide any packet modification abilities. However, eventually,
> > can expose new kfuncs to operate on the packet (or, rather, the actual
> > descriptors; for performance sake).
> 
> dynptr?

Haven't considered, let me explore, but not sure what it buys us
here?

> > --- UAPI ---
> >
> > The hooks are implemented in a HID-BPF style. Meaning they don't
> > expose any UAPI and are implemented as tracing programs that call
> > a bunch of kfuncs. The attach/detach operation happen via BPF syscall
> > programs. The series expands device-bound infrastructure to tracing
> > programs.
> 
> Not a fan of the "attach from BPF syscall program" thing. These are part
> of the XDP data path API, and I think we should expose them as proper
> bpf_link attachments from userspace with introspection etc. But I guess
> the bpf_mprog thing will give us that?

bpf_mprog will just make those attach kfuncs return the link fd. The
syscall program will still stay :-(

> > --- skb vs xdp ---
> >
> > The hooks operate on a new light-weight devtx_frame which contains:
> > - data
> > - len
> > - sinfo
> >
> > This should allow us to have a unified (from BPF POW) place at TX
> > and not be super-taxing (we need to copy 2 pointers + len to the stack
> > for each invocation).
> 
> Not sure what I think about this one. At the very least I think we
> should expose xdp->data_meta as well. I'm not sure what the use case for
> accessing skbs is? If that *is* indeed useful, probably there will also
> end up being a use case for accessing the full skb?

skb_shared_info has meta_len, buf afaik, xdp doesn't use it. Maybe I
a good opportunity to unify? Or probably won't work because if
xdf_frame doesn't have frags, it won't have sinfo?

> > --- Multiprog attachment ---
> >
> > Currently, attach/detach don't expose links and don't support multiple
> > programs. I'm planning to use Daniel's bpf_mprog once it lands.
> >
> > --- TODO ---
> >
> > Things that I'm planning to do for the non-RFC series:
> > - have some real device support to verify xdp_hw_metadata works
> 
> Would be good to see some performance numbers as well :)

+1 :-)

> > - freplace
> > - Documentation/networking/xdp-rx-metadata.rst - like documentation
> >
> > --- CC ---
> >
> > CC'ing people only on the cover letter. Hopefully can find the rest via
> > lore.
> 
> Well, I found it there, even though I was apparently left off the Cc
> list :(
> 
> -Toke

Sure, I'll CC you explicitly next time! But I know you diligently follow bpf
list, so decided to explicitly cc mostly netdev folks that might miss
it otherwise.
Toke Høiland-Jørgensen June 13, 2023, 5:18 p.m. UTC | #3
Stanislav Fomichev <sdf@google.com> writes:

> On 06/12, Toke Høiland-Jørgensen wrote:
>> Some immediate thoughts after glancing through this:
>> 
>> > --- Use cases ---
>> >
>> > The goal of this series is to add two new standard-ish places
>> > in the transmit path:
>> >
>> > 1. Right before the packet is transmitted (with access to TX
>> >    descriptors)
>> > 2. Right after the packet is actually transmitted and we've received the
>> >    completion (again, with access to TX completion descriptors)
>> >
>> > Accessing TX descriptors unlocks the following use-cases:
>> >
>> > - Setting device hints at TX: XDP/AF_XDP might use these new hooks to
>> > use device offloads. The existing case implements TX timestamp.
>> > - Observability: global per-netdev hooks can be used for tracing
>> > the packets and exploring completion descriptors for all sorts of
>> > device errors.
>> >
>> > Accessing TX descriptors also means that the hooks have to be called
>> > from the drivers.
>> >
>> > The hooks are a light-weight alternative to XDP at egress and currently
>> > don't provide any packet modification abilities. However, eventually,
>> > can expose new kfuncs to operate on the packet (or, rather, the actual
>> > descriptors; for performance sake).
>> 
>> dynptr?
>
> Haven't considered, let me explore, but not sure what it buys us
> here?

API consistency, certainly. Possibly also performance, if using the
slice thing that gets you a direct pointer to the pkt data? Not sure
about that, though, haven't done extensive benchmarking of dynptr yet...

>> > --- UAPI ---
>> >
>> > The hooks are implemented in a HID-BPF style. Meaning they don't
>> > expose any UAPI and are implemented as tracing programs that call
>> > a bunch of kfuncs. The attach/detach operation happen via BPF syscall
>> > programs. The series expands device-bound infrastructure to tracing
>> > programs.
>> 
>> Not a fan of the "attach from BPF syscall program" thing. These are part
>> of the XDP data path API, and I think we should expose them as proper
>> bpf_link attachments from userspace with introspection etc. But I guess
>> the bpf_mprog thing will give us that?
>
> bpf_mprog will just make those attach kfuncs return the link fd. The
> syscall program will still stay :-(

Why does the attachment have to be done this way, exactly? Couldn't we
just use the regular bpf_link attachment from userspace? AFAICT it's not
really piggy-backing on the function override thing anyway when the
attachment is per-dev? Or am I misunderstanding how all this works?

>> > --- skb vs xdp ---
>> >
>> > The hooks operate on a new light-weight devtx_frame which contains:
>> > - data
>> > - len
>> > - sinfo
>> >
>> > This should allow us to have a unified (from BPF POW) place at TX
>> > and not be super-taxing (we need to copy 2 pointers + len to the stack
>> > for each invocation).
>> 
>> Not sure what I think about this one. At the very least I think we
>> should expose xdp->data_meta as well. I'm not sure what the use case for
>> accessing skbs is? If that *is* indeed useful, probably there will also
>> end up being a use case for accessing the full skb?
>
> skb_shared_info has meta_len, buf afaik, xdp doesn't use it. Maybe I
> a good opportunity to unify? Or probably won't work because if
> xdf_frame doesn't have frags, it won't have sinfo?

No, it won't. But why do we need this unification between the skb and
xdp paths in the first place? Doesn't the skb path already have support
for these things? Seems like we could just stick to making this xdp-only
and keeping xdp_frame as the ctx argument?

>> > --- Multiprog attachment ---
>> >
>> > Currently, attach/detach don't expose links and don't support multiple
>> > programs. I'm planning to use Daniel's bpf_mprog once it lands.
>> >
>> > --- TODO ---
>> >
>> > Things that I'm planning to do for the non-RFC series:
>> > - have some real device support to verify xdp_hw_metadata works
>> 
>> Would be good to see some performance numbers as well :)
>
> +1 :-)
>
>> > - freplace
>> > - Documentation/networking/xdp-rx-metadata.rst - like documentation
>> >
>> > --- CC ---
>> >
>> > CC'ing people only on the cover letter. Hopefully can find the rest via
>> > lore.
>> 
>> Well, I found it there, even though I was apparently left off the Cc
>> list :(
>> 
>> -Toke
>
> Sure, I'll CC you explicitly next time! But I know you diligently follow bpf
> list, so decided to explicitly cc mostly netdev folks that might miss
> it otherwise.

Haha, fair point! And no big deal, I did obviously see it. I was just
feeling a bit left out, that's all ;)

-Toke
Stanislav Fomichev June 13, 2023, 6:39 p.m. UTC | #4
On Tue, Jun 13, 2023 at 10:18 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> > On 06/12, Toke Høiland-Jørgensen wrote:
> >> Some immediate thoughts after glancing through this:
> >>
> >> > --- Use cases ---
> >> >
> >> > The goal of this series is to add two new standard-ish places
> >> > in the transmit path:
> >> >
> >> > 1. Right before the packet is transmitted (with access to TX
> >> >    descriptors)
> >> > 2. Right after the packet is actually transmitted and we've received the
> >> >    completion (again, with access to TX completion descriptors)
> >> >
> >> > Accessing TX descriptors unlocks the following use-cases:
> >> >
> >> > - Setting device hints at TX: XDP/AF_XDP might use these new hooks to
> >> > use device offloads. The existing case implements TX timestamp.
> >> > - Observability: global per-netdev hooks can be used for tracing
> >> > the packets and exploring completion descriptors for all sorts of
> >> > device errors.
> >> >
> >> > Accessing TX descriptors also means that the hooks have to be called
> >> > from the drivers.
> >> >
> >> > The hooks are a light-weight alternative to XDP at egress and currently
> >> > don't provide any packet modification abilities. However, eventually,
> >> > can expose new kfuncs to operate on the packet (or, rather, the actual
> >> > descriptors; for performance sake).
> >>
> >> dynptr?
> >
> > Haven't considered, let me explore, but not sure what it buys us
> > here?
>
> API consistency, certainly. Possibly also performance, if using the
> slice thing that gets you a direct pointer to the pkt data? Not sure
> about that, though, haven't done extensive benchmarking of dynptr yet...

Same. Let's keep it on the table, I'll try to explore. I was just
thinking that having less abstraction here might be better
performance-wise.

> >> > --- UAPI ---
> >> >
> >> > The hooks are implemented in a HID-BPF style. Meaning they don't
> >> > expose any UAPI and are implemented as tracing programs that call
> >> > a bunch of kfuncs. The attach/detach operation happen via BPF syscall
> >> > programs. The series expands device-bound infrastructure to tracing
> >> > programs.
> >>
> >> Not a fan of the "attach from BPF syscall program" thing. These are part
> >> of the XDP data path API, and I think we should expose them as proper
> >> bpf_link attachments from userspace with introspection etc. But I guess
> >> the bpf_mprog thing will give us that?
> >
> > bpf_mprog will just make those attach kfuncs return the link fd. The
> > syscall program will still stay :-(
>
> Why does the attachment have to be done this way, exactly? Couldn't we
> just use the regular bpf_link attachment from userspace? AFAICT it's not
> really piggy-backing on the function override thing anyway when the
> attachment is per-dev? Or am I misunderstanding how all this works?

It's UAPI vs non-UAPI. I'm assuming kfunc makes it non-UAPI and gives
us an opportunity to fix things.
We can do it via a regular syscall path if there is a consensus.

> >> > --- skb vs xdp ---
> >> >
> >> > The hooks operate on a new light-weight devtx_frame which contains:
> >> > - data
> >> > - len
> >> > - sinfo
> >> >
> >> > This should allow us to have a unified (from BPF POW) place at TX
> >> > and not be super-taxing (we need to copy 2 pointers + len to the stack
> >> > for each invocation).
> >>
> >> Not sure what I think about this one. At the very least I think we
> >> should expose xdp->data_meta as well. I'm not sure what the use case for
> >> accessing skbs is? If that *is* indeed useful, probably there will also
> >> end up being a use case for accessing the full skb?
> >
> > skb_shared_info has meta_len, buf afaik, xdp doesn't use it. Maybe I
> > a good opportunity to unify? Or probably won't work because if
> > xdf_frame doesn't have frags, it won't have sinfo?
>
> No, it won't. But why do we need this unification between the skb and
> xdp paths in the first place? Doesn't the skb path already have support
> for these things? Seems like we could just stick to making this xdp-only
> and keeping xdp_frame as the ctx argument?

For skb path, I'm assuming we can read sinfo->meta_len; it feels nice
to make it work for both cases?
We can always export metadata len via some kfunc, sure.

> >> > --- Multiprog attachment ---
> >> >
> >> > Currently, attach/detach don't expose links and don't support multiple
> >> > programs. I'm planning to use Daniel's bpf_mprog once it lands.
> >> >
> >> > --- TODO ---
> >> >
> >> > Things that I'm planning to do for the non-RFC series:
> >> > - have some real device support to verify xdp_hw_metadata works
> >>
> >> Would be good to see some performance numbers as well :)
> >
> > +1 :-)
> >
> >> > - freplace
> >> > - Documentation/networking/xdp-rx-metadata.rst - like documentation
> >> >
> >> > --- CC ---
> >> >
> >> > CC'ing people only on the cover letter. Hopefully can find the rest via
> >> > lore.
> >>
> >> Well, I found it there, even though I was apparently left off the Cc
> >> list :(
> >>
> >> -Toke
> >
> > Sure, I'll CC you explicitly next time! But I know you diligently follow bpf
> > list, so decided to explicitly cc mostly netdev folks that might miss
> > it otherwise.
>
> Haha, fair point! And no big deal, I did obviously see it. I was just
> feeling a bit left out, that's all ;)
>
> -Toke
Toke Høiland-Jørgensen June 13, 2023, 7:10 p.m. UTC | #5
Stanislav Fomichev <sdf@google.com> writes:

> On Tue, Jun 13, 2023 at 10:18 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>>
>> Stanislav Fomichev <sdf@google.com> writes:
>>
>> > On 06/12, Toke Høiland-Jørgensen wrote:
>> >> Some immediate thoughts after glancing through this:
>> >>
>> >> > --- Use cases ---
>> >> >
>> >> > The goal of this series is to add two new standard-ish places
>> >> > in the transmit path:
>> >> >
>> >> > 1. Right before the packet is transmitted (with access to TX
>> >> >    descriptors)
>> >> > 2. Right after the packet is actually transmitted and we've received the
>> >> >    completion (again, with access to TX completion descriptors)
>> >> >
>> >> > Accessing TX descriptors unlocks the following use-cases:
>> >> >
>> >> > - Setting device hints at TX: XDP/AF_XDP might use these new hooks to
>> >> > use device offloads. The existing case implements TX timestamp.
>> >> > - Observability: global per-netdev hooks can be used for tracing
>> >> > the packets and exploring completion descriptors for all sorts of
>> >> > device errors.
>> >> >
>> >> > Accessing TX descriptors also means that the hooks have to be called
>> >> > from the drivers.
>> >> >
>> >> > The hooks are a light-weight alternative to XDP at egress and currently
>> >> > don't provide any packet modification abilities. However, eventually,
>> >> > can expose new kfuncs to operate on the packet (or, rather, the actual
>> >> > descriptors; for performance sake).
>> >>
>> >> dynptr?
>> >
>> > Haven't considered, let me explore, but not sure what it buys us
>> > here?
>>
>> API consistency, certainly. Possibly also performance, if using the
>> slice thing that gets you a direct pointer to the pkt data? Not sure
>> about that, though, haven't done extensive benchmarking of dynptr yet...
>
> Same. Let's keep it on the table, I'll try to explore. I was just
> thinking that having less abstraction here might be better
> performance-wise.

Sure, let's evaluate this once we have performance numbers.

>> >> > --- UAPI ---
>> >> >
>> >> > The hooks are implemented in a HID-BPF style. Meaning they don't
>> >> > expose any UAPI and are implemented as tracing programs that call
>> >> > a bunch of kfuncs. The attach/detach operation happen via BPF syscall
>> >> > programs. The series expands device-bound infrastructure to tracing
>> >> > programs.
>> >>
>> >> Not a fan of the "attach from BPF syscall program" thing. These are part
>> >> of the XDP data path API, and I think we should expose them as proper
>> >> bpf_link attachments from userspace with introspection etc. But I guess
>> >> the bpf_mprog thing will give us that?
>> >
>> > bpf_mprog will just make those attach kfuncs return the link fd. The
>> > syscall program will still stay :-(
>>
>> Why does the attachment have to be done this way, exactly? Couldn't we
>> just use the regular bpf_link attachment from userspace? AFAICT it's not
>> really piggy-backing on the function override thing anyway when the
>> attachment is per-dev? Or am I misunderstanding how all this works?
>
> It's UAPI vs non-UAPI. I'm assuming kfunc makes it non-UAPI and gives
> us an opportunity to fix things.
> We can do it via a regular syscall path if there is a consensus.

Yeah, the API exposed to the BPF program is kfunc-based in any case. If
we were to at some point conclude that this whole thing was not useful
at all and deprecate it, it doesn't seem to me that it makes much
difference whether that means "you can no longer create a link
attachment of this type via BPF_LINK_CREATE" or "you can no longer
create a link attachment of this type via BPF_PROG_RUN of a syscall type
program" doesn't really seem like a significant detail to me...

>> >> > --- skb vs xdp ---
>> >> >
>> >> > The hooks operate on a new light-weight devtx_frame which contains:
>> >> > - data
>> >> > - len
>> >> > - sinfo
>> >> >
>> >> > This should allow us to have a unified (from BPF POW) place at TX
>> >> > and not be super-taxing (we need to copy 2 pointers + len to the stack
>> >> > for each invocation).
>> >>
>> >> Not sure what I think about this one. At the very least I think we
>> >> should expose xdp->data_meta as well. I'm not sure what the use case for
>> >> accessing skbs is? If that *is* indeed useful, probably there will also
>> >> end up being a use case for accessing the full skb?
>> >
>> > skb_shared_info has meta_len, buf afaik, xdp doesn't use it. Maybe I
>> > a good opportunity to unify? Or probably won't work because if
>> > xdf_frame doesn't have frags, it won't have sinfo?
>>
>> No, it won't. But why do we need this unification between the skb and
>> xdp paths in the first place? Doesn't the skb path already have support
>> for these things? Seems like we could just stick to making this xdp-only
>> and keeping xdp_frame as the ctx argument?
>
> For skb path, I'm assuming we can read sinfo->meta_len; it feels nice
> to make it work for both cases?
> We can always export metadata len via some kfunc, sure.

I wasn't referring to the metadata field specifically when talking about
the skb path. I'm wondering why we need these hooks to work for the skb
path at all? :)

-Toke
Stanislav Fomichev June 13, 2023, 9:17 p.m. UTC | #6
On Tue, Jun 13, 2023 at 12:10 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> > On Tue, Jun 13, 2023 at 10:18 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >>
> >> Stanislav Fomichev <sdf@google.com> writes:
> >>
> >> > On 06/12, Toke Høiland-Jørgensen wrote:
> >> >> Some immediate thoughts after glancing through this:
> >> >>
> >> >> > --- Use cases ---
> >> >> >
> >> >> > The goal of this series is to add two new standard-ish places
> >> >> > in the transmit path:
> >> >> >
> >> >> > 1. Right before the packet is transmitted (with access to TX
> >> >> >    descriptors)
> >> >> > 2. Right after the packet is actually transmitted and we've received the
> >> >> >    completion (again, with access to TX completion descriptors)
> >> >> >
> >> >> > Accessing TX descriptors unlocks the following use-cases:
> >> >> >
> >> >> > - Setting device hints at TX: XDP/AF_XDP might use these new hooks to
> >> >> > use device offloads. The existing case implements TX timestamp.
> >> >> > - Observability: global per-netdev hooks can be used for tracing
> >> >> > the packets and exploring completion descriptors for all sorts of
> >> >> > device errors.
> >> >> >
> >> >> > Accessing TX descriptors also means that the hooks have to be called
> >> >> > from the drivers.
> >> >> >
> >> >> > The hooks are a light-weight alternative to XDP at egress and currently
> >> >> > don't provide any packet modification abilities. However, eventually,
> >> >> > can expose new kfuncs to operate on the packet (or, rather, the actual
> >> >> > descriptors; for performance sake).
> >> >>
> >> >> dynptr?
> >> >
> >> > Haven't considered, let me explore, but not sure what it buys us
> >> > here?
> >>
> >> API consistency, certainly. Possibly also performance, if using the
> >> slice thing that gets you a direct pointer to the pkt data? Not sure
> >> about that, though, haven't done extensive benchmarking of dynptr yet...
> >
> > Same. Let's keep it on the table, I'll try to explore. I was just
> > thinking that having less abstraction here might be better
> > performance-wise.
>
> Sure, let's evaluate this once we have performance numbers.
>
> >> >> > --- UAPI ---
> >> >> >
> >> >> > The hooks are implemented in a HID-BPF style. Meaning they don't
> >> >> > expose any UAPI and are implemented as tracing programs that call
> >> >> > a bunch of kfuncs. The attach/detach operation happen via BPF syscall
> >> >> > programs. The series expands device-bound infrastructure to tracing
> >> >> > programs.
> >> >>
> >> >> Not a fan of the "attach from BPF syscall program" thing. These are part
> >> >> of the XDP data path API, and I think we should expose them as proper
> >> >> bpf_link attachments from userspace with introspection etc. But I guess
> >> >> the bpf_mprog thing will give us that?
> >> >
> >> > bpf_mprog will just make those attach kfuncs return the link fd. The
> >> > syscall program will still stay :-(
> >>
> >> Why does the attachment have to be done this way, exactly? Couldn't we
> >> just use the regular bpf_link attachment from userspace? AFAICT it's not
> >> really piggy-backing on the function override thing anyway when the
> >> attachment is per-dev? Or am I misunderstanding how all this works?
> >
> > It's UAPI vs non-UAPI. I'm assuming kfunc makes it non-UAPI and gives
> > us an opportunity to fix things.
> > We can do it via a regular syscall path if there is a consensus.
>
> Yeah, the API exposed to the BPF program is kfunc-based in any case. If
> we were to at some point conclude that this whole thing was not useful
> at all and deprecate it, it doesn't seem to me that it makes much
> difference whether that means "you can no longer create a link
> attachment of this type via BPF_LINK_CREATE" or "you can no longer
> create a link attachment of this type via BPF_PROG_RUN of a syscall type
> program" doesn't really seem like a significant detail to me...

In this case, why do you prefer it to go via regular syscall? Seems
like we can avoid a bunch of boileplate syscall work with a kfunc that
does the attachment?
We might as well abstract it at, say, libbpf layer which would
generate/load this small bpf program to call a kfunc.

> >> >> > --- skb vs xdp ---
> >> >> >
> >> >> > The hooks operate on a new light-weight devtx_frame which contains:
> >> >> > - data
> >> >> > - len
> >> >> > - sinfo
> >> >> >
> >> >> > This should allow us to have a unified (from BPF POW) place at TX
> >> >> > and not be super-taxing (we need to copy 2 pointers + len to the stack
> >> >> > for each invocation).
> >> >>
> >> >> Not sure what I think about this one. At the very least I think we
> >> >> should expose xdp->data_meta as well. I'm not sure what the use case for
> >> >> accessing skbs is? If that *is* indeed useful, probably there will also
> >> >> end up being a use case for accessing the full skb?
> >> >
> >> > skb_shared_info has meta_len, buf afaik, xdp doesn't use it. Maybe I
> >> > a good opportunity to unify? Or probably won't work because if
> >> > xdf_frame doesn't have frags, it won't have sinfo?
> >>
> >> No, it won't. But why do we need this unification between the skb and
> >> xdp paths in the first place? Doesn't the skb path already have support
> >> for these things? Seems like we could just stick to making this xdp-only
> >> and keeping xdp_frame as the ctx argument?
> >
> > For skb path, I'm assuming we can read sinfo->meta_len; it feels nice
> > to make it work for both cases?
> > We can always export metadata len via some kfunc, sure.
>
> I wasn't referring to the metadata field specifically when talking about
> the skb path. I'm wondering why we need these hooks to work for the skb
> path at all? :)

Aaah. I think John wanted them to trigger for skb path, so I'm trying
to explore whether having both makes sense.
But also, if we go purely xdp_frame, what about af_xdp in copy mode?
That's still skb-driven, right?
Not sure this skb vs xdp is a clear cut :-/
Alexei Starovoitov June 13, 2023, 10:32 p.m. UTC | #7
On Tue, Jun 13, 2023 at 2:17 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> > >> >> > --- UAPI ---
> > >> >> >
> > >> >> > The hooks are implemented in a HID-BPF style. Meaning they don't
> > >> >> > expose any UAPI and are implemented as tracing programs that call
> > >> >> > a bunch of kfuncs. The attach/detach operation happen via BPF syscall
> > >> >> > programs. The series expands device-bound infrastructure to tracing
> > >> >> > programs.
> > >> >>
> > >> >> Not a fan of the "attach from BPF syscall program" thing. These are part
> > >> >> of the XDP data path API, and I think we should expose them as proper
> > >> >> bpf_link attachments from userspace with introspection etc. But I guess
> > >> >> the bpf_mprog thing will give us that?
> > >> >
> > >> > bpf_mprog will just make those attach kfuncs return the link fd. The
> > >> > syscall program will still stay :-(
> > >>
> > >> Why does the attachment have to be done this way, exactly? Couldn't we
> > >> just use the regular bpf_link attachment from userspace? AFAICT it's not
> > >> really piggy-backing on the function override thing anyway when the
> > >> attachment is per-dev? Or am I misunderstanding how all this works?
> > >
> > > It's UAPI vs non-UAPI. I'm assuming kfunc makes it non-UAPI and gives
> > > us an opportunity to fix things.
> > > We can do it via a regular syscall path if there is a consensus.
> >
> > Yeah, the API exposed to the BPF program is kfunc-based in any case. If
> > we were to at some point conclude that this whole thing was not useful
> > at all and deprecate it, it doesn't seem to me that it makes much
> > difference whether that means "you can no longer create a link
> > attachment of this type via BPF_LINK_CREATE" or "you can no longer
> > create a link attachment of this type via BPF_PROG_RUN of a syscall type
> > program" doesn't really seem like a significant detail to me...
>
> In this case, why do you prefer it to go via regular syscall? Seems
> like we can avoid a bunch of boileplate syscall work with a kfunc that
> does the attachment?
> We might as well abstract it at, say, libbpf layer which would
> generate/load this small bpf program to call a kfunc.

I'm not sure we're on the same page here.
imo using syscall bpf prog that calls kfunc to do a per-device attach
is an overkill here.
It's an experimental feature, but you're already worried about
multiple netdevs?

Can you add an empty nop function and attach to it tracing style
with fentry ?
It won't be per-netdev, but do you have to do per-device demux
by the kernel? Can your tracing bpf prog do that instead?
It's just an ifindex compare.
This way than non-uapi bits will be even smaller and no need
to change struct netdevice.
Stanislav Fomichev June 13, 2023, 11:16 p.m. UTC | #8
On Tue, Jun 13, 2023 at 3:32 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Jun 13, 2023 at 2:17 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > > >> >> > --- UAPI ---
> > > >> >> >
> > > >> >> > The hooks are implemented in a HID-BPF style. Meaning they don't
> > > >> >> > expose any UAPI and are implemented as tracing programs that call
> > > >> >> > a bunch of kfuncs. The attach/detach operation happen via BPF syscall
> > > >> >> > programs. The series expands device-bound infrastructure to tracing
> > > >> >> > programs.
> > > >> >>
> > > >> >> Not a fan of the "attach from BPF syscall program" thing. These are part
> > > >> >> of the XDP data path API, and I think we should expose them as proper
> > > >> >> bpf_link attachments from userspace with introspection etc. But I guess
> > > >> >> the bpf_mprog thing will give us that?
> > > >> >
> > > >> > bpf_mprog will just make those attach kfuncs return the link fd. The
> > > >> > syscall program will still stay :-(
> > > >>
> > > >> Why does the attachment have to be done this way, exactly? Couldn't we
> > > >> just use the regular bpf_link attachment from userspace? AFAICT it's not
> > > >> really piggy-backing on the function override thing anyway when the
> > > >> attachment is per-dev? Or am I misunderstanding how all this works?
> > > >
> > > > It's UAPI vs non-UAPI. I'm assuming kfunc makes it non-UAPI and gives
> > > > us an opportunity to fix things.
> > > > We can do it via a regular syscall path if there is a consensus.
> > >
> > > Yeah, the API exposed to the BPF program is kfunc-based in any case. If
> > > we were to at some point conclude that this whole thing was not useful
> > > at all and deprecate it, it doesn't seem to me that it makes much
> > > difference whether that means "you can no longer create a link
> > > attachment of this type via BPF_LINK_CREATE" or "you can no longer
> > > create a link attachment of this type via BPF_PROG_RUN of a syscall type
> > > program" doesn't really seem like a significant detail to me...
> >
> > In this case, why do you prefer it to go via regular syscall? Seems
> > like we can avoid a bunch of boileplate syscall work with a kfunc that
> > does the attachment?
> > We might as well abstract it at, say, libbpf layer which would
> > generate/load this small bpf program to call a kfunc.
>
> I'm not sure we're on the same page here.
> imo using syscall bpf prog that calls kfunc to do a per-device attach
> is an overkill here.
> It's an experimental feature, but you're already worried about
> multiple netdevs?
>
> Can you add an empty nop function and attach to it tracing style
> with fentry ?
> It won't be per-netdev, but do you have to do per-device demux
> by the kernel? Can your tracing bpf prog do that instead?
> It's just an ifindex compare.
> This way than non-uapi bits will be even smaller and no need
> to change struct netdevice.

It's probably going to work if each driver has a separate set of tx
fentry points, something like:
  {veth,mlx5,etc}_devtx_submit()
  {veth,mlx5,etc}_devtx_complete()
Because I still need to have those netdev-bound tracing programs to
get access to driver kfuncs.

I can try to sketch something together for a v2.
Jakub Kicinski June 14, 2023, 3:31 a.m. UTC | #9
On Mon, 12 Jun 2023 10:23:00 -0700 Stanislav Fomichev wrote:
> The goal of this series is to add two new standard-ish places
> in the transmit path:
> 
> 1. Right before the packet is transmitted (with access to TX
>    descriptors)

I'm not sure that the Tx descriptors can be populated piecemeal.
If we were ever to support more standard offload features, which
require packet geometry (hdr offsets etc.) to be described "call
per feature" will end up duplicating arguments, and there will be
a lot of args..

And if there is an SKB path in the future combining the normal SKB
offloads with the half-rendered descriptors may be a pain.

> 2. Right after the packet is actually transmitted and we've received the
>    completion (again, with access to TX completion descriptors)

For completions trace-style attach makes perfect sense.

> Accessing TX descriptors unlocks the following use-cases:
> 
> - Setting device hints at TX: XDP/AF_XDP might use these new hooks to
> use device offloads. The existing case implements TX timestamp.
> - Observability: global per-netdev hooks can be used for tracing
> the packets and exploring completion descriptors for all sorts of
> device errors.
> 
> Accessing TX descriptors also means that the hooks have to be called
> from the drivers.
> 
> The hooks are a light-weight alternative to XDP at egress and currently

Hm, a lot of lightweight features lately. Unbearable lightness of
features.
David Ahern June 14, 2023, 3:54 a.m. UTC | #10
On 6/13/23 9:31 PM, Jakub Kicinski wrote:
> On Mon, 12 Jun 2023 10:23:00 -0700 Stanislav Fomichev wrote:
>> The goal of this series is to add two new standard-ish places
>> in the transmit path:
>>
>> 1. Right before the packet is transmitted (with access to TX
>>    descriptors)

If a device requires multiple Tx descriptors per skb or multibuf frame,
how would that be handled within the XDP API?

> 
> I'm not sure that the Tx descriptors can be populated piecemeal.

If it is host memory before the pidx move, why would that matter? Do you
have a specific example in mind?

> If we were ever to support more standard offload features, which
> require packet geometry (hdr offsets etc.) to be described "call
> per feature" will end up duplicating arguments, and there will be
> a lot of args..
> 
> And if there is an SKB path in the future combining the normal SKB
> offloads with the half-rendered descriptors may be a pain.

Once the descriptor(s) is (are) populated, the skb is irrelevant is it
not? Only complication that comes to mind is wanting to add or remove
headers (e.g., tunnels) which will be much more complicated at this
point, but might still be possible on a per NIC (and maybe version) basis.
Alexei Starovoitov June 14, 2023, 4:19 a.m. UTC | #11
On Tue, Jun 13, 2023 at 4:16 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On Tue, Jun 13, 2023 at 3:32 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Jun 13, 2023 at 2:17 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > > >> >> > --- UAPI ---
> > > > >> >> >
> > > > >> >> > The hooks are implemented in a HID-BPF style. Meaning they don't
> > > > >> >> > expose any UAPI and are implemented as tracing programs that call
> > > > >> >> > a bunch of kfuncs. The attach/detach operation happen via BPF syscall
> > > > >> >> > programs. The series expands device-bound infrastructure to tracing
> > > > >> >> > programs.
> > > > >> >>
> > > > >> >> Not a fan of the "attach from BPF syscall program" thing. These are part
> > > > >> >> of the XDP data path API, and I think we should expose them as proper
> > > > >> >> bpf_link attachments from userspace with introspection etc. But I guess
> > > > >> >> the bpf_mprog thing will give us that?
> > > > >> >
> > > > >> > bpf_mprog will just make those attach kfuncs return the link fd. The
> > > > >> > syscall program will still stay :-(
> > > > >>
> > > > >> Why does the attachment have to be done this way, exactly? Couldn't we
> > > > >> just use the regular bpf_link attachment from userspace? AFAICT it's not
> > > > >> really piggy-backing on the function override thing anyway when the
> > > > >> attachment is per-dev? Or am I misunderstanding how all this works?
> > > > >
> > > > > It's UAPI vs non-UAPI. I'm assuming kfunc makes it non-UAPI and gives
> > > > > us an opportunity to fix things.
> > > > > We can do it via a regular syscall path if there is a consensus.
> > > >
> > > > Yeah, the API exposed to the BPF program is kfunc-based in any case. If
> > > > we were to at some point conclude that this whole thing was not useful
> > > > at all and deprecate it, it doesn't seem to me that it makes much
> > > > difference whether that means "you can no longer create a link
> > > > attachment of this type via BPF_LINK_CREATE" or "you can no longer
> > > > create a link attachment of this type via BPF_PROG_RUN of a syscall type
> > > > program" doesn't really seem like a significant detail to me...
> > >
> > > In this case, why do you prefer it to go via regular syscall? Seems
> > > like we can avoid a bunch of boileplate syscall work with a kfunc that
> > > does the attachment?
> > > We might as well abstract it at, say, libbpf layer which would
> > > generate/load this small bpf program to call a kfunc.
> >
> > I'm not sure we're on the same page here.
> > imo using syscall bpf prog that calls kfunc to do a per-device attach
> > is an overkill here.
> > It's an experimental feature, but you're already worried about
> > multiple netdevs?
> >
> > Can you add an empty nop function and attach to it tracing style
> > with fentry ?
> > It won't be per-netdev, but do you have to do per-device demux
> > by the kernel? Can your tracing bpf prog do that instead?
> > It's just an ifindex compare.
> > This way than non-uapi bits will be even smaller and no need
> > to change struct netdevice.
>
> It's probably going to work if each driver has a separate set of tx
> fentry points, something like:
>   {veth,mlx5,etc}_devtx_submit()
>   {veth,mlx5,etc}_devtx_complete()

Right. And per-driver descriptors.
The 'generic' xdptx metadata is unlikely to be practical.
Marshaling in and out of it is going to be too perf sensitive.
I'd just add an attach point in the driver with enough
args for bpf progs to make sense of the context and extend
the verifier to make few safe fields writeable.
kfuncs to read/request timestamp are probably too slow.
Jakub Kicinski June 14, 2023, 5:05 a.m. UTC | #12
On Tue, 13 Jun 2023 20:54:26 -0700 David Ahern wrote:
> On 6/13/23 9:31 PM, Jakub Kicinski wrote:
> > On Mon, 12 Jun 2023 10:23:00 -0700 Stanislav Fomichev wrote:  
> >> The goal of this series is to add two new standard-ish places
> >> in the transmit path:
> >>
> >> 1. Right before the packet is transmitted (with access to TX
> >>    descriptors)  
> 
> If a device requires multiple Tx descriptors per skb or multibuf frame,
> how would that be handled within the XDP API?
> 
> > I'm not sure that the Tx descriptors can be populated piecemeal.  
> 
> If it is host memory before the pidx move, why would that matter? Do you
> have a specific example in mind?

I don't mean it's impossible implement, but it's may get cumbersome.
TSO/CSO/crypto may all need to know where L4 header starts, f.e.
Some ECN marking in the NIC may also want to know where L3 is.
So the offsets will get duplicated in each API.

> > If we were ever to support more standard offload features, which
> > require packet geometry (hdr offsets etc.) to be described "call
> > per feature" will end up duplicating arguments, and there will be
> > a lot of args..
> > 
> > And if there is an SKB path in the future combining the normal SKB
> > offloads with the half-rendered descriptors may be a pain.  
> 
> Once the descriptor(s) is (are) populated, the skb is irrelevant is it
> not? Only complication that comes to mind is wanting to add or remove
> headers (e.g., tunnels) which will be much more complicated at this
> point, but might still be possible on a per NIC (and maybe version) basis.

I guess one can write the skb descriptors first, then modify them from
the BPF. Either way I feel like the helper approach for Tx will result
in drivers saving the info into some local struct and then rendering
the descriptors after. We'll see.
Toke Høiland-Jørgensen June 14, 2023, 11:59 a.m. UTC | #13
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Tue, Jun 13, 2023 at 4:16 PM Stanislav Fomichev <sdf@google.com> wrote:
>>
>> On Tue, Jun 13, 2023 at 3:32 PM Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>> >
>> > On Tue, Jun 13, 2023 at 2:17 PM Stanislav Fomichev <sdf@google.com> wrote:
>> > >
>> > > > >> >> > --- UAPI ---
>> > > > >> >> >
>> > > > >> >> > The hooks are implemented in a HID-BPF style. Meaning they don't
>> > > > >> >> > expose any UAPI and are implemented as tracing programs that call
>> > > > >> >> > a bunch of kfuncs. The attach/detach operation happen via BPF syscall
>> > > > >> >> > programs. The series expands device-bound infrastructure to tracing
>> > > > >> >> > programs.
>> > > > >> >>
>> > > > >> >> Not a fan of the "attach from BPF syscall program" thing. These are part
>> > > > >> >> of the XDP data path API, and I think we should expose them as proper
>> > > > >> >> bpf_link attachments from userspace with introspection etc. But I guess
>> > > > >> >> the bpf_mprog thing will give us that?
>> > > > >> >
>> > > > >> > bpf_mprog will just make those attach kfuncs return the link fd. The
>> > > > >> > syscall program will still stay :-(
>> > > > >>
>> > > > >> Why does the attachment have to be done this way, exactly? Couldn't we
>> > > > >> just use the regular bpf_link attachment from userspace? AFAICT it's not
>> > > > >> really piggy-backing on the function override thing anyway when the
>> > > > >> attachment is per-dev? Or am I misunderstanding how all this works?
>> > > > >
>> > > > > It's UAPI vs non-UAPI. I'm assuming kfunc makes it non-UAPI and gives
>> > > > > us an opportunity to fix things.
>> > > > > We can do it via a regular syscall path if there is a consensus.
>> > > >
>> > > > Yeah, the API exposed to the BPF program is kfunc-based in any case. If
>> > > > we were to at some point conclude that this whole thing was not useful
>> > > > at all and deprecate it, it doesn't seem to me that it makes much
>> > > > difference whether that means "you can no longer create a link
>> > > > attachment of this type via BPF_LINK_CREATE" or "you can no longer
>> > > > create a link attachment of this type via BPF_PROG_RUN of a syscall type
>> > > > program" doesn't really seem like a significant detail to me...
>> > >
>> > > In this case, why do you prefer it to go via regular syscall? Seems
>> > > like we can avoid a bunch of boileplate syscall work with a kfunc that
>> > > does the attachment?
>> > > We might as well abstract it at, say, libbpf layer which would
>> > > generate/load this small bpf program to call a kfunc.
>> >
>> > I'm not sure we're on the same page here.
>> > imo using syscall bpf prog that calls kfunc to do a per-device attach
>> > is an overkill here.
>> > It's an experimental feature, but you're already worried about
>> > multiple netdevs?
>> >
>> > Can you add an empty nop function and attach to it tracing style
>> > with fentry ?
>> > It won't be per-netdev, but do you have to do per-device demux
>> > by the kernel? Can your tracing bpf prog do that instead?
>> > It's just an ifindex compare.
>> > This way than non-uapi bits will be even smaller and no need
>> > to change struct netdevice.
>>
>> It's probably going to work if each driver has a separate set of tx
>> fentry points, something like:
>>   {veth,mlx5,etc}_devtx_submit()
>>   {veth,mlx5,etc}_devtx_complete()

I really don't get the opposition to exposing proper APIs; as a
dataplane developer I want to attach a program to an interface. The
kernel's role is to provide a consistent interface for this, not to
require users to become driver developers just to get at the required
details.

> Right. And per-driver descriptors.
> The 'generic' xdptx metadata is unlikely to be practical.
> Marshaling in and out of it is going to be too perf sensitive.
> I'd just add an attach point in the driver with enough
> args for bpf progs to make sense of the context and extend
> the verifier to make few safe fields writeable.

This is a rehashing of the argument we had on the RX side: just exposing
descriptors is a bad API because it forces BPF programs to deal with
hardware errata - which is exactly the kind of thing that belongs in a
driver. Exposing kfuncs allows drivers to deal with hardware quirks
while exposing a consistent API to (BPF) users.

> kfuncs to read/request timestamp are probably too slow.

Which is why we should be inlining them :)

-Toke
Alexei Starovoitov June 14, 2023, 4:27 p.m. UTC | #14
On Wed, Jun 14, 2023 at 5:00 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>
> >>
> >> It's probably going to work if each driver has a separate set of tx
> >> fentry points, something like:
> >>   {veth,mlx5,etc}_devtx_submit()
> >>   {veth,mlx5,etc}_devtx_complete()
>
> I really don't get the opposition to exposing proper APIs; as a
> dataplane developer I want to attach a program to an interface. The
> kernel's role is to provide a consistent interface for this, not to
> require users to become driver developers just to get at the required
> details.

Consistent interface can appear only when there is a consistency
across nic manufacturers.
I'm suggesting to experiment in the most unstable way and
if/when the consistency is discovered then generalize.
Stanislav Fomichev June 14, 2023, 5:17 p.m. UTC | #15
On 06/13, Jakub Kicinski wrote:
> On Tue, 13 Jun 2023 20:54:26 -0700 David Ahern wrote:
> > On 6/13/23 9:31 PM, Jakub Kicinski wrote:
> > > On Mon, 12 Jun 2023 10:23:00 -0700 Stanislav Fomichev wrote:  
> > >> The goal of this series is to add two new standard-ish places
> > >> in the transmit path:
> > >>
> > >> 1. Right before the packet is transmitted (with access to TX
> > >>    descriptors)  
> > 
> > If a device requires multiple Tx descriptors per skb or multibuf frame,
> > how would that be handled within the XDP API?
> > 
> > > I'm not sure that the Tx descriptors can be populated piecemeal.  
> > 
> > If it is host memory before the pidx move, why would that matter? Do you
> > have a specific example in mind?
> 
> I don't mean it's impossible implement, but it's may get cumbersome.
> TSO/CSO/crypto may all need to know where L4 header starts, f.e.
> Some ECN marking in the NIC may also want to know where L3 is.
> So the offsets will get duplicated in each API.
> 
> > > If we were ever to support more standard offload features, which
> > > require packet geometry (hdr offsets etc.) to be described "call
> > > per feature" will end up duplicating arguments, and there will be
> > > a lot of args..
> > > 
> > > And if there is an SKB path in the future combining the normal SKB
> > > offloads with the half-rendered descriptors may be a pain.  
> > 
> > Once the descriptor(s) is (are) populated, the skb is irrelevant is it
> > not? Only complication that comes to mind is wanting to add or remove
> > headers (e.g., tunnels) which will be much more complicated at this
> > point, but might still be possible on a per NIC (and maybe version) basis.
> 
> I guess one can write the skb descriptors first, then modify them from
> the BPF. Either way I feel like the helper approach for Tx will result
> in drivers saving the info into some local struct and then rendering
> the descriptors after. We'll see.

I agree that it's probably the "easiest" option to implement for the
majority of the devices that were designed without much of a
programmability this late in the stack. But maybe some devices can
or at least we can try to influence future designs :-)
Toke Høiland-Jørgensen June 15, 2023, 12:36 p.m. UTC | #16
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Wed, Jun 14, 2023 at 5:00 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>>
>> >>
>> >> It's probably going to work if each driver has a separate set of tx
>> >> fentry points, something like:
>> >>   {veth,mlx5,etc}_devtx_submit()
>> >>   {veth,mlx5,etc}_devtx_complete()
>>
>> I really don't get the opposition to exposing proper APIs; as a
>> dataplane developer I want to attach a program to an interface. The
>> kernel's role is to provide a consistent interface for this, not to
>> require users to become driver developers just to get at the required
>> details.
>
> Consistent interface can appear only when there is a consistency
> across nic manufacturers.
> I'm suggesting to experiment in the most unstable way and
> if/when the consistency is discovered then generalize.

That would be fine for new experimental HW features, but we're talking
about timestamps here: a feature that is already supported by multiple
drivers and for which the stack has a working abstraction. There's no
reason why we can't have that for the XDP path as well.

-Toke
Alexei Starovoitov June 15, 2023, 4:10 p.m. UTC | #17
On Thu, Jun 15, 2023 at 5:36 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>
> > On Wed, Jun 14, 2023 at 5:00 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >>
> >> >>
> >> >> It's probably going to work if each driver has a separate set of tx
> >> >> fentry points, something like:
> >> >>   {veth,mlx5,etc}_devtx_submit()
> >> >>   {veth,mlx5,etc}_devtx_complete()
> >>
> >> I really don't get the opposition to exposing proper APIs; as a
> >> dataplane developer I want to attach a program to an interface. The
> >> kernel's role is to provide a consistent interface for this, not to
> >> require users to become driver developers just to get at the required
> >> details.
> >
> > Consistent interface can appear only when there is a consistency
> > across nic manufacturers.
> > I'm suggesting to experiment in the most unstable way and
> > if/when the consistency is discovered then generalize.
>
> That would be fine for new experimental HW features, but we're talking
> about timestamps here: a feature that is already supported by multiple
> drivers and for which the stack has a working abstraction. There's no
> reason why we can't have that for the XDP path as well.

... has an abstraction to receive, but has no mechanism to set it
selectively per packet and read it on completion.
Stanislav Fomichev June 15, 2023, 4:31 p.m. UTC | #18
On Thu, Jun 15, 2023 at 9:10 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Jun 15, 2023 at 5:36 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >
> > Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> >
> > > On Wed, Jun 14, 2023 at 5:00 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> > >>
> > >> >>
> > >> >> It's probably going to work if each driver has a separate set of tx
> > >> >> fentry points, something like:
> > >> >>   {veth,mlx5,etc}_devtx_submit()
> > >> >>   {veth,mlx5,etc}_devtx_complete()
> > >>
> > >> I really don't get the opposition to exposing proper APIs; as a
> > >> dataplane developer I want to attach a program to an interface. The
> > >> kernel's role is to provide a consistent interface for this, not to
> > >> require users to become driver developers just to get at the required
> > >> details.
> > >
> > > Consistent interface can appear only when there is a consistency
> > > across nic manufacturers.
> > > I'm suggesting to experiment in the most unstable way and
> > > if/when the consistency is discovered then generalize.
> >
> > That would be fine for new experimental HW features, but we're talking
> > about timestamps here: a feature that is already supported by multiple
> > drivers and for which the stack has a working abstraction. There's no
> > reason why we can't have that for the XDP path as well.
>
> ... has an abstraction to receive, but has no mechanism to set it
> selectively per packet and read it on completion.

Timestamp might be a bit of an outlier here where it's just setting
some bit in some existing descriptor.
For some features, the drivers might have to reserve extra descriptors
and I'm not sure how safe it would be to let the programs arbitrarily
mess with the descriptor queues like that.
I'll probably keep this kfunc approach for v2 rfc for now (will try to
get rid of the complicated attachment at least), but let's keep
discussing.
Stanislav Fomichev June 16, 2023, 12:09 a.m. UTC | #19
On Mon, Jun 12, 2023 at 2:01 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>
> Some immediate thoughts after glancing through this:
>
> > --- Use cases ---
> >
> > The goal of this series is to add two new standard-ish places
> > in the transmit path:
> >
> > 1. Right before the packet is transmitted (with access to TX
> >    descriptors)
> > 2. Right after the packet is actually transmitted and we've received the
> >    completion (again, with access to TX completion descriptors)
> >
> > Accessing TX descriptors unlocks the following use-cases:
> >
> > - Setting device hints at TX: XDP/AF_XDP might use these new hooks to
> > use device offloads. The existing case implements TX timestamp.
> > - Observability: global per-netdev hooks can be used for tracing
> > the packets and exploring completion descriptors for all sorts of
> > device errors.
> >
> > Accessing TX descriptors also means that the hooks have to be called
> > from the drivers.
> >
> > The hooks are a light-weight alternative to XDP at egress and currently
> > don't provide any packet modification abilities. However, eventually,
> > can expose new kfuncs to operate on the packet (or, rather, the actual
> > descriptors; for performance sake).
>
> dynptr?
>
> > --- UAPI ---
> >
> > The hooks are implemented in a HID-BPF style. Meaning they don't
> > expose any UAPI and are implemented as tracing programs that call
> > a bunch of kfuncs. The attach/detach operation happen via BPF syscall
> > programs. The series expands device-bound infrastructure to tracing
> > programs.
>
> Not a fan of the "attach from BPF syscall program" thing. These are part
> of the XDP data path API, and I think we should expose them as proper
> bpf_link attachments from userspace with introspection etc. But I guess
> the bpf_mprog thing will give us that?
>
> > --- skb vs xdp ---
> >
> > The hooks operate on a new light-weight devtx_frame which contains:
> > - data
> > - len
> > - sinfo
> >
> > This should allow us to have a unified (from BPF POW) place at TX
> > and not be super-taxing (we need to copy 2 pointers + len to the stack
> > for each invocation).
>
> Not sure what I think about this one. At the very least I think we
> should expose xdp->data_meta as well. I'm not sure what the use case for
> accessing skbs is? If that *is* indeed useful, probably there will also
> end up being a use case for accessing the full skb?

I spent some time looking at data_meta story on AF_XDP TX and it
doesn't look like it's supported (at least in a general way).
You obviously get some data_meta when you do XDP_TX, but if you want
to pass something to the bpf prog when doing TX via the AF_XDP ring,
it gets complicated.
In zerocopy mode, we can probably use XDP_UMEM_UNALIGNED_CHUNK_FLAG
and pass something in the headroom.
If copy-mode, there is no support to do skb_metadata_set.

Probably makes sense to have something like tx_metalen on the xsk? And
skb_metadata_set it in copy more and skip it in zerocopy mode?
Or maybe I'm missing something?
Jakub Kicinski June 16, 2023, 1:50 a.m. UTC | #20
On Thu, 15 Jun 2023 09:31:19 -0700 Stanislav Fomichev wrote:
> Timestamp might be a bit of an outlier here where it's just setting
> some bit in some existing descriptor.
> For some features, the drivers might have to reserve extra descriptors
> and I'm not sure how safe it would be to let the programs arbitrarily
> mess with the descriptor queues like that.

I was gonna say, most NICs will have some form of descriptor chaining,
with strict ordering, to perform reasonably with small packets.

> I'll probably keep this kfunc approach for v2 rfc for now (will try to
> get rid of the complicated attachment at least), but let's keep
> discussing.

SGTM, FWIW.
Magnus Karlsson June 16, 2023, 8:12 a.m. UTC | #21
On Fri, 16 Jun 2023 at 02:09, Stanislav Fomichev <sdf@google.com> wrote:
>
> On Mon, Jun 12, 2023 at 2:01 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >
> > Some immediate thoughts after glancing through this:
> >
> > > --- Use cases ---
> > >
> > > The goal of this series is to add two new standard-ish places
> > > in the transmit path:
> > >
> > > 1. Right before the packet is transmitted (with access to TX
> > >    descriptors)
> > > 2. Right after the packet is actually transmitted and we've received the
> > >    completion (again, with access to TX completion descriptors)
> > >
> > > Accessing TX descriptors unlocks the following use-cases:
> > >
> > > - Setting device hints at TX: XDP/AF_XDP might use these new hooks to
> > > use device offloads. The existing case implements TX timestamp.
> > > - Observability: global per-netdev hooks can be used for tracing
> > > the packets and exploring completion descriptors for all sorts of
> > > device errors.
> > >
> > > Accessing TX descriptors also means that the hooks have to be called
> > > from the drivers.
> > >
> > > The hooks are a light-weight alternative to XDP at egress and currently
> > > don't provide any packet modification abilities. However, eventually,
> > > can expose new kfuncs to operate on the packet (or, rather, the actual
> > > descriptors; for performance sake).
> >
> > dynptr?
> >
> > > --- UAPI ---
> > >
> > > The hooks are implemented in a HID-BPF style. Meaning they don't
> > > expose any UAPI and are implemented as tracing programs that call
> > > a bunch of kfuncs. The attach/detach operation happen via BPF syscall
> > > programs. The series expands device-bound infrastructure to tracing
> > > programs.
> >
> > Not a fan of the "attach from BPF syscall program" thing. These are part
> > of the XDP data path API, and I think we should expose them as proper
> > bpf_link attachments from userspace with introspection etc. But I guess
> > the bpf_mprog thing will give us that?
> >
> > > --- skb vs xdp ---
> > >
> > > The hooks operate on a new light-weight devtx_frame which contains:
> > > - data
> > > - len
> > > - sinfo
> > >
> > > This should allow us to have a unified (from BPF POW) place at TX
> > > and not be super-taxing (we need to copy 2 pointers + len to the stack
> > > for each invocation).
> >
> > Not sure what I think about this one. At the very least I think we
> > should expose xdp->data_meta as well. I'm not sure what the use case for
> > accessing skbs is? If that *is* indeed useful, probably there will also
> > end up being a use case for accessing the full skb?
>
> I spent some time looking at data_meta story on AF_XDP TX and it
> doesn't look like it's supported (at least in a general way).
> You obviously get some data_meta when you do XDP_TX, but if you want
> to pass something to the bpf prog when doing TX via the AF_XDP ring,
> it gets complicated.

When we designed this some 5 - 6 years ago, we thought that there
would be an XDP for egress action in the "nearish" future that could
be used to interpret the metadata field in front of the packet.
Basically, the user would load an XDP egress program that would define
the metadata layout by the operations it would perform on the metadata
area. But since XDP on egress has not happened, you are right, there
is definitely something missing to be able to use metadata on Tx. Or
could your proposed hook points be used for something like this?

> In zerocopy mode, we can probably use XDP_UMEM_UNALIGNED_CHUNK_FLAG
> and pass something in the headroom.

This feature is mainly used to allow for multiple packets on the same
chunk (to save space) and also to be able to have packets spanning two
chunks. Even in aligned mode, you can start a packet at an arbitrary
address in the chunk as long as the whole packet fits into the chunk.
So no problem having headroom in any of the modes.


> If copy-mode, there is no support to do skb_metadata_set.
>
> Probably makes sense to have something like tx_metalen on the xsk? And
> skb_metadata_set it in copy more and skip it in zerocopy mode?
> Or maybe I'm missing something?
>
Stanislav Fomichev June 16, 2023, 5:32 p.m. UTC | #22
On Fri, Jun 16, 2023 at 1:13 AM Magnus Karlsson
<magnus.karlsson@gmail.com> wrote:
>
> On Fri, 16 Jun 2023 at 02:09, Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On Mon, Jun 12, 2023 at 2:01 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> > >
> > > Some immediate thoughts after glancing through this:
> > >
> > > > --- Use cases ---
> > > >
> > > > The goal of this series is to add two new standard-ish places
> > > > in the transmit path:
> > > >
> > > > 1. Right before the packet is transmitted (with access to TX
> > > >    descriptors)
> > > > 2. Right after the packet is actually transmitted and we've received the
> > > >    completion (again, with access to TX completion descriptors)
> > > >
> > > > Accessing TX descriptors unlocks the following use-cases:
> > > >
> > > > - Setting device hints at TX: XDP/AF_XDP might use these new hooks to
> > > > use device offloads. The existing case implements TX timestamp.
> > > > - Observability: global per-netdev hooks can be used for tracing
> > > > the packets and exploring completion descriptors for all sorts of
> > > > device errors.
> > > >
> > > > Accessing TX descriptors also means that the hooks have to be called
> > > > from the drivers.
> > > >
> > > > The hooks are a light-weight alternative to XDP at egress and currently
> > > > don't provide any packet modification abilities. However, eventually,
> > > > can expose new kfuncs to operate on the packet (or, rather, the actual
> > > > descriptors; for performance sake).
> > >
> > > dynptr?
> > >
> > > > --- UAPI ---
> > > >
> > > > The hooks are implemented in a HID-BPF style. Meaning they don't
> > > > expose any UAPI and are implemented as tracing programs that call
> > > > a bunch of kfuncs. The attach/detach operation happen via BPF syscall
> > > > programs. The series expands device-bound infrastructure to tracing
> > > > programs.
> > >
> > > Not a fan of the "attach from BPF syscall program" thing. These are part
> > > of the XDP data path API, and I think we should expose them as proper
> > > bpf_link attachments from userspace with introspection etc. But I guess
> > > the bpf_mprog thing will give us that?
> > >
> > > > --- skb vs xdp ---
> > > >
> > > > The hooks operate on a new light-weight devtx_frame which contains:
> > > > - data
> > > > - len
> > > > - sinfo
> > > >
> > > > This should allow us to have a unified (from BPF POW) place at TX
> > > > and not be super-taxing (we need to copy 2 pointers + len to the stack
> > > > for each invocation).
> > >
> > > Not sure what I think about this one. At the very least I think we
> > > should expose xdp->data_meta as well. I'm not sure what the use case for
> > > accessing skbs is? If that *is* indeed useful, probably there will also
> > > end up being a use case for accessing the full skb?
> >
> > I spent some time looking at data_meta story on AF_XDP TX and it
> > doesn't look like it's supported (at least in a general way).
> > You obviously get some data_meta when you do XDP_TX, but if you want
> > to pass something to the bpf prog when doing TX via the AF_XDP ring,
> > it gets complicated.
>
> When we designed this some 5 - 6 years ago, we thought that there
> would be an XDP for egress action in the "nearish" future that could
> be used to interpret the metadata field in front of the packet.
> Basically, the user would load an XDP egress program that would define
> the metadata layout by the operations it would perform on the metadata
> area. But since XDP on egress has not happened, you are right, there
> is definitely something missing to be able to use metadata on Tx. Or
> could your proposed hook points be used for something like this?

Thanks for the context!
Yes, the proposal is to use these new tx hooks to read out af_xdp
metadata and apply it to the packet via a bunch of tbd kfuncs.
AF_XDP and BPF programs would have to have a contract about the
metadata layout (same as we have on rx).

> > In zerocopy mode, we can probably use XDP_UMEM_UNALIGNED_CHUNK_FLAG
> > and pass something in the headroom.
>
> This feature is mainly used to allow for multiple packets on the same
> chunk (to save space) and also to be able to have packets spanning two
> chunks. Even in aligned mode, you can start a packet at an arbitrary
> address in the chunk as long as the whole packet fits into the chunk.
> So no problem having headroom in any of the modes.

But if I put it into the headroom it will only be passed down to the
driver in zero-copy mode, right?
If I do tx_desc->addr = packet_start, no medata (that goes prior to
packet_start) gets copied into skb in the copy mode (it seems).
Or do you suggest that the interface should be tx_desc->addr =
metadata_start and the bpf program should call the equivalent of
bpf_xdp_adjust_head to consume this metadata?
Stanislav Fomichev June 16, 2023, 11:10 p.m. UTC | #23
On 06/16, Stanislav Fomichev wrote:
> On Fri, Jun 16, 2023 at 1:13 AM Magnus Karlsson
> <magnus.karlsson@gmail.com> wrote:
> >
> > On Fri, 16 Jun 2023 at 02:09, Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > On Mon, Jun 12, 2023 at 2:01 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> > > >
> > > > Some immediate thoughts after glancing through this:
> > > >
> > > > > --- Use cases ---
> > > > >
> > > > > The goal of this series is to add two new standard-ish places
> > > > > in the transmit path:
> > > > >
> > > > > 1. Right before the packet is transmitted (with access to TX
> > > > >    descriptors)
> > > > > 2. Right after the packet is actually transmitted and we've received the
> > > > >    completion (again, with access to TX completion descriptors)
> > > > >
> > > > > Accessing TX descriptors unlocks the following use-cases:
> > > > >
> > > > > - Setting device hints at TX: XDP/AF_XDP might use these new hooks to
> > > > > use device offloads. The existing case implements TX timestamp.
> > > > > - Observability: global per-netdev hooks can be used for tracing
> > > > > the packets and exploring completion descriptors for all sorts of
> > > > > device errors.
> > > > >
> > > > > Accessing TX descriptors also means that the hooks have to be called
> > > > > from the drivers.
> > > > >
> > > > > The hooks are a light-weight alternative to XDP at egress and currently
> > > > > don't provide any packet modification abilities. However, eventually,
> > > > > can expose new kfuncs to operate on the packet (or, rather, the actual
> > > > > descriptors; for performance sake).
> > > >
> > > > dynptr?
> > > >
> > > > > --- UAPI ---
> > > > >
> > > > > The hooks are implemented in a HID-BPF style. Meaning they don't
> > > > > expose any UAPI and are implemented as tracing programs that call
> > > > > a bunch of kfuncs. The attach/detach operation happen via BPF syscall
> > > > > programs. The series expands device-bound infrastructure to tracing
> > > > > programs.
> > > >
> > > > Not a fan of the "attach from BPF syscall program" thing. These are part
> > > > of the XDP data path API, and I think we should expose them as proper
> > > > bpf_link attachments from userspace with introspection etc. But I guess
> > > > the bpf_mprog thing will give us that?
> > > >
> > > > > --- skb vs xdp ---
> > > > >
> > > > > The hooks operate on a new light-weight devtx_frame which contains:
> > > > > - data
> > > > > - len
> > > > > - sinfo
> > > > >
> > > > > This should allow us to have a unified (from BPF POW) place at TX
> > > > > and not be super-taxing (we need to copy 2 pointers + len to the stack
> > > > > for each invocation).
> > > >
> > > > Not sure what I think about this one. At the very least I think we
> > > > should expose xdp->data_meta as well. I'm not sure what the use case for
> > > > accessing skbs is? If that *is* indeed useful, probably there will also
> > > > end up being a use case for accessing the full skb?
> > >
> > > I spent some time looking at data_meta story on AF_XDP TX and it
> > > doesn't look like it's supported (at least in a general way).
> > > You obviously get some data_meta when you do XDP_TX, but if you want
> > > to pass something to the bpf prog when doing TX via the AF_XDP ring,
> > > it gets complicated.
> >
> > When we designed this some 5 - 6 years ago, we thought that there
> > would be an XDP for egress action in the "nearish" future that could
> > be used to interpret the metadata field in front of the packet.
> > Basically, the user would load an XDP egress program that would define
> > the metadata layout by the operations it would perform on the metadata
> > area. But since XDP on egress has not happened, you are right, there
> > is definitely something missing to be able to use metadata on Tx. Or
> > could your proposed hook points be used for something like this?
> 
> Thanks for the context!
> Yes, the proposal is to use these new tx hooks to read out af_xdp
> metadata and apply it to the packet via a bunch of tbd kfuncs.
> AF_XDP and BPF programs would have to have a contract about the
> metadata layout (same as we have on rx).
> 
> > > In zerocopy mode, we can probably use XDP_UMEM_UNALIGNED_CHUNK_FLAG
> > > and pass something in the headroom.
> >
> > This feature is mainly used to allow for multiple packets on the same
> > chunk (to save space) and also to be able to have packets spanning two
> > chunks. Even in aligned mode, you can start a packet at an arbitrary
> > address in the chunk as long as the whole packet fits into the chunk.
> > So no problem having headroom in any of the modes.
> 
> But if I put it into the headroom it will only be passed down to the
> driver in zero-copy mode, right?
> If I do tx_desc->addr = packet_start, no medata (that goes prior to
> packet_start) gets copied into skb in the copy mode (it seems).
> Or do you suggest that the interface should be tx_desc->addr =
> metadata_start and the bpf program should call the equivalent of
> bpf_xdp_adjust_head to consume this metadata?

For copy-mode, here is what I've prototyped. That seems to work.
For zero-copy, I don't think we need anything extra (besides exposing
xsk->tx_meta_len at the hook point, tbd). Does the patch below make
sense?

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index e96a1151ec75..30018b3b862d 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -51,6 +51,7 @@ struct xdp_sock {
 	struct list_head flush_node;
 	struct xsk_buff_pool *pool;
 	u16 queue_id;
+	u8 tx_metadata_len;
 	bool zc;
 	enum {
 		XSK_READY = 0,
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index a78a8096f4ce..2374eafff7db 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -63,6 +63,7 @@ struct xdp_mmap_offsets {
 #define XDP_UMEM_COMPLETION_RING	6
 #define XDP_STATISTICS			7
 #define XDP_OPTIONS			8
+#define XDP_TX_METADATA_LEN		9
 
 struct xdp_umem_reg {
 	__u64 addr; /* Start of packet data area */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index cc1e7f15fa73..a95872712547 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -493,14 +493,21 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 			return ERR_PTR(err);
 
 		skb_reserve(skb, hr);
-		skb_put(skb, len);
+		skb_put(skb, len + xs->tx_metadata_len);
 
 		buffer = xsk_buff_raw_get_data(xs->pool, desc->addr);
+		buffer -= xs->tx_metadata_len;
+
 		err = skb_store_bits(skb, 0, buffer, len);
 		if (unlikely(err)) {
 			kfree_skb(skb);
 			return ERR_PTR(err);
 		}
+
+		if (xs->tx_metadata_len) {
+			skb_metadata_set(skb, xs->tx_metadata_len);
+			__skb_pull(skb, xs->tx_metadata_len);
+		}
 	}
 
 	skb->dev = dev;
@@ -1137,6 +1144,27 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 		mutex_unlock(&xs->mutex);
 		return err;
 	}
+	case XDP_TX_METADATA_LEN:
+	{
+		int val;
+
+		if (optlen < sizeof(val))
+			return -EINVAL;
+		if (copy_from_sockptr(&val, optval, sizeof(val)))
+			return -EFAULT;
+
+		if (val >= 256)
+			return -EINVAL;
+
+		mutex_lock(&xs->mutex);
+		if (xs->state != XSK_READY) {
+			mutex_unlock(&xs->mutex);
+			return -EBUSY;
+		}
+		xs->tx_metadata_len = val;
+		mutex_unlock(&xs->mutex);
+		return err;
+	}
 	default:
 		break;
 	}
Magnus Karlsson June 19, 2023, 7:15 a.m. UTC | #24
On Sat, 17 Jun 2023 at 01:10, Stanislav Fomichev <sdf@google.com> wrote:
>
> On 06/16, Stanislav Fomichev wrote:
> > On Fri, Jun 16, 2023 at 1:13 AM Magnus Karlsson
> > <magnus.karlsson@gmail.com> wrote:
> > >
> > > On Fri, 16 Jun 2023 at 02:09, Stanislav Fomichev <sdf@google.com> wrote:
> > > >
> > > > On Mon, Jun 12, 2023 at 2:01 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> > > > >
> > > > > Some immediate thoughts after glancing through this:
> > > > >
> > > > > > --- Use cases ---
> > > > > >
> > > > > > The goal of this series is to add two new standard-ish places
> > > > > > in the transmit path:
> > > > > >
> > > > > > 1. Right before the packet is transmitted (with access to TX
> > > > > >    descriptors)
> > > > > > 2. Right after the packet is actually transmitted and we've received the
> > > > > >    completion (again, with access to TX completion descriptors)
> > > > > >
> > > > > > Accessing TX descriptors unlocks the following use-cases:
> > > > > >
> > > > > > - Setting device hints at TX: XDP/AF_XDP might use these new hooks to
> > > > > > use device offloads. The existing case implements TX timestamp.
> > > > > > - Observability: global per-netdev hooks can be used for tracing
> > > > > > the packets and exploring completion descriptors for all sorts of
> > > > > > device errors.
> > > > > >
> > > > > > Accessing TX descriptors also means that the hooks have to be called
> > > > > > from the drivers.
> > > > > >
> > > > > > The hooks are a light-weight alternative to XDP at egress and currently
> > > > > > don't provide any packet modification abilities. However, eventually,
> > > > > > can expose new kfuncs to operate on the packet (or, rather, the actual
> > > > > > descriptors; for performance sake).
> > > > >
> > > > > dynptr?
> > > > >
> > > > > > --- UAPI ---
> > > > > >
> > > > > > The hooks are implemented in a HID-BPF style. Meaning they don't
> > > > > > expose any UAPI and are implemented as tracing programs that call
> > > > > > a bunch of kfuncs. The attach/detach operation happen via BPF syscall
> > > > > > programs. The series expands device-bound infrastructure to tracing
> > > > > > programs.
> > > > >
> > > > > Not a fan of the "attach from BPF syscall program" thing. These are part
> > > > > of the XDP data path API, and I think we should expose them as proper
> > > > > bpf_link attachments from userspace with introspection etc. But I guess
> > > > > the bpf_mprog thing will give us that?
> > > > >
> > > > > > --- skb vs xdp ---
> > > > > >
> > > > > > The hooks operate on a new light-weight devtx_frame which contains:
> > > > > > - data
> > > > > > - len
> > > > > > - sinfo
> > > > > >
> > > > > > This should allow us to have a unified (from BPF POW) place at TX
> > > > > > and not be super-taxing (we need to copy 2 pointers + len to the stack
> > > > > > for each invocation).
> > > > >
> > > > > Not sure what I think about this one. At the very least I think we
> > > > > should expose xdp->data_meta as well. I'm not sure what the use case for
> > > > > accessing skbs is? If that *is* indeed useful, probably there will also
> > > > > end up being a use case for accessing the full skb?
> > > >
> > > > I spent some time looking at data_meta story on AF_XDP TX and it
> > > > doesn't look like it's supported (at least in a general way).
> > > > You obviously get some data_meta when you do XDP_TX, but if you want
> > > > to pass something to the bpf prog when doing TX via the AF_XDP ring,
> > > > it gets complicated.
> > >
> > > When we designed this some 5 - 6 years ago, we thought that there
> > > would be an XDP for egress action in the "nearish" future that could
> > > be used to interpret the metadata field in front of the packet.
> > > Basically, the user would load an XDP egress program that would define
> > > the metadata layout by the operations it would perform on the metadata
> > > area. But since XDP on egress has not happened, you are right, there
> > > is definitely something missing to be able to use metadata on Tx. Or
> > > could your proposed hook points be used for something like this?
> >
> > Thanks for the context!
> > Yes, the proposal is to use these new tx hooks to read out af_xdp
> > metadata and apply it to the packet via a bunch of tbd kfuncs.
> > AF_XDP and BPF programs would have to have a contract about the
> > metadata layout (same as we have on rx).
> >
> > > > In zerocopy mode, we can probably use XDP_UMEM_UNALIGNED_CHUNK_FLAG
> > > > and pass something in the headroom.
> > >
> > > This feature is mainly used to allow for multiple packets on the same
> > > chunk (to save space) and also to be able to have packets spanning two
> > > chunks. Even in aligned mode, you can start a packet at an arbitrary
> > > address in the chunk as long as the whole packet fits into the chunk.
> > > So no problem having headroom in any of the modes.
> >
> > But if I put it into the headroom it will only be passed down to the
> > driver in zero-copy mode, right?
> > If I do tx_desc->addr = packet_start, no medata (that goes prior to
> > packet_start) gets copied into skb in the copy mode (it seems).
> > Or do you suggest that the interface should be tx_desc->addr =
> > metadata_start and the bpf program should call the equivalent of
> > bpf_xdp_adjust_head to consume this metadata?
>
> For copy-mode, here is what I've prototyped. That seems to work.
> For zero-copy, I don't think we need anything extra (besides exposing
> xsk->tx_meta_len at the hook point, tbd). Does the patch below make
> sense?

Was just going to suggest adding a setsockopt, so this makes perfect
sense to me. Thanks!

> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index e96a1151ec75..30018b3b862d 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -51,6 +51,7 @@ struct xdp_sock {
>         struct list_head flush_node;
>         struct xsk_buff_pool *pool;
>         u16 queue_id;
> +       u8 tx_metadata_len;
>         bool zc;
>         enum {
>                 XSK_READY = 0,
> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> index a78a8096f4ce..2374eafff7db 100644
> --- a/include/uapi/linux/if_xdp.h
> +++ b/include/uapi/linux/if_xdp.h
> @@ -63,6 +63,7 @@ struct xdp_mmap_offsets {
>  #define XDP_UMEM_COMPLETION_RING       6
>  #define XDP_STATISTICS                 7
>  #define XDP_OPTIONS                    8
> +#define XDP_TX_METADATA_LEN            9
>
>  struct xdp_umem_reg {
>         __u64 addr; /* Start of packet data area */
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index cc1e7f15fa73..a95872712547 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -493,14 +493,21 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>                         return ERR_PTR(err);
>
>                 skb_reserve(skb, hr);
> -               skb_put(skb, len);
> +               skb_put(skb, len + xs->tx_metadata_len);
>
>                 buffer = xsk_buff_raw_get_data(xs->pool, desc->addr);
> +               buffer -= xs->tx_metadata_len;
> +
>                 err = skb_store_bits(skb, 0, buffer, len);
>                 if (unlikely(err)) {
>                         kfree_skb(skb);
>                         return ERR_PTR(err);
>                 }
> +
> +               if (xs->tx_metadata_len) {
> +                       skb_metadata_set(skb, xs->tx_metadata_len);
> +                       __skb_pull(skb, xs->tx_metadata_len);
> +               }
>         }
>
>         skb->dev = dev;
> @@ -1137,6 +1144,27 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
>                 mutex_unlock(&xs->mutex);
>                 return err;
>         }
> +       case XDP_TX_METADATA_LEN:
> +       {
> +               int val;
> +
> +               if (optlen < sizeof(val))
> +                       return -EINVAL;
> +               if (copy_from_sockptr(&val, optval, sizeof(val)))
> +                       return -EFAULT;
> +
> +               if (val >= 256)
> +                       return -EINVAL;
> +
> +               mutex_lock(&xs->mutex);
> +               if (xs->state != XSK_READY) {
> +                       mutex_unlock(&xs->mutex);
> +                       return -EBUSY;
> +               }
> +               xs->tx_metadata_len = val;
> +               mutex_unlock(&xs->mutex);
> +               return err;
> +       }
>         default:
>                 break;
>         }